1. Introduction
Color anomaly detection methods identify pixel regions in multispectral images that have a low probability of occurring in the background landscape and are therefore considered to be outliers. Such techniques are used in remote sensing applications for agriculture, wildlife observation, surveillance, or search and rescue. Occlusion caused by vegetation, however, remains a major challenge.
Airborne Optical Sectioning (AOS) [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13] is a synthetic aperture sensing technique that computationally removes occlusion in real-time by registering and integrating multiple images captured within a large synthetic aperture area above the forest (cf.
Figure 1). With the resulting shallow-depth-of-field integral images, it becomes possible to locate targets (e.g., people, animals, vehicles, wildfires, etc.) that are otherwise hidden under the forest canopy. Image pixels that correspond to the same target on the synthetic focal plane (i.e., the forest ground) are computationally aligned and enhanced, while occluders above the focal plane (i.e., trees) are suppressed in strong defocus. AOS is real-time and wavelength-independent (i.e., it can be applied to images in all spectral bands), which is beneficial for many areas of application. Thus far, AOS has been applied to the visible [
1,
11] and the far-infrared (thermal) spectrum [
4] for various applications, such as archeology [
1,
2], wildlife observation [
5], and search and rescue [
8,
9]. By employing a randomly distributed statistical model [
3,
10,
12], the limits of AOS and its efficacy with respect to its optimal sampling parameters can be explained. Common image processing tasks, such as classification with deep neural networks [
8,
9] or color anomaly detection, [
11] are proven to perform significantly better when applied to AOS integral images compared with conventional aerial images. We also demonstrated the real-time capability of AOS by deploying it on a fully autonomous and classification-driven adaptive search and rescue drone [
9]. In [
11,
13], we presented the first solutions to tracking moving people through densely occluding foliage.
Anomaly detection methods for wilderness search and rescue have been evaluated earlier [
14], and bimodal systems using a composition of visible and thermal information were already used to improve detection rates of machine learning algorithms [
15,
16]. However, none of the previous work considered occlusion.
With AOS, we are able to combine multispectral recordings into a single integral image. Our previous work has shown that image processing tasks, such as person classification with deep neural networks [
8,
9,
10], perform significantly better on integral images when compared to single images. These classifiers are based on supervised architectures, which have the disadvantage that training data must be collected and labeled in a time-consuming manner and that the trained neural networks do not generalize well into other domains. It was also shown in [
11] that the image integration process of AOS decreases variance and covariance, which allows better separation of target and background pixels when applying the Reed–Xiaoli (RX) unsupervised anomaly detection [
17].
In this article, we evaluate several common unsupervised anomaly detection methods being applied to multispectral integral images that are captured from a drone when flying over open and occluded (forest) landscapes. We show that their performance can significantly be improved by the right combination of spectral bands and choice of color space input format. Especially for forest-like environments, detection rates of occluded people can be consistently increased if visible and thermal bands are combined and if HSV or HLS color spaces are used for the visible bands instead of common RGB. Furthermore, we also evaluate the runtime behavior of these methods when considered for time-critical applications, such as search and rescue.
2. Materials and Methods
For our evaluation, we applied the dataset from [
8], which was used to prove that integral images improve people classification under occluded conditions. It consists of RGB and thermal images (pairwise simultaneously) captured with a drone prototype over multiple forest types (broadleaf, conifer, mixed) and open landscapes, as shown in
Figure 2. In all images, targets (persons laying on the ground) are manually labeled. Additional telemetry data (GPS and IMU sensor values) of the drone during capturing are also provided for each image.
While the visible bands were converted from RGB to other color spaces (HLS, HSV, LAB, LUV, XYZ, and YUV), the thermal data were optionally added as a fourth (alpha) channel, resulting in additional input options (RGB-T, HLS-T, HSV-T, LAB-T, LUV-T, XYZ-T, and YUV-T).
All images had a resolution of 512x512 pixels, so the input dimensions where either (512, 512, 3) or (512, 512, 4). Methods that do not require spatial information used flattened images with (262144, 3) or (262144, 4) dimensions.
The publicly available C/C++ implementation of AOS Source Code:
https://github.com/JKU-ICG/AOS (accessed on 24 November 2022) was used to compute integral images from single images.
2.1. Color Anomaly Detectors
Unsupervised color anomaly detectors have been widely used in the past [
17,
18,
19,
20,
21,
22], with the Reed–Xiaoli (RX) detector [
17] being commonly considered as a benchmark. Several variations of RX exist, where the standard implementation calculates global background statistics (over the entire image) and then compares individual pixels based on the Mahalanobis distance. In the further course of this article, we will refer to this particular RX detector as Reed–Xiaoli Global (RXG).
The following briefly summarizes the considered color anomaly detectors, while details can be found through the provided references:
The Reed–Xiaoli Global (RXG) detector [
17] computes a
covariance matrix of the image, where
n is the number of input channels (e.g., for RGB,
and for RGB-T,
). The pixel under test is the
n-dimensional vector
r, and the mean is given by the
n-dimensional vector
:
The Reed–Xiaoli Modified (RXM) detector [
18] is a variation of RXG, where an additional constant
is used for normalization:
The Reed–Xiaoli Local (RXL) detector computes covariance and mean over smaller local areas and, therefore, does not use global background statistics. The areas are defined by an inner window () and an outer window (). The mean and covariance K are calculated based on the outer window but excludes the inner window. Window sizes were chosen to be and , based on the projected pixel sizes of the targets in the forest landscape.
The principal component analysis (PCA) [
19] uses singular value decomposition for a linear dimensionality reduction. The covariance matrix of the image is decomposed into eigenvectors and their corresponding eigenvalues. A low-dimensional hyperplane is constructed by selected (
) eigenvectors. Outlier scores for each sample are then obtained by their euclidean distance to the constructed hyperplane. The number of eigenvectors to use was chosen to be
, where
n is the number of input channels (e.g., for RGB,
and for RGB-T,
).
The Gaussian mixture model (GMM) [
20] is a clustering approach, where multiple Gaussian distributions are used to characterize the data. The data are fit to each of the single Gaussians (
), which are considered as a representation of clusters. For each sample, the algorithm calculates the probability of belonging to each cluster, where low probabilities are an indication of being an anomaly. The number of Gaussians to use was chosen to be
.
The cluster based anomaly detection (CBAD) [
21] estimates background statistics over clusters instead of sliding windows. The image background is partitioned (using any clustering algorithm) into clusters (
), where each cluster can be modeled as a Gaussian distribution. Similar to GMM, anomalies have values that deviate significantly from the cluster distributions. Samples are each assigned to the nearest background cluster, becoming an anomaly if their value deviates farther from the mean than background pixel values in that cluster. The number of clusters to use was chosen to be
.
The local outlier factor (LOF) [
22] uses a distance metric (e.g., Minkowski distance) to determine the distances between neighboring (
) data points. Based on the inverse of those average distances, the local density is calculated. This is then compared to the local densities of their surrounding neighborhood. Samples that have significantly lower densities than their neighbors are considered isolated and, therefore, become outliers. The number of neighbors to use was chosen to be
.
2.2. Evaluation
The evaluation of the methods summarized above was carried out on a consumer PC (Intel Core i9-11900H @ 2.50GHz) for the landscapes shown in
Figure 2.
Precision (Equation (
1)) vs. recall (Equation (
2)) was used as metrics for performance comparisons.
The task can be formulated as a binary classification problem, where positive predictions are considered anomalous pixels. The data we want to classify (image pixels) is highly unbalanced, as most of the pixels are considered as background (majority class) and only some of the pixels are considered anomalies (minority class).
The true positive (TP) pixels are determined by checking whether they lie within one of the labeled rectangles, as shown in
Figure 2. Pixels detected outside these rectangles are considered false positives (FP) and pixels inside the rectangle but not classified anomalously are considered false negatives (FN). Since the dataset only provides rectangles for labels and not perfect masks around the persons, the recall results are biased (in general, not as good as expected). As we are mainly interested in the performance difference between individual methods and the errors introduced are always constant (rectangle area—real person mask), the conclusions drawn from the results should be the same, even if perfect masks were used instead.
The precision (Equation (
1)) quantifies the number of correct positive predictions made, and recall (Equation (
2)) quantifies the number of correct positive predictions made out of all positive predictions that could have been made.
Precision and recall both focus on the minority class (anomalous pixels) and are therefore less concerned with the majority class (background pixels), which is important for our unbalanced dataset.
Since the anomaly detection methods provide probabilistic scores on the likelihood of a pixel being considered anomalous, a threshold value must be chosen to obtain a final binary result.
The precision–recall curve (PRC) in
Figure 3 shows the relationship between precision and recall for every possible threshold value that could be chosen. Thus, a method performing well would have high precision and high recall over different threshold values. We use the area under the precision–recall curve (AUPRC), which is simply the integral of the PRC, as the final evaluation metric.
The AUPRC metric provides comparable results on the overall performance of a method but is not well suited when it comes to finding the best threshold for a single image. To obtain the best threshold value for a single image, we use the F
-score (Equation (
3)), which is also calculated from precision and recall:
where
is used as a weighting factor that can be chosen such that recall is considered
-times more important than precision.
The balanced F1-score is the harmonic mean of precision and recall and is widely used. However, as we care more about minimizing false positives than minimizing false negatives, we would select a . A grid search of values has given the best results for . With this setting, precision is weighted more heavily than recall. The metric is only used to threshold the image scores for comparison purposes, as shown in Figure 6.
3. Results
Figure 4 shows the AUPRC values across different color spaces and methods. The methods are evaluated on each color space, once with three channels (visible spectrum only) and once with four channels (visible and thermal spectrum). The results of the forest landscape are average values over F0, F1 and F5, and the results of the open landscape are average values over O1, O3 and O5.
As expected (and as we have also seen in
Figure 3), the overall AUPRC of the open landscapes is much higher than the AUPRC values of the more challenging forest landscapes. The reason is an occlusion in the presence of forests.
The AUPRC values of the four-channel (color + thermal) and three-channel (color only) inputs are overlayed in the same bar. The slightly lighter colored four-channel results are always higher than the three-channel results—regardless of the method or the color space used. However, the difference is more pronounced for the forest landscapes than for the open landscapes. This shows that regardless of the scenery and regardless of the method and the color space used, the additional thermal information always improves the performance of anomaly detection.
With a look at the AUPRC values in the forest landscapes, we can observe that RXL gives the overall best results and outperforms all other methods. Utilizing the additional thermal information gives, in this case, even a gain. This can also be observed visually in the anomaly detection scores shown in Figure 6, where FP’s detections highly decrease and TP’s detections highly increase if the thermal channel is added (e.g., F1, in the visible spectral band, many background pixels are considered anomalous, with the additional thermal information those misclassified pixels are eliminated).
Looking at the AUPRC values in the open landscapes, we can observe that the difference between the methods is not as pronounced as in the forest landscapes. An obvious outlier, however, seems to be LOF, which nevertheless performs very well (second best) in the forest landscapes. This can be explained by the fact that hyper-parameters of the methods were specifically chosen for the forest landscape. In the case of LOF, the parameter was set to be 200, which seems suboptimal for the open landscapes. The same holds for RXL (window sizes), CBAD (number of clusters), GMM (number of components) and PCA (number of components). All other methods do not require hyper-parametrization.
Another observation that can be made is that some color spaces consistently give better results than others. In the forest landscapes, HSV(-T) usually gives the best results, regardless of the methods being used. In the open landscapes, it is not as clear which color space performs best, but HSV(-T) still gives overall good results. In general, and especially for RXM, the improvements achieved by choosing HSV(-T) over other color spaces are clearly noticeable.
The individual results plotted in
Figure 4 are also shown in
Table 1 and
Table 2, where the mean values over all color spaces (last row) may give a useful estimate of the method’s overall performance. The highest AUPRC value for the forest and open scenery is highlighted in bold.
Since anomaly detection for time-critical applications should deliver reliable results in real-time, we have also measured their runtimes, as shown in
Table 3. The best-performing methods on the forest landscapes in terms of AUPRC values are RXL and LOF. In terms of runtime, both are found to be very slow, as they consume 20 to 35 s for computations, where all other algorithms provide anomaly scores in under a second (cf.
Figure 5).
4. Discussion
The AUPRC results in
Figure 4 show that all color anomaly detection methods benefit from additional thermal information, but especially in combination with the forest landscapes.
In challenging environments, where the distribution of colors has a much higher variance (e.g., F1 in
Figure 6, due to bright sunlight), the additional thermal information improves results significantly. If the temperature difference between targets and the surrounding is large enough, the thermal spectral band may add spatial information (e.g., distinct clusters of persons), which is beneficial for methods that calculate results based on locality properties (e.g., RXL, LOF).
In forest-like environments, the RXL anomaly detector performs best regardless of the input color space. This could be explained by the specific characteristics of an integral image. In the case of occlusion, the integration process produces highly blurred images caused by defocused occluders (forest canopy) above the ground, which results in a much more uniformly distributed background. Since target pixels on the ground stay in focus, anomaly detection methods such as RXL, which calculate background statistics on a smaller window around the target, benefit from the uniform distributed (local) background. The same is true for LOF, where the local density in the blurred background regions is much higher than the local density in the focused target region, resulting in overall better outlier detection rates. Since most objects in open landscapes are located near the focal plane (i.e., at nearly the same altitude above the ground), there is no out-of-focus effect caused by the integration process. Thus, these methods do not produce similarly good results for open landscapes.
For the forest landscapes, the HSV(-T) and HSL(-T) color spaces consistently give better results than others. The color spaces HSV (hue, saturation, value) and HSL (hue, saturation, lightness) are both based on cylindrical color space geometries and differ mainly in their last dimension (brightness/lightness). The first two dimensions (hue, saturation) can be considered more important when distinguishing colors, as the last dimension only describes the value (brightness) or lightness of a color. We assume that the more uniform background resulting from the integration process also has a positive effect on the distance metric calculations when those two color spaces are used, especially if the background mainly consists of a very similar color tone. This is again more pronounced for the forest landscapes than for the open landscapes.
Although the AUPRC results obtained from RXL and LOF are best for forest landscapes, the high runtime indicates that these methods are impractical for real-time applications. A trade-off must be made between good anomaly detection results and fast runtime; therefore, we consider the top-performing methods that provide reliable results within milliseconds further.
Based on the AUPRC and runtime results shown in
Figure 5, one could suggest that the RXM method may be used. The AUPRC results combined with HSV-T are the best among methods that run under one second, regardless of the landscape. Since this method does not require a-priory settings to be chosen (only the final thresholding value) and the runtime is one of the fastest, it would be well suited for usage in forests and open landscapes. The second-best algorithm based on the AUPRC values would be CBAD, with the disadvantage that it requires a hyper-parameter setting and does not generalize well for open landscapes.
5. Conclusions
In this article, we have shown that the performance of unsupervised color anomaly detection methods applied to multispectral integral images can be further improved by an additional thermal channel. Each of the evaluated methods performs significantly better when thermal information is utilized in addition, regardless of the landscape (forest or open). Another finding is that even without the additional thermal band, the choice of input color space (for the visible channels) already has an influence on the results. Color spaces such as HSV and HLS can outperform the widely used RGB color space, especially in forest-like landscapes.
These findings might guard decisions on the choice of color anomaly detection method, input format, and applied spectral band, depending on individual use cases. Occlusion cause by vegetation, such as forests, remains challenging for many of them. In the future, we will investigate anomalies caused by motion in the context of synthetic aperture sensing. In combination with color and thermal anomaly detection, motion anomaly detection has the potential to further improve detection results for moving targets, such as people, animals, or vehicles.