How to effectively combine the information of different modalities is an important issue in the field of the fusion tracking task. Therefore, we design a classification response aggregation module to combine the information of hyperspectral and RGB modalities, aiming to improve the tracking performance by enhancing the discrimination between the object and the background. In addition, the reliability of different modality images under different conditions is inconsistent, so the fusion based on the reliability-aware of different modalities can make full use of multi-modality information to improve tracking performance. To reduce the noise effect caused by low-reliability modality data and ensure that a more reliable modality plays a more important role in the multi-mode fusion tracking, we also propose a method to evaluate the reliability of hyperspectral and RGB modalities and guide information fusion through the modality reliability.
2.3.1. The Evaluation Method of Modality Reliability
Since the features of the template patch need to propagate to all subsequent frames, the reliability of different modality template patches is an essential factor affecting the tracking performance. Based on this, we use the reliability of the template patch to represent the modality reliability. In this part, we propose a simple method to evaluate the reliability of the template patch. The evaluation flow is as follows:
First, the gray template patch of different modalities. Readjust the template patch according to the weight calculated by the mean value of each channel in the template patch of different modalities. Then, the processed template patch is superimposed along the channel direction to obtain the gray template patch. In particular, in order to facilitate calculation, the gray template patch of different modalities is standardized.
Second, statistics of the number of different modality gray levels. The gray range of the gray template patch is divided into M gray levels. The number of gray levels is obtained by counting the number of gray levels of pixels not less than N in each modality.
Third, calculate the sharpness of different modality template patches. The sharpness of the template patch is related to the high-frequency component. When the template patch is clear, the high-frequency component is at its highest, and the difference between the mutation pixel and the adjacent pixel becomes larger. We calculate the sharpness of different modality template patches by calculating the square of the difference between each pixel and its horizontal right second nearest neighbor, which can be summarized as:
where
is the gray value of the pixel
corresponding to gray template image
f and
is the sharpness of gray template image.
Finally, evaluate the reliability of the different modality template patches. The reliability of the different modality template patches
is calculated as follows:
where
A represents the collection of the modality kinds of RGB and hyperspectral,
represents a modality,
is the coefficient in reliability,
t represents a threshold,
represents the gray levels number of
i template patch, and
represents the sharpness of
i template patch.
If the sharpness of the template patch is greater than a certain threshold, it indicates that the template patch is relatively clear. When both kinds of modality template patches have high sharpness values, we use times the ratio of the number of gray levels to image sharpness to represent the reliability of template patches. At this time, if one modality template patch contains more gray levels than the other modality template patch, it indicates that the semantic information contained in this kind of modality template patch is richer, so the reliability value of this kind of modality template patch is higher.
On the contrary, if the sharpness of the template patch is less than a certain threshold, it means that the template patch is relatively blurred. When both kinds of template patches are blurred, we use times the reciprocal of the sum of the gray levels number and the image sharpness to represent the reliability of the template patch. At this time, if one modality template patch contains more gray levels than the other modality template patch, the noise contained in this modality template patch is larger, so the reliability value of this modality template patch is lower.
However, if one kind of modality template patch is clear and the other is blurred, we use times the sharpness of the template patch to directly represent the reliability of the template patch. Obviously, in this case, the clear template patch has a higher reliability value than the blurred template patch.
Two examples are used to verify the reliability of two modality template patches in
Figure 7. The related images of the Coin are at the top of
Figure 7, and that of the Rider2 are at the bottom. The RGB template patches are displayed in the second column. Columns 3 and 4 are grayscale images of RGB and hyperspectral template patches, and columns 5 and 6 are the feature thermal images of RGB and hyperspectral template patches, respectively. In this work,
M is 500,
N is 12,
is 1, and
t is 76.
We can see that the third and fourth images of the video named Coin are very clear, and the fourth image contains more visual information than the third image. Therefore, it can be judged that the reliability of the fourth image is higher than that of the third image, and more useful features can be extracted from it. In addition, it can be seen from the feature thermal images of different modality patches that the feature area displayed in the sixth image is more matched with the object area, and the interference feature around the object is minor, which also shows that the sixth image corresponding to the fourth image is more reliable. The reliability of the third and fourth images of the coin is calculated by the proposed method. The sharpness of the two images is 189 and 96, respectively, which are greater than the threshold, so both images are relatively clearer. The reliability of the fourth image is 4.896, which is higher than that of the third image (1.709), consistent with subjective feelings.
It is easy to observe that the third and fourth images of the video named Rider2 are both blurred. Compared with the fourth image, the third image is more complex, and the quality is poor. It is difficult to extract useful features from it, so the reliability of the third image is low. It can also be verified that the fourth image is more reliable by comparing the feature thermal images shown in the fifth and sixth images of the Rider2 video. From a numerical point of view, the sharpness of the third and fourth images is 66 and 74, respectively, which are both smaller than the threshold, and the reliability of the third image is , and that of the fourth image is , so the fourth image is more reliable and conforms to the visual characteristics.
2.3.2. Classification Response Aggregation Module
The structure of the classification response aggregation module is shown in
Figure 8. To reduce the noise effect caused by low-reliability modality data and to make a more reliable modality play a more important role in the classification task, we use the modality reliability to predict the contribution of different modalities for the classification task and use the contribution value to process the MS representations of the corresponding modality, to maximize the role of the aggregation module in improving tracking performance. We denote the contribution of HSI as
, and the contribution of RGB as
(
+
= 1). Comparing the reliability of hyperspectral and RGB template patches, if the reliability of the hyperspectral template patch is higher than that of RGB,
>
. On the contrary, if the reliability of the hyperspectral template patch is smaller than that of RGB,
<
. The processed MS representations of different modalities are input into the classification prediction head, respectively, and the classification response of hyperspectral and RGB can be obtained. The final classification response results from the fusion of different modality classification responses. Denote
as the final classification response,
as the HSI MS representations,
as the RGB MS representations,
and
as hyperspectral and RGB classification prediction heads, respectively. We utilize the same network for classification prediction, so this study has identical
and
. The final classification response is defined as: