4.2. Evaluation Metrics
To quantitatively compare different saliency models for ORSSD, EORSSD and ORSI-4199 datasets, we adopted the following five evaluation metrics: the precision–recall (PR) curve [
33], the F-measure curve, the max F-measure (
) [
33], the S-measure (
) [
56] and the mean absolute error (MAE) [
57].
Precision and recall are standard metrics to evaluate the model performance. The precision and recall scores can be calculated by comparing the binary mask to the ground truth.
F-measure is calculated as the weighted sum average of precision and recall, defined as follows:
where
is set to 0.3 to emphasize the precision over recall as recommended in [
33]. The larger the F-measure, the more accurate the prediction result. The algorithm selects the maximum value calculated from all thresholds as the evaluation result. We also simultaneously plotted the F-measure curve. The F-measure curve was drawn based on the pair of F-measure score and threshold ([0, 255]). In general, a more expressive model can achieve a higher maximum F-measure score and cover a larger coordinate area in the F-measure curve.
S-measure is a measure of the structural similarity between the saliency map and true value maps from both the regional and significant target perspectives. It focuses on assessing the structural information of the saliency map, which is closer to the human visual system than the F-measure. The definitions are as follows:
where
is usually set to 0.5,
indicates object structural similarity and
indicates regional structural similarity. A larger value indicates a smaller network structure error and better model performance.
MAE is the calculation of the mean absolute error of pixels between the predicted saliency map
S and the ground truth map
G, which is shown below.
where
S and
G are normalized within the range of [0, 1] as
and
where
W and
H are the width and height of the saliency map.
4.3. Comparison with SOTA Methods
In line with the three popular ORSI-SOD benchmarks [
18,
19,
20], we conducted a comprehensive evaluation of our method by comparing it with 17 state-of-the-art NSI-SOD and ORSI-SOD methods. Specifically, the compared methods consist of three traditional NSI-SOD methods (LC [
32], FT [
33], and MBD [
58]), six deep-learning-based NSI-SOD methods (VST [
10], GateNet [
9], F3Net [
5], PoolNet [
7], SCRNet [
8] and CPDNet [
6]) and eight recent deep-learning-based ORSI-SOD methods (MCCNet [
40], MSCNet [
43], LVNet [
18], ACCoNet [
42], DAFNet [
19], EMFINet [
22], MJRBMNet [
20] and RRNet [
41]). All results were generated by the codes provided by their authors. For a fair comparison, we used three publicly available datasets to retrain the SOD method. LVNet [
18] is the earliest ORSI-SOD method. It does not provide the code but provides the test results for ORSSD and EORSSD.
- (1)
Quantitative comparison:
- (a)
Quantitative comparison of ORSSD: We present the performance of our method TCM-Net for the ORSSD [
18] dataset by quantitatively comparing it with other methods. The evaluation was conducted based on three metrics: MAE, max F-measure (
) and S-measure (
), and the results are shown in
Table 1. Our method demonstrates superior performance compared to all other methods on all three metrics; while MCCNet [
40] is the best performer among the remaining 17 methods, our method exhibits even better performance. Specifically, our method outperforms MCCNet with a 2.14% improvement in
, a 13.8% lower MAE and a 0.07% better
. Despite RRNet [
41] and DAFNet [
19] achieving higher
scores of 0.9210 and 0.9166, our method surpasses them by 1.62% and 2.11% in the
metric, respectively. Furthermore,
Figure 7 illustrates the PR curve and the F-measure curve. In the three datasets with 17 methods, our model’s PR curve is positioned closer to the upper right corner, while our F-measure curve covers a larger area compared to other methods.
- (b)
Quantitative comparison of EORSSD: We present a quantitative comparison of our approach with other methods on the EORSSD [
19] dataset, as shown in the middle column of
Table 1. Additionally, we plotted the PR curve and the F-measurement curve in
Figure 7, which highlight our method’s superior performance. Among the compared methods, our approach achieves the best performance in two metrics, MAE and
, and also has the best PR curve and F-measurement curve. However, our
metric is slightly behind ACCoNet [
42] by 0.03%. Our
value is 0.9332, whereas ACCoNet’s
is 0.9335. Despite this, our
outperforms ACCoNet by 3.70%, and our MAE is 12% lower. Notably, RRNet [
41] has the highest
value of 0.9031 among all the compared methods, except our approach. However, our
is still superior to RRNet by 1.25%.
- (c)
Quantitative comparison of ORSI-4199: Based on the ORSI-4199 dataset [
20], we performed a quantitative comparison of the three metrics and present the results in
Table 1. The corresponding PR curve and F-measure curve are shown in
Figure 7 on the right. Overall, our method outperforms other methods, achieving values of 0.0275, 0.8717 and 0.8760, respectively. In addition, our method showed the best performance in the PR curve F-measure curve. Among the other methods, ACCoNet [
42] achieved the best results with MAE,
and
values of 0.0328, 0.8584 and 0.8800, respectively. Compared to ACCoNet, our
was 0.46% lower, but our
was 1.55% higher and MAE was 19.3% lower. Moreover, our comparison with VSTNet [
10], which showed the highest performance among the NSI-SOD methods, demonstrated the high applicability of the transformer model in the SOD task. VSTNet achieved values of 0.8543, 0.0306 and 0.8752 for MAE,
and
, respectively. Our method achieved a better
(improved by 2.04%), lower MAE (reduced by 11.3%) and slightly improved
(by 0.09%) compared to VSTNet.
Our quantitative comparison results demonstrate that our proposed method outperforms 17 other methods for ORSI-SOD. We provide a new state-of-the-art benchmark for comparison. The success of our model is attributed to the proposed dual-path complementary codec network structure and gating mechanism, which is better suited for ORSIs. We conducted more detailed ablation experiments to further investigate the advantages of each module.
- (2)
Visual comparison: To qualitatively compare all methods, we selected experimental results from the ORSI-4199 dataset, which includes representative and challenging scenes, as shown in
Figure 8. These scenes involve multiple objects, complex scenes, low contrast, shadow occlusion and more. Our model’s predictions demonstrate greater completeness and accuracy in detecting salient objects compared to other methods shown in
Figure 8. The specific advantages are reflected in the following aspects.
- (a)
Superiority in scenes with multiple or multiscale objects: In the first and third examples of
Figure 8, both traditional models (e.g., LC [
32] and MBD [
58] shown in
Figure 8q,s) and deep-learning-based models (e.g., EMFINet [
22], DAFNet [
19] and F3Net [
5] presented in
Figure 8h,j,m) fail to highlight the foreground regions accurately, resulting in prominent errors. In contrast, our model provides complete and accurate inference for salient objects. Similarly, in the eighth and thirteenth examples of
Figure 8, some state-of-the-art deep-learning models (e.g., ACCoNet [
42], MCCNet [
40] and MSCNet [
43] depicted in
Figure 8d,e,g) either incorrectly detect multiple objects or fail to fully pop out the salient objects. In stark contrast, the model shown in
Figure 8 can still successfully highlight the salient objects, and our results exhibit clear boundaries, particularly in preserving the overall integrity of multiple small airplanes and a single large building. This is clearly attributed to the superiority of our U-shaped encoder–decoder architecture and the control of redundant information by the AG module.
- (b)
Superiority in cluttered background and low contrast scenes: In the fourth, fifth and sixth examples of
Figure 8, traditional models (e.g., LC [
32] and FT [
33] shown in
Figure 8r,s) completely fail to detect salient objects, while deep-learning-based models (e.g., ACCoNet [
42], MSCNet [
43], PoolNet [
7] presented in
Figure 8d,g,n) either provide incomplete detection or incorrectly highlight background regions. In contrast, our model successfully detects salient bridges, four ships and two airplanes from the aforementioned three examples. Similarly, in
Figure 8, for the fourteenth, fifteenth and seventeenth examples, state-of-the-art deep-learning models (e.g., ACCoNet [
42], MCCNet [
40] and RRNet [
41]) either erroneously highlight background regions or fail to clearly distinguish salient objects, as shown in
Figure 8d–f). In contrast, the model depicted in
Figure 8c can fully and clearly pop out all salient objects. This is clearly attributed to our effective control of global contextual and local detailed information, as well as the LGFF module’s ability to fuse information from both sources effectively.
- (c)
Superiority in salient regions with complicated edges or irregular topology: In the tenth, eleventh and twelfth examples of
Figure 8, both traditional models (e.g., LC [
32] and MBD [
58] shown in
Figure 8q,s) and deep-learning-based models (e.g., ACCoNet [
42], MCCNet [
40], EMFINet [
22], MJRBMNet [
20] and GateNet [
9]) presented in
Figure 8d,e,h,i,l fail to accurately delineate the lake regions comprehensively and also fall short in detecting irregular rivers and buildings. In contrast, our model, as shown in
Figure 8c, outperforms the other models. The forest regions are effectively suppressed, irregular lakes are fully highlighted and, for irregular topological structures of rivers and buildings, our model is capable of generating a more complete saliency map with more accurate boundaries. It is evident that these superior results are attributed to the addition of edge supervision in our hybrid loss and our effective control of global-to-local information for image localization.
- (3)
Attribute-based study: In the latest ORSI-4199 dataset [
20], each image is meticulously categorized according to the distinctive attributes commonly found in ORSIs. These attributes include the presence of big salient objects (BSOs), small salient objects (SSOs), off center (OC) objects, complex salient objects (CSOs), complex scenes (CSs), narrow salient objects (NSOs), multiple salient objects (MSOs), low contrast scenes (LCSs) and incomplete salient objects (ISOs). These annotations allow us to compare the strengths and weaknesses of our proposed model and other models under different conditions.
Table 2 shows the
values of our model and other state-of-the-art models. Our model ranks first in seven of the nine attributes and second in scores for the remaining two attributes. Additionally, we used radar plots for the first time to depict the
values of the top five ranked methods for different attributes, and the results confirm that our model outperforms existing models in the majority of challenging scenarios, as shown in
Figure 9.
4.5. Ablation Study
- (1)
Module ablation experiments: We conducted module ablation experiments on the EORSSD dataset to evaluate the effectiveness of our proposed ResNet34+Swin transformer model compared to baseline models using either ResNet34 or the Swin transformer alone. Additionally, we evaluated the U-shaped encoder–decoder network architecture to determine its effectiveness, with the modular ablation study focusing on two specific modules, namely LGFF and AG. To visually demonstrate the effectiveness of our proposed architecture and modules, we present saliency maps in
Figure 11, highlighting the impact of different ablation modules and baselines.
As shown in
Figure 11d,e, the incorporation of AG modules significantly improves the ability to characterize feature information across multiple layers. This results in the reduction in redundant interference information and the suppression of image noise, which ultimately leads to improved prediction accuracy. Our LGFF module successfully integrates global and local information, as shown in
Figure 11d,f, effectively capturing both macro global context and fine local details. While the direct U-shaped network structure, as shown in the second and fourth panels of
Figure 11, may have slightly lower accuracy, it outperforms in SOD with irregular topology, as shown in the first and fifth panels of
Figure 11. Comparing panels (g)-(h), (i)-(j) in
Figure 11, it can be seen that the use of the U-shaped codec structure yields significantly better results than relying solely on the basic models of the Swin transformer and ResNet34. This approach provides better detection results and captures more detail information.
Table 4 shows the quantitative results for the variables in our study. We performed eight ablation experiments to evaluate the effectiveness of the U-shape codec architecture, the LGFF module and the AG module. We included the baseline codec layer (ResNet34 + Swin transformer + U-shape codec architecture) in the appropriate position to meet the requirements of the ablation experiments, which included the LGFF module as a variable. Although the performance of the TCM-Net baseline was lower than many of the comparison methods in
Table 1, it can still be considered a relatively good model. To solve the multi-scale problem of ORSI, we utilized a U-shape codec architecture to improve the detection accuracy. This was achieved by feeding the features of the coding part into the decoding module through skip connections. The results in
Table 4 show a significant performance improvement with the adoption of the U-shape codec architecture. Adding the U-shape architecture to ResNet34 improves the
by 7.3%, the
by 3.7% and reduces the MAE by 17.9% compared to the original ResNet34. In addition, adding the U-shape architecture to the Swin transformer results in even more significant improvements. The
is improved by 60%, the
is improved by 21.1%, and the MAE is reduced by 61.2% compared to the original Swin transformer.
To optimize the fusion of local and global information from the CNN and transformer, we introduced the LGFF module. However, the improvement achieved with the LGFF module is not significant because our TCM-Net baseline is a 7-layer decoding structure, and adding the LGFF module results in a reduction of one decoding layer. To improve feature extraction and reduce redundant information, our AG module shows a significant performance improvement, as evidenced by the improvement in from 0.8980 to 0.9126, from 0.9235 to 0.9332 and the reduction in MAE from 0.77 to 0.66. Our TCM-Net structure integrates the local and global information using the LGFF module and the U-shaped architecture for the decoder, while the AG module controls the redundant information to improve feature characterization. Compared to our network baseline, the TCM-Net structure achieves a 1.83% improvement in , 1.05% improvement in and 14.5% reduction in MAE.
- (2)
Loss ablation experiments: We performed ablation experiments on the EORSSD dataset [
19] to demonstrate the effectiveness of our hybrid loss function strategy, which is specifically designed for the network structure. To demonstrate the complementarity of BCE, IoU, SSIM and Fm in the loss function, we evaluated our loss function using four common loss function strategies: (1) training our model only with BCE loss, e.g., [
6]; (2) training our model with BCE-IoU loss, e.g., [
42]; (3) training our model jointly with the three loss functions of BCE, IoU and SSIM, e.g., [
22]; (4) training our model jointly with the three loss functions of BCE, IoU and Fm, e.g., [
40]; and (5) using our selected hybrid function and allocation strategy that includes BCE, IoU, SSIM and Fm. The results are shown in
Table 5.
In this study, we evaluated the effectiveness of different loss functions for our deep-learning-based ORSI-SOD model on the EORSSD dataset. Our model using only the BCE loss function was the least effective, achieving
of 0.8459, MAE of 0.0095 and
of 0.8864. Nevertheless, our model outperformed four out of the six deep-learning-based ORSI-SODs listed in
Table 1. In Experiment 2, we used the common loss strategy BCE-IOU loss and achieved better performance than the 17 state-of-the-art methods in
Table 1, with
of 0.9129, MAE of 0.0070 and
of 0.9309. This result indirectly proves that our model is more suitable for combining multiple loss functions.
We also tested our model using BCE-IOU loss in combination with the SSIM loss function and the Fm loss function, respectively. Compared to BCE-IOU loss, the hybrid loss function using SSIM resulted in a decrease in and to 0.9080 and 0.9295, respectively, with only a small decrease in MAE to 0.0069. The hybrid loss function using Fm loss resulted in a decrease in to 0.9103, an increase in MAE to 0.0071 and an improvement in to 0.9314. We also compared our designed hybrid loss function strategy with a control group that stacked detail loss, global loss and final prediction loss together. The results showed that our designed strategy outperformed the control group, with of 0.9144, MAE of 0.0066 and of 0.9332, while the control group achieved of 0.9038, MAE of 0.0071 and of 0.9272. These results show that simply stacking other loss functions does not necessarily improve the performance of the network. Our experimental comparison confirms that the best performance is achieved using our hybrid loss function strategy, which is specifically designed for the network structure.
4.9. Failure Case
Although TCM-Net shows superior performance on three public ORSI-SOD datasets. For some very challenging examples, our method still cannot achieve perfect results.
The model exhibits poor performance in highly cluttered backgrounds and low contrast scenes, as shown in
Figure 12a, making it difficult to clearly segment objects. Additionally, the subtle difference between salient objects and non-salient objects causes our model to misclassify non-salient objects as the foreground, presented in
Figure 12b,c. Furthermore, our model misidentifies portions between adjacent salient objects and background regions, depicted in
Figure 12d,e. Finally, for salient object portions that are similar to the background, our model may miss them and treat them as background regions during detection, as illustrated in
Figure 12f,g. The aforementioned concerns could necessitate additional model optimization and data acquisition from diverse scenarios to augment model training.