1. Introduction
As active microwave equipment, synthetic aperture radar (SAR) operates in all weather and day–night, with the capability of penetrating dry soil and vegetation canopies, leading to its widespread adoption in both military and civilian applications [
1,
2,
3,
4,
5]. Considering that objects in SAR images generally convey essential scene information that is crucial for interpretation, object detection has long been a fundamental research challenge in SAR image analysis.
In practical tasks, we usually need to realize object extraction based on single SAR imagery. Scholars have investigated various solutions to this problem. For example, Shirvany et al. proposed a method anchored in polarimetric features for maritime target detection [
1,
2,
3], references [
6,
7,
8,
9] achieved ship detection by leveraging the physical principles of SAR imaging, and song et al. exploited object segmentation from the perspective of statistical modeling [
10].
Among various detection algorithms, the Constant False Alarm Rate (CFAR) detector [
11,
12] has been extensively studied and widely applied. CFAR-based methods essentially treat targets as statistical anomalies distinct from background clutter and conduct detection with adaptive thresholds derived from specific clutter distribution models. Under homogeneous clutter conditions and well-matched statistical models, these techniques can demonstrate satisfactory detection performance. However, with the expansion of SAR applications and improvements in imaging resolution, monitoring scenarios have become complicated. This evolution substantially increases the difficulty of accurate clutter modeling, consequently limiting the effectiveness of CFAR in practice. Although researchers have developed enhanced frameworks [
13,
14], these modifications still fail to fully address the demanding requirements of contemporary detection tasks. Meanwhile, CFAR processing employs a local contrast strategy, making it inherently sensitive to object sizes [
15] and often ineffective for those exhibiting high global but low local contrast. Moreover, CFAR methods rely exclusively on pixel-wise backscatter intensity while failing to exploit other rich information in SAR imagery, rendering them particularly vulnerable to multiplicative texture noise. These limitations collectively impair the performance of CFAR-based algorithms in increasingly complex detection scenarios.
Recently, the rapid development of computer vision (CV) techniques and their demonstrated success in optical image detection have inspired researchers to explore CV-based approaches for SAR image analysis. Saliency-based methods and deep learning (DL)-based techniques represent two prominent directions in this trend.
DL-based methods usually leverage the powerful automatic feature extraction capability of neural networks to capture high-level semantic information from images and then realize an end-to-end object detection. With enough samples for the training of neural networks, DL-based methods can achieve excellent detection performance and generalization ability. For instance, zhou et al. propose an anchor-free Convolutional Neural Network (CNN) framework, which integrates multi-level feature refinement and can outperform several typical competitors in SAR ship detection tasks [
16]; Jie Zhou et al. design a novel network by introducing diffusion models to SAR target detection and successfully locate aircrafts [
17]. Although DL-based methods have demonstrated their performance in various application scenes, this work focuses on detection based on a single SAR image. In this case, there is a lack of sufficient data and prior knowledge to support the evolution of CNN models. Therefore, DL-based methods are not the research subject of this study. It should be noted that while simulation-based and synthetic data augmentation techniques have been adopted to build SAR image datasets, as exemplified by the Synthetic and Measured Paired and Labeled Experiment (SAMPLE) dataset [
18,
19], two critical challenges still persist: insufficient fidelity in synthetic data representations and limited generalization of trained models when applied to real-world measured SAR data. These issues require further in-depth research for reliable SAR interpretation.
On another hand, when performing tasks such as scene understanding, the human visual system can rapidly focus attention on regions of interest while ignoring the background. This selective process is known as the visual attention mechanism [
20]. In recent years, many scholars introduced visual attention into SAR object detection, developing numerous saliency models which have demonstrated their effectiveness. For example, Lai et al. designed a mechanism for weak target detection by referring to the ITTI saliency model [
21]; Zhang et al. proposed an oil tank detection method based on saliency analysis [
22]; Jin et al. achieved hurricane eye extraction by utilizing a classic saliency framework which integrates brightness and orientation features [
23].
Compared with CFAR detection, saliency-based methods often do not require pre-processing for noise reduction and are not reliant on specific clutter models. Meanwhile, saliency detection can integrate multiple features, including echo power, local contrast, global contrast, and so on, fully exploiting the various information carried by SAR images. It can be said that saliency-based methods successfully circumvent several major obstacles that are difficult to overcome by traditional algorithms represented by CFAR detectors.
Currently, research on SAR saliency detection primarily progresses along two main directions. The first approach integrates saliency with learning-based techniques, particularly CNNs. In such methods, saliency serves as crucial information and is typically involved in model construction in the form of features or weights. Similar to DL-based processing, if there are enough SAR images for model training, learning-based saliency can achieve excellent detection performance. However, this work focuses on the interpretation of single SAR imagery, and learning-based processing is not applicable to the discussion herein. The second approach derives from saliency definition, conducting detection based on the understanding of saliency. For example, Zhai et al. argued that salient objects in SAR images should present both high local and global contrast, thereby realizing inshore ship detection based on this understanding [
24]. Ni et al. considered regions that significantly differ from the background as salient targets and then designed a two-stage detection framework accordingly [
25]. Gao et al. believed that saliency should consider local contrast, edge density, and boundary connectivity comprehensively, and based on this, they constructed a saliency algorithm for river detection [
26]. Generally, definition-based methods firstly construct a saliency map (SM) to highlight objects and suppress the background from the perspective of saliency understanding; then, they achieve object extraction by segmenting the SM. Unlike learning-based processing, these methods do not rely on sample datasets but directly operate on single SAR images. Therefore, definition-based saliency is more appropriate for the discussion in this paper.
Traditional definition-based algorithms predominantly employ a bottom-up mechanism to construct saliency maps. In recent years, many researchers have attempted to incorporate top-down strategies to further enhance the accuracy of saliency detection. For instance, references [
27,
28] determined the scale of the Gaussian pyramid based on specific tasks, ensuring that the SM has an excellent highlighting effect on the target. Reference [
29] utilized morphological operations to filter out interference regions in the SM according to object sizes, thereby more effectively suppressing background noise. However, similar processing depends on the specific detection task and requires corresponding prior knowledge. To address the limitation, this work proposes a two-channel framework by referring to guided search theory (GST) [
30]. Within this framework, the prior acquisition processing channel simulates the “where” pathway in GST, automatically computing object prior indication based on edge information and assigning more saliency to potential object regions. Meanwhile, the feature extraction processing channel simulates the “what” pathway in GST, extracting four typical features: brightness, frequency, global contrast, and local contrast, to further reinforce the object presence. Finally, the outputs of these two channels are fused via Bayesian inference to generate an initial SM.
An ideal SM should thoroughly suppress background regions while exclusively highlighting objects. However, due to the complexity of practical scenes, some non-object areas may also exhibit high saliency levels, which will inevitably hinder subsequent object extraction based on the SM. Current research often designs complicated segmentation strategies, or employs discrimination processing, to ensure the final performance of saliency detection. For example, Zhang et al. proposed a localization algorithm based on the active contour model, which accurately identified oil tanks from the SM [
31]. Wang et al. applied morphological filtering and clustering to eliminate false alarms after obtaining preliminary detection results through SM binarization, thereby improving the accuracy of the final output [
27]. However, most of these methods have high computational complexity and may be subject to certain limitations in practical tasks. Correspondingly, an adaptive iteration mechanism is designed for SM improvement in this work. Through multiple cycles of iteration, it continuously reinforces the presence of objects and suppresses the saliency of the background. In the final SM, there will be a clear distinction between objects and the background; thus, object extraction can be realized with a simple global threshold.
Integrating the prior calculation and the SM modification, this paper proposes a two-channel Bayesian framework with adaptive iteration for salient object detection in single SAR imagery. The main contributions of this work are summarized as follows:
To improve the performance of object detection in single SAR images, this paper proposes a two-channel saliency framework with Bayesian inference and adaptive iteration. The qualitative and quantitative experiments on real SAR datasets demonstrate that our method can present better detection results than several classic competitors.
To acquire an ideal SM which assigns potential object regions with high saliency values, we develop a two-channel framework with the top-down mechanism by imitating the guided search theory in the human vision system. It first calculates the prior and feature information with each channel, respectively, then utilizes Bayesian inference to integrate these two, and finally generates an initial SM.
To further rectify the results of SM generation, we provide an adaptive iteration mechanism. Through iterative processing, object areas are progressively highlighted while background clutter is continuously suppressed. Ultimately, a distinct contrast between the object and background emerges in the SM, allowing for straightforward object extraction via simple threshold segmentation.
2. Methodology
The structure of the proposed algorithm is illustrated in
Figure 1. The prior acquisition processing channel firstly calculates an improved object indication using edge strength and standard deviation, subsequently generating object and background priors. Meanwhile, the feature extraction processing channel derives four features: brightness, frequency (count of pixel value occurrence), local contrast, and global contrast. The outputs from these channels are then fused through Bayesian inference to produce an initial SM. After that, an iterative mechanism is implemented where, if the SM fails to meet the termination condition, it serves as a new improved object indication to update the priors for Bayesian inference in the next round, thereby producing a new SM. Upon satisfying the termination condition, the SM generated by the current iteration becomes the final SM of the proposed algorithm. At this stage, there is a distinction between objects and the background in the SM, and the detection results can be obtained through simple OTSU segmentation [
32].
2.1. Acquisition of Object/Background Prior
The prior acquisition processing channel operates in a task-independent manner, automatically computing improved object indication to quantify the probability of corresponding areas being objects, and finally outputting the prior for Bayesian inference.
2.1.1. Edge Strength Index
Edges can effectively delineate the structure of scene distribution and reveal potential locations of objects. Therefore, we utilize edge information to estimate the probability of specific regions belonging to object components. The most classical approach for SAR edge detection is the Ratio of Average (ROA) operator [
33]. Here in this section, we firstly introduce this operator.
(1) ROA Operator
The ROA operator enables CFAR detection in four directions
, and its overall structure is illustrated in
Figure 2, in which the blue point represents the current pixel, while the gray regions
and
denote the reference windows established around the current pixel in a specified direction
. When the ROA operator is applied to an image, it firstly calculates the gray-level mean of corresponding pixels within each reference window, as expressed in Equation (1).
where
I denotes the input image;
represents the reference window in the direction of
; “*” means convolution operation here;
is the gray-level weighted mean of pixels within the reference window
. Subsequently, the edge strength response of the current pixel, referred to as ES
ROA, can be calculated by Equation (2),
where max(•) and min(•) represent the maximum and minimum operators, respectively.
(2) Directional Derivative Function of Anisotropic Gaussian kernels
It should be noted that the ROA operator only employs kernels in four directions. This may lead to compromised detection performance when handling edges with finer directional variations. To address this limitation, we augment the ROA operator by incorporating directional derivatives of anisotropic Gaussian kernels. These kernels not only provide enhanced directional sensitivity but also exhibit inherent noise suppression capabilities, thereby improving the overall robustness of edge detection.
The two-dimensional Gaussian kernel function is given by Equation (3),
where
and
, respectively, indicate the scale factor and anisotropic factor;
denotes one specific cell in the kernel. According to Reference [
34], the anisotropic Gaussian kernel can be derived by applying a rotation matrix to the 2D Gaussian kernel, as shown in Equation (4), where
is the rotation matrix.
Taking the derivate of
with respect to
, the directional derivative function of the anisotropic Gaussian kernel along the
direction is expressed as shown in Equation (5). As illustrated in
Figure 3, the pair of reference windows in the kernel have opposite signs, which results in a high sensitivity to grayscale variations. Therefore,
can be regarded as an edge detector along the
direction.
Unlike the ROA operator,
does not impose specific value constraints (given the symmetry of
,
is typically selected from [0,
]). This allows for edge detection in arbitrary directions. In this work,
is set to
, and the corresponding kernels are illustrated in
Figure 3.
The edge strength response (ES
G) of this detector can be computed from Equation (6) by applying
to the image, where “*” denotes the convolution operation.
(3) Edge Strength Index
In the above process, we utilized the ROA operator and the directional derivative function of the anisotropic Gaussian kernel to obtain ES
ROA and ES
G, respectively. The final Edge Strength Index (ESI) for the original SAR image can be calculated using Equation (7), where “
” denotes element-wise multiplication of two matrices.
Figure 4 illustrates an example of ESI calculation. It can be seen that ES
ROA exhibits significant interference caused by speckle noise. ES
G appears overly sensitive to variations in grayscale intensity, resulting in fragmented detections in regions where the edge direction changes, as well as strong responses within the object interior. In contrast, ESI effectively suppresses such interferences while providing a more complete delineation of the object contour, thereby facilitating subsequent processes.
2.1.2. Improved Object Indication
In this subsection, we firstly extract single-pixel-width edges from the ESI using Non-maxima Suppression (NMS) [
35] and a straightforward global threshold segmentation. Empirically, the threshold is set as
, where
and
are, respectively, the mean and standard deviation of the NMS-processed image. Based on the edge information, we further adopt a computational approach inspired by visual filling to generate the object indication (OI), which quantifies the probability of specified regions belonging to object components [
36].
Given the abundant speckle noise and possible shadows inherent in SAR images, edge detection results are prone to false alarms, causing certain non-object regions in the OI map also exhibiting high indication values, as exemplified by the areas enclosed in red boxes in
Figure 5.
References [
37,
38] demonstrate that local variance can effectively characterize the edge and shape information of objects. Additionally, in SAR images, the grayscale intensity of objects is typically higher than that of the background and shadows, and the grayscale fluctuations caused by multiplicative speckle noise are more pronounced on objects. It implies that local variance can also reflect the presence of objects. Therefore, we incorporate the local variance feature to suppress indication values of non-object regions in the OI map. The calculation of local variance
is illustrated in Equation (8),
where
N represents the size of the kernel and is empirically set as 7 (that is, the size of local-variance window is 7 × 7);
I denotes the input image; (
i,
j) indicates the coordinates of the current pixel. As exemplified in
Figure 5c, compared with the background and shadows, edges and edge-surrounded regions are obviously highlighted in the variance feature map.
Furthermore, we normalize local variance feature and fuse it with the OI, further enhancing the presence of objects while suppressing other components. This process is detailed in Equation (9),
where
denotes the linear normalization operator (that is, min–max normalization, which will linearly scale all elements to [0, 1]); “
” represents element-wise multiplication of two matrices; IOI stands for improved object indication. As exemplified in
Figure 5d, after being refined, only the object regions in the IOI map maintain high indication values. It can be said that under the combined effect of the OI and local variance feature, the potential locations of objects are estimated rather appropriately.
2.1.3. Object/Background Prior
A higher indication value in the IOI map corresponds to a greater probability of the location belonging to object components. We normalize IOI to the range [0, 1] and utilize it as the object prior, as specified in Equation (10),
where
denotes the normalization operator;
represents the prior probability of a region belonging to a salient object. Correspondingly, the prior probability of it belonging to the background is
, where
.
2.2. Feature Extraction
The feature extraction channel is designed for acquiring feature information that enhances the saliency of objects, thereby enabling more accurate determination of their existence. In SAR images, the object saliency can be considered from three perspectives: (1) objects generally exhibit strong backscatter intensity; (2) whether viewed in the context of the entire image or compared to local surrounding regions, objects are expected to demonstrate uniqueness and distinctiveness in terms of intensity, shape, or some other attributes; (3) the human visual system instinctively focuses on rare or anomalous regions within a scene [
39]; compared with the abundant background elements, objects usually occupy a smaller proportion, making them more likely to attract visual attention. Accordingly, we employ four features—brightness, frequency, local contrast, and global contrast—to characterize the object saliency.
(1) Brightness
In SAR images, the intensity of echo power is reflected by grayscale levels. We directly utilize the original image, after linear function normalization, as the brightness feature
, as shown in Equation (11).
(2) Frequency
In SAR images, particularly those capturing large scenes, objects usually occupy a minor proportion of the image. Thus, the corresponding grayscale components will appear infrequently across the image. We employ a frequency feature to characterize this sparsity. Here, “frequency” refers to the occurrence rate of pixel values, which differs from the conventional concept in image processing (typically computed via Fourier transform) that describes the rate of grayscale variations.
The frequency feature can be calculated by Equation (12),
where
denotes the total number of pixels in the input image. If a pixel at coordinates
has a grayscale level
, then
represents the count of all pixels in the SAR image with grayscale level
. The parameter
is an empirical constant, set to 1 in the experiments presented in this work. Due to the infrequent occurrence of the corresponding grayscale components, objects will exhibit higher frequency feature values, while the background is relatively suppressed.
(3) Local contrast
Objects usually exhibit various differences from their surrounding elements. We quantify this local dissimilarity using the method introduced in reference [
40]. The detailed steps are described as below:
(S
1) A reference window of size
is centered at the current pixel. This window is uniformly divided into 9 sub-windows with the size of
, which are numbered from 0 to 8, just as illustrated in
Figure 6.
(S2) The maximum grayscale value of pixels within sub-window 0 is denoted as L0.
(S3) Calculate the mean grayscale values of sub-window 1 to 8, respectively. The maximum of these 8 mean values is denoted as mmax.
(S
4) The local contrast feature
of the current pixel
is calculated using Equation (13), where
is an empirical parameter and set to 5 in this work.
(4) Global contrast
From the perspective of an entire SAR scene, objects typically attract visual attention due to their distinctiveness. We employ the global contrast feature to quantify this object-background discriminability, as mathematically formulated in Equation (14),
where
denotes the global contrast corresponding to the pixel at coordinates
;
I represents the input image,
is its mean grayscale value,
is the total number of pixels in the input image
I; parameter
is an empirical threshold. In this paper, we adopt an adaptive approach to set
, as shown in Equation (15), where
represents the standard deviation of the input image.
2.3. Saliency Map Generation
Based on IOI revealing potential locations of objects and four features determining the object presence, this section first utilizes Bayesian inference to fuse channel outputs, generating an initial SM. Furthermore, an adaptive iteration mechanism is utilized to refine the SM progressively. In the final SM, object regions will exhibit high saliency values, while other components are substantially suppressed.
According to Bayesian principles, the posterior probability that a certain pixel
x in an image belongs to object components can be expressed as Equation (16),
where
and
are the prior probabilities mentioned above in
Section 2.1.3;
and
represent the likelihood probabilities of observing the sample
x in the object and background regions, respectively.
Calculating likelihood probabilities requires object and background sample sets. Here, the threshold selection method described in reference [
41] is introduced to determine the optimal threshold
Topt. Then, we use
Topt to binarize the IOI map, roughly estimating the object/background sample set. Specifically, pixels with indication values higher than
Topt form the object sample set S
obj, while the other areas constitute the background sample set S
back.
2.3.1. The Calculation of Weight for Feature Fusion
The Bayesian inference process, when utilized for integrating two channel outputs, requires us to fuse multiple features. Many current saliency algorithms based on feature integration often treat different features as equally important. For instance, reference [
24] directly combines local contrast and global contrast under the assumption that they contribute equally. However, such processing fails to consider that some features may have a stronger descriptive capability and should have been assigned more weight, which might further enhance the SM performance.
Accordingly, we assign fusion weights based on the descriptive performance of the features. The calculation of the weight is shown in Equation (17),
where
is the fusion weight for feature
f;
(or
) denotes a set in feature map
f, which corresponds to the sample set S
obj (or S
back);
is utilized to calculate the average value. Thus, features that demonstrate significant differences between the object and the background are assigned higher fusion weights, whereas features with minor differences receive lower weights. Compared to an SM generated by equal-weight fusion, our SM will exhibit superior object highlighting effects and improved background suppression capabilities.
2.3.2. The Calculation of Likelihood Probabilities
The estimation of likelihood probabilities relies on the ‘feature conditional independence assumption’ [
42], which presumes statistical independence among features. Actually, it remains challenging to mathematically prove this assumption rigorously. The independence between brightness and its derivative contrast features has always been controversial [
43]. On the other hand, numerous methods that directly utilize this assumption have achieved excellent results. For instance, the classic Naive Bayes Classifier, widely applied in pattern recognition, operates under this assumption [
42]. Therefore, our method still adheres to this assumption.
Based on the independence, the likelihood probabilities of one pixel
in the input image
I,
and
, can be, respectively, calculated by Equations (18) and (19),
where
represents the probability that
in the feature map
f (
) belongs to the object regions; similarly,
) represents the probability that
belongs to the background regions.
and
correspond to each other, which indicates that the feature map
f has the same size as the input image
I, and the pixel
in
f has the same position as the pixel
in
I.
and
can be, respectively, calculated by Equations (20) and (21),
where
represents the total number of pixels in
(i.e., the number of pixels in the object set S
obj);
represents the number of pixels in
that have the same feature value as
(similarly,
and
are defined for the background).
By substituting the derived likelihood probabilities into Equation (16), we obtain the posterior probability, which represents the saliency value of pixel x in the current iteration.
2.3.3. The Adaptive Iteration Mechanism
To more thoroughly highlight salient objects and suppress the background, we design an adaptive iteration mechanism to further improve the SM performance. Assuming the iteration index is t, SM (t) represents the SM generation result of the current iteration, where the saliency value of pixel x is denoted as . If the termination condition remains unsatisfied, SM(t) undergoes simple smoothing processing (median filtering is employed in our experiments, and the size of the median window is empirically set as 7 × 7) and subsequently serves as the new IOI for the next iteration. Then, the object/background sample sets, prior probabilities, and likelihood probabilities will be re-calculated, thereby generating a new saliency map, SM (t + 1). This process continues iteratively until the termination condition is satisfied, and the SM provided by the current iteration becomes the final SM of our method.
Here, we utilize Mean Absolute Error (MAE) [
44] to establish the termination condition for the iterative process, as shown in Equations (22) and (23),
where
is an empirical threshold. If
is set too large, the iterative optimization will be insufficient, making it difficult to achieve ideal performance. Conversely, if
is set too small, the iteration rounds will increase, significantly raising the overall computational complexity of the algorithm. Moreover, some small or weak objects may be continuously weakened during the iteration process, thereby affecting the local performance of the SM. In our experiments,
is set to 0.25.
When Equation (23) holds, we consider that the variation between SM (t + 1) and SM (t) is negligible and the iteration process has approached convergence. At this stage, the current SM (t + 1) can be adopted as the final SM of our algorithm.
2.4. Saliency Map Segmentation
In the final SM of our proposed method, there is a distinct difference between the object and the background. We can easily extract the salient objects using simple OTSU threshold segmentation.
3. Experiments
This section comprehensively evaluates the performance of our proposed algorithm in terms of SM and detection results with real SAR images. The ground-object scenes used in experiments are selected from the Moving and Stationary Target Acquisition and Recognition (MSTAR) [
45,
46] dataset, as shown in
Figure 7a. The maritime-object scenes are derived from the SAR Ship Detection Dataset (SSDD) [
47], as illustrated in
Figure 8a.
3.1. Data Description
The MSTAR dataset is jointly developed by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL) in the United States. The corresponding SAR system operates in the X-band with spotlight mode and HH polarization, achieving an imaging resolution of 0.3 m × 0.3 m. The images employed in the experiments are derived from the BTR-60(test) subset, wherein each individual image has dimensions of 128 × 128 pixels and the number of total images is 195.
The SSDD is manually constructed and specifically designed for ship detection in SAR imagery. The images are acquired under four polarization modes, with spatial resolutions ranging from 1 m to 15 m. The ships, located either in open seas or inshore areas, are typically positioned at the center of images. We randomly select 200 images from the SSDD train dataset for experiments. Among them, the majority are open-sea scenes, while a small portion are inshore scenes. In the inshore scenes, ships may be joined with some harbor facilities with strong echoes, making it impossible to distinguish between them, as illustrated in
Figure 8a(i,ii). Since object recognition and identification are not within the scope of this study, these facilities are treated as object regions in the following experiments.
3.2. Experimental Settings
In our experiments, some important parameters were configured as below: λ utilized for calculating frequency is set to 1; α utilized for generating global contrast is calculated with Equation (14); TMAE utilized for constructing termination condition is set to 0.25.
All experiments were conducted on a personal computer equipped with Intel Core i5-9600 CPU and 32GB RAM. The software used here is Matlab R2022a.
3.3. Experimental Analysis of Saliency Map
The essential step in our proposed method is SM generation, and the quality of the SM can indirectly reflect the final detection results. Hence, we firstly evaluate the performance of our method from the perspective of the SM.
To comprehensively evaluate our SM, on one hand, we conducted experiments using both ground and maritime scenes from the MSTAR and SSDD datasets, respectively. On the other hand, we introduced four classical saliency models—ITTI [
48], Region Contrast (RC) [
49], Spectral Residual (SR) [
50], and Contour-Guided Visual Search (CGVS) [
41]—as benchmarks. Among them, the ITTI model, based on the feature integration theory and local contrast strategy, represents the most classical visual saliency algorithm. The RC model employs a global contrast strategy, and its performance has been successfully validated in optical applications. Unlike the ITTI and RC models, which compute saliency values in the color (or grayscale) space, the SR model is the most classical saliency algorithm that models visual attention in the frequency domain. The CGVS model, similar to the proposed algorithm, also utilizes Bayesian inference.
3.3.1. Qualitative and Quantitative Analysis of SM Generation Results
The SM generation results of the above algorithms based on ground and maritime scenarios are illustrated in
Figure 7 and
Figure 8. Within one figure, row (a) presents several original SAR images used in the experiments; row (b) shows the SM generation results of our proposed method corresponding to SAR images in row (a); rows (c–f), respectively, provide the SM generation results of four benchmark algorithms, which are ITTI, SR, RC, and CGVS in sequence; row (g) presents the ground truth (GT) that is manually annotated through human observation.
As illustrated in
Figure 7c(iv,vi), although ITTI roughly highlights object areas, it fails to effectively suppress the background regions. This will potentially complicate the subsequent segmentation processing. Perhaps due to the adoption of Gaussian pyramid and cross-scale difference, ITTI cannot uniformly enhance the entire object. Instead, it tends to emphasize the edge regions, resulting in low saliency values in the central region of the object, as indicated by the areas marked with red boxes in
Figure 7c(i) and
Figure 8c(i). In contrast, the SR algorithm demonstrates better background suppression, while its effect on enhancing objects is slightly insufficient (comparing
Figure 7d(i) with
Figure 7c(i)), which can easily increase the risk of missed detections in the SM-based segmentation. The RC model achieves acceptable performance in both object enhancement and background inhibition. Its SM primarily exhibits two limitations: (1) under high-power clutter conditions, certain background regions still present undesirably high saliency values, as demonstrated in
Figure 7e(vi); (2) owing to utilizing a global contrast mechanism, shadow areas—which typically exhibit lower grayscale levels than both background clutter and objects—are often assigned elevated saliency values, as indicated by the blue-boxed areas in
Figure 7e. These shortcomings may adversely affect subsequent salient object extraction. The CGVS model can assign high saliency values to object areas, and there is a clear distinction between regions with high and low saliency values in its SM. However, it simultaneously enhances numerous background regions and presents a more pronounced shadow enhancement effect. As shown in
Figure 7f(v), shadow and object regions show comparable saliency levels.
By comparison, our algorithm achieves uniform enhancement across entire object regions, with high-saliency areas precisely matching object regions. It also demonstrates superior background inhibition relative to benchmark methods. A representative interference area exhibiting strong echo (green box,
Figure 8a(iii)) exemplifies this advantage: while comparative methods assign it saliency levels comparable to actual objects, our SM maintains near-zero values in such region, thereby minimally impacting downstream processing. Moreover, our SM presents an intrinsic binarization property, enabling direct extraction of high-saliency regions through computationally efficient threshold segmentation. This characteristic significantly simplifies the subsequent object segmentation.
Furthermore, we utilize the P-R curve (
precision–
recall curve) [
51,
52] to quantitatively analyze the SM generation results. The definitions of
precision and
recall are illustrated in Equation (24),
where TP, FP, and FN, respectively, represent the numbers of true positive, false positive, and false negative samples. Theoretically, if an SM presents excellent performance, its corresponding P-R curve will approach the upper-right corner and have a large BEP (break-even point) value [
53]. The break-event point in a P-R curve indicates the position in which
precision equals
recall.
The P-R curves corresponding to SMs based on ground and maritime scenes are presented in
Figure 9a,b, respectively, and BEP values of these curves are recorded in
Table 1. It can be seen that blue curves exhibit the most obvious tendency towards the upper-right corner and demonstrate the best results at the BEP, indicating that the SMs provided by our method present superior overall performance. This finding is consistent with the qualitative analysis presented above.
3.3.2. Analysis of the Binarization Property of SMs
An ideal SM should effectively highlight objects while thoroughly suppressing the background, leading to a clear distinction between areas of high and low saliency values. This property significantly facilitates subsequent object extraction based on the SM. This subsection evaluates the segmentation potential of SMs from the perspective of the binarization property and elaborates its implications for object extraction. Experiments were conducted on the maritime scenes. For consistency, the Otsu threshold—employed in our proposed algorithm for object extraction—was uniformly applied to segment objects from the SMs produced by other benchmark algorithms.
Figure 10 presents partial experimental results. The corresponding original SAR images and GT are shown in
Figure 8a and 8g, respectively.
Figure 10a–e sequentially display segmentation examples based on the proposed SMs, ITTI SMs, SR SMs, RC SMs, and CGVS SMs. Experimental observations indicate that while CGVS SMs exhibit favorable binary property with high correspondence between segmentation results and regions of high saliency values, they simultaneously activate extensive non-object areas, leading to severe false alarms that render CGVS ineffective in this scenario. ITTI and SR SMs achieve better segmentation quality, with results largely covering object regions and much fewer false alarms than CGVS + OTSU. However, residual false alarms persisting in
Figure 10b,c suggest that simple OTSU segmentation struggles to precisely extract objects from ITTI and SR SMs, and more advanced post-processing strategies would be required for these algorithms to achieve excellent detection performance. Results based on RC SMs demonstrate marked improvement over those of ITTI and SR, though noticeable false alarms remain (e.g., the red-boxed region in
Figure 10d(iii)). By contrast, the proposed method generates SMs with precise high-saliency localization in object regions and effective background suppression. Correspondingly, its segmentation results exhibit superior control over both false alarms and missed detections, outperforming all four benchmark algorithms. For instance, the high-intensity clutter region marked with a green box in
Figure 8a(iii) is thoroughly suppressed in our segmentation, while all other methods produce distinct false alarms in this area.
Quantitative evaluation using
precision and
recall metrics (
Table 2) corroborates the qualitative analysis. CGVS, ITTI, and SR methods achieve low
precision scores, indicating substantial false alarms that limit their practical utility. RC maintains relatively balanced
precision and
recall scores, while the proposed algorithm significantly surpasses RC in both metrics. Notably, although CGVS and ITTI achieve higher
recall than our method, this comes at the cost of severely degraded
precision, compromising their detection reliability. Comparative analysis of
Table 1 (BEP scores) and
Table 2 reveals a strong positive correlation between SM quality (BEP) and segmentation performance. This indicates that improving the SM quality facilitates to achieve better object detection results through more concise segmentation processes, such as basic threshold processing.
In summary, the proposed algorithm simultaneously achieves strong object activation and effective background suppression. Its SMs exhibit exceptional binary characteristics, enabling high-quality object extraction via simple threshold segmentation. Experimental validation confirms that our method delivers superior detection performance with minimal post-processing requirements, fulfilling the practical demands of efficient and accurate SAR object detection.
3.4. The Effect of Adaptive Iterative Mechanism
In contrast to conventional saliency detection algorithms that directly segment objects based on the initial SM, the proposed framework incorporates an adaptive iteration mechanism to progressively refine the preliminary SM, thereby ensuring improved performance of the final output. In this subsection, maritime scenes were utilized and all images which undergo five iteration rounds were selected as illustrative examples to analyze the impact of the proposed adaptive iteration mechanism on SM generation.
Figure 11 provides an exemplification of SM evolution during the adaptive iteration. The iterative process conducts five rounds of Bayesian inference, and the SMs generated from each round are sequentially shown in
Figure 11b–f. In the SM corresponding to the first round (
Figure 11b), objects appear relatively blurred, and some background regions exhibit high saliency levels. If this SM were directly applied to segmentation processing, as is common with conventional saliency algorithms, the final detection results would inevitably suffer accuracy loss. As the iteration process progresses, object components in the SM gradually become clearer, while the background regions are increasingly suppressed. In the final SM (
Figure 11f), there is a distinct difference between object areas and the background, allowing us to conduct segmentation with a simple global threshold. Therefore, the adaptive iteration mechanism can effectively improve the SM generated from a single round of Bayesian inference, and according to the analysis in
Section 3.3.2, it will indirectly enhance the overall detection performance of our method.
The P-R curves corresponding to each round are presented in
Figure 12. As the number of iterations increases, the P-R curves progressively shift towards the upper-right corner, quantitatively revealing the enhancement of SM performance through the iteration processing. On the other hand, the variation from
Figure 11e to
Figure 11f is minimal, and correspondingly, the blue curve in
Figure 12 nearly overlaps with the pink curve. It indicates that the SM reaches a stable state where further iterations provide negligible improvement, prompting the adaptive mechanism to automatically terminate the process. These results confirm that the iteration mechanism not only effectively enhances SM quality but also possesses the capability to terminate the process at an appropriate stage, demonstrating excellent self-adaptability.
3.5. Experimental Analysis of Detection Results
Object extraction results directly reflect the comprehensive performance of detection algorithms. This subsection systematically evaluates the final detection of the proposed saliency algorithm. We employed both ground scenes (BTR-60 (test)) and maritime scenes (SSDD (train)). For ground object detection, the Two Parameter-CFAR (TP-CFAR) detector [
11] and the Background-Context-Aware Saliency (BCAsaliency) method [
25] were selected as benchmark algorithms. The former is a classical approach in SAR image detection and has been widely applied in practical detection, while the latter—a recently proposed saliency algorithm for SAR ground targets by Ni et al.—demonstrates outstanding object highlighting performance on the MSTAR dataset. Since reference [
25] did not provide an extraction scheme for BCAsaliency SMs, the OTSU threshold was adopted to segment its SMs for fair comparison with our proposed method. Thus, the actual benchmark method is denoted as BCAsaliency + OTSU, hereafter abbreviated as BCAsaliency. For maritime object detection, TP-CFAR, Density Censoring-CFAR (DC-CFAR) [
54], and Superpixel and Gamma-CFAR (SPG-CFAR) [
55] were chosen as baseline algorithms. The latter two are improved CFAR strategies proposed in recent years for SAR maritime targets: DC-CFAR employs superpixel-level processing to mitigate speckle noise and incorporates a screening mechanism to reduce computational redundancy; SPG-CFAR similarly utilizes superpixel-level processing for noise suppression and specifically employs the Gamma distribution for clutter modeling tailored to maritime background.
In the ground scenes, the TP-CFAR detector fails to completely extract objects, revealing significant limitations in detection performance. While the BCAsaliency approach generally preserves object integrity and demonstrates robust clutter suppression in homogeneous backgrounds, it exhibits three critical shortcomings. First, as shown in
Figure 13d(i,ii), it tends to misclassify background components adjacent to objects as regions of interest, causing detection results to substantially exceed true object boundaries. Second, its clutter suppression capability deteriorates sharply in heterogeneous backgrounds, leading to pronounced false alarms (seeing
Figure 13d(iii,vi)); notably, in a strong-clutter scenario depicted in
Figure 13d(vi), the object becomes entirely indistinguishable from background noise. Third, BCAsaliency usually generates false alarms in shadow regions (
Figure 13d(v,vii)), which is a consequence of its global contrast mechanism. Shadow areas exhibit extremely low backscattered energy, creating strong contrast against typical SAR backgrounds and thereby triggering false positives. In contrast, our proposed algorithm overcomes these limitations by fully segmenting entire objects while maintaining effective clutter suppression across both homogeneous and heterogeneous environments. By integrating multiple features—including echo power (brightness feature)—it successfully circumvents shadow-induced interference, ultimately achieving superior comprehensive detection performance.
In the maritime scenes, all evaluated algorithms achieved satisfactory detection results in relatively clean backgrounds, as shown in
Figure 14(i). However, when processing those containing complex content, our method demonstrates superior detection performance. As exemplified in
Figure 14(ii), TP-CFAR confronts severe missed detections; both DC-CFAR and SPG-CFAR yield marginally inferior object extraction with non-negligible false alarms; comparatively, the proposed method achieves optimal detection results. Furthermore, our algorithm exhibits more outstanding suppression effects against high-power interference. As illustrated in
Figure 14(iii), where a strong interference which is marked by a green box exits, the proposed algorithm suppresses this interference thoroughly, while results from the other three methods all retains conspicuous components corresponding to this interference.
Comprehensively considering experiments in both ground and maritime scenes, the proposed algorithm achieves better results than other classic benchmark methods, presenting excellent detection performance.
We also employ the
Fβ-
measure, which comprehensively considers both
precision and
recall, to quantitatively analyze detection results. The
Fβ-
measure is defined as the weighted harmonic mean of
precision and
recall [
16], as shown in Equation (25),
where the parameter
β characterizes the relative importance of
recall with respect to
precision. If
β > 1,
recall exerts a more dominant influence on the
Fβ-
measure; conversely, if
β < 1,
precision contributes more significantly to the metric. In this experiment, we consider
precision and
recall to be of equal importance and set
β to 1.
Figure 15a,b present the
Fβ-measure values for detection results in ground and maritime scenarios, respectively. In the ground scenes, our proposed algorithm achieves a much higher
Fβ-measure than those of two classical benchmarks (TP-CFAR and BCAsaliency), demonstrating optimal detection performance. In the maritime scene, although two improved CFAR variants (DC-CFAR and SPG-CFAR) outperform the traditional TP-CFAR, our method still surpasses all three CFAR detectors, yielding the most competitive results.
Comparative analysis of
Figure 15a,b reveals that the
Fβ-measure of TP-CFAR in the ground scene is markedly higher than that in the maritime scene, indicating its sensitivity to scene variations and significant performance instability. In contrast, the proposed algorithm maintains robust performance across both scenarios, with closely aligned
Fβ-measure values, denoting its consistent efficacy in heterogeneous tasks. These results confirm the superior robustness and generalization capability of our algorithm.
Based on the above analysis, our method exhibits superior detection capabilities compared with the classic SAR detection algorithms. It achieves reliable results across different scenarios, demonstrating excellent detection performance.
3.6. Experimental Analysis of the Setting of Parameter TMAE
The parameter
TMAE plays a critical role in our algorithm by determining the termination condition for adaptive iteration, which directly influences final detection results. To systematically evaluate its operational mechanism, experiments were conducted on ground scenes with
TMAE values sequentially set to {0.10, 0.15, 0.20, 0.25, 0.30, 0.35}. And then we quantitatively analyzed the detection results under such configurations via the metric of
Fβ-
measure to reveal performance variations. The corresponding
Fβ-
measure scores for each parameter setting are listed in
Table 3.
Experimental results demonstrate minor fluctuations in detection performance when TMAE varies within [0.1, 0.35], with the proposed algorithm consistently maintaining high effectiveness. This phenomenon confirms its low sensitivity to TMAE parameter settings of our method, indicating its strong parameter robustness. Moreover, these findings further verify the stability and reliability of the proposed algorithm.