Salient Object Detection via Fusion of Multi-Visual Perception

: Salient object detection aims to distinguish the most visually conspicuous regions, playing an important role in computer vision tasks. However, complex natural scenarios can challenge salient object detection, hindering accurate extraction of objects with rich morphological diversity. This paper proposes a novel method for salient object detection leveraging multi-visual perception, mirroring the human visual system’s rapid identification, and focusing on impressive objects/regions within complex scenes. First, a feature map is derived from the original image. Then, salient object detection results are obtained for each perception feature and combined via a feature fusion strategy to produce a saliency map. Finally, superpixel segmentation is employed for precise salient object extraction, removing interference areas. This multi-feature approach for salient object detection harnesses complementary features to adapt to complex scenarios. Competitive experiments on the MSRA10K and ECSSD datasets place our method in the first tier, achieving 0.1302 MAE and 0.9382 F-measure for the MSRA10K dataset and 0.0783 MAE and and 0.9635 F-measure for the ECSSD dataset, demonstrating superior salient object detection performance in complex natural scenarios.


Introduction
Attention plays an extremely important role in human perception systems [1].Human visual perception often selectively disregards unnecessary information through attentional mechanisms.To effectively extract target information, it selectively attends to salient regions by integrating local contextual cues.The human visual system [2] is an important medium for human cognition of the world and has strong recognition and information processing capabilities.In the human visual system, humans can autonomously ignore a large amount of secondary and redundant information and further accurately locate and extract key information in the image.This key information can be perceived and refined to extract richer advanced information in the attention stage.Salient object detection imitates the human visual attention mechanism through computer calculation, with the aim of distinguishing the most visually attractive targets and quickly screen images by focusing on the target area that human eyes are most interested in [3].Salient object detection is a binary classification problem, and its results need to highlight the boundaries and give complete targets.With the development of computer vision, this attention mechanism of the human visual system has attracted great interest.To detect more important and valuable information in images or videos, research on salient object detection is gradually emerging, with the aim of highlighting the most salient objects in an image [4].Such research holds practical significance for applications including video surveillance, virtual reality, human-computer interaction, and autonomous navigation.Over the past few decades, many approaches have been proposed to detect salient objects [5].In general, there are two main methods for salient object detection: bottom-up and top-down [6][7][8][9].
Bottom-up approaches are fast, data-driven, and task-dependent [10] and are usually based on easy-to-implement primitive features such as color, intensity, and texture.However, salient objects cannot be detected using these features alone.Therefore, bottom-up methods make some assumptions about object and background properties, including the contrast prior, center prior, and background prior [11].For example, the contrast prior assumes that salient regions are always different from their neighbors or scenes and can be further divided into the local contrast prior and global contrast prior [2], the center prior is established based on the fact that salient objects are more likely to exist in the center of the image, and the background prior indicates that the boundaries of the image are more likely to be part of the background.With these assumptions, salient regions can be better highlighted with background suppressed.
In addition, top-down approaches are slow, well-controlled, and task-driven and require supervised learning [12] based on manually labeled training samples.With the development of deep learning methods, this method has achieved great success in salient object detection.Deep-learning-based algorithms [13,14] can process semantic-level or image-level features of training samples and achieve the best performance for saliency detection.However, deep neural networks require a large number of labeled samples to iteratively adjust massive training parameters, have a long training process, and cannot be applied to online learning.For example, commonly used CNN (convolutional neural network) architectures, such as ResNet-50 [15] and VGG16 [16], have a large number of parameters.ResNet-50 has over 20 million parameters, while VGG16 has approximately 130 million parameters.The time required to complete a full training and inference cycle is incomparably longer than the method proposed in this paper.Therefore, most existing deep-learning-based methods are time-consuming and highly reliant on well-annotated training datasets [10].Besides the efficient training problem, labeling a large number of samples is also a difficult task in practice [6].
To tackle the problems mentioned above, this paper proposes a bottom-up-based salient object detection method using multi-visual perception.Most of the intuitive perception of human vision is unsupervised and mostly based on low-level visual features such as color, texture, and brightness.Contrast, as a very important feature for salient region recognition, has a critical impact on visual effects.The greater the contrast, the clearer and more eye-catching the image is, and the easier it is for human eyes to notice the object.The reason why salient objects can attract the attention of the human visual system is that their feature performance is different from the surrounding environment.Therefore, this paper mainly uses five different low-level visual perception features: gradient feature, mean subtractive contrast normalization (MSN) coefficients [17], the dark channel, saturation, and hue to detect salient objects by analyzing the visual presentation characteristics of natural images.More details are described in Sections 3.1 and 3.2.The method follows the feature utilization of the human visual system when recognizing objects in images.First, saliency maps are acquired by every single feature, then fused through an effective strategy, and finally combined with superpixel segmentation to obtain the final result.During the experimental phase, we utilized two challenging datasets, MSRA10K [18] and ECSSD [19], and evaluated our method using common precision-recall (PR) curve, ROC curve, F-measure, and mean absolute error (MAE) metrics.Detailed descriptions of the specific evaluation procedures are provided in Sections 5.1 and 5.2.In short, the main contributions of this paper are as follows: (1) Methodology for salient object detection: The paper introduces an innovative approach to salient object detection, leveraging multi-visual perception.This pioneering method capitalizes on the synergy of various visual cues to enhance accuracy.(2) Enhanced feature utilization: By incorporating five distinct low-level visual features, the proposed method significantly bolsters its ability to handle intricate natural scenes.These nuanced features contribute to robustness and adaptability.(3) Coherent framework for accurate detection: The integration of multiple features within a coherent framework yields substantial improvements in salient object de-tection.Notably, this leads to a pronounced reduction in both false positives and false negatives.
In summary, our method fully utilizes the underlying representation features of images, thoroughly explores and integrates global and local characteristics, and achieves good performance without requiring costly training data or teacher signals.
The remainder of this paper is organized as follows.Section 2 discusses related work, and Section 3 explains the motivation and background for our work.Section 4 presents the details of our proposed method.Then, Section 5 describes experiments demonstrating the proposed method's performance.The explanation of experimental results and further validation of the proposed method are expanded upon in Section 6.Finally, Section 7 concludes.

Related Work
Over the past two decades, salient object detection methods have developed rapidly.Vision-based object detection is an interdisciplinary research topic in image processing, computer vision, and pattern recognition [4].Some strategies for salient object detection use unsupervised techniques by combining visual features and prior theory.The earliest creative method proposed by Itti et al. [20] was based on color, intensity, and contrast around the center of the orientation.Then, Jiang et al. [21] introduced a regional-level saliency descriptor primarily constructed on local-contrast, background, and other well-known features.Achanta et al. [22] described a frequency-adjusted global-contrast-based method to measure saliency.Yang et al. [23] used a graph-based approach to measure the similarity of regions to foreground or background cues through the boundary prior and manifold ordering.Jia and Han [24] calculated the saliency scores for each region and then compared them to the soft foreground and background.Liu et al. [25] utilized a Bayesian framework for saliency detection based on a Bayesian model.To overcome the limitation of using only contrast, Lou et al. [26] proposed the use of statistical and color contrast to detect salient objects.Furthermore, there are some successful examples salient object detection by combining visual features with prior theory [27][28][29].
With the boom in deep learning [12], saliency detection has introduced supervised learning methods.Among them, it is worth mentioning the region-based CNN model and the FCN-based (fully convolutional networks) model.Region-based methods [14] divide the input image into multi-scale or smaller regions and use CNNs to extract their high-level features.Then, the multi-layer perceptron (MLP) outputs the saliency value of each small region through these high-level features.Although region-based CNN models have achieved good performance, these models cannot preserve spatial information due to small region segmentation.Therefore, an FCN-based method was designed to overcome this shortcoming [13].This method operates at the pixel level rather than at the region or patch level.It overcomes the limitations of region-based CNN models and also preserves contextual information well.However, substantial labeled data with teacher signals or ground truth is necessary for their training, leading to inherent computational complexity and potential unavailability.
The traditional methods mentioned above typically focus on a limited set of features, such as local contrast and color contrast, resulting in under-utilization of image information.Deep learning methods typically extract semantic features through convolution, neglecting surface-level features of images.Additionally, they perform convolution at the block level, overlooking the consideration of connectivity between salient objects across blocks.Furthermore, they require a large amount of annotated data, which is time-consuming and labor-intensive, and the complexity of networks results in high time complexity.
To address the aforementioned challenges, this paper proposes a multi-feature detection method leveraging the rich features of images.Employing multiple features is not only beneficial for saliency detection in complex natural scenes but also mitigates limitations of single-feature approaches.Moreover, the proposed bottom-up method is simple and data-independent.

Motivation
The generation of visual saliency arises from the formation of a new and distinct stimulus that captures the observer's attention due to the contrast between visual objects and the surrounding environment.This results in visual contrast, often induced by the elements constituting the image itself.Regions with higher contrast are more likely to attract the visual system's attention and are referred to as salient regions, while visual stimuli are described through features [30].
Generally, an image feature describes a visual attribute of a specific aspect of the image object.Different dimensions of features can describe the image from various perspectives [30], such as contours of different subjects, texture features, brightness contrast depicting sensory depth, and distant/nearby objects, while color features aid in object recognition and differentiation.The human eye comprehends images through the integrated analysis of multiple perceptual features to achieve a final understanding of the image.To simulate human visual mechanisms, multiple features are selected.For instance, MSCN and gradient features describe the texture distribution in the image.Given the heightened sensitivity of the human eye to edges and textures, these features contribute to simulating such sensitivity, thereby enhancing the perceptual quality of the image.The dark channel, mapping the minimum pixel values in the image, is employed to extract depth information from the scene, simulating human perception of image depth and relative distances between objects.In natural environments, objects often possess unique colors, aiding in the perception and memorization of the surroundings.Colors serve as crucial identifying features for remembered objects and scenes.
The aforementioned features collectively cover almost all the information acquired by the human visual system [31], leading us to integrate MSCN coefficients, gradient feature, dark channel, saturation, and hue features to emulate the human visual perception system.

Visual Perceptual Features
The proposed method uses five different perceptual features, i.e., gradient feature, MSCN coefficients, the dark channel, saturation, and hue in the HSV color space.These perceptual features play an essential role in the human visual recognition process.The details of each perceptual feature are described as follows.
Gradient feature: The human visual system recognizes objects by their edges.The higher the value in a gradient image, the closer to the edge observed by human vision.The gradient feature uses edge detection operators to extract texture information from images and represent it as finer object structures.In this work, a robust contrast operator [32] (cvo:contrast value operator) was used for gradient feature extraction.
MSCN coefficients: Due to the smooth transition of image regions, adjacent pixels of natural images tend to have high correlations.The human visual system detects salient objects after removing this high correlation.The MSCN coefficients are proven to reduce image region correlation [33].Therefore, an MSCN coefficient feature map can decorrelate salient objects with their surroundings.The MSCN coefficients are calculated as follows: where i ∈ {1, 2, . . . ,M} and j ∈ {1, 2, . . . ,N} are spatial indices.ω is a 2D circularly symmetric Gaussian weighting function sampled to three standard deviations (K = L = 3) and rescaled to unit volume.I gray (i, j) is the gray version of a raw image I.
In addition, we find that the distribution of the salient object values tends to be concentrated in part of the MSCN coefficient feature value.So, the feature map is combined with the histogram to detect salient objects.
Dark channel: Although the dark channel prior is first proposed in the direction of dehazing, some papers have studied its feature to detect salient objects.Refs.[34,35] demonstrated that the dark channel prior could effectively suppress detection failures due to large background regions, foreground touching image boundaries, and the foreground and background sharing similar color appearances.
The dark channel is a kind of statistical characteristic of outdoor haze-free images because the study found that in most non-sky patches, at least one color channel has some pixels with very low pixel intensities close to zero [36].It is obtained as follows: where I c (y) is a color channel of I and Ω(x) is a local patch centered at x. Saturation: Saturation can represent how color density drops from the maximum value to 0. The purer the color, the higher the saturation, and high saturation equals high recognizability of colors.
Hue: Hue is considered to be the primary characteristic of color.Changing the hue will lead to a more considerable difference than changing the same amount of saturation or value [11].Therefore, color is also an important perceptual feature when human eyes focus on salient features.
Five feature maps f m (m = 1, 2, . . ., 5) with size M × N are obtained from the original image, gradient feature f 1 , MSCN coefficients f 2 , dark channel f 3 , saturation f 4 , and hue f 5 .To improve the detection efficiency while preserving the effective information of the feature maps, the feature maps are downsampled to the maximum value.In this work, the downsampled feature maps fm were obtained by K-fold maximum downsampling with the size of ( M K )( N K ).Then, the downsampled maps are normalized as follows: fm (i, j) = [ fm (i, j) − fm,min ]/( fm,max − fm,min ). (5)

Methodology
In reality, human visual systems can acquire various kinds of information for object recognition.Therefore, the method proposed in this paper simulates the recognition process of human vision through a variety of perceptual features.The framework of our method is illustrated in Figure 1.First, the saliency map corresponding to each perceptual feature is detected.Second, the saliency maps with different perceptual features are merged into one with an effective strategy.Third, the superpixel segmentation is involved in generating the final saliency map.

Single Feature Detection
Humans have an inconsistent understanding of different features, so it is impossible to use the same detection method for each feature.According to the characteristics of each feature, different detection strategies are designed to acquire the different saliency maps of every single-feature.
Gradient feature: Gradient values are usually large at the contours of salient objects, so gradient features are selected to detect salient objects.During gradient feature map detection, salient objects are obtained by finding salient object edges row by row.First, each row of data is put into a chart, and its peak points are found, likely to be the edges of salient objects.In this paper, peak points larger than λ 1 × Map max are considered as the edges of salient objects, where Map max is the maximum value of the whole map, and λ 1 is set to 0.45.Then, salient regions are identified through those peak points.If the peak points are concentrated, the space between the concentrated peak points is a salient area.Finally, salient objects can be obtained according to the salient regions of each row.Figure 2 illustrates the gradient feature detection process.Figure 2b shows one row of gradient changes taken from Figure 2a.In addition, the peak points larger than λ 1 × Map max with an asterisk are marked, and the salient region of this row is obtained.Figure 2c shows the saliency map obtained by gradient features.MSCN coefficients: By studying the characteristics of MSCN coefficients, we note a significant difference in the MSCN distribution between salient objects and the background, as shown in Figure 3.
In this paper, it is defined as the statistical histogram as the discrete function: where r k represents the value of level k, and n k is the number of r k in the image.Then, the corresponding statistical histogram can be obtained through the feature map of MSCN coefficients as Equation ( 6).The histogram is normalized as follows: where p(r k ) estimates the probability that the value r k appears in the MSCN coefficient feature map.When the histogram distribution satisfies one of the three conditions, the value range greater than λ 2 is chosen as the distribution range of salient objects.If none of the above three conditions are satisfied, the distribution range of salient objects is in a region less than λ 2 .The initial value of λ 2 is set to 0.4.If the salient region obtained by the initial value is small, λ 2 is changed to enlarge the salient region.The three conditions are as follows: (1) Figure 4 shows some examples of MSCN coefficient detection, where the histogram distribution of the first image satisfies one of the above three conditions, so the range greater than λ 2 is the salient object distribution range.The other images do not meet the three conditions, so the range less than λ 2 is the salient object distribution range.Figure 4c shows the detection result of the image.
The three other features, dark channel, saturation, and hue: The statistical histogram is combined with the background prior theory [28] to detect salient objects.In most scenarios, the background prior theory assumes that the edges of the image tend to be the background.Thus, the edges of the image are considered non-salient.Therefore, when the statistical histogram changes from the edge to the whole, the newly emerging values are also almost salient object values.In addition, there are sometimes salient objects at the edges of the image, so when the statistical histogram values increase from the edges to the overall anomaly, these values are also likely to be salient object values.To obtain the value of salient objects, first, the statistical histogram of fm is defined as h all m (r k ), where m ∈ {3, 4, 5}.Second, the outermost value of the four sides of fm is extracted, and its statistical histogram is obtained as h edge m (r k ).The statistics of each level are then processed as a whole/edge to obtain the pixel increase multiplier for each level, and its expression is as follows: Through the multiplier, the saliency map can be obtained as follows: where h edge m (r k ) = 0 means a value that does not exist in h edge m (r k ) but exists in h all m (r k ).T = mean(h(r k )) + λ 4 × std(h(r k )) is a threshold for judging whether the value increases abnormally in the process from the edge to the whole, where mean(h(r k )) means to find the mean value of h(r k ), std(h(r k )) means to find the variance of h(r k ), and λ 4 is set to 1.5.Here, the threshold T varies with the values of the mean and variance.If the histogram level of the feature map value satisfies the above two conditions, it is set to 1. Otherwise, it is set to 0. As one example, Figure 5 illustrates the feature of the hue detection process.Figure 5a shows the statistics histogram h all m (r k ) of f4 .Figure 5b shows the edge statistics histogram h edge m (r k ) of f4 .Figure 5c is the hue growth result from the edge to the whole image calculated by Equation (8). Figure 5d is the hue's final saliency map.The features of the dark channel and saturation also follow the same process.

Fusion with Superpixel Map
For individual images, feature maps are determined based on the histogram distributions of different features.After obtaining feature maps through features, these feature maps are fused.Algorithm 1 shows the feature salient object fusion process.There are some non-salient points in each feature saliency map.Therefore, for accurate detection, we fuse the multi-perception features to calculate the final accurate feature map.It first calculates the density of salient points around each salient point of each feature saliency map fm .The specific calculation method draws a rectangle centered on the point, calculates the salient point density in the rectangle, and obtains the salient density.After obtaining the saliency density of each saliency point, the saliency points whose saliency density is less than λ 5 × maximum density are regarded as noise points and removed, and λ 5 is set to 0.45 as the best value obtained from experiments.Then, the feature saliency maps after denoising are fused as the fused saliency map f .For any given image, the probability of meeting the foreground conditions using these three features is higher for the foreground.Therefore, we have assigned an equal weight ratio of 1:1 to each feature.Ultimately, the identification of three or more features as meeting the foreground criteria defines this area as the foreground.Then, the fusion process involves adding the values of the five feature saliency maps, where pixels with a sum greater than or equal to 3 are considered foreground, while those below 3 are considered background.Ultimately, the resulting saliency map is represented as f .Finally, an inverse up-sampling process is presented on the fused saliency map, restoring it to the input image size.The final fused saliency map of Figure 6a is shown in Figure 6b as one typical example.
As can be seen from Figure 6b, the edges of salient objects are relatively rough, so the superpixel segmentation algorithm is introduced for refinement.Here, the SLIC superpixel segmentation algorithm [37,38] is involved in removing the interference area to obtain the final salient object.In this paper, the fusion strategy of SLIC is improved by considering both the color and area of superpixel blocks when fusing superpixel blocks.As before, the SLIC algorithm is first used to segment the image [38].Then, the blocks are fused with a small color difference, and the area of the fused blocks is judged.If the area of the block is too small, it is fused with the surrounding superpixel block with the smallest chromatic aberration.After each superpixel segmentation block is processed through the above steps, the final superpixel segmentation map can be obtained.Figure 6c is the final superpixel segmentation map of Figure 6a.dens(i, j) = Density( fm (i, j)); The final superpixel segmentation map splits the image into many superpixel blocks.The mean value of each superpixel block is calculated as its saliency value, i.e., T c = B c , where c is the index of the superpixel block, B c is the superpixel block with index c, and T c represents the mean value of the saliency of the pixel block.The saliency map is then obtained by normalizing saliency values, as shown in Figure 6d and binarized using SC's adaptive threshold method [18].The resulting binary map is shown in Figure 6e.

Experiments
All experiments were run with an AMD Ryzen 7 5800X 8-Core and 32 GB RAM on MATLAB (R2021b) platform.

Evaluation Metrics
To comprehensively evaluate the performance of the algorithm, four quantitative evaluation indicators are used: the precision-recall (PR) curve, ROC curve, F-measure, and mean absolute error (MAE).
The binary maps are obtained from the saliency maps by applying a threshold from 0 to 255.Then, the precision (P) and recall (R) value pairs are obtained from these binary images and the ground truth, and a PR curve is plotted.P and R can be calculated as: where B is the binary map, and GT is the ground truth.The ROC curve is the curve obtained by the false positive rate (FPR) and true positive rate (TPR).FPR represents the probability of mispredicting a positive sample among all negative samples, while TPR means the probability of being correctly predicted as a positive sample among all positive samples.
The F-measure is also obtained through the P-R value pair, which is calculated as: where β = 0.3, indicating that improving precision is more important than improving recall.Mean absolute error (MAE) is a simple and reliable binary map evaluation metric.It is computed as the mean of the pixel-level absolute error between the binary map (M) and the corresponding GT, defined as: where M and N represent the height and width of the image, respectively.

Datasets
Two challenging public datasets were adopted for detection and evaluation in this work.The MSRA10K dataset [18] is a large image dataset containing 10,000 images, where each image contains a salient object or a unique foreground object.The image content of the database is varied, but the background structure is mostly simple and smooth.It is the first dataset for the quantitative evaluation of salient object detection.
The ECSSD dataset is an extension of the CSSD dataset [19].It consists of complex scenes that exhibit textures and structures common in real-world images.The dataset contains a total of 1000 complex images and their respective ground-truth saliency maps [39].
In this work, the pixel-level ground truth was employed to evaluate the MASR10K and ECSSD datasets, which achieved more accurate results than the bounding-box-based ground truth.

Ablation Experiments
In this study, multi-visual perception features were employed for salient object detection.To verify the positive facilitation effects in the multi-visual perception salient object method, ablation experiments were conducted by only selecting one feature from multivisual perception features.PR evaluation metrics of each feature detection result on the ESSCD dataset are shown in Figure 7.This experiment proves that each perceptual feature can detect salient objects but is ineffective.From the experimental results, it should be noted that when all perceptual features work together, the salient object detection method can be more suitable for complex natural scenarios.

Results
The results of the proposed method for ECSSD and MASR10K datasets were compared with those of MSS [40], COV [41], GR [42], PCA [43], LDS [44], FCB [45], CNS [26], MDF [46], PPMRF [47], saliency bagging [2], MTL-ITSOD [48], PDLSO [7], and LC-SA [6] methods.These algorithms were selected to demonstrate the performance of our proposed method.The methods of Refs.[40,41] can combat complex natural scenarios and [42,43], based on the prior theory, make results more robust.Others [26,46,47] are representative methods that emerged in recent years.All these methods utilize color and some prior theories to suppress background information and highlight salient objects.The parameters settings in our work are shown in Table 1.The visual comparison of all evaluated methods is shown in Figure 8.The saliency maps generated by our method are more consistent with the ground truth.Our method can better suppress background noise and highlight salient objects.
In Figure 9, saliency maps are evaluated through PR and ROC curves to show the performance of our method.The curve of our method is above that of other methods, which shows that our method leads the way compared with others.

Time Complexity Analysis
The input image has dimensions of M × N pixels, and the program executes k × M × N times, where k is a constant.In the feature extraction phase, each of the five features requires a constant number of operations on each pixel.The time complexity of the feature extraction phase is O(M × N).The feature fusion phase involves adding the feature maps obtained from feature extraction and performing binary thresholding.This requires processing each pixel in each of the five feature maps, resulting in a time complexity of O(M × N) for the feature fusion module.The superpixel segmentation utilizes the SLIC algorithm with a time complexity of O(M × N).Therefore, the overall time complexity of the proposed algorithm is O(M × N).
The proposed method processes images of 300 × 400 pixels in an average time of 2.1 s on an AMD Ryzen 7 5800X 8-Core CPU with 32 GB RAM using MATLAB.This speed is due to MATLAB's serial processing, where the single-core processing capability of the CPU affects the algorithm's runtime.Theoretically, using more advanced CPU processors, our algorithm will achieve a higher processing speed.We compared the COV algorithm [41], also implemented in MATLAB, with an average runtime of approximately 3.6 s.However, compared to the GR algorithm [42], the runtime is approximately 0.8 s, owing to the higher efficiency of GR using C++ code.Additionally, our method extracts more low-level image features, which consequently increases the computational workload.
Time complexity analysis suggests the method's computational demands correlate linearly with the product of the image dimensions.The method's efficiency, demonstrated by execution time on specific hardware, supports viability for real-time or near-real-time image processing applications.

Parameter Settings
By analyzing the distribution of various visual perceptual features for nature images, we fully leverage low-level image characteristics to identify salient objects.For example, using the MSCN feature, we found that most salient regions account for 30-50% of total MSCN features, while non-salient regions typically account for 70% (see Figure 10b,c.Peak values are mostly near 0.4, so we set λ 3 to 0.7 and λ 2 's initial value to 0.4. On the other hand, regarding the H channel, salient regions often exhibit significant H value differences from edges (see Figure 10d-f).We calculate each H value's occurrence count in the original image divided by its count in edges.We believe a salient region's growth ratio must exceed the average.If its standard deviation is large, it indicates one or more peaks.These H channel features in peak regions are what we aim to capture, so we set λ 4 to 1.5 to better obtain salient regions using the H channel.

Results Analysis
In ablation experiments, the curve for the multi-visual perceptual feature detection method lies above all other curves, indicating that this approach is more effective and versatile than single-feature detection methods.Moreover, when designing detection methods for each single feature, we emphasize their contribution to the whole method.Consequently, methods relying on a single feature cannot adapt to all complex natural environments and perform poorly on the dataset.
It is worth mentioning that these results are sufficient to demonstrate that these features can be applied to salient object detection.Visually, our method can detect salient objects more completely and accurately than other methods.Quantitatively, some comparison methods outperform our method on some evaluation metrics.For example, the F-measure for the ECSSD and MSRA10K datasets in Table 2 is not optimal among all methods.The reason for this is that the detection method of multiple feature fusion has yet to be optimized.In particular, the MSCN feature map using feature statistical properties summarizes three conditions for detecting salient objects.These three conditions only cover most conditions and need further refinement in the future.
Furthermore, with the help of multiple perceptual features participating in salient object detection without relying on a single feature, this method shows effectiveness in complex scenarios.Additionally, the denoising operation and superpixel segmentation method achieve stronger performance.

Performance with Low Resolution
To further validate our method's effectiveness for low-resolution images, we conducted relevant experiments.Testing across different image resolutions yielded typical results as shown in Figure 11.Our method performs well across various low-resolution images.However, as resolution decreases further, detection becomes increasingly affected by background interference, making background areas easier to detect.When resolution reaches its limit, such as 200 × 150, subtle feature detection, including shape structures and fine edges, is impaired, lowering detection accuracy.Nevertheless, salient areas remain recognizable.Therefore, our multi-visual perception-feature-based method maintains effectiveness even at low image resolution.However, detection precision decreases at very low resolutions.In general, our method optimizes salient object detection by integrating image features aligned with human visual perception, as elaborated in Section 3.2.It is particularly effective in natural environments and adept at capturing complex scene nuances via nuanced use of perceptually relevant features.This approach markedly outperforms conventional methods, showcasing adaptability to dynamic visual characteristics in natural images.The method's precision and robustness make it valuable for diverse applications in natural image processing and computer vision.

Conclusions
In this paper, we introduce an innovative technique for salient object detection utilizing fusion of multi-perceptual features.Salient objects are extracted from five distinct perceptual features and combined using an expertly designed fusion approach.A saliency map is generated, visually representing relative importance of different image regions.The superpixel algorithm is incorporated to obtain the final result, minimizing noise interference.Experimental analysis shows our method excels in detecting salient objects even in complex natural scenarios.As we continue refining and enhancing this method, we plan to incorporate a broader spectrum of perceptual features, aiming to better simulate the human visual system and uncover new possibilities for our technique's application in computer vision.

Figure 1 .
Figure 1.The framework of the proposed method.

Figure 2 .
Figure 2.An example of the gradient feature detection process.(a) Gradient feature map.(b) One row of gradient changes.(c) Gradient saliency map.

Figure 3 .
Figure 3.The MSCN distribution between the salient object and background.(a) MSCN feature map.(b) MSCN distribution.(c) The background distribution of MSCN.(d) The salient object distribution of MSCN.

Algorithm 1
Multi-perception feature map fusion 1: Input: Each feature saliency map fm 2: Output: The fused saliency map f 3: m ← 1 4: //m is initialized to 1 and fused from the first feature map until the five feature maps are fused 5: while m ≤ 5 do 6:

Figure 7 .
Figure 7. Ablation experiments on multi-visual perception features for ECSSD.

Figure 9 .
Figure 9. PR and ROC curves for the ECSSD and MSRA10K datasets.(a) PR curve for the ECSSD dataset.(b) ROC curve for the ECSSD dataset.(c) PR curve for the MASR10K dataset.(d) ROC curve for the MASR10K dataset.

Figure 11 .
Figure 11.Typical results at different resolutions.

Table 1 .
Setting parameters and explanation.Downsampling factor of the feature map.Different sizes of images use different downsampling factors.The image size in this paper is 300 × 400, using 4 times downsampling.The critical value of the region where the salient objects of the MSCN variance feature are located.It is initially set to 0.4, but when the salient region has few values, λ 2 will dynamically change to enlarge the salient region.

Table 2 .
MAE and F-measure on two benchmark datasets.