7.2. Evaluation Metric
The Matthews Correlation Coefficient (MCC) [
30,
31] is used to evaluate the segmentation accuracy of the seven multi-level thresholding methods mentioned above. Compared to metrics such as Precision, Recall, Specificity, Dice, and Jaccard (also known as Intersection Over Union), the MCC is a more robust metric, as it yields a high value only when all classes are segmented accurately. The specific formulation of the MCC for multi-class scenarios is
where
represents the actual occurrence count of class
,
represents the number of predictions for class
,
indicates the total number of samples that have been correctly predicted, and
represents the total number of samples.
The MCC metric is used as the evaluation metric for multi-level thresholding, in contrast to traditional metrics such as the Structural Similarity Index (SSIM), Feature Similarity Index (FSIM), and Peak Signal-to-Noise Ratio (PSNR). The primary reason for this choice is that the MCC metric more accurately reflects the semantic information and segmentation accuracy in the thresholded images. The following experiment demonstrates that MCC is more robust than SSIM, FSIM, and PSNR.
Figure 6a shows a grayscale image degraded by Gaussian noise with a variance of 0.003, while
Figure 6c,d present color visualizations of thresholding results under different thresholds. The SSIM, FSIM, and PSNR values for
Figure 6c are 0.2798, 0.5697, and 19.1189, respectively, whereas for
Figure 6d, the SSIM, FSIM, and PSNR values are 0.6827, 0.8275, and 20.0621, respectively. It is typically believed that higher SSIM, FSIM, and PSNR values indicate better multi-level thresholding results. However, the visual effect of
Figure 6c is clearly superior to that of
Figure 6d. Compared to these three traditional visual quality metrics, the MCC metric demonstrates greater robustness in reflecting semantic information and segmentation accuracy. The experimental results reveal that the images with better visual effects often exhibit higher MCC values. For instance, the MCC for
Figure 6c is 0.9926, significantly surpassing the MCC of 0.2992 for
Figure 6d.
7.3. Comparison Experiments on Synthetic Images
To evaluate the thresholding applicability of the CLCSE method and other methods, a test set consisting of six synthetic images with varying numbers of targets was generated. By constructing synthetic images with increasing complexity, ranging from single-target to multi-target scenarios, the study systematically evaluated the performance of the CLCSE method in multi-target, multi-level thresholding tasks.
Figure 7 illustrates the Shannon entropy objective function (SEOF) curves used in the CLCSE method, along with the grayscale histograms of the synthetic images and the thresholds determined by the CLCSE method.
Figure 8 and
Table 1 present qualitative and quantitative comparison results of seven methods across six synthetic images.
Figure 7a is a grayscale image containing a single target, where the size ratio between the target and the background is severely imbalanced. In this case, the CLCSE method achieves perfect segmentation (MCC = 1.0000), while other methods fail to distinguish the target (MCC < 0.05). The success of CLCSE is rooted in its multiscale multiplication image selection mechanism (
Section 3), which maximizes the Shannon entropy difference (Equation (11)) to suppress noise and enhance edge localization. Specifically, for
Figure 7a, the optimal number of images
participating in the multiscale multiplication transformation is determined by maximizing the entropy difference, and the final multiscale multiplication image is generated by multiplying gradient magnitudes from these five distinct scales (Equation (10)). This process effectively suppresses random noise (e.g., background texture in
Figure 2a) while amplifying the target’s boundaries, enabling precise threshold selection even under extreme size imbalance. In contrast, metaheuristic methods (e.g., INGO, HHO) rely on global grayscale statistics (e.g., cross-entropy or inter-class variance) that assign equal weights to all pixels, causing thresholds to cluster around high-intensity regions (≈100) dominated by background pixels. Similarly, 2DNLM’s non-local spatial fusion introduces noise sensitivity, further degrading its performance (MCC = 0.0112). These results validate that maximizing Shannon entropy difference for multiscale multiplication images is critical for handling size-imbalanced segmentation tasks, whereas traditional methods lack such hierarchical edge discrimination.
For the grayscale image containing two targets shown in
Figure 7b, all methods, except for the 2DNLM method, yielded satisfactory results. The 2DNLM method extends Rényi entropy to multi-level thresholding by combining a non-local means two-dimensional histogram with the exponential Kbest gravitational search method to determine the optimal thresholds. While this approach considers both pixel grayscale information and spatial information, its complex multi-information processing framework can lead to inferior performance compared to other methods in certain scenarios.
For the grayscale image with three targets shown in
Figure 7c, the CLCSE method successfully achieved complete segmentation of all targets, while other methods exhibited inferior performance. The initial thresholds for these methods are commonly concentrated in the range of [0, 40], which can be attributed to the large number of pixels with a grayscale of 0. This phenomenon results in reduced frequency differences of other grayscales in this range, leading these methods to favor selecting lower grayscales during the initial threshold selection process. A similar phenomenon can be observed in
Figure 7e, which reveals the strong dependence of these methods on the grayscale distribution. This reliance hampers their effectiveness in handling complex or non-uniform grayscale distributions, resulting in suboptimal segmentation performance.
In
Figure 7d–f, as the number of targets in the grayscale image increases, the proposed CLCSE method maintains robust segmentation accuracy (MCC > 0.985), outperforming all compared methods. This success is rooted in the composite local contour Shannon entropy strategy (
Section 5), which selectively combines local contours through a partial fusion approach to ensure precise edge localization for each target. Specifically, CLCSE retains the contour image with the highest number of contour pixels (e.g., the background in
Figure 7d) as a standalone reference and combines the remaining contours using Equation (16). This partial combination strategy avoids redundancy caused by merging all local contours (
Figure 5), ensuring that each retained contour accurately reflects the boundaries of distinct targets. For instance, in
Figure 7f, retaining the largest contour (background) as a standalone reference while combining the remaining contours allows the method to focus on the subtle edges of individual targets, ensuring that the composite contour accurately reflects the true boundaries of all targets, thereby achieving an MCC of 0.9858. In contrast, metaheuristic methods like QLGJO and HHO rely on fixed fitness functions (e.g., inter-class variance) that bias thresholds toward dominant grayscale regions, leading to misclassification of smaller or adjacent targets (QLGJO MCC = 0.4915). The 2DNLM method, despite incorporating spatial context through non-local means 2D histograms, amplifies noise in dense configurations (MCC = 0.7102), while IPSO’s divergence-based objective fails to adapt to increasing target complexity (MCC = -0.0277). These results demonstrate that the partial contour fusion strategy in CLCSE, guided by composite local contour entropy, is essential for multi-target segmentation, as it adaptively preserves critical edges while suppressing redundant information, a capability absent in parameter-dependent frameworks.
7.4. Comparison Experiments on Real-World Images
To further demonstrate the potential applications of the proposed CLCSE method in various real-world scenarios, we tested and compared seven multi-level thresholding methods across nine representative images. These images were selected from diverse application domains, including material non-destructive testing, brain tumor MRI imaging, street-view thermal infrared imaging, satellite remote sensing for oil spill detection, ground scene remote sensing, infrared thermography for circuit breaker fault detection, steel surface defect inspection, and ship target monitoring.
Figure 9 presents the threshold selection results of the CLCSE method applied to nine real-world images.
Figure 9a shows an industrial micro-defect image that requires precise differentiation of subtle differences due to the extremely small target and complex background texture.
Figure 9b illustrates a brain tumor MRI image characterized by highly variable tumor morphologies and blurry edges, demanding robust recognition of shapes and boundaries.
Figure 9c depicts an industrial product image with blurred shadows, where the method must maintain high performance despite noise and uncertain boundary conditions.
Figure 9d presents a steel surface defect image that features low background contrast and unclear defect boundaries, requiring effective handling of low-contrast images.
Figure 9e shows an infrared thermography image of circuit breaker faults, which necessitates the ability to detect minimal temperature differences due to a uniform grayscale distribution.
Figure 9f highlights a ship monitoring image, where the grayscales of ocean waves are similar to those of the hull, requiring strong discrimination capabilities to avoid mis-segmentation.
Figure 9g provides an oil spill detection remote sensing image that poses challenges of uneven target-to-background ratios and complex ocean textures, necessitating effective segmentation of imbalanced scales.
Figure 9h presents a ground remote sensing image, where intricate grass textures and random details increase the difficulty of target area extraction.
Figure 9i illustrates a street-view thermal infrared image, where people appear small and their grayscale is similar to that of the background, necessitating that the method effectively handles small targets with precision.
Figure 10 and
Figure 11 present the thresholding results obtained by the seven methods shown in
Figure 9, while
Table 2 summarizes the corresponding MCC values of these methods on nine real-world images. The results indicate that the CLCSE method achieves relatively higher MCC values on most test images, demonstrating superior thresholding adaptability and robustness in addressing challenges such as complex backgrounds, low contrast, blurry boundaries, and imbalanced size ratios. In comparison, the 2DNLM method consistently exhibits lower MCC values on all test images, reflecting its weaker thresholding adaptation in these complex scenarios. The QLGJO method performs well on test images in
Figure 9b,d,e, achieving MCC values exceeding 0.8500, which indicates that the QLGJO method has specific advantages in scenarios characterized by low contrast. Additionally, the EBMO method has higher MCC values on images
Figure 9a,f,g, than other methods, except for CLCSE, indicating its relatively strong adaptability and robustness in handling tasks involving complex backgrounds and imbalanced target-to-background ratios. For INGO, HHO, and IPSO methods, their MCC values exhibit obvious fluctuations across these test images, indicating that their thresholding accuracy is not consistently stable across different scenarios.
7.7. Discussions
The proposed CLCSE method fundamentally diverges from existing multi-level thresholding techniques through its edge-driven entropy optimization framework, which directly selects thresholds based on local maxima of composite contour Shannon entropy, bypassing iterative metaheuristic optimization. Unlike conventional methods (e.g., 2DNLM, HHO, QLGJO) that rely on a two-stage paradigm—designing objective functions (e.g., cross-entropy, divergence measures) and deploying population-based algorithms to search for thresholds—CLCSE integrates a multiscale multiplication transform to enhance critical edges while suppressing noise, and dynamically constructs contour-guided entropy models to identify thresholds from semantically meaningful boundaries. This eliminates parameter dependency, e.g., population size and mutation rates. While baseline methods suffer from premature convergence in texture-rich or low-contrast scenarios due to global grayscale statistics, CLCSE prioritizes localized edge distributions, achieving superior accuracy in target-distinct cases.
Many existing approaches, including 2DNLM, INGO, HHO, IPSO, EBMO, and QLGJO, predominantly follow a “black-box optimization” framework, where thresholds are derived by maximizing/minimizing predefined objective functions (e.g., Otsu’s variance, exponential entropy) via metaheuristic algorithms. While these methods demonstrate theoretical potential, their practical performance is constrained by intrinsic limitations in objective function design and extrinsic bottlenecks in metaheuristic optimization.
- (1)
Intrinsic Limitations
The predefined objective functions used by baseline methods often fail to adapt to complex scenarios due to their reliance on global statistical measures or rigid mathematical formulations:
The 2D Rényi entropy objective function in the 2DNLM method combines non-local spatial information with grayscale distributions. While it improves noise robustness compared to 1D methods, its computational complexity () becomes prohibitive for high-resolution images. Additionally, the fixed Rényi parameter () limits adaptability to images with varying texture complexity. The symmetric cross-entropy objective function in the INGO method assumes balanced foreground and background distributions. In size-imbalanced cases (e.g., small defects in industrial images), this function biases thresholds toward dominant regions (e.g., backgrounds), leading to misclassification. The Kullback–Leibler divergence-based cross-entropy in the HHO method measures global similarity between the original and segmented images. However, it struggles with low-contrast boundaries (e.g., blurred edges in medical images), where grayscale overlap between regions creates ambiguous fitness landscapes. The arithmetic–geometric divergence criterion in the IPSO method focuses on minimizing statistical discrepancies between regions. While effective for homogeneous textures, it fails to capture edge information, resulting in fragmented segmentation for images with complex structures. The exponential entropy in the EBMO method’s objective function replaces logarithmic terms with exponential gains to avoid undefined values. However, this modification amplifies noise sensitivity, as seen in low-SNR images. Otsu’s inter-class variance in the QLGJO method maximizes separability between regions but ignores spatial coherence. For images with overlapping intensity distributions, this leads to oversegmentation.
- (2)
Extrinsic Limitations
The 2DNLM method exhibits three primary shortcomings. Firstly, the computation of its 2D histogram requires analyzing relationships between pixel values and non-local means, significantly increasing complexity compared to 1D histograms. This process involves calculating and statistically aggregating neighborhood pixels for every image pixel, resulting in substantial computational overhead. Secondly, the method is highly sensitive to parameter settings. Its enhanced gravitational search algorithm (eKGSA) relies on empirically tuned parameters (e.g., population size, gravitational constant G, iteration limits, see
Table 6), which lack theoretical guidance and risk suboptimal convergence or local minima traps. Additionally, the Rényi entropy parameter α, fixed at 0.45 in experiments, limits adaptability across diverse images despite its stability within α ∈ [0.1, 0.9]. Thirdly, the method struggles with noise robustness. While non-local means filtering mitigates noise to some extent, it fails to eliminate strong or structurally complex noise, often blurring critical details and distorting the 2D histogram. Consequently, residual noise alters threshold distributions, leading to misclassification or blurred boundaries in segmentation results.
The INGO method suffers from two critical drawbacks. Firstly, the integration of multiple enhancement strategies—including cubic chaotic initialization, best-worst reverse learning, and lens imaging reverse learning—significantly increases algorithmic complexity. This structural intricacy raises implementation challenges and necessitates the adjustment of additional parameters, where improper configurations may degrade performance. Secondly, the method faces hyperparameter selection difficulties (see
Table 6). Key parameters, such as the cubic chaos control parameter (
) and lens imaging scaling factor (
), lack systematic guidance for optimal tuning. In practice, users must empirically determine these values through extensive experimentation, which complicates usability and limits adaptability to diverse datasets.
The HHO method exhibits two key limitations. Firstly, its performance heavily depends on image feature characteristics. The method’s reliance on grayscale distributions makes it sensitive to uneven intensity variations or noise interference, leading to inaccurate boundary detection in homogeneous regions. Additionally, its minimum cross-entropy criterion, based solely on grayscale information, fails to capture structural or chromatic details in complex images (e.g., color-rich or texture-dense scenes), resulting in suboptimal segmentation. Secondly, the algorithm is prone to local optima stagnation, a common issue in population-based optimization algorithms. In complex segmentation tasks with large search spaces, the Harris hawks’ convergence strategy prioritizes exploitation over exploration, causing premature convergence to suboptimal thresholds and degrading accuracy.
The IPSO method faces four primary limitations. Firstly, its robustness to noise remains underexplored, despite the prevalence of noise in practical applications such as crack detection. Sensitivity to noise can distort grayscale distributions, leading to inaccurate threshold selection. Secondly, the algorithm requires manual tuning of multiple parameters, including learning factors (
), population size, and maximum iterations (see
Table 6). The absence of systematic guidelines for parameter optimization restricts its adaptability across diverse datasets. Thirdly, while local stochastic perturbations are introduced to mitigate local optima, the algorithm still risks stagnation in high-dimensional or complex search spaces, particularly when handling intricate image data. Lastly, the method primarily relies on grayscale intensity distributions for thresholding, neglecting critical spatial features such as texture and edge information. This oversimplification limits segmentation accuracy in scenarios where structural or contextual cues are essential.
The EBMO method exhibits two critical limitations in algorithmic performance. Despite incorporating a Gaussian mutation strategy and random flow steps to enhance exploration, the algorithm remains susceptible to local optima when addressing high-dimensional or multi-modal optimization problems, particularly with complex fitness landscapes. This limitation arises from insufficient global search capability, preventing it from identifying true global thresholds in intricate scenarios. Additionally, the method struggles to balance convergence speed and precision. While improvements in convergence rate are achieved, this often occurs at the expense of segmentation accuracy, especially in tasks demanding extremely high precision, such as fine-grained medical image analysis.
The QLGJO method exhibits three primary limitations. Firstly, it demonstrates significant parameter sensitivity, requiring meticulous tuning of key hyperparameters such as the Q-learning rate (λ), discount factor (γ), and mutation strategy coefficients (see
Table 6). Suboptimal parameter settings can slow convergence or trap the algorithm in local optima, necessitating extensive experimental calibration in practical applications. Secondly, despite integrating reinforcement learning and mutation mechanisms to enhance population diversity, the algorithm remains prone to local optima stagnation in complex, multi-modal optimization landscapes. This issue stems from insufficient exploration capability in later iterations, where diminished diversity restricts global search effectiveness. Lastly, the method’s robustness is inconsistent, particularly under data variations or noise interference. While it achieves stable performance in controlled experiments, real-world scenarios with subtle intensity shifts or artifacts may degrade segmentation reliability. Collectively, these limitations highlight challenges in balancing adaptability, precision, and computational stability.