Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis

Liu, Dichao; Suzuki, Kenji

doi:10.3390/diagnostics15151883

Open AccessArticle

Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis

by

Dichao Liu

^*

and

Kenji Suzuki

^*

Biomedical Artificial Intelligence Research Unit, Institute of Integrated Research (IIR), Institute of Science Tokyo, 4259 Nagatsuta-cho, Midori-ku, Yokohama 226-8503, Kanagawa, Japan

^*

Authors to whom correspondence should be addressed.

Diagnostics 2025, 15(15), 1883; https://doi.org/10.3390/diagnostics15151883

Submission received: 27 April 2025 / Revised: 23 June 2025 / Accepted: 23 July 2025 / Published: 27 July 2025

(This article belongs to the Special Issue Lesion Detection and Analysis Using Artificial Intelligence, Third Edition)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: Deep models have achieved remarkable progress in the diagnosis of skin lesions but face two significant drawbacks. First, they cannot effectively explain the basis of their predictions. Although attention visualization tools like Grad-CAM can create heatmaps using deep features, these features often have large receptive fields, resulting in poor spatial alignment with the input image. Second, the design of most deep models neglects interpretable traditional visual features inspired by clinical experience, such as color–texture and edge features. This study aims to propose a novel approach integrating deep learning with traditional visual features to handle these limitations. Methods: We introduce the edge- and color–texture-aware bag-of-local-features model (ECT-BoFM), which limits the receptive field of deep features to a small size and incorporates edge and color–texture information from traditional features. A non-rigid reconstruction strategy ensures that traditional features enhance rather than constrain the model’s performance. Results: Experiments on the ISIC 2018 and 2019 datasets demonstrated that ECT-BoFM yields precise heatmaps and achieves high diagnostic performance, outperforming state-of-the-art methods. Furthermore, training models using only a small number of the most predictive patches identified by ECT-BoFM achieved diagnostic performance comparable to that obtained using full images, demonstrating its efficiency in exploring key clues. Conclusions: ECT-BoFM successfully combines deep learning and traditional visual features, addressing the interpretability and diagnostic accuracy challenges of existing methods. ECT-BoFM provides an interpretable and accurate framework for skin lesion diagnosis, advancing the integration of AI in dermatological research and clinical applications.

Keywords:

diagnosis; biomedical imaging; computer-assisted intervention; handcrafted and deep feature fusion

1. Introduction

In recent years, the rising incidence of skin disease mortality has garnered public attention [1], and deep learning techniques such as ResNets [2] have become dominant computer-aided dermatologic diagnosis methods. However, despite their promising success, deep models are often regarded as “black boxes,” as their decision-making process is not always transparent or clinically interpretable. Given an image of a skin lesion, they can predict the lesion category but cannot explain the prediction’s basis. To address this issue, many previous studies have used attention visualization (AV) methods, such as class activation maps (CAMs) [3] and gradient-based class activation maps (Grad-CAM) [4], to explain the prediction basis of deep models in dermatologic diagnosis [5,6,7,8]. AV refers to generating a heatmap to highlight the regions in the input image leading to the model’s prediction. For example, Jiang et al. [6] added a CAM to some attention modules to generate a heatmap that identifies the lesion regions considered the decision basis. Wang et al. [5] utilized Grad-CAM as a part of an interpretable module to produce a heatmap for explaining the dermatologic diagnostic results.

However, many recent studies have pointed out that the heatmap yielded by existing AV methods is imprecise [9,10,11]. Existing AV methods usually employ a model’s deep layer to obtain a low-resolution activation map and up-sample it to create a heatmap of the same size as the input. This strategy requires the sampling results to be aligned with the input image spatially. However, the receptive field is very large in deep layers. For example, in ResNet50 [2], the receptive field of each pixel of a deep feature map can be even larger than the original input [12]. That is, in this case, from a theoretical point of view, each pixel of the deep activation maps may potentially sense any location on the input; thus it is not reasonable to use them to identify particular locations. From a practical point of view, deep activation maps somehow do correlate spatially with the input image. However, the underlying problem of too large receptive fields still causes many surface problems such as receptive field misalignment [9]. These problems have also been observed previously by other researchers [9,10,11].

Additionally, due to the large receptive field, most deep models are highly effective at capturing global features but struggle to capture fine-grained spatial details. However, for skin disease classification, these fine details are crucial. The differences between skin disease categories, such as melanoma and nevus, are often subtle and lie in fine-grained features like edge irregularity and color unevenness. While these diseases may appear similar in overall shape, it is these detailed characteristics that often determine whether a lesion is malignant or benign.

Another issue with today’s deep learning-based dermatologic diagnostics is the neglect of statistically-based traditional visual features derived from clinical experience, which are mathematically interpretable. For example, extracting edge information from a lesion and checking whether the pigment pattern shows a sharp, abrupt cutoff or a gradual, inconspicuous one is a crucial step in the clinical diagnostic process of a skin lesion [13,14]. Therefore, traditional computer-aided diagnostics often utilize edge information (EI) such as Sobel [15] as part of the descriptors [16]. Other examples include color–texture information (CTI), which also plays a crucial role in both clinical and traditional computer-aided diagnosis of skin lesions [13]. Although it is possible for a deep model to implicitly acquire features with similar functionality to traditional ones during feature extraction, it remains uncertain whether they actually do so. Additionally, understanding how deep models base decisions on these features is unattainable.

In response to the above issues, we revisited the dermatological image diagnosis process used before the rise in deep learning. In this traditional approach, handcrafted features are first extracted from small image patches and then statistically analyzed before being aggregated into image-level representations [17]. This method has two advantages: first, the extracted features from each patch are statistically interpretable; second, identifying the most influential patches helps pinpoint the visual cues contributing to the diagnosis. Inspired by this, we explore whether modern deep learning systems can retain these advantages while maintaining their performance benefits. This led us to bag-of-feature (BoF) models [18], which classify an image based on the occurrence of small local images, and we propose the edge- and color–texture-aware bag-of-local-features model (ECT-BoFM), which allows the BoF model to draw insights from traditional features without being constrained by them, thereby further enhancing performance.

Specifically, as shown in Figure 1, in ECT-BoFM, a backbone network with a small receptive field extracts local features from small patches obtained by splitting an input image. The local features are then averaged to be the global descriptor for classification. In this way, each deep feature element strictly corresponds spatially to a particular patch of the input image. ECT-BoFM is trained with multiple tasks, including classification and the reconstruction of traditional EI and CTI features using deep features. The latter task not only enables the model to learn EI and CTI from the traditional features but also provides heatmaps from different perspectives. This allows us to understand not only where the model deems important but also the perspective (EI or CTI) upon which the model bases its importance assessment. Our contributions are as follows:

(i): We propose the novel ECT-BoFM for skin lesion diagnosis, which limits the size of the deep features’ receptive field to obtain precise heatmaps and learns CTI and EI from traditional features.
(ii): We propose a non-rigid reconstruction strategy that allows the model to learn from traditional features without being constrained by them.
(iii): ECT-BoFM, while sacrificing some capability in capturing global image information, significantly enhances the extraction of fine-grained features, which are more critical for skin lesion diagnosis, achieving high diagnostic performance. It surpassed recent state-of-the-art methods on two widely used public datasets, ISIC2018 [19] and ISIC2019 [19,20].
(iv): ECT-BoFM yields precise and detailed heatmaps showing the prediction basis, potentially making clinical applications of computer-aided diagnosis more trustworthy.
(v): Our experiments show that using only a small subset of image regions selected by our algorithm enables global-feature-learning models such as Vision Transformers [21] to achieve classification performance comparable to full-image inputs. This raises questions about the necessity and effectiveness of global learning mechanisms like self-attention and deep-stacked CNNs for skin disease diagnosis. We believe that this insight could inspire the development of more efficient network architectures in the future.

Figure 1. A simplified illustration of the receptive fields [22] of general convolutional neural networks (CNNs) and our model. For ease of understanding, a feature map is simplified as a row of solid circles and shown. The hollow circles represent padding. Here we use a 3 × 3 convolution with a stride of one as an example. In conventional CNNs, stacking multiple convolutional layers progressively enlarges the receptive field [22,23], allowing each pixel in the deep feature map to theoretically correspond to any location in the entire image. In contrast, our model divides the input image into independent patches, each processed by a network with a small receptive field and shared weights. The predictions (logits) from all patches are then averaged to produce the final output. This ensures that each pixel in the deep feature map strictly corresponds to a specific patch. Note that for illustration, we show a 3 × 3 grid of patches in this figure. In our actual work, a 225 × 225 input image is divided into 25 × 25 patches using a 33 × 33 window with a stride of eight, enabling more precise heatmap generation.

2. Related Work

Over the past decade, the rapid advancement of artificial intelligence has significantly transformed skin disease image diagnosis algorithms. Traditional methods, predominantly based on statistical principles and handcrafted feature extraction, have given way to deep learning-based approaches [17]. Traditional algorithms often relied on the integration of clinical medical knowledge with statistical techniques. Medical insights guided feature selection, focusing on characteristics such as shape, color, and texture, which have clear clinical interpretations. For example, malignant tumors like melanoma typically exhibit irregular edges and uneven pigmentation, making these features crucial for classification [24]. Statistical methods quantified these features, providing input for machine learning models, while medical expertise helped prioritize relevant features to enhance diagnostic accuracy.

In recent years, deep learning models have revolutionized the field, consistently outperforming traditional methods. State-of-the-art (SOTA) algorithms in skin disease diagnosis are predominantly deep learning-based, with innovations broadly categorized into advancements in network architecture [25,26,27] and training strategies [1,25,28].

2.1. Innovations in Network Architecture

Advancements in network architecture often involve designing models tailored to the specific characteristics of skin disease images or broader medical imaging tasks. For instance, Huo et al. [25] proposed HiFuse, a three-branch hierarchical network integrating convolution-based feature extraction, adaptive hierarchical feature fusion (HFF) blocks, and inverted residual MLPs. This model achieves outstanding performance on various medical image classification datasets, combining enhanced accuracy with computational efficiency.

More recently, with the emergence of novel token-based architectures like Vision Transformers [21], researchers have explored hybrid models that combine these token-based architectures with convolutional neural networks (CNNs) to further enhance skin disease classification. For example, Manzari et al. [26] introduced a robust CNN–Transformer hybrid model for medical image classification. By leveraging an efficient convolutional attention mechanism, the model captures long-range dependencies while reducing the quadratic complexity of self-attention. Additionally, the model emphasizes global structural features and utilizes feature space augmentation with mean and variance interpolation, achieving smoother decision boundaries and better generalization across diverse medical domains. Yue et al. [27] proposed MedMamba, a model that combines convolutional layers with state-space models (SSMs), effectively capturing both short- and long-range dependencies. By employing grouped convolution and channel-shuffle strategies, this approach balances computational efficiency with parameter reduction.

2.2. Innovations in Training Strategies

Innovations in training strategies have primarily targeted the imbalanced data distribution inherent in skin disease datasets. Certain diseases, such as melanoma, are relatively common, while others, such as actinic keratosis, are rare, resulting in a long-tailed distribution. Addressing this imbalance is critical to improving model performance. For instance, Gessert et al. [29] proposed a patch-based attention mechanism combined with pretrained architectures to enhance skin lesion classification. This method captures the global context across local patches, outperforming traditional aggregation approaches. Furthermore, diagnosis-guided loss weighting was introduced to incorporate meta-information about diagnostic methods, mitigating the impact of class imbalance. Zhang et al. [1] proposed a class-enhancement contrastive learning (ECL) strategy to address data imbalance and diagnostic difficulty. By utilizing a hybrid-proxy model and a cycle update strategy, minority class representation was improved. The proposed losses also accounted for both data imbalance and diagnostic complexity, leading to superior performance on imbalanced dermoscopic datasets.

2.3. The Position of Our Work

Our work primarily focuses on innovations in training strategies combined with certain network architectural enhancements. We rethink traditional handcrafted features used in skin disease classification, noting their medical and statistical interpretability, as well as their focus on extracting statistical features from small image patches and aggregating them at the final stage to capture fine-grained details. While the performance of handcrafted features has fallen behind deep learning-based methods, we argue that the underlying principles of these traditional approaches remain valuable. By leveraging suitable deep network architectures and designing training strategies, we explore how to extract relevant features guided by domain knowledge embedded in traditional methods, aiming to enhance the interpretability of deep features. Building on this, we integrated the bag-of-feature model into skin disease classification and adapted it to align with our tailored training strategy. This approach yielded precise and interpretable heatmaps from different traditional feature perspectives. Notably, despite not employing architectures capable of capturing global image information or training strategies specifically tailored for long-tailed challenges, our diagnostic performance surpasses the aforementioned SOTA methods. Additionally, we analyzed the impact of image patches selected by our model on classification results, comparing them to the influence of the entire image in deep learning. We believe that the findings from this analysis can provide valuable insights for the design of future dermatological lesion diagnostic models.

In addition to advances in architecture and training, substantial efforts have been made to improve the explainability of deep models in skin disease classification. Saliency-based visualization methods such as CAM [3], Grad-CAM [4], and their dermatology-specific adaptations [5,6,8] are widely adopted to highlight diagnostic regions. While effective to some extent, these methods often rely on deep-layer activations with large receptive fields, leading to heatmaps that may lack spatial precision or clinical specificity [9,30]. Some alternative techniques, including attribution-based methods such as SHAP [31], have also been explored.

Our work builds upon this line of research by proposing a local-feature-based framework that explicitly reconstructs traditional visual features, such as edge and color-texture, from deep features. This allows the model not only to localize informative regions with high spatial precision but also to generate clinically grounded interpretations aligned with expert diagnostic criteria (see details in Section 4.4).

3. Method

Figure 2 illustrates the overall framework of ECT-BoFM. Let

I \in R^{H \times W \times 3}

denote an input image. The framework conducts three main processes on I: (i) extracting the RGB (red, green, and blue) and YUV (luma, chrominance blue, and chrominance red) channels from I and concatenating the local binary pattern (LBP) histograms [32] (59-D) computed on the six channels to form a color–texture descriptor

C \in R^{354}

; (ii) obtaining a deep feature map X using a backbone network

B

; and (iii) deriving an EI map

E \in R^{H \times W}

with the Sobel operator [15].

The backbone network

B

is based on BagNet-33 [18], so the receptive field is restricted to 33 × 33 pixels. In this paper, we set the stride for the 33 × 33 window as 8. The default input size of the original BagNet-33 model is 224 × 224, which results in the regions at the end of the input image not being traversed by the 33 × 33 window. To solve this problem, we set the size of I as 225 × 225, making

X \in R^{25 \times 25 \times 2048}

. Each element

x \in R^{1 \times 1 \times 2048}

in X strictly corresponds to a certain 33 × 33 region in I.

Thereafter, a CTI diverger and an EI diverger take X as the input to generate a CTI-biased feature map

X_{CTI} \in R^{25 \times 25 \times 2048}

and an EI-biased feature map

X_{EI} \in R^{25 \times 25 \times 2048}

, respectively. Then, a CTI decoder and an EI decoder take

X_{CTI}

and

X_{EI}

as the input to reconstruct C and E, respectively. During training, the decoders are optimized with the proposed non-rigid reconstruction strategy, which can prevent the optimization of the decoders from harming the classification performance. Meanwhile, a CTI gate and an EI gate enhance the features in

X_{CTI}

and

X_{EI}

that are beneficial for classification and weaken the unbeneficial features, respectively. The gated features are concatenated and then processed by global pooling (GAP) and fully connected (FC) layers for classification. During training, to obtain optimal results, we use a strategy called partial sharpness-aware minimization (PSAM) to optimize the whole framework. For inference, the decoders are removed, and the traditional features are not needed.

Divergers: The divergers are designed to diverge X into a CTI-biased feature map versus an EI-biased feature map. Let

f_{Conv}^{M \times N \times H^{'} \times W^{'}} (.)

denote a convolutional layer, which has M input channels, N output channels, and a kernel size of

H^{'} \times W^{'}

. All the convolutional layers given in this section do not use padding layers and have a stride of 1, so we omit them in the equations. The CTI and EI divergers have the same structure but do not share the parameters. The functions can be written as

X_{CTI} = f_{Conv}^{2048 \times 2048 \times 1 \times 1} (X)

,

X_{EI} = f_{Conv}^{2048 \times 2048 \times 1 \times 1} (X)

.

Decoders: The decoders are designed to reconstruct the color–texture descriptor

C \in R^{354}

and edge map

E \in R^{225 \times 225}

. This design allows the model to learn about CTI and EI during training and subsequently analyze the classification results from the perspectives of CTI and EI through visualization. In a typical reconstruction task, the decoder generates an output of the same size as the target and minimizes the point-to-point difference between this output and the target using a loss function such as the widely used Smooth L1 Loss. However, the main task in this paper is the classification task. Directly applying the typical strategies in the reconstruction task may result in degraded classification performance. For example, there may be competition between the low-level vision task (reconstruction) and high-level vision task (classification). Also, traditional feature extractors such as LBP and Sobel have been behind deep models in terms of classification performance for a long time. Making deep features strictly approximate traditional features may limit their usefulness. To solve this problem, we propose non-rigid reconstruction (NRR), which does not view the traditional features as direct reconstruction targets of

X_{CTI}

and

X_{EI}

, but rather as a special case in distributions defined by

X_{CTI}

and

X_{EI}

. Given

X_{CTI}

and

X_{EI}

, NRR first defines two Gaussian distributions,

N (μ_{CTI},

σ_{CTI}^{2})

and

N (μ_{EI}, σ_{EI}^{2})

, where

μ_{CTI}, σ_{CTI} \in R^{354}

and

μ_{EI}, σ_{EI} \in R^{25 \times 25 \times 2048}

are

\begin{matrix} μ_{CTI} = f_{FC}^{128 \times 354} (f_{ReLU} (f_{FC}^{2048 \times 128} (f_{ReLU} (f_{GAP} (X_{CTI}))))) \end{matrix}

(1)

\begin{matrix} σ_{CTI} = f_{FC}^{128 \times 354} (f_{ReLU} (f_{FC}^{2048 \times 128} (f_{ReLU} (f_{GAP} (X_{CTI}))))) \end{matrix}

(2)

\begin{matrix} μ_{EI} = f_{Conv}^{128 \times 2048 \times 1 \times 1} (f_{ReLU} (f_{Conv}^{2048 \times 128 \times 1 \times 1} (f_{ReLU} (X_{EI})))) \end{matrix}

(3)

\begin{matrix} σ_{EI} = f_{Conv}^{128 \times 2048 \times 1 \times 1} (f_{ReLU} (f_{Conv}^{2048 \times 128 \times 1 \times 1} (f_{ReLU} (X_{EI})))) \end{matrix}

(4)

Here,

f_{FC}

,

f_{GAP}

, and

f_{ReLU}

represent fully connected, global average pooling, and ReLU operations, respectively. Thereafter, we generate new features

X_{CTI}^{'} \in R^{354}

and

X_{EI}^{'} \in R^{25 \times 25 \times 2048}

by randomly sampling from the obtained distributions. However, due to the non-differentiability of random sampling, we follow Kingma et al. [33] to utilize reparameterization in practice. Let

N (μ,

σ^{2})

denote one of the above distributions, with the feature dimensionality omitted. Instead of directly sampling from

N (μ,

σ^{2})

, the reparameterization trick samples

ϵ

from a standard normal distribution

N (1,

0)

and computes the desired sample z as

z = μ + σ ϵ

. z can refer to either

X_{CTI}^{'}

or

X_{EI}^{'}

).

X_{EI}^{'}

is then up-sampled as

\begin{matrix} X_{EI}^{''} & = f_{Tconv}^{2048 \times 1 \times 33 \times 33} (f_{Conv}^{128 \times 2048 \times 1 \times 1} (f_{ReLU} (f_{Conv}^{2048 \times 128 \times 1 \times 1} (f_{ReLU} (X_{EI}^{'}))))) \end{matrix}

(5)

where

f_{Tconv}^{2048 \times 1 \times 33 \times 33} (.)

is a transposed convolution layer, and its stride is set as 8. Thus, the up-sampling process of this layer is in strict spatial correspondence with the down-sampling process of

B

. The proposed reconstruction loss is

\begin{matrix} L_{Rec} = \log (L_{KL} (C, X_{CTI}^{'}) + 1) + \log (L_{KL} (E, X_{EI}^{''}) + 1) \end{matrix}

(6)

where

L_{KL} (.)

denotes a KL loss with a sum reduction, and the log function can prevent the loss value from being too large. The design of NRR avoids strictly forcing

X_{CTI}

and

X_{CTI}

to adhere to traditional features to the extent that it may limit classification performance. At the same time, this design enables

X_{CTI}

and

X_{EI}

to learn macroscopic knowledge about CTI and EI during training that goes beyond the information covered by the traditional features [33].

Gates: The gates are designed to enhance features in

X_{CTI}

and

X_{EI}

that are beneficial for classification and weaken those that are unhelpful. The gated features

X_{CTI}^{g} \in R^{25 \times 25 \times 2048}

and

X_{EI}^{g} \in R^{25 \times 25 \times 2048}

are obtained as

\begin{matrix} X_{CTI}^{g} = X_{CTI} \otimes A_{CTI}, \\ where A_{CTI} = Sigmoid (f_{Conv}^{64 \times 1 \times 1 \times 1} (f_{Conv}^{2048 \times 64 \times 1 \times 1} (X_{CTI}))) \end{matrix}

(7)

\begin{matrix} X_{EI}^{g} = X_{EI} \otimes A_{EI}, \\ where A_{EI} = Sigmoid (f_{Conv}^{64 \times 1 \times 1 \times 1} (f_{Conv}^{2048 \times 64 \times 1 \times 1} (X_{EI}))) \end{matrix}

(8)

Here, ⊗ denotes the multiplication broadcast across the feature channels.

A_{CTI}

and

A_{EI}

are CTI and EI gating weights, respectively.

X_{CTI}^{g}

and

X_{EI}^{g}

then undergo concatenation, GAP, and FC operations to obtain categorical logits l.

Partial Sharpness-aware Minimization: During training, the overall loss can be written as

L = L_{S} (l, t) + L_{Rec}

, where

L_{S} (.)

denotes a softmax loss, and l, t are the predicted logits and target, respectively. Minimizing

L

is multi-task learning that encompasses both high-level and low-level vision tasks. To avoid pulling between tasks leading to suboptimal results, we propose PSAM, which is adapted from sharpness-aware minimization (SAM) [34]. Generally, the optimization of a model is defined as looking for the weights w that can result in the minimal loss such as

{min}_{w} L_{w} + {λ | | w | |}_{2}^{2}

, where

{λ | | w | |}_{2}^{2}

is an L2 regularization term. Unlike this general optimization strategy, SAM seeks parameters that lie in neighborhoods that have a uniformly low loss such as

{min}_{w} SAM (w) + {λ | | w | |}_{2}^{2}

, where

SAM (w) = {max}_{{| | ζ | |}_{2} \leq ρ} L_{w + ζ}

, and

ρ > 0

is a hyperparameter that defines the range of parameter neighborhoods that need to be searched. Instead of simply finding parameters leading to a low loss, SAM aims to find parameters where the entire neighborhood has a uniformly low training loss. This prevents the model from converging to sharp minima, where a parameter may lead to a low loss, but its neighboring parameters lead to a high loss. However, SAM may ignore some parameters that are in the sharp neighborhood but can actually lead to good performance. Thus, to further improve the results, we propose PSAM, which randomly selects 50% of the parameters to be updated with the general optimization strategy and the remaining 50% with SAM. PSAM also functions similarly to dropout by introducing noise as a form of regularization.

Visualization: In ECT-BoFM, each deep feature is in strict correspondence with a certain 33 × 33 patch in the input image. Given a 33 × 33 patch p, the model infers one logit

l_{p}^{k}

for the predicted skin lesion class k, together with one EI gating weight

A_{EI}^{p}

and one CTI gating weight

A_{CTI}^{p}

(refer to Equations (7) and (8)). Thus, for each pixel of the heatmaps, we can obtain three types of values: (i)

A_{CTI}^{p} l_{p}^{k}

, which denotes the model’s judgment of p’s importance from a CTI perspective; (ii)

A_{EI}^{p} l_{p}^{k}

, which denotes the model’s judgment of p’s importance from an EI perspective; and (iii)

α_{CTI} A_{CTI}^{p} l_{p}^{k} + α_{EI} A_{EI}^{p} l_{p}^{k}

, which denotes the model’s overall judgment of p’s importance. Here,

α_{CTI} = Σ (w_{CTI}^{FC}) / Σ (w^{FC})

and

α_{EI} = Σ (w_{EI}^{FC}) / Σ (w^{FC})

, where

w^{FC} \in R^{4096}

is the weights of the FC classifier for the predicted category k, and its negative values are set to 0.

w_{CTI}^{FC} \in R^{2048}

and

w_{EI}^{FC} \in R^{2048}

are the elements in

w^{FC}

handling the features from the CTI and EI gates, respectively. To generate the heatmaps, we pad I by 16 pixels on all sides with zeros and extract a total of 225 × 225 patches using a 33 × 33 window with a stride of 1. Then, we can obtain three heatmaps: a CTI heatmap, an EI heatmap, and an overall heatmap. The heatmaps are normalized with min-max normalization.

4. Experiments

4.1. Datasets and Implementation Details

Datasets: We evaluated the ECT-BoFM on two public dermoscopic datasets: ISIC2018 [19] with 10,015 images across seven classes (melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis/Bowen’s disease, benign keratosis, dermatofibroma, and vascular lesion) and ISIC2019 [19,20] with 25,331 images across eight classes (melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, vascular lesion, and squamous cell carcinoma).

Implementation Details: We trained the models using stochastic gradient descent (SGD) with 100 epochs, a momentum of 0.9, a weight decay of

5 \times 10^{- 4}

, and a mini-batch size of 16. The initial learning rate was 0.002, and it was updated with cosine annealing. During training, images are resized to

256 \times 256

with random cropping and flipping as the only augmentations. For inference, we applied center cropping. We followed previous studies [1,35,36] to use three common evaluation metrics: accuracy (Acc), macro f1-score (F1), and macro area under the curve (AUC). We repeated the experiments five times and reported the average results.

4.2. Comparison with State-of-the-Art Methods

To show the superiority of ECT-BoFM, we compared the diagnostic performance of ECT-BoFM with recent state-of-the-art (SOTA) methods on ISIC 2018 and 2019. Different studies have used various partitions of the ISIC 2018 dataset to evaluate performance. In our experiments, we followed the partitions used in previous studies [1,35,36,37] and compared our results against those obtained with the same partitions. Specifically, there were three partitions: (i) train/val/test = 1:1:1; (ii) train/val/test = 3:1:1; and (iii) using all 10,015 images as training or validating samples and testing the obtained model on the ISIC Task 3 validation set (193 images). For the third partition, we used 30% of the training data for validation. The previous SOTA methods on ISIC 2019 predominantly split this dataset into train/val/test = 3:1:1 [1], and thus we followed this partition. The comparison results are shown in Table 1, Table 2, Table 3 and Table 4. On ISIC 2018, ECT-BoFM significantly outperformed other SOTA methods in all metrics. Overall, the closest competitor in performance to our method is CEM [28]. However, our approach not only outperforms CEM significantly across all diagnostic metrics but also eliminates the need of manually labeling concept tags on large datasets and conducting additional training for generalization to other datasets, as required by CEM. On ISIC 2019, ECT-BoFM performed marginally lower than the previous best method, ECL [1], in the F1 score but higher than all previous SOTA methods in the other two metrics. ECT-BoFM has a significant advantage with the highest level in most metrics.

Although ECT-BoFM achieves significant improvements over prior methods on ISIC2018, the performance gain on ISIC2019 is relatively limited. We attribute this to the increased number of diagnostic categories in ISIC2019, which introduces greater inter-class similarity and more complex decision boundaries. For example, actinic keratosis (AK), a precancerous condition, often shares visual characteristics with both benign keratosis (BKL) and squamous cell carcinoma (SCC), making it particularly difficult to distinguish. This visual ambiguity increases the challenge not just for our method, but for all competing approaches.

This trend is clearly reflected in our experimental results, as shown in Table 3 and Table 4; classification performance across all methods is consistently lower on ISIC2019 than on ISIC2018. Therefore, we believe the relatively smaller margin between our method and other state-of-the-art models on ISIC2019 does not reflect a specific limitation of our approach, but rather highlights a broader challenge faced by current techniques when dealing with fine-grained, high-confusion medical categories. These findings suggest that further methodological advances are still needed to address the inherent difficulty of multi-class skin lesion diagnosis in more diverse datasets.

4.3. Ablation Studies

The design of ECT-BoFM included a BagNet-based backbone

B

, a CTI stream, an EI stream, the non-rigid reconstruction strategy, and the PSAM training strategy. We conducted ablation studies to evaluate the components of this design. Additionally, to demonstrate the effectiveness of limiting receptive fields, we also constructed two backbones with unlimited receptive fields by inserting MlpFusion and TransFusion layers into the final two stages of the original backbone. These fusion layers reshape the input tensor for processing through MLP or Transformer layers designed to globally integrate the local features and then reshape back the output to maintain the original tensor dimensions. In these ablation experiments, we constructed a very challenging diagnosis task by randomly dividing ISIC-2018 into the training, validation, and test sets in a 1:1:1 ratio. The results are shown in Table 5.

Adding a single CTI or EI stream to

B

improved all three metrics, with the addition of both CTI and EI streams leading to a greater improvement in all three metrics compared to the original

B

and to

B

with the addition of a single CTI or EI stream. This result indicates that CTI and EI streams are complementary, and a fusion of them is beneficial for diagnostic performance. Whether using TransFusion or MlpFusion, the network’s performance decreases when the receptive field is not limited. This indicates that constraining the receptive field can enhance diagnostic performance. Then, when we trained

B

with the addition of the two streams without NRR, the accuracy and F1 score were even lower than those of the original

B

, indicating the effectiveness of NRR. Furthermore, introducing SAM decreased AUC and F1, while introducing PSAM improved all metrics.

4.4. Analysis of Interpretable Heatmaps

In this section, we analyze the fidelity of the generated interpretable heatmaps. All experimental results are based on a 3:1:1 train–val–test split of the ISIC 2018 dataset, with all relevant models trained using the SGD optimizer.

Visualization Results: Figure 3 shows the visualization results of our model as well as those of Resnet50-based CAM and Grad-CAM, which are common strategies for prior interpretable dermatological diagnosis methods [3,4]. The CTI and EI heatmaps can provide explanations of the categorical predictions from two separate perspectives, enabling us to move a step forward in understanding the perspective from which the model makes its decisions. The overall heatmap can be seen as a synthesis of the CTI and EI heatmaps, capturing more comprehensive information. For dermatologic diagnosis, comprehensive information is beneficial. Consider the first row in Figure 3 as an example. The input image is a photo of a basal cell carcinoma (BCC). In this case, the EI stream mainly captures pigmented structures, such as a blue-gray ovoid nest, while the CTI stream mainly captures vascular structures, such as arborizing telangiectasias. Both are important for the diagnosis of BCC [17].

Impact of Heatmap-Guided Masking on Accuracy. To quantitatively analyze the fidelity of the generated ECT-BoFM heatmaps, we examined how accuracy changes when masking the most predictive 33 × 33 image patches. We employed three methods for masking patches: (i) defining the masking locations based on the ECT-BoFM heatmaps; (ii) randomly masking patches from all possible 33 × 33 windows with a stride of eight; and (iii) generating a heatmap using ResNet50 + CAM (note that CAM is similar to Grad-CAM for skin lesion diagnosis tasks, as illustrated in Figure 3; therefore we used CAM exclusively). For the third method, we traversed the generated heatmap with 33 × 33 windows and masked patches based on the highest sum of heatmap values within each window. We conducted these experiments on both the proposed ECT-BoFM and a trained ViT-B-16 model. The results, shown in Figure 4, demonstrate that masking the predictive patches identified by ECT-BoFM leads to a significant drop in diagnostic performance for both ECT-BoFM and ViT-B-16, compared to masking patches using other methods. Notably, masking just 20 patches, which constituted a small portion of the original image, resulted in negligible performance degradation when conducted randomly. In contrast, masking 20 important patches identified by ECT-BoFM caused a substantial performance decline in both models, which highlighted that the predictive patches provided by ECT-BoFM were naturally crucial to the model’s performance and were not only important to specific models but also fundamentally important.

As mentioned above, we used CAM exclusively based on its visual similarity to Grad-CAM in our experiments (Figure 3). To further support this choice, we note that dermoscopic images typically exhibit localized diagnostic features and relatively low spatial complexity, which naturally reduces the functional distinction between CAM- and gradient-based attribution methods. Previous studies in this domain, including Lucieri et al. [8] and Jiang et al. [6], have also observed that CAM and Grad-CAM tend to highlight nearly identical regions in skin lesion classification tasks. Hence, we adopted CAM throughout our framework to ensure simplicity and compatibility with our interpretable feature pathways, without compromising the quality or faithfulness of visual explanations.

Results of Training with Heatmap-Identified Key Regions on Diagnostic Performance.

We leveraged the interpretability of ECT-BoFM to analyze the impact of global and local information captured by models in skin lesion diagnosis. Using a trained ECT-BoFM, we identified the most predictive 33 × 33 image patches, retaining only these small regions while masking out the rest of the image. These modified images were then used to train both ViT-B-16 and ECT-BoFM. We conducted experiments where only 4, 8, 12, 16, or 20 of the most predictive image patches were retained, with the results shown in Figure 5. In Figure 5, we also provide two reference baselines: (i) the performance using original, unmasked images and (ii) the performance when retaining the 20 most predictive image patches identified by ResNet50+CAM from all possible 33 × 33 windows with a stride of eight while masking out the remaining regions.

As shown in Figure 5, whether training with ECT-BoFM or ViT-B-16, the diagnosis performance using the small number of patches identified by ECT-BoFM was significantly higher than that obtained using ResNet50+CAM. This indicated that the predictive patches identified by ECT-BoFM were genuinely critical regions, which is consistent with our earlier conclusions. Compared to the models trained on the original images, those using ECT-BoFM-identified patches achieved comparable or even superior accuracy, as well as similar F1 scores, but the AUC was better with models trained on the original images. These experimental results showed that for skin lesion diagnosis, a very small number of localized regions provide sufficient classification cues in most cases. While full input image information mainly enhanced the ability to differentiate some minority categories more comprehensively, the AUC improvement remained limited to about 2–3%. This phenomenon also applied to ViT models, which were designed to capture long-range dependencies between different local regions. Despite the extensive parameters and computations aimed at capturing such global information, the results in Figure 5 suggested that this design, while theoretically promising, might be less effective or even wasteful in practical skin lesion diagnosis. Training a ViT model with images where most regions were masked still achieved high diagnostic performance, which raises questions about whether ViT’s long-range dependency capture truly benefited skin lesion tasks. Future skin lesion diagnosis algorithms might be more effective by focusing on extracting local features and supplementing them with appropriate global information to improve the distinction of minority classes, thereby making better use of model parameters and computational resources.

4.5. Rethinking the Necessity of Global Modeling in Skin Lesion Diagnosis

In recent years, deep learning models with global modeling capabilities—such as Transformers with self-attention mechanisms—have shown impressive performance across various vision tasks. These models are based on the assumption that modeling long-range dependencies and holistic context is essential for high-level understanding, including in medical imaging. However, our findings suggest that this assumption may not universally hold for all medical imaging scenarios, particularly for skin lesion classification tasks.

As demonstrated in our experiments, local modeling alone—guided by interpretable and clinically relevant heatmaps—can achieve strong classification performance. For instance, Table 1 through Table 4 show that ECT-BoFM consistently outperforms or matches state-of-the-art methods on both ISIC2018 and ISIC2019 across multiple backbones, despite operating with a local receptive field and a limited global context. Figure 4 further shows that the removal of high-response regions, as highlighted by the interpretable channels, leads to substantial performance degradation, indicating that those localized regions carry essential diagnostic information. Figure 5 also reveals that even when only a small number of high-response patches are retained for training, classification performance remains close to that of the full model, further supporting the sufficiency of local features.

Architecturally, our use of BagNet-33 limits the effective receptive field to 33 × 33 pixels, minimizing the aggregation of non-local information. The model explicitly integrates edge and color–texture priors, which encourage it to focus on fine-grained features such as border irregularities and pigmentation details. The combination of these elements results in a lightweight, interpretable framework that is surprisingly effective without deep semantic fusion.

Nonetheless, we acknowledge that certain global visual patterns do carry diagnostic value in clinical practice. For example, dermatologists often assess asymmetry, lesion size relative to the surrounding area, or overall texture heterogeneity—features that cannot be captured by small receptive fields alone. Therefore, we do not claim that global information is irrelevant. Rather, our findings suggest that existing generic global modeling mechanisms, such as standard self-attention layers, may not be well aligned with the domain-specific needs of skin lesion diagnosis.

At the same time, although our method achieves state-of-the-art or competitive performance across multiple settings, there remains a gap to perfection. This opens a promising direction: rethinking global modeling not by discarding it, but by designing new task-specific global mechanisms that complement strong local representations with clinically meaningful global cues. Our findings offer a step in this direction by highlighting when and how local features dominate in this task and by raising the broader question of what kinds of global structures are truly needed in dermatological AI systems.

5. Conclusions

This paper introduces the novel ECT-BoFM for skin lesion diagnosis, which is required to perform two kinds of tasks: classification and reconstruction. The reconstruction task involves the model sampling new features from distributions defined by its deep features to complete the reconstruction of traditional edge and color–texture features. This approach allows the model to learn edge and color–texture information from traditional features effectively, without being overly constrained by them, thereby preventing negative effects on diagnostic performance. The proposed model obtained high diagnostic performance that clearly surpassed recent state-of-the-art methods on ISIC 2018 and ISIC 2019. By restricting the receptive field of deep features, this model generated precise and detailed heatmaps that reflect prediction basis from the edge, color–texture, and overall perspectives, potentially enhancing the reliability of AI-aided diagnosis. Analysis on ECT-BoFM’s interpretable heatmaps showed that masking a small number of the most predictive patches results in significant drops in diagnostic performance across models. Training with these patches alone achieved performance similar to using full images and suggested that, for other models designed to capture long-range dependencies and global information, the benefits from such capabilities are quite limited. We believe that the contribution of our work is not limited to proposing a new framework; it also provides valuable insights that can inspire the design of future dermatological lesion diagnostic models.

Author Contributions

Conceptualization, D.L.; Methodology, D.L.; Software, D.L.; Validation, D.L.; Writing—original draft, D.L.; Writing—review & editing, K.S.; Visualization, D.L.; Supervision, K.S.; Project administration, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JST GTIE GAP Fund Program, Grant Number GTIE2024_EX10, Japan.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of publicly available and de-identified datasets (ISIC 2018 and ISIC 2019).

Informed Consent Statement

Patient consent was waived due to the use of publicly available and fully anonymized dermatological image datasets (ISIC 2018 and ISIC 2019).

Data Availability Statement

The experimental results in this study are based on the publicly available ISIC 2018 and ISIC 2019 datasets, which can be accessed at https://challenge.isic-archive.com/data/#2018 (accessed on 22 July 2025) and https://challenge.isic-archive.com/data/#2019 (accessed on 22 July 2025), respectively. The source code and implementation details are publicly available at https://github.com/Dichao-Liu/ECT-BoFM (accessed on 22 July 2025) to support reproducibility.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Chen, J.; Wang, K.; Xie, F. ECL: Class-Enhancement Contrastive Learning for Long-Tailed Skin Lesion Classification. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 244–254. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Cision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Wang, S.; Yin, Y.; Wang, D.; Wang, Y.; Jin, Y. Interpretability-based multimodal convolutional neural networks for skin lesion diagnosis. IEEE Trans. Cybern. 2021, 52, 12623–12637. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Li, H.; Jin, Z. A visually interpretable deep learning framework for histopathological image-based skin cancer diagnosis. IEEE J. Biomed. Health Informatics 2021, 25, 1483–1494. [Google Scholar] [CrossRef] [PubMed]
Gamage, L.; Isuranga, U.; Meedeniya, D.; De Silva, S.; Yogarajah, P. Melanoma skin cancer identification with explainability utilizing mask guided technique. Electronics 2024, 13, 680. [Google Scholar] [CrossRef]
Lucieri, A.; Bajwa, M.N.; Braun, S.A.; Malik, M.I.; Dengel, A.; Ahmed, S. On interpretability of deep learning based skin lesion classifiers using concept activation vectors. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–10. [Google Scholar]
Xia, P.; Niu, H.; Li, Z.; Li, B. On the receptive field misalignment in CAM-based visual explanations. Pattern Recognit. Lett. 2021, 152, 275–282. [Google Scholar] [CrossRef]
Kim, B.J.; Koo, G.; Choi, H.; Kim, S.W. Extending class activation mapping using Gaussian receptive field. Comput. Vis. Image Underst. 2023, 231, 103663. [Google Scholar] [CrossRef]
Zia, T.; Wahab, A.; Windridge, D.; Tirunagari, S.; Bhatti, N.B. Visual attribution using adversarial latent transformations. Comput. Biol. Med. 2023, 166, 107521. [Google Scholar] [CrossRef] [PubMed]
Araujo, A.; Norris, W.; Sim, J. Computing receptive fields of convolutional neural networks. Distill 2019, 4, e21. [Google Scholar] [CrossRef]
Maglogiannis, I.; Doukas, C.N. Overview of advanced computer vision systems for skin lesions characterization. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 721–733. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Pan, Y.; Zhao, J.; Zhang, L. Skin disease diagnosis with deep learning: A review. Neurocomputing 2021, 464, 364–393. [Google Scholar] [CrossRef]
Sobel, I. History and definition of the sobel operator. Retrieved World Wide Web 2014, 1505. [Google Scholar]
Monisha, M.; Suresh, A.; Rashmi, M. Artificial intelligence based skin classification using GMM. J. Med. Syst. 2019, 43, 1–8. [Google Scholar] [CrossRef] [PubMed]
Álvarez-Salafranca, M.; Ara, M.; Zaballos, P. Dermoscopy in basal cell carcinoma: An updated review. Actas Dermo-Sifiliográficas 2021, 112, 330–338. [Google Scholar] [CrossRef] [PubMed]
Brendel, W.; Bethge, M. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 1–9. [Google Scholar] [CrossRef] [PubMed]
Combalia, M.; Codella, N.C.; Rotemberg, V.; Helba, B.; Vilaplana, V.; Reiter, O.; Carrera, C.; Barreiro, A.; Halpern, A.C.; Puig, S.; et al. Bcn20000: Dermoscopic lesions in the wild. arXiv 2019, arXiv:1908.02288. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
Lu, Y.; Yi, S.; Zeng, N.; Liu, Y.; Zhang, Y. Identification of rice diseases using deep convolutional neural networks. Neurocomputing 2017, 267, 378–384. [Google Scholar] [CrossRef]
Cula, O.G.; Dana, K.J.; Murphy, F.P.; Rao, B.K. Bidirectional imaging and modeling of skin texture. IEEE Trans. Biomed. Eng. 2004, 51, 2148–2159. [Google Scholar] [CrossRef] [PubMed]
Huo, X.; Sun, G.; Tian, S.; Wang, Y.; Yu, L.; Long, J.; Zhang, W.; Li, A. HiFuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 2024, 87, 105534. [Google Scholar] [CrossRef]
Manzari, O.N.; Ahmadabadi, H.; Kashiani, H.; Shokouhi, S.B.; Ayatollahi, A. MedViT: A robust vision transformer for generalized medical image classification. Comput. Biol. Med. 2023, 157, 106791. [Google Scholar] [CrossRef] [PubMed]
Yue, Y.; Li, Z. Medmamba: Vision mamba for medical image classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]
Gao, Y.; Gao, Z.; Gao, X.; Liu, Y.; Wang, B.; Zhuang, X. Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 308–317. [Google Scholar]
Gessert, N.; Sentker, T.; Madesta, F.; Schmitz, R.; Kniep, H.; Baltruschat, I.; Werner, R.; Schlaefer, A. Skin Lesion Classification Using CNNs With Patch-Based Attention and Diagnosis-Guided Loss Weighting. IEEE Trans. Biomed. Eng. 2020, 67, 495–503. [Google Scholar] [CrossRef] [PubMed]
Nunnari, F.; Kadir, M.A.; Sonntag, D. On the overlap between grad-cam saliency maps and explainable visual features in skin cancer images. In Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Virtual, 17–20 August 2021; pp. 241–253. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; pp. 1–19. [Google Scholar]
Zhu, J.; Wang, Z.; Chen, J.; Chen, Y.P.P.; Jiang, Y.G. Balanced contrastive learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6908–6917. [Google Scholar]
Li, T.; Cao, P.; Yuan, Y.; Fan, L.; Yang, Y.; Feris, R.S.; Indyk, P.; Katabi, D. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6918–6928. [Google Scholar]
Li, C.; Li, M.; Peng, C.; Lovell, B.C. Dynamic Curriculum Learning via In-Domain Uncertainty for Medical Image Classification. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 747–757. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Araújo, D.J.; Verdelho, M.R.; Bissoto, A.; Nascimento, J.C.; Santiago, C.; Barata, C. Key Patches Are All You Need: A Multiple Instance Learning Framework For Robust Medical Diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 5231–5240. [Google Scholar]
Hacohen, G.; Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the International Conference on Machine Learning, Zhuhai, China, 22–24 February 2019; pp. 2535–2544. [Google Scholar]
Kong, Y.; Liu, L.; Wang, J.; Tao, D. Adaptive curriculum learning. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5067–5076. [Google Scholar]
Yao, P.; Shen, S.; Xu, M.; Liu, P.; Zhang, F.; Xing, J.; Shao, P.; Kaffenberger, B.; Xu, R.X. Single model deep learning on imbalanced small datasets for skin lesion classification. IEEE Trans. Med. Imaging 2021, 41, 1242–1254. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 943–952. [Google Scholar]
Marrakchi, Y.; Makansi, O.; Brox, T. Fighting class imbalance with contrastive learning. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 466–476. [Google Scholar]

Figure 2. The overview of our edge- and color–texture-aware bag-of-local-features model.

Figure 3. Comparison of visualization results.

Figure 4. Decrease in classification performance in ECT-BoFM and ViT-B-16 if increasingly more patches are masked according to the heatmaps of ECT-BoFM, ResNet50+CAM, and random masking.

Figure 5. Classification performance on models trained with images where only a small portion of the patches is retained and the rest is masked.

Table 1. Comparison results on ISIC 2018 (train/val/test = 1:1:1).

	Acc	F1	AUC
ViT-B-16 (ICLR 2021 [21])	83.87	70.34	95.89
Vit-B-32 (ICLR 2021 [21])	85.70	74.57	96.67
MLP Mixer (NeurIPS 2021 [38])	80.25	61.06	93.76
EfficientViT B3 (ICCV 2023 [39])	75.70	47.21	92.43
MedViT (Elsevier CBM 2023 [26])	80.16	64.50	93.22
HiFuse (Elsevier BSPC 2024 [25])	82.77	70.71	95.84
MIL (CVPRW 2024 [40])	83.28	73.27	95.30
CEM (MICCAI 2024 [28])	86.07	75.15	96.72
MedMamba (arXiv 2024 [27])	77.34	53.81	94.72
Ours	87.65	75.95	97.37

Table 2. Comparison results on ISIC 2018 (test on the ISIC 2018 Task 3 validation set).

	Acc	F1	AUC
FCL (ICML 2019 [41])	82.85	39.87	72.14
ViT-B-16 (ICLR 2021 [21])	86.53	67.16	96.87
AdaptCL (ICCV 2021 [42])	83.17	37.64	70.71
DCLU (MICCAI 2023 [37])	85.10	40.58	73.57
MedViT (Elsevier CBM 2023 [26])	80.31	53.99	93.89
HiFuse (Elsevier BSPC 2024 [25])	83.94	73.37	97.49
MIL (CVPRW 2024 [40])	84.46	76.00	97.40
CEM (MICCAI 2024 [28])	86.53	68.60	97.18
MedMamba (arXiv 2024 [27])	79.27	56.59	92.64
Ours	90.16	80.23	98.72

Table 3. Comparison results on ISIC 2018 (train/val/test = 3:1:1).

	Acc	F1	AUC
ViT-B-16 (ICLR 2021 [21])	87.56	80.40	97.65
MWNL (IEEE TMI 2021 [43])	84.90	74.92	96.79
Hyb-SC (CVPR 2021 [44])	86.30	74.34	96.33
SCL (MICCAI 2021 [45])	86.13	74.27	96.56
BCL (CVPR 2022 [35])	84.92	71.57	95.61
TSC (CVPR 2022 [36])	85.94	74.94	95.83
ECL (MICCAI 2023 [1])	87.20	76.76	96.55
MedViT (Elsevier CBM 2023 [26])	82.78	67.75	96.04
HiFuse (Elsevier BSPC 2024 [25])	84.32	72.61	96.64
MIL (CVPRW 2024 [40])	84.82	72.52	96.90
CEM (MICCAI 2024 [28])	88.62	80.72	97.88
MedMamba (arXiv 2024 [27])	78.58	60.04	93.51
Ours	91.01	83.25	98.33

Table 4. Comparison results on ISIC 2019 (train/val/test = 3:1:1).

	Acc	F1	AUC
ViT-B-16 [21]	85.12	75.98	97.01
MWNL (IEEE TMI 2021 [43])	84.10	75.08	96.61
Hyb-SC (CVPR 2021 [44])	84.69	73.27	96.67
SCL (MICCAI [45])	84.60	75.07	96.21
BCL (CVPR 2022 [35])	83.47	73.50	95.95
TSC (CVPR 2022 [36])	84.75	75.13	95.84
ECL (MICCAI 2023 [1])	86.11	79.46	96.78
MedViT (Elsevier CBM 2023 [26])	79.89	71.19	95.51
HiFuse (Elsevier BSPC 2024 [25])	83.13	75.93	96.80
MIL (CVPRW 2024 [40])	83.07	74.44	96.82
CEM (MICCAI 2024 [28])	84.93	77.94	97.10
MedMamba (arXiv 2024 [27])	77.32	67.12	94.34
Ours	86.23	79.34	97.58

Table 5. Ablation studies on ISIC 2018 (train/val/test = 1:1:1).

	Only $B$	$B$ + CTI	$B$ + EI	$B$ + EI + CTI	$B$ + EI + CTI + TF	$B$ + EI + CTI + MF	$B$ + EI + CTI (w/o NRR)	$B$ + EI + CTI + SAM	$B$ + EI + CTI + PSAM
Acc	85.31	86.34	86.46	87.20	86.50	86.15	85.13	87.23	87.65
F1	72.63	73.87	73.17	75.62	74.84	73.57	71.67	75.43	75.95
AUC	96.69	97.06	96.47	97.32	96.90	96.54	97.09	97.30	97.37

‘w/o NRR’ indicates that the decoders were trained by smooth L1 loss, and the reconstruction results were generated directly from

X_{CTI}

and

X_{EI}

rather than samples from the distributions they define. ‘TF’ and ‘MF’ denote TransFusion and MlpFusion, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, D.; Suzuki, K. Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis. Diagnostics 2025, 15, 1883. https://doi.org/10.3390/diagnostics15151883

AMA Style

Liu D, Suzuki K. Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis. Diagnostics. 2025; 15(15):1883. https://doi.org/10.3390/diagnostics15151883

Chicago/Turabian Style

Liu, Dichao, and Kenji Suzuki. 2025. "Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis" Diagnostics 15, no. 15: 1883. https://doi.org/10.3390/diagnostics15151883

APA Style

Liu, D., & Suzuki, K. (2025). Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis. Diagnostics, 15(15), 1883. https://doi.org/10.3390/diagnostics15151883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Edge- and Color–Texture-Aware Bag-of-Local-Features Model for Accurate and Interpretable Skin Lesion Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Innovations in Network Architecture

2.2. Innovations in Training Strategies

2.3. The Position of Our Work

3. Method

4. Experiments

4.1. Datasets and Implementation Details

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Studies

4.4. Analysis of Interpretable Heatmaps

4.5. Rethinking the Necessity of Global Modeling in Skin Lesion Diagnosis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI