Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers

Reis, Sara Seabra; Pinto-Coelho, Luis; Sousa, Maria Carolina; Neto, Mariana; Silva, Marta; Sequeira, Miguela

doi:10.3390/app15158321

Open AccessArticle

Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers

by

Sara Seabra Reis

^1,2,*

,

Luis Pinto-Coelho

^1,2,3

,

Maria Carolina Sousa

¹

,

Mariana Neto

¹,

Marta Silva

¹ and

Miguela Sequeira

¹

ISEP, Polytechnic of Porto, 4249-015 Porto, Portugal

²

CIETI, ISEP, Polytechnic of Porto, 4249-015 Porto, Portugal

³

INESC TEC—Institute for Systems and Computer Engineering Technology and Science, 4200-465 Porto, Portugal

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8321; https://doi.org/10.3390/app15158321

Submission received: 25 June 2025 / Revised: 21 July 2025 / Accepted: 24 July 2025 / Published: 26 July 2025

(This article belongs to the Special Issue Advances and Applications of Machine Learning for Bioinformatics)

Download

Browse Figures

Versions Notes

Abstract

The present paper investigates the application of convolutional neural networks (CNNs) for the classification of diabetic foot ulcers, using VGG16, VGG19 and MobileNetV2 architectures. The primary objective is to develop and compare deep learning models capable of accurately identifying ulcerated regions in clinical images of diabetic feet, thereby aiding in the prevention and effective treatment of foot ulcers. A comprehensive study was conducted using an annotated dataset of medical images, evaluating the performance of the models in terms of accuracy, precision, recall and F1-score. VGG19 achieved the highest accuracy at 97%, demonstrating superior ability to focus activations on relevant lesion areas in complex images. MobileNetV2, while slightly less accurate, excelled in computational efficiency, making it a suitable choice for mobile devices and environments with hardware constraints. The study also highlights the limitations of each architecture, such as increased risk of overfitting in deeper models and the lower capability of MobileNetV2 to capture fine clinical details. These findings suggest that CNNs hold significant potential in computer-aided clinical diagnosis, particularly in the early and precise detection of diabetic foot ulcers, where timely intervention is crucial to prevent amputations.

Keywords:

convolutional neural networks; medical image classification; diabetic foot ulcers; deep learning

1. Introduction

1.1. Background

Diabetes mellitus (DM) is a chronic, non-communicable condition for which there is no known cure. This disease is characterized by high levels of glucose in the blood due to a lack of insulin, a hormone produced by the pancreas. Insulin’s main function is to allow sugar from food to enter the cells, where it is transformed into energy. In the case of diabetes, this entry is compromised, which causes glucose to accumulate in the bloodstream, leading to hyperglycaemia, which is the main symptom of the disease. Although some diabetic patients show symptoms, many remain asymptomatic, unaware of the presence of the disease and, consequently, maintain the same lifestyle habits [1].

Diabetes can have various causes, and its different types are classified according to how they arise [2]. Type 1 DM is characterized by an autoimmune process in which the beta cells in the pancreas, which are responsible for producing insulin, are destroyed. This results in a partial or total deficiency in insulin production, resulting in a build-up of glucose in the blood. This leads to metabolic problems that are clinically manifested through symptoms such as frequent mucus production, excessive thirst, ketoacidosis and involuntary weight loss. It can appear at any age, from birth to young adulthood, and is one of the diseases with the highest incidence in young people [3]. Type 2 DM, on the other hand, is a multifactorial condition that results from deficiencies in insulin secretion and action, both influenced by genetic factors, and generally affects individuals over the age of 40. It is the most common form of disease, accounting for between 90% and 95% of patients. In this type of diabetes, although the pancreas produces insulin normally, the body develops resistance to this hormone, which prevents glucose from entering the cells properly to produce energy. As a result, glucose accumulates in the bloodstream and can, in some cases, reach levels up to ten times higher than normal, while the cells remain short of energy. Most patients with type 2 DM are overweight or obese. Diagnostic tests usually show high levels of triglycerides, an increase in low-density lipoprotein (LDL) cholesterol, known as ‘bad cholesterol’, which contributes to plaque build-up in the arteries, and a reduction in high-density lipoprotein (HDL) cholesterol, the ‘good cholesterol’, which helps remove excess cholesterol from the arteries [4].

DM lasts throughout life and can result in multiple complications over time. If not properly controlled, the disease can cause severe symptoms such as headaches, restlessness, irritability, paleness, sweating, tachycardia, mental confusion, fainting, convulsions and even coma. Lack of adequate treatment significantly increases the risk of long-term complications, such as the development of blood clotting problems, making it difficult for ulcers to heal and potentially resulting in amputations. In addition, poor blood circulation in the eyes can lead to complications such as diabetic retinopathy, glaucoma and cataracts, which can result in partial or total loss of vision. Kidney problems are also common, including kidney failure and diabetic nephropathy, where malfunctioning kidney blood vessels can lead to failures in filtering impurities and retaining nutrients. In terms of neurological manifestations, these can include sensory loss and limb ulcers, while infections are frequent due to reduced immunity, making diabetic patients more susceptible to skin, urinary and mouth infections [5].

Diabetic foot is one of the most serious complications of DM, both because of its direct impact on the patient’s quality of life and because of the high social and economic costs associated with it [6]. It results from a combination of peripheral neuropathy, peripheral vascular disease and biomechanical changes that lead to complications such as ulcers, infections and, in advanced cases, amputations. In addition, factors such as persistent hyperglycaemia, structural deformities, inappropriate footwear, repeated trauma, calluses and dry skin significantly increase the risk of ulceration [7]. Due to the loss of sensation and poor tissue perfusion, patients often do not detect these lesions early on, which allows them to progress to severe infections and, in advanced cases, can culminate in the amputation of the affected limb [8].

Peripheral neuropathy affects the functioning of the peripheral nerves, compromising sensory, motor and autonomic fibers. Sensory dysfunction results in a loss of protective sensitivity, making it difficult to perceive stimuli such as pain, pressure, temperature or injuries [1]. As a result, it becomes difficult to detect damage to the feet at an early stage, allowing small injuries to develop silently into more serious conditions [9]. At the same time, motor neuropathy causes muscle imbalances and structural changes, such as hallux rigidus and hammer toes, which increase pressure points in the foot. Autonomic neuropathy compromises the regulation of sweating, leading to dry skin, cracks and greater vulnerability to infections [9].

In addition, peripheral vascular disease is attributed to increased oxidative stress and atherosclerosis, phenomena that are exacerbated by hyperglycaemia. The accumulation of glucose in the blood results in the weakening, narrowing and obstruction of the arteries. Furthermore, oxidative stress prolongs the inflammatory state in the microcirculation and reduces the elasticity of the capillaries, while atherosclerosis affects the blood vessels associated with the femoral artery and the knee. In this way, blood flow to the lower limbs is compromised, reducing tissue oxygenation, delaying wound healing and significantly increasing the risk of ischaemia. These conditions are further aggravated by biomechanical changes, which can alter the patient’s gait, increasing pressure on certain areas of the foot and raising the risk of ulceration [10]. A summary of the etiology of diabetic foot ulcers is illustrated in Figure 1.

According to the World Health Organization, one of the main indicators of diabetic foot is the appearance of a foot ulcer, usually below the malleolus, associated with the presence of autonomic neuropathy, ischemic disorders and local infections. It is estimated that more than 10% of people with DM are at risk of developing foot ulcers throughout their lives. Furthermore, this predisposition increases the likelihood of injuries caused by peripheral neuropathy in 80 to 90% of cases, as well as peripheral vascular disease and deformities [11].

Ulcers can be classified by their aetiology as neuropathic or ischemic. Neuropathic ulcers result from peripheral nerve damage, often in diabetic patients, and tend to occur on weight-bearing areas like plantar surface or metatarsal heads. They usually have callused borders, a painless presentation, and warm, well-perfused surrounding tissue due to preserved blood flow but lost sensation. Ischemic ulcers, on the other hand, arise because of insufficient circulation and are often located on the tips of the toes or lateral foot. They present punched-out margins, minimal bleeding, and surrounding pale or cool skin. There are often painful due to intact nerves [12]. The precise location of the wound and the existance of pain are relevant for diagnose. Assessing the risk of ulceration is an essential step in avoiding serious complications. This process should be an integral part of the regular physical examination of diabetic patients and should be carried out thoroughly. To identify loss of sensation associated with neuropathy, the Semmes–Weinstein 10 g monofilament is used [13], where the patient’s foot is touched in order to assess its response. This can be complemented with vibration sensitivity tests (128 Hz tuning fork), tactile sensitivity and patellar and achilles reflexes.

As far as ulcer prevention is concerned, the approach should be multidisciplinary, centered on user education and regular foot monitoring. All diabetic patients should undergo an annual foot check-up, which assesses factors such as the use of appropriate footwear, the presence of calluses, the condition of the nails and skin of the feet, and the existence of bone deformities. Wearing appropriate footwear is one of the most important measures to prevent diabetic foot injuries. They should be comfortable, well-fitting and made of flexible materials such as leather. Health education plays a fundamental role in prevention, as it promotes the adoption of preventive behaviours. Users and their families should be aware of the daily care required, such as regular foot inspections, proper hygiene, moisturising the skin and cutting nails correctly [14].

Diabetic foot treatment covers the prevention and treatment of ulcers and is divided into care for non-ulcerated feet and treatment of ulcerated lesions. For non-ulcerated feet, treatment focuses on prevention, which includes removing calluses, treating nails and correcting deformities. When ulcers occur, the approach varies: plantar pressure relief and limb immobilisation are crucial, as is infection control, which may require surgical or non-surgical action.

1.2. Problem

Around 422 million people worldwide have diabetes, most of them living in low- and middle-income countries. Both the number of cases and the prevalence of the disease have been steadily increasing over the last few decades [11].

Among the various complications associated with the disease, foot problems stand out as one of the main causes of hospitalisation, especially in developed countries. It is estimated that up to 30% of people with diabetes develop foot ulcers in their lifetime, and approximately 85% of lower limb amputations are directly associated with these injuries. Half of elderly patients with type 2 diabetes have risk factors for foot problems, underlining the importance of regular screening through careful clinical examinations. It has been shown that up to 50% of amputations and foot ulcers in diabetes can be prevented through frequent screening [15], effective identification and education [6,7].

Diabetic foot ulcers (DFUs) have a five-year mortality rate of around 30%, surpassing serious diseases such as breast cancer. These figures emphasise that DFUs are not just a localized complication but a significant risk factor for the health and survival of diabetic patients. When they evolve into amputations, especially major amputations, mortality can be as high as 70%, a very alarming indicator of the impact these complications can have on patients’ lives. In fact, when observing the five year mortality percentage, minor amputation, chronic lower limb threatning ischemia and major amputation are all above the average of cancer related diseases and are only surpassed by lung cancer [16].

1.3. Objectives

Following on from the problem identified, there is a need to develop an effective method for diagnosing and classifying DFU. Existing methods have notable limitations, making it crucial to improve the approach in order to improve diagnostic accuracy, stratify risks and optimise treatment planning. This aims to reduce complications and improve the care offered to diabetic patients with foot ulcers.

In this sense, the main objective of this study is to compare the use of computer vision and machine learning algorithms with traditional methods of assessments by human experts in patients with DFU. To this end, the aim is to (1) evaluate the effectiveness of machine learning architectures in classifying DFU based on commonly used performance metrics; (2) check how machine learning methods perform in the task of classifying feet with and without ulcers; and (3) identify the technique that offers the best results in terms of accuracy and reliability in identifying DFU.

1.4. Related Studies

The classification of DFUs using automated methods is a critical area of research in medical imaging. These tools have the potential to improve early diagnosis and reduce severe complications like infections or amputations, especially in diabetic patients. While various machine learning approaches, such as attention-based models, Vision Transformers (ViTs) and custom-designed architectures, have been explored, the use of such systems in clinical settings requires robust tools [17]. Standard Convolutional Neural Networks (CNNs) like ResNet and VGG are widely used models that, besides their proven performance, also have associated explainability algorithms that support trust in the algorithms behaviour. For the current system, CNNs have been selected for DFU classification also due to their ability to benefit from efficient transfer learning techniques, interpretability and computational feasibility. These advantages are particularly relevant when addressing challenges like skin colour variability, which can alter the visual characteristics of ulcers across diverse patient populations, impacting model generalisability. In fact, according to a review by Das et al. [18], ‘advanced deep learning approaches have proven to be more promising than ML approaches. The CNN-based solutions (…) have dominated the problem domain’. This section reviews related work on CNN-based DFU classification, highlighting the strengths of fine-tuned models, while also considering alternative approaches and their limitations. Additionally, efforts to incorporate skin colour diversity in medical imaging are examined to ensure equitable diagnostic outcomes, with the purpose of setting the foundation for the proposed CNN-based framework tailored for robust and inclusive DFU classification.

When surveying recent and highly cited literature, there are promising early approaches that explore beyond hand-crafted feature-based approaches. However, the lack of publicly available supporting datasets prevents comprehensive comparisons. For example, in the work by Zhao et al. [19], a custom architecture bilinear CNN is used to explore depth and granulation classification, which are crucial indicators of healing. A new dataset was collected without specific ambient control or class balancing. Before being fed into the network, the images underwent preprocessing steps like wound area extraction, sharpening, resizing, and augmentation. Using holdout validation, the proposed approach achieving an accuracy of 84.6%. Also, in Wang et al. [20], a few-shot DFU classification method is proposed. By augmenting a limited dataset of DFU images, specifically collected, and fine-tuning a pre-trained ResNet50 model, the researchers achieved a high average accuracy of 98.67%. In Alzubaidi et al. [21] the authors introduce a new dataset of 754 foot images categorized as healthy or containing DFUs. A novel network, named DFU_QUTNet, focused on increasing network width to improve gradient propagation and combine features effectively. The feature-based approach relied on Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) classifiers that achieved an F1-score of 94.5%, outperforming other CNNs like GoogleNet, VGG16 and AlexNet on this task.

The public release of the DFU2021 dataset [22] came to establish a common benchmarking reference and increased the quality of existing research. In Yap et al. [22] a DFU classification study has as its main objective identifying infection and ischaemia. The authors explored several pre-trained models, such as VGG16, ResNet101 and EfficientNetB0, tested with DL models. The best performance was obtained with EfficientNetB0, which, after adjustments and data augmentation, achieved an output with a precision of 0.57, recall of 0.62 and F1-score of 0.55, excelling in the classification of ischaemia. Data augmentation techniques were applied to balance the less represented classes, such as rotation, Gaussian noise and contrast adjustments, which improved the model’s generalisation capacity, especially for the ischaemia class. To interpret the predictions, the authors used Gradient-Weighted Class Activation Mapping (Grad-CAM) to visualise the areas focused on by the model during prediction, showing that in some cases it focused incorrectly outside the ulcer regions, leading to erroneous accuracies. Using the same dataset, in Liu et al. [23], the EfficientNet model is explored, with the system achieving an acuracy of 99% for ischemia and 98% for infection. It is also reported that faster classification times can be achieved. Explainability methods are not mentioned, and performance indicators based on skin tone are absent.

To categorize various wound types (diabetic, pressure, surgical and venous ulcers), a multimodal classifier is proposed by Anisuzzaman et al. [24]. Using the Medetec and the AZH datasets, the researchers designed three subsets with the help of wound specialists, each containing both image and location information. The multi-modal network itself was developed by concatenating the outputs of image-based and location-based classifiers, along with other modifications. The model achieved high accuracy, ranging from 82.48% to 100% in mixed-class classifications.

Alqahtani et al. (2023) [25] developed a system for classifying DFUs into normal or abnormal based on DL, using the Adaptive Weighted Sub-Gradient Convolutional Neural Network (AWSg CNN). The dataset used was obtained from the Kaggle platform and was subjected to pre-processing, which included removing missing or inconsistent entries and dividing them into 80 per cent for training and 20 per cent for testing. Two key components are used for classification: random initialisation of weights (RIW), which helps prevent overfitting and ensures effective learning by adapting to high-dimensional data; and the Adaptive Sub-gradient Optimiser (ASGO), which stabilises gradients and optimises convergence rates using a log softmax function. The model achieved an accuracy, F1-score and area under the curve (AUC) of 99%.

Also, with the aim of classifying DFU images into normal and abnormal, Preiya et al. (2023) [26] proposed a different deep learning framework. The DFUC2021 dataset was used, which was pre-processed and segmented. In addition, a deep recurrent neural network (DRNN) was integrated for feature extraction from numeric and text data with a fast convolutional neural network (PFCNN) pre-trained with U++ Net for image classification. The model obtained an accuracy of 99.32%.

Fadhel et al. (2024) [27] proposed real-time classification models to distinguish normal and abnormal DFUs, combining deep learning with hardware acceleration platforms. They used a dataset from the Kaggle repository and applied data augmentation to balance the classes. This study introduced two CNN models, DFU_FNet and DFU_TFNet, each designed to optimise feature extraction and mitigate the limitations of small datasets. DFU_FNet uses a simpler architecture to extract features for training classifiers such as SVM and KNN, while DFU_TFNet uses a deeper architecture reinforced with transfer learning, improving performance in medical imaging tasks. Both models were implemented on FPGAs (field-programmable gate arrays) and graphics processing units (GPUs). DFU_TFNet achieved remarkable results, outperforming models such as AlexNet, VGG16 and GoogleNet. Although FPGAs process a little more slowly than GPUs, their low power consumption makes them especially suitable for portable and real-time diagnostic applications.

A more recent study by Gudivaka et al. (2025) [28] presented a practical machine learning method for DFU classification using reinforcement learning. It integrates compositional pattern production networks (CPPNs) to recognise structured and unstructured images, SVM for classification, hierarchical clustering to group data and ELM with a single hidden layer for fast classification. This research evaluates several deep learning architectures, including AlexNet, VGG16, GoogLeNet and ResNet50, to classify DFUs. The dataset is DFU2020, and ResNet50 achieved the highest classification accuracy (99.49% for ischaemia). To improve performance with limited datasets, affine transformation techniques were used to augment the data. The results show significant improvements in classification efficiency. Analysis of the clustering scenarios reveals that the model’s performance varies according to the severity of the ulcer, managing to effectively distinguish between four groups: Group 1—Mild to moderate localised cellulitis; Group 2—Moderate to severe cellulitis; Group 3—Moderate to severe cellulitis with ischaemia; and Group 4—Life- or limb-threatening infections.

Table 1 summarises the studies found in the literature that use neural networks to classify DFUs, while Table 2 summarises the datasets available for training the models. Despite the reported impact of skin tones in ulcer assessment [29,30], none of the reviewed studies has addressed this factor, neither during dataset collection or analysis, nor during the prediction model development.

In general, VGG16 is the most widely used architecture, while other models such as InceptionV3, ResNet50, InceptionResNetV2, MobileNetV2 and EfficientNet are also explored. Regarding performance, the results are mostly reported in terms of accuracy, precision and F1-score. The EfficientNet model shows the best average results, with accuracy up to 99.39% for ischaemia and 97.92% for infection, while VGG16 has a solid performance, ranging from 83.36% to 100% accuracy in the studies analysed.

2. Materials and Methods

2.1. Datasets

This work made use of two publicly available datasets: the DFU dataset and the DFUC2021 dataset [22]. Both datasets consist of images of diabetic foot ulcers, annotated by medical professionals, and were combined and adapted to create a more comprehensive dataset for model training and evaluation.

The DFU dataset was collected over a period of five years at Lancashire Teaching Hospitals. The images were captured after the debridement process, which involves removing necrotic and devitalized tissue. Three different camera models were used to capture these images: the Kodak DX4530, the Nikon D3300, and the Nikon COOLPIX P100. Each image in the DFU dataset was annotated by two healthcare professionals, a consultant physician and a diabetic foot specialist. In cases where their assessments differed, the more experienced clinician made the final decision [33,34].

The DFU dataset is structured around two binary classification tasks: distinguishing between ulcers with and without infection, and between ulcers with and without ischaemia. To increase the variability and size of the dataset, natural data augmentation techniques were applied. For the infection class, this involved including both the original image and enlarged versions of the same lesion. For the ischaemia class, original, rotated and mirrored versions of the ulcers were included [33,34].

The second dataset used in this study is the DFUC2021 dataset, which is organised into four distinct categories: “non-existent,” which includes images of healthy skin, healing ulcers, or ulcers without infection or ischaemia; “infection,” which includes ulcers showing only signs of infection; “ischaemia,” which includes ulcers showing only signs of ischaemia; and “both,” which includes ulcers exhibiting both infection and ischaemia [22].

Since the two datasets differ in their class structures, it was necessary to adapt the DFU dataset to match the four-class schema of the DFUC2021 dataset. This involved reclassifying and restructuring the DFU data so that all images could be consistently labeled under the categories non-existent, infection, ischaemia or both.

After this harmonization process, the final dataset used to train the model consisted of 1039 (46.7%) images of feet with ulcers and 1185 (53.3%) images of feet with healthy skin or no reported problem.

In addition, the dataset quality was also a matter of concern and observation. As many technologies, particularly those based on computer vision, have shown inconsistent results when applied to individuals with specific skin tones, it is essential to ensure that AI systems perform accurately and fairly across all skin tones. This issue is deeply ethical. Systems that underperform for certain racial or ethnic groups can reinforce inequalities and result in serious real-world consequences, such as misdiagnoses in healthcare [35]. AI independence from skin tone requires deliberate efforts during the design, training, and evaluation phases to ensure consistent performance across all racial and ethnic groups. A representative dataset is one that accurately reflects the diversity of the real-world population or environment where the AI systems is meant to operate in [36]. Evaluating the datasets’ inclusivity across dimensions such as skin tone, ethnicity, gender and age is a moral and ethical imperative. It demands deliberate data collection practices, transparent reporting and continuous auditing. When AI is built on inclusive foundations it becomes more equitable, more accurate, and more aligned with the values of social justice and human dignity.

For the classification of skin tones, there are two popular options: a) Fitzpatrick Skin Type (FST) scale [37], which classifies skin based on its reaction to sun exposure (burning and tanning); and b) Individual Typology Angle (ITA) [38], a quantitative measure of skin pigmentation derived from colourimetric measurements. Due to its objective nature, ITA was chosen for the current work. In fact, some studies compare and contrast FST with ITA, highlighting ITA’s superiority for objective measurement [39,40]. ITA is calculated from the L* (lightness) and b* (yellow–blue axis) from CIELAB colour space, using the formula

I T A = a r c t a n (\frac{L^{*} - 50}{b^{*}}) \times \frac{180}{π},

Based on Chardon et al. (1991) [38], the obtained values can then be categorised in six classes, as shown in Table 3.

To estimate the skin type for each image patch in the dataset, the corners of the images were selected as reference areas, based on the assumption that wound regions are typically round and less likely to occupy the corners. Square regions located in each of the four corners were used to sample pixels for calculating the Individual Typology Angle (ITA) of each pixel. The final ITA value for each image was determined as the median of the ITA values across these sampled pixels. This median ITA was then used to assign the corresponding skin type category. The side lengths of the corner squares were set to 3%, 6% and 10% of the image dimensions, and the most voted class was considerered. The resulting distribution of skin patches, as shown in Figure 2, is highly imbalanced, with a prevalence of around 66% of brown and dark tones. In Figure 3 it is possible to observe two randomly selected sample images for each ITA skin tone category.

The observed imbalance in skin type distribution, particularly the underrepresentation of “very light” and “light” categories (together with around 10%) and the dominance of “dark” (40%) and “brown” (25%) categories, raises important concerns regarding model fairness and generalizability. Machine learning models trained on imbalanced skin tone datasets are prone to performance disparities across population groups, a well-documented issue in dermatology and medical imaging research. Some studies have shown that models trained on disproportionately dark or light skin tone datasets tend to overfit the dominant classes while underperforming on less represented classes. For example, Groh et al. (2021) [41] highlighted that skin condition classifiers exhibit reduced accuracy on darker skin tones when not adequately represented in the training data. Similarly, Mendi et al. (2024) [42] demonstrated that skin tone imbalance in medical imaging datasets can lead to systematic underdiagnosis or misclassification for underrepresented groups, especially in automated systems. On the other hand, recent developments in ulcer classification using a dataset similar to the one in the current study fail to address these aspects [43]. In this context, the current imbalance could result in bias against lighter skin tones, compromising fairness and potentially limiting clinical utility across diverse populations.

2.2. Tools

The development was performed using the Python 3.9 programming language. For model construction, training and evaluation, the libraries TensorFlow and Keras were employed. TensorFlow, widely recognised for its ability to perform large-scale operations, was chosen for its flexibility and ability to optimise models in different execution environments. On the other hand, Keras, being a high-level interface for neural networks, facilitated the implementation of complex models with a simple and intuitive syntax [44]. In addition, the Google Colab execution environment was used as the supporting infrastructure, as it provides access to advanced computing resources, such as graphics processing units (GPUs) and tensor processing units (TPUs), which are essential for intensive training of deep neural networks.

2.3. Methods

To provide an overview of the methodological approach, Figure 4 illustrates the complete processing pipeline adopted in the development of this study. The workflow consists of five sequential stages, beginning with the input image and progressing through pre-processing, data augmentation, model training using CNNs, fine-tuning and classification, and concluding with model explainability through Grad-CAM.

This study evaluates three well-established convolutional neural network (CNN) architectures (VGG16, VGG19 and MobileNetV2) for medical imaging classification tasks. These models were chosen for their proven effectiveness in extracting critical visual features, their computational efficiency and their balance between performance and resource demands [17]. Additionally, their compatibility with model explainability techniques was considered an advantage.

The VGG-16 and VGG-19 models are part of the VGGNet family, developed by Simonyan and Zisserman in 2015 [45]. Both architectures are characterised by their use of small 3 × 3 convolutional filters and a deep stack of layers, which promotes a hierarchical and detailed extraction of visual features. The architecture was designed to increase the depth of the network without significantly increasing the number of parameters, and its success lies in the efficient use of convolutional layers followed by pooling to capture progressively more abstract features. VGG-16 is made up of 16 trainable layers, including 13 convolutional layers organised into five blocks, interspersed with MaxPooling layers, which progressively reduce the spatial dimensions of the representations. The last layers are made up of two dense layers with 4096 units each, followed by an output layer with the softmax activation function. This architecture is widely used in image classification tasks due to its simplicity and effectiveness. On the other hand, VGG-19 adds three extra convolutional layers, totalling 19 trainable layers, which increases the network’s ability to capture finer details in complex images. This architecture also maintains the organisation into convolution and pooling blocks, but the extra depth makes it even more suitable for tasks where visual details are critical, such as the case of classifying diabetic foot ulcers.

Despite their effectiveness, both models have a relatively high number of parameters, with approximately 138 million parameters in VGG16 and 143 million in VGG19. This high complexity makes it necessary to use regularisation techniques to mitigate the risk of overfitting, especially with smaller datasets. Figure 5 illustrates the structural differences between VGG16 and VGG19, highlighting the additional convolutional layers in VGG19 and the identical configuration of the dense layers, in later stages, used for classification.

The MobileNetV2 architecture, on the other hand, was designed with low/medium resourced devices in mind. By using a set of blocks with depthwise separable convolutions, which includes deep convolution and point-by-point convolution, it seeks to minimise the number of convolution operations by around 1/9th when compared with standard convolutions. This innovation also allows one to reduce memory usage while maintaining a good accuracy of the model. This approach processes the image into a compact, low-dimensional representation, which is then expanded to a higher dimensionality and convoluted in a lightweight manner, followed by a projection to a low dimension. MobileNetV2 is composed of two blocks, one with stride = [1 1] and another with stride = [2 2], which downsizes the input. This structural efficiency makes this network a popular option for implementation on devices with limited resources while maintaining good performance, as described by Sandler et al. (2018) [46]. Figure 6 illustrates the MobileNetV2 architecture and its building blocks, highlighting its key components.

For the current purposes, both architectures (VGG and MobileNetV2) were pre-trained on the ImageNet dataset and then adjusted for the specific task of binary classification of diabetic ulcers. This methodology is known as ‘transfer learning’, which consists of taking advantage of the knowledge successfully acquired in a previous task (in this case, image classification in ImageNet) to speed up and improve performance in a new task. This approach brings significant advantages in terms of performance, as it allows models to start training with robust representations, requiring less data and time to achieve good results in the new task.

2.3.1. Data Preparation and Pre-Processing

Data preparation and pre-processing are crucial stages in any deep learning pipeline, directly influencing the model’s performance and ability to generalise. In this stage the ITA skin tones were estimated for each image. The dataset of diabetic foot ulcer images was then partitioned into training and validation sets using a stratified random split, preserving the original class distribution. A split of 80% for the training set and 20% for the validation set was chosen, using a fixed random seed to ensure reproducibility. This choice was made to ensure that the model had a substantial set of data to learn from, while still allowing an effective assessment of the model’s performance through the validation set. The use of callbacks such as early stopping and model checkpointing helped prevent overfitting and ensured robust model selection.

Aditionally, to ensure that the models could effectively handle the visual diversity of the images, a set of pre-processing operations were performed. Specifically, images were first converted to grayscale and were normalised to the range 0–1, helping to stabilize the gradients during training, making the optimisation process smoother, faster and more likely to converge to a good solution.

Data augmentation techniques were also applied to increase dataset variability and improve generalisation. Data augmentation focused on transformations that preserve the general visual characteristics of the ulcers while generating feasible variations. Since skin tones were considered, colour variations were carefully used, as described below:

Geometric Transformations
○
Image rotation or Flipping: varying rotation angles by up to 30° or applying a vertical or horizontal flip, which helps the model recognise ulcers regardless of image orientation.
○
Translation: small positional variations of up to 20% of the image dimensions were applied to simulate different ulcer locations and improve the model’s robustness to spatial displacement.
○
Scaling/Zoom: applied with a range of up to 50%, allowing the model to focus on both smaller details and broader aspects of the images.
○
Shear: introducing small geometric distortions, which forces the model to learn more complex variations in ulcer shapes.
Colour and Photometric Transformations
○
Brightness Adjustment: The brightness of the images has been adjusted in a range of 20% up or down, which simulates different lighting conditions and makes the model more robust to these variations. The operation alters luminance but does not significantly shift colour balance (chrominance).
○
Gamma adjustment: Very small adjustments to mid-tone brightness can be useful to mimic different camera sensors. It keeps relative colour ratios mostly intact.

The transformations were not cummulatory to better control the potential disturbance during the process. As represented in Figure 7, for each original image of the dataset, another 8 augmented images were created.

2.3.2. Model Architecture Configuration

In complex ML tasks, it is common to use networks that have been pre-trained on large datasets, such as ImageNet, and then adjust them to the new problem. To do this, techniques such as fine-tuning and Batch Normalisation are used, both of which are essential to avoid problems such as overfitting and instability during training [47,48].

Fine-tuning is a technique in which a model that has already been trained on a large dataset is adapted to a new, more specific one, while maintaining the general characteristics learnt previously. During this approach, parts of the original model are frozen, while other parts are adjusted for the new dataset [2]. In practice, the lower layers of the network are responsible for capturing generic visual features (such as edges and textures), which are useful in almost any computer vision task and can therefore be kept, i.e., they will be the so-called frozen layers. The higher layers of the network, on the other hand, learn more complex and specific patterns from the original dataset and need to be adjusted (unfrozen layers) to adapt to the characteristics of the DFUs [49,50].

In the case of VGG16 and VGG19, the first eight layers were frozen, while the rest were thawed. On the other hand, MobileNetV2, being a lighter and more compact network, uses a different approach. As the features learnt by the model in ImageNet may not be sufficient to capture the more subtle nuances of ulcers, all its layers were thawed to allow for complete fine-tuning.

Batch Normalisation is a standard technique in deep neural networks, the main aim of which is to stabilise and speed up training. To do this, it normalises the activations of a layer according to the mean and standard deviation of the mini-batch, adjusting these values so that the network has more stable and well-scaled activations. By ensuring that the values are always controlled, it speeds up the convergence of the model and allows the use of higher learning rates, reducing the need for detailed manual adjustment of hyperparameters [48].

In the VGG network model, Batch Normalisation was applied after the output of the convolutional base, stabilising the activations before moving on to the dense layers. This was especially useful for unfreezing the upper layers during fine-tuning, as it prevented changes in these layers from causing large variations in training. In the case of MobileNetV2, it was also applied just after the convolutional base was output, ensuring that the model maintains stability during training, especially when unfreezing all the layers.

2.3.3. Training and Validation

The training and validation process is a critical phase in the development of these models, as it determines the model’s ability to characterise new data. To optimise this process, various techniques and strategies have been implemented, including setting an appropriate learning rate, using callbacks and applying class weights.

For the training sessions, utilising either VGG or MobileNet architectures, the foundational hyperparameters exhibit considerable overlap, reflecting common practices in DL for image classification. The number of epochs was set at 50, a value often found to be effective when using pre-trained weights through transfer learning. This duration allows the model to adapt to the new dataset while preventing excessive overfitting, a process monitored closely with early stopping criteria. The batch size was early defined at 32 for local training, and later increased to 64 for server-side training. The Categorical Cross-Entropy loss function was employed, paired with a Softmax activation in the final layer to output probability distributions over the classes.

The learning rate is a fundamental hyperparameter that controls the magnitude of the updates to the model weights during training. An adequate learning rate is crucial to ensuring efficient and effective convergence. If it is too high, it can make excessive updates, resulting in instability and difficulty in finding the best solution. On the other hand, if it is too low, training will be slow, and the model may find a solution that is not the best possible [47]. For both models, VGG16, VGG19 and MobileNetV2, a learning rate of 1 × 10⁻⁵ was used, which was chosen after several tuning attempts to find the best configuration. All the models were adapted to use the Adam optimiser, an algorithm that automatically adjusts the learning rates for each parameter, allowing for faster and more stable convergence during training [51].

In order to improve efficiency and monitor the model’s performance, callbacks were used, functions that allow actions to be carried out at certain points during training. The following functions were applied [52]:

Early Stopping: used to stop training when the validation metric has not improved after a certain number of epochs, known as patience. This ensures that the model does not continue to train unnecessarily after reaching its best performance. It was configured with a patience of 10 epochs for both models;
Reduce Learning Rate on Plateau: reduces the learning rate when the validation metric stops improving, allowing the model to make finer adjustments as it approaches convergence;
Model Checkpoint: saves the model during training whenever the validation metric (in this case, loss) improves. This ensures that the best version of the model is saved.

A fine-tuning strategy was also used, consisting of initially training the model for 25 epochs, with the convolutional base frozen, meaning only the top layers were updated. This allowed the network to learn task-specific features without altering the pre-trained weights. Then, in the following 25 epochs, fine-tuning was enabled by unfreezing some of the deeper layers of the convolutional base, allowing them to be retrained alongside the top layers. This technique helped the model adjust its learned representations more precisely to the task.

In an early development stage, training sessions were run on a Windows/Intel machine equipped with an NVIDIA RTX 3050 and CUDA, resulting in processing times ranging from 40 to 60 min. To accelerate development and parameter tuning iterations, the training processes were migrated to Google Colab Pro. By utilising high-performance GPUs, specifically the T4 and A100, it was possible to significantly reduce training time to just 5 to 10 min. This optimisation provided a substantial gain in efficiency, enabling a much faster development cycle.

2.3.4. Analysis with Grad-CAM

The Grad-CAM technique is an effective tool for interpreting and visualising the decisions made by models, especially in convolutional neural networks. In this study, it made it possible to identify, through heat maps, which parts of an image most influenced the model’s decision in classification tasks, offering a deeper understanding of how the model sees and analyses the data [53].

The process begins by loading and pre-processing the image to suit the model’s input. Next, during backpropagation, the model’s output gradients are calculated in relation to the activations of the chosen convolutional layer. These gradients are grouped, usually by average, to obtain weights that indicate the importance of each channel in that layer. With these weights, the convolutional layer activations are weighted, resulting in a heat map that highlights the regions of the image that contributed most to the model’s decision. Finally, this heat map is normalised and resized to match the dimensions of the original image, making it possible to clearly see the areas that were most significant for the classification [53].

This technique was applied in a similar way to the different models, each using the corresponding convolutional layer.

2.3.5. Evaluation Metrics

To evaluate the VGG16, VGG19 and MobileNetV2 models, a separate dataset was used for validation, which was not used during training. This ensured that performance metrics are calculated based on data that the models had not encountered before, providing a more accurate assessment of their capabilities. The models were tested on a significant number of validation images, and the predictions were compared with the actual classes to calculate the metrics.

Accuracy was one of the metrics evaluated. This is one of the main metrics used to assess the performance of classification models. Accuracy was calculated as follows:

A c c u r a c y = \frac{TP + TN}{TP + TN + FP + FN},

(1)

where

True Positive (TP): images showing ulcers that the model classifies as ‘abnormal’.
False Negative (FN): samples with ulcers that were incorrectly classified as ‘normal’, representing a missed diagnosis.
False Positive (FP): samples without ulcers that were incorrectly classified as ‘abnormal’.
True Negative (TN): samples without ulcers that were correctly classified.

A second metric calculated was precision. Precision is a crucial metric when evaluating classification models, especially in medical contexts where the weight of false positives can be significant. This measures the proportion of TPs out of all instances the model classified as positive, indicating how reliable its positive predictions are. It was calculated as follows:

P r e c i s i o n = \frac{TP}{TP + FP},

(2)

Another metric calculated was recall, also known as sensitivity or hit rate, that measures the model’s ability to correctly identify TPs. In simple terms, recall measures how many of the actual positive instances were correctly identified by the model. The following equation is used to calculate it:

R e c a l l = \frac{TP}{TP + FN},

(3)

The last metric evaluated was F1-Score. This is a widely used metric in machine learning, especially in scenarios where there is an imbalance between classes or where both precision and recall are important. The F1-score is the harmonic mean between precision and recall, combining these two metrics into a single score that reflects the balance between the accuracy of positive predictions and the model’s ability to correctly identify positive instances:

F 1_{S c o r e} = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n},

(4)

In addition to the traditional metrics, the confusion matrices of each model were calculated. The confusion matrix is an essential tool for evaluating classification models, especially in supervised learning problems such as the detection of diabetic foot ulcers. This matrix allows the model’s performance to be clearly visualised by providing a comparison between predicted and actual values. In binary classifications, such as this work, the matrix is made up of four main elements:

T P = R e c a l l \times T o t a l N u m b e r o f P o s i t i v e s,

(5)

F N = T o t a l N u m b e r o f P o s i t i v e s,

(6)

F P = \frac{TP}{P r e c i s i o n} - TP,

(7)

T N = T o t a l N u m b e r o f N e g a t i v e s

(8)

3. Results

This section presents the performance results of the VGG16, VGG19 and MobileNetV2 models on the validation dataset. In Table 4 it is possible to observe a summary of the results of the performance metrics for each model.

The performance metrics calculated are fundamental to evaluating the effectiveness of the models and determining their clinical applicability. Through this evaluation, it is possible to identify not only the overall performance but also the areas in which each model can be improved, contributing to the continuous improvement of the automatic screening system for patients with DFU. Table 5 presents the initial performance metrics for the original dataset in the “raw” column, highlighting the early modest results. The remaining columns show the F1-score for each augmentation technique applied individually. This clearly demonstrates the impact of the data augmentation process, which not only expanded the dataset but also increased its diversity.

To further assess model behaviour, Table 6 presents the confusion matrices for the MobileNetV2, VGG16 and VGG19 models, providing a detailed overview of the performance of the model.

The results of the confusion matrices offer a deeper insight into the performance of each model, complementing the metrics presented above.

The evaluation with Grad-CAM then comes in as a qualitative analysis, helping to visually analyse the previously obtained results. Figure 8 shows the produced images using this tool on VGG16, VGG19 and MobileNetV2, respectively. The heatmaps allow to visually explain which regions of an input image the CNN model focuses on when making a particular classification decision.

4. Discussion

4.1. Impact of Image Augmentation Techniques

Each augmentation strategy applied (Rotation, Translation, Scaling, Shearing and Brightness adjustment) led to an improvement in accuracy compared to the “Raw” baseline, where no augmentation was applied. This highlights the importance of data augmentation to improve model generalisation and reduce overfitting.

Among the models evaluated, VGG16 performed best across all conditions. Its “Raw” performance started at 74% and reached a peak accuracy of 80% with brightness augmentation, surpassing the best results of the other models. VGG19, while outperforming MobileNetV2 overall, did not achieve the same level of performance as VGG16.

When looking at the individual impact of each augmentation, Rotation and Scale proved to be particularly effective geometric transformations. Both techniques yielded a substantial improvement of +5% for MobileNetV2 (reaching 72%) and +5% for VGG16 (reaching 79% for Rotation and +3% for Scale), while VGG19 also saw notable gains of +4% for both. This indicates that teaching models invariance to object orientation and size variations within images is crucial. Furthermore, the photometric augmentation, Brightness adjustment, also showed good effectiveness, especially for VGG16, providing its largest individual increase of +6% to reach 80%. MobileNetV2 and VGG19 also benefited significantly from Brightness adjustment, with increases of +4%. Translation and Shear augmentations, while beneficial, offered more modest improvements overall, typically in the range of +2% to +4%.

4.2. Performance Metrics Analysis

In terms of the VGG16 model, the accuracy achieved was 93%, indicating that the model correctly classified 93% of the DFU images in the validation set. The VGG19 model had a higher accuracy, reaching 97%. This result suggests that VGG19, having a deeper architecture, was able to identify relevant patterns in the images more effectively. Despite being optimised for efficiency and use on mobile devices, as mentioned above, MobileNetV2 obtained an accuracy of 80%. This shows that, despite its lightness, the model was still able to carry out effective classification, albeit with a lower performance compared to the VGG models. These results highlight the effectiveness of the models in correctly identifying ulcer characteristics, with VGG19 standing out as the most robust among them. Accuracy, as a metric, offers an initial view of the models‘ performance, serving as a starting point for more detailed analyses and the exploration of other performance metrics, such as precision, recall and F1-score, which provide a more comprehensive understanding of the models’ behaviour under different conditions.

In a diabetic foot ulcer classification scenario, it is essential to minimise FPs, as this can lead to incorrect diagnoses and unnecessary treatment for patients. Therefore, high accuracy indicates that when the model classifies an image as ‘abnormal’, then there is a high probability that the image actually shows an ulcer. The accuracy results for each model were analysed separately for the ‘abnormal’ and ‘normal’ classes for better insight. The MobileNetV2 model obtained an accuracy of 74% (recall) for the ‘abnormal’ class, which indicates a good probability that the ulcer classifications are correct, while the 79% accuracy (specificity) for the ‘normal’ class suggests that it tends to incorrectly classify some normal images as abnormal. VGG16 showed a balanced accuracy of 93% for both classes, demonstrating consistent and reliable performance in all classifications. VGG19, on the other hand, stood out with an accuracy of 97% for the ‘abnormal’ class, showing excellent ability in identifying images with ulcers.

A high recall value indicates that the model is identifying most of the positive cases correctly, i.e., there are few false negatives. In this case, a false negative would be for the model to consider an image as normal, but in fact that image has an ulcer on a diabetic foot. In a scenario where a model uses clinical data, such as ulcer detection, the aim is to have a high recall, as this means that the model is identifying the majority of ulcer cases correctly. However, a high recall percentage can come at the expense of an increase in false positives, which can be captured by precision. For the MobileNetV2 model, the recall for the ‘abnormal’ class was 74%, meaning that the model correctly identified 74% of the images with ulcers. VGG19 obtained the best recall value of 97%.

The F1-score can be crucial in clinical applications, such as the classification of diabetic foot ulcers, where both the ability to correctly detect ulcers (recall) and the accuracy of positive predictions (precision) have direct implications for treatment and diagnosis. The MobileNetV2 model obtained an F1-Score of 78%. This result indicates a reasonable balance between precision and recall. VGG16 had an F1-Score of 93%, pointing again to a good compromise between precision and recall. VGG19 obtained an F1-Score of 97%, making it the best performing model.

The results in Table 4 and in Table 6 indicate that VGG19 had the best metrics, suggesting that it is the most robust model for this task. MobileNetV2, although efficient, performed slightly less well, which may be an expected trade-off given its lightweight design.

When performing a fairness analysis of the model, based on Table 6, it can be observed that despite the high overall classification performance, analysis across skin tone categories reveals notable disparities, particularly affecting lighter skin tones. While the model performs consistently well on darker skin tones, achieving F1-scores of 98–99% for “Tan”, “Brown” and “Dark” categories, it shows reduced performance on “Light” and “Intermediate” tones. Most notably, the “Light” group exhibits the lowest F1-score (84%), with comparatively lower precision (81%) and accuracy (84%). This discrepancy is not solely attributable to small sample size, as other minority classes (e.g., “Very Light”) perform better (F1-score 97%) despite similarly low support. Instead, the reduced precision for “Light” skin tones, where 11 false positives were observed, suggests systematic misclassification or confusion with neighboring classes, which may stem from insufficient representation or ambiguous ITA thresholds. On the other hand, the “Very Light” skin tones also showed to have good results. This may be due to a more obvious difference between skin tone and ulcer-related colours. It is also interesting to observe that “Intermediate” class showed lower performance than “Very Light” class despite its higher representation in the dataset.

4.3. Grad-CAM Analysis

Evaluating the results obtained by applying Grad-CAM, we can say that the VGG16 network largely focuses on the most relevant regions for detecting anomalies, such as areas with lesions, ulcers or visible signs of infection and ischaemia in the feet. The activations shown by Grad-CAM indicate that the model is learning to correctly identify patterns associated with pathological conditions. This is a positive sign that VGG16 is able to correlate the most pertinent visual characteristics of the images with the final classification.

As we can see in Figure 8, the activations are fully not concentrated in the lesioned areas, but they highlight relevant areas for decision, including non-obvious areas from a human perspective. This reflects a good match between the model’s decision and the clinical importance of the areas identified.

Overall, the Grad-CAM visualisations provide important visual validation that the VGG16 model is correctly targeting the regions of greatest clinical interest, reinforcing the model’s effectiveness in the task of classifying diabetic feet.

The Grad-CAM results applied to VGG19 showed superior performance compared to VGG16, especially in identifying pathological areas in diabetic feet. VGG19 activations are more concentrated in relevant regions, such as ulcerations and lesions. To summarise, VGG19 shows greater accuracy in identifying lesions, with more focused and less dispersed activations.

The Grad-CAM results applied to MobileNetV2 show consistent performance in identifying pathological areas in diabetic foot ulcers. MobileNetV2 activations are more scattered in some images but still manage to highlight important areas associated with lesions, although with less precision compared to deeper architectures such as VGG19. MobileNetV2 performs satisfactorily, with activations consistent with pathological areas, but with a slight tendency to activate uninjured regions. The focus of the activations is more diffuse, which reflects a lower ability to capture fine clinical detail, especially compared to VGG19.

5. Conclusions

5.1. Main Contributions

This study contributes to the advancement of medical image analysis by applying DL to the classification of DFUs, a medical condition with severe complications that can drastically affect a patient’s life, such as infection, amputation and low levels of quality of life. An application was presented, demonstrating how artificial intelligence can contribute to clinical decision-making and patient outcomes through the early and accurate diagnosis of ulcers in an automated way.

One of the main contributions of this work lies in demonstrating that strong classification performance for diabetic foot ulcers can be achieved even with limited medical data, through a comparative analysis of three well-established CNN architectures (MobileNetV2, VGG16 and VGG19) adapted using transfer learning, data augmentation and fine-tuning. This analysis highlights the viability of applying DL in real-world clinical contexts, where data availability is often constrained and does not present the highest quality. Moreover, the comparative results underscore the practical trade-offs between the models: while MobileNetV2 offers a lightweight design suitable for deployment on devices with limited resources, the very high performance of VGG16 and VGG19 suggests that they could also be suitable for clinical applications where diagnostic accuracy is the priority.

Furthermore, the results further emphasise the importance of architectural decisions and fine-tuning techniques in optimising model performance. MobileNetV2, although providing advantages of efficiency and deployability, lacks sensitivity, and so there must be compromises to be made in clinical practice. On the other hand, VGG16’s and VGG19′s improved performance, especially inclusion of dropout and Batch Normalisation layers, indicates the necessity for precision model management to optimise diagnostic accuracy.

A further contribution of this study is the fairness analysis based on skin tone, aimed at identifying potential disparities in model performance across different skin types. The results revealed lower reliability for individuals with “Light” and possibly “Intermediate” skin tones, a limitation that could lead to unequal diagnostic outcomes in practice. These findings contribute to the field by exposing fairness gaps that are often overlooked in diabetic foot ulcer classification research. Furthermore, it reinforces the need for more inclusive datasets and evaluation protocols to ensure that AI-based diagnostic systems serve all patient populations equitably.

5.2. Future Work

The availability of larger and more diverse datasets is fundamental to improving the generalisation capacity of models. A greater number of examples, especially in medical contexts, can help mitigate overfitting problems and improve the robustness of predictions. Including more images from different sources, lighting conditions and ulcer variability would help the models learn more comprehensive features, resulting in superior performance in real-world scenarios.

Finally, future research could explore the development of models that not only identify ulcers but also predict the risk of developing new lesions based on historical patient data, such as pre-existing health conditions and treatment history. This proactive approach could be transformative, allowing for more targeted interventions and improving the quality of patient care.

Author Contributions

Conceptualization, L.P.-C.; methodology, M.S. (Miguela Sequeira), L.P.-C.; software, M.S. (Miguela Sequeira); validation, L.P.-C.; formal analysis, L.P.-C. and S.S.R.; investigation, M.S. (Miguela Sequeira); resources, S.S.R.; data curation, M.S. (Miguela Sequeira); writing—original draft preparation, M.S. (Miguela Sequeira), M.C.S. (Maria Carolina Sousa), M.S. (Marta Silva), M.N.; writing—review and editing, M.S. (Miguela Sequeira), M.C.S. (Maria Carolina Sousa), M.S. (Marta Silva), M.N., L.P.-C. and S.S.R.; visualization, M.C.S. (Maria Carolina Sousa), M.S. (Marta Silva), M.N.; supervision, L.P.-C.; project administration, S.S.R.; funding acquisition, L.P.-C. and S.S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by FCT-UIDB/04730/2020 and FCT-UIDB/50014/2020 projects.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets that were used for the presented developments are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Casarin, D.E.; Donadel, G.; Dalmagro, M.; de Oliveira, P.C.; de Cássia Faglioni Boleta-Ceranto, D.; Zardeto, G. Diabetes Mellitus: Causas, Tratamento e Prevenção/Diabetes Mellitus: Causes, Treatment and Prevention. Braz. J. Dev. 2022, 8, 10062–10075. [Google Scholar] [CrossRef]
Banday, M.Z.; Sameer, A.S.; Nissar, S. Pathophysiology of Diabetes: An Overview. Avicenna J. Med. 2020, 10, 174–188. [Google Scholar] [CrossRef]
Atkinson, M.A.; Eisenbarth, G.S.; Michels, A.W. Type 1 Diabetes. Lancet 2014, 383, 69–82. [Google Scholar] [CrossRef]
Perego, C.; Da Dalt, L.; Pirillo, A.; Galli, A.; Catapano, A.L.; Norata, G.D. Cholesterol Metabolism, Pancreatic β-Cell Function and Diabetes. Biochim. Biophys. Acta (BBA)–Mol. Basis Dis. 2019, 1865, 2149–2156. [Google Scholar] [CrossRef]
Cole, J.B.; Florez, J.C. Genetics of Diabetes Mellitus and Diabetes Complications. Nat. Rev. Nephrol. 2020, 16, 377–390. [Google Scholar] [CrossRef]
Boulton, A.J.M.; Armstrong, D.G.; Hardman, M.J.; Malone, M.; Embil, J.M.; Attinger, C.E.; Lipsky, B.A.; Aragón-Sánchez, J.; Li, H.K.; Schultz, G.; et al. Diagnosis and Management of Diabetic Foot Infections; ADA Clinical Compendia Series; American Diabetes Association: Arlington, VA, USA, 2020. [Google Scholar]
Castro-Martins, P.; Marques, A.; Coelho, L.; Vaz, M.; Costa, J.T. Plantar Pressure Thresholds as a Strategy to Prevent Diabetic Foot Ulcers: A Systematic Review. Heliyon 2024, 10, e26161. [Google Scholar] [CrossRef]
In-Shoe Plantar Pressure Measurement Technologies for the Diabetic Foot: A Systematic Review: Heliyon. Available online: https://www.cell.com/heliyon/fulltext/S2405-8440(24)05703-7 (accessed on 22 June 2025).
Ansari, P.; Akther, S.; Khan, J.T.; Islam, S.S.; Masud, M.S.R.; Rahman, A.; Seidel, V.; Abdel-Wahab, Y.H.A. Hyperglycaemia-Linked Diabetic Foot Complications and Their Management Using Conventional and Alternative Therapies. Appl. Sci. 2022, 12, 11777. [Google Scholar] [CrossRef]
Castro-Martins, P.; Marques, A.; Pinto-Coelho, L.; Fonseca, P.; Vaz, M. A Portable Insole System for Actively Controlled Offloading of Plantar Pressure for Diabetic Foot Care. Sensors 2025, 25, 3820. [Google Scholar] [CrossRef]
World Health Organization Diabetes Fact Sheet. Available online: https://www.who.int/news-room/fact-sheets/detail/diabetes (accessed on 22 December 2021).
Monteiro-Soares, M.; Boyko, E.J.; Jeffcoate, W.; Mills, J.L.; Russell, D.; Morbach, S.; Game, F. Diabetic Foot Ulcer Classifications: A Critical Review. Diabetes/Metab. Res. Rev. 2020, 36, e3272. [Google Scholar] [CrossRef]
Castro-Martins, P.; Pinto-Coelho, L.; Campilho, R.D.S.G. Calibration and Modeling of the Semmes–Weinstein Monofilament for Diabetic Foot Management. Bioengineering 2024, 11, 886. [Google Scholar] [CrossRef]
Iraj, B.; Khorvash, F.; Ebneshahidi, A.; Askari, G. Prevention of Diabetic Foot Ulcer. Int. J. Prev. Med. 2013, 4, 373–376. [Google Scholar]
Costa, T.; Coelho, L.; Silva, M.F. Automatic Segmentation of Monofilament Testing Sites in Plantar Images for Diabetic Foot Management. Bioengineering 2022, 9, 86. [Google Scholar] [CrossRef]
Armstrong, D.G.; Swerdlow, M.A.; Armstrong, A.A.; Conte, M.S.; Padula, W.V.; Bus, S.A. Five Year Mortality and Direct Costs of Care for People with Diabetic Foot Complications Are Comparable to Cancer. J. Foot Ankle Res. 2020, 13, 16. [Google Scholar] [CrossRef]
Pinto-Coelho, L. How Artificial Intelligence Is Shaping Medical Imaging Technology: A Survey of Innovations and Applications. Bioengineering 2023, 10, 1435. [Google Scholar] [CrossRef]
Das, S.K.; Roy, P.; Singh, P.; Diwakar, M.; Singh, V.; Maurya, A.; Kumar, S.; Kadry, S.; Kim, J. Diabetic Foot Ulcer Identification: A Review. Diagnostics 2023, 13, 1998. [Google Scholar] [CrossRef]
Zhao, X.; Liu, Z.; Agu, E.; Wagh, A.; Jain, S.; Lindsay, C.; Tulu, B.; Strong, D.; Kan, J. Fine-Grained Diabetic Wound Depth and Granulation Tissue Amount Assessment Using Bilinear Convolutional Neural Network. IEEE Access 2019, 7, 179151–179162. [Google Scholar] [CrossRef]
Wang, C.; Yu, Z.; Long, Z.; Zhao, H.; Wang, Z. A Few-Shot Diabetes Foot Ulcer Image Classification Method Based on Deep ResNet and Transfer Learning. Sci. Rep. 2024, 14, 29877. [Google Scholar] [CrossRef]
Alzubaidi, L.; Fadhel, M.A.; Oleiwi, S.R.; Al-Shamma, O.; Zhang, J. DFU_QUTNet: Diabetic Foot Ulcer Classification Using Novel Deep Convolutional Neural Network. Multimed. Tools Appl. 2020, 79, 15655–15677. [Google Scholar] [CrossRef]
Yap, M.H.; Cassidy, B.; Pappachan, J.M.; O’Shea, C.; Gillespie, D.; Reeves, N.D. Analysis Towards Classification of Infection and Ischaemia of Diabetic Foot Ulcers. In Proceedings of the 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, 27–30 July 2021. [Google Scholar]
Liu, Z.; John, J.; Agu, E. Diabetic Foot Ulcer Ischemia and Infection Classification Using EfficientNet Deep Learning Models. IEEE Open J. Eng. Med. Biol. 2022, 3, 189–201. [Google Scholar] [CrossRef]
Anisuzzaman, D.M.; Patel, Y.; Rostami, B.; Niezgoda, J.; Gopalakrishnan, S.; Yu, Z. Multi-Modal Wound Classification Using Wound Image and Location by Deep Neural Network. Sci. Rep. 2022, 12, 20057. [Google Scholar] [CrossRef]
Alqahtani, A.; Alsubai, S.; Rahamathulla, M.P.; Gumaei, A.; Sha, M.; Zhang, Y.-D.; Khan, M.A. Empowering Foot Health: Harnessing the Adaptive Weighted Sub-Gradient Convolutional Neural Network for Diabetic Foot Ulcer Classification. Diagnostics 2023, 13, 2831. [Google Scholar] [CrossRef]
Sathya Preiya, V.; Kumar, V.D.A. Deep Learning-Based Classification and Feature Extraction for Predicting Pathogenesis of Foot Ulcers in Patients with Diabetes. Diagnostics 2023, 13, 1983. [Google Scholar] [CrossRef]
Fadhel, M.A.; Alzubaidi, L.; Gu, Y.; Santamaría, J.; Duan, Y. Real-Time Diabetic Foot Ulcer Classification Based on Deep Learning & Parallel Hardware Computational Tools. Multimed. Tools Appl. 2024, 83, 70369–70394. [Google Scholar] [CrossRef]
Gudivaka, R.K.; Gudivaka, R.L.; Gudivaka, B.R.; Basani, D.K.R.; Grandhi, S.H.; Khan, F. Diabetic Foot Ulcer Classification Assessment Employing an Improved Machine Learning Algorithm. Technol. Heal. Care 2025, 33, 1645–1660. [Google Scholar] [CrossRef]
Johnson, J.; Johnson, A.R.; Andersen, C.A.; Kelso, M.R.; Oropallo, A.R.; Serena, T.E. Skin Pigmentation Impacts the Clinical Diagnosis of Wound Infection: Imaging of Bacterial Burden to Overcome Diagnostic Limitations. J. Racial Ethn. Heal. Disparities 2024, 11, 1045–1055. [Google Scholar] [CrossRef]
Avsar, P.; Moore, Z.; Patton, D.; O’Connor, T.; Skoubo Bertelsen, L.; Tobin, D.J.; Brunetti, G.; Carville, K.; Iyer, V.; Wilson, H. Exploring Physiological Differences in Injury Response by Skin Tone: A Scoping Review. J. Tissue Viability 2025, 34, 100871. [Google Scholar] [CrossRef]
Cassidy, B.; Reeves, N.D.; Pappachan, J.M.; Gillespie, D.; O’Shea, C.; Rajbhandari, S.; Maiya, A.G.; Frank, E.; Boulton, A.J.; Armstrong, D.G.; et al. The DFUC 2020 Dataset: Analysis Towards Diabetic Foot Ulcer Detection. touchREV Endocrinol. 2021, 17, 5–11. [Google Scholar] [CrossRef]
Pictures of Wounds and Surgical Wound Dressings. Available online: https://www.medetec.co.uk/files/medetec-image-databases.html (accessed on 16 July 2025).
Alzubaidi, L.; Fadhel, M.A.; Al-Shamma, O.; Zhang, J.; Santamaria, J.; Duan, Y. Robust Application of New Deep Learning Tools: An Experimental Study in Medical Imaging. Multimed. Tools Appl. 2022, 81, 13289–13317. [Google Scholar] [CrossRef]
Alzubaidi, L.; Fadhel, M.A.; Al-Shamma, O.; Zhang, J.; Santamaría, J.; Duan, Y.; Oleiwi, S.R. Towards a Better Understanding of Transfer Learning for Medical Imaging: A Case Study. Appl. Sci. 2020, 10, 4523. [Google Scholar] [CrossRef]
Schwartz, R.; Vassilev, A.; Greene, K.K.; Perine, L.; Burt, A.; Hall, P. Towards a Standard for Identifying and Managing Bias in Artificial Intelligence; NIST: Gaithersburg, MD, USA, 2022.
Gong, Y.; Liu, G.; Xue, Y.; Li, R.; Meng, L. A Survey on Dataset Quality in Machine Learning. Inf. Softw. Technol. 2023, 162, 107268. [Google Scholar] [CrossRef]
Fitzpatrick, T.B. The Validity and Practicality of Sun-Reactive Skin Types I through VI. Arch. Dermatol. 1988, 124, 869–871. [Google Scholar] [CrossRef]
Chardon, A.; Cretois, I.; Hourseau, C. Skin Colour Typology and Suntanning Pathways. Int. J. Cosmet. Sci. 1991, 13, 191–208. [Google Scholar] [CrossRef]
Osto, M.; Hamzavi, I.; Lim, H.; Kohli, I. Individual Typology Angle and Fitzpatrick Skin Phototypes Are Not Equivalent in Photodermatology. Photochem. Photobiol. 2021, 98, 127–129. [Google Scholar] [CrossRef]
Fijałkowska, M.; Koziej, M.; Żądzińska, E.; Antoszewski, B.; Sitek, A. Assessment of the Predictive Value of Spectrophotometric Skin Color Parameters and Environmental and Behavioral Factors in Estimating the Risk of Skin Cancer: A Case–Control Study. J. Clin. Med. 2022, 11, 2969. [Google Scholar] [CrossRef]
Groh, M.; Harris, C.; Soenksen, L.; Lau, F.; Han, R.; Kim, A.; Koochek, A.; Badri, O. Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
İsmail Mendi, B.; Kose, K.; Fleshner, L.; Adam, R.; Safai, B.; Farabi, B.; Atak, M.F. Artificial Intelligence in the Non-Invasive Detection of Melanoma. Life 2024, 14, 1602. [Google Scholar] [CrossRef]
Rathore, P.S.; Kumar, A.; Nandal, A.; Dhaka, A.; Sharma, A.K. A Feature Explainability-Based Deep Learning Technique for Diabetic Foot Ulcer Identification. Sci. Rep. 2025, 15, 6758. [Google Scholar] [CrossRef]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks 2019. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chassagnon, G.; Vakalopolou, M.; Paragios, N.; Revel, M.-P. Deep Learning: Definition and Perspectives for Thoracic Imaging. Eur. Radiol. 2020, 30, 2021–2030. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: New York, NY, USA, 2015; Volume 37, pp. 448–456. [Google Scholar]
Liu, Y.; Agarwal, S.; Venkataraman, S. Autofreeze: Automatically Freezing Model Blocks to Accelerate Fine-Tuning. arXiv 2021, arXiv:2102.01386. [Google Scholar]
Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope. Electronics 2021, 10, 2470. [Google Scholar] [CrossRef]
Nikbakhtsarvestani, F.; Ebrahimi, M.; Rahnamayan, S. Multi-Objective ADAM Optimizer (MAdam). In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Maui, HI, USA, 1–4 October 2023; pp. 3860–3867. [Google Scholar]
Tian, J.; Li, H.; Qi, Y.; Wang, X.; Feng, Y. Intelligent Medical Detection and Diagnosis Assisted by Deep Learning. Appl. Comput. Eng. 2024, 64, 121–126. [Google Scholar] [CrossRef]
Zhang, H.; Ogasawara, K. Grad-CAM-Based Explainable Artificial Intelligence Related to Medical Text Processing. Bioengineering 2023, 10, 1070. [Google Scholar] [CrossRef]

Figure 1. Causes of diabetic foot ulcers (adapted from [9]; images from Biorender.com).

Figure 2. Skin tone distribution based on the calculation of ITA classes from images in the dataset.

Figure 3. Randomly selected sample images for each ITA skin tone category.

Figure 4. Development pipeline for DFU classification and explainability.

Figure 5. VGG16 and VGG19 network architectures.

Figure 6. MobileNetV2 network architecture and related Bottleneck Residual Block.

Figure 7. Data augmentation pipeline that was used to expand the original dataset. From each operation, an arrow signals the number of generated images.

Figure 8. Examples of the heatmaps generated using Grad-CAM analysis on the tested CNN models. The overlayed colours represent: Red/Yellow colors (Warm Colors) indicate regions of high importance or strong activation, meaning that the pixels in these areas heavily contributed to the model’s decision to classify the image as the target class. Blue/Green colors (Cool Colors) indicate regions of low importance or weak activation, with pixels in these areas having little to no influence on the model’s prediction for that specific class.

Table 1. Summary of studies found in the literature.

Study	Methodology	# Images	Performance (%)
Zhao et al. (2019) [19]	VGG16	1639	Accuracy: 83.36
Wang et al. (2020) [20]	MobileNetv2, VGG16	1109	F1-score: 94.05
Alzubaidi et al. (2020) [21]	DFU-QUTNet, SVM, KNN	754	Precision: 95.40
Yap et al. (2021) [22]	VGG16, ResNet, EfficientNet	15,683	Precision: 57.3; F1-score: 55.2
Liu et al. (2022) [23]	EfficientNet	58,200	Accuracy Ischaemia: 99.39 Accuracy Infection: 97.92
Anisuzzaman et al. (2022) [24]	VGG16, VGG19	1088	Precision: 100
Alqahtani et al. (2023) [25]	AWSg CNN	493	Accuracy, F1-score, AUC: 99
Preiya et al. (2023) [26]	DRNN, PFCNN	15,683	Accuracy: 99.32
Fadhel et al. (2024) [27]	DFU_FNet, DFU_TFNet	493	AlexNet: Accuracy: 89.11; F1-score: 88.1 VGG16: Accuracy: 90.37; F1- score: 90.9 GoogleNet: Accuracy: 91.93; F1-score: 92.9 DFU_FNet + SVM: Accuracy: 94.71; F1-score: 94.5
Gudivaka et al. (2025) [28]	RL, CPPN, SVM, ELM, ResNet50	4000	Classification accuracy: 93.75 Cluster 1 Efficiency: 71–88 Cluster 2 Efficiency: 85–97 Cluster 3 Efficiency: 90–98 Cluster 4 Efficiency: 93.5–98.2

Table 2. Summary of foot ulcers datasets.

Name	Images Available	Collection Conditions	Resolution (Pixels)	Open-Acess
DFUC 2020 [31]	4000 (2496 training and 2097 test)	Captured by digital camera; variations in distance, angle, orientation, lighting, focus and presence of background objects	640 × 480	No
DFUC 2021 [22]	15,683 (3994 unidentified and 5734 for the test set)	Three cameras used: Kodak DX4530, Nikon D3300 and Nikon COOLPIX P100	640 × 480	Yes
Nasiriyah diabetic hospital centre [18]	754	Used a Samsung Galaxy Note 8 and an iPad	-	No
Medetec Wound Database [32]	49 (358 wounds)	Not specified	Width: from 358 to 560 Height: from 371 to 560	Yes
Diabetic Foot Ulcer [33]	2673	Captured by digital camera	224 × 224	Yes

Table 3. ITA values and related skin types.

ITA (°)	Skin Type
>55	Very Light
41–55	Light
28–41	Intermediate
10–28	Tan
330–10	Brown
<−30	Dark

Table 4. Performance metrics obtained using the balanced and augmented dataset.

Model	TP	FN	FP	TN	Accuracy	Precision	Recall	F1-Score
MobileNetV2	769	270	169	1016	80%	82%	74%	78%
VGG16	966	73	84	1101	93%	92%	93%	92%
VGG19	1008	31	31	1154	97%	97%	97%	97%

Table 5. F1-score obtained by training the model with the original dataset (raw) and by applying each augmentation technique individually. The improvement difference is presented in parenthesis.

Model	Raw	Rot./Flip.	Trans.	Scaling	Shear	Bright.
MobileNetV2	67%	72% (+5%)	69% (+2%)	72% (+5%)	69% (+2%)	71% (+4%)
VGG16	74%	79% (+5%)	77% (+3%)	77% (+3%)	78% (+4%)	80% (+6%)
VGG19	70%	74% (+4%)	73% (+3%)	74% (+4%)	72% (+2%)	74% (+4%)

Table 6. Summary of the results for the VGG19 model grouped by ITA skin tone.

Skin Tone	TP	FN	FP	TN	Accuracy	Precision	Recall	F1-Score
Very Light	56	2	1	57	97%	98%	97%	97%
Light	48	7	11	50	84%	81%	87%	84%
Intermediate	88	8	7	70	91%	93%	92%	92%
Tan	139	3	3	209	98%	98%	98%	98%
Brown	417	8	7	469	98%	98%	98%	98%
Dark	260	3	2	299	99%	99%	99%	99%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Reis, S.S.; Pinto-Coelho, L.; Sousa, M.C.; Neto, M.; Silva, M.; Sequeira, M. Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers. Appl. Sci. 2025, 15, 8321. https://doi.org/10.3390/app15158321

AMA Style

Reis SS, Pinto-Coelho L, Sousa MC, Neto M, Silva M, Sequeira M. Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers. Applied Sciences. 2025; 15(15):8321. https://doi.org/10.3390/app15158321

Chicago/Turabian Style

Reis, Sara Seabra, Luis Pinto-Coelho, Maria Carolina Sousa, Mariana Neto, Marta Silva, and Miguela Sequeira. 2025. "Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers" Applied Sciences 15, no. 15: 8321. https://doi.org/10.3390/app15158321

APA Style

Reis, S. S., Pinto-Coelho, L., Sousa, M. C., Neto, M., Silva, M., & Sequeira, M. (2025). Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers. Applied Sciences, 15(15), 8321. https://doi.org/10.3390/app15158321

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Skin Tone Fairness in Convolutional Neural Networks for the Classification of Diabetic Foot Ulcers

Abstract

1. Introduction

1.1. Background

1.2. Problem

1.3. Objectives

1.4. Related Studies

2. Materials and Methods

2.1. Datasets

2.2. Tools

2.3. Methods

2.3.1. Data Preparation and Pre-Processing

2.3.2. Model Architecture Configuration

2.3.3. Training and Validation

2.3.4. Analysis with Grad-CAM

2.3.5. Evaluation Metrics

3. Results

4. Discussion

4.1. Impact of Image Augmentation Techniques

4.2. Performance Metrics Analysis

4.3. Grad-CAM Analysis

5. Conclusions

5.1. Main Contributions

5.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI