A Novel Machine Learning Approach for Severity Classification of Diabetic Foot Complications Using Thermogram Images

Diabetes mellitus (DM) is one of the most prevalent diseases in the world, and is correlated to a high index of mortality. One of its major complications is diabetic foot, leading to plantar ulcers, amputation, and death. Several studies report that a thermogram helps to detect changes in the plantar temperature of the foot, which may lead to a higher risk of ulceration. However, in diabetic patients, the distribution of plantar temperature does not follow a standard pattern, thereby making it difficult to quantify the changes. The abnormal temperature distribution in infrared (IR) foot thermogram images can be used for the early detection of diabetic foot before ulceration to avoid complications. There is no machine learning-based technique reported in the literature to classify these thermograms based on the severity of diabetic foot complications. This paper uses an available labeled diabetic thermogram dataset and uses the k-mean clustering technique to cluster the severity risk of diabetic foot ulcers using an unsupervised approach. Using the plantar foot temperature, the new clustered dataset is verified by expert medical doctors in terms of risk for the development of foot ulcers. The newly labeled dataset is then investigated in terms of robustness to be classified by any machine learning network. Classical machine learning algorithms with feature engineering and a convolutional neural network (CNN) with image-enhancement techniques are investigated to provide the best-performing network in classifying thermograms based on severity. It is found that the popular VGG 19 CNN model shows an accuracy, precision, sensitivity, F1-score, and specificity of 95.08%, 95.08%, 95.09%, 95.08%, and 97.2%, respectively, in the stratification of severity. A stacking classifier is proposed using extracted features of the thermogram, which is created using the trained gradient boost classifier, XGBoost classifier, and random forest classifier. This provides a comparable performance of 94.47%, 94.45%, 94.47%, 94.43%, and 93.25% for accuracy, precision, sensitivity, F1-score, and specificity, respectively.


Introduction
Diabetes mellitus (DM) is a chronic medical condition resulting from high amounts of sugar in the blood, which often leads to severe health complications such as heart-related It is possible to calculate and determine an estimation of thermal changes with respect to one foot as a reference [27][28][29][30] using the contralateral comparison of temperatures. However, if both feet have temperature changes, but neither of them has the butterfly pattern, one of the feet cannot act as a reference. Asymmetry cannot be measured even if there is a large temperature difference and identical spatial distributions present in both feet. An alternative approach is to calculate the temperature change using a control group butterfly pattern [31][32][33].
Machine learning (ML) techniques are gaining popularity in biomedical applications in assisting medical experts in early diagnosis [34][35][36]. The authors conducted an extensive investigation and developed a trained AdaBoost classifier, which achieved an F1-score of 97% in classifying diabetic and healthy patients using thermogram images [37]. However, the temperature distribution of a diabetic foot does not have a specific spatial pattern and it is, therefore, important to devise a method to distinguish the diabetic feet with different temperature distributions that do not depend on a spatial pattern. The spatial distribution may change after a while and the temperature changes are not significant in some cases. Moreover, the detection of irregular temperature rises in the plantar region is important for diabetic patients. The deep learning technique using thermogram images to classify the control and diabetic patients is not a well-studied domain. Moreover, the severity grading of diabetic foot is also lacking in the literature, which might help to establish an early warning tool for diabetic foot ulcer detection. A simple, effective, and accurate machine learning technique for diabetic foot severity stratification before foot ulcer development using thermogram images would be very useful.
Several studies [31,32,[38][39][40][41][42][43][44][45] have attempted to extract features that can be used to identify the hot region in the plantar thermogram, which could be a sign of tissue damage or inflammation. In all of the works, the plantar region was split into six areas and different statistical features were extracted. In order to obtain various coefficients from texture and entropy characteristics, Adam et al. in [43] employed a discrete wavelet transformation (DWT) and higher-order spectra (HOS). In another work, the foot image was decomposed using a double-density dual-tree complex wavelet transform (DD-DT-CWT), and many features were retrieved from the decomposed images [42]. To categorize patients as normal or in the ulcer group, Saminathan et al. in [41] segmented the plantar area into 11 regions using region raising and retrieved texture characteristics. Maldonado et al. in [40] employed the DL approach to segment a visible foot picture, which was then used to segment the plantar area of the same patient's thermogram image to identify ulceration or necrosis based on temperature differences. Hernandez et al. in [32] presented the thermal change index (TCI) as a quantitative indicator for detecting thermal changes in the plantar region of diabetic patients in comparison to a reference control group. They published a public database of diabetic foot thermogram pictures and named it the "Plantar Thermogram Database", and used TCI to classify the participants into classes one to five based on the spatial temperature distribution and temperature range. The authors conducted considerable research and produced a trained AdaBoost classifier that classified diabetic and healthy patients using thermogram images with an F1-score of 97 percent [37]. Cruz-Vega et al. in [38] proposed a deep learning technique to classify the plantar thermogram database images in a non-convenient classification scheme, where the results were shown by taking two classes at a time and then averaging the results after ten folds of a different combination of two set classes. For the classification of class three and class four, a new diabetic foot thermogram network (DFTNet) was proposed, with a sensitivity and accuracy of 0.9167 and 0.853, respectively. However, the ground-truth classification of the patients in this database was performed using the TCI score, which takes the butterfly pattern of the control group which is individually used to compare each affected foot. Therefore, the classification entirely relies on the reliability of the TCI scoring technique, which was found questionable in two different aspects. Firstly, it was observed that thermogram images with a butterfly pattern even present in higher classes and the temperature distribution of severe diabetic patients as reported in other articles were incorrectly classified to lower classes based on the TCI score. Secondly, the state-of-the-art deep learning techniques failed to reliably classify the thermogram images into different classes, which were graded based on the TCI score. If a database is publicly available, it is easy to re-evaluate the labels in datasets if it is found that the labels are questionable [46]. Aradillas et al. [47] mentioned scenarios where they found errors in the labeling of the training samples in databases and proposed cross-validation techniques to remove them. Hu et al. [48] mentioned wrong labels in face-recognition databases and proposed a training paradigm that would take the dataset size into account and train itself by probabilistically removing some datasets that could have been incorrectly labeled. Ding et al. [49] proposed a semi-supervised two-stage approach in learning from noisy labels.
As per the above discussion, it can be seen that there is a lack of severity grading of the risk of ulcer development. The authors, including a set of expert medical doctors working in a diabetic foot clinic, stated that a foot plantar temperature distribution can help in the early detection of a foot ulcer development [50][51][52]. The scientific hypothesis of this work is whether machine learning approaches can be used to categorize thermograms into severity classes of patients with their risk of ulcer development. This severity classification can help in early treatment that could prevent diabetic foot ulcers. The improved performance of a machine learning model could help in early detection from the convenience of the patient's home and reduce the burden on the healthcare systems, considering diabetic foot complications are very common and lead to expensive follow-up and treatment procedures throughout the world. All the above studies motivated the authors to carry out an extensive investigation on dataset re-labeling, using an unsupervised machine learning and pre-trained convolutional neural network (CNN) to extract features automatically from thermogram images, reduce feature dimensionality using the principal component analysis (PCA), and, finally, to classify thermograms using k-mean clustering [53] to revise the labels. The new labels identified by the unsupervised clusters were fine-tuned by a pool of medical doctors (MDs). Then, the authors investigated classical machine learning techniques using feature engineering and 2D CNNs with image enhancement techniques on the images to develop the best-performing classification network. The major contributions of this paper can be stated as follows: The revision of labels of the thermogram dataset for the diabetic severity grading of a publicly available thermogram dataset using unsupervised machine learning (K-mean clustering), which is fine-tuned by medical doctors (MDs).
The extraction and ranking of relevant features from temperature pixels for classifying thermograms into diabetic severity groups.
The exploration of the effect of various image enhancement techniques on thermogram images in improving the performance of 2D CNN models in diabetic severity classification.
This work is the first of its kind to propose a machine learning technique to classify diabetic thermograms into different severity classes: mild, moderate, and severe.
The manuscript is organized into five sections, where Section 1 discusses the introduction and the related works and key contributions, Section 2 discusses the detailed methodology used in this work, and Sections 3 and 4 presents the results and discussion. Finally, Section 5 presents the conclusions.

Methodology
The methodology adopted in the work is presented in Figure 1. In the study, the thermogram was applied to a pre-trained CNN model to extract useful features, and then the feature dimensionality was reduced using principal component analysis (PCA), and the sparse feature space was then applied as an input to the k-mean clustering algorithm to provide unsupervised revised classes, which was then verified by medical doctors. The newly revised classes-mild, moderate, and severe-were then tested in terms of classification performance by 2D CNN approach using different image enhancement techniques and classical machine learning algorithms on the extracted features from the thermograms. The following sections went over the dataset used in the study, the K-mean clustering method used for unsupervised clustering, and the investigation performed using (i) thermogram images and 2D CNN networks and various image enhancement techniques, and (ii) classical machine learning algorithms with feature engineering (feature extraction, and feature reduction) on the thermogram images. This section also went through the performance indicators that were utilized to choose the best machine learning algorithm.
newly revised classes-mild, moderate, and severe-were then tested in terms of classification performance by 2D CNN approach using different image enhancement techniques and classical machine learning algorithms on the extracted features from the thermograms. The following sections went over the dataset used in the study, the K-mean clustering method used for unsupervised clustering, and the investigation performed using (i) thermogram images and 2D CNN networks and various image enhancement techniques, and (ii) classical machine learning algorithms with feature engineering (feature extraction, and feature reduction) on the thermogram images. This section also went through the performance indicators that were utilized to choose the best machine learning algorithm.

Dataset
In this study, 167 foot-pair thermograms of DM (122) and control (45) subjects were obtained from a public database from the General Hospital of the North, the General Hospital of the South, the BIOCARE clinic, and the National Institute of Astrophysics, Optics and Electronics (INAOE) for 3 years (from 2012 to 2014). All of these places were located in the city of Puebla, Mexico. The dataset had demographic information such as age (control group-27.76 ± 8.09; diabetic group-55.98 ± 10.57 years), gender, height, and weight. The participants who participated in the study were asked to lay in the supine position for 15 min to reach thermodynamic equilibrium to improve the accuracy of temperature variation due to diabetic complications and had avoided prolonged sun exposure, intense physical activity, and any effort that could affect blood pressure [32,39]. The participants were asked to remove their shoes and socks and clean their feet with a damp towel.
The data collection performed during resting time was to avoid exerting any effort that would impact the subjects' blood pressure and, as a result, their plantar temperature. The dataset included segmented thermograms of the medial plantar artery (MPA), lateral plantar artery (LPA), medial calcaneal artery (MCA), and lateral calcaneal artery (LCA), a concept suggested by Taylor and Palmer [54], in addition to the segmented foot thermograms ( Figure 2). Because it was utilized to compute local temperature distribution, the information gained utilizing angiosomes was related not only to the damage caused by DM in arteries, but also to the accompanying ulceration risk. The dataset, which is the largest diabetic thermogram public dataset, also included pixelated temperature measurements for complete feet and four angiosomes for both feet.

Dataset
In this study, 167 foot-pair thermograms of DM (122) and control (45) subjects were obtained from a public database from the General Hospital of the North, the General Hospital of the South, the BIOCARE clinic, and the National Institute of Astrophysics, Optics and Electronics (INAOE) for 3 years (from 2012 to 2014). All of these places were located in the city of Puebla, Mexico. The dataset had demographic information such as age (control group-27.76 ± 8.09; diabetic group-55.98 ± 10.57 years), gender, height, and weight. The participants who participated in the study were asked to lay in the supine position for 15 min to reach thermodynamic equilibrium to improve the accuracy of temperature variation due to diabetic complications and had avoided prolonged sun exposure, intense physical activity, and any effort that could affect blood pressure [32,39]. The participants were asked to remove their shoes and socks and clean their feet with a damp towel.
The data collection performed during resting time was to avoid exerting any effort that would impact the subjects' blood pressure and, as a result, their plantar temperature. The dataset included segmented thermograms of the medial plantar artery (MPA), lateral plantar artery (LPA), medial calcaneal artery (MCA), and lateral calcaneal artery (LCA), a concept suggested by Taylor and Palmer [54], in addition to the segmented foot thermograms ( Figure 2). Because it was utilized to compute local temperature distribution, the information gained utilizing angiosomes was related not only to the damage caused by DM in arteries, but also to the accompanying ulceration risk. The dataset, which is the largest diabetic thermogram public dataset, also included pixelated temperature measurements for complete feet and four angiosomes for both feet.

K-Mean Clustering Unsupervised Classification
In this part of the investigation, the underlying features of the images extracted using a pre-trained CNN model were used to cluster them in different groups using k-mean clustering. K-mean is the most popular method for clustering data in an unsupervised manner, and was proposed in 1967 [53]. K-mean is an unsupervised, non-deterministic, numerical, iterative method of clustering. In k-mean, each cluster is represented by the mean value of objects in the cluster. Here, we partitioned a set of n objects into k clusters, so that intercluster similarity was low and intracluster similarity was high. The similarity was measured in terms of the mean value of objects in a cluster. A similar concept was used in our investigation and could be divided into 4 steps: A. Pre-processing: Preparing the image so that it could be fed properly to the CNN model. B. Feature extraction: Using a pre-trained CNN model to extract the underlying features from a specific layer. C. Dimensionality reduction: Using principal component analysis (PCA) [55] to reduce the noise in the feature space and reduce the dimensionality D. Clustering: Using K-mean to cluster the images based on similar features.
In the pre-processing step, the image was first resized to 224 × 224 size. This was because the CNN model available in Keras [56], pretrained on ImageNet Dataset [57], required the input image to be of that size. Interpolation was applied to images before resizing to allow for rescaling by a non-integer scaling factor. This step did not change the properties of the image, and instead ensured that the image was formatted correctly.
For feature extraction in our study, the popular CNN network-VGG19 was used [58]. The VGG19 network developed by the Oxford Visual Geometry Group is a popular CNN for computer vision tasks because of its high performance and relative simplicity. The detailed framework, along with network architecture, can be seen in Figure 3. The pretrained VGG19 model was used to extract 4096 features from the 'FC1' layer from each thermogram image. The selection of the network layer for feature extraction was a hyperparameter in this experiment, and it was tuned to obtain the parameter that provided the best result, which is a common exercise. We empirically found out that the FC1 layer gave us the best result similar to other studies that investigated k-means clustering [59,60]. The number of features needed to be reduced so that the k-mean algorithm did not overfit and it remained robust to noise. PCA was used to transform and reduce the features such that only the most important features were retained [51]. Figure 4 shows the cumulative variance versus the number of components after PCA. It can be seen that the first 10 components provided a cumulative variance of 0.5, and after adding 90 more components, the variance rose to 0.9. Thus, significant information was retained in the initial components.

K-Mean Clustering Unsupervised Classification
In this part of the investigation, the underlying features of the images extracted using a pre-trained CNN model were used to cluster them in different groups using k-mean clustering. K-mean is the most popular method for clustering data in an unsupervised manner, and was proposed in 1967 [53]. K-mean is an unsupervised, non-deterministic, numerical, iterative method of clustering. In k-mean, each cluster is represented by the mean value of objects in the cluster. Here, we partitioned a set of n objects into k clusters, so that intercluster similarity was low and intracluster similarity was high. The similarity was measured in terms of the mean value of objects in a cluster. A similar concept was used in our investigation and could be divided into 4 steps: A. Pre-processing: Preparing the image so that it could be fed properly to the CNN model. B. Feature extraction: Using a pre-trained CNN model to extract the underlying features from a specific layer. C. Dimensionality reduction: Using principal component analysis (PCA) [55] to reduce the noise in the feature space and reduce the dimensionality D. Clustering: Using K-mean to cluster the images based on similar features.
In the pre-processing step, the image was first resized to 224 × 224 size. This was because the CNN model available in Keras [56], pretrained on ImageNet Dataset [57], required the input image to be of that size. Interpolation was applied to images before resizing to allow for rescaling by a non-integer scaling factor. This step did not change the properties of the image, and instead ensured that the image was formatted correctly.
For feature extraction in our study, the popular CNN network-VGG19 was used [58]. The VGG19 network developed by the Oxford Visual Geometry Group is a popular CNN for computer vision tasks because of its high performance and relative simplicity. The detailed framework, along with network architecture, can be seen in Figure 3. The pre-trained VGG19 model was used to extract 4096 features from the 'FC1' layer from each thermogram image. The selection of the network layer for feature extraction was a hyperparameter in this experiment, and it was tuned to obtain the parameter that provided the best result, which is a common exercise. We empirically found out that the FC1 layer gave us the best result similar to other studies that investigated k-means clustering [59,60]. The number of features needed to be reduced so that the k-mean algorithm did not overfit and it remained robust to noise. PCA was used to transform and reduce the features such that only the most important features were retained [51]. Figure 4 shows the cumulative variance versus the number of components after PCA. It can be seen that the first 10 components provided a cumulative variance of 0.5, and after adding 90 more components, the variance rose to 0.9. Thus, significant information was retained in the initial components. In our study, we In our study, we used 45 components, which helped to achieve 75% cumulative variance, which is usually an acceptable threshold in various k-mean clustering studies [61][62][63][64].  The reduced and transformed features were separated into different groups using kmean clustering [65]. This is an unsupervised machine learning algorithm, which popularized dealing with non-labeled data [66,67]. K-mean's aim was to group data that were similar in feature space. The algorithm flow can be seen in Algorithm 1 below. In our study, max_iter was set to 500 and the break condition that happened only after line 4 was fulfilled for all clusters. In our study, we used 45 components, which helped to achieve 75% cumulative variance, which is usually an acceptable threshold in various k-mean clustering studies [61][62][63][64].  The reduced and transformed features were separated into different groups using kmean clustering [65]. This is an unsupervised machine learning algorithm, which popularized dealing with non-labeled data [66,67]. K-mean's aim was to group data that were similar in feature space. The algorithm flow can be seen in Algorithm 1 below. In our study, max_iter was set to 500 and the break condition that happened only after line 4 was fulfilled for all clusters. The reduced and transformed features were separated into different groups using k-mean clustering [65]. This is an unsupervised machine learning algorithm, which popularized dealing with non-labeled data [66,67]. K-mean's aim was to group data that were similar in feature space. The algorithm flow can be seen in Algorithm 1 below. In our study, max_iter was set to 500 and the break condition that happened only after line 4 was fulfilled for all clusters. The algorithm aimed to reduce inertia. Inertia is the sum of squared Euclidean distances from a centroid to its associated data points and can be seen mathematically in Equation (1).
where k is the total number of clusters, n is the number of samples associated with a cluster, x is the position of the sample in the feature space, and c is the position of the centroid in the feature space.
K-means++ [68] was used to initialize the centroid position. This method helped to achieve good clustering performance and reduced computational complexity. In this method, the first centroid was selected from the data with uniform probability. The other centroids were selected from the data with a probability proportionate to their distance from the nearest centroid. Hence, the initial centroids were close to the data points but were far apart from each other.
As stated earlier, the labeled dataset provided by k-mean clustering was used for the investigation using (i) transfer learning, image enhancement, and transfer learning 2D CNN and (ii) classical machine learning techniques and feature engineering on the extracted features from the thermograms.

Two-Dimensional CNN-Based Classification
Two-dimensional CNN is widely used in biomedical applications for automatic and early diagnosis of anomalies such as COVID-19 pneumonia, tuberculosis, and other diseases [69,70]. A labeled dataset can be divided into training and testing datasets, with the training dataset being used to train the network and the unseen testing dataset being used to verify its performance. During the training process, a portion of the training dataset is used for validation to avoid overfitting. Five-fold cross-validation was employed in this study, which divided the dataset five-fold and found the performance metric for the testing dataset five times. Each time, one of the folds was used as testing dataset, and the remaining folds were used for training and validation. This approach helped in stating results considering the complete dataset and making sure the test data were always unseen. The final results were the overall and weighted result of the five folds; detailed performance metric was shown later. As larger amount of data used for training always helps in obtaining better trained model, the authors used popular augmentation techniques (rotation and translation) to increase the training data size. The rotation operation used for image augmentation was performed by rotating the images in clockwise and counter-clockwise directions with angles from 5 to 30 in increments of 2 • . Image translation was conducted by translating images horizontally and vertically from −15 to 15%. The details of the training, validation and testing dataset for 2D binary and severity classifier are shown in Table 1. Transfer learning: As we had a limited dataset, which can be seen in Table 1, we could make use of pre-trained models which were trained on a large ImageNet database [71]. These pre-trained networks were trained on very large ImageNet database [71], and had good classification performance. These networks could be further trained on any other classification problem, and this is known as transfer learning. Based on extensive literature review and previous performances [37], in this study, seven well-known pretrained deep CNN models were used for thermograms' classification: ResNet18, ResNet50, ResNet100 [72], DenseNet201 [72], InceptionV3 [73], VGG19 [58], and MobileNetV2 [74].
Image enhancement: It was found that image enhancement techniques such as adaptive histogram equalization (AHE) [75] and gamma correction [70,76] can help the 2D CNN in improving its classification performance for thermograms [37]. Some samples of the image enhancement on DM and CG can be seen in Figure 5. The authors investigated the improvement of performance made possible by the different image enhancement techniques, and they were reported in the Experimental Results section.  Transfer learning: As we had a limited dataset, which can be seen in Table 1, we could make use of pre-trained models which were trained on a large ImageNet database [71]. These pre-trained networks were trained on very large ImageNet database [71], and had good classification performance. These networks could be further trained on any other classification problem, and this is known as transfer learning. Based on extensive literature review and previous performances [37], in this study, seven well-known pre-trained deep CNN models were used for thermograms' classification: ResNet18, ResNet50, Res-Net100 [72], DenseNet201 [72], InceptionV3 [73], VGG19 [58], and MobileNetV2 [74].
Image enhancement: It was found that image enhancement techniques such as adaptive histogram equalization (AHE) [75] and gamma correction [70,76] can help the 2D CNN in improving its classification performance for thermograms [37]. Some samples of the image enhancement on DM and CG can be seen in Figure 5. The authors investigated the improvement of performance made possible by the different image enhancement techniques, and they were reported in the Experimental Results section.

Classical Machine Learning Approach
This section discusses the features extracted for classical ML techniques, feature reduction techniques, feature ranking techniques, machine learning classifiers, and the extensive investigations performed using two approaches.

Classical Machine Learning Approach
This section discusses the features extracted for classical ML techniques, feature reduction techniques, feature ranking techniques, machine learning classifiers, and the extensive investigations performed using two approaches.

Extracted Features and Feature Reduction
The authors looked through the literature carefully to summarize the features that are used in clinical practice and machine learning approaches to analyze the foot thermograms for diabetic foot diagnosis. The details of the final list of features identified by the authors were mentioned in their previous work in [37] and also mentioned below: where CG ang and DM ang are the temperature values of the angiosome for the control group and a DM subject, respectively.
Estimated Temperature (ET) = a j−1 C j−1 + a j C j + a j+1 C j+1 a j−1 + a j + a j+1 Estimated Temperature Di f f erence (ETD) =| ET le f t Angiosome − ET right angiosome | Hot Spot Estimator (HSE) =| c l − ET | A histogram for the percentage of pixels in the thermogram (either complete foot or angiosomes) in the different classmark temperatures (C 0 = 26.5 • C, C 1 = 28.5 • C, C 2 = 29.5 • C, C 3 = 30.5 • C, C 4 = 31 • C, C 5 = 32.5 • C, C 6 = 33.5 • C, and C 7 = 34.5 • C) was generated to equate the parameters in Equations (3)-(5). The classmark temperature and the associated percentage of pixels in that region were denoted by the terms C j and a j , respectively. The percentage of pixels in the surrounding classmark temperatures C j−1 and C j+1 were represented by the values a j−1 and a j+1 , respectively.
In addition to these parameters, the authors formulated NRTclass j, which was the number of pixels in the class j temperature range over the total number of non-zero pixels, where class j could be class 1 to 5. This parameter is visually very important for distinguishing the variation in the plantar temperature distribution, and was also reported in the authors' previous work [37].
Thus, TCI, highest temperature value, NTR (class 1-5), HSE, ET, ETD, mean, median, SD of temperature for the distinct angiosomes, LPA, LCA, MPA, MCA, and full foot were among the 37 features that could be employed for early diabetic foot identification. In their prior study [37], the authors published the statistics of the data provided by the source [39].
By determining the association between the various features, the final list of features was streamlined to eliminate redundant features. Features with a correlation of greater than 95% were deleted, improving overall performance by lowering the number of redundant features and preventing overfitting [77,78].

Classical Machine Learning Approach 1: Optimal Combination of Feature Ranking, Number of Features
After optimization, the dataset's shortlisted parameters were evaluated to determine decisions and the best features for severity classification. The multi-tree extreme gradient boost (XGBoost) [89], random forest [90], extra tree [91], chi-squared [92], pearson correlation coefficient [93], recursive feature elimination (RFE) [94], logistic regression [95], and LightGBM [96] were used to identify three different sets of feature ranking. Rigorous research determined the greatest combination of features that offered the best performance, and the best-performing top-ranked features from the different feature-ranking methodologies were used to identify the best combination of features.

Classical Machine Learning Approach 2: Stacking-Based Classification
We presented a stacking-based classifier that worked by merging numerous bestperforming classifiers created by different learning algorithms L 1 , . . . . . . , L N on a single dataset S, which consisted of examples s i = (x i , y i ), i.e., pairs of feature vectors (x i ) and their classifications (y i ). A collection of base-level classifiers M 1 , . . . . . . , M N was trained in the first phase, where M i = L i (S). A meta-level classifier was trained in the second step, which integrated the outputs of the base-level classifiers ( Figure 6). taset S, which consisted of examples = ( , ), i.e., pairs of feature vectors ( ) and their classifications ( ). A collection of base-level classifiers , … … , was trained in the first phase, where = ( ). A meta-level classifier was trained in the second step, which integrated the outputs of the base-level classifiers ( Figure 6). A cross-validation approach was used to build a training set for learning the metalevel classifier. We applied each of the base-level learning algorithms to four folds of dataset for five-fold cross-validation, leaving one fold for testing. A probability distribution over the possible class values was predicted by each base-level classifier. When applied to example x, the prediction of the base-level classifier M was a probability distribution: ( | ) signifies the probability that example x belonged to class as estimated

Performance Evaluation and Classification Scheme
In all of our experiments, we reported sensitivity, specificity, precision, accuracy, F1score, and area under the curve (AUC) for five folds, as our evaluation metrics. A cross-validation approach was used to build a training set for learning the meta-level classifier. We applied each of the base-level learning algorithms to four folds of dataset for five-fold cross-validation, leaving one fold for testing. A probability distribution over the possible class values was predicted by each base-level classifier. When applied to example x, the prediction of the base-level classifier M was a probability distribution:

Performance Evaluation and Classification Scheme
In all of our experiments, we reported sensitivity, specificity, precision, accuracy, F1-score, and area under the curve (AUC) for five folds, as our evaluation metrics.
Here, TP, FP, TN, and FN are true positive, false positive, true negative, and False negative. In any classification, TP is the number of correctly identified thermograms, TN is the number of correctly identified thermograms of the other class, FP is the number of thermograms misclassified and FN is the number of thermograms of the other class misclassified. In this paper, weighted performance metric with 95% confidence interval was reported for sensitivity, specificity, precision, and F1-score, and for the accuracy, the overall accuracy was reported.
All the experiments were done on google colab platform and used GPU GeForce RTX2070 Super which is manufactured by NVIDIA, it is located in Santa Clara, CA, USA. The other software was Matlab from Mathworks, Natick, MA, USA.

Experimental Results
This section of the paper provides the results of the various important experiments of the paper. The details are below:

K-Mean Clustering Unsupervised Classification
As stated in the previous section, k-mean clustering was used to classify the thermogram images in an unsupervised fashion. This was conducted by using the features extracted from the VGG-19 network, which were later reduced and transformed using PCA. K-mean clustering was applied for various cluster numbers, i.e., k in Algorithm 1. The optimal number of clusters would be equivalent to the correct severity class of the thermograms, later verified by medical experts. The T-distributed stochastic neighbor embedding (t-SNE) plots, which are very useful in visualizing the clustering of data in two dimensions [97,98], for various clusters can be seen in Figure 7. The t-SNE is a dimensionality reduction method that is used to visualize high-dimensional data. In this work, the high-dimensional feature space was visualized by reducing it to two-dimensions using t-SNE, and then plotting it in a scatterplot. It was evident that there was a clear distinction in clustering for cluster sizes two and three compared to a cluster size of five (original classification performed using TCI scores, i.e., class one-class five). As stated in Figure 4, 45 components were used in the k-mean clustering.
The clusters found using the k-mean clustering approach were confirmed by experts in the diabetic foot complication domain as thermograms of patients with mild, moderate, and severe severity, and can be seen in Figure 8. The experts used their experience with plantar foot temperature distribution amongst diabetic patients as reference for confirming the severity classification. The clusters found by k-mean clustering provided three differentiable sets of thermograms. Plantar foot thermograms with slightly a deviated butterfly pattern were labelled as mild by medical experts. The second set of plantar foot thermograms where there was an abnormal and high-temperature distribution was labelled as moderate by medical experts, as the temperature distributions were still not very high (which is usually indicated by the red color distribution in thermograms) and could indicate the possibility of ulcer development in the near future. The third set of plantar foot thermograms had extreme high temperature (which is usually indicated by the red color distribution in thermograms) throughout the foot, indicating the possibility of ulcer development in the immediate future, and were labelled as severe. Such clustering was not found in the original classification using the TCI-based method (i.e., class one-class five). The clustering algorithm separated the images into three clusters. The mild class had 82 images, the moderate class had 84 images, and the severe class had 78 images.  The clusters found using the k-mean clustering approach were confirmed by experts in the diabetic foot complication domain as thermograms of patients with mild, moderate, and severe severity, and can be seen in Figure 8. The experts used their experience with plantar foot temperature distribution amongst diabetic patients as reference for confirming the severity classification. The clusters found by k-mean clustering provided three differentiable sets of thermograms. Plantar foot thermograms with slightly a deviated butterfly pattern were labelled as mild by medical experts. The second set of plantar foot thermograms where there was an abnormal and high-temperature distribution was labelled as moderate by medical experts, as the temperature distributions were still not very The output of the k-mean clustering could be further understood with the help of the pie charts in Figure 9, which showcase the distribution of the classes in the different clusters found by the k-mean clustering approach. It can be clearly seen that many class one thermograms were very similar to class five thermogram (refer to the severe class in the pie chart) and many class three, class four, and class five were similar to class one and class two (refer to the mild class in the pie chart).
indicate the possibility of ulcer development in the near future. The third set of plantar foot thermograms had extreme high temperature (which is usually indicated by the red color distribution in thermograms) throughout the foot, indicating the possibility of ulcer development in the immediate future, and were labelled as severe. Such clustering was not found in the original classification using the TCI-based method (i.e., class one-class five). The clustering algorithm separated the images into three clusters. The mild class had 82 images, the moderate class had 84 images, and the severe class had 78 images. The output of the k-mean clustering could be further understood with the help of the pie charts in Figure 9, which showcase the distribution of the classes in the different clusters found by the k-mean clustering approach. It can be clearly seen that many class one thermograms were very similar to class five thermogram (refer to the severe class in the pie chart) and many class three, class four, and class five were similar to class one and class two (refer to the mild class in the pie chart).

Classical Machine Learning-Based Classification
Once the new labeled datasets were confirmed using the k-mean clustering approach and verified by the medical experts, the highlighted correlated features from the 37 extracted were reduced to 27 features-NRT (class one), NRT (class two), NRT (class three), NRT (class four), NRT (class five), highest temperature, TCI, HSE, ETD, mean, and STD  The output of the k-mean clustering could be further understood with the help of the pie charts in Figure 9, which showcase the distribution of the classes in the different clusters found by the k-mean clustering approach. It can be clearly seen that many class one thermograms were very similar to class five thermogram (refer to the severe class in the pie chart) and many class three, class four, and class five were similar to class one and class two (refer to the mild class in the pie chart).

Classical Machine Learning-Based Classification
Once the new labeled datasets were confirmed using the k-mean clustering approach and verified by the medical experts, the highlighted correlated features from the 37 extracted were reduced to 27 features-NRT (class one), NRT (class two), NRT (class three), NRT (class four), NRT (class five), highest temperature, TCI, HSE, ETD, mean, and STD Figure 9. Class 1, 2, 3, 4, and 5 distribution in the K-mean clustering categories-mild, moderate, and severe.

Classical Machine Learning-Based Classification
Once the new labeled datasets were confirmed using the k-mean clustering approach and verified by the medical experts, the highlighted correlated features from the 37 extracted were reduced to 27 features-NRT (class one), NRT (class two), NRT (class three), NRT (class four), NRT (class five), highest temperature, TCI, HSE, ETD, mean, and STD of MPA angiosome; HSE, ET, ETD, mean, and STD of LPA angiosome; HSE, ET, ETD, mean, and STD of LCA angiosome; HSE, ETD, and STD of MCA angiosome; HSE, ETD, and STD of full feet. The heatmaps of the correlation matrix with all features and after removing the highly correlated features are shown in Supplementary Figure S1 In this experiment, 3 feature selection techniques with 10 machine learning models were investigated with 27 optimized features to identify the best-combined results in 810 investigations. The best-performing combination is presented in Table 2. It can be seen that the XGboost classifier with the random forest feature selection technique and the top 25 features showed the best performance of 92.63% weighted F1-score in the diabetic severity classification. As stated earlier, we shortlisted the major contributing features from the 29 finalized features after a feature reduction. The shortlisting was performed based on finding the features which were categorized as highly correlated with the output based on all the techniques-Pearson, Chi-square, RFE, logistic, random forest, and LightGBM. Table 3 showcases the features which had a total of more than four, and it was found that only eight features (TCI, NRT (class four), NRT (class three), the mean of MPA, mean of LPA, ET of LPA, mean of LCA, and highest temperature) were top ranked features by all the techniques (Pearson, Chi-Square, RFE, logistics, random forest, and LightGBM). These eight features were used to check the performance using different machine learning classifiers and stacking performed on the top three classifiers, providing the best result, as can be seen in Table 4. The stacking classifier created using the trained gradient boost classifier, XGBoost classifier, and random forest classifier was the best-performing classifier, with 94.47%, 94.45%, 94.47%, 94.43%, and 93.25% for accuracy, precision, sensitivity, F1-score, and specificity, respectively. The performance of this approach was better than Approach 1, as shown in Table 2, with the only difference in the inference time, which was found more in the stacking classifier, as expected.

Two-Dimensional CNN-Based Classification
As discussed earlier, the authors investigated state-of-the-art transfer learning networks-ResNet18, ResNet50, VGG19, DenseNet201, InceptionV3, and MobileNetv2, along with popular image enhancement techniques. The best-performing network and enhancement types (e.g., AHE) were used for severity classification. Gamma correction did not improve the performance, as it did not help in sharpening the distinguishing features. Further discussion on the image enhancement techniques was provided in the Discussion section. Independent foot images were used to check if the different pre-trained networks could classify them into different severity levels. Table 5 reports the best-performing AHE, with VGG19 showcasing the best performance. It is to be noted that the improved performance came with a tradeoff of a slightly higher inference time. The increase in inference time to classify the thermogram image was caused due to the addition of the image enhancement technique as a pre-processing step. However, the increase in inference time due to this additional pre-processing step was only 1 ms, which would not affect even real-time classification applications. It is more important to have improved performance than to have a faster response. In our study, the image enhancement helped in the performance with a 1 ms excess delay. The ROC curves for the original and AHE thermograms are also shown in Figure 10.  Figure 10. AUC for the original and best-performing AHE thermogram in severity classification. Figure 10. AUC for the original and best-performing AHE thermogram in severity classification.

Discussion
As mentioned earlier, the authors conducted an extensive investigation and developed a trained AdaBoost classifier to achieve an F1-score of 97% in classifying diabetic and healthy patients using thermogram images [37]. The diabetic foot thermogram network (DFTNet) proposed a deep learning technique to classify the images of the plantar thermogram database in a non-convenient classification scheme, where the results were shown by taking two classes at a time and then averaging the results after ten folds of a different combination of two set classes [38]. As discussed earlier, the ground-truth classification of the patients in this database was performed using the TCI score and entirely relied on the reliability of the TCI scoring technique. Previous works and the lack of severity grading of diabetic foot complications motivated this work.
Again, to the best of the author's knowledge, no previous studies have investigated image enhancement techniques for diabetic foot severity classification using thermograms for 2D CNNs. The authors investigated different pre-trained networks and also found that the image enhancement techniques helped in the classification performance. Figure 11 confirms the improved performance of the image enhancement technique using the F1-score performance metric.
Even the classical machine learning approach provided better classification performance using TCI, NRT (Class 4), NRT (Class 3), the mean of MPA, mean of LPA, ET of LPA, mean of LCA, and highest temperature extracted features from the thermogram. The stacked classifiers using popular gradient boost, XGBoost, and random forest classifiers provided a comparable classification performance with a much lower inference time (Tables 4 and 5). The following interesting results could be summarized: AHE, due to its special equalization, helped with the severity classification using the pre-trained VGG 19 network.
A deeper layer network such as DensetNet201 did not improve the performance for image enhancement, which could be attributed to the simplistic nature of the thermogram images, which did not require a very deep network to extract meaningful features.
The proposed features [37] and new classification technique helped in classifying the diabetic thermograms into different severity groups. verity grading of diabetic foot complications motivated this work.
Again, to the best of the author's knowledge, no previous studies have investigated image enhancement techniques for diabetic foot severity classification using thermograms for 2D CNNs. The authors investigated different pre-trained networks and also found that the image enhancement techniques helped in the classification performance. Figure 11 confirms the improved performance of the image enhancement technique using the F1score performance metric. Figure 11. Comparison of F1-score between original and AHE-enhanced thermogram images using 2D severity classifier. Figure 11. Comparison of F1-score between original and AHE-enhanced thermogram images using 2D severity classifier.

Conclusions
Diabetic foot problems are an issue that has great consequences not only in terms of mortality, but also in terms of the expense needed for monitoring and controlling the disease. Thus, early detection and severity classification can help in preventing such complications. The deployment of machine learning in biomedical applications could help in preparing easy-to-use solutions for early detection, not only for medical experts to save their time, but also since such solutions can be useful for patients in a home setting. They can use it in their homes, especially during pandemic times, where visits to healthcare services are preferred to be limited, avoiding stress on the healthcare system. The authors in the paper proposed a novel framework to cluster diabetic thermograms based on severity, which was not present in the literature, and the ones that were available needed verification. The trained 2D CNN and classical machine learning models could help in severity stratification using foot thermograms, which can be captured using infrared (IR) cameras. To the best of the author's knowledge, this was the first study analyzing such a diabetic foot condition severity classifier for the early and reliable stratification of diabetic foot. The machine learning classifier performance was comparable to the 2D CNN performance using image enhancement. In conclusion, such a system could be easily deployed as a web application, and patients could benefit from remote health care using just an infrared camera and a mobile application, which is a future direction of our research.

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/s22114249/s1: Figure S1: heatmap of the correlation matrix with all the features (A) and after removing the highly correlated features (B).