Towards a Better Understanding of Transfer Learning for Medical Imaging: A Case Study

Featured Application: The proposed intelligent medical system is applicable for a medical diagnostic system, especially for the diagnosis of diabetic foot ulcer


Introduction
Over the last two decades, cases of diabetes mellitus (DM) have increased noticeably across global public health systems [1,2]. In 1985In , 2000, there were 30, 177, and 185 million cases respectively [3,4]. Epidemiological studies suggest that the estimated number of patients with DM will be greater than 360 million by 2030. Patients with DM can acquire numerous complications such needs to be addressed in case the performance is to be improved. One of the best solutions of the lack of training data is transfer learning (TL). TL is a technique that stores knowledge obtained while solving one task and applying it to a different task. Most medical imaging classification tasks that have utilized TL employed it from models that trained on the ImageNet (consists of natural images such as pen, cars, animals) dataset. In this case, this is an unrelated learning task to medical tasks. In order to boost the performance, TL should be from a related task. For example, the knowledge obtained while learning to classify lung diseases could apply when trying to recognize COVID19 in the lung. In this paper, we employ TL to solve the lack of training data for DCNNs model then we investigate the benefit of using the same and different sources of TL.
This paper is organized as follows: Section 2 reviews convolutional neural networks (CNNs) in image classification and the state-of-the-art DCNNs models. Section 3 describes the challenges and research problem. Section 4 lists the aims and the contributions of our work. Section 5 explains the methodology. Section 6 presents the experimental results. Finally, the conclusion of the work is drawn in Section 7.

Review of the State-of-the-Art
As there are very limited research papers related to deep learning applications in DFU, we review the role of CNNs in image classification as explained in subsection A. In subsection B, we review several deep convolutional neural networks (DCNNs) architectures and the advantage of each architecture.

CNNs in Image Classification
Image classification in the field of CV is a significant task that has been researched for several years [30,31]. It is used as a primary task in different application areas including event detection [32], scene understanding [33], and object tracking [34]. In terms of human accuracy [35], machine learning is the most promising technique compared to other available approaches [31,36]. As deep learning developed, CNNs were introduced as a new state-of-the-art concept in image classification [30,35,37]. This type of network can overcome several challenging issues in image classification such as occlusion, deformation, background clutter, and changes in scale and viewpoint. The most interesting part of CNNs is that the feature extractor and the classifier are put together. However, traditional machine learning methods have two separate steps: the first step is the handcrafted techniques for feature extraction; the second step is when extracted features are used to train the classifier such as K-nearest neighbor (KNN) [36] and support vector machine (SVM) [38]. Another benefit of CNNs is that they can work with binary or multi-class classification. CNNs have shown extraordinary achievements in several pattern recognition and CV tasks [30,37] and have solved many problems in computer vision.
Krizhevsky et al. [30] developed the CNN further when they introduced the AlexNet network. Subsequently, several architectures were introduced after the achievements of AlexNet, such as VGG-Net [37], GoogLeNet [39], and ResNet [35]. Due to the success of these models, the majority of recently proposed CNNs is often based on them and enhances performance by adding extra convolutional layers [37,39]. In general, for classification tasks, CNN's architecture involves several convolutional layers (at the beginning) and fully connected layers (on the top) heaped one over the other. These CNNs extract features via the convolutional layers and executes the classification tasks by the fully connected layers [37,40,41].
The number of layers, or depth of the CNN, plays a critical role in a superior classification model as its learning capacity is controlled by changing its depth [42]. Examining the proposed models in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) showed that accuracy is increased as the model becomes deeper [30,35,37]. Thus, as the depth is enlarged, accuracy is enhanced up to saturation level [35]. Afterward, increasing the depth (via stacking extra layers) will not enable the CNN model to reduce the error [35]. As an alternative to the usual architecture of stacked layers, ResNet introduced a less complicated structure with up to 152 layers (deeper network), while GoogLeNet introduced a hierarchical structure of convolutional layers for classification. A significant fact is to distinguish a well-behaved architecture while considering learning forms and model depth. The chosen CNN architecture must also be able to generalize the dataset, free from overfitting and the proper learning ability for the existing dataset. Several attempts were considered in classifying undersized datasets using CNN [43] and although these techniques significantly enhanced performance, they faced the problem of overfitting due to direct training from scratch. Therefore, the question of shallow architectures having sufficient ability to capture whole features for undersized data cannot be answered. How deep must the CNN architecture be to train with undersized data? Current CNN models avert direct training and enhance performance using TL techniques. TL helps to address the problem of lack of training data.

Deep Convolutional Neural Networks (DCNNs)
CNNs with a large number of layers are defined as DCNNs [44]. The gain of these DCNNs is to have better feature extraction to distinguish between classes [44]. DCNNs brought great attention to the results of the ImageNet classification challenge [30]. DDCNs (as a feature extraction task) play an essential part in different tasks of CV such as image retrieval [45], object recognition [46], and image recognition [47]. In contrast, the architectural development of DCNN is still an engineering challenge, especially in selecting several new configurations of network layers and hyper-parameters [46]. Hence, studying and designing a better network is necessary [35,39]. The DCNN attained fast and excellent advancements in different attributes involving activation function [48], regularization strategies [49], and optimization techniques [50]. In particular, the latest designs of network architecture [37,51] show that network classification performance is incredibly enhanced by redesigning the DCNN structure in a way that facilitates deep feature learning. Furthermore, implementing expansion in terms of "depth" with different degrees to the traditional network like AlexNet [30] and VGG [37], with the training of a large dataset, will efficiently enhance their model representation power. However, a degradation problem can occur if deeper networks are capable of beginning convergence (i.e., accuracy becomes overloaded and begins degrading quickly if the depth of the network increases) [35]. Hence, it is not always a solution to just increase the depth by itself. In recent years, GoogLeNet [39] and ResNet [35] have tried to solve the optimization issue of ultra-deep networks by proposing bypass paths or identity connection. In the meantime, numerous alternatives have been developed to improve ResNet architecture such as ResNet in ResNet [52], and Wide ResNet [53]. Veit et al. [54] noted that ResNet acts as an exponential ensemble of a moderately shallow network, whereas both GoogLeNet [37,55,56] and ResNet [35,52,53] are a combination of several dependent networks. Veit et al. [54] also indicate short-path aids ResNet to prevent the vanishing gradient issue, which is the same way to the analysis in FractalNet [57] and deep fuse network [58]. Furthermore, the DenseNet network has an identity connection that concatenates the layers within it. This network is capable of completely investigating the network potential via a feature reprocess. Wang et al. [59] indicated that networks of several branches are fused (either concatenation or summation) in the intermediate layers and have several benefits such as (1) the ability to generate several base networks, including shared parameters; (2) the ability to optimize the information flow; and (3) enhance the training process of the deep network. For instance, the inspection module in GoogLeNet seems to be a fusion stage and several sub-networks can concatenate with various lengths. Its architecture also comprises a series of inspection modules, which can be considered as a type of deep concatenation fusion. Hence, except for a single branch network like VGG and AlexNet, other networks like GoogLeNet, ResNet, and the recently introduced Highway, are considered deep fused networks. The representative power of these models is effectively enhanced. These DCNNs models are not able to achieve better performance without training with a large amount of data. Thus, addressing the lack of training data issue is urgently needed. These models like AlexNet, VGG, GoogLeNet, and ResNet have been fine-tuned for the various CV tasks using their previous learning from the ImageNet dataset. Although these models have shown good performance for different CV tasks, images of ImageNet as a source of TL are different from medical images which could not present a good benefit for the medical society. Therefore, to validate that issue, we have implemented several experiments in this paper for that purpose.
Based on the study of the architectures mentioned above, we have designed our proposed model combining different architectures inspired by GoogleNet and ResNet with some enhancements such as adding the global average pooling layer. We also propose a solution to tackle the lack of training data issue.

Challenges and Research Gap
In this section, we present the challenges of DFU classification and research problem of employing TL.

Challenges in DFU Classification
The automatic classification of DFU has several challenges including: • Lack of training data due to costly and time-consuming data collecting and labeling by experts.
Low contrast between target objects and background.

•
Heterogeneous and complex shapes.

•
Lack of a robust and effective deep learning model to differentiate between DFU classes.

Research Problem in Transfer Learning
It is difficult to obtain good performance with a deep learning approach due to the massive number of images required for training. In image recognition and classification, a deep convolutional neural network (DCNN) with many layers can achieve excellent results, sometimes better performance than a human, if an enormous volume of data is obtainable [35,60,61]. However, these applications demand large datasets to prevent overfitting and generalize DCNN models properly.
There is no minimum size for the dataset in training a DCNN, but training with small datasets or using a DCNN with fewer layers prevents the model from being highly accurate because of under or overfitting issues. Models with fewer layers are less accurate because they are unable to use the hierarchical features of large datasets. Collecting labeled datasets is extremely cost-effective in fields such as environmental science and medical imaging etc. [62].
In particular, in the field of medical image analysis, most crowdsourcing workers do not have the required medical/biological knowledge to accurately annotate medical/biological images. For that, machine-learning researchers frequently depend on the field specialists for labeling these images. Indeed, this is an unproductive and costly process. Thus, producing enough amount of labels to flourish deep networks becomes impracticable.
Researchers have used various techniques to overcome the lack of training data. One of the most common techniques is data augmentation, where data is created virtually [63]. Although such techniques enhance the data by creating further images, convolutional neural network (CNN) models still struggle with overfitting issues due to repeated images in data augmentation. In recent years, many researchers have employed a TL technique where deep learning models are trained on a large dataset, then fine-tuned to train on a smaller targeted dataset [27]. Although TL improves performance in many CV and pattern recognition tasks [64,65], it still has a fundamental challenge which is the type of source data used for TL compared to the target dataset. For example, DCNN models trained on the ImageNet dataset [60], which comprises of natural images, are utilized to enhance the performance of the medical image classification task. These images of the ImageNet dataset are quite different from medical images, which would not improve the performance of medical image classification. It has been proven that different domain TL does not significantly affect performance on medical imaging tasks, with lightweight models trained from scratch perform nearly as well as standard ImageNet transferred models [66].

Aim and Contribution
In this section, we list the aims and the contributions of our work.

Aim
The aims of this paper are: • To address the issue of lack of training data for DFU classification.

•
To test whether the type of images used for TL affects the performance or not.

•
To improve the performance of the DFU classification task.

•
To employ DCNN in the task of DFU classification.

Contributions
In this paper, we investigate the issue of same and different domain TL using DFU image classification as a case study by implementing several experiments. We have utilized the same domain TL with the DFU classification task. Then, we implemented the same procedure with TL from the nature image dataset. The contributions of the paper are multi-folds: • A new dataset has been collected which containing 1200 images of feet that have been manually labeled by a DFU expert as normal and abnormal. • A hybrid deep learning model has been designed that combines traditional and parallel convolutional layers along with residual connections. • Several training scenarios have been performed with the proposed hybrid model.

•
Two pre-trained deep learning models (VGG19, ResNet50) have been trained with target datasets. • It has been empirically proven that TL from the same domain of the target dataset can significantly improve performance.

•
The performance of DFU classification has been improved by attaining F1-score value of 97.6%.

Methodology
This section consists of four parts: datasets, CNN, transfer learning, the proposed model and training scenarios.

Datasets
In this paper, we utilized four datasets. Two datasets represented the target datasets while the other two datasets were employed for TL purposes.

Target Datasets
We presented two target datasets from different domains. Both datasets have approximately the same number of images for a fair comparison. The aim of using these datasets is to test the concept of TL from the same and different domain datasets. For both datasets, we divided them into 80% for training and 20% for testing.

DFU Dataset (Dataset A):
The dataset was gathered from Al-Nasiriyah Diabetic and Endocrinology Center in Iraq and we obtained ethical approval and written consent from all relevant persons and patients. The dataset had 1200 images of feet classified as normal and abnormal (DFU). We cropped the region of interest to 224 × 224 as shown in Figure 1. The total number of cropped regions was 1477, with 742 classified as normal and 735 classified as abnormal.

Animal Dataset (Dataset B):
This dataset had 1490 images of cows (766 images) and chickens (724 images) [67]. All images were resized to 224 × 224 to fit the input size of the proposed model. For a fair comparison, we took almost the same amount of data from the DFU dataset. Figure 2 shows some samples of the dataset.

Pre-Train Datasets
We collected large datasets from different sources. The first dataset (Dataset C) is in the same domain as the first target dataset (Dataset A) while the second dataset (Dataset D) is in the same domain as the second target dataset (Dataset B). The main purpose of these datasets was to pre-train our model for target datasets. Both datasets have images that look similar in features such as color, shape, and size to the target datasets. Dataset C is similar to Dataset A while Dataset D is similar to Dataset B. All images of both datasets were used for training for TL purpose.
Medical Dataset (Dataset C): we collected the dataset from different sources although all images were in the same domain as the DFU dataset. The first source had 594 images that were classified into 15 wound categories including abdominal wounds, burns, epidermolysis bullosa, extravasation wounds, foot ulcers, hemangioma, leg ulcers, malignant wounds, meningitis, miscellaneous, orthopedic wounds, pilonidal sinus, pressure ulcers (a), pressure ulcers (b), and toes [68]. The second source had 2700 different wound images that were collected from the internet. The third source contained 1000 images of clinical skin diseases that were collected from [69]. The last source had 37,364 images of skin cancer including melanoma, melanocytic nevus, basal cell carcinoma actinic keratosis, benign keratosis, dermatofibroma, and vascular lesions [70,71]. All images were resized to 224 × 224. Some of the images were divided into two or three sub-images. The final total of images was 50,103. Figure 3 shows some samples of the dataset.

Large Animal Dataset (Dataset D):
This dataset consisted of several classes of animals that were collected from different sources. The first source had eight classes of animals including dog, cat, horse, spider, butterfly, sheep, squirrel, and elephant [69]. The total number of images was 21,215. The second source had 8000 images of cats and dogs that were added to those in the first source [72]. The third source had 2099 different types of birds [73]. The fourth source had images of animals such as chimpanzee, ox, deer, etc., [74]. The total number of images was 3000. The last source included 16,643 images of wild animals such as panda, zebra, gorilla, giraffe, camel, tiger, bear, lion, elephant, and kangaroo. All images were collected from the internet. The overall number of images in this dataset was 50,957 and they were resized to 224 × 224. Figure 4 shows some samples of the dataset.

Convolutional Neural Networks (CNNs)
Currently, CNN is considered the best machine-learning (ML) algorithm for analyzing medical images [27][28][29]. The reason behind this is that after filtering the input images, CNN preserves the spatial relationships. These relationships are extremely significant in the field of radiology and other medical tasks. CNN has several types of layers such as convolution, pooling, rectified linear unit (ReLU), and fully connected layers [23]. Generally, its structure consists of a convolutional layer followed by a ReLU layer, a pooling layer, one or more convolutional layers, and one or more fully connected layers, respectively. The key feature that characterizes CNN apart from a normal neural network is its image structure during processing. The CNN's main layers are described below.

Convolutional Layer
The convolutional layer is identified following the convolution operation. Convolution in mathematics is defined as an operation executed on two functions that yield the third one.
The third function is a convoluted (modified) form of one of the two previous functions. The resultant (third) function yields as an amount function that translated one of the previous functions in the case of pointwise multiplication integral of the previous functions.
The convolutional layer is composed of neuron groups that form kernels. All kernels are often the same depth as the input and are low in size. The receptive field is a small input area that the kernel neurons are connected to. In the case of images (high dimension inputs), it is useless to connect whole neurons to whole previous outputs. For instance, if the input layer has 100 neurons and an image of 10,000 pixels (a size of 100 × 100), this yields one million parameters. Thus, a neuron contains the weights of only the kernel input dimension, instead of having weights for the input full dimension. The kernels slip crosswise to input height and width for extracting high-level features and producing a 2D activation map. The kernel stride is represented as a parameter. The resultant activation maps are stacked to form the output of the convolutional layer, which will be used for defining the next layer input. For example, by considering an image of 32 × 32, an activation map of 28 × 28 will result when operating a convolutional layer over it. Applying extra convolutional layers will reduce the size further and, in turn, the size of the image will significantly be reduced. This produces a vanishing gradient problem as well as a loss of information. To overcome this problem, padding is used. Padding enlarges the input data size via packing across input data with constants. These constants have zero values; hence, this operation is called "zero paddings". When the spatial dimensions of the output feature are similar to the feature map of the input, then it is called "the same padding". This applies equally to padding right and left. Consequently, if the added columns are odd, a further column to the right is added. No padding is equivalent to "valid padding".
On the other hand, a kernel passes over image pixels without including them in the output due to strides. If extra complex kernels and a larger image are utilized, strides obtain how a convolutional task operates with a kernel. More specifically, the kernel employs the stride parameter for obtaining the number of positions to be skipped when it slides the input. Usually, convolutional layers are followed by the ReLU layer, which enlarges its nonlinear properties but does not decrease the network size. It also operates the activation function max_(0, x).

Pooling Layers
The pooling layer has two primary jobs. The first is reducing the number of computations performed in the network and reducing the spatial dimensions of the representation. The second is controlling the overfitting issue. There are three common types of pooling layers. The first type is average pooling, which applies the average operation on a selected window. The second type is max pooling, which takes the maximum value of the selected window. The third type is the global average pooling layer, which reduces the whole input into one value. It helps reduce the spatial dimensions of a three-dimensional tensor to a one-dimensional tensor. Average and max-pooling layers use a sliding window (such as 2 × 2; 3 × 3) to reduce the size. However, the global average pooling layer performs a more extreme dimensionality reduction by turning the whole size to one dimension [75] as illustrated in Figure 5. This layer is more robust to spatial translations and helps avoid overfitting.

Batch Normalization
Training a network with modifications to parameters and weights will change the real data distribution of the total inputs of the whole layers in the DNN. This matter makes them either too small or too large and in turn, makes them incomprehensible for training the network, particularly with activation functions that apply nonlinear saturations like tanh and sigmoid. In 2015, the concept of batch normalization was proposed by Iofee and Szegedy [58]. It enhances the accuracy of the DNN, as well as the training time. For each mini-batch, batch normalization updates the inputs to have both zero-mean and unit variance.

Dropout
A limited number of solutions are used to reduce the risk of overfitting. One of these solutions is the dropout layer [57]. In this layer, the units are arbitrarily chosen and their weights are nullified and output; therefore, they do not influence the backpropagation or forward pass. Other techniques involve the use of regularization and enlargement of the training dataset utilizing the techniques of label preserving. Compared to regularization, dropout performs well in accelerating the process of training, as well as reducing the overfitting risk.

Fully Connected Layers
These layers are similar to those found in a regular neural network. Each output of the preceding layer is linked to each neuron of the fully connected layer. All tasks behind the fully connected layer are similar to that of the convolutional layer; hence, the exchange between the two layers is possible.

Loss Layers
If there is any deviation out of the expected output, the network is penalized by employing these layers, as they represent the last layer of the network. Several types of loss layers are available such as sigmoid cross-entropy and Softmax. Sigmoid is utilized for predicting multi-independent probabilities (in the interval of [0, 1]) while Softmax is employed for predicting a class from multi-disjoint classes.

Transfer Learning
The lack of training data is a common problem in deep CNN. The common solution is transfer learning. More specifically, training the models for one task encapsulates relations in the data category that can be used again for various tasks in a similar field. The learning process can be similar to the parameters of the highest likely solution for the considered task by employing the reprocessed features of an initially trained model. In other words, TL is the concept of employing knowledge gained for a specific task to resolve other correlated tasks as well as to overcome the separated learning paradigm [33].
TL helps obtain accuracy for image classification tasks by offering a large dataset for learning features as in [33]. Many researchers have demonstrated that the use of TL in medical image classification tasks is effective and efficient [25,27]. Additionally, training a CNN model from scratch (invaluable dataset) will not achieve significant outcomes. As an alternative, the solution for improving outcomes is transfer learning.
Solving complicated issues in deep learning models requires a massive amount of data. Supervised models demand large amounts of labeled data, which is an extremely difficult task due to the effort and time taken to collect and label data. Thus, this issue established the motivational basis for TL and its outstanding performance in the medical sector inspired us to utilize it.
In traditional machine learning, the common learning process is separate and only performed on certain models, datasets, and tasks. Hence, knowledge is neither preserved nor transferred between models. Conversely, in deep learning, TL can employ knowledge such as weights and features of the pre-trained model to train a new model, as well as undertake issues in the novel task that has a smaller amount of data. TL with deep learning models is more rapid, has improved accuracy, and/or needs less training data. The TL concept is to utilize a trained network on different tasks for different source data then adjust it for the target task as explained in Figure 6. There are a series of steps to fine-tune the proposed model and pre-trained models which are:

•
The proposed model has trained on transfer learning datasets (Dataset C once then Dataset D) for transfer learning purposes.

•
The pre-trained model has been loaded.

•
The final layers have been replaced with new layers to learn features specific to the target task.

•
The fine-tuned model has trained with the target dataset.

•
The model accuracy has been assessed.

•
The results have been deployed. The procedure of transfer learning explained in Figure 7. The pre-trained models (VGG, ResNet) fellow the same procedure except for the first point.

Proposed Model
We have designed our hybrid model based on the study of previous state-of-the-art architectures and the advantages of each architecture. It integrates three different ideas involving traditional convolutions, parallel convolutions, and residual connections. The total number of layers of the proposed model is 91, which are explained in Table 1 and Figure 8.
At the beginning of the model, we used two traditional convolutional layers with a filter size of 3 × 3 and 5 × 5 to reduce the input size of the input image. We chose these two filters to avoid losing small or large features. Picking a small filter size, such as 1 × 1, would act as a bottleneck that prevents large features passing through, which could help distinguish between classes; picking a large filter size, such as 11 × 11, would ignore small details that could lead to false classification. Therefore, we adopted average filter sizes. All convolutional layers in the model were followed by batch normalization and ReLU. Batch normalization speeds up the training progress, while the ReLU layer aids in reducing the effect of the vanishing gradient problem.
Traditional convolutional layers are followed by five blocks of parallel convolutional layers. Each block consists of four parallel convolutional layers with four different filter sizes (1 × 1, 3 × 3, 5 × 5, 7 × 7). The output of these four layers is combined in the concatenation layer to pass to the next block. The blocks are connected with long and short connections. The benefits of this type of combination are that it can integrate different levels of features and learn both small and large details, as having a variety of filter sizes. Furthermore, this structure is very helpful for gradient propagation as the error can backpropagate from multiple paths.
We employed a global average pooling layer on top of the five blocks of convolutional layers. Subsequently, three fully connected layers with two dropout layers were employed. Lastly, Softmax was adapted to produce the output. The global average pooling and dropout layers were used to overcome the issue of overfitting.

Training Scenarios
We utilized different training scenarios. Scenario 4: Fine-tuning two pre-trained state-of-the-art models (VGG19, ResNet50) then training them with A and B from Scenario 1. These two models had previously been trained with the ImageNet dataset containing nature images. We achieved the visualization stage by showing what the first convolutional layer learned in our model trained in Scenario 2, Dataset A. Figure 9 shows learned filters of the abnormal class. Figure 10 shows the learned filter of the normal class. The training process was accomplished using stochastic gradient descent with momentum set to 0.9. The mini-batch size was 64 and MaxEpochs was 100, with a learning rate that was initially set to 0.001. We implemented our experiments on Matlab2019 as software and a processor from Intel (R) Core TM i7-5829K CPU @ 3.30 GHz, 32 GB RAM, and 8 GB GPU.

Experimental Results
The evaluation stage was achieved by calculating recall, precision, and F1-score, where TP refers to true positives, FP refers to false positives, and FN refers to false negatives. These parameters are defined as: We started by evaluating the proposed model performance with target datasets trained with Scenario 1, as reported in Table 2. The results of the proposed model with Dataset A are slightly higher than those with Dataset B, achieving 84.8%, 88.6%, and 86.6% for precision, recall, and F1-score, respectively. The proposed model with Dataset B achieved 82.9% for precision, 87.5% for recall, and 85.1% for the F1-score. Although the performance of the proposed model with Dataset A was higher than the proposed model with Dataset B, both still roughly had the same performance.
In Scenario 2, TL from medical sources (Dataset C) was adopted and the results are listed in Table 3. The results of the proposed model with Dataset A were significantly higher than those with Dataset B, achieving 96.8%, 98.6%, and 97.6% for precision, recall, and F1-score, respectively. By comparing the performance of the proposed model with Dataset A in Scenario 2 to Scenario 1, there was a remarkable improvement in the performance of Scenario 2 due to the TL from the same domain of Dataset A. On the other hand, the situation was different with Dataset B in Scenario 2 compared to Scenario 1. Employing TL from a medical source for animal classification (Dataset B) degraded the performance due to differences in learned features. The proposed model with dataset B achieved 81.8% for precision, 86.7% for recall and 84.1% for the F1-score. In these situations, training from scratch is preferable as in Scenario 1.
In Scenario 3, the source of TL was in the same domain of Dataset B, which boosted the results to 91.6%, 96.9%, and 94.1% for precision, recall, and F1-score, respectively. By comparing the performance of the proposed model with Dataset B in Scenario 3 to that of Dataset B in Scenarios 1 and 2, it was clear that the same domain transfer played a big role in enhancing performance. Although the performance of the proposed model with Dataset A was less than that with Dataset B in Scenario 3, the results of the proposed model with Dataset A slightly improved compared to the results from Scenario 1. It achieved 86.5% for precision, 92.7% for recall and 89.4% for the F1-score. These results are still less than the results from Scenario 2. The results of the proposed model with Datasets A and B trained in Scenario 3 are listed in Table 4. In Table 5, we employed two state-of-the-art models trained on the ImageNet dataset, which consists of natural images including animals. Both the ImageNet dataset and Dataset B are in the same domain. For that reason, the results of Dataset B were higher than the results of Dataset A with a score 89.3% for precision, 95.2% for recall and 92.1% for the F1-score with VGG19, while it scored 93.7% for precision, 98.9% for recall, and 96.2% for the F1-score with ResNet50. The results of ResNet50 with Dataset B were considered the dataset's highest results due to training with the ImageNet dataset as source of TL and the results of the proposed model in Scenario 3 were the second highest. It showed the importance of the same domain of transfer learning. On the other hand, the results of these models on Dataset A improved compared to Scenario 1 by achieving 86.4% for precision, 90.5% for recall and 88.4% for the F1-score with VGG19, while it achieved 88.2% for precision, 93.1% for recall and 90.5% for the F1-score with ResNet50. Although these models (VGG19, ResNet50) were trained with one million images, they were not from the same domain as the medical images. Therefore, a fewer number of images for TL in the same domain is better than a million images in a different domain. In Scenario 2, TL from the same domain significantly improved performance with fewer images than the million images that VGG 19 and ResNet50 were trained with.

Conclusions
In summarizing this paper, there are six main highlights: (i) the issue of the lack of training has been tackled using transfer learning; (ii) a hybrid deep learning model has been proposed combining different structures including traditional and parallel convolutional layers along with residual links. Due to this type of structure, the proposed model has the advantage of better feature representation; (iii) a DFU dataset was collected and labeled as normal or abnormal by an expert in the field and utilized for experiment; and (iv) four training scenarios were designed including training from scratch and training scenarios representing the same and different domain TL for target datasets. Four datasets were employed for the training scenarios including two target datasets and two other datasets for TL purposes; (v) the same domain TL was proven to be more beneficial for addressing the lack of training issue. It was found that fewer images in the same domain of the target dataset were better than a large number of images from a different domain; (vi) the proposed model with the DFU dataset (Dataset A) achieved an F1-score of 86.6% with training from scratch, 89.4% with TL from a different domain of the target dataset, and 97.6% with TL from the same domain of the target dataset. As the idea of the same domain TL improved performance, we plan to adopt it in other applications.