2.2. Skin Lesion Classification Model and Dataset
The YOLO model (NNM) used a one-stage model, in which object detection (localization) and classification were performed in a dense sampling layer (
Figure 1). The proposed model was designed to automatically extract features from input images; then, from these features, the prediction layers determined the location and class of each skin lesion. The structure of the applied model consisted of three main parts—the backbone, head, and detection elements. The backbone was used to extract discriminative features from the input image. It was mainly based on a BottleNeckCSP convolutional neural network that aggregated and formed image features at different granularities. BottleNeckCSP models are based on a DenseNet network [
29], which is designed to connect neural layers with the goals of avoiding the vanishing gradient problem, bolstering feature propagation, and reducing the number of network parameters. The head component of the YOLO model extracted fusion features and passed them forward to the classification and detection parts. The head element consisted of a series of convolutional layers, such as Conv1 × 1 (convolution using a 1 × 1 filter), Conv3 × 3 (convolution using a 3 × 3 filter), a merging concatenation neural layer, an upscaling layer (UpSample), and the previously described BottleNeckCSP layer. Detection was achieved in the last part of the model structure. The detection analyzed features by using a fully connected layer (Conv1 × 1) and a sigmoidal transfer function; finally, it had location boxes with prediction values as its output. The detection employed the total loss function of the bounding box and non-maximum suppression [
30].
Binary cross-entropy with the logits loss function was used as a metric to evaluate how well the proposed skin lesion detection and classification model was trained. When the predictions of the YOLO model are closer to the true annotated values, the selected metric (loss function) will be at a minimum. If the predictions do not correspond with actual values, the loss function value will reach the maximum. The training parameters of the YOLO model are updated based on the values of the loss function. The binary cross-entropy loss function was used to measure the dissimilarity between the predicted probability distribution and the true labels in the training dataset. The predicted probabilities were compared to the actual class values by calculating the score that penalized the probabilities based on the distance from the expected value. The binary cross-entropy value (
Loss) was calculated using the formula given below:
where
N is the number of samples in the training dataset (or output size when used in training with data batches),
y is a class label, and
p(
y) is the model output or prediction that the given input corresponds to the actual label.
The given loss function was adapted for the multiclass classification problem and the loss value was calculated using the following formula:
where
N is the number of samples,
M is the number of classes,
y is a class label, and
p(
y) is the predicted probability of the YOLO model.
The average difference between the actual and predicted probability distributions for all classes was calculated using Equation (2).
The proposed YOLO detection and classification model was trained using the gradient descent optimization algorithm (SGD). The SGD is one of the most common ways to optimize deep neural networks, and it is used as a black-box optimizer. Gradient descent is a way to minimize an objective function, described using the neural network model, by updating the parameters in the opposite direction of the gradient.
The training hardware environment consisted of a single NVIDIA GeForce RTX3080 Ti (Santa Clara, CA, USA) graphical card with 12 GB of memory and an Intel i9 (Santa Clara, CA, USA) CPU processor. The NNM was implemented using a torch 1.81 + cu101 CUDA python library. Hyperparameters, such as a number of epochs of 300, a batch size ranging from 4 to 16, an input image of 640 × 640 pixels, and initial and final learning rates of l0 = 0.01 and lf = 0.2, were used in training.
From a computational point of view, the feasibility of the results depended on the implementation. It involved a complex algorithm, such as the largest YOLO structure, and different operating systems. In addition, the processing of color images of the lesions required substantial computational resources and time (which depended on the processor’s CPU or GPU) to produce accurate results. The computation complexity of the results was evaluated using several important factors, such as the number of inquiries (number of images that should be processed) and the computational resources available. The dataset was very large; thus, the use of parallel processing and distributed computing were required to achieve feasible results within a reasonable timeframe. We tested several machines: (1) a desktop computer with a 12th Gen Intel CPU i9, 64 GB of RAM, and NVIDIA RTX4090 with 24 GB of RAM; (2) a laptop computer with an Intel i5 1.60 GHz CPU and 16 GB of RAM (no dedicated GPU processor); (3) an Apple iPhone X (Cupertino, CA, USA); and (4) a Samsung Galaxy A25 (Seoul, Republic of Korea). The highest image-processing response rates (on average, less than 0.15 s) were acquired using the desktop computer. Approximately less than 0.7 s were needed to process the images using either smartphone. The lowest processing rate was 1.5 s, achieved using a laptop computer.
Ultimately, the results were presented in the form of a data vector, which held information about the detected object’s class, score (confidence level ranging from 0.0 to 1.0), location (x, y), and size (width, height). The YOLO model within SmartVisSolution© (Kaunas, Lithuania) is available as a closed beta on personal computers and the Apple Inc. iOS (Cupertino, CA, USA) and Open Handset Alliance Android (Mountain View, CA, USA) smartphone operating systems. SmartVisSolution© was developed by a consulting and software development company named “Dts solutions” (Kaunas, Lithuania) and the Lithuanian University of Health Sciences.
The NNM training dataset was formed by using dermatoscopic images from the International Skin Imaging Collaboration (ISIC) archive (
n = 58,457) [
31,
32,
33], which consisted of 5106 melanomas, 18,068 melanocytic nevi, 1525 seborrheic keratoses, 3323 basal cell carcinomas (BCCs), 628 squamous cell carcinomas, and 29,807 uncategorized benign tumors that were captured by using various dermatoscopic devices. The ISIC archive’s MSK [
34,
35] and UDA [
36] sub-databases were excluded from training. The dataset was subsequently expanded with dermatoscopic images (
n = 633; 183 melanomas, 68 BCCs, 353 melanocytic nevi, and 29 seborrheic keratoses), which were collected with a FotoFinder © dermatoscopic device. The dermatoscopic images were retrospectively gathered from 2010 to 2020. All 251 of the dermatoscopic images of melanomas and BCCs were verified through histopathology. The remaining skin lesions were confirmed as benign by the expert opinion of two experienced dermatologists. Image augmentation techniques, such as image rotation, changes in illumination, and noise correction, were used to increase the size and diversity of the training data.
2.3. Test Dataset
The testing of the NNM was performed on 100 dermatoscopic images (
Figure 2) of histologically confirmed melanomas (
n = 32), melanocytic nevi (
n = 35), and seborrheic keratoses (
n = 33). The results of the NNM classification were compared with a histologically confirmed diagnosis, and with a blinded evaluation by two dermatologists who were skilled in dermatoscopy and five beginners in dermatoscopy. The dermatoscopic images were randomly selected from the HAM10000 [
31], MSK-1, MSK-2, MSK-3, MSK-4, MSK-5 [
34,
35], and UDA2 [
36] databases (
Table 1). There was no overlap between the classification model’s training and testing datasets. To avoid image duplication within the ISIC datasets, the MSK and UDA databases were exclusively utilized for testing and were not included in the training dataset. For the HAM10000 dataset, 38 images were randomly selected from the HAM10000 dataset and excluded prior to training.
The chosen dermatoscopic images were uploaded to the smartphone application and were cropped by using automatic selection of the skin tumor. The smartphone application resized each image to a resolution of 640 × 640 and output the classification probabilities and locations of four skin lesion classes—melanomas, melanocytic nevi, seborrheic keratoses, and BCCs (
Figure 3).
The performance of the raters was evaluated by using a multiclass classification task. Each participant received a dermatoscopic image from the test dataset in a randomized order and was asked to assign one of the three diagnoses (melanoma, melanocytic nevus, or seborrheic keratosis) for all 100 images. The raters were also required to indicate the number of months of their experience in dermatoscopy for their placement in the “skilled” (>2 years of experience in dermatoscopy) or “beginner” (≤2 years of experience) groups.
2.4. Statistical Analysis
The outcome of interest was the sensitivity and specificity of the NNM and raters for the classification of melanomas, melanocytic nevi, and seborrheic keratoses. In addition, we computed the NNM’s area under the curve (AUC) of the receiver operating characteristic (ROC) curve. The raters’ sensitivity and specificity were analyzed within the “skilled” and “beginner” groups, in addition to an analysis of the pooled data presented as “all raters”. The chance-corrected inter-rater agreement was estimated with the Fleiss kappa. Performance was assessed by using the “one-vs-all” multiclass classification approach with absolute probability values.
To measure the accuracy of the NNM, we used a cutoff for the predicted probability such that the specificity of the model was equal to the mean specificity of the raters for the particular lesion (melanoma, melanocytic nevus, or seborrheic keratosis).
Point estimates are presented with a 95% confidence interval. The statistical analysis was carried out by using R, version 4.1.1 (R Foundation for Statistical Computing©, Vienna, Austria).