Automated Skin Lesion Classification on Ultrasound Images

The growing incidence of skin cancer makes computer-aided diagnosis tools for this group of diseases increasingly important. The use of ultrasound has the potential to complement information from optical dermoscopy. The current work presents a fully automatic classification framework utilizing fully-automated (FA) segmentation and compares it with classification using two semi-automated (SA) segmentation methods. Ultrasound recordings were taken from a total of 310 lesions (70 melanoma, 130 basal cell carcinoma and 110 benign nevi). A support vector machine (SVM) model was trained on 62 features, with ten-fold cross-validation. Six classification tasks were considered, namely all the possible permutations of one class versus one or two remaining classes. The receiver operating characteristic (ROC) area under the curve (AUC) as well as the accuracy (ACC) were measured. The best classification was obtained for the classification of nevi from cancerous lesions (melanoma, basal cell carcinoma), with AUCs of over 90% and ACCs of over 85% obtained with all segmentation methods. Previous works have either not implemented FA ultrasound-based skin cancer classification (making diagnosis more lengthy and operator-dependent), or are unclear in their classification results. Furthermore, the current work is the first to assess the effect of implementing FA instead of SA classification, with FA classification never degrading performance (in terms of AUC or ACC) by more than 5%.


Motivation
Skin cancer is a disease that is causing a growing problem in the developed world. For instance, one in five Americans are expected to get skin cancer during their lifetime, with an estimated 5.8% rise in melanoma cases for 2021 and a 77% rise in the incidence of non-melanoma skin cancer between 1994 and 2014 [1]. While malignant melanoma (MM) is the most deadly form of skin cancer, thankfully it is about 20 times less common than other forms of skin cancer, with basal cell carcinoma (BCC) being the most common non-melanoma skin cancer [2]. Due to the relative shortage of dermatologists in the midst of increases in skin cancer incidence, the role of computer aided diagnostic approaches is gaining increasing prominence.
Deep neural network-based optical approaches using tens of thousands of clinical records, including dermoscopy images, have achieved an accuracy of around 94% on automated skin lesion classification [3,4]. Despite the high accuracy of optics-based melanoma detection, the addition of subsurface information from ultrasound imaging can further improve classification accuracy [5].
In the last few decades, there has been increased interest in the use of dermatologic ultrasound for skin lesion diagnosis. The appearance of different cancerous and noncancer-Regarding the topic of skin cancer, the reader is first directed to two reviews of skin cancer detection methods in general [37,38], followed by two reviews of ultrasound-based skin cancer diagnosis [39,40]. These reviews of ultrasound methods highlight a number of studies where a number of quantitative and semi-quantitative parameters-such as echogenicity, homogeneity, shape, margins and location of the lesions, as well as the posterior acoustic shadow and dermal echogenicity ratio-are shown to be promising features in differential diagnosis [41][42][43][44][45][46][47]. These works, however, do not aim to provide automated classification, as they require fully manual segmentation and examine the diagnostic potential of features rather than combining them into a classification framework. Therefore, lesion classification accuracy values are either missing, not detailed properly, or do not reach the desired level (60%+) [40].
A number of more recent studies have moved towards providing a skin cancer classification framework. Csabai et al. [48] and Andrékuté et al. [49] combined an semi-automated (SA) segmentation method with a fully-automatic feature extraction and classification method based on acoustical, textural and shape features. Csabai et al. [48] examined three kinds of lesion types, namely MMs, BCCs and benign nevi. Using a manually selected seeding region, an active contour model (ACM) was used to segment the lesion. Five shape features and seven first-order texture parameters were defined and their mean and standard deviation were input as features into a support vector machine (SVM) model. In terms of the area under the receiver operating characteristic (ROC) curve (AUC) metric, they reported a classification performance of 86% for the differentiation between nevi from cancerous lesions and 90% for BCC vs. nevi. Andrékuté et al. [49] used a somewhat different approach: following a manual selection of those A-lines that contained the lesion, the lesion boundaries were automatically calculated for each A-line independently (A-lines are one-dimensional sections of the B-mode image in the depth direction, the direction of propagation of the ultrasound pulse). From these A-lines, 29 features were extracted for binary classification between MMs and benign melanocytic skin tumors (MST). An AUC performance of 89.0 ± 0.6% was obtained.
In another strand of research, Kia et al. [50] presented an automatic classification method for differentiating between healthy tissues, benign lesions, BCCs and melanomas. Although 98% sensitivity was attained, this was achieved at the cost of the specificity being only 5%, making the diagnostic value of the algorithm extremely limited. (Although, judging from the context in which the performance values were reported, it is possible the authors may have meant to write a specificity of 95%.) In addition, healthy skin without lesions was included in the testing set, making a comparison with other classification articles difficult, if they do not consider the differentiation of lesion-free skin necessary. A more recent work from the group is based on tissue frequency analysis [51]. It uses a 384-element-long feature vector from frequency space to train the above-mentioned neural network. The algorithm calculated the features using the whole sonograms without applying any kind of segmentation before it. Their work reached an accuracy of 95.9% using 220 malignant and 180 benign lesions for training, testing and evaluation; a very promising result. However, since the report did not specify the four classes of skin type for which differential diagnosis was performed, more reports from their work are needed before the method can be verified.
A final work worthy of mention is that of Tiwari et al. [5], where skin lesion classification is performed based on parameters collected and combined from a number of different imaging modalities, namely ultrasonograpy, dermatoscopy and spectrophotometry. Although the results are outstanding (with an AUC of 99.9%), purely ultrasound-based classification is unfortunately not proposed or evaluated in the work.
In the table below (Table 1), the performance of the most relevant ultrasound-based classification methods is presented. They have been selected from the literature included above on the basis of having clearly defined and documented performance measures differentiating between skin lesions only.

Aims of Current Work
The current work aims to present a framework for ultrasound-based skin cancer diagnosis that differentiates between three common skin lesion types: benign nevi, BCC and MM. In contrast to techniques that require some form of manual segmentation, the use of an automated segmentation method [52] makes skin cancer detection fully automatic, which, considering the time limits imposed on dermatological visits [53], would significantly improve the utility of the skin cancer detection method. Although many methods exist for optical-based automated skin cancer diagnosis [3,4], as mentioned earlier, ultrasound has the potential to improve on the accuracy of fully optical-based methods [5]. The aim of the current work, therefore, is to assess how an fully-automated (FA) skin lesion detection method compares with reference SA methods and to compare with relevant results in the literature. Table 1. Comparison of ultrasound-based skin lesion differential diagnosis methods focusing on benign and malignant skin lesion. In contrast to the current proposed method, none of the methods are fully automated since they do not employ fully-automated (FA) segmentation. NA stands for Not Available.

Materials and Methods
In the current section, the data processing pipeline is described from the point of data collection to data processing using feature extraction, classification and, finally, measures to evaluate classification performance. All the code used in the work is available to download on GitHub (Available online: https://github.com/marosanp/skin-lesion-us, accessed on 20 June 2021).

Ultrasound Data Collection
Data were collected at the Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Budapest, Hungary, as part of an ethically approved study. Informed consent was obtained from the participating patients for the anonymised use of the data for research and publication [52].
The source of the examined dataset was a commercial high-frequency ultrasound imager (HI VISION Preirus with 5-18 MHz EUP-L75 transducer connected to Hitachi Preirus, Hitachi, Tokyo, Japan). The current study involved N = 310 B-mode ultrasound images, containing skin lesions with a thickness of 1-2 mm. Three different types of skin lesions were distinguished, including 110 benign nevi, 130 BCCs and 70 recordings of MMs. Figure 1 illustrates a representative ultrasound image of each examined lesion type.

Segmentation
Three different segmentation techniques were implemented and compared in the current study: one was an FA algorithm to study the accuracy of ultrasound-based FA skin cancer detection, while the other two were SA algorithms used as a reference. Note that the same segmentation method was used for training and testing rather than selecting one as the ground truth during training. The first technique performs FA lesion segmentation based on an initial seeding step and a growing step, described in detail in [52], and described briefly below. The seeding step begins with a pre-processing substep to make ultrasound images from different machines similar to each other. This is followed by layer extraction substeps (above skin, epidermis, and dermis), and a lesion detection substep within the dermis. Each layer extraction substep first performs an intensity-based clustering or multilevel thresholding method combined with prior geometric information to return an initial estimate of the layer region; then performs a refinement of the region estimate based on an ACM and morphological operations. The seeding step is concluded by a lesion detection substep that incorporates information from the layer extraction substeps with prior geometrical assumptions about the arrangement of the layers and lesions. The automated seeding is followed by a growing step, using ACMs to extract the final lesion mask. To the best of our knowledge, this is the only fully automatic segmentation algorithm for ultrasound images of skin cancer-suspicious lesions that works on images from multiple imaging systems [52].
Two SA segmentation algorithms were also used for comparison purposes. Both of them require manual seeding for lesion localization and execute an ACM-based growing step on the initial seed masks for final boundary delimitation. The first, freehand seeding method used a freehand drawing around the lesion borders (using the MATLAB command freehand). This freehand seeding method simulated a careful manual segmentation since it allows any errors to be corrected using an ACM method. The second seeding method generated the largest area rectangle (LAR) from the freehand drawing and used that as the seed. This is similar to someone choosing a rectangular seed, as found in other works [54,55], and preferred in practice to freehand seeding due to the higher selection speed involved. We also have to mention that freehand seeding adds a significant variance and impairs the reproducibility of the algorithm as was demonstrated in [52]. In the current implementation, the difference was that the LAR was derived from the freehand seeding itself to allow meaningful comparison between the two.
The above-defined three segmentation techniques were chosen based on the following considerations. Our primary motivation of the work is to compare FA classification with SA classification. To the best of our knowledge, our previous work [52] is the only FA seeding-based method that was applied on more than one type of skin ultrasound image, demonstrating its robustness; therefore, this was chosen as the FA segmentation algorithm. Considering reference SA segmentation methods, freehand-seeding-based SA segmentation was considered a good approximation to ground truth, combining human knowledge with the filling in of spatially fine details. Since the freehand method is highly time-consuming, the LAR-based SA segmentation is considered a good simulation of a rapid human input into the segmentation workflow [52]. Since the freehand method was considered to be the most reliable segmentation, this was treated as the ground truth segmentation. In particular, the success rate of the FA segmentation was defined as the proportion of lesions where the Dice coefficient between the FA and freehand SA exceeded 10%.

Feature Extraction
Using the segmentation described in the previous subsection, 93 features were extracted from the lesion, lesion boundary, and the area of the dermis under the lesion. Then, feature selection was made by examining the SVM-based weights on the training set. In this way, 62 features were selected from 93 examined ( Table 2).
The 62 features were calculated using the following image regions: • Lesion region: all the pixels inside the lesion mask; • Dermis region: pixels of the region of the dermis being right under the lesion mask; • Lesion boundary: a lane of pixels being located within a fixed distance from the lesion mask boundary.
The features can be grouped into first-order textural, second-order textural, and shape features. First-order textural features express information about the distribution of individual pixel intensity values, while second-order textural features express spatial correlation between pixel intensities [56], and, in the current work, are based on the gray level co-occurrence matrix (GLCM) [57][58][59]. Lastly, some shape-based features were also extracted. All three groups of features are presented below.

First-Order Textural Features
First-order textural features can be broadly categorized according to the properties or regions concerned, hence the subgroups; attenuation, lesion contrast, lesion boundary, and statistical.
Attenuation-based features, such as attenuation, contrast of attenuation and heterogeneity of attenuation, examine the lesion region and its shadowing in the dermis region right under the lesion mask. Contrast parameters, such as lesion-contrast-based heterogeneity and the mean lesion contrast, examine the contrast of each boundary line emanating radially from the inner edge of the lesion [48]. The above two subgroups, as well as the mean boundary belonging to the boundary subgroup, were adopted from Csabai et al. [48].
Lesion boundary region-based features, such as mean boundary, boundary heterogeneity, boundary contrast, boundary heterogeneity contrast, boundary-lesion contrast, dermis-lesion heterogeneity contrast and boundary-lesion heterogeneity contrast, are calcu-lated based on the expressions presented in Table 2. Statistical features, such as skewness, kurtosis and entropy, were also selected.

Second-Order Textural Features
For most of the second-order (GLCM) textural descriptors (contrast, correlation I. correlation II. dissimilarity, maximum probability, difference variance, difference entropy and information measure of correlation I.), the descriptors were calculated in both the vertical and horizontal directions, for both the lesion region and the dermis region of the images. For some of the second-order descriptors (energy, entropy, homogeneity I & II, and information measurement of correlation II) only the vertical GLCMs were calculated for both regions.
Further details and calculations of the above listed GLCM-based co-occurrence texture statistics are to be found in the work of Uppulari [59].

Shape Features
Shape features are derived from parameters describing the shape of the lesion boundary. Some shape parameters, such as standard deviation of curvature and circularity, were extracted by Csabai et al. [48]. Further shape features, such as axis ratio, perimeter-area ratio and compactness (ratio of perimeter and the length of the major axis of the lesion mask), are also extracted.
The feature names with corresponding indices in the feature set (idx) are presented in Table 2, with references (to those taken from the literature) or with a short description (in the cases of newly introduced descriptors).

Classification
To aid comparison, the approach for the classification methodology (training and testing) closely resembled that of Andrékuté et al. [49]. Namely, an SVM-based classifier [60] was used, with ten-fold classification implemented in the following manner. First, ten separate groups were selected randomly, with the same ratios of lesion types. Then, each group in turn was selected as the test group, with the remaining nine groups merged into a training set. The classification performance was then averaged over the ten training instances, using accuracy (ACC) and AUC as performance metrics for the binary classification. In addition to binary classification (as implemented by Andrékuté et al. [49]), multiclass classification was also carried out in the current work and evaluated using ACC.
The following classes were distinguished for both binary and multiclass classification: 'Nevus vs. others', 'MM vs. others', 'BCC vs. others', 'Nevus vs. BCC', 'Nevus vs. MM', 'BCC vs. MM'. In the cases of the binary classifications, 'Nevus vs. BCC', 'Nevus vs. MM' and 'BCC vs. MM' classifications were performed on datasets containing only lesions from the relevant two groups, while the corresponding multiclass classification results were obtained using datasets containing all three lesion types. 'Nevus vs. others', 'MM vs. others' and 'BCC vs. others' classifications were performed using datasets containing all three types of lesions in both binary and multiclass cases. Here, the binary classification training was conducted using two classes (a certain type of lesion versus all other lesions), while the multiclass classifications used three lesion type classes for training.  The parameter set (Algorithm 1) and the entire workflow (Algorithm 2) of the current work is presented as an algorithm in pseudo-code form at the end of the section, focusing on the details of the classification method. Details of the FA segmentation algorithm are presented in Marosán et al. [52], where figures of the procedure are also provided. Feature extraction is detailed in Table 2 and Section 2.3, Feature extraction.

Performance Metrics
The evaluation of classification methods was carried out using the performance metrics listed below. Numerous works have applied [48,49] the following metrics previously.
The sensitivity Sens calculates the proportion of positive cases that are correctly detected: with TP, FN denoting the number of true positives and false negatives in the classification, respectively.   The specificity Spec calculates the proportion of negative cases that are correctly detected: with TN, FP denoting the number of true negatives and false positives in the classification, respectively. The accuracy ACC describes the classification accuracy, namely, the ratio of the number of correctly detected cases to the number of all examined cases The receiver operating characteristic ROC curve is a graphical plot that displays how Sens varies with 1 − Spec. Considering ROC as a function, its definition can be stated as Finally, using the above definition for the ROC, the area under the ROC curve AUC can be defined as: where x = 1 − Spec.

Overview of Classification Performance
The ROC curves of the various binary classifications for all three segmentation methods are presented in Figure 2. From these curves, the AUC can be calculated, the values of which are summarized in Table 3.   Although the FA method generally fares worse than the SA methods, the performance is always within 5% of the best performing method.
The ACC values for both binary and multiclass classifications are summarized in Table 4. The table depicts a similar trend to the AUC results in Table 4: the nevi are generally well distinguished from cancerous lesions, with most such classifications reaching accuracies above 80% (highlighted in bold in the table). The notable exception is the multiclass classification of 'Nevus vs. MM', which could be, in part, due to the relatively low number (N = 70) of melanoma recordings.
It should also be noted that multiclass classifications generally show a worse performance than binary classifications. This is arguably because multiclass classification forces the training classes to be smaller, outweighing the advantage that arises from being able to train on adequately distinct classes separately.
As before, the FA method generally shows a poorer performance; however, with one exception ('BCC vs. others', multiclass), the difference from the best method is never worse than 5%.
The current results also compare favourably to the results of Andrékuté et al. [49], where nevi were distinguished from melanoma with an accuracy of 82.4%: with the SA methods, the binary classification achieved an accuracy of 85.0%, while the FA method also achieved a comparable 81.1%.
For a comparison with other relevant works (all using SA segmentation), please see Table 1. Although the comparison is more difficult with other works, since they mostly set the sensitivity to 100% and then observe specificity, the current work nevertheless seems to fare well. For example, for those works where AUC values are provided, the current work is always superior in its corresponding AUC values. The specificity at 100% sensitivity shows more variable results in the current work. Overall, the classification of nevi from cancerous lesions, and in particular from both cancerous lesion types ('Nevus vs. others'), provides the best classification performance. Using multiclass classification, the FA method can distinguish nevi from cancerous lesions with an accuracy of over 85%.

Comparison of FA and SA Classification Performance with Representative Images
The reason behind a difference in classification performance can be a direct result of differences in segmentation during the classification phase; or it can be an indirect result of training on differently segmented images. For the FA method, 83.5% of the lesions were detected correctly, making its classification performance being close to that of the other methods somewhat surprising. To try and offer putative explanations for differences between classification performances, a presentation of images (with overlaid segmentation contours) that led to different classifications could be informative. In order to be consistent, the classification discussed will be the best performing one, namely 'Nevus vs. others'. Before proceeding to discuss the images, it should be noted for context that, for this classification task, all three methods were jointly successful in 73.6% of the images, all three failed together in 5.2% of the cases, while there were differing classifications in 21.3% of the cases. In the following subsections, those cases where differing classifications are given are considered.

Cases When FA Fails While SA Methods Perform Correctly
In most cases of differing classifications, the SA methods succeeded while the FA method failed. Figure 3 presents examples for the three trends observed when analyzing this subset. In some cases, the FA segmentation method detected an image region fully outside the real lesion. As can be seen in Figure 3a, there are cases in which the image structure can be misleading when searching for the lesion location. In other cases, the FA segmentation did not detect parts of the lesion (Figure 3b) or included additional areas that were not part of the lesion (Figure 3c). In the latter case, the example shows a shadowing effect in the dermis region next to the lesion, which potentially misled the FA segmentation method.

Cases When the Two SA Methods Return Different Classifications
The SA segmentation methods, based on freehand and LAR segmentation, produced similar results generally; however, in certain cases, they led to different lesion classification results, such as in the cases presented in Figure 4. In the minority (29%) of such cases (2.5% of all images), the LAR-based method was the one failing the classifications. These were due to the shape of the lesion being such that the LAR segmentation could cover only a small portion of the lesion, which could not be expanded to cover the entire lesion even with the subsequent ACM method (Figure 4a).
In the majority (71%) of the cases discussed here (6.3% of all images), the freehandbased method led to failing classifications while the LAR produced correct results. These were presumably due to human error in the freehand segmentation, since the pixel boundaries of the freehand lesion segmentation are relatively arbitrary in contrast to the LARbased method, where the ACM model finds the boundaries of the lesion with a higher precision (Figure 4b).

Cases When the SA Methods Both Fail While the FA Method Performs Correctly
In some cases, the 'Nevus vs. others' type classification was correct for the FA segmentation while failing for both SA segmentations. Figure 5 presents notable cases for such segmentations. Figure 5a shows a case in which the FA segmentation detected a similarly shaped and sized but slightly shifted region from that detected by the two SA segmentations. In some cases, such as the one presented in Figure 5b, the FA segmentation detected additional image regions as part of the lesion in comparison to the results of the SA segmentations, leading to a correct classification result.
In the case shown in Figure 5c, all three segmentation results matched closely; however, slight differences in their borders led to different classification results. This example emphasizes the significance of small details in automated ultrasound-image-based lesion classification performance.

Sensitivity of Classification to Changes in Lesion Segmentation
The last example in the previous subsection (Figure 5c) presented an interesting case, since it showed that slight changes in the border of the segmented lesion could lead to different classification results. Similar issues of classification sensitivity have been addressed elsewhere in the computer vision literature [61]; however, it is a challenging topic to address, partly due to the large search space involved in the sensitivity analysis. In the current work, the topic has been partially addressed in the following manner.
The ultrasound image on Figure 5 depicting a nevus was chosen as the target of the investigation, since this is where the phenomenon of changing classification due to a small change in lesion border was observed. Taking the freehand-based SA segmentation as the reference, the region was progressively grown/shrunk at various regions of the border (right edge, bottom edge, right and bottom, entire border), with the classification noted at each step. Figure 6 shows the results. Figure 6b shows that if the segmented region is shrunk only at the right edge, then a modest shrinkage of only 2 pixels' width is able to change the classification from wrong to correct. This is consistent with the behaviour previously noted in Figure 5c, where the correct FA classification has a segmentation that is slightly shrunk at the right edge compared to the other two (Figure 6a). Interestingly, where other sides of the segmented regions are also modified, it becomes more difficult for the classification to be corrected. This could be because a hyperechoic patch on the right edge of the lesion perturbs the classification, and the segmentation is otherwise correct.
Although a more systematic analysis is beyond the scope of the current paper, the above preliminary analysis does show some insight into the relatively small yet significant perturbations in classification that can occur due to variabilities in segmentation.

Feature Performance
The aim of this section is to evaluate the features used in the classification framework by identifying those with the largest contribution to the classification performance. Since in the SVM classifier model the features do not act in isolation but are part of a non-linear system where features support each other, the performance of a feature was deemed to be better evaluated by measuring how much performance degraded when it was left out, rather than the performance it achieves on its own. Thus, the contribution of each of the 62 selected features was examined one-by-one as follows: one feature was removed and binary classification was performed using the remaining 61 features and using LAR-based segmentation. The AUC and ACC performance metrics were computed for all 62 cases for the six types of classification (similarly to Tables 3 and 4 earlier when all features had been used). The top features-the absence of which caused the largest deterioration in classification performance-are shown in Table 5, with the worst feature also shown for reference. By removing one of the top features, the accuracy performance dropped by around 2 to 5%. Table 5 shows several features that performed well on different classification cases, regarding both AUC and ACC performance measures. Those features that appear at least three times in the top four are in bold and are as follows (with feature indices in parentheses as listed in Section 2.3, Feature extraction): boundary heterogeneity (7); boundary contrast (8); skewness (13); entropy (15); axis ratio (18); compactness (20); and difference variance (26). Of these features, four are first-order textural features (7,8,13,15), two are shape-based features (18,20), and one is a second-order textural feature (26). When considering the possibility of applying the trained classifier to skin ultrasound images from other devices, the two shape-based features are arguably the most transposable; with the other features, some form of domain adaptation may be required [62]. In both the 'Nevus vs. others' and 'Nevus vs. MM' classification tasks, shape-based features were prominent, in agreement with the results of [49], where shape-based features provided the highest classification performance for distinguishing between nevi and melanoma.

Runtime Measurements
While the segmentation seeding algorithm was implemented in Python, the growing step, feature extraction, classification and evaluation were implemented in MATLAB R2018b (MathWorks, Inc., Natick, MA, USA). Table 6 details the computational cost of the proposed algorithm. Each ultrasound image had a size of 900 × 400 pixels and the computer used had an Intel Core i7-7500U CPU (2.70 GHz) processor and 16 GB RAM.

Conclusions
An automated framework for skin lesion classification was presented and an FA and two SA segmentation methods were compared. Ultrasound images of three types of lesions were used for the classifications: MMs, BCCs and benign nevi. Both binary and multiclass classifications were performed and evaluated, using six types of class differentiations, for all three segmentation methods, separately.
The best results were obtained generally for 'Cancerous vs. Non-cancerous' ('Nevus vs. others', 'Nevus vs. BCC', 'Nevus vs. MM') type binary classifications (>90% AUC). The two SA methods produced generally better results than the FA method, but with relatively slight differences (between 0.7-3.9%), such that the FA also provided an AUC at around 91% for the binary classification between nevi and the two cancerous lesion types. The achieved accuracies were similar to those obtained by Andrékuté et al. [49] when they differentiated between nevi and melanoma: 85.0% with the SA methods and 81.1% with the FA method, compared with 82.4%. The classification of nevi from cancerous lesions had even higher accuracies of above 85% even with the FA method. The result demonstrates the viability of FA skin cancer classification from ultrasound images.
Since features can be highly dependent on the type of ultrasound image they are applied to, it is worth noting that, in the case of the best performing classification task of distinguishing nevi from cancerous lesions, the top two performing features in terms of accuracy were shape-based features, since such features are more adaptable for different types of ultrasound images. Nevertheless, future work should focus on applying domain adaptation techniques to ensure the classification framework here presented can also be applied to skin ultrasound images produced by other devices.  Institutional Review Board Statement: The acquisition of the ultrasound images was conducted according to the guidelines of the Declaration of Helsinki, and approved by the competent authority National Institute of Pharmacy and Nutrition, Budapest, Hungary (protocol code OGYÉI/16798/2017).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the data acquisition.

Acknowledgments:
The authors thank the generosity of Miklós Sárdy (director) and Sarolta Kárpáti (former director) of the Department of Dermatology, Venereology and Dermatooncology of Semmelweis University) for their scientific support and encouragement of the current research. The authors also thank Jan D'hooge, from Department of Cardiovascular Sciences, KU Leuven, for advising on the second-order textural features.

Conflicts of Interest:
Some of the co-authors are employed by and hold financial interest in Dermus Kft. The company may plan to use the scientific results in its own research and development efforts. Other than that, the authors have no potential conflicts of interest to disclose.

Abbreviations
The following abbreviations are used in this manuscript: