High-Frequency Ultrasound Dataset for Deep Learning-Based Image Quality Assessment

This study aims at high-frequency ultrasound image quality assessment for computer-aided diagnosis of skin. In recent decades, high-frequency ultrasound imaging opened up new opportunities in dermatology, utilizing the most recent deep learning-based algorithms for automated image analysis. An individual dermatological examination contains either a single image, a couple of pictures, or an image series acquired during the probe movement. The estimated skin parameters might depend on the probe position, orientation, or acquisition setup. Consequently, the more images analyzed, the more precise the obtained measurements. Therefore, for the automated measurements, the best choice is to acquire the image series and then analyze its parameters statistically. However, besides the correctly received images, the resulting series contains plenty of non-informative data: Images with different artifacts, noise, or the images acquired for the time stamp when the ultrasound probe has no contact with the patient skin. All of them influence further analysis, leading to misclassification or incorrect image segmentation. Therefore, an automated image selection step is crucial. To meet this need, we collected and shared 17,425 high-frequency images of the facial skin from 516 measurements of 44 patients. Two experts annotated each image as correct or not. The proposed framework utilizes a deep convolutional neural network followed by a fuzzy reasoning system to assess the acquired data’s quality automatically. Different approaches to binary and multi-class image analysis, based on the VGG-16 model, were developed and compared. The best classification results reach 91.7% accuracy for the first, and 82.3% for the second analysis, respectively.


Introduction
During the last decades, high-frequency ultrasound (HFUS, >20 MHz) has opened up new diagnostic paths in skin analysis, enabling visualization and diagnosis of superficial structures [1,2]. Therefore, it has gained popularity in various areas of medical diagnostics [3,4] and is now commonly used in medical practice [5]. In oncology, it helps in the determination of skin tumor depth, prognosis, and surgical planning [1,6], enabling differentiation between melanoma, benign nevi, and seborrheic keratoses [7]. Heibel et al. [6] presented the HFUS as a reliable method with perfect intrahand interreproducibility for the measurement of melanoma depth in vivo. Sciolla et al. [8] described the spatial extent of basal-cell carcinoma (BCC), provided by HFUS data analysis, as a crucial parameter for the surgical excision. Hurnakova et al. [9] investigated the ability of HFUS (22 MHz) in rheumatology to assess cartilage damage in small joints of the hands in patients with rheumatoid arthritis (RA) and osteoarthritis (OA). In the newest study, Ciapoletta et al. [10] describe the usefulness of using 22 MHz ultrasound images for hyaline cartilage diagnostics. The skin thickness and stiffness measurements are recognized by Chen et al. [11] as described in in [30], can be found in Mendeley Data. We collected and shared the face HFUS image database described in this paper to meet this need.
One of the possible solutions, which can partially solve the overfitting problem, if training from scratch, is data augmentation. Nevertheless, a feasible alternative is to use: Semi-supervised learning, transfer learning (TL), learning from noisy labels, or learning from computer-generated labels [31]. However, TL is reported as widely applicable in medical image processing tasks [32,33], where limited training data are common problems. In this approach, the knowledge is extracted from well-annotated, available, large datasets (e.g., ImageNet [34]) and used in the ongoing issues.
Fast and robust classification steps in medical applications are essential for further clinical practice usage. Moreover, the visual explanation of the system decision (like Grad-CAM map [35]) enables its recommendation for clinical use ('explainable AI'). Noise or the artifacts influencing the geometry of visualized structures may lead to misclassification, false-positive detections, over/under segmentation, and in consequence, inaccurate results of measurements. To solve these problems, image quality assessment (IQA) algorithms are developed [36][37][38]. Very popular yet poorly correlating with human judgments of image quality are mean-squared error (MSE), its relevant peak signal-to-noise ratio (PSNR), or a bit more efficient structural similarity index (SSIM) [39]. All the mentioned assume that the original image signal is known. According to [40], the optical images can be distorted at any stage of their acquisition, processing, compression, etc., and a reliable IQA metrics is critical for evaluating them. The distortion-specific BIQA (blind image quality assessment) methods provide high accuracy and robustness for known distortion types or processes. Unlike the previous methods, these do not require the original image availability. However, considering that the distortion type is specified quite rarely, their application scope is limited. Therefore, natural scene statistics (NSS), including local DCT (discrete cosine transform) or wavelet coefficients describing contrast or gradient features, are utilized [41]. The DGR (distortion graph representation) based solution is presented in [40]. It considers the relationship between distortion-related factors and their effects on perceptual quality. Since the blind measures are distortion-specific, the blind no-reference (NR) IQA methods have been studied in recent years [39]. Both the BIQA and NRIQA are extended to work with stereo images [42], VR images [43], and many other currently investigated image types. As reported in [37], most IQA methods and research studies focus on optical images. Since the medical image quality is highly related to its application, and in some issues, low contrast and noisy images can still be acceptable for medical diagnosis, medical image quality assessment differ from the others [36]. They consider multiple expert opinions to label the data and utilize the benefits of AI (artificial intelligence). The applications of CNN to IQA of retina images can be found in [38]. The authors use DenseNet to classify the images into good and bad quality or five categories: Adequate, just noticeable blur, inappropriate illumination, incomplete optic disc, and opacity. Piccini et al. [44] utilized the benefits of using CNN to assess the image quality of whole-heart MRI. The only two solutions for ultrasound IQA, both based on CNN, are given in [37,45]. The chronologically first [45] scheme targets to assess the fetal US image quality in the clinical obstetric examination. The second one [37] is claimed to be universal, considering different US images. In the designed framework, the network is trained on the benchmark dataset LIVE IQ [46] and then fine-tuned using ultrasound images.
As we mentioned before, the HFUS image processing algorithms described in the literature [2,14,15,17] assume that the input dataset consists of preselected good quality image data. Among many possible applications, CNNs are for the first applied to reduce the analyzed dataset of HFUS to the informative part in [10]. In the current work, we decided to follow this way and automatically select the correct frames from the acquired dataset-asses the HFUS image quality. This solution enables automated analysis of HFUS records, which avoids the influence of incorrect detections on the analysis results. Due to the absence of such frameworks for HFUS skin images, the two main contributions of our study are as follows. The first is the database, including 17,425 HFUS frames of facial skin denoted by two experts (in total three times) as noisy-inaccurate for analysis and correctly acquired [47]. The proportion of correct and incorrect data is about 1:1.3. The data description includes the demographic features of the patient cohort, places of image acquisition on the face, acquisition dates, and system parameters. Second, we present different deep learning-based frameworks, including followed by a fuzzy interference system for automatically annotating frames. The analysis is conducted two-way, classifying the data into correct and incorrect and dividing them into four groups, depending on the experts' majority decision.
Our extensive image database includes data acquired during an actual dermatological ultrasound examination. Thus it contains: • images distorted by artifacts from trembling hand with the US probe or impurities contained in the ultrasound gel; • frames captured when the ultrasound probe was not adhered or incorrectly adhered to the patient's skin, or the angle between the ultrasound probe and the skin was too small (the proper angle is crucial for HFUS image acquisition); • images with too low contrast for reliable diagnosis, or captured with too little gel volume-improper for epidermis layer detection; • data with disturbed geometry as well as HFUS frames with common ultrasound artifacts like acoustic enhancement, acoustic shadowing, beam width artifact, etc.
Due to the image variety, high amount of possible distortions, and the subjective expert opinion, which is not always connected with them, application of IQA methods dedicated to optical images is impossible (Zhang et al. underline it strongly in [37]). A portion of images are hard to decide (even by the experts, see Figure 1), they can be useful in the diagnosis, but due to some artifacts, their analysis might be error-prone. Therefore, following the works in medical IQA [37,38,44] and image selection [10], we propose the CNN-based framework-a combination of the previous, which enables HFUS skin image analysis. The images selected by our methods are high quality, or informative, and accurate for diagnosis and processing. Depending on the application and user needs, the obtained results can be utilized twofold. First, only those classified as definitely good for the high amount of the acquired frames (label 4 in Table 1) should be considered. Second, for the US record with a limited number of frames, the image data labeled as 2 and 3 (in Table 1) can be taken into account. Yet, the results of their further automated analysis (segmentation or classification) should be treated as less trustworthy (assuming two trustful levels: Higher and lower, connected with labels 2 and 3, respectively). This is the first application of CNN to this task in HFUS images and the first combining CNN and fuzzy interference system.
The dataset developed in this study is in detail described in Section 2. The description is followed by numerical analysis of the data and expert annotations. The classification steps are presented in Section 3, including two-(Section 3.1) and multi-class Section 3.2 analysis. The model assessment and other results are given in Section 4. The study is discussed and concluded in Section 5.

Materials
The dataset includes high-frequency images (image sequences) of female facial skin. The data were collected during 4 sessions (the session dates are given as data IDs in a format [day month year]), with 44 healthy Caucasian subjects in age between 56 and 67 years (average = 60.64, std = 2.61), all postmenopausal. In anti-aging skin therapy, the patients were treated with trichloroacetic acid (TCA) chemical peel. The first image data were acquired before the first acid application, and the patients were divided into treated (23), and placebo group (21). The data were registered from three different locations on the patient face. The locations and ultrasound probe movement directions are visualized in Figure 1 by three arrows superimposed into a facial model. The image acquisition starts where the arrow begins and ends with the arrow end. At each patient visit, three HFUS series were registered. Several dozen (about 40) HFUS images were collected in a single series for each location. The original image resolution was equal 1386 × 3466 [pix] and the pixel size is equal to 0.0093 × 0.0023 [mm/pix] (axial × lateral). The analyzed HFUS image data were acquired using DUB SkinScanner75 with a 24 MHz (B-mode frequency, 8 mm depth, and acoustic intensity level 40 dB) transducer. Each series includes both the image data suitable for further diagnosis (technical-using CAD software, or medical) or not. The second group includes, for example, the ultrasound frames captured when the ultrasound probe was not adhered or incorrectly adhered to the patient's skin and when the angle between the ultrasound probe and the skin was <70 degrees. Exemplary HFUS images annotated as suitable ('ok') or not ('no ok') for further analysis are given in Figure 1.
The HFUS examinations were performed by a beginner sonographer (without any experience in HFUS image acquisition and analysis, but working with the conventional US in his scientific practice): ID = 15022021 and 12042021, and experienced one (graduating Euroson School Sono-Derm, and working with HFUS image analysis from 3 years): ID = 08032021 and 07062021. In total 17,425 HFUS images were acquired.
After the data collection step, the complete dataset was labeled by two experts in HFUS data analysis. One of them annotated the data twice with an interval of one week.
Hence, the further description includes three annotations denoted as Expert 1, Expert 2, and Expert 3. However, the labels Expert 1 and Expert 2 refer to the same person (annotations of the first expert with a week interval). The agreement in the useful image selection between all the experts was analyzed statistically using both confusion matrices (given in Figure 2) and unweighted Cohen's Kappa [10], and interpreted according to Cipoletta et al. [10], and Landis and Koch [48] (see Figure 3). The analysis was performed using Matlab library [49]. The agreement between experts was partially substantial or perfect, and there is no difference between intra-and inter-observer results.

Methods
There are different ways for ultrasound-based diagnostic procedures. Depending on the application, the sonographer acquires either a single image or an image series. The second approach is better when a further automated image processing step is introduced. Simultaneous analysis of multiple data provides reliable results, less prone to artifacts and outliers. At the same time, the analysis of the whole recording might be disturbed by strongly distorted data or the artifacts influencing the geometry of visualized structures, appearing on the part of frames. Consequently, it leads to misclassification, false-positive detections, and finally, inaccurate results of measurements. Therefore, the overall goal of this study was to develop and evaluate the classification framework, which enables robust and fast HFUS series analysis.
Numerical analysis of image annotations provided by the experts, described in Section 2 shows that manual image labeling is a nontrivial issue. While most of the images were unambiguously annotated as correct or not, there appear image data (in our case, it is 15%) on which the experts disagree. There are images partially disturbed in this group but still having diagnostic potential. Considering this, we first divide the data into unambiguous and ambiguous. It enables CNN model selection, suitable for further analysis. Then, the developed methods followed twofold: Binary classification and multi-class analysis. The first one includes division the image data, and two groups are denoted as correct and incorrect. Next, the data will be divided into two and four groups, respectively, according to the labels included in Table 1.

Binary Classification
The first goal of this step is the CNN model selection, providing the most reliable classification results. Based on the previous experiences [4] and the recent papers in medical IQA [38], or informative HFUS image selection [10], we consider two most promising architectures. The first one is DenseNet-201 [50] and the second is VGG16 [51]. Both were pre-trained on the ImageNet dataset and then used for transfer learning. DenseNet uses features of all complexity levels, giving smooth decision boundaries and performing well when training data is insufficient, whereas VGG16 is described as being suitable for the small-size training set and low image variability [10]. Both architectures were adapted for the binary classification problem. The DenseNet-201 architecture was trained freezing the first 140 convolution layers (as in [4]) and tuning the remaining ones, whereas in the VGG16 model, according to [10], 10 convolution layers were frozen.
Prior training, RGB US frames were resized to 224 × 224 × 3 pixels. The stochastic gradient descent optimizer with the momentum of 0.9, the categorical cross-entropy as loss function, batch size equal to 64, and initial learning rate of 0.0001 were chosen as the most efficient in a series of experiments [4,10]. The authors of [10] suggested 100 epochs for training the VGG16 model. However, due to the observed overfitting problem (the validation accuracy does not change, but the validation loss increases), we shortened the training process to 50 epochs. In further iterations, no significant improvements in training curves were visible, and the validation loss tended to increase. The same training parameters were applied for binary and multi-class models.
For the binary classification, the models are trained five times (see annotations 'CNN training' in Figures 5 and 6). Three of them are connected with three separate expert annotations (Expert 1 labels, Expert 2 labels, Expert 3 labels). The fourth one considers only the part of the data on which the experts agreed (labels 1 and 4). In contrast, the fifth one (in path2) utilizes the labels resulting from the previous voting step-selecting the most frequently indicated label. This models are utilized in four processing paths shown in Figures 5 and 6, and described below.
The voting step utilized in binary classification targets is calculating a binary output based on three labels provided by the experts or resulting from the analysis. The first solution is applied in path2, where the binary labels required for model training are calculated based on the expert annotations. The US frame indicated two times as 'ok' is considered as 'ok', and the US frame indicated twice as 'no ok' is considered as 'no ok'. It corresponds to Group labels (in Table 1): 4 and 2 for 'ok', and 1 and 3 for 'no ok', respectively. In path2, three separate models (one for each expert) are trained and tested, and the final binary classification results are calculated as previous: The label resulting twice determines the output. The binary output selection used in path4 is described in detail in Section 3.1.4.

Path1
This scheme ( Figure 5 left) starts from the reliable images selection step, based on annotations provided by all the experts. By reliable images, we understand this part of the input data, for which all the experts agreed (labels: 1 and 4 from Table 1). The CNN model is trained and then applied to all the image data (labels 1 to 4).

Path2
In this processing path ( Figure 5 right), the CNN model is trained based on all the input data, and the binary input labels are calculated based on the voting step (v1). The voting step (v1) selects the most frequently indicated label, among three experts annotations.

Path3
This framework (Figure 6, v1) is based on the CNN training and then classifying, performed for each independent expert input. The obtained results are then used for voting (v1)-selecting the most frequently resulting label.

Path4
This path (Figure 6, v2) refers to the same framework as path3 with the difference that the voting step utilizes Mamdani Fuzzy Interference System (FIS) [52], followed by uniform output thresholding (t ∈ {0.25, 0.5, 0.75}) for final decision (see Figure 7). The membership function of the fuzzy sets in the rule premises and conclusions look the same for inputs and output and are presented in Figure 7. As the FIS input, we introduce the CNN predicted class scores. The FIS output can also be used as the confidence measure for further analysis, where the images classified as 'definitely' correct are rewarded.

Multi-Class Analysis
The multi-class analysis is performed twofold. In the first solution, the previously obtained binary classification results are combined to provide the final results. In the second one, the model is adapted to 4-group classification and trained again. Same as before, different processing paths are introduced to obtain the final classification results (see Figures 8 and 9).

Path5
In the first experiment, the Group labels defined in Table 1 are used for 4-group CNN model training. The trained model is then directly used for data classification.

Path6
The second processing path here (path6) refers to path1 in binary classification. The CNN model is trained on the reliable image data, then used for all data classification, and the predicted class score is uniformly thresholded to obtain the final classification results.

Path7
Path7 refers to path3 in binary classification. Three CNN models are trained separately, and the final labeling is based on the scheme given in Table 1, with the difference that we do not take into account the expert annotations, but the results of the three models.

Experiments and Results
To assess all the experiments, we used the external 5-fold cross-validation, and the non-testing remaining data were divided into training and validation subsets (4:1 ratio). All the experiments are marked on the classification schemes using red arrows and 'Evaluation #nb' tags. To measure the performance of all the introduced approaches, we compute the accuracy (ACC), the classification Precision, Recall, and f1-score. Additionally, due to the class imbalance, we use confusion matrices to capture all the classified and misclassified records classwise (see Figures 10 and 11). Finally, to measure the agreement between the automatic algorithms and the experts, we utilize the unweighted Cohen's kappas.
The analysis starts from CNN model selection. Based on the literature review [2,10,38], the most recent studies: In HFUS classification [2], ultrasound IQA [38], and informative HFUS frame selection [10], favor two CNN models: DenseNet and VGG16. The most promising model will then be utilized in the following experiments. For this, we train and test both the considered architectures: DenseNet-201 and VGG16, for each expert separately (Evaluation #4). The obtained performance measures are gathered in Table 3. On this basis, we decided to select the VGG16 model for further analysis. Since it is used in the subsequent processing steps, we first evaluated the classification performance of the selected VGG16 model for the reliable labels only (Evaluation #1). According to Cohen's kappa analysis, we obtained Perfect Agreement (kappa = 0.9177) with the experts, and the classification accuracy equal to 0.9595. Due to the reduced (to the reliable labels) image set, these results could not be compared with any furthers. However, they proved that for the collection of images unequivocally classified by experts, the abilities of the VGG16 model for indicating the correct data are good (as we expected from [10]).
Next, we analyzed the developed extension of the direct CNN-based technique (see Figure 10). For the binary classification, the best results were obtained using path4, utilizing the CNN combination with FIS (Evaluation #6)-ACC equal to 0.9170 and f1-score equal to 0.9076. A bit worse performance measures-ACC equal to 0.9158 and f1-score equal to 0.9074, yet higher Recall-0.9266, resulted from the classical CNN-based approach-path1 (Evaluation #2). According to Cohen's kappa analysis, both of them, as well as path2 (Evaluation #3), provided Perfect Agreement (see Table 4). The combination of three separately trained models followed by the selection of the most frequently resulted label performs worst in this case.
Finally, we evaluate the abilities of multi-class classification. By Table 1 we considered four groups and four different processing frameworks given as path5 to path8. The obtained results are collected in Figure 11. For this analysis, the best evaluation results provided the classical CNN-based version-path5, without any modification. However, same as all of the others-paths6 to path8, the Cohen's kappa analysis indicates only Substantial Agreement. Moreover, according to the confusion matrices, the best-recognized class in all the experiments is 1 (all experts labeled the image 'no ok'), the second is 4 (all experts labeled the image 'ok'), and 2 i 3 are hard to distinguish by the algorithms.

Discussion and Conclusions
Since the correct acquisition of US and HFUS images is essential for further accurate data analysis, in this study, we describe possible solutions aiming at 'correct' image identification. We believe that this step increases the HFUS image processing reliability. The obtained results can be used twofold. First, the incorrect image data can be excluded from further automated analysis if the software classified them as incorrect. Second, the remaining data analysis can be weighted based on the system output of the kept samples. Our work is the first application in this area-HFUS images of facial skin and applying AI to this task.
The first contribution of our study is the database of 17,425 HFUS images of facial skin [47] registered by two sonographers. Two experts annotated all the image data (one annotated it twice), and a detailed analysis of this expertise is provided in this work. On this basis, we can first conclude that the proportion of correct to incorrect images decreases from 1:1.3 to 1:2 if a less experienced person performs the examination. The image analysis and classification methods would provide the worst and less reliable measurements in this case. Next, there exists a group of images, which the experts can not unambiguously annotate (see Figures 2 and 3), and their automated classification by the system is also problematic. They can be considered together (labels 2 and 3), and during further numerical analysis, we can treat them as having less impact on the processing results.
The second contribution includes different developed, introduced, or just verified frameworks for automated HFUS image classification as correct-sufficient for further analy-sis or not. We analyzed two previously applied to similar problems [4,10], CNN models: DenseNet201 and VGG16, as having potential for HFUS frame selection. The numerical analysis benefits the latter. Using the VGG16 model as a base for further modifications, and the best among the state-of-the-art in HFUS image analysis, we proposed different frameworks to classify the image data into two or four groups. From our observation, the binary classification results are more accurate than multi-class analysis and can be applied in other HFUS image processing techniques. The best results were obtained for the developed CNN model and FIS combination. In this case the FIS-based improvement outperforms the VGG16 model. However, the limitation of the binary solutions is that they are trained and verified using the labels resulting from the voting step. It means that the 'correct' group includes the image data labeled as 'ok' both by all the experts and only two of them. The same problem appears for the 'incorrect' group. This solution assumes that the data annotated as 'ok' by most of the experts can be considered in the other processing steps (i.e., segmentation or further classification). To reduce the influence of two middle labels (2 and 3) on image analysis, we suggest assigning the confidence level to each analyzed image, utilizing the FIS outputs. The histograms of FIS outputs for binary classification are given in Figure 12. It is worth mentioning that both the analyzed models, as well as the FIS systems, are made available in [47].
To reduce the imbalance of group size, especially in four-class analysis, it is possible to introduce the augmentation step during training four-class. However, based on our previous experiences, the augmentation procedures should be selected carefully to avoid additionally produced artifacts due to the specific data appearance. Besides of this, future improvement can include three-class analysis, other body parts and diseases, and a broader range of frequencies and HFUS machines commonly used in dermatological practice, like 33, 50, or 75 MHz. Additionally, we plan to introduce FIS output weights as the pre-processing step for previously described segmentation [17] and classification [4] frameworks to evaluate their influence on the obtained results. Moreover, it needs to be validated in clinical practice.  Table 1.
In conclusion, this study describes the first step of the HFUS image analysis. The developed algorithm makes it possible to automatically select correctly acquired US frames among all the images collected during the US examination. This method applied as the pre-processing step will decrease the influence of misclassifications or over/under segmentations and improve the reliability of the measurements. Furthermore, it can be used instead of pre-processing steps targeting artifact reduction. The frame selection step is crucial since the proportion of correct to incorrect scans is about 1.5. On the other hand, due to the high amount of images acquired during the single examination, manual data selection is time and cost-consuming, and the developed technique solves this problem.