Application of Deep Learning Model in the Sonographic Diagnosis of Uterine Adenomyosis

Background: This study aims to evaluate the diagnostic performance of Deep Learning (DL) machine for the detection of adenomyosis on uterine ultrasonographic images and compare it to intermediate ultrasound skilled trainees. Methods: Prospective observational study were conducted between 1 and 30 April 2022. Transvaginal ultrasound (TVUS) diagnosis of adenomyosis was investigated by an experienced sonographer on 100 fertile-age patients. Videoclips of the uterine corpus were recorded and sequential ultrasound images were extracted. Intermediate ultrasound-skilled trainees and DL machine were asked to make a diagnosis reviewing uterine images. We evaluated and compared the accuracy, sensitivity, positive predictive value, F1-score, specificity and negative predictive value of the DL model and the trainees for adenomyosis diagnosis. Results: Accuracy of DL and intermediate ultrasound-skilled trainees for the diagnosis of adenomyosis were 0.51 (95% CI, 0.48–0.54) and 0.70 (95% CI, 0.60–0.79), respectively. Sensitivity, specificity and F1-score of DL were 0.43 (95% CI, 0.38–0.48), 0.82 (95% CI, 0.79–0.85) and 0.46 (0.42–0.50), respectively, whereas intermediate ultrasound-skilled trainees had sensitivity of 0.72 (95% CI, 0.52–0.86), specificity of 0.69 (95% CI, 0.58–0.79) and F1-score of 0.55 (95% CI, 0.43–0.66). Conclusions: In this preliminary study DL model showed a lower accuracy but a higher specificity in diagnosing adenomyosis on ultrasonographic images compared to intermediate-skilled trainees.


Introduction
Adenomyosis is a benign gynecological disease described by the presence of endometrial glands and stroma within the myometrium, as well as reactive hyperplasia and hypertrophy of the muscular layer [1]. Adenomyosis can cause symptoms like heavy menstrual bleeding, dysmenorrhea and infertility [2-6].

of 9
Pathological examination of myometrial specimen remains the gold standard for the diagnosis of adenomyosis and its estimated prevalence ranges from 21% to 36% among hysterectomized women. However, only a small, selected percentage of symptomatic women with adenomyosis undergoes hysterectomy, and the real prevalence of the disease is underestimated [3,[7][8][9].
Transvaginal ultrasound (TVUS) represents the method of choice for the non-invasive diagnosis of adenomyosis with adequate sensitivity and specificity [10][11][12]. Standardization of terminology for the description of myometrium with Morphological Uterus Sonographic Assessment (MUSA) allowed universal recognition and assessment of typical adenomyotic ultrasound features [2, 3,13].
Despite being low cost and easily accessible, TVUS has some limitations for the diagnosis of adenomyosis. It is an operator-dependent technique with adequate diagnostic performance and inter-operator reproducibility only if performed by expert sonographers [10,14,15]. In particular, Rasmussen et al. observed a moderate intra-operator agreement and a poor inter-operator agreement among medium experienced raters for the diagnosis of adenomyosis [14]. Therefore, expert sonographers in dedicated centers are recommended for the diagnosis of adenomyosis [9], with an overall TVUS sensibility and specificity of 81% and 87%, respectively [12].
Recently, the need to improve efficiency in all clinical settings using technological advances led to the development of powerful instruments such as artificial intelligence (AI) [16]. AI is defined as the use of several complex algorithm-based applications that can solve problems by simulating human cognitive functions, including data learning and processing, problem solving and decision making [17]. Machine learning (ML) and deep learning (DL) can be accounted among the newest developed technologies in this area. DL is a subfield of machine learning, able to consistently add new data with self-learning ability, thus increasing the performance of the application itself and able to find correlations that humans cannot [18].
AI applicability for healthcare purposes has already been partially investigated and has shown promising results in several medical fields, including gynecology. In particular, AI in gynecological studies was tested for several tasks on medical images, including discriminating malignancy or benignity of ovarian masses, diagnosing cervical cancer, staging endometrial cancer, or diagnosing rectosigmoid endometriosis [17,19,20].
To the best of our knowledge, no study has ever assessed the accuracy of DL in the diagnosis of adenomyosis using TVUS. Therefore, the aims of this study are to first evaluate the diagnostic performance of DL in the diagnosis of adenomyosis on uterine ultrasonographic images and compare it to that of intermediate ultrasound skilled trainees.

Study Protocol and Selection Criteria
This was a proof-of-concept, monocentric, observational, cross-sectional study, conducted in a tertiary academic centre. The whole study followed an a priori protocol previously drawn up according to STROBE guidelines and checklist [21].
Exclusion criteria were as follows: age less than 18 years old, virgo intacta status, ongoing or recent (less than 6 months) pregnancy, suspicion of gynecological malignancy, previous hysterectomy, menopausal status, coexistence of adenomyosis and fibroids at TVUS. From 1 to 30 April 2022, all eligible consecutive women referring to our tertiary gynecological ultrasound clinic were consecutively asked to participate to the study.

Study Outcomes
Accuracy was used as the primary evaluation metric for the diagnostic performance of DL machine in the diagnosis of adenomyosis at 2D gray scale mode TVUS images. Accuracy is a statistical measure of how well a binary classification test correctly identifies or excludes a condition, that is, the proportion of correct predictions (both true positives and true negatives) among the total number of patients examined.
Other metrics used to measure accuracy from different perspectives were the following: recall (sensitivity), precision (positive predictive value, PPV), F1-score (harmonic mean of precision and recall), specificity and negative predictive value (NPV). The same metrics were used to assess the performance of intermediate skilled trainees for the diagnosis of adenomyosis and were informally compared with those obtained with DL machine.
Secondarily, all the measures listed above were calculated using the diagnoses of fibrosis and homogeneous echogenicity as reference.

Patient Assessment
For each patient, anamnestic and clinical data were acquired as follows: age, body mass index (BMI), parity, history of infertility, previous endometriosis surgery, moderate-tosevere pain symptoms defined as numerical rating scale (NRS) equal or superior to 5 [22], heavy menstrual bleeding referred to as pictorial blood loss analysis chart ≥100 [23] and use of hormonal therapy.

Ultrasound Details
Voluson E8 ultrasound machine (GE Healthcare, Zipf, Austria) with a 4-9 MHz volumetric vaginal probe was used for all acquisitions. Ultrasound scans were obtained with patients in a modified lithotomic position. During the 2D gray scale mode TVUS examination, an expert sonographer classified uteruses in three groups: homogenous myometrial echogenicity, fibroids or adenomyosis.
Adenomyosis was diagnosed when two or more of the following sonographic criteria were present: globular uterus appearance, asymmetrical thickening, hypoechogenic myometrial cysts, hyperechoic islands, fan-shaped shadowing, echogenic subendometrial lines and buds, junctional zone irregolarities [2, 3,8]. Otherwise, uterine fibroids were diagnosed as well-defined round lesions of the myometrium, frequently with shadows at the edge or an internal fan-shaped shadow [2,24].
For each patient, presence of deep endometriotic lesions and endometrioma was also investigated according to IDEA consensus [25,26].

Deep Learning (DL)
An end-to-end DL model was developed for the classification of uterine images. Sequential ultrasound images including uterine corpus and cervix were extracted from ultrasound video clips by an automatic system. Manual segmentation was performed by the experienced sonographer. Uterine boundaries were manually traced in the sagittal scan including a region of interest (ROI) that clearly highlighted ultrasound features of adenomyosis or fibroids, according to literature. These ultrasound images were used for the construction, validation and testing of the DL system.
The available dataset of 100 ultrasound video clips was divided in a random and balanced way into three parts: training (n = 30), validation (n = 30) and testing set (n = 40). The training set was used to train the network by teaching it the parameters of the models. Two architectures were considered: ResNet and Vgg. Among these, Vgg13, Vgg19, ResNet 18 and ResNet 34 models were used.
The validation set was used for early stopping, which saves the network weights at the point of best performance, and for optimizing the hyper-parameters. The hyper-parameters used were as follows: To find the best combination of hyperparameters, Tree Parzen Estimator (TPE) was used as a sampler and Successive Halving Pruner (SHP) as a pruner. To reduce over-fitting, data augmentation was also applied, which generates additional training models using random image transformations. To this end, the captured images were extracted with the resolution reduced from 300 × 300 pixels to 224 × 224 pixels, random horizontal flips and vertical flips were employed and Gaussian blur was applied.
The test set was used to independently assess the generalization error for the final models chosen.
Diagnostic performance of each DL models was acquired.

Diagnostic Performance of Trainees
For each patient, uterine images were acquired for storage using short video clips (8-10 s). The uterus (cervix and corpus) was filmed in a sagittal plane and with a lateral left-to-right movement of the probe, using a grayscale mode. Videoclips were downloaded in Mp4 format from the hospital image database system and then de-identified prior to being reanalyzed by three intermediate ultrasound skilled trainees. These trainees were 4th year residents in O&G with intermediate ultrasound skills (consisting of more than 500 gynecologic ultrasound cases) doing their postgraduate studies in endometriosis management [27]. The trainees blinded to clinical data were asked separately to make their own diagnosis reviewing uterine images of testing set. Diagnostic performance of each trainee was acquired.

Statistical Analysis
Numerical variables were summarized as mean ± SD or median (95% CI); categorical variables were summarized as counts and percentages. Chi-squared test, Fisher's exact test and variance analysis were used for comparison of categorical and numerical variables, where appropriate.
To compare the performance of the best DL model with that of the best trainee in diagnosing adenomyosis, accuracy, sensitivity, positive predictive value (PPV), F1-score (harmonic mean of positive predictive value and sensitivity), specificity and negative predictive value (NPV) were calculated.

Ethical Statement and Informed Consent
The study protocol received approval by the local Ethics Committee (114/2022/Oss/ AOUBo). All patients signed an informed consent before entering the study, and all data were anonymized.

Results
During the study period, 100 eligible patients were enrolled. Ultrasound diagnosis by expert operator were as follows: 45 patients with homogeneous echogenicity of the myometrium, 30 with fibroids and 25 women with adenomyosis.
Baseline and clinical characteristics are summarized in Table 1. Mean (±SD) age and BMI of the study sample were 35.4 ± 8.0 years and 22.5 ± 2.5 kg/m 2 , respectively. There was no significant difference in terms of baseline data among the three study groups, except for age, rate of spontaneous delivery and heavy menstrual bleeding, which were higher in the fibroids group, while previous surgery for endometriosis was more frequent in the adenomyosis group. Sonographic signs suggestive for adenomyosis in the "adenomyosis group" diagnosed by expert operator are reported in Table 2. "Globular uterus" was the most frequent sonographic sign (72%), followed by "asymmetrical thickening" and "fan shaped shadowing" (60%). After the application of data augmentation, number of uterine images were as follows: -Training set: n = 1645 homogeneous echogenicity, n = 1071 fibroids, n = 836 adenomyosis; -Validation set: n = 481 homogeneous echogenicity, n = 336 fibroids, n = 252 adenomyosis; -Testing set: n = 495 homogeneous echogenicity, n = 359 fibroids, n = 336 adenomyosis.
Confusion matrix of the DL for the testing set is shown in Figure 1. The matrix highlights where the model fails. Rows show diagnosis made by the experienced sonographer (true label), while columns show predictions made by the machine (predicted label). Diagonal elements were the number of points where the predicted label was the same as the actual label, while the off-diagonal ones were misinterpreted by the model. Confusion matrix of the DL for the testing set is shown in Figure 1. The matrix highlights where the model fails. Rows show diagnosis made by the experienced sonographer (true label), while columns show predictions made by the machine (predicted label). Diagonal elements were the number of points where the predicted label was the same as the actual label, while the off-diagonal ones were misinterpreted by the model. As reported in Table 3  As reported in Table 3

Discussion
Despite AI recently gaining popularity in the field of medical imaging and has experienced increased its applications in gynecology, no study has ever used this tool in the diagnosis of uterine adenomyosis. Therefore, this study can be considered a proof-ofconcept for this issue.
Recently, DL based on artificial neural networks with representation learning has been adopted to help operators to untangle among differential diagnoses.
In the present study, the DL model showed a low accuracy in the diagnosis of uterine adenomyosis (51%). This observation may reflect the complexity of the disease. Indeed, adenomyosis is a heterogeneous disease that can have several phenotypes, varying per extension (diffuse, focal or adenomyoma) and location (internal myometrial or junctional invasion) within the myometrium [4,11].
As secondary finding, the accuracy of intermediate ultrasound-skilled trainees (70%) resulted higher than that of the DL. Moreover, these trainees showed a higher sensitivity (72%) but a lower specificity (69%) compared to those of the DL. This over-diagnosis could be explained by the tertiary center setting in which frequency of adenomyosis is estimated higher than general population and the offline assessment of uterine images instead of personal execution of TVUS. Conversely, the DL model showed a higher specificity, being more effective in identifying healthy uteruses, with low false positive values. Indeed, the DL model could be a useful tool to exclude adenomyosis where it is not present and disprove the over diagnosis of less experienced operators, avoiding unnecessary secondlevel examinations or over treatment cases.
Limitations of our study are the small sample size and the monocentric design, reducing the generalizability of our results. Larger multicentric studies are needed to better evaluate the potential clinical aid of AI in the diagnosis of adenomyosis. Although the lack of histological confirmation of adenomyosis may be considered another limitation of the study, pathological examination is unethical in patients without any indication for surgery. On the other hand, to date, an experienced sonographer must be considered an adequate alternative to histological diagnosis in women who are asymptomatic or have not completed their reproductive plan.
The impossibility to fully investigate the JZ by using 3D-TVUS examination and to evaluate translesional vascularity through Power Doppler mode may have influenced the diagnostic performance of the expert sonographer firstly and then that of the trainees and the DL machine.
In order to improve the diagnostic performance of the DL in diagnosing adenomyosis, future research can be focused to specific training of the DL machine on the recognition of each of ultrasound criteria suggestive for adenomyosis. Moreover, more studies are needed to evaluate any improvement of DL performance, adding other sonographic signs (i.e., translesional vascularity using Power Doppler and junctional zone thickness or irregularities at 3D TVUS) and/or clinical data (i.e., presence and severity of pain symptoms and uterine tenderness).

Conclusions
In this proof-of-concept study, the DL model achieved a low diagnostic performance for the detection of adenomyosis with accuracy of 51%, lower than that of intermediate skilled trainees. Sensitivity and F1-score of the intermediate skilled trainees were higher than those of DL as well. However, DL model showed potential for excluding adenomyotic uteri, with higher specificity and NPV than those of intermediate skilled trainees.
Larger multicentric studies with adjuvant investigation of JZ by 3D-TVUS and translesional vascularity through Power Doppler are needed to better evaluate the potential clinical application of AI in the diagnosis of adenomyosis.