Detection and Classification of Hysteroscopic Images Using Deep Learning

Simple Summary This article discusses the potential of deep learning (DL) models in aiding the diagnosis of endometrial pathologies through hysteroscopic images. While hysteroscopy with endometrial biopsy is currently the gold standard for diagnosis, it heavily relies on the expertise of gynecologists. The study aims to develop a DL model for automated detection and classification of endometrial pathologies. Conducted as a monocentric observational retrospective cohort study, it reviewed records and videos of hysteroscopies from patients with confirmed intrauterine lesions. The DL model was trained using these images, with or without incorporating clinical factors. Results indicate that while the DL model showed promising results, its diagnostic performance remained relatively low, even with the inclusion of clinical data. The best performance was achieved when clinical factors were included, with precision, recall, specificity, and F1 scores ranging from 80 to 90% for classification and 85 to 93% for identification tasks. Despite slight improvements in clinical data, further refinement of DL models is warranted for more accurate diagnosis of endometrial pathologies. Abstract Background: Although hysteroscopy with endometrial biopsy is the gold standard in the diagnosis of endometrial pathology, the gynecologist experience is crucial for a correct diagnosis. Deep learning (DL), as an artificial intelligence method, might help to overcome this limitation. Unfortunately, only preliminary findings are available, with the absence of studies evaluating the performance of DL models in identifying intrauterine lesions and the possible aid related to the inclusion of clinical factors in the model. Aim: To develop a DL model as an automated tool for detecting and classifying endometrial pathologies from hysteroscopic images. Methods: A monocentric observational retrospective cohort study was performed by reviewing clinical records, electronic databases, and stored videos of hysteroscopies from consecutive patients with pathologically confirmed intrauterine lesions at our Center from January 2021 to May 2021. Retrieved hysteroscopic images were used to build a DL model for the classification and identification of intracavitary uterine lesions with or without the aid of clinical factors. Study outcomes were DL model diagnostic metrics in the classification and identification of intracavitary uterine lesions with and without the aid of clinical factors. Results: We reviewed 1500 images from 266 patients: 186 patients had benign focal lesions, 25 benign diffuse lesions, and 55 preneoplastic/neoplastic lesions. For both the classification and identification tasks, the best performance was achieved with the aid of clinical factors, with an overall precision of 80.11%, recall of 80.11%, specificity of 90.06%, F1 score of 80.11%, and accuracy of 86.74 for the classification task, and overall detection of 85.82%, precision of 93.12%, recall of 91.63%, and an F1 score of 92.37% for the identification task. Conclusion: Our DL model achieved a low diagnostic performance in the detection and classification of intracavitary uterine lesions from hysteroscopic images. Although the best diagnostic performance was obtained with the aid of clinical data, such an improvement was slight.


Introduction
Hysteroscopy with endometrial biopsy is an endoscopic tool that can be considered the gold standard in the diagnosis of abnormal uterine bleeding (AUB) and endometrial pathology, as it allows the direct visual assessment of endometrium and subsequent histopathological examination [1][2][3][4].AUB can be caused by benign lesions, such as endometrial polyps, intracavitary myomas, and endometrial hyperplasia without atypias [5][6][7], or pre-malignant and malignant lesions, such as atypical endometrial hyperplasia and endometrial carcinomas [8].Unfortunately, the experience of the gynecologist plays a crucial role in identifying suspicious areas to be sampled and distinguishing between several endometrial pathologies, with the possibility of failing the correct diagnosis [9].
A valuable help to overcome this limitation could be provided by deep learning (DL), an artificial intelligence (AI) method.AI has recently been introduced in medicine, particularly in disciplines based on the analysis of images, such as pathology, ultrasound, and radiology [10].For example, AI has shown interesting results in many medical image analysis tasks, such as screening for breast cancer and prediction of lymph node metastasis in cervical cancer [11,12].In the realm of AI techniques, the utilization of DL for processing and analyzing medical images emerges as highly promising.Deep Convolutional Neural Networks stand as the prevalent DL method for pattern identification in images and videos.Deep Convolutional Neural Networks are able to automatically learn a set of feature detectors, usually over a number of layers (making the model "deep"), from a labeled dataset that "trains" the model to recognize pathologies through image analysis [13,14].To prepare a DL model for operation, the main dataset is typically divided into two subsets: a training set and a test set.The training set consists of data that are fed into the deep learning network during the iterative training process, known as epochs.Throughout these epochs, the network's parameters are adjusted to enhance the desired outcome.Following the completion of training, the test dataset is employed to evaluate the performance of the finalized model [15].DL applications for these tasks may represent a useful tool for clinicians in decision-making and treatment planning [16].To the best of our knowledge, only two preliminary studies evaluated the performance of DL using hysteroscopy images for diagnosis of benign and malignant endometrial lesions, with favorable results [17,18].However, none of these studies assessed the performance of DL models in the identification task of intrauterine lesions, as they only reported its accuracy in classifying intrauterine pathologies.In addition, no study evaluated the inclusion of specific clinical factors in the DL model to improve the performance.Moreover, preliminary data on DL performance must be confirmed by different studies before accepting it as a potential clinical aid [19].
In the present study, we aimed to develop a DL model to provide an automated tool for detecting endometrial pathologies and classifying them as benign or malignant intrauterine lesions using hysteroscopic images from a consecutive series of women with pathologically confirmed endometrial lesions.

Study Protocol and Selection Criteria
The study followed an a priori-defined study protocol and was reported according to the Standards for Reporting of Diagnostic Accuracy (STARD) [20].The study was designed as a monocentric observational retrospective cohort study.
We reviewed clinical records, electronic databases, and stored videos of hysteroscopies from all consecutive patients with pathological confirmation of intracavitary uterine lesions at IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy, from January 2021 to May 2021.Retrieved hysteroscopic images were used to build a DL model for the classification and identification of intracavitary uterine lesions with and without the aid of clinical factors.
Intracavitary uterine lesions included endometrial polyps, fibroids, endometrial hyperplasia with and without atypia, and endometrial cancer diagnosed at histological examination of hysteroscopic specimens.
The exclusion criteria were the absence of adequate histological examination, absence of iconographic documentation, presence of uterine dysmorphism, and absence of intrauterine pathology.

Study Outcomes
The primary outcome was the accuracy of the DL model in the classification of intracavitary uterine lesions (overall and by category of lesion) without the aid of specific clinical factors to DL model performance.
The secondary outcomes were the following: Classification refers to the discrimination between three categories of intracavitary uterine lesions: benign focal lesions (i.e., polyps and myomas), benign diffuse lesions (i.e., non-atypical endometrial hyperplasia), and pre-neoplastic/neoplastic lesions (i.e., atypical endometrial hyperplasia and endometrial cancer).Instead, identification referred to the detection of intracavitary uterine lesions.Given the inclusion of only patients with intracavitary uterine lesions diagnosed at histological examination, true negatives were absent for identification metrics.On the other hand, intracavitary uterine lesions of other categories were considered as false negatives for classification metrics.
Clinical factors assessed for aiding DL model performance were age, menopausal status, AUB, hormonal therapy, and tamoxifen use.

Hysteroscopy and Image Processing
Hysteroscopy with targeted biopsies of intracavitary uterine lesions through 5 French instruments was performed in outpatient settings using 0.9% saline solution distension and a Bettocchi hysteroscope (Karl Storz, Tuttlingen, Germany).Stills and images from hysteroscopic videos of eligible patients were processed for DL model building.Images and videos were captured with two different hysteroscopic systems, one high-definition system and one standard-definition system.Features were extracted from the original image.The system extracts the area of interest for the lesion detected at 224 × 224 pixels required for the classification task.Manual segmentation was performed by an experienced hysteroscopist.

Deep Learning
We developed an end-to-end DL model for intracavitary uterine lesion identification and classification.The deep learning process comprises three parts: training, validation, and testing.The dataset was divided into three groups at random with a ratio of 60:20:20.Two groups were used for training and validation, and the remaining group was used for testing.
ResNet50 was used as a deep learning model since it can exhibit relatively high accuracy with smaller size datasets and less expensive learning costs.ResNet50 was pretrained by a million natural images from the Microsoft Common Objects in Context dataset and was fine-tuned using images from the training and validation dataset.
We used established techniques to reduce over-fitting during the validation process with an iterative method: (a) data augmentation, which is a process synthetically generating additional training examples by using random image transformations; (b) "early stopping", by which the weights of the network at the point of best performance are saved, as opposed to the weights obtained at the end of training.The performance of the DL model was evaluated using a balanced sampler on image units.
In our methodology, data augmentation was implemented online, meaning it was applied in real-time during the training of the model.This approach differs significantly from the traditional offline augmentation, where an augmented dataset is prepared in advance before the training process begins.Each training batch underwent a unique set of random transformations, ensuring that the model encountered a diverse range of variations in the training images.This dynamic approach to augmentation is crucial in preventing the model from overfitting, as it learns to generalize better from a constantly varying dataset.The specific augmentation steps included in our process were as follows:

•
Random Vertical and Horizontal Flipping: each image in the training batch had a chance of being flipped either vertically or horizontally.This step introduces a variety of orientations, helping the model to learn features that are orientation-invariant.

•
Random Brightness Adjustment: the brightness of each image was altered using a random factor ranging from 0.8 to 1.2.This variance in brightness ensures the model's robustness against different lighting conditions.

•
Random Contrast Adjustment: similarly, the contrast of each image was modified with a random factor within the same range (0.8 to 1.2).This step helps in training the model to identify features under various contrast levels.
By incorporating these random transformations, our DL model benefits from a more comprehensive and challenging training environment.This online method of data augmentation plays a significant role in enhancing the model's ability to accurately classify and identify lesions under diverse imaging conditions, ultimately improving its diagnostic efficacy.
Optimization of hyperparameters was performed using TPESampler as a sample, and SuccessiveHalvingPruner as a pruner, and the train of each set of hyperparameters was replicated 3 times.We used RepeatFactorTrainingSampler with the threshold optimized by hyperparameter optimization.The F1 score average was the optimization metric on the validation set.The hyperparameters are shown in Table 1.Table 2 shows the optimal hyperparameters.Clinical factors were incorporated into the Region Proposal Network (RPN) and Classification Head and were concatenated to features extracted from the ROI Pooler (Figure 1).
on the validation set.The hyperparameters are shown in Table 1.Table 2 shows the optimal hyperparameters.

Study Population and Dataset
During the study period, 703 patients underwent hysteroscopy in our center.Four hundred and thirty-seven were excluded from analysis due to lack of imaging or histological examination or both.
Out of benign focal lesions, 21 were myomas, and 165 were polyps; out of benign diffuse lesions, 19 were polypoid endometrium, and 6 were endometrial hyperplasia without atypia; out of preneoplastic and neoplastic lesions, 7 were atypical endometrial hyperplasia, 12 were endometrial intraepithelial neoplasia, and 36 were endometrial cancers.
Clinical data about the whole study population and by category of intracavitary uterine lesions are summarized in Table 3. Patients were randomly included in the training (n = 157), validation (n = 54), and testing (n = 55) cohorts (Table 4).

Model Performance
Overall, the accuracy of the model in classifying uterine intracavitary lesions without the aid of specific clinical factors was 85.09 ± 1.18%.Specifically, such accuracy was 79.55 ± 1.29% for benign focal lesions, 90.1 ± 0.91% for benign diffuse lesions, and 85.63 ± 1.16% for malignant lesions.
Tables 5 and 6 show the accuracy, precision, sensitivity, specificity, and F1 score of the DL model in the classification of intracavitary uterine lesions, without and with the aid of specific clinical factors, to DL model performance, respectively.Table 7 shows the precision, sensitivity, and F1 score of the DL model in the identification of intracavitary uterine lesions, with and without the aid of specific clinical factors, to DL model performance.For the classification task, the best performance was achieved in all the categories with the aid of clinical factors, as shown in Table 8.For the identification task, the best performance was achieved with the aid of clinical factors with detection of 85.82%, precision of 93.12%, recall of 91.63%, and an F1 score of 92.37%.

Discussion
This study showed that the DL model had low overall accuracy in the detection and classification of uterine intracavitary diseases.The best performance of the DL model was obtained with the aid of clinical factors for both tasks.However, such an improvement was slight.
Although hysteroscopy with endometrial biopsy appears as the gold standard diagnostic tool for AUB and uterine intracavitary diseases [22], it is affected by operator experience in detecting suspicious areas to be sampled and distinguishing between several diseases.Moreover, hysteroscopic diagnosis of uterine intracavitary diseases can be challenging even if it is performed by expert operators [23].Hysteroscopy has shown a low sensitivity especially for endometrial hyperplasia since such disease may not show evident hysteroscopic signs, simulating a second-phase or dysfunctional endometrium, or endometrial polyps [1][2][3][4]9,24].
Recently, some studies have attempted to build DL models to try to overcome these limitations.Takahashi et al. have recently employed DL models on 177 patients with AUB in order to increase the hysteroscopy accuracy in cancer diagnosis [17].In detail, the Takahashi DL model distinguished atypical endometrial hyperplasia and endometrial cancer from polyps, fibroids, or normal endometrium with a 90% accuracy.However, this study might be affected by several limitations: (i) it did not evaluate the ability of the DL model in detection of endometrial lesions; (ii) it did not assess the possible aid of clinical factors on machine learning performance; (iii) it did not include cases with non-atypical endometrial hyperplasia; (iv) it did not evaluate histology as a reference standard for all cases; (v) it used a dataset with images from only one hysteroscopic system, limiting the generalizability of the findings.
Yet, Zhang et al. have built a DL model on 454 patients with histologically confirmed intracavitary lesions, showing an overall accuracy of up to 80.8% and 90% in correctly classifying lesions as benign or premalignant/malignant, respectively.However, also this study did not evaluate DL model accuracy in the detection of endometrial lesions and possible improvement in accuracy with the aid of clinical factors [18].
Zhao et al. developed a DL model to automatically detect only endometrial polyps in real-time hysteroscopic videos with an accuracy of up to 95%; unfortunately, they did not perform any classification of the lesions [25].
None of these studies used a DL model to identify and classify intracavitary uterine lesions at the same time.Therefore, we built a DL model for these purposes and evaluated its diagnostic performance (identification and classification) on hysteroscopy images from women performing the exam for AUB or sonographic suspect of an intrauterine lesion, then confirmed at pathological examination [26].To the best of our knowledge, our study may be the first study with these aims and study population in the literature.Furthermore, our DL model may be the first one to include the aid of clinical factors in the field.
As previously stated, in the present study, our DL model showed a low accuracy in the detection and classification of intracavitary diseases.This observation may reflect the heterogeneity of uterine intracavitary pathology, the small size, and the heterogeneity of the dataset.Moreover, the lack of images of normal cavities and the small number of patients led to a dataset imbalance problem.
Anyway, the best performance of our DL model is close to that of the above-mentioned larger studies.Our DL model might be an updated starting point for future improved DL models in the field.
In order to improve the diagnostic performance of the DL model in the detection and classification of intrauterine lesions, future research should be focused on specific training of the DL machine on the detection between normal and abnormal cavities and recognition of each category with a larger and balanced dataset including high-definition images and videos.After DL model building, the model should undergo external validation and improvement, with the inclusion of further images and videos from other centers.When a high DL model performance is obtained, the inclusion of cases with other rarer intrauterine pathologies (e.g., Mullerian malformations, atypical polypoid adenomioma, pecoma, sarcoma, trophoblastic disease, retained products of conception) [27][28][29][30] might make the DL model testable in the clinical practice thorough comparison of diagnostic performance by expert endoscopists.

Conclusions
In this study, our DL model achieved a low diagnostic performance in the detection and classification of intracavitary uterine lesions from hysteroscopic images.Although the best diagnostic performance was obtained with the aid of clinical data, such an improvement was slight.However, our DL model might be an updated starting point for future improved DL models in the field based on larger datasets.
Our study underscores the importance of continued research in refining DL models for uterine lesion detection and classification.Future efforts should prioritize the expansion of datasets with high-definition images, the inclusion of diverse uterine pathologies, and external validation across multiple centers.Moreover, the addition of normal uterine cavity images and rarer intrauterine lesions to the training set might allow to enhance the DL model's diagnostic accuracy.
In conclusion, while our DL model represents a promising step towards automated uterine lesion diagnosis, further refinement and validation is needed before its integration

•
accuracy of the DL model in the classification of intracavitary uterine lesions (overall and by category of lesion) with the aid of specific clinical factors to DL model performance; • precision, sensitivity, specificity, and F1 score (i.e., the harmonic mean of precision and sensitivity) of the DL model in the classification of intracavitary uterine lesions (overall and by category of lesion), with and without the aid of specific clinical factors to DL model performance; • precision, sensitivity, and F1 score of the DL model in the identification of intra- cavitary uterine lesions, with and without the aid of specific clinical factors to DL model performance; • the best performance of the DL model during testing in the identification and classification of intracavitary uterine lesions (overall and by category of lesion).
Clinical factors were incorporated into the Region Proposal Network (RPN) and Classification Head and were concatenated to features extracted from the ROI Pooler (Figure1).
* related to data augmentation.** minority class repetition factor.
* related to data augmentation.** minority class repetition factor.
* related to data augmentation.** minority class repetition factor.

Table 3 .
Clinical data on the whole study population and category of intracavitary uterine lesions.

Table 4 .
Characteristics of the dataset.

Table 5 .
Accuracy, precision, sensitivity, specificity, and F1 score of the DL model in the classification of intracavitary uterine lesions without clinical data.Values are expressed as % (95% CI).

Table 6 .
Accuracy, precision, sensitivity, specificity, and F1 score of the DL model in the classification of intracavitary uterine lesions with clinical data.Values are expressed as % (95% CI).

Table 7 .
Precision, sensitivity, and F1 score of the DL model in the identification of intracavitary uterine lesions, with and without the aid of specific clinical factors, to DL model performance.Values are expressed as % (95% CI).

Table 8 .
Best performance of the DL Model in the classification task.Values are expressed as %.