Using Deep Learning with Convolutional Neural Network Approach to Identify the Invasion Depth of Endometrial Cancer in Myometrium Using MR Images: A Pilot Study

Myometrial invasion affects the prognosis of endometrial cancer. However, discrepancies exist between pre-operative magnetic resonance imaging staging and post-operative pathological staging. This study aims to validate the accuracy of artificial intelligence (AI) for detecting the depth of myometrial invasion using a deep learning technique on magnetic resonance images. We obtained 4896 contrast-enhanced T1-weighted images (T1w) and T2-weighted images (T2w) from 72 patients who were diagnosed with surgico-pathological stage I endometrial carcinoma. We used the images from 24 patients (33.3%) to train the AI. The images from the remaining 48 patients (66.7%) were used to evaluate the accuracy of the model. The AI then interpreted each of the cases and sorted them into stage IA or IB. Compared with the accuracy rate of radiologists’ diagnoses (77.8%), the accuracy rate of AI interpretation in contrast-enhanced T1w was higher (79.2%), whereas that in T2w was lower (70.8%). The diagnostic accuracy was not significantly different between radiologists and AI for both T1w and T2w. However, AI was more likely to provide incorrect interpretations in patients with coexisting benign leiomyomas or polypoid tumors. Currently, the ability of this AI technology to make an accurate diagnosis has limitations. However, in hospitals with limited resources, AI may be able to assist in reading magnetic resonance images. We believe that AI has the potential to assist radiologists or serve as a reasonable alternative for pre-operative evaluation of the myometrial invasion depth of stage I endometrial cancers.


Introduction
Endometrial cancer is one of the leading gynecologic malignancies in industrialized countries. The incidence of endometrial cancer has been increasing significantly worldwide in the past 10 years [1,2]. Myometrium, the middle layer of the uterine wall, serves as a barrier to prevent further expansion of endometrial cancer [3,4]. When the disease is diagnosed at an advanced stage, poor prognoses can be expected. One of the key parameters used to determine the stage is the depth of myometrial invasion. This is a prognostic factor used to categorize patients into high or low-intermediate risk categories, leading to different postoperative treatment approaches [5]. Therefore, accurate diagnoses

Study Population
This is a retrospective study examining data from 72 endometrial cancer patients who received surgical treatment. Originally, there were 262 patients who had surgeries at the Tri-Service General Hospital in Taipei, Taiwan from January 2014 to September 2018. However, since we were interested in examining the ability of AI to validate myometrial invasion in early stage cancer, we excluded patients whose endometrial cancer was staged based merely on post-operative pathology without preoperative MRI scans and patients with stage II, III, and IV cancer. Eventually, 72 patients were qualified for this study (see Table 1). Among these, 53 were diagnosed with stage IA cancer, and 19 were diagnosed with stage IB based on permanent pathology. The average age of the patients was 59.7, with a minimum age of 39 and maximum age of 85. In terms of menopausal status, 63 (87.5%) postmenopausal. In terms of histology grade, 27, 32, and 13 belonged to grades 1, 2, 3 (5 were serous carcinomas, 1 was clear cell, and 3 were mixed). Among all the patients, 29 (40.3%) had uterine leiomyomas, and 43 (59.7%) did not. A total of 4896 MRI slices (3456 slices of contrast-enhanced T1w, and 1440 slices of T2w) with detailed preoperative radiology reports were collected from these 72 patients. Patients were divided into training, validation, and testing groups. The training group was comprised of patients whose results from the radiologists' diagnoses and from the pathology reports were compatible. One third of the patients (24 patients) were selected as the training group that was used to train the DL model and generate the model parameters. Then, the performance of the model was checked by evaluating the error function using an independent validation group (6 patients). The model that generated the smallest error was selected as the final model. Finally, the test group was comprised of a dataset that was independent of the training group (42 patients plus the 6 patients in the validation group), and this group was used to appraise the accuracy rate of the novel AI-based system (see Figure 1). Two gynecologic oncologists with 25 and 14 years of clinical experience, respectively, were recruited to label the MRI images of each patient, including the contours of the uterus, lesion of the endometrium, and lining of the endometrium. Our research team then double-checked their work. Contrast-enhanced T1w and T2w images were both labeled to provide the AI model with appropriate information for image segmentation and training. Results of the histopathological report were used as a reference to calculate the accuracy rates.

Artificial Intelligence Systems Selection
CNN is an efficient recognition algorithm that is frequently used in image processing and pattern recognition [21,22]. In this study, we used a deep neural network architecture known as U-Net as a model for segmentation of MR images, which consist of equal amount of up-and downsampling layers. U-Net combines them with the so-called skip connections between opposing convolution and deconvolution layers. It concatenates a contracting path and expansive path while a large number of feature channels during up-sampling allows the propagation of the context information to higher resolution layers [17,23]. The U-Net architecture has proven to be useful for biomedical segmentation applications and medical image analyses [24][25][26][27]. Furthermore, using different methods of weights initialization within the same architecture (known as fine tuning) to initialize the weights for an encoder of the network, VGG11, VGG16, and ResNet34 pre-trained encoder models which converge considerably faster to a steady value and reduce training time in comparison to the non-pre-trained network, were used [28,29]. Based on the findings from previous research [30], the training settings of the aforementioned three architectures included hyperparameters such as batch size, epoch, learning rate, and optimizer that can be adjusted to enhance recognition accuracy. The training results of the AI models were evaluated using data from the validation group [31].

Artificial Intelligence Systems Selection
CNN is an efficient recognition algorithm that is frequently used in image processing and pattern recognition [21,22]. In this study, we used a deep neural network architecture known as U-Net as a model for segmentation of MR images, which consist of equal amount of up-and down-sampling layers. U-Net combines them with the so-called skip connections between opposing convolution and deconvolution layers. It concatenates a contracting path and expansive path while a large number of feature channels during up-sampling allows the propagation of the context information to higher resolution layers [17,23]. The U-Net architecture has proven to be useful for biomedical segmentation applications and medical image analyses [24][25][26][27]. Furthermore, using different methods of weights initialization within the same architecture (known as fine tuning) to initialize the weights for an encoder of the network, VGG11, VGG16, and ResNet34 pre-trained encoder models which converge considerably faster to a steady value and reduce training time in comparison to the non-pre-trained network, were used [28,29]. Based on the findings from previous research [30], the training settings of the aforementioned three architectures included hyperparameters such as batch size, epoch, learning rate, and optimizer that can be adjusted to enhance recognition accuracy. The training results of the AI models were evaluated using data from the validation group [31].

Images Processing and Analysis
In this study, MR images were obtained using 1.5T (Optima MR450W, GE Healthcare, Chicago, IL, USA) and 3T superconducting units (Discovery MR 750, GE Healthcare, Chicago, IL, USA). The imaging protocol for MR imaging scanners typically include sagittal, contrast-enhanced T1w (TR/TE, 501/Minimum ms; section thickness, 4 mm; gap, 1.2 mm; matrix, 288 × 192; and FOV, 280 mm), and sagittal T2w (TR/TE, 5000/90 ms; section thickness, 5 mm; gap, 1 mm; echo-train length, 19; matrix, 320 × 224; and FOV, 240 mm). Since the MR images have different formats and resolutions, initial quality control is important to filter out images with improper formats and low resolutions. We cropped all the raw images into 896 × 896 pixel resolution by using the equation for altered resolution: P i = (Pi − Pmean)/Pstd. Here, Pi represents each pixel, Pmean and Pstd are the mean and standard deviation of all pixels, respectively, and P i is the resulting altered pixel. Moreover, to improve DL efficiency, data augmentation was conducted by multiplying, horizontally flipping, vertically flipping, and using affine transformation on all MR images. Thus, multiple images were derived from the original ones used for AI model training. The augmented dataset was used only for training and not for validation or testing [32]. Thereafter, segmentation was carried out, which involves assigning a label to every pixel in an image such that pixels with the same label share certain characteristics [33]. After processing the MR images, the training and validation groups were used to establish and validate the models, respectively. We used Intersection over Union (IoU) to evaluate the performance of the AI model. IoU is an evaluation metric used to measure the accuracy of an object detector on a particular dataset, which is often used for evaluation the performance of CNN detectors.

Establishing AI Models
After testing several different models, the U-Net with ResNet34, VGG16, and VGG11 encoders, pre-trained on ImageNet architectures, was used to establish the CNN-based AI models. The above models were established using a QNAP TS-2888X Linux based server with an Intel Xeon CPU, four GPU cards, and 512 GB available RAM for training and validation. During the training process, the original MR and mask images were initially adjusted to have the same size and resolution as the training input images, and the MR images of endometrium and uterus were learned through AI training. The layers were trained by a stochastic gradient descent in a relatively small batch size (16 images) because of the variations in the size and shape of the uterus and endometrial lesions, with a learning rate of 0.001. To determine the best model, the training for all categories was performed for 150 epochs and the loss was calculated using the Dice-coefficient loss function, rather than by the cross entropy loss function, because of its advantage in solving the problem of disparity between the size of the endometrium and non-endometrial areas in the image ( Figure A1). After adjusting the parameters, segmentation of the uterus and endometrial lesions in contrast-enhanced T1w exhibited significant and superior performance for the U-Net with VGG11 model and achieved 94.20% and 79.16% of the mean IoU, respectively (Tables A1, A4 and A5). However, the U-Net with ResNet34 model for segmentation training of T2w exhibited better performance than the other models and achieved 91.66% and 79.31% of the mean IoU of the uterus and endometrial lesion, respectively (Tables A2, A6 and A7). The parameters of the best model were selected (see Figure 2) and used for the validation, and these are listed in Table A3.

Statistical Analysis
To determine the differences between diagnoses by the radiologists and AI interpretations, Chisquare tests were used to identify whether two categorical variables were independent of each other, including the "accuracy," "over-staged/under-staged," and "whether the concomitant conditions (coexisting uterine leiomyoma, different histology types) affect the diagnoses." For continuous variables, Pearson correlation was applied to examine the positive or negative relationships between the degree (depth) of myometrial invasion from the pathology reports and the degree (depth) of myometrial invasion interpreted by AI. These two variables were both measured as fractions (myometrial invasion over myometrial thickness). Box and whisker plots were used to compare the accuracy of diagnoses made by radiologists and those made by AI (the center, spread, and overall range of the depth of myometrial invasion were used as determinants), and a one-way analysis of variance (ANOVA) was used to generate the F-statistics and p-value (α = 0.05 was the standard used to determine any significant differences). This study used STATA 14 software (StataCorp Limited Liability Company, College Station, TX, USA) for statistical analyses.

Ethical Approval
Our research was a retrospective study using the MRI images, pathological reports, and other demographic information of patients. All data/samples were fully anonymized before we accessed them. Only serial numbers were associated with the collected data/samples. We could not identify any individual based on the serial numbers. The data collection period was between January 2019 to November 2019. The IRB approved our study as "Low Risk" (IRB No.: 1-107-05-165).

Verification of the Final Model
To verify the established AI model, a total of 48 patients were used, and the AI calculated the depth of myometrial invasion by endometrial cancer and then classified each as stage IA or IB. The architecture of our AI model was adopted from Iglovikov and Shvets [24] and modified from Shvets, Iglovikov, Rakhlin, and Kalinin [25] (see Figure 3). The results were compared with the surgicopathological findings. The accuracy rates for the contrast-enhanced T1w, T2w, and radiologists were 79.2%, 70.8%, and 77.8%, respectively. The chi-square tests showed that there were no significant differences between the AI interpretations for both T1w and T2w and radiologist's diagnoses (p =

Statistical Analysis
To determine the differences between diagnoses by the radiologists and AI interpretations, Chi-square tests were used to identify whether two categorical variables were independent of each other, including the "accuracy," "over-staged/under-staged," and "whether the concomitant conditions (coexisting uterine leiomyoma, different histology types) affect the diagnoses." For continuous variables, Pearson correlation was applied to examine the positive or negative relationships between the degree (depth) of myometrial invasion from the pathology reports and the degree (depth) of myometrial invasion interpreted by AI. These two variables were both measured as fractions (myometrial invasion over myometrial thickness). Box and whisker plots were used to compare the accuracy of diagnoses made by radiologists and those made by AI (the center, spread, and overall range of the depth of myometrial invasion were used as determinants), and a one-way analysis of variance (ANOVA) was used to generate the F-statistics and p-value (α = 0.05 was the standard used to determine any significant differences). This study used STATA 14 software (StataCorp Limited Liability Company, College Station, TX, USA) for statistical analyses.

Ethical Approval
Our research was a retrospective study using the MRI images, pathological reports, and other demographic information of patients. All data/samples were fully anonymized before we accessed them. Only serial numbers were associated with the collected data/samples. We could not identify any individual based on the serial numbers. The data collection period was between January 2019 to November 2019. The IRB approved our study as "Low Risk" (IRB No.: 1-107-05-165).

Verification of the Final Model
To verify the established AI model, a total of 48 patients were used, and the AI calculated the depth of myometrial invasion by endometrial cancer and then classified each as stage IA or IB. The architecture of our AI model was adopted from Iglovikov and Shvets [24] and modified from Shvets, Iglovikov, Rakhlin, and Kalinin [25] (see Figure 3). The results were compared with the surgico-pathological findings. The accuracy rates for the contrast-enhanced T1w, T2w, and radiologists were 79.2%, 70.8%, and 77.8%, respectively. The chi-square tests showed that there were no significant differences between the AI interpretations for both T1w and T2w and radiologist's diagnoses (p = 0.856 and p = 0.392, respectively) (see Table 2). However, we did notice a relatively higher "over-diagnosis rate" from the radiologists. Among the incompatible cases in T1w, 7 out of 10 were over diagnosed (over-diagnosis rate: 70.0%). Among the incompatible cases in T2w, 9 out of 14 were over diagnosed (over-diagnosis rate: 64.3%). However, for the incompatible cases in radiologists' diagnoses, 14 out of 16 were over diagnosed (over-diagnosis rate: 87.5%).

Effects of Concomitant Conditions on MR Image Interpretation
In addition to the aforementioned findings, we found that the MR images interpreted by AI were more likely to be inaccurate when the patients had coexisting uterine leiomyoma, (p = 0.027 for contrast-enhanced T1w and p = 0.12 for T2w). In contrast, coexisting leiomyoma usually did not affect the radiologist's MRI interpretations (p = 0.140). Other than uterine leiomyoma, different histological subtypes did not affect the accuracy of the radiologists (p = 0.413) or the AI (p = 0.549 for contrastenhanced T1w; p = 0.727 for T2w) (see Table 3). Table 3. Influence of concomitant conditions on the accuracy rates of AI and radiologists.

Effects of Concomitant Conditions on MR Image Interpretation
In addition to the aforementioned findings, we found that the MR images interpreted by AI were more likely to be inaccurate when the patients had coexisting uterine leiomyoma, (p = 0.027 for contrast-enhanced T1w and p = 0.12 for T2w). In contrast, coexisting leiomyoma usually did not affect the radiologist's MRI interpretations (p = 0.140). Other than uterine leiomyoma, different histological subtypes did not affect the accuracy of the radiologists (p = 0.413) or the AI (p = 0.549 for contrast-enhanced T1w; p = 0.727 for T2w) (see Table 3). * "compatible" between pathology report and AI interpretation; ** "compatible" between pathology report and radiologists' diagnoses.
In addition, we found a positive correlation between the depth of myometrial invasion (the percentage of myometrial invasion over myometrial thickness) from the pathology report and that interpreted by AI (r = 0.54 for contrast-enhanced T1w, p = 0.026; r = 0.52 for T2w, p = 0.004) (see Figure A2). We also found that the distribution of the percentage of myometrial invasion may lead to discrepancies between the stages diagnosed by radiologists or AI. Results are presented in Tables 4 and 5. Table 4 and Figure 4 show the results of radiologists' diagnoses. As shown in the table, when the degree of myometrial invasion was relatively small or large, radiologists were less likely to make incorrect decisions. However, when the depth of myometrial invasion was around 50%, radiologists' diagnoses tended to be incompatible with the pathology reports. The box and whisker plots display the distribution of myometrial invasion. Table 5 and Figure 5 show the results of AI interpretation. Similarly, when the degree of myometrial invasion was at the two extremes, AI was more likely to generate a correct answer; however, when the depth of myometrial invasion was in the middle range (50%), AI also generated incompatible results. Particularly, the range for AI to make discrepant results for the T2w (from 1.5% to 86.7%) was significantly wider than that for the T1w or for the radiologists. Results of the ANOVA showed that the closer the depth of myometrial invasion was to 50%, the easier it was for both radiologists (F-value = 99.06, p < 0.001) and AI (F-value = 44.46, p < 0.001 for contrast-enhanced T1w; F-value = 17.68, p < 0.001 for T2w) to provide incorrect diagnoses. We found the results reasonable since the ability to determine the pre-operative MRI stages based, mainly, on personal expertise and experience which vary from person to person. On the other hand, using AI to determine the depth of myometrial invasion may also be affected by various pathological factors, such as irregular endomyometrial junction inside single uterus, hematometra, endometrial polyps, exophytic tumor growth, adenomyosis or extensive leiomyomas. These factors may inevitably cause both human beings and AI to come up with incorrect diagnoses, especially when there is a clear cut-off value, below or above 50% myometrial invasion.

Discussion
We compared the accuracy rates of the radiologists' diagnoses and AI interpretations based on the depth of myometrial invasion. The results indicated that the AI interpretations for both contrast-enhanced T1w and T2w were similar to radiologists' diagnoses. Although small differences exist, they were not statistically significant. However, we found that the closer the depth of myometrial invasion was to 50%, the easier it was for both the radiologists and AI to provide incorrect judgements. However, compared with AI, radiologists were more likely to "over-stage" the results from IA to IB. We believe these findings shed light on the fact that human beings tend to act conservatively when making critical decisions. In clinical practice, when facing with a situation where it is necessary to choose the lesser of two evils, radiologists would rather let the patients receive more evaluations or treatments than receive insufficient ones. More treatments usually include more extensive surgeries (lymph nodes dissection), radiation therapy, and/or chemotherapy. However, receiving more treatments are not always beneficial, and they also come with additional risks. Patients are more likely to suffer from surgical complications or therapy-related complications.
In addition, we found that when patients had coexisting leiomyoma, AI was more likely to provide incorrect interpretations. However, the coexisting leiomyoma did not affect the radiologists' judgements. We believe that was because the myometrial compression from a leiomyoma or bulky polypoid tumor would lead to unclear boundaries between the tumor and myometrium, which would make it difficult for the AI to calculate the depth. The histological types of endometrial carcinoma, on the contrary, did not affect the radiologists or AI. Such findings suggested that the primary role of MRI used in gynecologic oncology is in delineating the extent of the disease, not for analyzing the morphological features or histological types [34].
Still, AI technology is potentially useful especially in hospitals without radiologists specializing in gynecology. There are several benefits of using AI to predict myometrial invasion before surgery. First, it can affect the choice of surgical approach methods, and second, it can be used to determine if lymphadenectomy is necessary. In the early stages, patients have the chance to choose either exploratory laparotomy or micro-invasive laparoscopic surgery. The result of a Gynecologic Oncology Group LAP2 trial for early stage endometrial cancer reported favorable recurrence and survival outcomes of laparoscopy surgical staging [35]. In another Cochrane review published in 2018 [36], the key results revealed no difference in perioperative mortality risk, overall survival, and disease-free survival between laparoscopy and laparotomy. Furthermore, laparoscopy is associated with significantly shorter hospital stays. In hospitals without radiologists specializing in gynecology, AI technology could help identify low-risk patients with stage IA disease, and therefore, the gynecological oncologist may feel more comfortable performing laparoscopic surgery.
In addition, the necessity of routine lymphadenectomy in staging surgery has been widely debated. The Mayo group described the criteria of patients with a low risk of nodal disease spread and a high disease-free survival rate: grade 1 to 2 tumors, less than 50% myometrial invasion, and tumor size less than 2 cm [37]. In GOG Lap 2 trial, in 971 patients with type 1 endometrioid carcinoma, 40% met Mayo low-risk criteria and only 0.8% (3/389) had positive nodes. Therefore, with careful selection of low risk patients, lymphadenectomy may be safely avoided, which reduces surgery-related morbidity such as lower limb lymphedema or intra-abdominal lymphocele formation. In hospitals without radiologists specializing in gynecology and gynecologists specializing in oncologic surgery, with help of AI pre-operative diagnosis, a general gynecological surgeon could perform simple hysterectomy, bilateral salpingo-oophorectomy without lymphadenectomy for selective low-risk patients. In remote area with limited medical resources, AI pre-operative diagnosis will reduce the need of transferring low-risk patients to tertiary medical center.
We noted a few potential limitations of our study. First, our datasets were built based on cases from one institution. The diagnoses made using MRI and pathology were based on personal expertise and experience, which exhibit individual differences. Although the Tri-Service General Hospital is a medical center in Taipei, we still cannot eliminate the possibility of that there is bias in the data. Second, the key imaging sequence used to assess uterine cavity tumors involves choosing a sagittal T2w. However, a multiparametric approach which combines T2w and contrast-enhanced T1w along with different planes may represent the most comprehensive approach to assess tumor spread. Also, our study did not analyze the diffusion-weighted imaging or the MR spectroscopy, which may could serve as one potential research subject in diagnosing endometrial cancer. Training AI from a two-dimensional approach (largest cross-sectional tumor area) to a three-dimensional approach (whole tumor) could lead to differences in establishing the AI-based model or affect the accuracy of the interpretation of the results, especially for results obtained by assessing only the sagittal plane of T2w and contrast-enhanced T1w [38,39]. Third, different CNN architectures might be suitable for the interpretation of different diseases' images, and thereby selecting and establishing appropriate architectures might provide slightly different results [23][24][25][26]. Fourth, the impact of ethnic differences was not examined because all the patients in this study were Taiwanese. Therefore, using AI to determine the depth of myometrial invasion still has its weak spot given the limited cases using for AI training regarding the endometrial cancer. The AI's decision cannot be final at this point. However, if more diverse cases can be used for deep learning in the future, or if we can develop a multicenter database for this purpose, we may further enhance the validity of the AI and improve the quality of our health care. Alternatively, we could further design a prospective randomized study to identify a population of patients with endometrial cancer to examine the efficacy of AI-assisted method. Our paper, as a pilot study in this uncharted territory, shows that using AI as an assist to interpret the depth of myometrial invasion of MRI is indeed advantageous to prevent the high interobserver variability among radiologists [18].

Conclusions
In summary, although current AI technology may not be able to replace the expertise and experience of physicians, AI could be used as an auxiliary resource. From the perspective of balancing human proactive errors and passive errors, it could be beneficial for the physicians to have a "second opinion" from the AI technology before making critical judgement calls on endometrial cancer. Our research is the first attempt to use AI technology to evaluate the invasion depth of myometrium in early stage endometrial cancers. There have not been any publications about AI applications in endometrial cancers. In the future, refinement of selection and establishment of a deep learning model with a larger image database are essential to improve the accuracy. We believe artificial intelligence has the potential to assist radiologists or serve as a reasonable alternative for pre-operative evaluation of the myometrial invasion depth of stage I endometrial cancers.      (A) Relationship between AI Interpretation (Contrast-enhanced T1WI) and Pathology Report Results.
(B) Relationship between AI (T2WI) and Pathology Report Results.