A Deep Learning Model System for Diagnosis and Management of Adnexal Masses

Simple Summary This was a multicenter study on the development of a deep learning (DL) model system to diagnose adnexal masses on ultrasound images. There were three innovation points. First, the DL system contained five models: a detector, a mass segmentor, a papillary segmentor, a type classifier, and a pathological subtype classifier. Therefore, the system could finish the entire diagnosis process for adnexal masses on ultrasound images. Second, the DL system could discriminate borderline tumors from benign and malignant tumors with the assistance of annotations for papillary projections (which is a significant morphological feature of borderline tumors). Third, the benign tumors were classified into five pathological subtypes with different risks of clinical complication and accurate disease. Abstract Appropriate clinical management of adnexal masses requires a detailed diagnosis. We retrospectively collected ultrasound images of 1559 cases from the first Center of Chinese PLA General Hospital and developed a fully automatic deep learning (DL) model system to diagnose adnexal masses. The DL system contained five models: a detector, a mass segmentor, a papillary segmentor, a type classifier, and a pathological subtype classifier. To test the DL system, 462 cases from another two hospitals were recruited. The DL system identified benign, borderline, and malignant tumors with macro-F1 scores that varied from 0.684 to 0.791, a benefit to preventing both delayed and overextensive treatment. The macro-F1 scores of the pathological subtype classifier to categorize the benign masses varied from 0.714 to 0.831. The detailed classification can inform clinicians of the corresponding complications of each pathological subtype of benign tumors. The distinguishment between borderline and malignant tumors and inflammation from other subtypes of benign tumors need further study. The accuracy and sensitivity of the DL system were comparable to that of the expert and intermediate sonographers and exceeded that of the junior sonographer.


Introduction
Adnexal masses are widespread in women. Most masses are benign and some may disappear spontaneously [1]. Conservative management is often advised [2]. However, some adnexal masses classified as borderline or malignant may pose a significant risk.

Ethical Approval
In this retrospective study, the use of previously obtained sonographic images was approved by the ethical committees of all participating centers (the first Center of Chinese PLA General Hospital, the 7th Center of Chinese PLA General Hospital, and Hengshui People's Hospital) via wavier of patient informed consent.

Participants and Datasets
We retrospectively reviewed transvaginal and abdominal ultrasound images taken at the first Center of Chinese PLA General Hospital from 2015 to 2021. The data from 2015 to 2018 were used to create a training dataset while the data from 2019 to 2021 were used as the internal validation dataset. Patients were retrospectively recruited in the 7th Center of Chinese PLA General Hospital from 2020 to 2022 for external test dataset 1, and patients in Hengshui People's Hospital from 2021 to 2022 were used for external test dataset 2.
Consecutive images from patients with at least one adnexal mass who underwent surgery (providing histological diagnoses for ground truth) and had normal ovaries or ovaries that had been resected previously were eligible for inclusion. Pregnant patients were excluded. Grayscale and color doppler ultrasound images were acquired with Mindray Resona8T, Mindray Resona7, GE VolusonE8, EPIQ7, SAMSUNG, WS80A, HITACHI, and SIEMENS machines equipped with transvaginal and transabdominal probes. The study included only one adnexal mass per patient. If more than one mass was detected, then the mass with the most complex morphology or, in the case of a similar morphology, the largest diameter, was used [16,17].
The pathological results provided the definitive diagnosis. The final pathological diagnosis results were classified into three types: benign, borderline, and malignant tumors. Benign masses were further categorized into five pathological subtypes: endometriomas, other epithelial tumors except endometriomas, germ cell tumors, sex cord-stromal tumors, and inflammation.

Annotation and Framework
A sonographer annotated the adnexal mass and papillary areas on the ultrasound images. The annotation was checked by two expert sonographers and was decided upon their confirmation. The morphological feature of "papillary projection" was defined as any solid projection into the cyst cavity from the cyst wall with a height ≥ 3 mm. The hyper-reflective area in dermoid cysts and "sludge" or blood clots in endometriomas were not regarded as a papillary projection [29]. A cyst with papillary projections is the most significant ultrasound characteristic for borderline tumors [30]. We thus included the additional annotation of papillary projections to discriminate borderline adnexal masses from benign and malignant tumors [26,31,32].
We designed a deep learning (DL) system containing five models to complete the diagnosis for adnexal masses: a detector, a mass segmentor, a papillary segmentor, a type classifier, and a pathological subtype classifier ( Figure 1). First, the detector aimed to find the ovaries with adnexal masses while discarding normal ovaries or ovaries resected. The detector focused on two-dimensional grayscale images because color doppler flow images are not available in normal ovaries. Second, the mass segmentor located the area of the tumor. Third, for masses with papillary projections, the papillary segmentor located the area of papillary projections based on the mass area. Fourth, the type classifier predicted the type (benign, borderline, or malignant) of the adnexal masses based on the information for the mass area, papillary area, and original images. Finally, if the type was benign, the pathological subtype classifier inferred the detailed pathology of the adnexal masses. The borderline and malignant types were output directly as the final result. No human intervention was performed in the entire process. All DL models except the detector were compatible with both two-dimensional grayscale images and color doppler flow images; the detector used only two-dimensional grayscale images. Flowchart of the framework of the DL model system. Images imported into the system are indicated in dotted boxes. Grey boxes show the five parts of the DL model system. Diagnosis results are shown in solid boxes. Images were input into the DL system, and those with adnexal masses were picked out by the detector. The area of masses was recognized by the mass segmentor and papillary projections were located by the papillary segmentor if they existed. The results of these two segmentors and the original images were taken into consideration to make the classification. The type classification would diagnose the adnexal masses into the three types: benign, borderline, and malignant. The borderline and malignant tumors were directly output as the final result. Benign tumors were further distinguished as endometriomas, other epithelial tumors except endometriomas, germ cell tumors, sex cord-stromal tumors, and inflammation.

Model Architecture
The backbone structure of the detector was LKResnet-18, which is a combination of ResNet-18 [33] and LKNet [34]. Resnet-18 consists of eight residual blocks, each of which contains two convolutional layers and a residual link. Each convolution layer of ResNet-18 has a convolution kernel size of 3 × 3 (excluding the first convolutional layer), and the small convolution kernel can only focus on the texture features of the image. In contrast, the larger convolution kernel can focus on the shape information of the image, while the human recognition of objects is based more on shape cues than texture cues [34]. We used the convolution kernel size of 15 × 15 instead of 3 × 3. At the same time, we use depthwise convolution to balance the kernel size and GPU memory. In addition, we replaced the ReLU activation function and batch normalization with the more efficient and robust GeLU and layer normalization. We adopted LKResnet-18 for two reasons: first, a small convolutional kernel might lead to extracting texture-based features rather than structure-based features while large convolutional kernels, such as 15 × 15 and 31 × 31 kernels, can achieve better performance [32]; second, for ultrasound images, the texture of the image can provide limited information, and the structure and contour information is of much importance. Therefore, the large kernel convolutional network (LKNet) was a very suitable solution.
The convolutional layers could extract 512 latent features from each image; then a fully-connected classifier identified the probability of having a tumor in the image based on the pattern of latent features. In the training stage, we adopted the following data Figure 1. Flowchart of the framework of the DL model system. Images imported into the system are indicated in dotted boxes. Grey boxes show the five parts of the DL model system. Diagnosis results are shown in solid boxes. Images were input into the DL system, and those with adnexal masses were picked out by the detector. The area of masses was recognized by the mass segmentor and papillary projections were located by the papillary segmentor if they existed. The results of these two segmentors and the original images were taken into consideration to make the classification. The type classification would diagnose the adnexal masses into the three types: benign, borderline, and malignant. The borderline and malignant tumors were directly output as the final result. Benign tumors were further distinguished as endometriomas, other epithelial tumors except endometriomas, germ cell tumors, sex cord-stromal tumors, and inflammation.

Model Architecture
The backbone structure of the detector was LKResnet-18, which is a combination of ResNet-18 [33] and LKNet [34]. Resnet-18 consists of eight residual blocks, each of which contains two convolutional layers and a residual link. Each convolution layer of ResNet-18 has a convolution kernel size of 3 × 3 (excluding the first convolutional layer), and the small convolution kernel can only focus on the texture features of the image. In contrast, the larger convolution kernel can focus on the shape information of the image, while the human recognition of objects is based more on shape cues than texture cues [34]. We used the convolution kernel size of 15 × 15 instead of 3 × 3. At the same time, we use depthwise convolution to balance the kernel size and GPU memory. In addition, we replaced the ReLU activation function and batch normalization with the more efficient and robust GeLU and layer normalization. We adopted LKResnet-18 for two reasons: first, a small convolutional kernel might lead to extracting texture-based features rather than structure-based features while large convolutional kernels, such as 15 × 15 and 31 × 31 kernels, can achieve better performance [32]; second, for ultrasound images, the texture of the image can provide limited information, and the structure and contour information is of much importance. Therefore, the large kernel convolutional network (LKNet) was a very suitable solution.
The convolutional layers could extract 512 latent features from each image; then a fully-connected classifier identified the probability of having a tumor in the image based on the pattern of latent features. In the training stage, we adopted the following data augmentations to improve the robustness of the model: random shift, random scale, and random rotation. The model parameters were updated in 20,000 iterations with a batch size of eight. The loss function adopted cross-entropy, which is defined as: where N is the number of samples, θ is the parameter of the detector model, y (i) represents the tumor label of sample I, and p(x (i) |θ) is the probability of the tumor prediction x (i) of sample i. The purpose of the mass segmentor was to segment the specific location of the tumor in the image. The segmentor model structure was U-net [35] and the backbone was LKResnet-18. As shown in the Table S1, we compared the performances of different segmentation models. It turned out that U-net with LKResnet-18 outperformed the other segmentation models. U-net contains a decoder and an encoder; the former extracted high-level semantic features of the image and reduced the resolution while the latter fused high-level semantic features with low-level texture features and restored the resolution. In the training stage, we adopted the following data augmentations: random shift, random scale, random rotate, horizontal flip, and vertical flip. The model parameters were updated in 20,000 iterations with a batch size of four. The loss function adopted cross-entropy loss and boundary loss [36]. Boundary loss used boundary matching to supervise the segmentor. The settings for the mass segmentor were also used for the papillary segmentor.
The model structure of the type classifier and pathological subtype classifier was LKResnet-18, as shown in Figure 2. The input to the classifier was the combination of the outputs of the mass segmentor and papillary segmentor. We adopted the following data augmentations for training: random shift, random scale, random rotation, Gaussian blur, random brightness, random contrast, horizontal flip, and vertical flip. The model parameters were updated in 10,000 iterations with a batch size of four. The loss function adopted cross-entropy loss as given in Equation (1).
where N is the number of samples, θ is the parameter of the detector model, y (i) represents the tumor label of sample I, and p(x (i) | θ) is the probability of the tumor prediction x (i) of sample i. The purpose of the mass segmentor was to segment the specific location of the tumor in the image. The segmentor model structure was U-net [35] and the backbone was LKResnet-18. As shown in the Table S1, we compared the performances of different segmentation models. It turned out that U-net with LKResnet-18 outperformed the other segmentation models. U-net contains a decoder and an encoder; the former extracted high-level semantic features of the image and reduced the resolution while the latter fused high-level semantic features with low-level texture features and restored the resolution. In the training stage, we adopted the following data augmentations: random shift, random scale, random rotate, horizontal flip, and vertical flip. The model parameters were updated in 20,000 iterations with a batch size of four. The loss function adopted cross-entropy loss and boundary loss [36]. Boundary loss used boundary matching to supervise the segmentor. The settings for the mass segmentor were also used for the papillary segmentor.
The model structure of the type classifier and pathological subtype classifier was LKResnet-18, as shown in Figure 2. The input to the classifier was the combination of the outputs of the mass segmentor and papillary segmentor. We adopted the following data augmentations for training: random shift, random scale, random rotation, Gaussian blur, random brightness, random contrast, horizontal flip, and vertical flip. The model parameters were updated in 10,000 iterations with a batch size of four. The loss function adopted cross-entropy loss as given in Equation (1).
Each case contained multiple two-dimensional sonographic images and color doppler flow images. The type and pathological subtype classifiers scored each image, which provided multiple predictions for each case. As shown in Figure 2, the model used majority voting, a rule-based case-wise strategy, to generate the case-wise prediction.
The five models used the AdamW [37,38] optimizer with a learning rate of 3 × 10 −5 and weight L2 regularization of 3 × 10 −4 . All experiments were performed on two NVIDIA GeForce RTX 3090 graphics processing units.  The framework of the type and pathological subtype classifiers. "LKB (c,s)" refers to a basic large kernel block containing two depthwise-pointwise convolutions with output channels of c and strides of s. For each image, including color doppler flow images and two-dimensional sonographic images, the mass segmentor was used to locate the tumor region. Then the image was cropped to just the tumor region. The papillary segmentor was used to locate the papillary region in the tumor area. The papillary region and tumor region were concatenated as the input into the classifier.
Each case contained multiple two-dimensional sonographic images and color doppler flow images. The type and pathological subtype classifiers scored each image, which provided multiple predictions for each case. As shown in Figure 2, the model used majority voting, a rule-based case-wise strategy, to generate the case-wise prediction.
The five models used the AdamW [37,38] optimizer with a learning rate of 3 × 10 −5 and weight L2 regularization of 3 × 10 −4 . All experiments were performed on two NVIDIA GeForce RTX 3090 graphics processing units.

Evaluation and Comparison with Sonographers
The diagnostic performance of the DL system containing five models was evaluated using the internal validation dataset, external test dataset 1, and external test dataset 2. Three reviewers assessed all images in the external test set 1 and external set 2 independently. All reviewers were certificated sonographers. Reviewer A was an expert gynecological sonographer with 37 years of clinical experience. Reviewer B was an intermediate sonographer with 16 years of experience. Reviewer C had 2 years of clinical experience as a junior sonographer. All reviewers were blinded to the clinical information, original ultrasound reports, and pathological results. The diagnostic performance of the sonographers was evaluated and compared with the DL model system in external test sets.

Statistical Analysis
Continuous measures are presented as mean. Categorical measures are presented as proportion with 95% CIs and were compared using the chi-squared test. The frequency of the morphological characteristic of papillary projection for different adnexal mass types was compared using pairwise differences.
We evaluated the diagnostic efficiency of the DL system based on the discrimination and calibration performance. For the detector, accuracy in identifying adnexal masses was measured. The dice score was computed to estimate the performance of the mass and papillary segmentors. We constructed the confusion matrix for the type classifier and the pathological subtype classifier. The accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and macro-F1 scores were calculated for both classifiers [39]. The calibration plot and the Brier score were used to estimate the calibration of the type classifier and the pathological subtype classifier.
The accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and macro-F1 scores of the sonographers were also computed for comparison with the DL system. Statistical analysis was done using Python; p < 0.05 was considered to indicate a statistically significant difference.

Data and Patients
A total of 1099 cases (4497 images) were included in the training dataset and 460 cases (1217 images) in the internal validation set. External test set 1 contained 490 images of 198 cases; external test set 2 had 761 images of 264 cases. The baseline characteristics of each dataset are shown in Table 1.

Papillary Projections
Papillary projections are more frequent in borderline than in benign and malignant tumors. Benign and malignant tumors do not differ in papillary projection frequency.
In the training dataset, the frequency of papillary projection was 5.61% (21/374) in benign adnexal masses, 54.00% (27/50) in borderline tumors, and 8.98% (15/167) in malignant tumors. Borderline tumors had a higher rate of papillary projections than benign tumors (p < 0.001) and malignant tumors (p < 0.001). There was no statistical difference in the presence of papillary projections between benign and malignant tumors (p = 0.147).
In external test dataset 1, 5.88% (3/51) of benign tumors and 50% (4/8) of borderline tumors showed the morphological characteristic of papillary projection, while no malignant tumors presented with that ultrasound feature. There was a higher rate of papillary projections in borderline tumors than benign (p < 0.001) or malignant tumors (p < 0.001). No statistical difference was detected between benign and malignant tumors (p = 0.106).

Diagnostic Performance for the DL Model System
The performance of each of the five models in the DL system was considered in turn. The detector was implemented in the training test to distinguish sonographic images with adnexal masses from those without lesions. The detector performed very well, achieving an accuracy of 0.965 (95% CI: 0.933-0.975) on the internal validation dataset. For external test dataset 1, the accuracy was 0.943 (0.922-0.951); for external dataset 2, the accuracy was 0.931 (0.919-0.952).
The mass segmentor was created to ascertain the location of the adnexal masses. It performed well, achieving a dice score of 0.945 in the internal validation dataset. It remained stable in the multicenter test datasets. The dice score for the mass segmentor was 0.923 for external test dataset 1 and 0.912 for external test dataset 2.
The papillary projection segmentor located the projections accurately. The dice score was 0.864 for the internal validation dataset, 0.852 for external test dataset 1, and 0.855 for external test dataset 2. Figure 3 shows the results of four cases of the mass segmentor and the papillary segmentor.
The type classifier was used to classify adnexal masses as benign, borderline, and malignant. According to the confusion matrixes of the type classifier in Figure 4, most predicted results of the adnexal masses were in accordance with the ground truth. For the internal validation dataset, the macro-F1 score was 0.791. For external test dataset 1, the classifier had a macro-F1 score of 0.749. For external test dataset 2, the macro-F1 score was 0.684.
The pathological subtype classifier could discriminate among the five pathological subtypes of benign adnexal tumors. The results were visualized in the confusion matrix shown in Figure 4. The macro-F1 score in the internal validation dataset to distinguish among endometriomas, other epithelial tumors except endometriomas, germ cell tumors, sex cord-stromal tumors, and inflammation was 0.831. For external dataset 1, the macro-F1 score was 0.826. For external dataset 2, the macro-F1 score of the pathological subtype classifier was 0.714. The papillary projection segmentor located the projections accurately. T was 0.864 for the internal validation dataset, 0.852 for external test dataset 1, external test dataset 2. Figure 3 shows the results of four cases of the mass se the papillary segmentor. The type classifier was used to classify adnexal masses as benign, border shown in Figure 4. The macro-F1 score in the internal validation dataset to distinguis among endometriomas, other epithelial tumors except endometriomas, germ cell tumor sex cord-stromal tumors, and inflammation was 0.831. For external dataset 1, the macr F1 score was 0.826. For external dataset 2, the macro-F1 score of the pathological subtyp classifier was 0.714. Comprehensive diagnostic performance parameters of the type classifier and th pathological subtype classifier are presented in Tables 2 and 3. The calibration plots fu ther indicated that the type classifier and the pathological subtype classifier fit the da very well ( Figure 5). The Brier score of the type classifier was 0.090 for the internal valid tion dataset, 0.102 for external test dataset 1, and 0.082 for external test dataset 2. The Bri score for the pathological subtype classifier was 0.056 for the internal validation datase 0.064 for external test dataset 1, and 0.058 for external test dataset 2. Figure 6 shows th final classification of the type classifier and the pathological subtype classifier. Comprehensive diagnostic performance parameters of the type classifier and the pathological subtype classifier are presented in Tables 2 and 3. The calibration plots further indicated that the type classifier and the pathological subtype classifier fit the data very well ( Figure 5). The Brier score of the type classifier was 0.090 for the internal validation dataset, 0.102 for external test dataset 1, and 0.082 for external test dataset 2. The Brier score for the pathological subtype classifier was 0.056 for the internal validation dataset, 0.064 for external test dataset 1, and 0.058 for external test dataset 2. Figure 6 shows the final classification of the type classifier and the pathological subtype classifier.

Comparison with Sonographers
The diagnostic performance parameters of three reviewers with different experien were evaluated using the external test datasets (Table 4).

Comparison with Sonographers
The diagnostic performance parameters of three reviewers with different experience were evaluated using the external test datasets (Table 4).
For discrimination of benign tumors, the DL system had a higher accuracy (0.863 vs. 0.578, p < 0.001), sensitivity (0.824 vs. 0.451, p < 0.001), and specificity (0.902 vs. 0.706, p = 0.013) than the junior reviewer for external test dataset 1. For external test dataset 2, the specificity of the DL system for benign tumors was higher than that of the expert (0.939 vs. 0.606, p = 0.001) and the junior sonographer (0.939 vs. 0.697, p = 0.011) but the sensitivity was lower than that of the expert sonographer (0.825 vs. 0.921, p = 0.023).
For classification of borderline tumors, the DL system achieved a higher sensitivity (0.625 vs. 0.000, p = 0.026) than that of the junior reviewer in external test dataset 2.
For classifying malignant tumors, the accuracy, sensitivity, and specificity of the DL system also exceeded that of the junior reviewer (accuracy: 0.843 vs. 0.559, p < 0.001; sensitivity: 0.907 vs. 0.698, p = 0.015; specificity: 0.797 vs. 0.458, p < 0.001) in external test dataset 1. The specificity of the DL system was lower than that of the expert sonographer (0.797 vs. 0.932, p = 0.031) when discriminating malignant tumors in external dataset 1. The specificity of the DL system was lower than that of the expert reviewer when classifying malignant tumors in external test dataset 2 (0.843 vs. 0.940, p = 0.011).
In the external test datasets, the diagnostic performance of the DL system was comparable to that of the intermediate sonographer.

Discussion
We established a DL system that comprised five models (a detector, a mass segmentor, a papillary segmentor, a type classifier, and a pathological subtype classifier) to automatically diagnose adnexal masses in sonographic images. The DL model system could identify the existence of the tumor and recognize the area of the mass and papillary projection precisely. Masses were correctly classified into benign, borderline, and malignant tumors; benign tumors were further categorized into one of five pathological subtypes. Our DL system had the ability to complete multiple tasks. This allowed the DL model system to automatically implement the complete diagnostic process for adnexal masses and provide abundant information for clinical therapy [31].
The DL system showed good discrimination in the internal validation dataset and the external test datasets 1 and 2 with macro-F1 scores of 0.791, 0.749, and 0.684, respectively. According to the confusion matrixes, the DL system was able to distinguish benign and malignant adnexal tumors precisely. The discrimination between benign and malignant tumors could also be achieved through deep learning in other studies [22]. However, previous DL models for adnexal masses never discriminated borderline tumors from benign and malignant mimics. The ability to discriminate borderline tumors is essential in choosing the appropriate treatment for patients with adnexal masses. Fertility-sparing surgery is the gold standard therapeutic modality for women with borderline tumors who wish to maintain their fertility without impacting overall survival [28,40]. If borderline tumors are misdiagnosed as benign tumors, clinical treatment may be delayed [30]. The commonly used radical surgery for patients with malignant tumors may be excessive given that borderline tumors have a better overall 5-year survival than malignant tumors [4,28]. Therefore, we tried to discern borderline tumors from benign and malignant tumors by annotating morphological features of the masses, which could direct the DL models to extract more valuable imaging information for medical diagnosis. The papillary projection was a morphological characteristic that appeared more frequently in borderline tumors than in benign or malignant tumors (p < 0.001). In addition, papillary projections could be segmented successfully in our study. Accurate information on papillary projections might allow the DL system to improve its ability to distinguish borderline tumors. In this study, we correctly classified 11 of 15 borderline tumors in the internal validation dataset. However, the type classifier was not reliable in the external test datasets. According to the confusion matrixes, there were only three of eight and five of eight borderline tumors diagnosed correctly in the two respective external datasets. Moreover, most of the borderline tumors that were misdiagnosed were predicted as malignant tumors. In previous studies in which borderline adnexal tumors were difficult to discriminate, it was recognized that borderline tumors should be classified as malignant tumors to improve the survival rates. In this condition, the poor performance in classifying borderline tumors may be acceptable. We reviewed all the borderline tumors that were misdiagnosed in the external datasets. All nine cases diagnosed as malignant tumors presented as multilocular cysts with or without a solid component. Therefore, further study regarding adnexal masses with multiple septations may improve the distinguishment between borderline and malignant tumors.
Observation is preferred for patients with benign adnexal tumors to minimize the potential risks and complications of surgery [2]. During observation, 20.2-39% of benign tumors spontaneously resolve and the risk of malignancy is lower than 0.5% [1,41,42]. However, the risk of potential complications such as cyst rupture and torsion and the presence of clinical symptoms remain the primary reasons for surgery [1]. Different subtypes of benign adnexal masses have various complications, and doctors should pay attention to the corresponding complications of each subtype of benign tumors. Thus, it is significant in clinical work to distinguish among different pathological subtypes of benign tumors. Morphological features of benign tumors were previously used only for excluding malignant tumors, but they can also help with classification into detailed pathological subtypes. The subtype classifier of the DL system was able to discern most subtypes of benign tumors with macro F1-scores of 0.831 in the internal validation dataset, 0.826 in external test dataset 1, and 0.714 in external test dataset 2. The pathological subtype classifier performed well when distinguishing endometriomas, other epithelial tumors except endometriomas, germ cell tumors, and sex cord-stromal tumors: the diagnostic parameters were satisfactory, as shown in Table 3. An accurate diagnosis may assist with the decision on whether an operation is necessary. When endometriomas are diagnosed, intervention is advised for those with unrelieved pelvic pain and desire for fertility, and follow-up is required to survey for recurrence [7]. Dermoid cysts (benign germ cell tumors) are the most frequent adnexal mass to twist. This accurate discrimination can prevent potential ovarian torsion via a timely operation, especially for those tumors with larger diameter. In contrast, torsion does not often occur in endometriomas or malignant tumors [8]. The precise diagnosis of sex cord-stromal tumors may allow clinical signs of hormonal production such as virilization, precocious puberty, and menstrual changes to be explained and settled [9]. When inflammation is diagnosed, the proper anti-inflammatory drug therapy can effectively relieve symptoms, preventing unnecessary surgery. However, the ultrasound manifestations of inflammation present variously at different stages of the disease. In addition, the low rate of inflammation among benign tumors also increases the difficulty in discriminating inflammation reliably. The sensitivity was only 0.400 in the internal validation dataset and 0.167 in the external dataset 2. In real clinical work, experienced sonographers would take the patients' clinical symptoms into consideration to assist in the diagnosis of inflammation.
The DL system had an excellent diagnostic performance, exceeding the accuracy and sensitivity of the junior sonographer and matching that of the intermediate and expert sonographers. The sensitivity of the DL system to benign and malignant tumors was higher than that of the junior sonographer in external test dataset 1 and the sensitivity to borderline tumors was higher for the DL system in external test dataset 2. Using the DL system would thus be practical in clinical work and improve the ability of a sonographer lacking experience. Moreover, the stability of the DL system was proved by testing it using external datasets from different centers.
Chen et.al established DL algorithms to distinguish malignant from benign adnexal tumors with a diagnostic performance comparable to expert subjective assessment, similar to our study [26]. However, Chen's study did not involve the distinguishment of borderline tumors or a comparison with expert assessments.
Our DL system had some limitations. First, we hoped to separate primary adnexal cancer from secondary metastasis cancer, but our DL model was unsuccessful at achieving this. The insufficient number of cases of secondary metastasis cancer was likely the key reason for this. In addition, more morphological features had to be found to distinguish primary from metastatic adnexal cancer. Second, the performance of the DL system in discerning borderline tumors and inflammation were not satisfactory enough. More valuable ultrasound characteristics for borderline tumors and a larger sample may improve this condition. Third, this was a retrospective study, so prospective external validation needs to be implemented in the future.

Conclusions
In this multicenter study, we implemented a DL model system to perform the complete diagnostic process for adnexal masses. The masses were detected and segmented automatically and classified into benign, borderline, and malignant types. The benign tumors were then further classified into different pathological subtypes. This DL system matched the abilities of expert and intermediate sonographers and outperformed the junior sonographer.