A Comparative Study of Deep Neural Networks for Real-Time Semantic Segmentation during the Transurethral Resection of Bladder Tumors

Bladder cancer is a common and often fatal disease. Papillary bladder tumors are well detectable using cystoscopic imaging, but small or flat lesions are frequently overlooked by urologists. However, detection accuracy can be improved if the images from the cystoscope are segmented in real time by a deep neural network (DNN). In this paper, we compare eight state-of-the-art DNNs for the semantic segmentation of white-light cystoscopy images: U-Net, UNet++, MA-Net, LinkNet, FPN, PAN, DeepLabv3, and DeepLabv3+. The evaluation includes per-image classification accuracy, per-pixel localization accuracy, prediction speed, and model size. Results show that the best F-score for bladder cancer (91%), the best segmentation map precision (92.91%), and the lowest size (7.93 MB) are also achieved by the PAN model, while the highest speed (6.73 ms) is obtained by DeepLabv3+. These results indicate better tumor localization accuracy than reported in previous studies. It can be concluded that deep neural networks may be extremely useful in the real-time diagnosis and therapy of bladder cancer, and among the eight investigated models, PAN shows the most promising results.


Introduction
Bladder cancer is one of the ten most common types of cancer and is also reported to be a leading cause of death in Western countries [1]. The diagnosis is most often based on white-light cystoscopy imaging, which is excellent for the detection of papillary bladder tumors, but small or flat lesions may not be visible clearly enough [2]. Furthermore, the interpretation of cystoscopy findings can vary from examiner to examiner, since no two physicians have the same skills and experience [3]. Overall, it is estimated that up to 20% of tumors are overlooked during white-light cystoscopy [4].
Deep learning (DL) methods are developing rapidly, and as a result, neural networkbased bladder cancer diagnosis is becoming increasingly popular [4][5][6]. DL solutions can not only achieve diagnostic accuracy similar to that of experienced specialists [2,7], but their results are also objective and reproducible [3], as they are based on mathematical operations. Using neural networks to help diagnose bladder cancer could result in fewer tumors going undiagnosed, which could save lives. In addition, fewer healthy patients may need to undergo unnecessary biopsies, which is a process that is invasive, painful, and expensive.
One of the first studies to examine deep learning-based bladder cancer diagnosis is the work of Ikeda et al. [8], who used the GoogLeNet model [9] to classify cystoscopic images. Their network was pre-trained on the ImageNet database [10] of 1.2 million natural images. Subsequently, using the obtained weights as initial values, the model was finetuned to classify white-light cystoscopic images as cancerous or normal. Their training data consisted of 1336 normal and 344 tumor images, but the tumor images were augmented network training, and the measurement setup. The results of the evaluation are presented in Section 3 and discussed in Section 4. Finally, the paper is concluded in Section 5.

Materials and Methods
The task is to produce a high-precision semantic segmentation map that locates bladder cancer as well as other lesions and artifacts on white-light cystoscopic images. The obtained segmentation map can be overlaid on the original image ( Figure 1) to assist the transurethral resection of bladder tumors in real time. In this section, we present the DNN models to be evaluated (Section 2.1), the image processing pipeline (Section 2.2), the network training (Section 2.3), and the measurement setup that was used for the evaluations (Section 2.4).
(a) input image (b) output image Figure 1. An example of a segmentation outline overlay. The input image shows a grade 2 bladder tumor that has been well localized by the DeepLabv3+ [29] neural network.

The Deep Neural Networks to Be Compared
In the field of image analysis, convolutional neural networks (CNNs) are the most widely used deep learning solutions [30]. They take advantage of the spatial arrangement of images, i.e., the fact that information is usually local in images [5], so it may be sufficient to examine only a part of the image content at a time. CNNs do this by running convolutional filters over the input image, which share the learnable weights between image parts, thereby significantly reducing the number of parameters to be optimized. These networks are able to model complex hierarchical functions and train deep models efficiently [31].
An encoder-decoder architecture is a special type of model design that consists of two parts: an encoder that compresses the input into a latent space vector, and a decoder that predicts the output from this intermediate representation. Encoder-decoder CNNs are particularly suited for the task of image segmentation, as they are able to model the complex underlying features of the input image in latent space and then produce an output image of the same size as the input, i.e., a segmentation map containing the classification of each pixel. In the following, we review the encoder-decoder CNNs that we evaluate in this paper for cystoscopy image segmentation. U-Net [22] is a convolutional neural network designed specifically for biomedical image segmentation. It gets its name from the fact that the model consists of a downsampling part (encoder) and an upsampling part (decoder), which form a U-shaped architecture. The encoder of U-Net applies successive 3 × 3 convolutions, each followed by ReLU activation [32] and maximum pooling. While the spatial resolution is reduced, the feature dimension of the image is simultaneously increased to aggregate high-level semantic information. The resulting latent representation is then upsampled in the decoder to obtain a precise segmentation map. To improve performance, skip connections are introduced between encoder blocks and their corresponding decoder parts. This helps to propagate information from early layers that still retain fine input details to deeper layers that lose these details due to successive downsampling.
UNet++ [23] is a new variant of the U-Net architecture. It is based on the hypothesis that when the encoder and decoder feature maps are semantically similar, the optimizer has an easier optimization problem to solve. Therefore, skip connections are redesigned to include a dense convolution block in which the number of convolutional layers depends on the pyramid level. Basically, this dense convolution block brings the semantic level of the feature maps of the encoder closer to that of the decoder. UNet++ was reported to give better accuracy than U-Net in multiple medical image segmentation tasks, such as nuclei segmentation in microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos [23].
MA-Net (Multi-scale Attention Network) [24] is a variant of U-Net that introduces self-attention to its architecture. It uses two blocks: the Position-wise Attention Block (PAB) and the Multi-scale Fusion Attention Block (MFAB). The PAB extracts spatial dependencies between pixels by a self-attention mechanism, while the MFAB captures channel dependencies between any feature map by attention-based multi-scale semantic feature fusion. MA-Net was originally designed for liver and tumor segmentation and has been shown to perform superiorly to U-Net and UNet++ in the segmentation of liver CT scans [24].
LinkNet [25] is an encoder-decoder type CNN that aims to achieve not only good segmentation accuracy but also a small size, i.e., few learnable parameters. This is obtained by introducing residual skip connections that bypass information from the input of each encoder block to the output of its corresponding decoder block. Unlike U-Net skip connections, where layers are concatenated, here, the features are summed, which does not increase the number of input channels and thus the number of parameters needed in the following decoding blocks [33]. This results in a small and efficient neural network.
FPN (Feature Pyramid Network) [26] is a feature extractor that builds high-level semantic feature maps at multiple scales, i.e., a feature pyramid [34]. This process does not depend on the backbone CNN, so it can be considered a generic solution. The construction of the feature pyramid involves a bottom-up pathway and a top-down pathway. The bottom-up pathway is the feed-forward computation of the backbone CNN, which creates the feature map hierarchy with a scaling step of 2. The top-down pathway downpropagates the lower spatial resolution but semantically higher level information from the upper levels of the pyramid. The features of the two pathways are fused via lateral connections. When the task is image segmentation, the different levels of the feature pyramid are used for predicting masks at different scales.
PAN (Pyramid Attention Network) [27] is a convolutional neural network designed to exploit global contextual information in semantic segmentation. First, a feature pyramid is generated using ResNet-101 [12] (encoding). Then, a Feature Pyramid Attention (FPA) module is used to gather dense pixel-level attention information. FPA is a center block between the encoder and decoder that fuses features from under three different pyramid scales and combines them with global average pooling to improve performance. Finally, a Global Attention Upsample (GAU) module on each decoder layer provides high-level features as guidance for low-level information to generate the final segmentation map.
DeepLabv3 [28] is an improvement upon the DeepLabv2 semantic segmentation architecture [35] (which itself can be traced back to the original DeepLab model [36]). To address the problem of segmenting objects at multiple scales, modules are designed to perform atrous convolutions in cascade or in parallel, capturing multi-scale context using multiple atrous rates. In addition, the Atrous Spatial Pyramid Pooling (ASPP) module of DeepLabv2 has been augmented with image-level features that encode global context and increase performance. Unlike DeepLabv2, the new model no longer needs DenseCRF post-processing to segment objects precisely.
DeepLabv3+ [29] extends DeepLabv3 by a simple yet effective decoder module that improves segmentation results, especially at object boundaries. This architecture is the first to combine spatial pyramid pooling with the encoder-decoder structure. Furthermore, inspired by Xception [37], depthwise separable convolution is applied to both the ASPP module and the decoder, resulting in an even faster and higher-precision model.

The Image Processing Pipeline
The goal is real-time assistance in the transurethral resection of bladder tumors. The camera stream from the cystoscope is usually of poor quality, which can hinder the performance of the neural network and the operating urologist. Therefore, we execute image enhancement as a pre-processing step, which may be configured or turned off by the urologist. After the neural network predicts the segmentation map, the outline polygons of the detected objects are extracted and overlaid on the original image to draw the doctor's attention to possible lesions. Overall, our image processing pipeline consists of three phases:

2.
Segmentation: a deep neural network predicts the segmentation map.

3.
Post-processing: outline polygons are extracted from the segmentation map and overlaid on the original image to show doctors the detected lesions.
A flowchart of the segmentation pipeline is displayed by Figure 2.

Pre-Processing
Cystoscopic images may be blurred and have low contrast depending on the actual cystoscopic camera and the circumstances of the image capture. To obtain a solution that is robust to these phenomena, image pre-processing is needed during the training and recognition, which prepares the input images to allow the neural network for efficient lesion detection. For this reason, noise reduction and contrast enhancement have been incorporated into the training and detection pipelines. Noise reduction is performed by guided filtering [38], since it provides high-quality edge-preserving smoothing at real-time rates. The guide for filtering was the input image itself. Contrast enhancement is achieved by unsharp masking [39,40], the essence of which is to sharpen the input image by adding its estimated high-frequency components, i.e., its unsharp mask to itself: where R is the output image, P is the input image, λ is the degree of enhancement, and U is the unsharp mask, which is obtained by subtracting the low-pass filtered version of the input image from itself, i.e., U = P − P filtered . In our solution, the low-pass filtered version could simply be set to the output of the guided filter. In this case, however, the color channels would be sharpened separately, causing the hues of the pixels to change, which results in color noise. To preserve the hues of the pixels, we apply unsharp masking only to the luminance channel, i.e., our enhancement formula is where lum(P i ) is the luminance of pixel P i . For simple calculations, we approximate the luminance as the average of the red, green, and blue intensities.
In the last stage of pre-processing, the enhanced image is cropped to remove any margins and then resized to 160 × 160 pixels, which was found to be the optimal input size for the neural network. Afterwards, data are standardized to have a mean of zero and a standard deviation of one, which improves the stability and accuracy of the predictions. Figure 3 displays the output of pre-processing using different guided filter and unsharp masking parameters. In our system, both the guided filter radius r and the unsharp masking strength λ are adjustable real numbers between 0 and 5 (0: no effect, 5: strong effect, i.e., strong blurring in the case of filtering or strong sharpening in the case of unsharp masking).

Segmentation
The pre-processed image is fed into a deep neural network that predicts the class of each pixel. Eight models were compared for the segmentation task at hand, which are listed in Section 2.1. All models used the MobileNetV3-Large network [41] as an encoder, which is an efficient, lightweight CNN designed to perform well with limited computing resources or strict time constraints. Since we aim to provide a real-time tool to assist bladder tumor resection, the small size and low latency of the model are key factors.
A total of 7 classes are distinguished in the image: Ureteric orifice; 6.
The output of the neural network is a seven-element vector for each pixel of the input image, where the ith element corresponds to the confidence level of the network in that the given pixel belongs to the ith class. Pixels are classified in the category with the highest confidence level.

Post-Processing
In the post-processing phase, the polygon outlines of the detected lesions are extracted from the segmentation map and drawn on the original cystoscopic image ( Figure 1). The color of the polygon indicates the class of the lesion; e.g., bladder cancer is marked by cyan, while artifacts are marked by green.
To obtain the outlines of the detected lesions, the boundaries of the homogeneous regions of the segmentation map must be determined. A pixel is at a boundary if one of its neighbors belongs to a different class than itself. However, merely collecting the boundary pixels can result in a noisy and complicated outline. To simplify the boundary, pixels where the curvature would be too high are removed. The easiest way to achieve this is to count how many pixels belong to another class in the eight-neighborhood of the given pixel, and if we obtain a number greater than four, the pixel is removed. Note that this procedure reduces the area of the polygon; therefore, the number of its repeated applications should be limited.
The polygon outlines of the regions of a given class are formed according to the following iterative algorithm. First, an arbitrary pixel is selected from the boundaries of that class, i.e., we find a pixel that belongs to the given class but at least one of its neighbors does not. In each subsequent step, we search for the next boundary pixel until we return to the starting point, thus closing the loop. During the search, we first examine the immediate neighborhood of the currently selected pixel, beginning with the direction taken in the previous step. If we find a boundary pixel of the same class, the pixel is added to the polygon as a vertex, and is also removed from the segmentation map so that it is not re-selected later. If no other boundary pixels are found in the immediate neighborhood of the current one, which happens if there is a loop in the outline, we start examining the pixels that are farther away, i.e., at a distance of two pixels. For performance reasons, the search distance is maximized to five pixels. If there are no other boundary pixels in the five-pixel-wide neighborhood, the polygon is considered to be noise and discarded.
The presented algorithm produces a single, closed outline. In cases where multiple regions belong to the given class, the procedure should be repeated until all boundary pixels are removed.
Prediction errors can result in unrealistically small polygons that need to be identified and discarded. We calculate the area of the polygon of vertices r 1 , . . . , r n as where r n+1 = r 1 . If the resulting area A is small, e.g., smaller than 0.25% of the total area of the image (64 pixels in the case of a 160 × 160 pixel input), then the polygon is discarded.

Network Training
To improve segmentation accuracy, transfer learning [42] is used. That is, the neural networks are pre-trained on the ImageNet database [10] before being trained on cystoscopic images. This way, the networks learn a general model of the visual world that can facilitate the interpretation of cystoscopic images.
There are relatively few samples in our dataset for three of the seven classes (ureteric orifice, TURBT scar, BCG cystitis), as a result of which the networks have difficulty learning them. To address this class imbalance, we use focal loss [43], which is a modification of the standard cross-entropy loss that focuses training on hard samples. The focal loss dynamically scales the cross-entropy loss by a scaling factor that decays to zero as confidence in the correct class increases. This scaling factor automatically downweights the contribution of easy samples during training and focuses the model on samples it finds harder to classify.
To present the formal definition of the focal loss, let us start from the cross-entropy (CE) loss for binary classification (For simplicity, we only discuss binary classification, but extension to the multiclass case is straightforward [43]): where y ∈ {±1} denotes the ground truth class and p ∈ [0, 1] is the probability estimated for the class with label y = 1. To simplify the notation, let us introduce Focal loss adds a modulating factor (1 − p t ) γ to the cross-entropy loss, where γ ≥ 0 is a tunable focusing parameter. That is, focal loss is defined as It can be observed that as the confidence in the correct class increases, i.e., p t → 1, the modulating factor converges to 0, and the loss for well-classified samples is downweighted. In contrast, if a sample is misclassified and p t is small, the factor is approximately equal to 1, and thus, the loss is unaffected.
The focal loss has an α-balanced version, which was reported to give slightly better results than the original form [43]. In this version, a hyperparameter α ∈ [0, 1] is introduced as a weighting factor for the positive class, while (1 − α) is used to weight the negative class. Defining α t analogously to p t , the α-balanced version of the focal loss is calculated as In our solution, the α for a given class is inversely proportional to the relative frequency of occurrence of the class in the training set. That is, the more pixels are labeled by that class in the training set, the lower value its α weight will have.

Data
We used a private dataset consisting of 2578 cystoscopy images manually annotated by medical professionals. Data were split into train, validation, and test sets with a ratio of 90:5:5, i.e., 2320 of the images were put in the train set, 129 were put in the validation set, and 129 were put in the test set. The number of image samples was increased by augmenting the dataset with the two pre-processing techniques presented in Section 2.2.1: noise reduction by guided filtering and contrast enhancement by unsharp masking. In total, each image was saved in four different versions (Figure 4):
Enhanced: both guided filter and unsharp masking.
This way, the number of image samples was quadrupled, so the total training set consisted of 9280 images, the validation set consisted of 516 images, and the test set consisted of 516 images.  . Four different pre-processing outputs of the same input image. In (a), the image was only cropped. In (b), guided filtering was applied to reduce noise, with radius r set to 5. In (c), unsharp masking was applied to sharpen the image, with strength λ set to 5. In (d), both guided filtering and unsharp masking were applied (r = 5, λ = 5).
The training set was also augmented on-the-fly using the following transformations: These augmentations were different in every epoch; therefore, the network was never trained on the exact same image twice.

Implementation
The neural networks were implemented in Python, using the PyTorch framework [45] and the Segmentation Models library [46]. Training was performed for 1000 epochs by the Adam optimizer [47] with a learning rate of 7 · 10 −5 and a batch size of 64. The loss function was the α-balanced focal loss described in Section 2.3, with γ = 2 and the α of each class pre-calculated from the statistics of the dataset. The parameters of the segmentation models were not changed from the default settings of the Segmentation Models library. All calculations were run on an NVIDIA GeForce RTX 2080 SUPER GPU [48].

Evaluation Metrics
Per-pixel Dice coefficient The segmentation maps generated by the neural networks were evaluated using the Dice similarity coefficient [49]. Given the set T of pixels belonging to a given class c, and the set P of pixels predicted by the network to belong to class c, the Dice coefficient is calculated as where | · | denotes the cardinality of a set and ∩ denotes the intersection of two sets. This calculation is performed for each class, and the results are averaged. The higher the Dice coefficient, the closer the predicted segmentation map is to the real one, i.e., the better the localization accuracy of the lesions is.

Per-image F-score
The main hazards are producing a false negative or false positive classification for the input image. For example, if an image is declared cancer-free while in truth it shows a cancerous region, the patient may go undiagnosed and not receive the necessary treatment. Conversely, if an image is declared to show cancer when it is in truth either fully normal or has only benign features, the patient may receive unnecessary treatment, which is expensive and could pose health risks. Therefore, we group classes as 1.
Healthy: normal, ureteric orifice, air bubble, image artifact; and counted images in the following categories: From these categories, the following metrics can be derived to describe the per-image classification accuracy:

1.
Precision: the fraction of patients with a true lesion (malignant or benign) among those who were identified to have the lesion:

2.
Recall (= Sensitivity): the fraction of patients identified with a lesion (malignant or benign) among those who truly have a lesion: 3. F-score: a measure combining the precision and recall of the classification system. It is defined as the harmonic mean of the model's precision and recall: .
The primary objective is to maximize the F-score for the malignant category, as a misclassification of bladder cancer can have serious consequences. In contrast, benign features usually do not require treatment; at most, they are monitored for a certain period of time. Therefore, the classification accuracy of benign features is only secondary. The per-pixel localization accuracy (i.e., the Dice coefficient) serves as a tertiary aspect.

Per-Pixel Dice Coefficient
The Dice coefficient was evaluated for each image in the test set. Table 1 presents the mean Dice scores averaged on the test images. The results show that all models were able to achieve a 91-92% mean Dice score, which is very promising, especially given that our dataset was relatively poorly balanced. The highest mean Dice score (92.91%) was obtained by the PAN [27] architecture.  Figure 5 displays the segmentation maps generated by the different models for a grade 2 papillary bladder cancer. All models were able to detect the tumor, although the resulting shapes were generally simpler and more ellipsoidal than the zigzag-edged ground truth mask. The highest Dice score for this image (98.67%) was achieved by the U-Net model.  Tables 2 and 3 present the per-image classification accuracy of the trained models evaluated on the test set. The highest F-score for the malignant category was achieved by the PAN model, which was closely followed by the U-Net and the UNet++ networks. For the benign category, the highest F-score was achieved by DeepLabv3, which was closely followed by PAN.

Per-Image F-Score
F-scores were also evaluated separately on the image subsets of the four different preprocessed versions (original, filtered, sharpened, enhanced) in order to investigate which pre-processing method improves the classification accuracy the most. This measurement was performed using the PAN model. Results are shown in Tables 4 and 5. It can be observed that the enhanced pre-processing method leads to the best lesion detection ability for both the malignant and benign categories.

Speed
The CNN should predict segmentations fast enough to be applied in real time on video streams during the transurethral resection of bladder tumors. To see if this is accomplished, we have measured the time required to calculate a single prediction, averaged from 100,000 runs on a NVIDIA GeForce RTX 2080 SUPER GPU [48]. The frame loading time or the time needed to fetch the image from the camera is not included, since it depends on the actual format, disk, and camera used. Measurement results are presented in Table 6.
The highest prediction speed (6.72 ms) was achieved by the DeepLabv3+ model. Note that even the slowest model (MA-Net) took less than 9 ms to make a prediction.
The pre-and post-processing of images requires some additional time. Guided filtering takes on average 2 ms and unsharp masking takes another 0.02 ms to perform on an NVIDIA GeForce GTX 960M GPU [50]. The extraction of polygon contours from the segmentation map takes an average of 0.06 ms, but the exact speed depends on the number of detected lesions and the complexity of their contours. Note that even with pre-and post-processing, a frame rate of 93-114 fps is attainable.

Model Size
The neural network is preferred to have a small size so that it does not occupy much storage space, it is fast to load (requires low bandwidth and has low memory latency), and it does not pose a problem to keep the model in the RAM or cache. Table 7 summarizes the sizes of the trained models. It can be observed that the model sizes range from 7 to 69 MB, which should pose no problem to store and utilize even on a clinical infrastructure with limited computing resources. Furthermore, the PAN model that achieved the highest prediction performance occupies only 7.93 MB, which is practically the smallest size of all. A summary of the evaluation results is displayed in Table 8.

Discussion
The segmentation of cystoscopic images is challenging for several reasons. Firstly, the images are often blurry and have low contrast due to difficult shooting conditions. Secondly, the appearance of bladder lesions shows a large variance, even within a single class. Thirdly, for CNNs that generally work with fixed receptive fields, the presence of objects at multiple scales causes difficulties in classification [27,28]. This cannot be avoided in cystoscopy, as lesions have different physical sizes and, in addition, their dimensions in image space depend on the current position and orientation of the cystoscope.
Image quality can be improved with appropriate pre-processing, while the problem of high variance within classes can be mitigated by providing a sufficiently large and varied training dataset. However, the effective recognition of objects at multiple scales must be achieved by the neural network architecture.
U-Net, UNet++, and LinkNet do not give a direct solution for multi-scale object segmentation. Nevertheless, information about objects at different scales is implicitly encoded at the different levels of the U-shaped architecture. Skip connections also help deliver this information to the decoder where segmentation takes place.
U-Net/UNet++ and LinkNet differ in the way their decoder blocks combine the features coming from the skip connection and the previous decoder block. In U-Net/UNet++, these two feature maps are concatenated, while in LinkNet, they are summed to reduce the number of parameters. Our results show that it is not always advantageous to save on the parameters: both the per-image classification accuracy and the per-pixel localization accuracy of LinkNet were significantly lower than that of the U-Net models. This may indicate that semantic information can be lost during the summation of feature maps.
A technique widely used in segmentation models to achieve multi-scale object recognition is the feature pyramid method. Image features are extracted in a pyramidal fashion at incremental scales, and different levels are used to predict masks at different scales. In gen-eral, the lower levels of the pyramid carry semantically less accurate, but spatially higher resolution information, whereas the higher levels of the pyramid carry semantically more accurate but spatially lower resolution information. In order to combine the advantages of the two types, it is common to merge the features of different pyramid levels.
Among the neural networks we investigated, FPN, PAN, DeepLabv3, and DeepLabv3+ use feature pyramids to provide multi-scale object detection capability. However, their methods of creating the feature pyramid differ.
DeepLabv3 and DeepLabv3+ use dilated convolutions with an increasing dilation rate to generate the feature pyramid. MA-Net also uses dilated convolutions to obtain multi-scale context. However, this type of convolution has the inherent problem that when the dilation is large, the convolutional kernel becomes too sparse to cover any local information [51]. This may impair the local consistency of feature maps [27].
FPN uses simple convolutions and spatial upscaling to generate the feature pyramid, which has a total of four levels. The final segmentation map is then obtained as the sum of all pyramid level predictions, which are weighted by 1/4, 1/8, 1/16, and 1/32, respectively. However, FPN did not perform well in our evaluations compared to the other models. We argue that this is due to the fixed weights assigned to the pyramid levels. Depending on how close the cystoscopic camera is to the examined lesions, different object scales, i.e., different pyramid levels may provide the most important semantic information.
PAN uses a pre-trained ResNet-101 to create the feature pyramid and then applies a Feature Pyramid Attention (FPA) module to fuse the features of different levels. FPA has a U-shaped structure that integrates attention-inspired information from different scales stepby-step in a precise and efficient way. In addition, PAN uses global context information to guide segmentation via Global Attention Upsampling modules. This increases the receptive field of the model and enhances the consistency of pixel-wise classification.
We believe that PAN was able to achieve the best accuracy in cystoscopic image segmentation because it efficiently combines the feature pyramid method, the attention mechanism, and also makes use of global context information. However, these operations entail a significant computational cost, making PAN one of our slowest networks. In contrast, the dilated convolutions used by the DeepLabv3 models are very fast to compute.
The 92.91% Dice score achieved by our PAN model indicates better tumor localization ability than reported in previous studies. As listed in Section 1, the attention mechanismbased U-Net of Zhang et al. [19] achieved a Dice score of 82.5%, the mask region-based CNN of Yoo et al. [6] obtained a score of 74.7%, while the CystoNet of Shkolyar et al. [4] did not report the Dice score result, but their per-frame sensitivity of 90.9% is lower than the 94.79% of our PAN model. This improvement in accuracy may be the result of more appropriate pre-processing of the input image, a better choice of neural network architecture, or a difference in the dataset.
In contrast to previous studies, we distinguish several types of lesions in addition to bladder cancer. Results show that the neural network can detect and localize them all with good accuracy. This can not only reduce the chance of any abnormalities going unnoticed but may also help urologists differentiate between benign and malignant lesions.
We also observed that the proposed pre-processing techniques (noise reduction, contrast enhancement) improved the accuracy of the subsequent segmentation step compared to the case when the raw input image is fed into the neural network. In the future, we hope to investigate more image enhancement algorithms to see if further improvements can be made. Moreover, with the rapid development of segmentation neural networks, we believe that architectures more efficient than the current ones may appear within a few years, which should be examined for cystoscopy.
We consider the greatest advantages of our solution to be the advanced and configurable image enhancement framework as well as the possibility of segmenting multiple types of bladder lesions. Disadvantages include the scaling of the input images to a fixed size of 160 × 160 pixels as well as the fact that it is not possible to know in advance what settings should be used for the pre-processing of the individual image frames in order to achieve the best possible segmentation accuracy.

Conclusions
In this paper, we examined a deep learning-based solution for the automatic segmentation of cystoscopy images. As a pre-processing, noise was reduced by guided filtering, and contrast enhancement was performed by unsharp masking. Then, the image was fed into a CNN that predicted a high-accuracy segmentation map, marking bladder cancer, two types of non-cancerous lesions, and four types of healthy tissue. This map was post-processed to find the boundary polygons and overlay the input image with their outlines, assisting doctors in cancer diagnosis and treatment. We examined eight popular neural network architectures for the task at hand, and we found that the PAN model achieves the best results with a mean Dice score of 92.91% and a malignant F-score of 91%. The small model size and the few millisecond prediction time of the neural network allow the solution to be applied in real time on video streams during the transurethral resection of bladder tumors.