Trichomonas vaginalis Detection Using Two Convolutional Neural Networks with Encoder-Decoder Architecture

: Diagnosis of Trichomonas vaginalis infection is one of the most important factors in the routine examination of leucorrhea. According to the motion characteristics of Trichomonas vaginalis , a viable detection method is the use of a microscopic camera to record videos of leucorrhea samples and video object detection algorithms for detection. Most Trichomonas vaginalis is defocused and displays as shadow regions on microscopic images, and it is hard to recognize the movement of shadow regions using traditional video object detection algorithms. In order to solve this problem, we propose two convolutional neural networks based on an encoder-decoder architecture. The ﬁrst network has the ability to learn the difference between frames and utilizes the image and optical ﬂow information of three consecutive frames as the input to perform rough detection. The second network corrects the coarse contours and uses the image information and the rough detection result of the current frame as the input to perform ﬁne detection. With these two networks applied, the metric value of the mean intersection over union of Trichomonas vaginalis achieves 72.09% on test videos. The proposed networks can effectively detect defocused Trichomonas vaginalis and suppress false alarms caused by the motion of formed elements or impurities.


Introduction
Diagnosis of Trichomonas vaginalis (TV) infection is one of the most important factors in the routine examination of leucorrhea. The traditional manual microscopic examination method has the advantage of high detection rates, but its low efficiency mean it cannot meet the need for daily examination of a large number of leucorrhea samples. Therefore, fully automated leucorrhea examination equipment with an intelligent algorithm for TV detection is in urgently needed. Staining TV is a commonly used method of leucorrhea sample pretreatment. The advantages of this method are that the contours of TV after staining are clear and it is easy to distinguish from other formed elements or impurities. The staining process is complicated and time-consuming, so it is not suitable for integration into fully automated leucorrhea examination equipment. According to the motion characteristics of TV, a feasible detection method is using a microscopic camera to record videos of leucorrhea samples and adopt video object detection algorithms to identify it [1,2].
Traditional video object detection algorithms include frame difference methods, background difference methods, and optical flow methods. Frame difference methods determine the moving foreground object by comparing the difference between adjacent frames or three frames [3,4]. Background difference methods utilize the image information of previous frames of the video to establish a background model and then judge the foreground or background by comparing the difference between the current frame and the background model [5,6]. The background model is updated according to foreground and background results of the current frame. Optical flow methods match the pixels in adjacent frames to obtain the motion direction and step of each pixel in the current frame [7].
Video object detection algorithms based on deep learning mainly include flow-based methods [8][9][10]. The principle of these methods is that in the feature extraction stage (encoder), only the feature maps of the key frames in the video are extracted. For non-key frames, the feature maps are generated by the feature maps of the key frames via the optical flow field [8,10]. In the classifier stage (decoder), the feature map of a single frame [8] or feature aggregation information of multiple frames [9] is used as the input to detect the moving object in the current frame. These flow-based methods improve the detection speed while ensuring the detection rate.
In the vertical direction of the microscope stage, the TV is located in different planes from the formed elements or impurities, due to its motion characteristic. Figure 1 shows one frame of the video with clear formed elements (epithelial cells and white blood cells). The number of TV is 9 in this frame. The TV and some backgrounds have been marked with red and green rectangles, respectively. Some enlarged image regions are shown in Figure 1a-d. In the vertical direction, the TV in (b) is close to the current plane and its outline can still be observed; the TV in (a) and (c) is far away from the current plane and is displayed as shadow regions in the image. By observing successive frames of the video, it can be seen that (d) is not TV but is a defocused formed element or impurity. It is not possible to accurately determine whether the defocused shadow region is TV or background by using only one single frame. For frame difference methods or background difference methods, it is necessary to lower the foreground judgment threshold to recognize the shadow regions where the defocused TV is located. However, some background regions will be mistakenly detected as the foreground regions, leading to false alarms. The artificial identification of defocused TV is mainly based on the characteristics of the shadow area and the continuous movement of TV. Therefore, image and optical flow information jointly determine the features of defocused TV. In flow-based methods [8][9][10], optical flow is mainly used for propagating feature maps between key frames and non-key frames, rather than as input information. The extracted features only contain image information and the trained network may mistakenly identify some stationary shadow regions in the background as TV.
In order to solve the above problems, we propose two convolutional neural networks based on an encoder-decoder architecture. The first network has the ability to learn the difference between frames and utilizes the image and optical flow information of three consecutive frames as the input to perform rough detection of the TV. The second network corrects the coarse contours and uses the image information and the rough detection result of the current frame as the input to perform fine detection of the TV. By combining these two networks, the defocused TV can be detected effectively and the false alarms caused by the motion of formed elements or impurities can be suppressed.
This article is organized as follows: Section 2 introduces our previous works on TV detection. The details of two convolutional neural networks with encoder-decoder architecture are described in Section 3. Section 4 introduces the dataset we used and the experimental results. Section 5 presents the discussion. Conclusions are provided in Section 6.

Related Works
At present, the detection of TV is mostly focused on biochemical and staining detection methods, whereas detection based on images or videos is used less. In our previous works, we proposed two TV detection methods based on a traditional background difference model.
In the work [1], we proposed an improved Kalman background reconstruction algorithm to detect TV automatically. The first frame of a video is used to build the background and the additional top-hat transformation can eliminate the phenomena of tailing and ghosting. By introducing time information and neighborhood judgment into the background updating mechanism, this method candeal with the problems of falsely detected static areas and missing motion areas. The algorithm results show that this method can effectively suppress the noise caused by illumination mutation, lens shift, and focal length variation, providing strong adaptability and good robustness.
When the movement speed of the moving target is slow or the movement frequency is low, the performance of this Kalman background reconstruction algorithm can decline, resulting in a high omission ratio. In order to address the above limitations, we proposed an improved VIBE background reconstruction algorithm [2]. The background model adopts three main update strategies: the memoryless update, the time subsampling of the model and the update of the spatial domain. In order to simplify the judgment, the foreground image is extracted by the frame difference method. Similarly, time information is introduced to eliminate false alarms from impurities or formed elements. This improved method can effectively suppress false alarms caused by formed elements and missed detections caused by the background model updating during the movement.
TV is defocused due to its motion characteristics and in most cases it appears as flat shadow regions with little difference between adjacent frames. To detect moving shadow regions, it is necessary to reduce the judgment threshold of the foreground, but this can result in false alarms where some backgrounds are detected as TV. The above two methods mainly focus on the recognition of clear TV but fail for a defocused conditions. Therefore, a detection method based on deep learning is proposed in this paper to solve the above problems.

Related Works
At present, the detection of TV is mostly focused on biochemical and staining detection methods, whereas detection based on images or videos is used less. In our previous works, we proposed two TV detection methods based on a traditional background difference model.
In the work [1], we proposed an improved Kalman background reconstruction algorithm to detect TV automatically. The first frame of a video is used to build the background and the additional top-hat transformation can eliminate the phenomena of tailing and ghosting. By introducing time information and neighborhood judgment into the background updating mechanism, this method candeal with the problems of falsely detected static areas and missing motion areas. The algorithm results show that this method can effectively suppress the noise caused by illumination mutation, lens shift, and focal length variation, providing strong adaptability and good robustness.
When the movement speed of the moving target is slow or the movement frequency is low, the performance of this Kalman background reconstruction algorithm can decline, resulting in a high omission ratio. In order to address the above limitations, we proposed an improved VIBE background reconstruction algorithm [2]. The background model adopts three main update strategies: the memoryless update, the time subsampling of the model and the update of the spatial domain. In order to simplify the judgment, the foreground image is extracted by the frame difference method. Similarly, time information is introduced to eliminate false alarms from impurities or formed elements. This improved method can effectively suppress false alarms caused by formed elements and missed detections caused by the background model updating during the movement.
TV is defocused due to its motion characteristics and in most cases it appears as flat shadow regions with little difference between adjacent frames. To detect moving shadow regions, it is necessary to reduce the judgment threshold of the foreground, but this can result in false alarms where some backgrounds are detected as TV. The above two methods mainly focus on the recognition of clear TV but fail for a defocused conditions. Therefore, a detection method based on deep learning is proposed in this paper to solve the above problems.

Convolutional Neural Network Based on Encoder-Decoder Architecture for Rough Detection
Using only one single frame cannot effectively distinguish TV from backgrounds, so the first convolutional neural network for rough detection needs to have the ability to learn the differences between adjacent frames. Dosovitskiy et al. [11] proposed two encoder-decoder network architectures (FlowNetSimple and FlowNetCorr) to calculate the optical flow between adjacent frames by deep learning methods. The calculation of optical flow only depends on the difference between frames rather than the image content of one single frame. This detection method is appropriate for the defocused TV detection problem, so the first convolutional neural network we propose uses the encoder-decoder architecture similar to FlowNetSimple [11]. Figure 2 shows the architecture of the rough detection network. The encoder and eecoder are shown in the red dashed box on the left and the green dashed box on the right, respectively.

Convolutional Neural Network Based on Encoder-Decoder Architecture for Rough Detection
Using only one single frame cannot effectively distinguish TV from backgrounds, so the first convolutional neural network for rough detection needs to have the ability to learn the differences between adjacent frames. Dosovitskiy et al. [11] proposed two encoder-decoder network architectures (FlowNetSimple and FlowNetCorr) to calculate the optical flow between adjacent frames by deep learning methods. The calculation of optical flow only depends on the difference between frames rather than the image content of one single frame. This detection method is appropriate for the defocused TV detection problem, so the first convolutional neural network we propose uses the encoder-decoder architecture similar to FlowNetSimple [11]. Figure 2 shows the architecture of the rough detection network. The encoder and eecoder are shown in the red dashed box on the left and the green dashed box on the right, respectively.

Encoder
We concatenated the current frame and its preceding and following frames (RGB image, 3 channels), the optical flow from the previous frame to the current frame and from the current frame to the next frame (pixel movement of x and y direction, 2 channels) together along the channel axis, then fed it into the encoder. The input shape was 512 × 512 × 13. Through image and optical flow information, the rough detection network can learn the motion and morphological features of the TV, preventing the moving formed elements or impurities from being mistakenly detected. In order to reduce the missed detection of TV, we use information from three frames rather than two. Unlike FlowNetSimple [11], we use the layers 'conv1_1 to 'conv5_3 from VGG-16 [12] as the basic architecture of the encoder. The weights of 'conv1_2 to 'conv5_3 are initialized from the ImageNet pre-trained model. The stride of the 'pool4 layer is set to (1, 1) and the 'conv5_1 to 'conv5_3 layers use dilated convolution with a dilation rate of (2, 2). The optical flow information between adjacent frames is calculated by the deep learning method proposed in [13].

Decoder
Based on the decoder architecture of FlowNetSimple [11], we added an attention block with a squeeze-and-excitation (SE) module [14] before each network output. As shown in Figure 3, the SE module contains spatial and channel attention modules, which can make it possible for the network to learn 'what' and 'where' to attend in the channel and spatial axes respectively. The spatial attention module generates a spatial attention map by utilizing the inter-spatial relationship of certain features. The input x (H × W × C 1 ) uses the convolution operation (kernel: [1,1], stride: [1,1], channels: 1) to obtain x s (H × W × 1). Then we employ a simple gating mechanism with a sigmoid activation on x s to obtain x s (H × W × 1). The spatial attention map x spatial (H × W × C 1 ) is generated by spatial-wise multiplication between the x s and the input x. The channel attention module can selectively enhance useful features and suppress invalid ones and produces a channel attention map. The x c (1 × 1 × C 1 ) is generated by using a global average pooling operation on input x. After using full convolution (channels: C 3 , C 3 = C 1 /4) and Relu to x c ,x c (1 × 1 × C 3 )was obtained. Then x c continuously executed fully convolution operation (channels: C 1 ) and sigmoid activation, obtaining x c (1 × 1 × C 1 ). The channel attention map x channel (H × W × C 1 ) is generated by channel-wise multiplication between the x c and the input x. To get the output of the attention block, convolution (kernel: [3,3], stride: [1,1]), batch normalization and Relu are connected successively after adding two attention maps to.

Training
For the training phase, Adam [15] was selected as the optimizer. The learning rate was set to 10 −5 and focal loss [16] (gamma: 2.0, alpha: 0.7) was the loss function. The loss weights for 'output1′ to 'output5′ were 1.0, 0.8, 0.8, 0.6, and 0.6. The size of the images and optical flow were rescaled to 1536x1024 and regions of fixed-size 512 × 512 were randomly cropped. The data augmentation methods include horizontal and vertical flip, The attention blocks introduce a large number of trainable parameters, which makes the network difficult to train. To solve this problem, we first trained the network without the attention mechanism, and then used the optimal model on the validation set as the pretrained model for transfer learning. Finally, we added the attention blocks before the five outputs and fine-tuned the network.

Inference
In the inference phase, inputs were rescaled to 1536 × 1024 and cropped as image patches with a fixed-size of 512 × 512 in the x and y directions with a step size of 256. We created an array Iout with a size of 1536 × 1024 × 2 to save results. 'output1′ was the unique output, which was packed into the corresponding region of Iout. The overlapping regions

Training
For the training phase, Adam [15] was selected as the optimizer. The learning rate was set to 10 −5 and focal loss [16] (gamma: 2.0, alpha: 0.7) was the loss function. The loss weights for 'output1 to 'output5 were 1.0, 0.8, 0.8, 0.6, and 0.6. The size of the images and optical flow were rescaled to 1536x1024 and regions of fixed-size 512 × 512 were randomly cropped. The data augmentation methods include horizontal and vertical flip, rotation from [−5 • , 5 • ], translation from [−10, 10] for x and y, and scaling from [0.8, 1.2]. The attention blocks introduce a large number of trainable parameters, which makes the network difficult to train. To solve this problem, we first trained the network without the attention mechanism, and then used the optimal model on the validation set as the pretrained model for transfer learning. Finally, we added the attention blocks before the five outputs and fine-tuned the network.

Inference
In the inference phase, inputs were rescaled to 1536 × 1024 and cropped as image patches with a fixed-size of 512 × 512 in the x and y directions with a step size of 256. We created an array I out with a size of 1536 × 1024 × 2 to save results. 'output1 was the unique output, which was packed into the corresponding region of I out . The overlapping regions took the maximum value. Finally, the category corresponding to the maximum of two channels was the predicted result for each pixel in I out .

Convolutional Neural Network Based on Encoder-Decoder Architecture for Fine Detection
The outlines of the TV extracted by the rough detection network are coarse, so the second fine detection network we proposed needs to have a function to correct the contours. To solve the problem of video object segmentation, Perazzi et al. [17] proposed the MaskTrack method, which add the predicted mask image of the previous frame to the input of the network. The extra mask channel is meant to provide an estimate of the visible area of the object in the current frame, its approximate location and its shape, which is the inspiration for the second fine detection network to modify the rough detection result.

Encoder and Decoder
The fine detection network adopts the same architecture of the rough detection network but removes the extra attention blocks before five outputs. In addition, we stacked the current frame (RGB image, 3 channels) and the rough detection result for the current frame (binary image, 1 channel) together as the input of the encoder. The input shape was 512 × 512 × 4.

Training
In the training phase, instead of using the results for the rough detection network as the training set, we constructed random rough detection results artificially, due to false alarms and missed detections for the rough detection network. To obtain random rough detection results, we used affine transformations and non-rigid deformations via thin-plate splines [18] to deform the ground truth images. Because the motion of TV is independent, we deformed each TV randomly.
In order to prevent large distortion, we only kept the deformed result with an intersection over union value (between its original region and the deformed result) larger than 10% for non-rigid deformations. Affine transformations includes rotation from [−15 • , 15 • ], translation from [−20, 20] for the x and y directions, and scaling from [0.5, 2.0]. A morphological dilation operation with a disc structuring element (15 pixels in diameter) was applied to remove the details of TV contours after the transformations. Examples of the generated rough detection results are shown in Figure 4. The optimization algorithm, loss function and other parameters used in the training phase were the same as those of the rough detection network. Since the TV regions have been randomly deformed, we only used the data augmentation methods of horizontal and vertical flip.

Inference
In order to reduce false alarms, the extra input mask of the fine detection network is obtained as follows. First, we apply a morphological dilation operation with a disc structuring element (15 pixels in diameter) to the rough detection result of the previous and current frames. Then the two dilated binary images perform an AND operation. Figure 5 shows the inference phase of the fine detection network. If there is no rough detection result for the previous frame, the rough detection result for the current frame is dilated as the input mask. The outputs are saved in the same way as the rough detection network.

Inference
In order to reduce false alarms, the extra input mask of the fine detection network is obtained as follows. First, we apply a morphological dilation operation with a disc structuring element (15 pixels in diameter) to the rough detection result of the previous and current frames. Then the two dilated binary images perform an AND operation. Figure 5 shows the inference phase of the fine detection network. If there is no rough detection result for the previous frame, the rough detection result for the current frame is dilated as the input mask. The outputs are saved in the same way as the rough detection network.

Inference
In order to reduce false alarms, the extra input mask of the fine detection network is obtained as follows. First, we apply a morphological dilation operation with a disc structuring element (15 pixels in diameter) to the rough detection result of the previous and current frames. Then the two dilated binary images perform an AND operation. Figure 5 shows the inference phase of the fine detection network. If there is no rough detection result for the previous frame, the rough detection result for the current frame is dilated as the input mask. The outputs are saved in the same way as the rough detection network.

Dataset and Optical System
There were six videos containing TV in our dataset and the frame number of each video is shown in Table 1. The image size of each frame was 1920 × 1200. All videos were shot under this condition: adjusting the z position of the microscope stage to make the formed elements clearest. The positions and shapes of the TV were constantly changing due to its motion characteristics and most of the time it appeared as shadow regions in the videos. For the convenience of comparison, we manually labeled the TV in all video frames, obtaining a total of 2520 annotated images for analysis (ground truth of TV, pixel value 0 for background regions, pixel value 1 for TV regions). For the two convolutional neural networks, we used video1 to video2 as the training set, video3 as the validation set, and video4 to video6 as the test set. As shown in Figure 6, the optical system for capturing TV videos contains a biological microscope and a charge-coupled device (CCD) camera. We used a CX31 biological microscope (Olympus, Tokyo, Japan) equipped with a 40× objective lens (CFI BE2 Plan Achromat, Nikon, Tokyo, Japan) which has a numerical aperture (NA) of 0.65. An EXCCD01400KMA CCD camera (Motic, Xiamen, China) with a pixel size of 6.45 µm × 6.45 µm is used for exposure and the exposure time was 40 ms. The field of view (FOV) was 0.41 mm × 0.26 mm.
formed elements clearest. The positions and shapes of the TV were constantly changing due to its motion characteristics and most of the time it appeared as shadow regions in the videos. For the convenience of comparison, we manually labeled the TV in all video frames, obtaining a total of 2520 annotated images for analysis (ground truth of TV, pixel value 0 for background regions, pixel value 1 for TV regions). For the two convolutional neural networks, we used video1 to video2 as the training set, video3 as the validation set, and video4 to video6 as the test set. As shown in Figure 6, the optical system for capturing TV videos contains a biological microscope and a charge-coupled device (CCD) camera. We used a CX31 biological microscope (Olympus, Tokyo, Japan) equipped with a 40× objective lens (CFI BE2 Plan Achromat, Nikon, Tokyo, Japan) which has a numerical aperture (NA) of 0.65. An EXCCD01400KMA CCD camera (Motic, Xiamen, China) with a pixel size of 6.45 µm × 6.45 µm is used for exposure and the exposure time was 40 ms. The field of view (FOV) was 0.41 mm × 0.26 mm.

Metric
In this study, we verified the effectiveness of the proposed detection networks by calculating the intersection over union (IoU) metric of TV. The calculation formula was as follows: where True Positive (TP) is the number of correctly detected TV pixels; False Positive (FP) is the number of background pixels incorrectly classified as TV; and False Negative

Metric
In this study, we verified the effectiveness of the proposed detection networks by calculating the intersection over union (IoU) metric of TV. The calculation formula was as follows: where True Positive (TP) is the number of correctly detected TV pixels; False Positive (FP) is the number of background pixels incorrectly classified as TV; and False Negative (FN) is the number of TV pixels incorrectly classified as background. In addition, we calculated the precision and recall metrics to evaluate the degree of false alarm and missed detection of TV. The calculation formula was as follows:

Results
The results of the rough detection network for the training, validation and test set are shown in Table 2. It can be seen that for the test set, the rough detection network has a higher value of mean recall, but the increasing false detection regions lead to a decrease in mean precision value and mean IoU value. The results of the fine detection network are shown in Table 3. With the fine detection network applied for the validation and test sets, the mean recall value decreases slightly, the mean precision value and mean IoU value are improved and the mean average IoU value of three test videos achieves 72.09%. The experimental results indicate that the proposed fine detection network can correct the boundary of TV. The mean IoU value in the training set is high and the boundaries of the TV are very close to the ground truth. The correction effect for TV is limited, however, the shadow region of false alarms after correction is enlarged and reduces the mean IoU value of video2. The specific analysis is discussed in Section 5.  Figure 7 shows the results of the two detection networks for one frame in video4. In this image, there are three TV adjacent to each other. In the rough detection result, the prediction regions of the three TV are stuck together. After using the fine detection network, some of the adhesion areas are eliminated. We uploaded the results of the fine detection network for the six videos online and the details can be found in the Supplementary Information.

The Operating Environment and Running Times
We used tensorflow2 framework to build our algorithm. The operating system is Ubuntu and we run this algorithm on a GTX TITAN Xp GPU. The code for calculating optical flow is from LiteFlowNet2 [13]. The overall detection starts from the second frame of the video. One calculation process includes image reading and scaling (1920 × 1200 × 3 -> 1536 × 1024 × 3), optical flow calculation, slicing (15 × 512 × 512 × 13 or 15 × 512 × 512 × 4), rough detection and fine detection. In the process of inference, the batch size of the rough and fine detection network is 8 each time. The average running times for optical flow calculation, slicing, rough and fine detection are shown in Table 4.

The Operating Environment and Running Times
We used tensorflow2 framework to build our algorithm. The operating system is Ubuntu and we run this algorithm on a GTX TITAN Xp GPU. The code for calculating optical flow is from LiteFlowNet2 [13]. The overall detection starts from the second frame of the video. One calculation process includes image reading and scaling (1920 × 1200 × 3 -> 1536 × 1024 × 3), optical flow calculation, slicing (15 × 512 × 512 × 13 or 15 × 512 × 512 × 4), rough detection and fine detection. In the process of inference, the batch size of the rough and fine detection network is 8 each time. The average running times for optical flow calculation, slicing, rough and fine detection are shown in Table 4.

Selection of the Optical Flow Calculation Method
In this section, we discuss the choice of optical flow calculation method. The traditional optical flow calculation method has a high calculation accuracy, but the high computation cost make itunsuitable for real-time detection. The optical flow calculation method based on deep learning has the advantage of fast calculation speed and acceptable precision, which has been an important research subject in respect of deep learning in recent years.
As shown in Figure 8, we compared 4 optical flow calculation methods which are based on deep learning. Figure 8a,b show the two adjacent frames, (c) shows the TV ground truth of frame (a), and frames (d) to (g) are the visualized images of optical flow calculated using the 4 methods. It can be seen from Figure 8d,e that flownet2.0 [19] and LiteFlowNet2 [13] can effectively capture the motion of the shadow regions where the defocused TV is located. Finally, we chose the LiteFlowNet2 [13] method, which has the faster calculation speed.  [19], LiteFlowNet2 [13], Pwcnet [20], and Spynet [21].

Ablation Study
In this section, we investigate the effects of each component in the model framework of the rough detection network. Table 5 summarizes the differing performance of rough detection using different  . (a,b) show the two adjacent frames; (c) shows the TV ground truth of frame (a); (d-g) are the visualized images of optical flow calculated by Flownet2.0 [19], LiteFlowNet2 [13], Pwcnet [20], and Spynet [21].

Ablation Study
In this section, we investigate the effects of each component in the model framework of the rough detection network. Table 5 summarizes the differing performance of rough detection using different encoder or adding attention blocks. We chose the original FlowNetSimple network [11] as the baseline model. After modifying the encoder to VGG16 [12], the optimal value of the mean IoU for the validation set achieved a significant improvement to 71.50%. Compared with the original decoder structure, the attention block with the SE module [14] is able to recover fine details for TV and improve the mean IoU value to 72.08%. Finally, VGG16 [12] was used as the encoder and attention blocks with a SE module were adopted in the decoder. Table 5. Differing performance of rough detection using different encoder or adding attention blocks.

Attention Module
The influence of different attention modules on the performance of rough detection network is shown in Table 6. We replaced the SE [14] module in the attention block with other classical attention modules such as non-local [24], CBAM [25], and dual attention (DA) [26]. Due to limitations in the memory size of the graphics boards, we deleted the attention blocks before 'output1 , 'output2 and 'output3 . Similar to the training phase of the rough detection network, we used the trained network without an attention mechanism as the pretrained model for transfer learning. The data in Table 6 indicate that the SE module has the ability to identify the information pertinent to the TV with a better performance.  Table 7 summarizes how the input information affects the performance of the rough detection network. In order to simplify the comparison process, we used VGG16 [12] as the encoder and removed the attention blocks in the decoder. The results demonstrate that optical flow information is necessary and using three consecutive frames could obtain a better result than two. We also tested the inputs of five consecutive frames with optical flow, but the network performance for the test set decreased. Finally, three consecutive frames with optical flow were adopted as the input of the rough detection network.

Comparison with Traditional Video Object Detection Methods
In this section, we compare our method with some traditional video object detection methods. For convenience of comparison, we have only compared the mean IoU metrics. In order to reduce the interference of background noise, we first used a median filtering algorithm on the images (11 × 11 kernel for Three frame difference [4] and 29 × 29 kernel for Gaussian Mixed Model (GMM) model [5]). As shown in Table 8, the mean IoU value of our method is much higher than the other traditional algorithms. Since most of the TV in our dataset is defocused, it appears as flat shadow regions with little difference between adjacent frames. Therefore, whether using the frame difference method [4] or the background difference method [5], it is necessary to reduce the judgment threshold of the foreground, resulting in false alarms where some backgrounds are detected as TV. The above problem also exists with the improved Kalman [1] and improved VIBE [2] methods that we proposed. These two methods can recognize clear instances of TV, but fail on defocused images.

The Performance of the Rough Detection Network Using Different Outputs
We only used 'output1 as the final result, although there are five 'output' for the rough detection network. Therefore, we compared the impact of different 'output' on the performance of the rough detection network. For 'output2 to 'output5 , we enlarged the image size to 512 × 512 by bilinear interpolation. The mean IoU values of 'output1 to 'output5 are shown in Table 9. The results of 'output1 , 'output2 , and 'output3 are similar. In order to improve the detection speed, the results of 'output3 to 'output5 were calculated alone while discarding the subsequent network structure. Furthermore, due to the reduction in the network size, we stack the cropped patch images together and a rescaled image with the shape of 1536 × 1024 could be detected by the rough detection network immediately.

Limitations of Our Trichomonas Vaginalis Detection Method
Most of the TV in the training set videos is defocused and therefore the rough detection network is sensitive to the shadow regions between frames, which often lead to false alarms. As shown in Figure 9a, the shadow region of the background with the blue marks moves slightly with the sample liquid and is mistakenly detected by the rough detection network at the bottom right of the image. In addition, the principle of the rough detection network mainly depends on the difference between frames. As shown in Figure 9b, the TV with the green mark is similar to white blood cells and its position is basically unchanged in this video, leading to little difference between frames and missed detection.

Limitations of Our Trichomonas Vaginalis Detection Method
Most of the TV in the training set videos is defocused and therefore the rough detection network is sensitive to the shadow regions between frames, which often lead to false alarms. As shown in Figure 9a, the shadow region of the background with the blue marks moves slightly with the sample liquid and is mistakenly detected by the rough detection network at the bottom right of the image. In addition, the principle of the rough detection network mainly depends on the difference between frames. As shown in Figure 9b, the TV with the green mark is similar to white blood cells and its position is basically unchanged in this video, leading to little difference between frames and missed detection. The main function of the fine detection network we proposed is to correct the contours of TV and eliminate the short-term false alarms. Therefore, if the rough detection network mistakenly detects or misses the TV, the result will not improve or get worse with the fine detection network applied. For example, the fine detection network could not detect the missed TV in Figure 9b. Figure 10 shows the rough and fine detection results for one frame in video2. The blue-marked shadow region at the bottom right of Figure 10a is falsely detected by the rough detection network. After adopting the fine The main function of the fine detection network we proposed is to correct the contours of TV and eliminate the short-term false alarms. Therefore, if the rough detection network mistakenly detects or misses the TV, the result will not improve or get worse with the fine detection network applied. For example, the fine detection network could not detect the missed TV in Figure 9b. Figure 10 shows the rough and fine detection results for one frame in video2. The blue-marked shadow region at the bottom right of Figure 10a is falsely detected by the rough detection network. After adopting the fine detection network, as shown in Figure 10b, it cannot be eliminated but is enlarged by the correction function of the fine detection network. In future work, we need to address the above limitations and further study the problem of TV recognition in a flowing liquid samples.

Conclusions
In this paper, we proposed two convolutional neural networks based on an en- In future work, we need to address the above limitations and further study the problem of TV recognition in a flowing liquid samples.

Conclusions
In this paper, we proposed two convolutional neural networks based on an encodedecoder architecture to solve the problem of defocused TV recognition in videos shot by microscopic cameras. The first rough detection network we proposed realizes the coarse detection of the TV by learning the difference between adjacent frames. The second fine detection network we proposed achieves correction of the contours of TV for rough detection results. By combining these two networks, the mean average IoU value of the TV achieved 72.09% for our test videos. The experimental results show that our proposed networks can effectively detect defocused TV and suppress the false alarms caused by the motion of formed elements or impurities.

Informed Consent Statement:
Written informed consent has been obtained from the patients to publish this paper. All samples were anonymization.

Data Availability Statement:
The algorithm codes and our dataset will be released online at www. github.com/wxz92/Trichomonas-Vaginalis-Detection (accessed on 24 February 2021).