Detecting Endotracheal Tube and Carina on Portable Supine Chest Radiographs Using One-Stage Detector with a Coarse-to-Fine Attention

In intensive care units (ICUs), after endotracheal intubation, the position of the endotracheal tube (ETT) should be checked to avoid complications. The malposition can be detected by the distance between the ETT tip and the Carina (ETT–Carina distance). However, it struggles with a limited performance for two major problems, i.e., occlusion by external machine, and the posture and machine of taking chest radiographs. While previous studies addressed these problems, they always suffered from the requirements of manual intervention. Therefore, the purpose of this paper is to locate the ETT tip and the Carina more accurately for detecting the malposition without manual intervention. The proposed architecture is composed of FCOS: Fully Convolutional One-Stage Object Detection, an attention mechanism named Coarse-to-Fine Attention (CTFA), and a segmentation branch. Moreover, a post-process algorithm is adopted to select the final location of the ETT tip and the Carina. Three metrics were used to evaluate the performance of the proposed method. With the dataset provided by National Cheng Kung University Hospital, the accuracy of the malposition detected by the proposed method achieves 88.82% and the ETT–Carina distance errors are less than 5.333±6.240 mm.


Introduction
In intensive care units (ICUs), endotracheal intubation is a common medical procedure when a patient cannot spontaneously breathe. When intubating, the position of an endotracheal tube (ETT) needs to be taken into account because some complications can be caused by its malposition. A deep position can cause tachycardia, hypertension, and/or bronchospasm. On the other hand, a shallow position can cause inadvertent extubation or larynx damage [1]. Extensive research has shown that the appropriate depth of an ETT can be determined by the distance between the ETT and Carina [2]. Therefore, to reduce the likelihood of serious complications, it is essential to develop a method to accurately detect the distance between ETT and Carina.
Recent evidence [3] suggests that chest radiography should be performed after endotracheal intubation. However, in ICUs, chest radiographs (CXRs) are usually obtained in supine anteroposterior (AP) view by a portable X-ray machine. In this situation, several challenges limit the detection accuracy; for example, external monitoring devices, tubes, or catheters can cause ambiguity in the position of an ETT and Carina. Additionally, the image quality of CXR obtained in the supine AP view is lower than the standard posteroanterior (PA) view. Therefore, over the past decade, researchers pay increased attention to computer-aided detection (CAD) methods in an attempt to improve the detection of unsuitable endotracheal intubation and reduce the burden on doctors.
1. A new spatial attention (SA) module was proposed, which noticed the feature distribution in spatial and channel domains simultaneously. Different from previous studies, the SA employed the weight of channel attention to refine the weight of spatial attention, which also solved the problem from [15] where it overlooked the synergies between the channel and spatial attention. Moreover, the performance of SA was better than other attention modules. 2. A new merge method of attention modules was proposed, which fused global modelling attention and scale attention. Instead of stacking the same type of attention together, this method enlarged the ability of feature extraction in a coarse-to-fine way and received better results. 3. The architecture proposed by this paper improved the outcomes of object detection tasks by adding a segmentation branch. With the segmentation branch, the architecture kept more useful information during the feature extraction. 4. The experimental result demonstrated that the proposed architecture in this paper achieves outstanding performance on the chest radiograph datasets provided by the ICU in National Cheng Kung University Hospital and Tainan Hospital.
The rest of this paper is organized as follows. Section 2 introduces previous research about object detection and attention module. Section 3 explains the proposed method for detecting endotracheal intubation malposition. Section 4 shows and discusses the experimental results. Section 5 summarizes this paper.

Object Detection
Object detection is a subfield of computer vision. The purpose of object detection is to localize and recognize objects with bounding boxes; this is different from classification, which is aimed at classifying a view into a class. Generally, it can be separated into onestage detectors and two-stage detectors, based on whether the region proposal is adopted or not. Moreover, object detection can also be separated into anchor-based detectors and anchor-free detectors based on whether predefined anchors are employed or not. This paper used a one-stage anchor-free architecture named FCOS [13] as the base.

Anchor-Based Approach
Anchor-based detectors set anchor boxes in each position on the feature map and predict the probability of having objects in each anchor box. This approach can be separated into two classes by whether it uses region proposals or not, i.e., two-stage detector and one-stage detector. Two-stage detectors generate region proposals and then classify and refine them. For instance, Ren et al. [16] improved Fast R-CNN [17] by proposing a region proposal network (RPN), which generated high-quality region proposals with more efficiency than the Selective Search applied in Fast R-CNN. After finding the proposals, they used the region of interest (ROI) pooling to produce feature maps with the same resolution for further refinement. Cai et al. [18] noticed the impact of intersection over union (IoU) threshold when training object detector. Therefore, they proposed Cascade R-CNN, which consists of a sequence of detectors trained with an increasing IoU threshold. He et al. proposed Mask R-CNN [12] which is based on Faster R-CNN [16] and Feature Pyramid Network (FPN) [19], trying to perform object detection and segmentation at the same time.
One-stage detectors classify and regress the location directly. These detectors are more efficient and lightweight but have lower performance compared to two-stage detectors. For instance, SSD [20] detects objects of different scales by different layers in VGG-16 [21], i.e., small objects are detected on high-resolution feature maps, and large objects are detected on low-resolution feature maps. Liu et al. [22] proposed RFBNet to strengthen the receptive fields (RFs) based on SSD and the structure of the RFs in the human visual system. The RF Block (RFB) in RFBNet is similar to a mix of the inception block [10,23] and the atrous spatial pyramid pooling (ASPP) [24]. Lin et al. proposed Focal Loss to solve the extreme imbalance between positive and negative samples during training [25]. The Focal Loss which is similar to weighted cross entropy (CE) loss provided larger weights for positive samples and lower weights for negative samples. Based on this loss function, Lin et al. also proposed a one-stage detector named RetinaNet which used ResNet [26] and FPN [19] as a backbone to demonstrate the capability of Focal Loss.

Anchor-Free Approach
Anchor-free detectors do not need to design anchor boxes, avoiding the complicated computations related to anchor boxes. FCOS [13,27] transformed object detection tasks into per-pixel prediction tasks and used multi-level prediction to improve the recall. Based on their approach, they regressed the distance between the location and the four sides of the bounding box. Furthermore, they also proposed the center-ness branch to suppress the low-quality detected bounding boxes. FoveaBox [28] is similar to FCOS. FoveaBox also applied FPN to solve the intractable ambiguity which is caused by the overlaps in ground truth boxes. Additionally, they also regressed the offset between the location and the four sides of the bounding box. However, the definition of the positive samples in FoveaBox is different from FCOS. They used a formula with a shrunk factor to form shrunk ground truth boxes. If a sample falls into the shrunk ground truth boxes, the sample is positive, otherwise, it is negative. Moreover, the formula for offset computation is also different from FCOS.
CornerNet [29] detected an object as a pair of the top-left corner and bottom-right corner of a bounding box with heatmaps and then grouped the pairs of corners belonging to the same instance with embedding vectors. Furthermore, to solve the problem that the corner outside the object cannot be localized based on local evidence, they proposed corner pooling which took in two feature maps to encode the locations of the corner. During training, they used unnormalized 2D Gaussian to ease the difficulty of detecting the exact position of a corner. Duan et al. noticed that the CornerNet had a high false-positive rate. They intuited that the higher IoU of bounding boxes, the higher probability of the center key points in the central region will be predicted as the same class [30]. Therefore, they improved CornerNet with two strategies named center pooling and cascade corner pooling and represented each object using a center point and a pair of corners. They defined a scale-aware central region for each bounding box, and the bounding box will be preserved if the center key point had the same class as the bounding box that was detected in its central region.

Attention Mechanism
Attention mechanisms that come from Natural Language Processing (NLP) make a huge impact on vision tasks. Generally, attention mechanisms are adopted to determine where to focus. They take feature maps as input and produce weights for each value on the feature maps. The weights could be formed by the relationships between two pixels, the relationships between channels, spatial relationships, or unmentioned relationships. Attention mechanisms can be separated into dot-product attention, channel attention, spatial attention, level attention, and some variants by the relationships they adopt. Moreover, attention mechanisms can also be separated into global modelling attention and scale attention by their purpose [31,32]. Therefore, this paper classifies the attention mechanisms from this point of view.

Global Modelling Attention
Global modelling attention is effective for modelling the long-range global contextual information. Wang et al. proposed a non-local module to capture long-range dependencies with deep neural networks in computer vision [33]. Fu et al. proposed a Dual Attention Network (DANet) to capture feature dependencies in the spatial and channel dimensions with two parallel attention modules [34]. One is a position attention module and the other is a channel attention module. Both of these modules are similar to the non-local module. Miao et al. proposed CCAGNet [14] that considered multi-region context information simultaneously by combining Cross-context Attention Mechanism (CCAM), Receptive Field Attention Mechanism (RFAM), and Semantic Fusion Attention Mechanism (SFAM). The CCAM modeled the long-range and adjacent relationships among the feature values by two parallel non-linear functions. Moreover, the RFAM was proposed to improve the RFB in RFBNet [22] by inserting channel attention and spatial attention implemented with CCAM. Finally, SFAM was proposed to improve traditional up-sampling by Sub-Pixel convolution up-sampling [35] followed by CCAM.

Scale Attention
Scale attention is adopted to determine where to focus and suppress uninformative features. This attention mechanism is widely used in computer vision. Hu et al. proposed Squeeze-and-Excitation Networks (SENet) [36] which consist of a squeezing process and an excitation process. The squeezing process used global average pooling to generate the relationships between channels. Then, in the excitation process, the relationships mentioned above pass through a bottleneck with two fully connected layers to fully capture channel-wise dependencies. Woo et al. proposed CBAM to capture channel-wise relationships and spatial-wise relationships in order [15]. Their channel attention module included two parallel SE blocks with max pooling and average pooling. Their spatial attention module also used max pooling and average pooling to grab the relationships in spatial domains. Gu et al. aggregated spatial attention, channel attention, and level attention together and proposed Comprehensive Attention Neural Networks (CA-Net) [37]. Their spatial attention included a non-local block and the two-pathway attention blocks which refined the deep feature maps with the shallow ones. Moreover, their level attention consisted of channel attention and refinement process, which integrated each scale output in the decoder.

Overview
The proposed overall architecture is shown in Figure 1. In general, the proposed method embedded CTFA and a mask branch into FCOS [13] and employed the ResNet50 [26] as a backbone. This architecture aimed to take chest radiographs as input and located the ETT tip and the Carina. After finding the candidates of ETT tips and Carinas, a post-process algorithm was applied to refine the best ETT tip and Carina. ResNet50 [26] without the last down-sampling operations was applied to extract semantic features and the feature maps from high resolution to low resolution are denoted as {C 2−5 }, respectively. Followed by C 5 , CTFA was applied to refine the feature representation and then passed into the neck. The neck was implemented by FPN [19] which up-sampled the low-resolution feature maps and merged them with the high-resolution feature maps by element-wise summation. The feature maps produced by FPN are defined as {P 2−5 } from high resolution to low resolution. Additionally, the P 5 would pass through a series of convolutions with strides of 2 and kernel sizes of 3 × 3 to form P 6 and P 7 . Afterwards, {P 3−7 } were fed into the FCOS detection head, and {P 3−5 } were also fed into a segmentation head with {P 2 }. After detecting where the ETT tip and Carina might be, the post-process algorithm would choose the best ETT tip and Carina feature point through Gaussian masks generated from the center of the predicted bounding boxes of ETT and tracheal bifurcation.

Coarse-to-Fine Attention (CTFA)
CTFA fused global-modelling attention (GA) and scale attention (SA) with a point-wise convolution as shown in Figure 2. Concretely, the feature map produced by GA would pass through a point-wise convolution to smooth the features. Then, these smoothed feature maps were fed into SA to rescale features. We found that the Cross-context Attention Mechanism (CCAM) in [14] can grab long-range relationships and its performance was good as shown in the Section 4.5.3 of the ablation study. Therefore, we adopted CCAM as our GA in this paper. Moreover, we noticed that the CBAM can grab local relationships. However, some shortages limit the performance of CBAM. Therefore, we addressed these shortages and propose a new SA based on the CBAM. The CCAM in [14] encoded the cross-region relationship and adjacent-region relationships by two branch. Then, it fused the relationships by concatenation and adopted a series of convolutions followed by a softmax activation function to grab the attention weight. However, we noticed that the performance of CCAM without employing a softmax activation function was better than the performance of CCAM with a softmax activation function in this paper as shown in the Section 4.5.1 of the ablation study. Therefore, GA did not adopt the softmax activation function and the following steps in the CCAM. The architecture of GA is shown in Figure 3 and it can be summarized as: where the ReLu and batch normalization are neglected, F in ∈ R C×H×W denotes the input feature map, DiConv denotes the dilated convolution and the parameters are the kernel size and the dilation rate, respectively, Conv denotes the traditional convolution and the parameter is the kernel size, F cr and F ar denote the feature maps generated from the crossregion branch and the adjacent-region branch, denotes the concatenate operation, F GA represents the output of the GA module.  Specifically, GA generated long-range relationships through two branches. The upper branch F cr captured long-range context information by a dilated convolution whose kernel size was 3 × 3 and the dilation rate was 3, and smooths the features by a point-wise convolution (Equation (1)). The lower branch F ar grabbed local context information by a point-wise convolution and expands the influence of the features by a dilated convolution whose kernel size was 3 × 3 and the dilation rate was 3 (Equation (1)). These two feature maps concatenated to form the feature maps F ga = F cr F ar ∈ R 2C×H×W . Then, F ga passed through a point-wise convolution and a dilated convolution whose kernel size was 3 × 3 and the dilation rate was 2 to produce the final output of GA, F GA (Equation (2)). The point-wise convolution here was used to decrease the channel numbers and the dilated convolution was used to expand the influence of features again.

Scale Attention (SA)
There were certain problems with the use of CBAM [15]. One of these was that CBAM ignored the interaction between channel attention and spatial attention. Another was that the channel-wise pooling operations overlooked the feature distribution of the feature maps. Based on the above analysis, the SA was proposed to smooth out these defects at the same time. The construction of SA is shown in Figure 4 and its overall process can be summarized as: where the ReLu and batch normalization are neglected, F x denotes the input feature map, M cavg and M cmax denote channel-wise average pooling and channel-wise max pooling, denotes the concatenate operation, F s denotes the feature map after channel-wise pooling, M aavg and M amax denote adaptive average pooling and adaptive max pooling, Conv denotes the convolutional layer and the subscript represents the kernel size, f c sq and f c ex are two fully connected layers, F SE1 and F SE2 denote the feature maps in two branches after two fully connected layers, ⊕ denotes the element-wise summation, F sa represents the spatial weight, σ represents the softmax function, F SA represents the output of the SA module.  Specifically, SA can be separated into four steps. First, the input feature map F x ∈ R C×H×W was fed into two channel-wise poolings and generated two feature maps F cavg = M cavg (F x ) ∈ R 8×H×W and F cmax = M cmax (F x ) ∈ R 8×H×W , respectively. Afterwards, F cavg was concatenated with F cmax to generate F s ∈ R 16×H×W as the Equation (3). By preserving more pooled channels, SA kept more of the channel information of the feature maps. Second, F s passed into adaptive max pooling and adaptive average pooling simultaneously to create F amax = M amax (F s ) ∈ R 16×3×3 and F aavg = M aavg (F s ) ∈ R 16×3×3 , respectively. Then, F amax and F aavg would pass through one shared-weight convolution layer and two shared-weight fully connected layers to generate F SE1 and F SE2 as the Equation (4). By retaining a higher resolution, SA saved more spatial information than the traditional pooling method. Third, SA directly fused F SE1 and F SE2 with a element-wise summation. Thereafter, the mixed feature was element-wise multiplied with F s followed by a point-wise convolution layer as Equation (5). This step integrated channel attention weight into the spatial attention process and reduced the channel of the merged features with a point-wise convolution layer. Finally, the spatial attention feature, F sa ∈ R 1×H×W , passed through softmax to grab the spatial attention weight. Then, the output of SA, F SA ∈ R C×H×W , was generated from the element-wise multiplication with F x and the spatial attention weight as the Equation (6).

Mask Branch
We noticed that the Mask R-CNN with a special feature extraction method proposed by Chen et al. [11] refined the feature points of the ETT tip and the Carina with the mask of ETT and tracheal bifurcation and achieved state-of-the-art performance. Inspired by their feature extraction algorithm and considering the characteristics of FCOS [13], this paper adopted a mask branch into the neck, trying to refine the feature point of the ETT tip and Carina in the same way. The architecture of the mask branch is shown in Figure 1 and its operation can be summarized as: where the ReLu and batch normalization are neglected, U psample i denotes enlarging the resolution i times with nearest interpolation, ⊕ denotes the element-wise summation, F seg denotes the feature map after fusion, Conv i j denotes stacking convolution with j kernel size i times, F SEG denotes the final output of the segmentation head.
Considering that the features of an object might be separated into the feature pyramid, and a large magnification may cause some problems, this paper only fused P 3−5 together (Equation (7)). Explicitly, P 3−5 would be enlarged to the same resolution with P 2 and then summarized together. Afterwards, the fused feature maps would feed into the segmentation head which includes a series of convolution and up-sampling to generate the segmentation mask (Equation (8)).

Post-Process Algorithm
The process of the post-process algorithm is shown in Figure 5. In general, the algorithm has three steps. First, it found the highest confidence bounding box (bbox) of the ETT/tracheal bifurcation. Second, it applied a Gaussian mask based on the center of the ETT/tracheal bifurcation bbox or the center of the image to determine the best ETT tip/Carina. Finally, if the confidence of the best ETT tip/Carina was less than a threshold, then the best ETT tip/Carina would be generated after comparing with the confidence of the ETT/tracheal bifurcation bbox.
Expressly, in the first step, the algorithm kept the first bbox of the ETT/tracheal bifurcation as the best one and enumerated all bboxes which were classified as ETT/tracheal bifurcation. If the confidence of a bbox was larger than the best one, then the best bbox would be replaced by the bbox. As for the second step, if the bbox of ETT/tracheal bifurcation existed, the algorithm would produce a Gaussian mask based on the center of the bbox. Otherwise, the Gaussian mask would be generated based on the center of an input image. Afterwards, the confidence of the bboxes classified as ETT tip/Carina would multiply the value which was grabbed from the same position as the center point of the bbox on the Gaussian mask. Then, the algorithm chose the best ETT tip bbox/Carina bbox as the first step based on this multiplied value. After checking whether the ETT tip bbox/Carina bbox existed or not, the algorithm adopted a threshold to filter out the ETT tip bboxes/Carina bboxes which did not need to be purified. Finally, the remaining bbox of ETT tip/Carina was compared with the corresponding bbox of ETT/tracheal bifurcation. If the confidence of the ETT tip bbox was lower than the ETT bbox, the ETT tip would be replaced by the center point of the lowest two points of the ETT bbox. Otherwise, the ETT tip was the center point of the ETT tip bbox. As for Carina, if the confidence of the Carina bbox was lower than the tracheal bifurcation bbox, the Carina would be replaced by the center point of the tracheal bifurcation bbox. Otherwise, the Carina was the center point of the Carina bbox.

Loss Function
The loss function was based on FCOS [13] appended with a segmentation loss. The formula of the loss function can be summarized as: where the L cls is the classification loss implemented with focal loss [25], L reg is the regression loss implemented with GIoU loss [38], L centerness is the center-ness loss implemented with binary cross entropy loss [13], L seg is the segmentation loss implemented with cross entropy loss plus dice loss as in [39].
The formula of focal loss is shown in Equation (10), where y ∈ {±1} denotes the ground-truth class, p ∈ [0, 1] denotes the predicted probability for the class with label y = 1, a ∈ [0, 1] denotes a weighting factor for class with label y = 1. Additionally, the γ is set to 2 as [13] did. This loss function can mitigate the class imbalance problem by down-weight easy samples.
The formula of GIoU loss is shown in Equation (11), where A, B denotes two arbitrary shapes, C denotes the smallest convex shapes enclosing both A and B. GIoU solved the problem that if |A ∩ B| = 0, IoU cannot reflect the relationship between A and B. L centerness = L BCE (centerness, centerness * ) The L BCE is the binary cross-entropy loss, the centernss * ∈ [0, 1] denotes the groundtruth value of centerness, l * , r * , b * , t * denotes the distance from a feature point to the left, right, top, and bottom of the relative ground-truth bbox. FCOS down-weighted the confidence of low-quality predicted bboxes by adding this loss function. Concretely, during the inference phase, the classification score will multiply with the corresponding predicted centerness score. Thus, the low-quality bboxes could be filtered out by nonmaximum suppression (NMS).
The λ is a trade-off between Dice loss and cross-entropy loss, and is set to 1 in all our experiments, the L Dice is the dice loss, A, B denotes two arbitrary shapes as in Equation (11), L CE denotes the cross-entropy loss, and p ∈ [0, 1] denotes the predicted probability for the class with label y = 1 as the Equation (10).

Dataset and Evaluation Metrics
This paper was approved by the institutional review board (IRB) of the National Cheng Kung University (NCKU) Hospital (IRB number: A-ER-108-305). The chest radiograph dataset provided by NCKU Hospital includes 1,870 portable chest radiographs of intubated ICU patients in DICOM format and the ground truth (GT) annotations were labeled by two board-certified intensivists. The GT ETT was labelled by four points (P 1−4 ), and the GT tracheal bifurcation was labelled by nine points (P 5−13 ), as shown in Figure 6a. The purpose of this paper was to detect the malposition by locating the ETT tip and the Carina. Therefore, this paper adopted two boxes with a size of 300 × 300 to label the feature point of the ETT tip which was the middle point of P 2 and P 3 , and the feature point of the Carina which was P 9 (the feature point of ETT tip and Carina are at the center of the boxes). Furthermore, the 13 points were corrected to be sequential for generating the GT mask. In summarize, the GT became Figure 6b. In Figure 6b, the green nodes denote the original points labeled by the intensivists, the blue nodes denote the ETT tip and Carina, the green boxes denote the GT bboxes of ETT and tracheal bifurcation, the blue boxes denote the GT bboxes of ETT tip and Carina, and the red polygons denote the GT mask of ETT and tracheal bifurcation. Finally, this paper used extra 150 chest radiographs to validate the proposed approach. This paper applied object error, distance error, recall, precision, and accuracy to evaluate the performance. Object error and distance error were implemented with the Euclidean Distance. The object error of the ETT tip was defined as the distance between the center of the predicted ETT tip bbox and the center of the GT ETT tip bbox. The object error of the Carina was defined as the distance between the center of the predicted Carina bbox and the center of the GT Carina bbox. Moreover, the distance between the ETT tip and the Carina was named the ETT-Carina distance. The distance error was defined as the absolute difference between the GT ETT-Carina distance and the predicted ETT-Carina distance. As for recall and precision, this paper defined successful detection (true positive) as when the object error was no more than 10 mm [40], a false positive indicated that a region was a specific object but the region did not include the object, and a false negative was that the model did not indicate the GT object. Therefore, the recall and precision are defined as: Moreover, by the Goodman's criteria [41,42], the ideal distance of the ETT-Carina was in the range of [30,70] mm. However, another paper [43] recommended that the average correct tube insertion depth was 21 cm in women and 23 cm in men. In this situation, the distance of the ETT-Carina was less than 25 mm. Taking this into consideration, this paper defined that the endotracheal intubation was suitable if the distance of the ETT-Carina is in the range of [20,70] mm, as in [9].

Implementation Details
The architecture was implemented based on Pytorch. During training, Adam [44] with the initial learning rate of 10 −4 was adopted. The batch size was set to 2 and the total epochs were set to 120. Moreover, we employed a warm-up policy for the first five epochs and utilized the cosine learning rate decay policy for the rest of the epochs. Additionally, the input images were resized to have their shorter side be 800 and their longer side less or equal to 1333. In addition, random color jitter, random rotation angle within the range of [−10, 10] degrees, and random cropping were used during training. Furthermore, the dataset provided by NCKU Hospital was equally divided into five folders to execute 5-fold cross-validation. Finally, all of the experiments were executed on an Nvidia GeForce RTX 2080 Ti GPU. Table 1, by the definition of "suitable position", this paper achieved 88.82% accuracy on the NCKU dataset and 90.67% on the external validation based on the annotations from board-certified intensivists. Moreover, the mean of the ETT-Carina distance error was 5.333 mm and the standard deviation of the ETT-Carina distance error was 6.240 mm. In [9], the mean of the ETT-Carina distance error was 6.9 mm, and the standard deviation of the ETT-Carina distance error was 7.0 mm. Table 2 shows that the distance error of 85.83% images were less than 10 mm on the NCKU dataset and 84.00% images were less than 10 mm on the external validation. Tables 3 and 4 show the confusion matrix of diagnosis in NCKU dataset and external validation.

As shown in
For the ETT tip and the Carina, Table 5 shows that the recall and precision of the ETT tip were 90.96% and 91.60%, separately. The recall and precision of the Carina were 93.90% and 94.10%, separately. Table 6 shows the mean and standard deviation of object error on the ETT tip were 4.304 mm and 5.526 mm, respectively. The mean and standard deviation of object error on the Carina were 4.118 mm and 3.655 mm, respectively. Table 7 shows that in the object error of the ETT tip, there were 90.96% images not longer than 10 mm. Table 8 shows that in the object error of the Carina, the differences of the 93.90% images were not longer than 10 mm. Although most of the performances on the external validation were dropped, the performance on ETT-Carina still showed the robustness of our architecture in detecting the malposition of the ETT.

Compare with the SOTA
This section compares the proposed method with the state-of-the-art (SOTA) method proposed by Chen et al. [11]. Both of these methods executed the experiment on the same datasets. Table 9 presents the the accuracy of detecting malposition and the distance error of the ETT-Carina, and Table 10 presents the error distribution of the ETT-Carina distance. Furthermore, the percentage in the brackets denotes the improvement or degeneration rate of the performances. Based on the tables, we observed that most of the performances of the proposed method were better than SOTA. It is worth mentioning that the proposed method had higher accuracy on both data sets, which indicated that the proposed method could be more accurate in detecting the malposition of the ETT. Tables 11 and 12 demonstrate the performance of the object error and the error distribution on the ETT tip. Tables 13 and 14 demonstrate the performance of the object error and the error distribution on the Carina. In general, our method was bad at detecting the ETT tip but good at detecting the Carina compared with SOTA. However, the improvement on the Carina could cover the degeneration of the ETT tip. Therefore, our method could detect the malposition more effectively.

Ablation Study
The ablation studies were performed on the NCKU dataset with folder 5 as the testing data to verify the effectiveness of our method. This section only focuses on the accuracy, the mean distance error, and the standard deviation. Table 15 demonstrates the performance of whether the GA adopted a softmax function. This results presented that the GA without a softmax activation function achieved a better outcome. Therefore, CTFA employed GA without a softmax function.  Table 16, the c and k in the brackets behind the SA denote the pooled channel number and the followed kernel size. The SE in the brackets behind the SA shows whether the SA adopts a SE block as the channel attention. From the six rows above, we observed that when the output channel number of the channel-wise max pooling and average pooling was 8, and the followed kernel size was 1, the performance was the best. Therefore, CTFA employed SA with 8 pooled channel numbers followed by a point-wise convolution. Furthermore, comparing the fourth row with the last row, we noticed that employing the SE block in SA would increase the performance.  Tables 17 and 18 display the results of adopting existing attention modules in FCOS [13]. This paper calculated the number of parameters and giga floating point operations (GFLOPs) for each attention module with an image size of 224 × 224. Furthermore, the percentage in the brackets at the "Parameter (M)" and "GFLOPs" column is the degree of increment compared with FCOS. By Table 17, we discovered that the SE block [36], Nonlocal block [33], Nonlocal with CSP block [45] and the CBAM block [15] brought no benefit to the accuracy of ETT-Carina, but the CCAM block [14] and the SA block proposed by this paper could improve the accuracy. Moreover, the SA block was more lightweight than the CCAM block and achieved higher accuracy, as shown in Table 18. These results illustrated the efficiency and effectiveness of the SA block.   [36] 32.126 (+0.02%) 19.764 (+0%) FCOS + CSPnonlocal [45] 32.284 (+0.52%) 19.782 (+0.09%) FCOS + nonlocal [33] 33.302 (+3.69%) 19.882 (+0.60%) FCOS + CBAM [15] 32.127 (+0.03%) 19.764 (+0%) FCOS + CCAM [14] 34

Fusion Method
This part explores the fusion method. The second row in Table 19 denotes that the C 5 with reduced channel directly passes through SA, followed by GA. The third row denotes that the C 5 with reduced channel goes through SA and GA in a parallel fashion, and then concatenates the output feature maps together followed by a point-wise convolution, batch normalization, and relu to integrate the feature map. The last row denotes that the C 5 with reduced channel directly passes through GA, followed by SA. As shown in Table 19, the last row demonstrated that the long-range relationship rescaled by the local relationship could improve the performance. Moreover, this was the best connection method in our ablation studies; thus, the CTFA adopted this method to grab better features in this paper.  Table 20 presents the result of fusing a global modelling attention and a scale attention. Here, * 2 denotes stacking two of the same attention module together. We observed that when directly stacking two of the same attention module together, the accuracy would drop compared with the performance in Table 17. Specifically, when stacking two CSPnonlocal blocks together, the accuracy would degrade from 86.10% to 85.56%, and when stacking two CCAM blocks together, the accuracy would degrade from 86.90% to 86.10%. Furthermore, stacking two SA blocks also grabbed a worse result. However, if we stacked global modelling attention with scale attention, the performance would increase, as shown in the last two rows. This result showed that the global modelling attention could grab long-range relationships and the scale attention could refine the attention map effectively. Therefore, stacking these two kinds of attention modules together could achieve higher performance. In Table 21, the second row denotes that CTFA is adopted in FCOS, and the "Seg" denotes that the CTFA and the mask branch are adopted in FCOS at the same time. Moreover, the text in the brackets behind "Seg" in the "Method" column denotes which mask annotation our approach employed at the mask branch, and the "Fusion" in the "Method" column denotes that we directly fuse the feature maps from P 2 to P 5 . Comparing the results of the second row and third row, we noticed that the mask branch did not bring advantages to the detection model. However, if we only used the annotation of ETT and tracheal bifurcation, the performance would be improved. This phenomenon might be caused by the occlusion of the annotations. Concretely, the annotations of the ETT and the tracheal bifurcation were always occluded by the annotation of the ETT tip and the Carina. Therefore, the mask branch might focus on the ETT tip and the Carina which was similar to the original detection model (FCOS + CTFA). However, if we only adopted the annotation of the ETT and the tracheal bifurcation, the detection model might pay more attention to the ETT and the tracheal bifurcation. Thus, the detection model might grab some useful context information for detecting the ETT tip and the Carina. Comparing the last two rows, we found that the fusion policy further improves the performance. Considering the accuracy and the object error, the method of the last row achieved the best result on the internal dataset with folder 5 as the testing data. Moreover, it was the final architecture employed by this paper for detecting the malposition of the ETT. The red bboxes and points in these figures are the GT ETT/bifurcation bboxes and the position of GT ETT tip/Carina, respectively. The green polygon is the GT mask of the ETT and the bifurcation. The blue bbox and point are the predicted ETT bbox and ETT tip, respectively. The yellow bbox and point are the predicted bifurcation and Carina, respectively. Specifically, without the post-process, the model might leave more than one predicted ETT tip/Carina, such as where the red arrow points in Figure 7a. However, with the post-process, the extra points would be removed as shown in Figure 7b. Besides, with the refinement process in the post-process, the feature point of ETT tip/Carina could be further refined as shown in Figure 8. Concretely, the object error of Carina was corrected from 8.469 mm to 1.319 mm.
We observed that the segmentation results of the ETT tip were always near the GT ETT tip. However, although the mask branch could improve the detection result, the segmentation result was not good enough to replace the detection result. As shown in Table 22, if we corrected the detection result of the ETT tip by the segmentation result of the ETT tip when the distance between them was longer than 100 pixels, the performance of ETT would decrease and thus the performance of the ETT-Carina would also decline. Therefore, in the post-process, we only used the bbox result of the ETT and the tracheal bifurcation to refine the position of the ETT tip and the Carina when the ETT tip and the Carina were not good enough.  In this part, the red bboxes and points denote the GT ETT/bifurcation bboxes and the position of GT ETT tip/Carina, respectively. The blue bbox and point are the predicted ETT bbox and ETT tip, respectively. The green bbox and point are the predicted bifurcation and Carina, respectively. The light blue point is the position of the ETT tip produced by the mask branch. In the Table 23, the first row demonstrates the good results, the second row shows the medium results, and the third row presents the bad results. We noticed that if an image had a clear location of the ETT tip and the Carina, the performance would be better. Moreover, the life-supporting device might blur the location of the ETT tip and the shadow of the heart might occlude the position of the Carina as shown in the medium case. Apart from the problems mentioned above, the angle of the CXR might also degrade the performance of the proposed method, as shown in the worse results when it is applied.

Conclusions
This paper proposed an end-to-end architecture to improve the FCOS [13] for detecting the position of the ETT tip and the Carina. First, a Coarse-to-Fine (CTFA) attention module was designed to capture long-range relationships by global-modelling attention (GA) and rescale the feature value with local relationships grabbed by scale attention (SA). Then, a mask branch was adopted to enhance the feature representation of the "backbone" and "neck". After that, a post-process algorithm was employed to ensure that only one detection result for each class would be left in an image and further refine the feature point of ETT tip and Carina. Experiments on the chest radiograph datasets provided by National Cheng Kung University Hospital demonstrated that the proposed architecture provides an effective solution to improve the performance of the FCOS. The mean malposition accuracy achieved 88.82% and the ETT-Carina distance error was less than 5.333 ± 6.240 mm on the internal dataset. Furthermore, our method indicates the position of the ETT, ETT tip, tracheal bifurcation, and the Carina on an image. Therefore, it is more reasonable for detecting the malposition of the ETT compared with classification-based methods. This end-to-end deep learning-based method could help the intensivists, helping them to keep an eye on the position of the ETT in ICU patients.
Although our proposed methods have exciting results, two shortages still exist that might need to be overcome. First, the annotation of the ETT tip is labelled on the slash at the bottom of the ETT. However, the feature point on the edge is not a good feature point compared with the corner point. Therefore, an accurate ETT tip annotation might further improve the performance. Second, the fusion method adopted for GA and SA is simple, as is the additional mask branch, so further investigations of the interaction between them might improve the performance. In the future, we will focus on these shortages, trying to provide a more reliable model for intensivists.  Informed Consent Statement: Consent from each patient was waived as this clinical study is a retrospective review of medical records produced during the patient's treatment process.

Data Availability Statement:
The data that support the findings of this study are not publicly available due to the information that could compromise the privacy of research participants.