Tubular Structure Segmentation via Multi-Scale Reverse Attention Sparse Convolution

Cerebrovascular and airway structures are tubular structures used for transporting blood and gases, respectively, providing essential support for the normal activities of the human body. Accurately segmenting these tubular structures is the basis of morphology research and pathological detection. Nevertheless, accurately segmenting these structures from images presents great challenges due to their complex morphological and topological characteristics. To address this challenge, this paper proposes a framework UARAI based on the U-Net multi-scale reverse attention network and sparse convolution network. The framework utilizes a multi-scale structure to effectively extract the global and deep detail features of vessels and airways. Further, it enhances the extraction ability of fine-edged features by a joint reverse attention module. In addition, the sparse convolution structure is introduced to improve the features’ expression ability without increasing the model’s complexity. Finally, the proposed training sample cropping strategy reduces the influence of block boundaries on the accuracy of tubular structure segmentation. The experimental findings demonstrate that the UARAI-based metrics, namely Dice and IoU, achieve impressive scores of 90.31% and 82.33% for cerebrovascular segmentation and 93.34% and 87.51% for airway segmentation, respectively. Compared to commonly employed segmentation techniques, the proposed method exhibits remarkable accuracy and robustness in delineating tubular structures such as cerebrovascular and airway structures. These results hold significant promise in facilitating medical image analysis and clinical diagnosis, offering invaluable support to healthcare professionals.


Introduction
Cerebrovascular and airway structures are vital tubular structures in the human body that play key roles in brain blood transport and respiratory gas exchange, respectively. very challenging [6,18]. Improving edge segmentation of such tubular structures is key to subsequent quantification of vessels and airway structures. In summary, improving the segmentation accuracy of branch-like tubular structures such as brain vessels and airways is very important, and enhancing their branch edge segmentation accuracy is crucial to improving the segmentation of tubular structures.
To address this limitation, this paper proposes a deep learning network UARAI (U-Net multi-scale feature aggregation reverse attention sparse convolution model) for segmenting complex tubular structures. The model comprehensively considers the structural characteristics of tubular structures at different scales, enhances the learning of features such as edge details, micro-vessels, and micro-airways, and aims to improve the segmentation accuracy. The main contributions of this work are summarized as follows: (a) In this paper, a multi-scale feature aggregation method is proposed and validated, which can fully extract and fuse the cerebrovascular and airway features with different thicknesses at different scales. The proposed method effectively solves the problem of differences in feature expression at the same scale, thus improving the segmentation accuracy. (b) Our paper introduces a novel reverse attention module combined with sparse convolution to guide the network effectively. By leveraging reverse attention mechanisms, this module enhances foreground detection by emphasizing the background and excluding areas of prediction. Moreover, it allocates reverse attention weights to extracted features, thereby improving the representation of micro-airways, microvessels, and image edges. The utilization of sparse convolution further improves overall feature representation and segmentation accuracy. (c) Through extensive experimental validation, we investigate the impact of sliding window sequencing and input image dimensions on the segmentation of tubular structures, including cerebral blood vessels and airways. The insights gained from this study contribute to the advancement of artificial intelligence techniques in medical image analysis, specifically focusing on enhancing the segmentation of tubular structures.

Related Work
In this section, we briefly review the related work and start-of-the-art approaches for tubular structure segmentation, feature fusion and 3D attention mechanisms for medical images.

Tubular Structure Segmentation
In medical image processing, the methods for segmenting tubular structures can be mainly classified into two categories: traditional methods and deep-learning -based methods.
Traditional methods: Traditional medical image segmentation methods originate in traditional imaging techniques, primarily relying on image gray features for segmentation. For cerebrovascular segmentation, Park et al. [19] proposed a connectivity-based local adaptive threshold algorithm for carotid artery segmentation. The algorithm adaptively segments the cerebrovascular structures based on the connectivity preserved between consecutive slices of the image and the local threshold set on each slice. Wang et al. [20][21][22] used Ostu's threshold to classify MRA images into foreground and background and then compared the statistical distributions of foreground and background to extract cerebrovascular structures from the foreground. Neumann et al. [23] combined vessel-enhanced filtering with subsequent level set segmentation, where level set segmentation was implemented using gradient descent and local minimum energy functions. Subsequent studies have also proposed other level set segmentation methods. Still, since they are susceptible to grayscale values and significantly impact the algorithm's convergence, the problem of segmentation difficulty remains [24]. Subsequently, Frangi et al. [25] proposed a Hessian-matrix-based method, known as the Frangi algorithm, which calculates the local Hessian-matrix of each pixel in an image to determine the vascular structure's location precisely. This approach has been shown to significantly enhance the performance of vessel segmentation compared to traditional segmentation methods. For airway segmentation, early works by Mori and Sonka et al. [26,27] used the difference in grayscale intensity between airway lumen and wall, combined with region-growing algorithms, for airway lumen segmentation. Tschirren et al. [28] proposed a fuzzy connectivity-based airway segmentation method that uses small adaptive regions to follow the airway branching. Duan et al. [29] proposed combining a dual-channel region-growing algorithm, grayscale morphological reconstruction, and leakage elimination. The method first performs the region-growing on one channel to obtain a rough airway tree, then does region-growing and grayscale morphological reconstruction on another channel to detect distant airways, and finally refines the airway tree by removing holes and leaks using the leakage detection method. While traditional methods can somewhat segment tubular structures, image quality and differences in imaging parameters often influence their performance. For instance, threshold segmentation algorithms can efficiently segment foreground and background but have difficulty distinguishing appropriate thresholds for noise with comparable grayscale values as the target object [30]. Additionally, the grayscale intensity of cerebrovascular and airway branches resembles the background, and their peripheral structures are intricate and complicated. As a result, traditional threshold segmentation and region-growing methods often struggle to achieve precise segmentation.
Deep-learning-based methods: In recent years, medical image segmentation has benefited from applying artificial intelligence (AI) technologies. Among these, deep-learning techniques are considered the most sophisticated and commonly used techniques [31]. The neuro-heuristic [32] analysis algorithm has made significant advances in the field of medical image segmentation by providing a deeper analysis of images for classification, segmentation, and recognition. However, it requires a large volume of high-quality image data for processing medical image segmentation tasks. Additionally, the complex network design of neuro-heuristic analysis algorithms and their empirical nature result in lower interpretability compared to traditional machine-learning algorithms. The Fox algorithm [33] has performed well in lung segmentation of medical images by automatically learning specific features such as lung position and shape for more accurate and efficient segmentation results. However, the Fox algorithm requires a large amount of training data and consumes considerable computational resources and time, resulting in lower segmentation efficiency. However, among the existing deep learning networks, U-Net is widely used in medical image segmentation tasks with scarce labeled data due to its small data requirement and fast training speed [12]. In the cerebrovascular segmentation task, Tetteh et al. [14] provided synthesized brain vessel tree data and used it for transfer learning to achieve efficient, robust, and universal vessel segmentation. Livne et al. [15] used a 2D U-Net network to segment cerebrovascular structures in high quality and compared Half-U-Net with half-channel numbers and found that Half-U-Net had equally excellent evaluation performance indices as U-Net. Lee et al. [16] proposed the Spider U-Net, which is based on the U-Net structure and enhances the connectivity of blood vessels between axial slices by inserting long short-term memory (LSTM) into the baseline model. At the same time, using the striding stencil (SS) data transfer strategy greatly improved the brain vessel segmentation effect. Guo et al. [11] proposed the M-U-Net model, which consists of three 2D U-Nets and fuses image features in three directions, inheriting the excellent performance of 2D U-Net in image segmentation and making up for the deficiency of a single U-Net in extracting 3D image axial features. Cicek et al. [13] designed a 3D U-Net segmentation network based on 2D U-Net, incorporating image z-axis information to improve segmentation accuracy. Hilbert et al. [8] proposed a high-performance, fully automatic segmentation framework BRAVE-NET, combining deep supervised networks and aggregating rough and low-resolution feature maps into the final convolution layer, effectively fusing multi-scale features. Min et al. [34] introduced multi-scale inputs and residual mechanisms into the U-Net network to improve the model's performance while maintaining generalization ability. Oktay et al. [17] introduced a novel module known as the Self-Attention Gate module, which enhances the significance of local regions and improves the model's sensitivity to the foreground, ultimately enhancing segmentation accuracy. Mou et al. [35] introduced the CS2-Net network structure for automatic detection of curved structures in medical and biomedical images. They incorporated self-attention mechanisms in both the encoder and decoder to enhance the features of curved structures. Xia et al. [36] proposed a reverse edge attention module and an edge-enhanced optimized loss to emphasize the importance of voxels along 3D body edges. Their approach aimed to better capture and preserve spatial edge information. Chen et al. [37] developed an attention and generative adversarial network model for brain vessel segmentation. They utilized multilevel features and dense connections to establish local and global associations. Additionally, they incorporated attention mechanisms in the discriminator to filter low-level features, balance the proportion of vessel class, and improve segmentation performance. Banerjee et al. [38] introduced the multi-task deep CNN (MSD-CNN) approach, which learns the voxel-wise centrality of the surface of cerebral vessels. This method adds additional regularization to the segmentation task. Jiang et al. [39] proposed the Axis-Projection Attention Network (APA U-Net) for 3D medical image segmentation, with a specific focus on small-object segmentation. The network employs a projection strategy that projects 3D features onto three orthogonal 2D planes to capture contextual attention from different viewpoints. This enables the network to filter out redundant feature information and retain crucial details of small lesions in 3D scans. For the airway segmentation task, Meng et al. [40] presented a method that combines 3D deep learning with image-based tracking to automatically extract airways. They employed adaptive cube volume analysis based on 3D U-Net models, where the 3D U-Net is used to extract the airway region within the volume of interest (VOI) for precise airway segmentation. Garcia Uceda et al. [41] used various data augmentation methods based on the 3D U-Net network to achieve accurate airway segmentation, and Garcia Uceda et al. [42] proposed another method combining 3D U-Net with graph neural networks, which uses graph convolution layers instead of ordinary convolution layers, achieving accurate airway tree segmentation with fewer training parameters. Wang et al. [43] used U-Net with spatial recurrent convolutional layers and radial distance loss function (RD Loss) to better segment tubular structures. Tan et al. [44] compared the methods of 12 teams in the airway segmentation challenge task at the 4th International Symposium on Image Computing and Digital Medicine (ISICDM 2020) and found that nine teams adopted U-Net networks or other forms of U-Net, including the forward attention mechanism, reverse attention mechanism, and multi-scale feature information fusion structure, and analyzed the effect of different networks on airway segmentation.

Multi-Scale Feature Fusion and Attention Mechaism
In medical image segmentation, feature fusion combines multiple heterogeneous features into a feature with high discriminative ability, improving the segmentation accuracy. In medical image segmentation networks, low-level feature layers have a high resolution and contain rich primary features, such as position, shape, and texture information. Highlevel feature layers have strong semantic information and a large receptive field, but low resolution and poor perception of details. Lin et al. [45] proposed the feature pyramid networks (FPN) that complemented different levels of feature maps, generating a feature map that simultaneously possesses high resolution and deep-level information. FPN allows the various levels of feature maps to complement each other, forming a multi-scale feature map by adding special lateral connections during the up-sampling and down-sampling process. This feature map can be used to detect objects of different sizes, solving the difficulty of multi-scale object detection. It effectively utilizes different feature maps of different scales to detect objects of different sizes, improving the accuracy of object detection. He et al. [46] proposed a Spatial Pyramid Pooling (SPP) structure that can handle input images at different scales, effectively addressing the problem of varying input image sizes and improving the network's classification performance. Zhao et al. [47] proposed the Pyramid Scene Parsing Network (PSP-Net), which uses dilated convolution to process context features of different regions to obtain global context information features, solving the utilization of global context features and multi-scale feature processing in semantic segmentation. Chen et al. [48] proposed the Atrous Spatial Pyramid Pooling (ASPP) module, which combined multiple feature maps with different resolutions obtained by dilated convolutions to obtain a feature map with a global receptive field, significantly improving the performance of image segmentation. In summary, fusing features at different scales is an important means of improving segmentation performance.
The introduction of the attention mechanism has greatly improved feature selection ability in many computer vision tasks. Similarly, it has been widely applied in medical image segmentation. Tran introduced spatial attention in convolutional neural networks, allowing the network model to learn features from different regions of the image more accurately. Later, the spatial attention mechanism was gradually applied to medical image analysis, achieving good results. Hu et al. [49] proposed an SE (Squeeze-and-Excitation) mechanism to weight and rescale feature maps using the importance ratio of each channel, which is widely used for medical image segmentation and classification. Woo further improved the model performance by adding the spatial attention mechanism based on SE attention. In early CT abdominal vessel segmentation, Oktay et al. [17] proposed a network structure with attention gates, enabling the network to automatically focus on organ structures in the image. Fan et al. [50] proposed a reverse attention U-Net structure for polyp segmentation, in which the reverse attention (RA) module implicitly erases the predicted region and highlights the background, guiding the network to gradually explore the polyp region and enhance the edge feature learning, improving segmentation accuracy. Mou et al. [35] designed a CS2-Net for detecting curved structures in medical images, such as blood vessels, by introducing self-attention, spatial attention (SAB) and channel attention (CAB). Xia et al. [36] proposed a reverse attention mechanism for edge enhancement features and introduced an edge-reinforced loss for vascular shape segmentation. While various attention mechanisms can effectively enhance feature representativeness, challenges still exist in edge segmentation of complex structures and microstructures.

Datasets
This article demonstrates the wide applicability of UARAI in 3D tubular structure segmentation by validating public cerebrovascular data and airway tree data (as shown in Table 1) provided by cooperating organizations. The cerebrovascular dataset comes from the open dataset MIDAS [51], which contains MRA images of 109 healthy volunteers aged 18 to over 60. These images were acquired by a standardized protocol 3T MRI scanner with a voxel size of 0.5 mm × 0.5 mm × 0.8 mm and a uniform sampling resolution of 448 × 448 × 128. The segmentation labels were initially annotated using 3Dslicer and ITK-Snap software (Version.3.8.0). Subsequently, two professional doctors manually corrected and labeled each piece of cerebrovascular data to create a binary image with labels. In this image, the background is represented as 0, while the blood vessels are represented as 1. The airway dataset consists of 400 samples obtained from lung CT data provided by Guangzhou Medical University. After excluding images of poor quality, a total Diagnostics 2023, 13, 2161 7 of 28 of 380 samples were used for experimentation. The voxel size of the images was 0.67 mm × 0.67 mm × 1 mm, and the scanning resolution was uniformly resampled to 512 × 512 × 320. The labeled images were generated through interactive annotation conducted by three professional radiologists.

Data Pre-Processing and Sample Cropping
Data pre-processing: TOF-MRA images collected by hospitals typically include the skull. Since the grayscale values of the skull and blood vessels are similar, the neural network may extract interference features from the skull when extracting brain vascular features. As a result, it is necessary to remove the skull. This study utilized FSL [52] and HD-Bet [53] tools to extract the brain region effectively, as depicted in Figure 1b. To diversify sample trends for brain vascular data, data augmentation methods such as random flipping, random affine, and elastic deformation were employed [42]. In the case of the lung CT dataset, non-pulmonary regions were eliminated by using data augmentation techniques such as cropping, random flipping, and rotation, as shown in Figure 1d. Since numerical values are large and pixel distribution is scattered for the MRA and lung CT images, Z-score normalization was used. The advantages include a reduction of computational complexity, increased utilization of computer resources, and improved convergence rate and efficiency of the network. Data pre-processing: TOF-MRA images collected by hospitals typically include the skull. Since the grayscale values of the skull and blood vessels are similar, the neural network may extract interference features from the skull when extracting brain vascular features. As a result, it is necessary to remove the skull. This study utilized FSL [52] and HD-Bet [53] tools to extract the brain region effectively, as depicted in Figure 1b. To diversify sample trends for brain vascular data, data augmentation methods such as random flipping, random affine, and elastic deformation were employed [42]. In the case of the lung CT dataset, non-pulmonary regions were eliminated by using data augmentation techniques such as cropping, random flipping, and rotation, as shown in Figure 1d. Since numerical values are large and pixel distribution is scattered for the MRA and lung CT images, Z-score normalization was used. The advantages include a reduction of computational complexity, increased utilization of computer resources, and improved convergence rate and efficiency of the network.  Training sample cropping: In the field of medical imaging, image categories such as MRI, pathological images, and 3D CT images often have large file sizes. Directly training models on these images can be unrealistic and inefficient [54,55]. Therefore, this study utilized high-resolution 3D TOF-MRA images and 3D lung CT images for training, using image patches to train the model. This increases the number of training samples and reduces the GPU memory costs for model training. To extract the patches, we used a sliding window approach combined with random cropping. In addition, the size of the patch is an important factor affecting the model's performance [56]. We set the patch size to 64 × 64 × 32, with a cross-sectional size of 64 × 64. Since the MRA image has a small scale on Here, x in , mean(x in ), std(x in ), and x out respectively represent the input image, the mean of input image grayscale, the variance of input image grayscale, and the normalized output image.
Training sample cropping: In the field of medical imaging, image categories such as MRI, pathological images, and 3D CT images often have large file sizes. Directly training models on these images can be unrealistic and inefficient [54,55]. Therefore, this study utilized high-resolution 3D TOF-MRA images and 3D lung CT images for training, using image patches to train the model. This increases the number of training samples and reduces the GPU memory costs for model training. To extract the patches, we used a sliding window approach combined with random cropping. In addition, the size of the patch is an important factor affecting the model's performance [56]. We set the patch size to 64 × 64 × 32, with a cross-sectional size of 64 × 64. Since the MRA image has a small scale on the z-axis, the size of the z-axis was set to 32. The patch z-axis for lung airway data was also set to 32 to ensure consistent training parameters. During the prediction phase, we also used a sliding window prediction strategy, predicting individual patches one by one and then stitching the predicted results back to the original image size to obtain the segmentation results.

UARAI Overall Framework
The U-Net framework has been widely applied in medical image segmentation and is considered one of the most promising frameworks [12]. In this paper, we propose a novel network framework, UARAI, based on the 3D U-Net architecture and integrates advanced techniques, such as multi-scale feature aggregation, reverse attention, and inception sparse convolution structure. This framework can achieve high-precision automatic segmentation of the cerebrovascular and airway structures. The network input is a cerebrovascular patch x ∈ P 1×H×W×D , where H, W, and D represent length, width, and depth, respectively. The output of UARAI predicts foreground and background segmentation probability maps y ∈ P 2×H×W×D , with the specific network structure illustrated in Figure 2.  The overall network framework of the UARAI network. The overall architecture is constructed based on the 3D U-Net. Firstly, at the encoding stage, the multi-scale feature aggregation module (MSFA) is applied to integrate features from different scales. In addition, a reverse attention module is incorporated after the jump connection to calculate the reverse attention coefficients. The coefficients are then used to re-weight the foreground and thus enhance the feature expression ability.

(A) Multi-Scale Feature Aggregation
Feature aggregation is commonly used in the field of computer vision. With the development of medical imaging, multi-scale feature aggregation has also been widely used in deep learning for medical image processing [57]. In the feature aggregation process, convolution, up-sampling, concatenation, and addition operations are used to fuse shallow and deep features, resulting in deep features that contain both strong expressions of high-level features with large receptive fields and rough features that represent edges and shapes in shallow layers. For example, considering the instance segmentation path aggregation network proposed by Liu et al. [58] has fully demonstrated the advantages of aggregating features at multiple levels for accurate prediction. On the other hand, in our approach, multi-scale feature aggregation is used to aggregate features of different scales obtained during the down-sampling process to the deep layers of the network to achieve full integration of high-level and shallow features.
Cerebrovascular and airway structures both have the anatomical characteristics of complex branching and uneven thickness. Accurate segmentation of tube-like structures with varying thicknesses at high precision within the same scale is challenging. In general networks, features of different scales have different expression abilities for structures of different sizes and shapes. In our segmentation task, both the large targets (such as major vessels and main airways) and small targets (such as peripheral branches of vessels and airways) are equally important, and the absence of any feature can significantly impact segmentation accuracy and clinical diagnosis. Therefore, multi-scale feature aggregation is used to avoid the loss of these features. As shown in Figure 3, in the multi-scale feature The overall network framework of the UARAI network. The overall architecture is constructed based on the 3D U-Net. Firstly, at the encoding stage, the multi-scale feature aggregation module (MSFA) is applied to integrate features from different scales. In addition, a reverse attention module is incorporated after the jump connection to calculate the reverse attention coefficients. The coefficients are then used to re-weight the foreground and thus enhance the feature expression ability.
The proposed UARAI segmentation network is based on the 3D U-Net framework. The encoder is achieved for image down-sampling and multi-scale feature extraction, while the decoder reconstructs high-resolution feature maps through up-sampling and skip connections. Each layer in the encoder path consists of multiple convolutional layers for feature extraction. Furthermore, this network utilizes a stride-2 convolutional layer, which learns the parameters of convolutional kernels, to increase the network's representation ability and achieve dimensionality reduction of features instead of a pooling layer. Additionally, the lack of shallow critical features can somewhat affect the segmentation results due to the loss of some low-level features during the dimensionality reduction process in the encoder path. Considering the uneven thickness of blood vessels and airways, and the differences in the expression of coarse tube-like structures at different scales, this paper adds a multi-scale feature aggregation module (MSFA) to the encoder path. This module aggregates shallow and deep features at different scales to help the network learn features better at different scales and improve feature extraction ability.
In the decoder path, the integration of low-level and high-level features is achieved by utilizing skip connections to combine the encoded feature map with the decoded feature map. This process ensures a comprehensive integration of information at different levels. Moreover, for accurate segmentation of small and intricate target branches such as blood vessels and airways, the inclusion of edge information is crucial. To address this, a reverse attention module (RAM) is incorporated into the decoder path. The RAM enhances the extraction and expression of edge features in the terminal branches. By multiplying the reverse attention coefficient with the feature map after the skip connection, the RAM dynamically adjusts the weight of the edge features. This adaptation aims to improve the accuracy of edge segmentation.
During the network output stage, the final three layers of the decoded output undergo operations such as up-sampling and convolutional fusion. These operations refine the feature maps and ultimately generate the segmentation results, which includes both foreground and background information as y ∈ P 2×H×W×D .
The overall implementation process of the segmentation model is as follows: in the encoding phase, the input is a batch of patches. Each layer first extracts patch features through two convolutional modules and then reduces the dimension of the features through a learnable convolutional layer with a kernel size of 3 × 3 × 3 and a stride of 2 instead of a pooling operation. The features are then normalized and activated non-linearly through InstanceNorm3d and Leaky-Relu, producing non-linear features. Residual processing is also added in each layer to prevent excessive feature loss and gradient disappearance. In the decoding process, skip connections are first used to concatenate the encoding layer features with the decoding layer features. Then, reverse attention modules are used to reassign feature weights, adaptively enhance edge features, and obtain decoding layer features through convolution. Up-sampling is performed through interpolation to reach the next decoding stage. The encoding and decoding operations are repeated four times each, resulting in segmentation results of the same size as the original image. Finally, the soft-max function normalizes the probability of foreground and background in the output.

(A) Multi-Scale Feature Aggregation
Feature aggregation is commonly used in the field of computer vision. With the development of medical imaging, multi-scale feature aggregation has also been widely used in deep learning for medical image processing [57]. In the feature aggregation process, convolution, up-sampling, concatenation, and addition operations are used to fuse shallow and deep features, resulting in deep features that contain both strong expressions of high-level features with large receptive fields and rough features that represent edges and shapes in shallow layers. For example, considering the instance segmentation path aggregation network proposed by Liu et al. [58] has fully demonstrated the advantages of aggregating features at multiple levels for accurate prediction. On the other hand, in our approach, multi-scale feature aggregation is used to aggregate features of different scales obtained during the down-sampling process to the deep layers of the network to achieve full integration of high-level and shallow features.
Cerebrovascular and airway structures both have the anatomical characteristics of complex branching and uneven thickness. Accurate segmentation of tube-like structures with varying thicknesses at high precision within the same scale is challenging. In general networks, features of different scales have different expression abilities for structures of different sizes and shapes. In our segmentation task, both the large targets (such as major vessels and main airways) and small targets (such as peripheral branches of vessels and airways) are equally important, and the absence of any feature can significantly impact segmentation accuracy and clinical diagnosis. Therefore, multi-scale feature aggregation is used to avoid the loss of these features. As shown in Figure 3, in the multi-scale feature aggregation framework, the input features represent the low-and high-level output features of the encoding layers of U-Net. The low-level feature maps mainly contain edge and texture information of the image, while the high-level features represent the semantic features with strong expression characteristics of the image. The lower-level are down-sampled to reduce their size by half through dimensionality reduction. Multiplying it with f 2 ∈ P 8C× H , which is concatenated with another down-sampled feature and fused through channel-wise concatenation, then the feature Finally, x 3_1 and x 3_2 are concatenated through channel-wise concatenation and passed through two Conv3 × 3 × 3 convolutions to extract features. The feature fusion and output are achieved through Conv1 × 1 × 1 to obtain f m f . The utilization of a multi-scale feature fusion approach serves to enhance both global and intricate features significantly. By amalgamating features from diverse levels, a more comprehensive and expressive feature representation is achieved, leading to notable improvements in segmentation accuracy. (3) Here, image, *, E, D, C, f i , and f m f represent the input image patch, the functions of matrix multiplication, encoder, down-sample, concatenate, encoder layer feature, and fused feature, respectively.  The complex shapes, varied branching structures of normal and abnormal cerebrovascular and airway structures, inconsistent imaging intensity, and substantial inter-individual differences affect the segmentation of tubular structures. This is especially the case with the extraction of peripheral, edge, and detail features. The reverse attention mechanism proposed by Fan et al. [50] performs well in segmenting the edges of polyps. By repeatedly utilizing the Reverse Attention (Rattention) module, a relationship between regional and boundary clues can be established to extract edge features from the fused highlevel features. Through continuous training iterations, the model can correct partially inconsistent areas in the predicted results, improving the segmentation accuracy.
In the network architecture proposed in this paper, the encoded features obtained are fused by skip connections to combine low-level and high-level features. However, the fused feature maps are not sensitive to edge details and edge features, which are difficult to extract from vessels and airways due to their rich branching structures and fine periph- The complex shapes, varied branching structures of normal and abnormal cerebrovascular and airway structures, inconsistent imaging intensity, and substantial inter-individual differences affect the segmentation of tubular structures. This is especially the case with the extraction of peripheral, edge, and detail features. The reverse attention mechanism proposed by Fan et al. [50] performs well in segmenting the edges of polyps. By repeatedly utilizing the Reverse Attention (Rattention) module, a relationship between regional and boundary clues can be established to extract edge features from the fused high-level features. Through continuous training iterations, the model can correct partially inconsistent areas in the predicted results, improving the segmentation accuracy.
In the network architecture proposed in this paper, the encoded features obtained are fused by skip connections to combine low-level and high-level features. However, the fused feature maps are not sensitive to edge details and edge features, which are difficult to extract from vessels and airways due to their rich branching structures and fine peripheral features. By multiplying the reverse attention coefficient matrix with the input features, the fused feature maps can be adaptively assigned with corresponding reverse attention weights, enhancing the expression of edge features.
As shown in Figure 4, the reverse attention module mainly obtains adaptive reverse attention coefficients via feature manipulation. It assigns new weights to input features using these coefficients to enhance the expression ability of edge features, emphasizing the boundary features. Specifically, the multi-scale aggregated feature f m f ∈ P 1×4×4×2 is first input into the inception sparse convolution module, which includes multiple dilated convolution structures that can further fuse multi-scale features and enhance feature expression ability. Then, the normalized and inverted features passed through the sigmoid function are used as the reverse attention coefficient R 1 ∈ P 1×4×4×2 to erase foreground features. The reverse attention coefficient R 1 is extended by channels to obtain M R 1 ∈ P 128×4×4×2 , which is pixel-wise multiplied with the input encoded feature f 1 ∈ P 128×4×4×2 to assign new weights to each pixel. Subsequently, the new feature matrix is input into the Conv3 × 3 × 3 convolution and up-sampled to obtain the decoded feature Decoder1. Meanwhile, the left image in Figure 5 describes the process of reverse attention propagation, which Finally, R i+1 ∈ P 1×2h×2w×2d is obtained through Conv1 × 1 × 1 convolution, up-sampling, and input into the inception structure. Therefore, the reverse attention mechanism can further enhance feature expression ability and improve segmentation accuracy in segmentation tasks.
Here *, f m f , R i , M R i , Decoder1, and F represent the functions of matrix multiplication, fused feature, reverse attention coefficient, reverse attention coefficient matrix, decoder feature, convolution, and up-sampling, respectively. Therefore, the reverse attention mechanism can further enhance feature expression ability and improve segmentation accuracy in segmentation tasks. Here *, Decoder , and F represent the functions of matrix multiplication, fused feature, reverse attention coefficient, reverse attention coefficient matrix, decoder feature, convolution, and up-sampling, respectively.    (C) Inception Block In our reverse attention module, we incorporated the Inception structure as a sparse network to efficiently use computational resources and improve the network's performance. Since the high-precision multi-scale and edge detail segmentation is crucial in tubular structure segmentation, we fused the Inception structure into the UARAI network architecture to effectively combine multi-scale features. This improved the feature expression without increasing the number of parameters and expanded the network's receptive field. In the UARAI network structure, the Inception structure is mainly used in the further comprehensive fusion of multi-scale features after multi-scale feature aggregation and the sparse propagation path of reverse attention. As illustrated in Figure 5 (right), the In our reverse attention module, we incorporated the Inception structure as a sparse network to efficiently use computational resources and improve the network's performance. Since the high-precision multi-scale and edge detail segmentation is crucial in tubular structure segmentation, we fused the Inception structure into the UARAI network architecture to effectively combine multi-scale features. This improved the feature expression without increasing the number of parameters and expanded the network's receptive field. In the UARAI network structure, the Inception structure is mainly used in the further comprehensive fusion of multi-scale features after multi-scale feature aggregation and the sparse propagation path of reverse attention. As illustrated in Figure 5 (right), the specific structure contains four branches, each consisting of two layers. Each branch layer undergoes processing using convolutions and dilated convolutions with different kernel sizes, followed by spatial and channel-wise fusion of the branch's results.

(D) Loss function
In this study, tubular structure segmentation suffered from the imbalance between positive and negative samples. The number of foreground pixels belonging to cerebrovascular and airway structures is far less than that of background pixels. Using Dice loss as the loss function can solve this problem and improve segmentation accuracy. Dice loss is a measure of similarity that calculates the similarity between two sets of foreground and background pixels, which has robustness in addressing class imbalance issues. The formula for the Dice loss is as follows: Here gt, pred, p i , and g i respectively represent ground truth, predicted, predicted image pixel, and labeled image pixel.

Experimental and Parameter Settings
The experiment is primarily based on MRA and CT images. It aims to validate the effectiveness of our method's data pre-processing, network model framework (including multi-scale feature fusion, reverse attention, and sparse convolution), and segmentation results' post-processing. We have conducted a large number of comparative experiments.
All experiments were performed on an A100 GPU with a memory size of 40 G, using CUDA version 11.4 and Python version 3.9. Firstly, 61,864 MRA image patches and 98,852 CT image patches were obtained by combining sliding window sequential cropping combined with random cropping. Secondly, the training parameters were set as follows: the batch size was 100, the epoch was 200, and the Adam optimizer was used for training with an initial learning rate of 0.001. Moreover, dropout = 0.3 was set in the network structure to force the neural network to actively discard some nodes, avoid overfitting deep neural networks, and enhance network generalization. The Early Stop (counters = 50) strategy was adopted during training to prevent overfitting and enhance model robustness and generalization.

Comparative Experiment
To obtain a more objective and reliable tubular structure segmentation model, this study designed three-dimensional comparative experiments based on cerebrovascular and airway datasets, including network dimension-based comparative experiments, patchcropping method-based comparative experiments, and patch-size-based comparative experiments. These three dimensions are not completely independent but are interrelated, as described below: (a) Network dimension-based comparative experiments: Based on commonly used medical image segmentation networks, this experiment compared and analyzed the performance of VoxResnet [59], Resnet [60], 3D U-Net [13], Attention U-Net [17], Rattention U-Net [50], CS2-Net [35], ER-Net [36], APA U-Net [39], and the UARAI network proposed in this study. Vessel and airway segmentation are evaluated to thoroughly validate the proposed model's segmentation effect. (b) Patch-cropping method-based comparative experiments: In order to verify the influence of different patch acquisition methods on model performance, two comparative experiments were designed in this paper. One method is random patch cropping, and the other combines sequential sliding window cropping and random patch cropping. For random patch cropping, the cropping condition was set as the block threshold greater than 0.01 (as shown in Equation (11)), and a total of 150 patches were cropped for each image. This patch type mainly includes coarse tubular structures with fewer vessels and airways in peripheral areas. The other combination method is to sequentially crop samples with a window size of 64 × 64 × 32 and a step size of 32. Then, 30 samples were randomly cropped from each image, and the threshold was set to 0.001 (no need to set a strict threshold). This strategy can obtain all the feature information of the image quickly and increase sample diversity. (c) Patch-size-based comparative experiments: Cerebrovascular structures are distributed very sparsely in the brain, and the volume fraction of physiological brain arterial vessels is 1.5%. The voxel resolution of arterial vessels that TOF-MRA can detect can be as low as 0.3% of all voxels in the brain [8].

Evaluation Metrics
Common semantic segmentation metrics were used in the experiment, including Recall (also known as sensitivity), Precision (also known as positive predictive value or PPV), Dice score, and IoU (intersection over union). These metrics can be used to evaluate the quality of segmentation results. The calculation formulas for each metric are as follows: Here TP, FP, TN, FN, gt and pred respectively represent true positive, false positive, true negative, false negative, and ground truth predicted.

Results
This study conducted several comparative experiments on cerebrovascular MRA and lung CT image datasets to verify the effectiveness of our proposed method. To ensure fairness, we randomly partitioned the training, validation, and testing data in the same hardware environment and used consistent evaluation metrics and post-processing methods for comparative analysis. Precision (Pre), Recall (Re), Dice score (Di), and IoU were used as the evaluation metrics for segmentation effectiveness.

Cerebrovascular and Airway Segmentation Results
Comparison experiment of network: Through comparison with other networks using the same post-processing method, we found that in the task of cerebral vascular segmentation (as shown in Table 2 and Figure 6). The network segmentation comparison results without post-processing are shown in Figures 7 and 8. Our proposed method outperformed U-Net by 2.29%, 1.36%, and 2.23% in Precision, Dice, and IoU, respectively. It also achieved higher performance than VoxResnet, with 8.09%, 5.07%, and 8.05% improvements in Precision, Dice and IoU, respectively, as well as Resnet, with 2.77%, 0.68%, and 1.11% improvements; Attention U-Net, with 3.91%, 2.14%, and 3.49% improvements; Rattention U-Net, with 3.72%, 1.10%, and 1.81% improvements; CS2-Net, with 0.74%, 2.41%, and 3.91% improvements; ER-Net, with 1.91%, 2.14%, and 3.49% improvements; and APA U-Net, by significant margins of 19.48%, 11.09%, and 16.71% in Precision, Dice, and IoU, respectively. Table 2. Segmentation results of cerebrovascular and airway structures by different networks (including post-processing); the evaluation index is the average value and variance of the prediction results of the test set; MSFA means multi-scale feature aggregation and the red bold is the optimal result (post-processing). Similarly, in the task of airway segmentation, our proposed method achieved the best performance and outperformed U-Net, with 1.07%, 0.09%, and 0.16% improvements in Precision, Dice, and IoU, respectively, as well as VoxResnet, with 3.49%, 1.02%, and 1.77% improvements; Resnet, with 0.99%, 0.07%, and 0.12% improvements; Attention U-Net, with 0.37%, 0.47%, and 0.82% improvements; Rattention U-Net, with 2.06%, 0.54%, and 0.94% improvements; CS2-Net, with 5.99%, 0.80%, and 1.39% improvements; ER-Net, with 2.62%, 0.59%, and 1.02% improvements; and APA U-Net, with 6.76%, 1.76%, and 3.03% improvements. In summary, our proposed UARAI model achieved superior segmentation performance in terms of Precision, Dice, and IoU compared to other models, particularly in the task of cerebral vascular segmentation. It also exhibited some improvement in the task of airway segmentation.

Vessels
Comparison of cropping methods: This experiment uses the UARAI structure with a training sample size of 64 × 64 × 32. Various cutting methods are compared, and the segmentation results are presented in Figure 9. Among these methods, combining the 'sliding window sequence + random' cutting method yields the highest Precision, Dice, and IoU scores on the cerebrovascular segmentation dataset. Specifically, the Precision, Dice, and IoU values are 93.63%, 90.10%, and 81.98%, respectively. Compared to the 'random' cutting method, there is an improvement of 3.25%, 1.10%, and 1.80% in Precision, Dice, and IoU scores, respectively, while the Recall value slightly decreases by 1.03%. Among the airway dataset, the segmentation results obtained through the combination of the 'sliding window sequence + random' cutting method yield the highest Precision, Recall, Dice, and IoU scores, which are 97.41%, 89.67%, 93.34%, and 87.51% respectively. These scores reflect improvements of 3.54%, 0.9%, 3.26%, and 5.56% compared to the results obtained using the 'random' cutting method. and APA U-Net, by significant margins of 19.48%, 11.09%, and 16.71% in Precision, Dice, and IoU, respectively. Similarly, in the task of airway segmentation, our proposed method achieved the best performance and outperformed U-Net, with 1.07%, 0.09%, and 0.16% improvements in Precision, Dice, and IoU, respectively, as well as VoxResnet, with 3.49%, 1.02%, and 1.77% improvements; Resnet, with 0.99%, 0.07%, and 0.12% improvements; Attention U-Net, with 0.37%, 0.47%, and 0.82% improvements; Rattention U-Net, with 2.06%, 0.54%, and 0.94% improvements; CS2-Net, with 5.99%, 0.80%, and 1.39% improvements; ER-Net, with 2.62%, 0.59%, and 1.02% improvements; and APA U-Net, with 6.76%, 1.76%, and 3.03% improvements. In summary, our proposed UARAI model achieved superior segmentation performance in terms of Precision, Dice, and IoU compared to other models, particularly in the task of cerebral vascular segmentation. It also exhibited some improvement in the task of airway segmentation.      Comparison of cropping methods: This experiment uses the UARAI structure with a training sample size of 64 × 64 × 32. Various cutting methods are compared, and the segmentation results are presented in Figure 9. Among these methods, combining the 'sliding window sequence + random' cutting method yields the highest Precision, Dice, and IoU scores on the cerebrovascular segmentation dataset. Specifically, the Precision, Dice, and IoU values are 93.63%, 90.10%, and 81.98%, respectively. Compared to the 'random' cutting method, there is an improvement of 3.25%, 1.10%, and 1.80% in Precision, Dice, and IoU scores, respectively, while the Recall value slightly decreases by 1.03%. Among the airway dataset, the segmentation results obtained through the combination of the 'sliding window sequence + random' cutting method yield the highest Precision, Recall, Dice, and IoU scores, which are 97.41%, 89.67%, 93.34%, and 87.51% respectively. These scores reflect improvements of 3.54%, 0.9%, 3.26%, and 5.56% compared to the results obtained using the 'random' cutting method.  Comparison experiment of patch size: We compared the segmentation results of the model on different cross-sectional sizes of cerebrovascular and airway samples, as shown in Tables 3 and 4. Regarding segmentation performance, the training sample size of 64 × 64 × 32 demonstrates the best results, disregarding the z-axis dimension. Specifically, in the cerebrovascular dataset, the size of 64 × 64 × 32 yields the highest Precision, Recall, Dice, and IoU scores, which are 93.63%, 89.29%, 90.10%, and 81.98%, respectively. For the airway dataset, the size of 64 × 64 × 32 achieves the best Dice and IoU scores of 93.20% and 87.27%, respectively. However, the Precision reaches its peak at the size of 128 × 128 × 32, standing at 97.07%, while the highest Recall is attained at the size of 96 × 96 × 32, amounting to 91.55%. Figure 10 visually illustrates the segmentation outcomes of the cerebrovascular and airway datasets obtained through the UARAI network, considering different patch sizes employed in this experiment. These results solidify the superiority of the training sample size of 64 × 64 × 32 for the cerebrovascular and airway datasets used in this study. Comparison experiment of patch size: We compared the segmentation results of the model on different cross-sectional sizes of cerebrovascular and airway samples, as shown in Tables 3 and 4. Regarding segmentation performance, the training sample size of 64 × 64 × 32 demonstrates the best results, disregarding the z-axis dimension. Specifically, in the cerebrovascular dataset, the size of 64 × 64 × 32 yields the highest Precision, Recall, Dice, and IoU scores, which are 93.63%, 89.29%, 90.10%, and 81.98%, respectively. For the airway dataset, the size of 64 × 64 × 32 achieves the best Dice and IoU scores of 93.20% and 87.27%, respectively. However, the Precision reaches its peak at the size of 128 × 128 × 32, standing at 97.07%, while the highest Recall is attained at the size of 96 × 96 × 32, amounting to 91.55%. Figure 10 visually illustrates the segmentation outcomes of the cerebrovascular and airway datasets obtained through the UARAI network, considering different patch sizes employed in this experiment. These results solidify the superiority of the training sample size of 64 × 64 × 32 for the cerebrovascular and airway datasets used in this study.  87.27%, respectively. However, the Precision reaches its peak at the size of 128 × 128 × 32, standing at 97.07%, while the highest Recall is attained at the size of 96 × 96 × 32, amounting to 91.55%. Figure 10 visually illustrates the segmentation outcomes of the cerebrovascular and airway datasets obtained through the UARAI network, considering different patch sizes employed in this experiment. These results solidify the superiority of the training sample size of 64 × 64 × 32 for the cerebrovascular and airway datasets used in this study.

Ablation Studies
To assess the efficacy of each module, this study conducted ablation experiments on the multi-scale feature aggregation module (MSFA) and the reverse attention sparse convolution module (Ra + Icp) within the cerebrovascular and airway segmentation models. The standard model, built upon the 3D U-Net baseline network framework, encompassed the modules' Baseline + MSFA + Ra + Icp'. The ablation experiments were carried out as follows: Ablation studies of MSFA: This study compared the effectiveness of the MSFA module in Attention and Reverse Attention network structures and verified the consistency of multi-scale feature aggregation in improving segmentation accuracy for different tubular objects. In the ablation experiments, the MSFA module was integrated into 'Baseline', 'Baseline + Attention', 'Baseline + Rattention', and the standard model. Table 5 presents the segmentation results for various models in both cerebrovascular and airway segmentation. In Figure 11, the models are depicted in red and blue, representing those with and without MSFA (multi-scale feature aggregation). The result demonstrated that the MSFA module could enhance the network's representation ability for different scales and improve segmentation accuracy for cerebrovascular and airway segmentation tasks.    , (a, b) showcase the comparative outcomes of 'Ra + Icp'. Furthermore, (c) il-  Ablation studies of Post-processing: Post-processing techniques play a crucial role in enhancing the outcomes of medical image segmentation. This study employed two key post-processing strategies to refine the results. The first strategy involved applying adaptive filtering to address false positive regions based on the original image's region of interest (ROI). This approach effectively mitigated isolated pixel areas that tend to emerge within cerebrovascular and airway regions. The second strategy focused on removing isolated pixel points by considering the maximum connected domain. By implementing these strategies, the segmentation results exhibited improved accuracy by effectively handling false positive areas outside the brain tissue and lung parenchyma, as depicted in Figure 13. To quantify the impact of post-processing, Table 2 showcases the results obtained by applying identical post-processing techniques across different network training sessions, utilizing a patch size of 64 × 64 × 32. Furthermore, Table 2 highlights significant advancements  in the associated measurements compared with Tables 4 and 5 with Resnet displaying outstanding performance. Figure 11. The ablation experiment on MSFA yielded significant results. In cerebrovascular segmentation, (a-c) correspond to the comparative outcomes of 'MSFA' across U-Net, Attention U-Net, and Rattention U-Net, respectively. Similarly, in airway segmentation, (d-f) represent the comparison results of 'MSFA' within U-Net, Attention U-Net, and Rattention U-Net, respectively.   Figure 13. A comprehensive comparison between pre-and post-processing results reveals intriguing insights. In particular, (a) showcases the refined cerebrovascular prediction outcomes achieved through connected domain processing, while (b) demonstrates the improved airway prediction results obtained by leveraging the region of interest (ROI) encompassing the lung parenchyma, including the airways. This meticulous analysis highlights the significant impact of postprocessing techniques in enhancing the accuracy and reliability of the predictions. False positive is highlighted by yellow circle.

Discussion
Cerebrovascular and airway segmentation has always been a significant clinical concern. To address the challenge of low segmentation accuracy due to the complexity of the cerebrovascular and airway structures and the difficulty in extracting features from end and edge regions, we suggest a multi-scale feature aggregation reverse attention sparse convolution network architecture that can enhance feature extraction for tubular structures with varying thicknesses and complex shapes. As a result, this method can enhance the expression ability of edge features, leading to high-precision segmentation of cerebrovascular and airway structures. The proposed network structure achieved Dice and IoU Figure 13. A comprehensive comparison between pre-and post-processing results reveals intriguing insights. In particular, (a) showcases the refined cerebrovascular prediction outcomes achieved through connected domain processing, while (b) demonstrates the improved airway prediction results obtained by leveraging the region of interest (ROI) encompassing the lung parenchyma, including the airways. This meticulous analysis highlights the significant impact of post-processing techniques in enhancing the accuracy and reliability of the predictions. False positive is highlighted by yellow circle.

Discussion
Cerebrovascular and airway segmentation has always been a significant clinical concern. To address the challenge of low segmentation accuracy due to the complexity of the cerebrovascular and airway structures and the difficulty in extracting features from end and edge regions, we suggest a multi-scale feature aggregation reverse attention sparse convolution network architecture that can enhance feature extraction for tubular structures with varying thicknesses and complex shapes. As a result, this method can enhance the expression ability of edge features, leading to high-precision segmentation of cerebrovascular and airway structures. The proposed network structure achieved Dice and IoU scores of 90.31% and 82.33%, respectively, in cerebrovascular segmentation. In airway segmentation, the Dice and IoU scores were 93.34% and 87.51%, respectively. The results suggest that the approach surpasses the commonly used segmentation networks. Furthermore, the findings indicate that the proposed method can accurately segment tubular structures, which is crucial in clinical diagnosis, preoperative planning, and prognosis analysis.
The primary objective of this study is to tackle the challenge of accurate segmentation of tubular structures, despite the limited availability of medical imaging data. To overcome this challenge, we propose a novel segmentation strategy that combines a sliding window sequence with random cropping, enabling us to generate a diverse and extensive range of training samples. By utilizing a patch size of 16 × 16 × 32, sliding window steps of 16, and random cropping of 30, we successfully obtained a remarkable 223,896 training samples. Similarly, with a patch size of 64 × 64 × 32 and sliding window steps of 32, we acquired 46,056 training samples. Moreover, leveraging a patch size of 128, sliding window steps of 64, and random cropping of 30 resulted in 9880 training samples. These findings unequivocally demonstrate that our proposed method generates a significantly larger sample pool than conventional random cropping techniques.
We integrated multiple image-enhancement techniques into the training process to further enrich the training samples and enhance the model's generality. These techniques played a crucial role in augmenting the training samples and boosting their representativeness. Experimental outcomes based on different patch sizes indicated that the optimal segmentation performance was achieved at a resolution of 64 × 64, irrespective of the layer thickness along the z-axis.
We conducted a comparative analysis of two patch extraction techniques: random cropping and random cropping combined with sliding window sequential cropping. In the case of random cropping, patches were extracted by determining a threshold based on the ratio of label pixels to the total number of pixels within each patch (as shown in Equation (11)). The choice of the threshold value directly influenced the accuracy of the segmentation. We found that extremely low or high values had a negative impact on the experimental results. If the threshold value was set too low, the resulting patches mainly consisted of background regions, lacking sufficient image feature information for effective training. On the other hand, an excessively high threshold value led to longer cropping times, reducing the efficiency of training. Additionally, since the background area is a significant component of the segmentation task, the random cropping approach often overlooked the background area, resulting in inconsistencies between the training patches and the actual image features. As a result, this approach led to decreased prediction accuracy.
To address these challenges, we adopted a sliding window sequential cropping approach and a non-strict threshold random cropping strategy when extracting patches for cerebrovascular and airway segmentation. Initially, the sliding window technique was employed to extract patches, ensuring the comprehensive inclusion of image feature information pertaining to the tubular structures of interest. Additionally, we incorporated a limited amount of random cropping to introduce diversity among the samples. This combined approach effectively captured all relevant image features. By adopting a more lenient threshold in random cropping, we successfully mitigated the issues mentioned above, leading to improved segmentation accuracy and preserving the necessary diversity in the training data.
Here, Threshold, V patch (i, j, k), V patch (i, j, k), and V crop respectively represent the threshold value set for random patch cropping, the corresponding label pixel value of each pixel in the patch, and the size of the patch. Figure 9 demonstrates that the fusion of sliding window sequential cropping and random cropping techniques yielded exceptional outcomes in cerebrovascular segmentation. The combined cropping strategy showcased notable improvements in various evaluation metrics compared with the sole utilization of random cropping. Specifically, the Dice score saw a commendable enhancement of 1.1%, Precision witnessed a substantial boost of 3.25%, and IoU experienced a significant increase of 1.8%. However, it is worth mentioning that the Recall exhibited a marginal decrease of 1.03% in this case.
In airway segmentation, employing the model trained to integrate sliding window sequential cropping and random cropping led to impressive results. Notably, there were remarkable improvements across multiple performance measures. The Dice score witnessed a substantial surge of 3.26%, Precision increased by an impressive 3.54%, IoU experienced a noteworthy boost of 5.56%, and Recall demonstrated a favorable increment of 0.9%, compared with the performance achieved solely through random cropping. These findings strongly indicate the efficacy and superiority of the combined cropping strategy in enhancing the segmentation accuracy for both cerebrovascular and airway datasets.
Our experimental findings shed light on the significant impact of patch size selection on the sensitivity of cerebrovascular and airway segmentation. Previous research [24] has emphasized that a smaller cropping size prompts the network to focus predominantly on local features. In comparison, a larger cropping size enables the network to capture more global features, albeit at the potential cost of requiring additional max-pooling layers. In our study, we conducted extensive comparative experiments on brain vasculature and airway datasets to determine the optimal cropping size for these specific domains.
The results unequivocally establish that a model with a patch size of 64 × 64 × 32 achieves superior segmentation accuracy by adeptly capturing global and intricate features in a wellbalanced manner. This conclusion is substantiated by the compelling evidence presented in Tables 3 and 4, which consistently highlight enhanced segmentation performance when utilizing the 64 × 64 × 32 size. Moreover, Figure 14 visually illustrates the segmentation outcomes achieved by models trained with different patch sizes. In Figure 14, the yellow circle represents false positives, while the green circle signifies false negatives. Our findings underscore that a cropping size of 64 × 64 × 32 yields the most favorable segmentation results, characterized by minimal false positives and false negatives. It is essential to note that using small patches may lead to a higher incidence of false positives, primarily due to the network's limited ability to comprehend contextual cues from these diminutive patches. Consequently, neighboring background regions might be classified as tubular structures erroneously, thereby contributing to false positive predictions. Conversely, larger patches encompass a greater degree of background interference, impeding the network's capacity to accurately discern finer details of the tubular structures. Consequently, there is a propensity for misidentifying cerebrovascular and airway regions as background, leading to elevated false negative rates. Thus, our findings underscore the crucial role played by the selection of an appropriate cropping size, with the 64 × 64 × 32 dimensions proving to be optimal for achieving accurate and reliable segmentation outcomes. In addition, to comprehensively validate the effectiveness of the proposed method in this paper, the proposed network was compared with existing segmentation methods. As It is essential to note that using small patches may lead to a higher incidence of false positives, primarily due to the network's limited ability to comprehend contextual cues from these diminutive patches. Consequently, neighboring background regions might be classified as tubular structures erroneously, thereby contributing to false positive predictions. Conversely, larger patches encompass a greater degree of background interference, impeding the network's capacity to accurately discern finer details of the tubular structures. Consequently, there is a propensity for misidentifying cerebrovascular and airway regions as background, leading to elevated false negative rates. Thus, our findings underscore the crucial role played by the selection of an appropriate cropping size, with the 64 × 64 × 32 dimensions proving to be optimal for achieving accurate and reliable segmentation outcomes.
In addition, to comprehensively validate the effectiveness of the proposed method in this paper, the proposed network was compared with existing segmentation methods. As shown in Figures 15 and 16, there were differences in the false positive and false negative cases among different networks. In Figure 15, we present two sets of cerebrovascular image segmentation results. The first column shows the maximum intensity projection (MIP) image of brain vasculature, which displays the distribution of blood vessels in the brain. The second column shows the ground truth labels and the subsequent columns show the segmentation results of various networks. Specifically, the U-Net model performs well in medical image segmentation and has good overall segmentation results but performs slightly worse in edge segmentation, small blood vessel segmentation, and airway segmentation. Although the U-Net model performs well in segmenting the primary vascular branches and airways, its ability to segment tubular structures near the edges is suboptimal.
racy. In ER-Net, the use of reverse attention enhances the edge feature module, further improving the segmentation ability for edge blood vessels and reducing false positive cases, but there still exist some false negative cases. Examining the segmentation results of the Attention U-Net and Rattention U-Net models, noticeable enhancements were observed in the segmentation accuracy of edge details, accompanied by a significant reduction in the false positive rate. In the case of the UARAI model segmentation results, a substantial decrease in the number of isolated false positive areas was evident. Moreover, the segmentation of small blood vessels became more delicate and accurate, and the segmented blood vessels exhibited improved continuity aligned with the anatomical structure characteristics. However, a few false negative cases persisted, which could be attributed to the challenge of differentiating arterial and venous image features that share similarities.
Moving to Figure 16, two sets of three-dimensional airway segmentation results are presented. Predominantly, the airway segmentation outcomes exhibit more false negatives and fewer false positives. Overall, all networks' main airway segmentation results demonstrate improved accuracy, although the segmentation of small airways falls short of ideal performance. In the APA U-Net and ResNet networks, there are many false positive regions outside the airway, which greatly affect segmentation performance. After post-processing, the accuracy is greatly improved. The false positive cases in the segmentation results of the U-Net and VoxResNet models are greatly improved, but the performance of edge segmentation still needs to be improved. The CS2-Net, ER-Net, Attention U-Net, and Rattention U-Net models introduced different attention mechanisms, which improved overall performance compared with U-Net. Particularly, in the ER-Net and Rattention U-Net models, the edge segmentation accuracy is significantly improved, further confirming the reusability of reverse attention in complex tube-like structures and edge detail segmentation. Notably, the UARAI model demonstrated exceptional performance in edge detail segmentation and the segmentation of small airways, as depicted in the yellow box area. Additionally, the false positive rate in the segmentation results was notably low, as indicated by the blue box area, resulting in highly accurate segmentation outcomes. Figure 15. The results obtained from various models in cerebrovascular segmentation revealed distinct patterns. Blue and green boxes depict false positive and false negative areas, respectively, Figure 15. The results obtained from various models in cerebrovascular segmentation revealed distinct patterns. Blue and green boxes depict false positive and false negative areas, respectively, providing a visual representation of the discrepancies among the models. False positive and false negative areas, particularly at the edges, are highlighted by blue and dark green boxes, respectively.
The U-Net and APA U-Net model exhibited limited discrimination ability when segmenting the vascular region at the arteriovenous junction during cerebral vascular segmentation. This limitation led to a higher occurrence of false positives in the results. On the other hand, the VoxResnet model showcased superior segmentation outcomes compared with the U-Net model, effectively reducing the occurrence of false positives. This improvement can be attributed to the presence of residual connections within its architecture, which mitigated the lack of shallow feature information and enhanced the segmentation accuracy. Additionally, increasing the depth of the Resnet model with residual connectivity further reduced the incidence of false positives in the predicted outcomes. However, a larger false positive region emerged outside the non-brain and non-airway regions, possibly due to the increased complexity of deeper network layers and the imbalanced ratio of positive and negative samples.
In Figure 15, two sets of three-dimensional vessel segmentation results are presented. In the CS2-Net model, the network addresses the weak segmentation ability of U-Net and APA U-Net at the intersection of arteries and veins by utilizing both spatial and channel attention, significantly reducing false positive cases and improving segmentation accuracy. In ER-Net, the use of reverse attention enhances the edge feature module, further improving the segmentation ability for edge blood vessels and reducing false positive cases, but there still exist some false negative cases. Examining the segmentation results of the Attention U-Net and Rattention U-Net models, noticeable enhancements were observed in the segmentation accuracy of edge details, accompanied by a significant reduction in the false positive rate. In the case of the UARAI model segmentation results, a substantial decrease in the number of isolated false positive areas was evident. Moreover, the segmentation of small blood vessels became more delicate and accurate, and the segmented blood vessels exhibited improved continuity aligned with the anatomical structure characteristics. However, a few false negative cases persisted, which could be attributed to the challenge of differentiating arterial and venous image features that share similarities. Under the UARAI framework, we conducted comparative experiments on diverse network models. The results indicate a noteworthy advancement in Precision, Dice, and IoU scores; however, we observed a minor decline in Recall as compared to other networks. As previously mentioned, an improvement in Precision indicates more accurate true-positive predictions or fewer false positives, with the model being more focused on predicting positive samples and making stricter judgments, thereby reducing misjudgments. Dice and IoU scores mainly focus on the overlapping area between the model's prediction results and the ground truth labels. Recall and Precision differ because Recall is more concerned about false-negative areas, with slightly lower values indicating that the model missed several positive samples and suffered from slight under-segmentation.
Low image resolution and large pixel spacing in cerebrovascular and airway datasets may create peripheral marker discontinuity. This leads the model to ignore positive areas that lack markers and treat them as background. This, in turn, affects the Recall value and the segmentation accuracy of tubular structures. Future work needs to address these challenges in achieving higher accuracy segmentation of tubular structures. To that end, we will focus on conducting semi-supervised methods that will primarily tackle issues relating to image quality and labeling limitations. For instance, we can employ self-training by utilizing semi-supervised learning to generate highly confident pseudo-labels repeatedly. Alternatively, we can use perturbation-consistent semi-supervised training methods to solve such issues and improve segmentation accuracy.

Conclusions
This research paper introduces a novel approach for accurately segmenting tubular structures such as cerebrovascular and airway structures. To address the challenges posed by complex tubular objects, we employed a combination of sliding window sequential cropping and random cropping strategies to increase the number of training samples and leverage the available image features effectively. Additionally, we proposed a unique U- Moving to Figure 16, two sets of three-dimensional airway segmentation results are presented. Predominantly, the airway segmentation outcomes exhibit more false negatives and fewer false positives. Overall, all networks' main airway segmentation results demonstrate improved accuracy, although the segmentation of small airways falls short of ideal performance. In the APA U-Net and ResNet networks, there are many false positive regions outside the airway, which greatly affect segmentation performance. After post-processing, the accuracy is greatly improved. The false positive cases in the segmentation results of the U-Net and VoxResNet models are greatly improved, but the performance of edge segmentation still needs to be improved. The CS2-Net, ER-Net, Attention U-Net, and Rattention U-Net models introduced different attention mechanisms, which improved overall performance compared with U-Net. Particularly, in the ER-Net and Rattention U-Net models, the edge segmentation accuracy is significantly improved, further confirming the reusability of reverse attention in complex tube-like structures and edge detail segmentation. Notably, the UARAI model demonstrated exceptional performance in edge detail segmentation and the segmentation of small airways, as depicted in the yellow box area. Additionally, the false positive rate in the segmentation results was notably low, as indicated by the blue box area, resulting in highly accurate segmentation outcomes.
Under the UARAI framework, we conducted comparative experiments on diverse network models. The results indicate a noteworthy advancement in Precision, Dice, and IoU scores; however, we observed a minor decline in Recall as compared to other networks. As previously mentioned, an improvement in Precision indicates more accurate true-positive predictions or fewer false positives, with the model being more focused on predicting positive samples and making stricter judgments, thereby reducing misjudgments. Dice and IoU scores mainly focus on the overlapping area between the model's prediction results and the ground truth labels. Recall and Precision differ because Recall is more concerned about false-negative areas, with slightly lower values indicating that the model missed several positive samples and suffered from slight under-segmentation.
Low image resolution and large pixel spacing in cerebrovascular and airway datasets may create peripheral marker discontinuity. This leads the model to ignore positive areas that lack markers and treat them as background. This, in turn, affects the Recall value and the segmentation accuracy of tubular structures. Future work needs to address these challenges in achieving higher accuracy segmentation of tubular structures. To that end, we will focus on conducting semi-supervised methods that will primarily tackle issues relating to image quality and labeling limitations. For instance, we can employ self-training by utilizing semi-supervised learning to generate highly confident pseudo-labels repeatedly. Alternatively, we can use perturbation-consistent semi-supervised training methods to solve such issues and improve segmentation accuracy.

Conclusions
This research paper introduces a novel approach for accurately segmenting tubular structures such as cerebrovascular and airway structures. To address the challenges posed by complex tubular objects, we employed a combination of sliding window sequential cropping and random cropping strategies to increase the number of training samples and leverage the available image features effectively. Additionally, we proposed a unique U-Net-based framework that incorporates multi-scale feature aggregation, reverse attention, and sparse convolution. A comprehensive experimental analysis was conducted to evaluate the efficacy of different components, including data pre-processing, model framework, and post-processing techniques.
The introduction of multi-scale feature aggregation enables the network to learn and adapt to different shapes and thicknesses of tubular structures at varying scales, enhancing the overall feature learning process. Incorporating reverse attention allows the model to dynamically emphasize edge features, improving the extraction of positive samples and edge details. Furthermore, integrating Inception sparse convolution enhances the network's receptive field and feature representation without significantly increasing model complexity.
Extensive experiments were conducted on cerebrovascular and airway datasets, demonstrating promising results. The proposed UARAI model achieved impressive Dice and IoU scores of 90.31% and 82.35% (cerebrovascular) and 93.34% and 87.60% (airways), respectively. Comparative analysis with existing advanced methods showcased the superior segmentation accuracy of our proposed model. Consequently, our proposed method can be regarded as an effective approach for tubular structure segmentation, offering advancements in accuracy and paving the way for improved medical image analysis and diagnosis.  Institutional Review Board Statement: All patients provided written informed consent, and this study was approved by the Guangzhou medical university Ethics Committee (grant number: 2009-09,Approval Date: 5 August 2009) and registered at http://www.chictr.org.cn (registration number:ChiCTR2000034586, accessed on 11 July 2020).
Informed Consent Statement: Patient consent was waived due to the retrospective design of data collection. Data Availability Statement: Some data comes from public datasets: https://data.kitware.com/ #collection/591086ee8d777f16d01e0724, accessed on 20 September 2021, and private data is not applicable.