DEHA-Net: A Dual-Encoder-Based Hard Attention Network with an Adaptive ROI Mechanism for Lung Nodule Segmentation

Measuring pulmonary nodules accurately can help the early diagnosis of lung cancer, which can increase the survival rate among patients. Numerous techniques for lung nodule segmentation have been developed; however, most of them either rely on the 3D volumetric region of interest (VOI) input by radiologists or use the 2D fixed region of interest (ROI) for all the slices of computed tomography (CT) scan. These methods only consider the presence of nodules within the given VOI, which limits the networks’ ability to detect nodules outside the VOI and can also encompass unnecessary structures in the VOI, leading to potentially inaccurate segmentation. In this work, we propose a novel approach for 3D lung nodule segmentation that utilizes the 2D region of interest (ROI) inputted from a radiologist or computer-aided detection (CADe) system. Concretely, we developed a two-stage lung nodule segmentation technique. Firstly, we designed a dual-encoder-based hard attention network (DEHA-Net) in which the full axial slice of thoracic computed tomography (CT) scan, along with an ROI mask, were considered as input to segment the lung nodule in the given slice. The output of DEHA-Net, the segmentation mask of the lung nodule, was inputted to the adaptive region of interest (A-ROI) algorithm to automatically generate the ROI masks for the surrounding slices, which eliminated the need for any further inputs from radiologists. After extracting the segmentation along the axial axis, at the second stage, we further investigated the lung nodule along sagittal and coronal views by employing DEHA-Net. All the estimated masks were inputted into the consensus module to obtain the final volumetric segmentation of the nodule. The proposed scheme was rigorously evaluated on the lung image database consortium and image database resource initiative (LIDC/IDRI) dataset, and an extensive analysis of the results was performed. The quantitative analysis showed that the proposed method not only improved the existing state-of-the-art methods in terms of dice score but also showed significant robustness against different types, shapes, and dimensions of the lung nodules. The proposed framework achieved the average dice score, sensitivity, and positive predictive value of 87.91%, 90.84%, and 89.56%, respectively.


Introduction
Lung cancer is the deadliest cancer type, and early detection is crucial for potentially life-saving treatment. The accurate quantification of pulmonary nodules, which may be associated with various conditions but are often indicative of lung cancer, is essential for the continuous monitoring of lung nodule volume to assess the malignancy and predict the likelihood of lung cancer [1,2]. However, the manual segmentation of nodules, which represents a necessary step in calculating their volume, is a laborious and time-consuming process that can also introduce variability between and within observers [3].
Computer-aided diagnosis (CAD) systems can significantly enhance the productivity of radiologists by assisting them in overcoming the challenges associated with the manual segmentation of pulmonary nodules. CAD systems consist of two subsystems, i.e., computer-aided detection (CADe) [4] and computer-aided diagnosis (CADx) [5]. The CADe system aims to distinguish between nodules and other structures, such as tissues and blood vessels. The CADx system then evaluates the detected nodules and determines whether they are benign or malignant tumors. The primary goal of these CAD systems is to improve the accuracy and efficiency of cancer diagnosis by radiologists. They are designed to assist decision making by providing additional information and reducing the time needed to interpret CT images. This work is focused on developing the CADx system for accurate lung nodule segmentation, which is challenging due to variable shapes, different sizes, and complicated surrounding tissues in the lung region. Various automatic segmentation frameworks for nodule quantification have been devised; such techniques consist of traditional image-processing-based methods and deep-learning-based approaches [6]. However, the significant heterogeneity of lung nodules, particularly the variations in the shape and contrast of lung nodules, hinders the development of a robust nodule segmentation framework. These variations, both within and between nodules, as well as the visual similarity between nodules and their surrounding non-nodule tissue, necessitate the use of a 3D volume of interest (VOI) as input to estimate the shape of the nodule accurately. Figure 1 demonstrates the intra-nodule and inter-nodule variations, showcasing the diversity between the forms of different nodules and the variations present within individual nodules. Providing a 3D VOI is quite a time-consuming and laborious task, as the radiologist has to specify the region of interest at each slice containing the nodule. A few studies have resolved this issue by utilizing a fixed ROI for all the slices; this approach requires only one ROI input from the user, which significantly reduces the time and hassle. However, employing a fixed ROI adds redundant non-nodular regions to the input ROI, leading to poor segmentation performance. February 7, 2023 submitted to Sensors 2 of 12 CADe system aims to distinguish between nodules and other structures, such as tissues 38 and blood vessels. The CADx system then evaluates the detected nodules and determines 39 whether they are benign or malignant tumors. The primary goal of these CAD systems 40 is to improve the accuracy and efficiency of cancer diagnosis by radiologists. They are 41 designed to assist decision-making by providing additional information and reducing 42 the time needed to interpret CT images. This work is focused on developing the CADx 43 system for accurate lung nodule segmentation, which is challenging due to variable shapes, 44 different sizes, and complicated surrounding tissues in the lung region. Various automatic 45 segmentation frameworks for nodule quantification have been devised; such techniques 46 consist of the traditional image processing-based methods, and the deep learning-based 47 approaches [? ]. However, the significant heterogeneity of lung nodules, particularly the 48 shape and contrast variations of lung nodules, hinders the development of a robust nodule 49 segmentation framework. These variations, both within and between nodules, as well as 50 the visual similarity between nodules and their surrounding non-nodule tissue, necessitate 51 the use of a 3D volume of interest (VOI) as input to estimate the shape of the nodule 52 accurately. Figure 1 demonstrates the intra-nodule and inter-nodule variations, showcasing 53 the diversity between the forms of different nodules and the variations present within 54 individual nodules. Providing a 3D VOI is quite a time-consuming and laborious task 55 as the radiologist has to specify the region of interest at each slice containing the nodule. 56 A few studies have resolved this issue by utilizing the fixed ROI for all the slices; this 57 approach requires only one ROI input from the user, which significantly reduces the time 58 and hassle. However, employing fixed ROI adds redundant non-nodular regions in the 59 input ROI, leading to poor segmentation performance. To address the issues related to using 3D VOIs as input and fixed ROIs, in our previ-61 ous work [? ], we proposed a novel approach of dynamic ROIs for accurate volumetric 62 segmentation of pulmonary nodules. To determine the dynamic ROIs, we proposed an 63 adaptive region of interest (A-ROI) algorithm that utilizes a single 2D ROI provided by 64 radiologists [? ] to estimate the dynamic ROIs in the surrounding slices. This approach 65 begins by segmenting the nodule in the initially provided ROI by employing residual-UNet 66 and then utilizes the segmentation mask to determine the ROIs for the surrounding slices 67 To address the issues related to using 3D VOIs as input and fixed ROIs, in our previous work [5], we proposed a novel approach using dynamic ROIs for the accurate volumetric segmentation of pulmonary nodules. To determine the dynamic ROIs, we proposed an adaptive region of interest (A-ROI) algorithm that utilizes a single 2D ROI provided by radiologists [5] to estimate the dynamic ROIs in the surrounding slices. This approach begins by segmenting the nodule in the initially provided ROI by employing residual-UNet and then utilizes the segmentation mask to determine the ROIs for the surrounding slices to extend the nodule segmentation in both directions. Concretely, the A-ROI algorithm dynamically adjusts the position and size of the bounding box for the adjacent slices to investigate the penetration of the nodule in the other slices. The technique demonstrated exceptional performance and outperformed the previous state-of-the-art methods. However, this previous approach required cropping the ROI, which can cause problems due to the inconsistent size of nodules, for instance, if the ROIs are too small or larger than the normalized dimensions used to input into the network. Similarly, the mask obtained after inference must be resized to match the original cropped ROI size, introducing error when interpolation is used to achieve the target dimensions. To address these issues, we propose a dual-encoder-based architecture that takes two inputs: the original slices and the ROI mask, eliminating the need to rescale the ROIs before and after inference. The A-ROI algorithm was then used to further produce ROI masks for surrounding slices for which ROIs were not provided. Specifically, the A-ROI algorithm was applied along the axial plane to provide an initial estimation of nodule shape, which was then used to extract a 3D VOI from the scan automatically. The extracted VOI was further utilized to create the coronal and sagittal views of the nodule, and the slices from these views were analyzed using two different dual-encoder-based architectures. Finally, a consensus module was employed to ensemble the three predictions from axial, coronal, and sagittal view models. Several experiments were performed on the LIDC dataset [7] to demonstrate the effectiveness of the proposed technique in terms of overall performance and robustness relative to the variations in the type and size of lung nodules.

Related Work
An accurate assessment of lung nodules is essential for evaluating their potential malignancy and the likelihood of being indicative of lung cancer. Subsequently, numerous researchers have made extensive efforts to devise an efficient nodule segmentation framework to assist radiologists. These studies can be classified into two categories, i.e., conventional image-processing-based methods and advanced deep-learning-based techniques [6].
Jamshid et al. [8] proposed a framework that segmented the nodule by employing region-growing techniques, such as contrast-based region growing and fuzzy connectivity region growing, and created a volumetric mask using a local adaptive segmentation algorithm that distinguishes between foreground and background regions within a specified window size. While the algorithm demonstrated good performance for isolated nodules, it could not effectively segment the attached ones. Using geodesic impact zones in a multithreshold picture representation, Stefano et al. [9] offered a user-interactive algorithm that meets the fusion-segregation criterion based on both gray-level similarity and object shape. They extended their work in another study [10] by eliminating the need for user interaction. A correction procedure was then performed based on a 3D local shape analysis, allowing for the refinement of an initial nodule segmentation to distinguish possible vessels from the nodule itself without requiring input from the user. Rendon et al. [11] used morphological and threshold approaches to eliminate extraneous structures from a given ROI. The last step was to use a support vector machine (SVM) to categorize each pixel in the discovered space.
Although classical image-processing-based techniques achieve accurate lung nodule segmentation, such techniques are susceptible to the types of nodules. In contrast, recent deep-learning-based methods have made wast inroads into many medical imaging applications such as disease classification [12] and segmentation applications [13,14], including lung nodule segmentation [15]. The introduction of the UNet [16] architecture for medical image segmentation, in particular, has dramatically enhanced the performance of various crucial tasks, such as tumor segmentation [17]. As a result, there has been an increased focus on using deep learning for lung nodule segmentation. In [18], Tyagi et al. proposed a 3D conditional generative adversarial network (GAN) for lung nodule segmentation. They utilized the UNet architecture as the backbone of GAN. They employed a simple classification network as a discriminator, incorporating spatial squeeze, and channel excitation modules to differentiate between truth and fake segmentation. Similarly, Wang et al. [19] developed a method for nodule segmentation called central-focused convolutional neural networks (CF-CNNs). This approach uses a volumetric patch centered around the voxel of interest as input to the model. In addition, the team [20] also published a multi-view CNN that can perform nodule segmentation using input from different views (axial, coronal, and sagittal) of the same voxel. One potential limitation of this method is that the patch extraction process is the same for all nodules, which could lead to incorrect segmentation if the nodule is larger than the size of the patch. By using skip connections in the encoder and decoder paths, Tong et al. [21] enhanced the performance of UNet for nodule segmentation; however, the model was only intended for 2D segmentation. Hancock et al. [22] put forth a variation on the standard level-set picture segmentation technique in which, as opposed to being manually created, the velocity function is instead learned from data using regression machine learning techniques. They reported slightly improved performance when they applied this segmentation approach to the segmentation of lung nodules. Chen et al. [23] proposed an end-to-end multi-task learning framework that consists of joint classification and multi-channel segmentation networks. Both networks utilized the exact latent representation learned by the common encoder branch, improving lung segmentation performance. The study also incorporated an enhanced version of patches by using OTSU and SLIC methods. To extract local characteristics and detailed contextual information from lung nodules, Liu et al. [24] used a residual-block-based dual-path network, which significantly improved performance. They also employed a fixed VOI, which restricts the nodule search and lowers 3D segmentation performance. To avoid this issue, Chen et al. [25] proposed a fast multi-crop guided attention (FMGA) network for lung nodule segmentation by incorporating 2D-and 3D-cropped ROIs. They applied the greedy search algorithm to explore the penetration of lung nodules into the surrounding slices. Their framework also exploited a customized loss function, enabling the network to focus on improving the segmentation of nodule borders. Their results demonstrated the robustness of the proposed framework; however, the scheme failed to improve state-of-the-art methods in terms of the overall dice score.
In our previous work [5], we addressed the limitations of a fixed volume of interest (VOI) by introducing the concept of an adaptive 2D region of interest (ROI) in each slice, which significantly improved the ability to utilize deep learning. Most notably, cropped ROIs were fed to the deep residual UNet [26], which demonstrated promising performance along with several limitations. Particularly, due to the heterogeneity of lung nodules, numerous variations in dimensions are possible, which makes it impossible to find the optimal input dimensions for the network. Subsequently, the cropped ROI has to be severally resized, by upsampling or downsampling the ROI, which affects the performance of the proposed framework. One possible alternative is to train various models with different input dimensions. However, this comes with an immense increase in the computational cost, which hinders the solution's real-time clinical applications. For instance, Zhang et al. [27] proposed multi-scale segmentation squeeze-and-excitation UNet with a conditional random field to segment the nodule in the given volume of interest. They extracted VOIs at four different scales and trained four different networks and finally applied a conditional random field to merge the four predictions. Their framework increased the computational complexity and only covered four scales defined according to dimensions available in the given dataset, which is insufficient to cover the possible diversity in the size of lung nodules in real-time clinical applications. To overcome the aforementioned issue, in this work, we propose a dual-encoder-based architecture that incorporates the ROI mask to input as hard attention, which enables the framework to avoid the pre-and post-inference resizing and leads to performance improvement.

Dataset
In this work, we used the lung image database consortium and image database resource initiative (LIDC-IDRI) database [7,28], which is the largest publicly available resource for lung CT scans. This dataset is created to facilitate the development of computer-aided systems for evaluating lung nodule identification, categorization, and quantification. In the LIDC-IDRI, a sizable number of thoracic CT scans have been gathered; the database comprises 1018 diagnostics and screening thoracic CT images for lung cancer from 1010 individuals with annotated lesions. Each thoracic CT scan underwent a two-phase annotation process performed by four qualified radiologists. As in earlier studies [5,29,30], in this work, we also considered nodules with a minimum diameter of 3 mm and annotations from all four radiologists. The ground-truth border for pulmonary nodule segmentation was created using a 50% consensus criterion [31] due to the variability among the four radiologists, and a Python module named pyLIDC was employed. A total of 893 nodules from the LIDC dataset were selected and randomly distributed into 40%, 5%, and 55% sets, which were, respectively, used as training, validation, and test sets.

Data Pre-processing
The pre-possessing of CT images can significantly improve the network's performance by reducing the influence of noise and irrelevant tissues. Normalizing the image can reduce the network's dependence on parameter initialization, smoothing the optimization process, and, subsequently, enhancing the convergence probability. Concretely, grayscale thresholding was applied to normalize the intensity range, which helped to suppress irrelevant, redundant information. This enabled the network to pay attention to the relevant tissue and reduce the complexity of the input data, making the network's training more efficient and effective.
We also normalized the intensity values, ranging from 0 to 1, by using the window center and window width tag from corresponding DICOM files [32]. The normalization can be defined as follows: where I, I n , WC, and WW represent the original image, normalized image, window center, and window width, respectively. The values of the window center and window width are extracted from the DICOM tags [32]. The LIDC collection includes the scans obtained from numerous locations and scanners. Consequently, it has a variety of pixel spacings and slice thicknesses. These variables are crucial for nodule appearance. In particular, slice thickness significantly impacts the coronal and sagittal views. Slice thickness in most LIDC scans, which spans from 0.45 mm to 5.0 mm, is higher than pixel spacing. Therefore, to enhance the visibility of nodules in all three views, the slice thickness was reduced to the corresponding pixel spacing by upsampling the scan along the z-axis. The pixel spacing remained unchanged, as it was less than one for each scan, producing an axial view of the nodules in a reasonable resolution.
In contrast to previous studies [19][20][21]33], which produced the training samples by employing the constant margin scheme, in this work, we utilized the ROI with random margins on each side as in [5]. To train our DEHA-Net architecture, we generated ROI masks by using ground-truth nodule masks. To enforce our model to learn about the absence of lung nodules in a given slice, we also included two non-nodular slices from both sides of each nodule.

Dual-Encoder-Based Hard Attention Network with Adaptive ROI Mechanism
The proposed framework utilized a novel dual-Encoder-based hard attention network (DEHA-Net) with an adaptive ROI (A-ROI) mechanism. The overall framework is illustrated in Figure 2. The first stage, the 2D ROI, which used manual input from a radiologist or computer-aided diagnosis (CADe) system as its source, was carried out using DEHA-Net along the axial axis. The A-ROI algorithm was applied to generate the ROIs for the remaining surrounding slices, which enabled the investigation of the nodules along the axial view in order to reconstruct the 3D mask of the nodule. In the second stage, the 3D mask constructed after the axial analysis was exploited to generate the ROIs along the sagittal and coronal views. Then, we applied the proposed DEHA-Net along the sagittal and coronal views with predefined ROIs generated from the 3D mask obtained at the first stage. Finally, a consensus module was utilized to produce the final 3D segmentation mask of the nodule. It is important to note that in the whole pipeline, no resizing was performed, thus eliminating the issues associated with the rescaling of a given input and network output. This enabled our network to achieve improved performance and made it more robust to size variations in various nodules. The following subsection describes the details of the proposed DEHA-Net and the A-ROI algorithm. The proposed framework utilized the novel dual-Encoder based hard-attention net-217 work (DEHA-Net) with adaptive ROI (A-ROI) mechanism. The overall framework has 218 been illustrated in Figure 2. The first stage, the 2D ROI, which can be the source from 219 manual input from a radiologist or computer-aided diagnosis (CADe) system, is provided 220 to DEHA-Net along the axial axis. It applies the A-ROI algorithm to generate the ROIs 221 for the remaining surrounding slices, which enables the investigation of nodules along 222 the axial view to reconstruct the 3D mask of the nodule. In the second stage, the 3D mask 223 constructed after the axial analysis is exploited to generate the ROIs along the sagittal and 224 coronal views. Then we apply the proposed DEHA-Net along sagittal and coronal view 225 with predefined ROIs generated from the 3D mask obtained at the first stage. Finally, a 226 consensus module has been utilized to produce the final 3D segmentation mask of the nod-227 ule. It is important to note that in the whole pipeline, no resizing is applied that eliminates 228 the issues associated with re-scaling of given input and network output. This enables our 229 network to achieve improved performance and makes it more robust to size variations of 230 various nodules. The following subsection describes the details of the proposed DEHA-Net 231 and A-ROI algorithm. 232 Figure 2. The proposed framework consists of two stages. In the first stage, the user or CADe system provides the ROI along the axial axis, and the DEHA-Net (dual-encoder-based hard attention network with self-hard attention) and adaptive ROI algorithm are used to determine the ROIs in the surrounding slices to perform the 3D segmentation. In the second stage, the sagittal and coronal views are created to segment the nodule. Finally, three segmentation predictions are fed into the consensus module to produce the final 3D segmentation mask.

233
Lung nodules vary in shape and dimension, making it impossible to set a suitable input 234 dimension for the network. To overcome this issue, we designed a dual-encoder-based 235 hard attention network (DEHA-Net) which incorporates two inputs, i.e., slice containing 236 the nodule and ROI mask, to segment the nodule in the given slice accurately. Mainly, 237 the ROI mask provides hard attention, which enables the network to focus on only the 238 provided region of interest. The proposed DEHA-Net consists of two encoders and one 239 decoder branch, as demonstrated in Figure 3. Each encoder is connected to the decoder with 240 residual connections from four different levels. Both encoders have identical architecture, 241 consisting of four levels. At nth level, there is a convolution layer of 32 × n 2 filters and 242 kernel size of 3x3 followed by rectified linear activation to add non-linearity. After relu 243 there is a batch normalization layer and then a max-pooling layer which compresses the 244 information. These four layers make a single level of an encoder. 245 Figure 2. The proposed framework consists of two stages. In the first stage, the user or CADe system provides the ROI along the axial axis, and the DEHA-Net (dual-encoder-based hard attention network with self-hard attention) and adaptive ROI algorithm are used to determine the ROIs in the surrounding slices to perform 3D segmentation. In the second stage, the sagittal and coronal views are created to segment the nodule. Finally, three segmentation predictions are fed into the consensus module to produce the final 3D segmentation mask.

Dual-Encoder-Based Hard Attention Network
Lung nodules vary in shape and dimension, making it impossible to set a suitable input dimension for the network. To overcome this issue, we designed a dual-encoderbased hard attention network (DEHA-Net) that incorporated two inputs, i.e., the slice containing the nodule and ROI mask, to segment the nodule in the given slice accurately. Specifically, the ROI mask provided hard attention, which enabled the network to focus on only the provided region of interest. The proposed DEHA-Net consisted of two encoders and one decoder branch, as demonstrated in Figure 3. Each encoder was connected to the decoder with residual connections from four different levels. Both encoders had identical architecture, consisting of four levels. At nth level, there was a convolution layer of 32 × n 2 filters and a kernel size of 3 × 3, followed by rectified linear activation to add non-linearity. After ReLU, there was a batch normalization layer and then a max-pooling layer, which compressed the information. These four layers made a single level of an encoder.
The first encoder extracted the features from CT scan images, while the second encoder enforced the hard attention learned from the ROI mask of the nodule. Its primary purpose was to maintain the network's focus on the nodule's location. Decoders output the segmentation mask of nodule for the current mask and ROI for the next and previous slice. Similarly, the decoder consisted of four levels, each consisting of a concatenation layer followed by a convolution layer. After that, rectified linear was applied for non-linearity, followed by a batch normalization layer, and finally, these features were upsampled. In the last level of the decoder, the upsampling layer was replaced by a convolution layer of a single filter with SoftMax activation. Each concatenation layer of the decoder concatenated the features from each level of both encoders and the previous decoder level to pass into the proceeding layers.

Adaptive ROI Algorithm
The adaptive ROI (A-ROI) algorithm was proposed in [5], which enables the network to investigate nodule presence in the surrounding slices without having ROIs from the user. Concretely, the A-ROI algorithm exploits the segmented mask of nodules in the current (nth) slice generated by the network to estimate the ROI for the successive slices (i.e., n ± 1). In this work, we employed the A-ROI algorithm to complement the proposed DEHA-Net to perform the 3D segmentation of lung nodules. A-ROI utilizes a hyperparameter R T ∈ (0, 1) to moderate the margins around the nodule in the generated ROI masks.
The full impact of the A-ROI algorithm is demonstrated in Figure 4. Two constant ROIs are shown in the first row in red and blue, which remain fixed throughout all the slices: One with tight margins failed to cover the nodule in the surrounding slices, while the other constant ROI had wider margins, which added redundant area, thus confusing the network. By contrast, in the second row, the dynamic ROI produced by the A-ROI algorithm is shown. The column shows the different slices; (a) represents the slice where the user provides the first ROI, and (b-f) demonstrate the adjacent slices.
The proposed framework for generating the 3D segmentation mask of lung nodules along the axial view is described in Algorithm 1. Algorithm 1 illustrates the steps followed to generate the 3D segmentation mask by investigating nodule penetration along the axial axis. The provided ROI by the radiologist or CADe system in n i th slice is represented by RoI n i and is used as RoI n to initiate the segmentation. The normalized slice, I n , and the provided ROI are fed to DEHA-Net, denoted by Θ. Later, the segmentation mask of the nodule generated by DEHA-Net was inputted into the A-ROI algorithm to produce the ROI mask for the next slice. The next slice could be in any direction, i.e., forward or backward. The same cycle was repeated until the next ROI mask became blank.
Algorithm 1 : The algorithmic steps followed in the proposed framework for nodule investigation along the axial view. 1: n = n i , RoI n = RoI n i 2: while ∑ RoI n > 0 do 3: Seg n = Θ(I n , RoI n ) 4: n ← n ± 1 5: RoI n ← AROI(Seg n , R T ) 6: end while Version February 7, 2023 submitted to Sensors 8 of 12 Algorithm 1 : The algorithmic steps followed in propose framework for nodule investigation along axial view.

281
The propose framework utilizes the consensus module to ensemble the segmentation 282 results obtained from axial, sagittal, and coronal axis. The consensus value E i of i th voxel is 283 calculated as follows: where, S ij represents the prediction of ith from jth model and K denotes the number of 285 models which in our case are three, i.e., axial, sagittal, and coronal. τ is the threshold which 286 is determined on validation set.  To train the proposed DEHA-Net, we utilized the dice similarity coefficient (DSC) [? ] 290 loss which can be defined as follows:

Experimental Setup and Implementation Details
where, Θ, S g i and N represent the model, groud truth segmentation mask and number of 292 samples in the training set, respectively. We use the stochastic gradient descent (SGD) to 293 train our network.

Ensembling Mechanism
The proposed framework utilized a consensus module to ensemble the segmentation results obtained from the axial, sagittal, and coronal axes. The consensus value E i of i th voxel is calculated as follows: where S ij represents the prediction of ith from the jth model, and K denotes the number of models, which in our case were three, i.e., axial, sagittal, and coronal. τ is the threshold that is determined on the validation set.

Loss Function
To train the proposed DEHA-Net, we utilized the dice similarity coefficient (DSC) [34] loss, which can be defined as follows: where Θ, S g i , and N represent the model, ground-truth segmentation mask, and the number of samples in the training set, respectively. We used stochastic gradient descent (SGD) to train our network.

Implementation Details and Training Strategy
We used the Keras [35] framework for implementing the proposed DEHA-Net and used Equation (6) with an SGD scheme to minimize the error. The model was trained on an Nvidia Tesla V100 Tensor core GPU with 12,821 images sized 512 × 512 and a batch size of 8. Training was initiated from random weights and with an initial learning rate of 0.001 and the first and second momentum of 0.9 for the decay of the learning rate. We used early stopping with a patience value of 10 epochs to avoid overfitting.

Performance Measures
We considered three evaluation parameters to rigorously evaluate the performance of the proposed framework. The following evaluation parameters were used to evaluate the performance of our proposed method.
• Dice Similarity Coefficient: We used the dice similarity coefficient (DSC) [19,36], which measures the degree of overlap between the ground-truth mask and the predicted mask. The DSC values range from 0 to 1, where 1 and 0 indicate complete overlap and no overlap, respectively. It can be defined as follows: where Y and Y are the predicted segmentation mask and reference segment mask, respectively. • Sensitivity: To measure the pixel classification performance proposed framework, we used sensitivity (SEN), which can be defined as follows: •

Positive Predictive Value (PPV):
To measure the correctness of the segmentation area produced by the proposed framework, we used the positive predictive value (PPV), which can be defined as follows:

Overall Performance Analysis
We evaluated our proposed framework on the parameters described in Section 4.3 and compared its performance with previously published studies. Table 1 summarizes the results achieved by using our framework on the test set along with the reported performance of existing studies. It demonstrates that our proposed architecture outperforms the existing methods in terms of the dice score while also having the lowest standard deviation, which depicts its robustness against the variations in the type and size of lung nodules. In comparison with our previous work [5], which utilized the cropped ROI input, our current approach offers improved performance with a lower standard deviation. This can be attributed to the incorporation of ROI masks into a dual-encoder-based architecture, which eliminates the necessity to crop and normalize the input slice. It also signifies the effectiveness of the incorporation of the A-ROI algorithm in the proposed scheme to estimate the ROI masks for the surrounding slices of a given input slice.  [21] 73.6 ± ---Liu et al., 2019 [24] 81.58 ± 11.05 87.30 ± 14.30 79.71 ± 13.59 Chen et al., 2020 [23] 86.43 ± ---Cao et al., 2020 [37] 82

Robustness Analysis
The LIDC dataset includes annotations that describe various characteristics of nodules, such as their subtlety, internal structure, calcification, sphericity, margin, lobulation, speculation, texture, and malignancy. These characteristics represent different levels of difficulty in detecting the boundaries of nodules. To evaluate the effectiveness of our method, we divided the test data into groups based on each characteristic and analyzed the results for each group. Table 2 presents the dice scores for each group, which demonstrates that our framework performs consistently in each group, and promising results can be obtained on all types of lung nodules. This can be attributed to the hard attention mechanism, which enables the proposed DEHA-Net to only focus on the given ROI region while leveraging the surrounding information to distinguish the nodule.
Further, to illustrate the robustness of the proposed method, a histogram of the distribution of the dice scores on the test set of the LIDC dataset is shown in Figure 5. The majority of test instances have a score of over 85%, which demonstrates the strong performance of our proposed method. We evaluated our proposed framework on the parameters described in Section 4.3 310 and compared its performance with previously published studies. Table 1 summarises 311 the results achieved by our framework on test set along with the reported performance of 312 existing studies. It demonstrates that our proposed architecture outperforms the existing 313 methods in terms of dice score while also having the lowest standard deviation, which 314 depicts its robustness against the type and size variations of lung nodules. In comparison 315 with our previous work [? ] which utilized cropped ROI input, our current approach 316 offers improved performance with lower standard deviation. This can be attributed to 317 the incorporation of ROI mask into a dual-encoder based architecture which eliminates 318 the necessity to crop and normalize the input slice. It also signifies the effectiveness of 319 A-ROI algorithm incorporation in proposed scheme to estimate the ROIs masks for the 320 surrounding slices of given input slice. The LIDC dataset includes annotations that describe various characteristics of nod-323 ules, such as their subtlety, internal structure, calcification, sphericity, margin, lobulation, 324 speculation, texture, and malignancy. These characteristics represent different levels of 325 difficulty in detecting the boundaries of nodules. To evaluate the effectiveness of our 326 method, we divided the test data into groups based on each characteristic and analyzed the 327 results for each group. Table 2 presents the dice scores for each group, which demonstrates 328 that our framework performs consistently on each group and promising results can be 329 obtained on all types of lung nodules. This can be attributed to the hard attention which 330 enables the proposed DEHA-Net to only focus on the given ROI region, while leveraging 331 the surrounding information to distinguish the nodule.

332
Further, to illustrate the robustness of proposed method, a histogram of the distribution 333 of dice scores on the test set of the LIDC dataset has been shown in Figure 5. The majority 334 of test instances have a score of over 85%, which demonstrates the strong performance of 335 our proposed method. 336

337
To elaborate on the difference in the performances of this framework and our previous 338 work [? ], we performed a visual analysis of the results. Figure 6 shows the visual results 339 with axial views on randomly selected nodules of different sizes and types. The results 340 demonstrate that the incorporation of hard attention by the ROI mask in the model has 341 Figure 5. Dice similarity score distribution obtained on the LIDC testing set. Figure 5. Dice similarity score distribution obtained on the LIDC testing set.

Qualitative Analysis
To elaborate on the difference in the performances of this framework and our previous work [5], we performed a visual analysis of the results. Figure 6 shows the visual results with axial views on randomly selected nodules of different sizes and types. The results demonstrate that the incorporation of hard attention with the ROI mask in the model significantly improves the segmentation performance. It can also be observed that the resizing of cropped slice disturbs the boundary of the segmented nodule, which is critical in determining the exact dimensions of the nodule and subsequently, the malignancy level. The proposed DEHA-Net enables our framework to utilize the full slice without losing minor details of the given input image, which are crucial to perform the accurate segmentation of lung nodules.

Conclusions
In this work, we proposed a novel dual-stage-based framework that used a 2D slice along with the seed region of interest (ROI), covering the nodule area, to produce the 3D segmentation of the nodule. To segment the nodule in the given slice, we proposed a novel dual-encoder-based hard attention network (DEHA-Net), which utilized the adaptive region of interest (A-ROI) algorithm for estimating the ROI for the surrounding slices. In contrast to the previous studies in which a cropped patch of a given slice is inputted to the network, the proposed DEHA-Net leverages complete 2D contextual information by taking the entire slice as input. It helps the DEHA-Net to learn the meaningful features that better distinguish between the nodule and non-nodular voxels. In the second stage, after obtaining the 3D segmentation of nodules from axial slices, the framework followed the same segmentation scheme for sagittal and coronal views. Finally, a consensus module was employed to process the results from all three axes to obtain the refined segmentation mask. An extensive evaluation of the proposed framework was performed on the lung image database consortium and image database resource initiative (LIDC/IDRI) dataset, which is the largest publicly available dataset. The quantitative and qualitative results were presented and analyzed, which demonstrate that the technique shows excellent performance by outperforming the existing state-of-the-art methods in terms of the dice similarity score. Furthermore, our results reveal that the framework is significantly robust to the various types and sizes of nodules. Future plans include improvement in the framework by reducing its computational complexity to optimize its performance in terms of execution time.

Data Availability Statement:
The data used in this study is public dataset from The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI) which can be accessed from https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254 (Accessed on 9 February 2023). The architecture source code can be provided on request.