CardioNet: Automatic Semantic Segmentation to Calculate the Cardiothoracic Ratio for Cardiomegaly and Other Chest Diseases

Semantic segmentation for diagnosing chest-related diseases like cardiomegaly, emphysema, pleural effusions, and pneumothorax is a critical yet understudied tool for identifying the chest anatomy. A dangerous disease among these is cardiomegaly, in which sudden death is a high risk. An expert medical practitioner can diagnose cardiomegaly early using a chest radiograph (CXR). Cardiomegaly is a heart enlargement disease that can be analyzed by calculating the transverse cardiac diameter (TCD) and the cardiothoracic ratio (CTR). However, the manual estimation of CTR and other chest-related diseases requires much time from medical experts. Based on their anatomical semantics, artificial intelligence estimates cardiomegaly and related diseases by segmenting CXRs. Unfortunately, due to poor-quality images and variations in intensity, the automatic segmentation of the lungs and heart with CXRs is challenging. Deep learning-based methods are being used to identify the chest anatomy segmentation, but most of them only consider the lung segmentation, requiring a great deal of training. This work is based on a multiclass concatenation-based automatic semantic segmentation network, CardioNet, that was explicitly designed to perform fine segmentation using fewer parameters than a conventional deep learning scheme. Furthermore, the semantic segmentation of other chest-related diseases is diagnosed using CardioNet. CardioNet is evaluated using the JSRT dataset (Japanese Society of Radiological Technology). The JSRT dataset is publicly available and contains multiclass segmentation of the heart, lungs, and clavicle bones. In addition, our study examined lung segmentation using another publicly available dataset, Montgomery County (MC). The experimental results of the proposed CardioNet model achieved acceptable accuracy and competitive results across all datasets.


Introduction
The most used and evaluated method of diagnosing chest-related pathologies such as pneumothorax, pulmonary cancer, congestive heart failure, lung nodule, and heart enlargement is the chest X-ray (CXRs) [1]. Heart enlargement is classified as cardiomegaly, one of the serious cardiovascular diseases among the general public [2]. Cardiomegaly can happen from different conditions such as cardiac insufficiency, blood pressure, hypertension, and coronary artery disease. These cardiac concerns affect patients' health ranging from a high risk of heart failure to immediate death [3]. Therefore, the early diagnosis of 2 of 21 cardiomegaly is critical; the disease can be diagnosed with edge detection of the size and shape of the chest and heart from posterior-anterior (PA) CXR. The cardiothoracic ratio (CTR) is a quantitative measure of heart enlargements in CXRs to detect cardiomegaly and the boundaries of other chest organs [4,5].
The CTR is the ratio between the maximum horizontal cardiac diameter and the maximum horizontal thoracic diameter, and the normal range is between 0.42 and 0.50. A value higher than normal (>0.50) is considered cardiomegaly [5]. The manual measurement of the CTR from CXR performed by medical experts requires the domain knowledge of chest physiologies. Cardiologists detect cardiomegaly by measuring the heart's left distance, D L , and right distance, D R , boundaries from the central vertical line of the chest. M is the maximum horizontal distance between the left-and right-side boundaries of the respective lungs, as shown in Figure 1. The method of calculating the CTR for cardiomegaly is expressed in Equation (1). There is the possibility of observational error, and the process is time-consuming. This problem has motivated researchers to develop a computer-aided diagnosis-(CAD) based CTR measurement to diagnose cardiomegaly. Several researchers have automatically measured the CTR and other heart diseases [6,7]. Most of the techniques used to detect the boundaries and size of lungs and heart require the accurate segmentation of anatomical organs. CTR = (D L + D R )/M (1) can happen from different conditions such as cardiac insufficiency, blood pressure, hypertension, and coronary artery disease. These cardiac concerns affect patients' health ranging from a high risk of heart failure to immediate death [3]. Therefore, the early diagnosis of cardiomegaly is critical; the disease can be diagnosed with edge detection of the size and shape of the chest and heart from posterior-anterior (PA) CXR. The cardiothoracic ratio (CTR) is a quantitative measure of heart enlargements in CXRs to detect cardiomegaly and the boundaries of other chest organs [4,5]. The CTR is the ratio between the maximum horizontal cardiac diameter and the maximum horizontal thoracic diameter, and the normal range is between 0.42 and 0.50. A value higher than normal (>0.50) is considered cardiomegaly [5]. The manual measurement of the CTR from CXR performed by medical experts requires the domain knowledge of chest physiologies. Cardiologists detect cardiomegaly by measuring the heart's left distance, D , and right distance, D , boundaries from the central vertical line of the chest. M is the maximum horizontal distance between the left-and right-side boundaries of the respective lungs, as shown in Figure 1. The method of calculating the CTR for cardiomegaly is expressed in Equation (1). There is the possibility of observational error, and the process is time-consuming. This problem has motivated researchers to develop a computer-aided diagnosis-(CAD) based CTR measurement to diagnose cardiomegaly. Several researchers have automatically measured the CTR and other heart diseases [6,7]. Most of the techniques used to detect the boundaries and size of lungs and heart require the accurate segmentation of anatomical organs. In medical imaging, segmentation is extracted with similar properties from the images. The areas of interest, like the lung and heart, are segmented with automated deep learning segmentation from the CXRs. The advancements in convolutional neural networks (CNN) have led to better performance in the image segmentation domain [8]. Multilayer CNNs [9] are used to detect different types of chest diseases and to segment medical tasks. A CNN-based automatic brain segmentation is performed on the MRI brain images to detect the tissues [10]. The authors of [11] proposed a semantic segmentation EG-CNN deep model to accurately detect the edges and boundaries of organs. Semantic segmentation is a pixel-wise classification that labels each pixel of a given class and groups the similar features. Medical images are complex, and this pixel-based semantic segmentation help with efficiently locating the infected areas in an image [12]. Due to low quality and pixel variations in the CXRs, the automatic semantic segmentation of the heart and other chest organs (lung and clavicle bones) is a challenging task. Previous studies addressed this issue and solved it by developing complex neural networks with higher computational resources [13]. In medical imaging, segmentation is extracted with similar properties from the images. The areas of interest, like the lung and heart, are segmented with automated deep learning segmentation from the CXRs. The advancements in convolutional neural networks (CNN) have led to better performance in the image segmentation domain [8]. Multilayer CNNs [9] are used to detect different types of chest diseases and to segment medical tasks. A CNNbased automatic brain segmentation is performed on the MRI brain images to detect the tissues [10]. The authors of [11] proposed a semantic segmentation EG-CNN deep model to accurately detect the edges and boundaries of organs. Semantic segmentation is a pixel-wise classification that labels each pixel of a given class and groups the similar features. Medical images are complex, and this pixel-based semantic segmentation help with efficiently locating the infected areas in an image [12]. Due to low quality and pixel variations in the CXRs, the automatic semantic segmentation of the heart and other chest organs (lung and clavicle bones) is a challenging task. Previous studies addressed this issue and solved it by developing complex neural networks with higher computational resources [13].
In this study, we addressed this challenging issue and proposed a learning-based solution that performs accurate segmentation of lungs and heart to measure CTR from chest PA CXRs. CardioNet is the method proposed and used to determine the presence of cardiomegaly, and the graphical representation is discussed in Section 3. CardioNet is a semantic segmentation network that uses the dense identity features in the architecture to detect the edge properly within a few pixels. This deep model provides fine segmentation within fewer trainable parameters than the CNNs. CardioNet was trained on chest PA CXRs and provided the binary masks to compute the pixels and the positions of the anatomical organs. The semantic segmentation output and the calculated CTR decide the performance evaluation and make the final decision.
The paper's organization is as follows: We review the related work about cardiomegaly and the concept of chest anatomy segmentation based on the conventional handcrafted and deep features in Section 2. In Section 3, we propose a deep methodology for performing the semantic segmentation. Section 4 provides the CXRs data for the experiment and the data generation approach; the training and the testing of the proposed CardioNet model are performed in this section, followed by the experimental evaluation results. The work concludes with challenges in Section 5.

Related Works
There are two methods of segmenting the chest-related organs: handmade and deep features. The handmade feature-based methods utilize various image-processing schemes, where in-depth features are the learned features that depend on deep learning-based semantic segmentation.

Chest Anatomy Segmentation Using Conventional Handmade Features
Handmade features are based on general image-processing approaches to segmenting the chest anatomy from the background. Most of the local feature-based methods do not consider multiclass segmentation of CXRs; rather, they focus on lung region segmentation. Peng et al. presented the Hull-CPLM method to detect the lung region of interest (ROI). The segmentation requires prior preprocessing for coarse segmentation, where the principle curve method is used to refine segmentation [14]. Candemir et al. proposed a nongrid registration-based lung segmentation method that performs the task in three steps: content-based image retrieval, sift-flow modeling for deformable registration, and graph cut optimization for boundary refinement [15]. Jaeger et al. computed three masks of lung segmentation, a probabilistic lung shape model and a Log Gabor mask, where segmentation was obtained by averaging and thresholding for the diagnostic purpose [16].
Jangam et al. presented a hybrid segmentation scheme that utilized an optimized clustering approach to exclude the lung field from the background in CXR images [17]. Vital et al. introduced an automatic system for lung field segmentation. A wavelet enhances the CXR images, and in the second step, OTSU thresholding is combined with mathematical morphology. For the third step, the active contour method improves the performance [18]. Ahmad et al. proposed a Gaussian derivative filter that considered seven different orientations of top segment fields. The segmentation task combines the Gaussian derivative filter with fuzzy c-mean clustering and thresholding [19]. Pattrapisetwong et al. used an unsupervised method for lung region exclusion from the background. They used preprocessing to enhance the image contrast, and then a shadow filter was applied to enhance the outline of the lungs; finally, multilevel thresholding segmented the resultant lung region [20].
Li et al. used graph-based lung segmentation. In detail, the CXR images are divided into several subregions and each region's saliency value, where the cubic spline interpolation is used to obtain fine, smoother boundaries of the lung region [21]. Chen et al. proposed a system that estimates the effusion volume. The segmentation was performed using a 2-D image processing scheme, similar to the Harris corner detector, for enhancement using a convolutional process with a 2 × 2 mask to detect the lung contour [22]. Dawoud presented an iterative framework for lung field segmentation. The method calculates the intensity and shape information, where the main segmentation is handled with iterative thresholding [23]. Saad et al. showed that the edge detection from Sobel, Prewitt, and Laplacian could segment the lung fields; however, the accuracy decreases owing to the image noise. Therefore, combining these edge detectors with morphological operators can produce better results for lung segmentation in CXRs [24]. Chondro et al. proposed a low-order adaptive lung segmentation method based on the growing region. The ROI was obtained by brick-block binarization and morphology, where the boundary refinement was measured by statistical region growing and graph cutting [25]. Chung et al. considered lung segmentation a prerequisite task for diagnosis. The segmentation of the lungs performed by the Bayesian active contour model is an iterative process for segmentation [26].

Chest Anatomy Segmentation Using Deep Feature (CNN)
Conventional handcrafted features are based on specific local intensities; thus, these methods cannot perform multiclass segmentation at once. The performance parameters change from image to image. Therefore, automatic deep feature methods are an alternative to handmade features for evaluation. Long et al. used the first learning method for the multiclass chest anatomy segmentation. The encode-decoder is used for critical learning of higher-order structures. The encoder part extracts the features, and the decoder performs upsampling operations to obtain the final segmented output [27]. An adversarial network is proposed by Dong et al. to estimate the CTR. The produced model creates the predicted domain-independent output mask [28]. Tang et al. developed a transfer learning approach for lung segmentation, and it consists of two main modules. The first module is crisscross attention-wise responsible for the enriched global contextual information, whereas the second module, image-to-image translation, is used for data augmentation [29].
Souza et al. used a learning method to segment the lung regions. The original images were divided into patches, and those patches were classified into lung and non-lung classes by a neural network [30]. Kalinovsky et al. modified the encoder-decoder-based SegNet model for lung segmentation and achieved 96.2% accuracy [31]. This method worked on the limitation of the original SegNet model. It used the max-pooling features to upsample the features maps in the decoding layers, the LF-SegNet method developed by Mittal et al. [13] to perform lung segmentation. The lung segmentation was performed and the proposed model was evaluated on two famous chest X-ray datasets, JSRT and MC, and achieved 98.73% and 95.10% accuracy, respectively. Liu et al. [32] proposed a U-Net segmentation model on the JSRT dataset to extract the lung regions, and DenseNet is used to segment the lungs.
Venkataramani et al. developed ContextNets for semantic segmentation to adapt the target domain with fewer images [33]. Frid-Adar et al. considered an important application of semantic segmentation to detect the clavicle bone positioning using Chest X-rays. The modified architecture used to segment the clavicle bones and the weights of VGG16 used in the encoder [34]. Oliveira et al. presented a transfer learning-based approach f chestrelated organs segmentation. The approach consists of pre-trained networks for semantic segmentation as; U-Net, fully connected network, and SegNet [35]. Wang et al. considered the instance segmentation to segment multiclass chest organs using chest X-ray images with Mask-RCNN [36]. Dong et al. presented a generative adversarial network (GAN) for a semantic segmentation purpose using CXRs [37]. Jiang et al. developed a CNN-based VGG16 segmentation model with prior weight initialization with fewer data. [38]. Most of the semantic segmentation from the radiographic images is performed by the UNet and encoder-decoder architectures or by the different variants of the followings; however, currently, the recurrent neural network (RNN) is also used for the segmentation purpose in radiology. Stollenga et al. [39] presented a segmentation 3D LSTM-RNN deep model to extract the brain features automatically. Chen et al. [40] used a semantic segmentation approach to 3D images by combining the LSTM-RNN and U-Net architecture.

Proposed CardioNet
The flowchart of the proposed CardioNet model for the automatic semantic segmentation of chest-related organs is presented in Figure 2. An input chest of CXR images without any preprocessing is given to the CardioNet. CardioNet uses pixel-wise classification to segment the heart, lungs, and clavicle bones. The output of the model is the segmented masks for each class. The heart and lungs masks are utilized to calculate CTR to detect cardiomegaly, make accurate diagnostic decisions, and measure the model's performance. The CardioNet is based on the dense connections between the downsample block, upsample block, and features boost block. Dense concatenation paths are used for the subsequent layers in the downsample and upsample block of the CardioNet and between the downsample and upsample block. The specific paths are introduced for the flow of information within the network to provide the edge information to the subsequent layers and the upsample block. Feature boost block is introduced to preserve minor features in the segmented image. It applies continuous convolutions without resizing the feature map size, which helps to premaintain needful spatial information to boost the segmentation performance.
tation of chest-related organs is presented in Figure 2. An input chest of CXR images without any preprocessing is given to the CardioNet. CardioNet uses pixel-wise classification to segment the heart, lungs, and clavicle bones. The output of the model is the segmented masks for each class. The heart and lungs masks are utilized to calculate CTR to detect cardiomegaly, make accurate diagnostic decisions, and measure the model's performance. The CardioNet is based on the dense connections between the downsample block, upsample block, and features boost block. Dense concatenation paths are used for the subsequent layers in the downsample and upsample block of the CardioNet and between the downsample and upsample block. The specific paths are introduced for the flow of information within the network to provide the edge information to the subsequent layers and the upsample block. Feature boost block is introduced to preserve minor features in the segmented image. It applies continuous convolutions without resizing the feature map size, which helps to premaintain needful spatial information to boost the segmentation performance.

Chest Anatomy Segmentation Using CardioNet Architecture
Semantic segmentation in CXRs is not an easy task. X-ray images have inferior quality, whereas the correct CTR computation is based on the true lungs and heart boundaries. Moreover, the conventional semantic segmentation architectures [27,41] lack spatial information loss compensation as they are not using any residual or dense connectivity to empower the feature after the continuous convolutional operation. The proposed CardioNet is a robust network that incorporates three principal parts: a downsample block (DSB), an upsample block (USB), and a features boost block (FBB). Figure 3 shows the proposed CardioNet architecture for the CXR semantic segmentation. The DSB is an encoder part of the network that squeezes the vital information from the input images in combination with dense connectivity benefits. It consists of five 3 × 3 general convolutions and three depth-wise separable convolutions. The convolutional layers with many channels consume more trainable parameters. Therefore, the convolution in the deeper side of the DSB is replaced by depth-wise convolutions. The edge features in the CXR image can be very small, and those small features can be eliminated if the feature map size is greatly reduced within the network. It can be noticed from Table 1 that the smallest feature map size in the DSB is 21 × 21, which is not enough to represent the minor information available in the CXR; the minor features can be eliminated, and therefore, the FBB is used to apply continuous convolutions without resizing the feature map. The FBB retains the feature map on a flat feature map size of 350 × 350, which can

Chest Anatomy Segmentation Using CardioNet Architecture
Semantic segmentation in CXRs is not an easy task. X-ray images have inferior quality, whereas the correct CTR computation is based on the true lungs and heart boundaries. Moreover, the conventional semantic segmentation architectures [27,41] lack spatial information loss compensation as they are not using any residual or dense connectivity to empower the feature after the continuous convolutional operation. The proposed Car-dioNet is a robust network that incorporates three principal parts: a downsample block (DSB), an upsample block (USB), and a features boost block (FBB). Figure 3 shows the proposed CardioNet architecture for the CXR semantic segmentation. The DSB is an encoder part of the network that squeezes the vital information from the input images in combination with dense connectivity benefits. It consists of five 3 × 3 general convolutions and three depth-wise separable convolutions. The convolutional layers with many channels consume more trainable parameters. Therefore, the convolution in the deeper side of the DSB is replaced by depth-wise convolutions. The edge features in the CXR image can be very small, and those small features can be eliminated if the feature map size is greatly reduced within the network. It can be noticed from Table 1 that the smallest feature map size in the DSB is 21 × 21, which is not enough to represent the minor information available in the CXR; the minor features can be eliminated, and therefore, the FBB is used to apply continuous convolutions without resizing the feature map. The FBB retains the feature map on a flat feature map size of 350 × 350, which can contain the smaller and most valuable features. The USB is mainly based on three depth-wise separable convolutions combined with five 3 × 3 general convolutions and batch normalization. As mentioned previously, in the case of DSB, general convolutions with more channels are costly, so depth-wise separable convolutions are replacing those of the general convolution layers to reduce cost. Furthermore, softmax and pixel classification layers are used in the USB; where softmax layer works as the activation function and pixel classification layer provides a categorical label for each pixel in the image.
contain the smaller and most valuable features. The USB is mainly based on three depthwise separable convolutions combined with five 3 × 3 general convolutions and batch normalization. As mentioned previously, in the case of DSB, general convolutions with more channels are costly, so depth-wise separable convolutions are replacing those of the general convolution layers to reduce cost. Furthermore, softmax and pixel classification layers are used in the USB; where softmax layer works as the activation function and pixel classification layer provides a categorical label for each pixel in the image.
Moreover, the residual connections from the ResNet model were used to empower the features that degraded the contextual image information during the downsampling [41]. This problem is termed a vanishing gradient problem, and researchers attempted to address it using residual skip connections, but it still faces the information flow latency problem. We used the dense connectivity function from the DenseNet deep architecture to overcome this problem and concatenated the in-depth features [42]. DenseNet is a famous classification model that outperforms the previous networks while using fewer trainable parameters. The proposed architecture is entirely different from conventional networks such as the segmentation network (SegNet) [43], outer residual skip network (OR-Skip-Net) [44], and U-shaped network (U-Net) [45], which are deep neural networks with larger number of trainable parameters. The shallow architecture of CardioNet exhibits 1.72 million (M) trainable parameters, while the conventional networks such as SegNet, OR-Skip-Net, and U-Net, have 29.46 M, 09.70 M, and 31.03 M trainable parameters, respectively. The key architectural differences of the CardioNet with the conventional segmentation networks such as Seg-Net, OR-Skip-Net, and U-Net are listed in Table 2. OR-Skip-Net [44] There is no internal connectivity between the convolutional layers in the encoder and decoder.
Both internal and external connectivities are used.
Residual connectivity is used. Dense connectivity is used. 16 Figure 4 illustrates the in-depth view of the proposed CardioNet with dense connectivity feature concatenation. CardioNet accomplishes the chest X-ray image; the spatial loss is compensated by concatenating deep, dense features. The dense concatenation method transfers these enriched features from the DSB to the USB, as shown in the figure.
As shown in Figure 4, the first convolution layer in the downsample block, i.e., Conv − A, receives the D i feature as input, and after applying the convolution operation, it outputs the F i feature. This F i feature is rich with low-level spatial information, and it flows in two directions for concatenation. First, F i is provided to the second convolution (Conv-B), which converts into K i . K i is the output achieved after the two convolution layers, and spatial loss of these convolutional operations is recovered by dense feature concatenation. The first aggregated rich, dense concatenated feature, A 1 Concat. , is the output of F i and K i given by Equation (2): As shown in Figure 4, the first convolution layer in the downsample block, i.e., Conv-A, receives the D feature as input, and after applying the convolution operation, it outputs the F feature. This F feature is rich with low-level spatial information, and it flows in two directions for concatenation. First, F is provided to the second convolution (Conv-B), which converts into K . K is the output achieved after the two convolution layers, and spatial loss of these convolutional operations is recovered by dense feature concatenation. The first aggregated rich, dense concatenated feature, A . , is the output of F and K given by Equation (2): Here © shows the depth-wise concatenation of F and K . F is also directly transferred to the corresponding USB to transfer the low-level spatial information to the last layers of the network for better performance. A is the resulting dense feature after the depth-wise concatenation. This process increases the number of filters for the dense features, the parameters, and the training time. To overcome this problem, we introduced the Bottlenecki layer to decrease the number of filters with the combination of the BN and ReLU layers. The obtained feature after the Bottlenecki block is ΔA and given by Equation (3): Here, Δ represents the BN and ReLU operations in combination with the Bottlenecki layer that limit the filters. Similarly, Figure 4 shows that each USB receives the U feature as input, and after applying the first convolution (Conv-A), it outputs the J feature. J has less spatial loss than the next convolution layer, and it flows in two directions for concatenation. First, J is provided to the second convolution (Conv-B), producing L . L is the output obtained after two convolution operations that concatenate with J from the first convolution (Conv-A) and F from the corresponding DSB, creating a second aggregated rich feature A . given by Equation (4): Here, © shows the depth-wise concatenation of J , L , and F . A . is a powerful feature that contains rich information from the initial layers, resulting in better segmentation of the lung and heart region pixels. Again, to overcome the filter size problem, the Bottleneckj block was used with a combination of BN and ReLU operations. ΔA . is a bottleneck feature and is given by Equation (5): Here © shows the depth-wise concatenation of F i and K i . F i is also directly transferred to the corresponding USB to transfer the low-level spatial information to the last layers of the network for better performance. A 1 Concat is the resulting dense feature after the depthwise concatenation. This process increases the number of filters for the dense features, the parameters, and the training time. To overcome this problem, we introduced the Bottleneck i layer to decrease the number of filters with the combination of the BN and ReLU layers. The obtained feature after the Bottleneck i block is ∆A 1 Concat and given by Equation (3): Here, ∆ represents the BN and ReLU operations in combination with the Bottleneck i layer that limit the filters. Similarly, Figure 4 shows that each USB receives the U i feature as input, and after applying the first convolution (Conv-A), it outputs the J i feature. J i has less spatial loss than the next convolution layer, and it flows in two directions for concatenation. First, J i is provided to the second convolution (Conv-B), producing L i . L i is the output obtained after two convolution operations that concatenate with J i from the first convolution (Conv-A) and F i from the corresponding DSB, creating a second aggregated rich feature A 2 Concat. given by Equation (4): Here, © shows the depth-wise concatenation of J i , L i , and F i . A 2 Concat. is a powerful feature that contains rich information from the initial layers, resulting in better segmentation of the lung and heart region pixels. Again, to overcome the filter size problem, the Bottleneck j block was used with a combination of BN and ReLU operations. ∆A 2 Concat. is a bottleneck feature and is given by Equation (5): Here, ∆ represents the combination of BN and ReLU operations with the Bottleneck j layer limiting the filter size. Both the bottleneck features (∆A 1 Concat. , ∆A 2 Concat. ) achieved from the downsample block or upsample block empowered the dense connectivity. The output A 2 Concat. of the last USB provided the Softmax and pixel classification. The layer details in the booster block are shown in Table 3. The connectivity of the booster block in CardioNet and the dense feature aggregation principle are shown in Figure 5. The downsample-upsample block (DUB) takes the input feature, and this feature passes through several convolutional layers to extract the meaningful features for chest anatomy segmentation. This DUB provides the F d feature. F d is densely aggregated with the rich F b from the feature booster block (FBB). The features in FBB are without an extensive pooling operation and therefore contain the features that represent the low-level boundary information. Both F d and F b are aggregated with depth-wise concatenation to create S 1 , given by Equation (6):  Table 3. The con booster block in CardioNet and the dense feature aggregation principle are ure 5. The downsample-upsample block (DUB) takes the input feature, a passes through several convolutional layers to extract the meaningful fea anatomy segmentation. This DUB provides the F feature. F is dens with the rich F from the feature booster block (FBB). The features in FBB extensive pooling operation and therefore contain the features that represen boundary information. Both F and F are aggregated with depth-wise co create S , given by Equation (6): Here, S is the densely aggregated feature created by the depth-wise of F (feature coming from the DUB block) and F (coming from FBB), re depth-wise concatenation. Table 3. Feature map size details for features boost block. Batch normalization and used with bottleneck convolution layers as a unit and denoted as "**". Stride is 1 features boost block.

Block
Layer Name Layer Size (Height × Width × Number of channels)

Filters/Groups
Bottleneck-C ** 1 × 1 × 8 8 Here, S 1 is the densely aggregated feature created by the depth-wise concatenation of F d (feature coming from the DUB block) and F b (coming from FBB), representing the depth-wise concatenation.

Data Augmentation
We investigated chest anatomy segmentation in multiple classes. To analyze the performance of CardioNet, the Japanese Society of Radiological Technology (JSRT), i.e., a publicly available multiclass dataset [46], is used. In the JSRT dataset, there are 247 chest radiograph images for the lungs, heart, and clavicle, each annotated at the pixel level. A total of 154 of the 247 images show nodules, whereas the remaining 94 do not. The pixel space of each image in the dataset is 0.175 mm, and the image size is 2048 × 2048 pixels.
The provided data are divided into two folds with odd (123) and even (124) numbered images. Training is done one-fold, testing one on the other and vice versa. Considering the two-fold cross-evaluation criteria, the results are obtained by averaging both folds. Figure 6 shows sample chest radiograph images from the JSRT dataset with corresponding ground truth. In detail, the red, green, and blue pixels show the clavicle bone, heart, and lungs, respectively.
We investigated chest anatomy segmentation in multiple classes. To analyze the performance of CardioNet, the Japanese Society of Radiological Technology (JSRT), i.e., a publicly available multiclass dataset [46], is used. In the JSRT dataset, there are 247 chest radiograph images for the lungs, heart, and clavicle, each annotated at the pixel level. A total of 154 of the 247 images show nodules, whereas the remaining 94 do not. The pixel space of each image in the dataset is 0.175 mm, and the image size is 2048 × 2048 pixels. The provided data are divided into two folds with odd (123) and even (124) numbered images. Training is done one-fold, testing one on the other and vice versa. Considering the two-fold cross-evaluation criteria, the results are obtained by averaging both folds. Figure 6 shows sample chest radiograph images from the JSRT dataset with corresponding ground truth. In detail, the red, green, and blue pixels show the clavicle bone, heart, and lungs, respectively. CardioNet is a pixel-wise classification network that requires large training data to classify the multiple classes. The original image size and labels of 2048 × 2048 pixels of JSRT are resized to 350 × 350 pixels to reduce the memory usage of the graphic processing unit (GPU) and the training time. Different data generation approaches are applied to increase the data size so the model will learn accurately. Acquiring enough medically labeled data is difficult, requiring an expert to label the data. The training of CardioNet is performed with various images, and images are generated with data augmentation. The image transformations create artificial images by cropping, horizontal flipping, translation, and horizontal flipping with translation. Figure 7 represents the proposed data augmentation for this task. The first step of augmentation, horizontal flipping (H-Flip), is applied to 124 original images, which results in 248 images. The 248 images are translated (X = 4, Y = −4) to produce 496 images. In the third step, horizontally flipping the 496 images makes 992 images. Using the 992 images from the previous step, the next step applies the translation of (X = 8, Y = 8) with an H-Flip and the translation of (X = −12, Y = −12) with an H-Flip, resulting in 1984 and 1984 images, respectively. Consequently, 3968 images (1984 CardioNet is a pixel-wise classification network that requires large training data to classify the multiple classes. The original image size and labels of 2048 × 2048 pixels of JSRT are resized to 350 × 350 pixels to reduce the memory usage of the graphic processing unit (GPU) and the training time. Different data generation approaches are applied to increase the data size so the model will learn accurately. Acquiring enough medically labeled data is difficult, requiring an expert to label the data. The training of CardioNet is performed with various images, and images are generated with data augmentation. The image transformations create artificial images by cropping, horizontal flipping, translation, and horizontal flipping with translation. Figure 7 represents the proposed data augmentation for this task. The first step of augmentation, horizontal flipping (H-Flip), is applied to 124 original images, which results in 248 images. The 248 images are translated (X = 4, Y = −4) to produce 496 images. In the third step, horizontally flipping the 496 images makes 992 images. Using the 992 images from the previous step, the next step applies the translation of (X = 8, Y = 8) with an H-Flip and the translation of (X = −12, Y = −12) with an H-Flip, resulting in 1984 and 1984 images, respectively. Consequently, 3968 images (1984 + 1984) are obtained and used as training images by combining transformational images from step four. + 1984) are obtained and used as training images by combining transformational images from step four. In our experiments, CardioNet is trained from scratch using MATLAB R2021b [47] (without fine-tuning or a pre-trained model, such as DenseNet, ResNet, Inception, or GoogleNet). CardioNet is trained and tested on a desktop computer, i.e., Intel ®® Core™ i7-8700 CPU. The system clock speed is 3.20 GHz, memory (RAM) is 16 GB, and NVIDIA GPU GeForce RTX 2060 (graphics memory of 8 GB) [48].

CardioNet Training
CardioNet contains several dense paths that allow the network to provide encoderdecoder connections both internally and externally, which is a useful tool to help the network converge quickly. As a result of the spatial edge information from the dense connectivity concatenation feature, the edges can be segmented finely without preprocessing. CardioNet training takes place from scratch, without any weight transfer or initialization. Since we designed CardioNet, fine-tuning is not required when training it.
The training of CardioNet involves the following considerations. The gradient of the network is optimized with a well-known training optimizer called Adam. Adam can efficiently scale gradients diagonally, is suitable for large datasets, and can even handle moving object classification problems [49]. Adam is adopted as an optimizer for CardioNet training due to its benefits. Considering all other parameters, the learning rate of 0.001 for 30 epochs (51,660 iterations) is used throughout the CardioNet training. CardioNet requires high memory, so a small batch size with 4 images is considered. To maintain the gradient threshold, the global L2 normalization method with an epsilon of 0.000001 is used during the training. The cross-entropy loss function with median frequency is used to make training faster and similar to frequency balancing and cross-entropy [43,44,50,51].  In our experiments, CardioNet is trained from scratch using MATLAB R2021b [47] (without fine-tuning or a pre-trained model, such as DenseNet, ResNet, Inception, or GoogleNet). CardioNet is trained and tested on a desktop computer, i.e., Intel ®® Core™ i7-8700 CPU. The system clock speed is 3.20 GHz, memory (RAM) is 16 GB, and NVIDIA GPU GeForce RTX 2060 (graphics memory of 8 GB) [48].

CardioNet Training
CardioNet contains several dense paths that allow the network to provide encoderdecoder connections both internally and externally, which is a useful tool to help the network converge quickly. As a result of the spatial edge information from the dense connectivity concatenation feature, the edges can be segmented finely without preprocessing. CardioNet training takes place from scratch, without any weight transfer or initialization. Since we designed CardioNet, fine-tuning is not required when training it.
The training of CardioNet involves the following considerations. The gradient of the network is optimized with a well-known training optimizer called Adam. Adam can efficiently scale gradients diagonally, is suitable for large datasets, and can even handle moving object classification problems [49]. Adam is adopted as an optimizer for CardioNet training due to its benefits. Considering all other parameters, the learning rate of 0.001 for 30 epochs (51,660 iterations) is used throughout the CardioNet training. CardioNet requires high memory, so a small batch size with 4 images is considered. To maintain the gradient threshold, the global L2 normalization method with an epsilon of 0.000001 is used during the training. The cross-entropy loss function with median frequency is used to make training faster and similar to frequency balancing and cross-entropy [43,44,50,51].

Chest Anatomy Segmentation Testing using CardioNet
A CXR image without data preprocessing is given to CardioNet as an input image. This input image is passed through the CardioNet downsample block and upsample block to acquire the segmented output. The final output of the proposed network is a multiclass segmentation mask. These segmented masks further evaluate the final output by generating the chest-related automatic semantic segmentation results. The performance of CardioNet is evaluated and measured using different metrics. The following segmentation metrics are calculated: Accuracy (Acc), Jaccard index (J), and Dice coefficient(D). J is a mean intersection over union (IoU), and D is calculated as [1,36,52] for a JSRT dataset. Equations (7)-(9) give the formulas for evaluation metrics: Here, TP = true positives, FP = false positives, and FN = false negatives. We are considering multiclass segmentation, considering lung class as an example: TP and FP are predicted as lung pixels and non-lung pixels in the ground truth, respectively. TP pixels

Chest Anatomy Segmentation Testing Using CardioNet
A CXR image without data preprocessing is given to CardioNet as an input image. This input image is passed through the CardioNet downsample block and upsample block to acquire the segmented output. The final output of the proposed network is a multiclass segmentation mask. These segmented masks further evaluate the final output by generating the chest-related automatic semantic segmentation results. The performance of CardioNet is evaluated and measured using different metrics. The following segmentation metrics are calculated: Accuracy (Acc), Jaccard index (J), and Dice coefficient(D). J is a mean intersection over union (IoU), and D is calculated as [1,36,52] for a JSRT dataset. Equations (7)-(9) give the formulas for evaluation metrics: Here, TP = true positives, FP = false positives, and FN = false negatives. We are considering multiclass segmentation, considering lung class as an example: TP and FP are predicted as lung pixels and non-lung pixels in the ground truth, respectively. TP pixels are the pixels that are predicted as lung pixels and listed as lung pixels in the ground truth. FN are those lung pixels with ground truth predicted as non-lung pixels by our CardioNet model.
The segmented chest X-ray images with CardioNet using the JSRT dataset are shown in Figure 9. The FP convention is black; the FN convention is yellow; and the TP convention is shown in green, blue, and red for the heart, lung, and clavicle bone classes, respectively. While there are some bad segmentation cases, there is no significant segmentation error using our method for the test images.
FOR PEER REVIEW 14 of 21 are the pixels that are predicted as lung pixels and listed as lung pixels in the ground truth. FN are those lung pixels with ground truth predicted as non-lung pixels by our CardioNet model. The segmented chest X-ray images with CardioNet using the JSRT dataset are shown in Figure 9. The FP convention is black; the FN convention is yellow; and the TP convention is shown in green, blue, and red for the heart, lung, and clavicle bone classes, respectively. While there are some bad segmentation cases, there is no significant segmentation error using our method for the test images.

Ablation Study
An ablation study of our proposed method was conducted to prove the efficiency of CardioNet. We considered two variants of CardioNet, one with the booster block, i.e., CardioNet-B (mentioned as CardioNet throughout the whole manuscript), and the second one without the booster block, referred to as CardioNet-X. Here, we compared CardioNet-B with CardioNet-X. The experiments showed that the effectiveness of the booster block in CardioNet-B provides superior results to those of CardioNet-X (without booster block). Table 4 shows that CardioNet-B with the booster block provides higher Acc, J, and D values than CardioNet-X with a minor increase in the number of trainable parameters. Booster block creates considerable performance differences owing to preserving spatial information to boost the segmentation performance.

Ablation Study
An ablation study of our proposed method was conducted to prove the efficiency of CardioNet. We considered two variants of CardioNet, one with the booster block, i.e., CardioNet-B (mentioned as CardioNet throughout the whole manuscript), and the second one without the booster block, referred to as CardioNet-X. Here, we compared CardioNet-B with CardioNet-X. The experiments showed that the effectiveness of the booster block in CardioNet-B provides superior results to those of CardioNet-X (without booster block). Table 4 shows that CardioNet-B with the booster block provides higher Acc, J, and D values than CardioNet-X with a minor increase in the number of trainable parameters. Booster block creates considerable performance differences owing to preserving spatial information to boost the segmentation performance. This section compares CardioNet segmentation performance with other methods. The segmentation performance of the existing state-of-the-art methods compared with CardioNet is shown in Table 5. This comparison is based on the JSRT dataset. The local feature-based methods and the learned feature-based methods are presented separately in Table 5. Based on J and D, the study results demonstrate that CardioNet outperforms other studies for chest anatomy segmentation. dataset. MC is a publicly available dataset published by the famous Montgomery County tuberculosis program, one of the most famous in the USA. It has 138 chest PA X-ray images with 80 normal and 58 TB cases. Figure 10 shows sample images from the MC data with lung contour binary masks, and the format of the X-ray images format is PNG. We divided the MC dataset into training and testing following [60], where the training set has 80 images, the validation set has 20, and the testing set consists of 38 images. The MC dataset is also augmented, and the image size was artificially increased using the same augmentation approach used for the JSRT dataset (Section 4.1). We only provided the ground truth masks for lungs. The CardioNet results for the MC dataset are shown in Figure 11. However, the experimental comparison of CardioNet with other existing models is shown in Table 6. The experimental results clearly show that CardioNet outperformed for lung segmentation and can be used in real-world medical applications for diagnostic purposes. ground truth masks for lungs. The CardioNet results for the MC dataset are shown in Figure 11. However, the experimental comparison of CardioNet with other existing models is shown in Table 6. The experimental results clearly show that CardioNet outperformed for lung segmentation and can be used in real-world medical applications for diagnostic purposes.  ground truth masks for lungs. The CardioNet results for the MC dataset are shown in Figure 11. However, the experimental comparison of CardioNet with other existing models is shown in Table 6. The experimental results clearly show that CardioNet outperformed for lung segmentation and can be used in real-world medical applications for diagnostic purposes.

Automated Computation of CTR by the Proposed Method
As explained in the introduction section, medical practitioners can use CTR in diagnosing cardiomegaly, a heart enlargement disease. CTR is a quantitative measure of heart enlargements in CXRs to detect cardiomegaly and boundaries of other chest organs. CTR estimates the size of the heart. The cardiac health experts manually calculate it. Our proposed semantic segmentation network (CardioNet) can automate the computation of CTR by accurately segmenting the lungs and heart boundaries. As shown in Figure 12a, the heart boundary is critical in X-ray images due to a change in pixel values. The CTR can only be calculated by the proposed method once the heart and lungs are segmented properly. With our feature-boosting mechanism, we can achieve accurate boundaries of the heart and lungs even with minor changes in pixel values.

Automated Computation of CTR by the Proposed Method
As explained in the introduction section, medical practitioners can use CTR in diagnosing cardiomegaly, a heart enlargement disease. CTR is a quantitative measure of heart enlargements in CXRs to detect cardiomegaly and boundaries of other chest organs. CTR estimates the size of the heart. The cardiac health experts manually calculate it. Our proposed semantic segmentation network (CardioNet) can automate the computation of CTR by accurately segmenting the lungs and heart boundaries. As shown in Figure 12a, the heart boundary is critical in X-ray images due to a change in pixel values. The CTR can only be calculated by the proposed method once the heart and lungs are segmented properly. With our feature-boosting mechanism, we can achieve accurate boundaries of the heart and lungs even with minor changes in pixel values.
An example image from the JSRT dataset for calculating the CTR of the ground truth and the predicted mask (by CardioNet) using equation (1) is shown in Figure 12b,c, respectively. For calculating the CTR, we need the ratio of distance D + D and M. Here, D + D is the distance between two extreme points of the heart. M is the maximum horizontal distance between two extreme outer points for both lungs. The distance D + D by the predicted mask is 126 pixels, and the distance M is 304. Hence, using equation (1), the predicted CTR (P-CTR) will be 0.4145. On the other hand, the distance DL + DR by the ground truth mask is 123 pixels, and the distance M is 303. Therefore, the ground truth CTR (G-CTR) calculated by Equation (1) is 0.4045. In JSRT, masks are prepared under the supervision of an expert radiologist. It can be determined from the obtained results that the proposed automatic method has predicted the CTR very efficiently. Hence, it can aid the medical practitioner in diagnosing using CTR and other chest anatomy segmentation as an alternative system.  An example image from the JSRT dataset for calculating the CTR of the ground truth and the predicted mask (by CardioNet) using Equation (1) is shown in Figure 12b,c, respectively. For calculating the CTR, we need the ratio of distance D L + D R and M. Here, D L + D R is the distance between two extreme points of the heart. M is the maximum horizontal distance between two extreme outer points for both lungs. The distance D L + D R by the predicted mask is 126 pixels, and the distance M is 304. Hence, using Equation (1), the predicted CTR (P-CTR) will be 0.4145. On the other hand, the distance D L + D R by the ground truth mask is 123 pixels, and the distance M is 303. Therefore, the ground truth CTR (G-CTR) calculated by Equation (1) is 0.4045. In JSRT, masks are prepared under the supervision of an expert radiologist. It can be determined from the obtained results that the proposed automatic method has predicted the CTR very efficiently. Hence, it can aid the medical practitioner in diagnosing using CTR and other chest anatomy segmentation as an alternative system.

Conclusions
Anatomical chest structures (lungs, hearts, and clavicle bones) were segmented using a residual connection-based semantic segmentation network (CardioNet) for diagnostic purposes. Even under nonideal situations and multiple classes, the method provides excellent segmentation accuracy. Since the pixel value of the heart is low and its boundary edges blend with the lungs, segmenting the heart is essential. CardioNet can segment the heart accurately, even in X-ray images of inferior quality. The segmentation accuracy of the heart and lung regions is directly related to the CTR. Conventional CNNs reduce the feature map size to classify classes, resulting in the depreciation of the minor information (clavicle bone and small-sized heart) due to the overuse of the max-pooling layers. The feature map size is not reduced in the network for classification; fewer pooling layers are used, and a large feature size helps restore the minor classes' information. The direct, outer, dense-connectivity feature concatenation causes direct information transfer, enabling CardioNet to converge faster, as seen in the training accuracy and loss curves. In our work, the proposed CardioNet automatically detects the boundaries of the lungs and heart to accurately calculate the CTR. Multiple cardiac and lung diseases can be diagnosed using the CTR.