1. Introduction
Pulmonary fibrosis is a progressive lung disease characterized by interstitial tissue scarring. Following lung tissue injury, the disease triggers persistent fibrosis and remodeling. Clinical evidence highlights that early detection and timely treatment are critical for disease control; however, current medical technologies cannot cure or reverse the progression. At present, only antifibrotic medications are available to slow disease progression and improve patients’ quality of life [
1].
In clinical practice, chest X-ray imaging remains the primary diagnostic tool. Radiologists assess lesion patterns by adjusting slice thickness and analyzing imaging features. Nevertheless, traditional manual interpretation faces significant challenges in identifying fibrotic lesions and quantifying their severity. The complexity of lung anatomy, overlapping vessels, bronchi, and other structures, and inter-patient variability in lesion presentation often complicates feature recognition and hinders early-stage disease detection.
Accurate diagnosis of pulmonary fibrosis in chest X-rays typically requires experienced radiologists, but inter-observer variability remains common. Even senior specialists may introduce subjectivity when diagnosing and localizing fibrotic regions. Several factors contribute to these challenges:
Pulmonary fibrosis may exhibit imaging features similar to other lung diseases (e.g., interstitial lung disease or autoimmune-related disorders).
Early-stage fibrosis often shows subtle or indistinct imaging features.
The degree of fibrosis can vary significantly across patients and within different lung regions.
Coexisting conditions (e.g., infections, tumors, pleural diseases) may obscure or alter fibrotic features.
Image quality may be affected by equipment performance, scanning protocols, or patient positioning.
Different subtypes of fibrosis (e.g., idiopathic pulmonary fibrosis, connective tissue disease-associated fibrosis) present with distinct radiographic patterns.
These factors collectively reduce diagnostic accuracy, introduce variability, and may delay timely detection, thereby increasing the workload and difficulty for radiologists. Consequently, developing a high-sensitivity, low-false-positive computer-aided diagnosis (CAD) system to assist clinicians in chest X-ray screening and diagnosis of pulmonary fibrosis represents a critical clinical priority.
With the rapid advancement of artificial intelligence (AI), deep learning has shown great potential in medical image analysis for early disease detection, personalized treatment, and improved diagnostic efficiency. However, accurate pulmonary fibrosis identification and quantification in chest X-ray images remain challenging due to complex anatomical structures and inter-patient variability. To address this, we propose an Attention Feature Transformation Network (AFTNet) for automated detection and precise quantification of pulmonary fibrosis regions. AFTNet integrates spatial- and channel-based multi-head self-attention modules with a skip connection mechanism to enhance feature representation and boundary delineation. The key contributions of this work are threefold: (1) the proposed method achieves high accuracy while enabling rapid deployment across different clinical domains; (2) the developed pulmonary fibrosis detection software has been clinically implemented at Taichung Veterans General Hospital, Taiwan, assisting radiologists in interpreting chest X-ray images; and (3) the system enables reliable fibrosis detection and assessment without requiring costly imaging equipment, providing a clinically practical and scalable solution.
The remainder of this paper is structured as follows.
Section 2 provides an overview of related research, while
Section 3 details the proposed method.
Section 4 presents the experimental results along with analysis and discussion. Finally,
Section 5 concludes the paper.
2. Related Work
Early detection of pulmonary lesions through chest X-ray imaging is crucial for curative treatment and disease management. Computer-aided diagnostic (CAD) systems can improve lesion detection accuracy and reduce the risk of missed diagnoses, thereby offering significant clinical value.
Frix et al. [
2] conducted a comprehensive study on traditional image processing techniques for pulmonary disease detection. They concluded that computer-based classification and detection of lung images typically involve five key stages: preprocessing, segmentation, feature selection, feature extraction, and classification.
Similarly, Abed [
3] proposed an X–ray–based lung tumor detection system that integrates Principal Component Analysis (PCA) with a Back-Propagation Neural Network (BPNN). PCA for feature extraction provides notable advantages, including reducing the dimensionality of training images, enhancing the recognition performance of the artificial neural network, and shortening computational time.
In recent years, integrating artificial intelligence and deep learning technologies has emerged as one of the most effective approaches in medical image analysis. Yahyatabar et al. [
4] proposed a deep convolutional neural network, Dense-UNet, for lung region segmentation. By introducing dense connections between layers, their model enhanced information flow throughout the network, reduced the number of trainable parameters, and preserved segmentation robustness, thereby improving efficiency and accuracy.
Suryani et al. [
5] employed Segmentation-based Deep Fusion Networks (SDFN) in combination with Squeeze-and-Excitation (SE) modules for model training. They also incorporated Seg-Grad-CAM (Semantic Segmentation via Gradient-Weighted Class Activation Mapping) to introduce semantic information, thereby improving the localization of pulmonary tumors.
In a subsequent study, Suryani et al. [
6] proposed a novel CAT-UNet model for lung mass segmentation in chest X-ray images. CAT-UNet is built upon the TransUNet architecture, integrating convolutional neural networks (CNNs) and Transformers as the encoder to enhance feature representation. Within the decoder, a Convolutional Block Attention Module (CBAM) is embedded during the upsampling process, enabling more refined optimization of segmentation details.
Pulmonary fibrosis is a progressive interstitial lung disease characterized by irreversible scarring and structural remodeling. Due to its accessibility and cost-effectiveness, chest X-ray imaging remains the most widely used diagnostic tool. It is often complemented by clinical symptoms and pulmonary function test results. Although lung ultrasound has shown promise for pulmonary disease assessment [
7], X-ray imaging continues to play a central role in clinical diagnosis.
The radiographic manifestations of pulmonary fibrosis, however, are heterogeneous and complex. Typical findings include ground-glass opacities (GGO), reticular patterns, honeycombing, consolidation, and emphysema. GGO appears as hazy areas of increased attenuation without obscuring vascular structures, while reticular opacities reflect fibrotic distortion of the interstitium. Honeycombing represents advanced fibrosis, with clustered cystic air spaces typically located subpleurally. Consolidation obscures vascular markings with homogeneous density, whereas emphysema is characterized by permanent enlargement of distal air spaces with alveolar wall destruction.
Although lung ultrasound has demonstrated potential in diagnosing pulmonary diseases, its inherent physical limitations and operator dependency restrict its use as an independent or preferred diagnostic tool. In contrast, recent advances in deep learning-based segmentation models have shown considerable promise in lung image analysis. Nevertheless, traditional approaches still face challenges in both accuracy and reliability for pulmonary fibrosis prediction.
For example, the FibroRegNet multimodal system [
8] employs a convolutional spatial transformation to predict the progression of idiopathic pulmonary fibrosis. While effective, its relatively high computational complexity limits feasibility for real-time clinical applications. It highlights the need for semantic segmentation models that balance accuracy with efficiency, ensuring applicability in practical medical environments. Specifically, reducing parameter count and computational burden without sacrificing predictive performance is critical for deployment in real-world settings where rapid results are essential.
To address this, lightweight backbone networks such as MobileNetV3 [
9], EfficientNet-Lite [
10], or ShuffleNet [
11] can be adopted as feature extractors, in combination with optimized decoder architectures. Such designs can substantially reduce model complexity and resource demands, allowing accurate pulmonary fibrosis segmentation on standard clinical hardware.
Given the limitations of existing approaches, there is a critical need for models that combine accuracy with computational efficiency to enable real-time clinical deployment. To address this, we propose the Attention Feature Transformation Network (AFTNet), a lightweight yet effective framework that integrates spatial- and channel-based attention mechanisms with an optimized architecture. AFTNet is designed to achieve precise segmentation and quantification of pulmonary fibrosis in chest X-ray images while maintaining efficiency suitable for routine clinical use.
3. Materials and Methods
The pulmonary fibrosis proportion segmentation model proposed in this paper is developed based on the U-Net architecture [
12], as illustrated in
Figure 1. Since this is a supervised learning network, input images and corresponding annotation files are required during training to guide the model in locating regions of interest for segmentation. The model is designed to process high-resolution chest X-ray images of size 1760 × 2140, and its architecture integrates the strengths of traditional convolutional neural networks with modern Transformer-based approaches [
13]. The workflow consists of three main components: the encoder path, the decoder path, and skip connections.
As shown in
Figure 2 and
Figure 3, the encoder partitions each high-resolution chest X-ray image (1760 × 2140 × 31,760) into a 5 × 5 grid, generating 75 non-overlapping blocks that preserve local details while reducing computational complexity. Each block is projected into a latent embedding space through a linear embedding layer, where pixel values are linearly mapped to feature vectors of dimension
C. This process functions as an information compression and feature abstraction mechanism, filtering redundant low-level data while retaining discriminative texture patterns relevant to pulmonary fibrosis. By converting spatial data into fixed-length feature tokens, the embedding layer establishes a consistent interface for attention-based processing, enhancing the network’s capacity to capture both fine local structures and global contextual dependencies in subsequent AFTNet stages.
The features pass through the Attention Feature Transformation (AFT) block, depicted in
Figure 4, which leverages optimized convolutional operations for enhanced feature representation. In the first processing unit, the input features
are initially normalized using Layer Normalization (LayerNorm) to stabilize training and accelerate convergence. The normalized features are then processed by the Spatial-Based Multi-Head Self-Attention (SB-MSA) module. This module models spatial relationships and captures long-range dependencies across pulmonary structures, a capability essential for addressing the diffuse and heterogeneous distribution of fibrosis. The SB-MSA output is combined with the original input via residual connections (Equation (1)), producing preliminary fused features. This residual design alleviates the vanishing gradient problem while preserving essential information. Finally, the fused features undergo another LayerNorm operation and are passed through a Multilayer Perceptron (MLP) for non-linear transformation (Equation (2)), generating the updated feature representation
for the subsequent stage.
This design aligns with theoretical insights from optimization theory, as residual mappings simplify the learning objective by directly enabling the network to approximate residual functions rather than complex transformations.
The second processing unit follows a structure similar to the first but introduces a key modification. Using the output as input, the features are first normalized and then processed through a Channel-Based Multi-Head Self-Attention (CB-MSA) module. Unlike SB-MSA, which captures spatial dependencies, CB-MSA focuses on relationships across feature channels, effectively modeling interdependencies between them. Combining SB-MSA and CB-MSA allows the model to attend to both spatial and channel dimensions, thereby enriching feature representation.
As in the first unit, the CB-MSA output is fused with the input via residual connections (Equation (3)), followed by Layer Normalization and transformation through a Multilayer Perceptron (MLP), generating the updated representation
(Equation (4)). In parallel, block merging operations are employed for downsampling, gradually reducing spatial resolution while increasing the number of feature channels. This hierarchical design enables the network to evolve from capturing low-level visual elements such as edges and textures in shallower layers to encoding higher-level semantic and contextual information in deeper layers, thereby constructing a rich feature hierarchy.
From a theoretical perspective, self-attention approximates a non-local operator capable of dynamically reweighting contextual information, thereby overcoming the locality constraints of convolutional filters. Conversely, CB-MSA explicitly models inter-channel relationships, allowing the network to adaptively recalibrate feature maps according to the relative importance of channels, a principle conceptually related to information bottleneck theory, where irrelevant features are suppressed in favor of more discriminative representations.
In the proposed model, features from each encoder stage are directly transmitted to the corresponding decoder levels via skip connections. These connections mitigate the vanishing gradient problem frequently encountered in deep networks while preserving fine-grained details from early layers, which is essential for high-precision segmentation. At the bottleneck, multi-scale convergence modules aggregate hierarchical features based on scale-space theory, enabling the capture of both coarse and fine patterns. This is particularly beneficial in pulmonary fibrosis, where pathological changes range from delicate reticular opacities to extensive honeycombing.
The decoder path adopts a symmetrical design relative to the encoder and is composed primarily of Attention Feature Transformation (AFT) blocks and block expansion operations. Starting from highly abstracted features produced by the encoder, the decoder progressively restores spatial detail through block expansion. Each expansion stage increases spatial resolution and fuses features from the corresponding encoder level via skip connections, ensuring that fine structural information is preserved.
As illustrated in
Figure 3, the input RGB images containing rich visual information are first partitioned into 75 equally sized patches. This partitioning strategy reduces computational complexity while maintaining spatial structure, allowing the model to focus on local features without losing global context. The patches are then transformed by a linear embedding layer, which projects them into a latent feature space. This step reduces dimensionality and encodes the original visual information into more abstract feature representations, making them suitable for subsequent Transformer-based processing. Although these embedded features differ formally from the raw input, they retain essential semantic content, providing a structured representation that forms the foundation for advanced vision tasks such as classification, detection, and segmentation.
The proposed architecture is based on the encoder–decoder framework and skip connection concept of U-Net. However, it introduces novel Attention Feature Transformation (AFT) blocks that reduce computational cost and improve feature representation. Unlike standard convolutional modules, the AFT block draws inspiration from MobileNet [
14] and EfficientNet [
10], adopting techniques such as separable convolutions and parameter optimization. However, it is specifically redesigned to balance efficiency and accuracy for pulmonary fibrosis segmentation.
A key contribution of this work is the block partitioning and merging mechanism. This design adapts hierarchical concepts from the Vision Transformer [
13] and Swin Transformer [
15], but modifies them to better preserve semantic consistency and spatial detail in high-resolution chest X-ray images. In addition, the architecture employs a multi-level skip connection strategy that combines the residual learning concept of ResNet [
16] with the multi-scale feature aggregation of FPN [
17]. This hybrid design ensures smooth information flow between the encoder and decoder while enabling effective multi-scale feature integration—crucial for capturing the diverse and complex patterns of pulmonary fibrosis.
At the shallow encoder stages, the model primarily responds to fine-grained structural details, such as vascular textures and subtle reticular patterns, which are essential for early fibrosis detection. In contrast, deeper layers progressively aggregate broader contextual information, enabling the identification of coarse structural patterns such as honeycombing and regional fibrosis distribution. The combination of spatial-based attention, which refines local contextual focus, and channel-based attention, which adaptively reweights feature importance, allows AFTNet to balance detailed boundary preservation with high-level semantic understanding. This hierarchical attention mechanism effectively integrates both global and local representations, thereby enhancing the model’s interpretability and segmentation precision in assessing pulmonary fibrosis.
These innovations differentiate our model from prior architectures by combining lightweight design, attention-based feature refinement, and task-specific optimization, offering a solution tailored for accurate and efficient clinical deployment.
4. Experimental Results and Discussion
4.1. Experimental Environment
Experiments were conducted under a controlled computing environment to evaluate the proposed method rigorously. The experimental platform was configured with Windows 10 as the operating system, an Intel Core i9-11900 CPU, 32 GB RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory, ensuring adequate computational capacity for high-resolution image processing. Model development and training were performed using the PyTorch 2.1.1 framework with CUDA 11.8 acceleration, which provided optimized performance and efficient GPU utilization.
The training and evaluation of the proposed architecture were carefully designed to ensure effective learning and generalization to the target task. A total of 200 high-quality chest X-ray (CXR) images were collected from Taichung Veterans General Hospital (TCVGH), Taiwan. These images were obtained from patients who had been clinically diagnosed with idiopathic pulmonary fibrosis (IPF) or interstitial lung disease (ILD) by certified pulmonologists. Since there is no universally accepted staging system for ILD/IPF, disease severity in these patients was assessed based on the results of high-resolution computed tomography (HRCT) and pulmonary function tests (PFTs) performed at the time of diagnosis.
All CXRs were retrospectively retrieved following confirmed diagnosis, and the fibrotic regions were manually interpreted and annotated by experienced pulmonologists to ensure diagnostic precision and clinical reliability. The dataset, therefore, represents a clinically heterogeneous distribution of fibrosis severity, reflecting variations in disease manifestation and imaging characteristics commonly observed in real-world hospital settings. Prior to inclusion in the study, all images underwent complete de-identification and anonymization in compliance with institutional ethical standards and medical data protection regulations, ensuring both patient confidentiality and data integrity.
Although this study employed a single clinically annotated dataset, several measures were implemented to enhance the model’s robustness and generalizability. Specifically, a series of data augmentation techniques—including random rotation, horizontal flipping, brightness and contrast adjustment, and elastic transformation—were applied during training to simulate diverse imaging conditions commonly encountered in real-world clinical practice. These operations effectively increased dataset variability and reduced the risk of overfitting associated with limited data. Furthermore, the dataset was divided into training, validation, and test subsets following an 8:1:1 ratio, where the training set was used for parameter learning, the validation set for hyperparameter tuning and early stopping, and the test set for unbiased final performance evaluation. Collectively, these strategies enhanced the reproducibility and generalization capability of AFTNet, enabling it to maintain consistent accuracy across heterogeneous chest X-ray images.
To preserve fine-grained detail critical for pulmonary fibrosis detection, all images were retained at their original high resolution (1760 × 2140 pixels) during training. The training schedule was set to 200 epochs, allowing sufficient iterations for the model to converge toward an optimal solution. Prolonged training enabled the network to capture complex structural and textural patterns within the data, while regularization strategies such as early stopping and weight decay were applied to prevent overfitting. The hyperparameters and training configurations are summarized in
Table 1.
4.2. Evaluation Metrics
In medical image segmentation, multiple evaluation metrics are employed to comprehensively assess model performance, as no single measure can fully capture the diverse aspects of segmentation quality. Region-overlap–based metrics, such as the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU), are among the most commonly used. Both quantify the degree of overlap between predicted and ground-truth masks, with higher values indicating better agreement. The DSC is particularly sensitive to the balance between false positives and false negatives, while IoU offers a stricter measure of overlap, making them complementary in reflecting volumetric accuracy.
However, overlap-based metrics alone may not sufficiently characterize boundary precision, paramount in clinical contexts such as surgical navigation and radiotherapy planning. The Hausdorff Distance (HD) is often employed as a boundary-based metric to address this limitation. HD measures the maximum surface-to-surface deviation between the predicted and reference contours, thereby capturing the worst-case error in boundary localization. A smaller HD value indicates closer geometric alignment and greater clinical reliability. To evaluate the proposed method’s effectiveness comprehensively, we adopted these three widely recognized core metrics in medical image segmentation.
Intersection over Union (IoU)
As defined in Equation (5), IoU measures the overlap between the ground-truth region , annotated as pulmonary fibrosis by radiologists, and the predicted region generated by the model. Here, denotes the union of labeled and predicted pixels, while indicates their intersection. By calculating the ratio of overlap to union, IoU quantifies segmentation accuracy. This metric is crucial in medical imaging, as precise delineation of lesion boundaries directly impacts diagnostic reliability. The IoU score ranges from 0 to 1, with higher values indicating closer alignment to expert annotations, making it a key determinant of a model’s clinical applicability.
Dice Similarity Coefficient (DSC)
As shown in Equation (6), the Dice coefficient evaluates the similarity between two sets, making it particularly valuable for medical image segmentation tasks with class imbalance. Compared to IoU, DSC is more sensitive to small structures, thus providing a better assessment of the model’s ability to detect subtle lesions. This sensitivity is critical in pulmonary fibrosis analysis, where early-stage abnormalities often appear as small, irregular regions.
The Hausdorff Distance (HD) is a widely adopted metric for evaluating the geometric accuracy of segmentation results, particularly in medical image analysis. Formally, for two point sets X (predicted boundary) and Y (ground-truth boundary), the directed Hausdorff distance is defined as:
which represents the maximum distance from a point in set
X to its nearest neighbor in set
Y. The Hausdorff Distance is then defined symmetrically as:
As illustrated in
Figure 5, HD quantifies the maximum boundary deviation between predicted and ground-truth segmentations. Unlike region-overlap metrics such as the Dice Similarity Coefficient (DSC) or Jaccard Index, HD is particularly sensitive to boundary errors and outliers.
A smaller HD value indicates closer alignment between predicted and reference boundaries, implying more accurate localization of pathological regions. This boundary-level precision is essential for minimizing treatment margins, improving radiation dose targeting, and reducing the risk of damaging adjacent healthy tissue. Consequently, HD complements overlap-based indices by providing additional information on worst-case boundary discrepancies, offering a more comprehensive evaluation of segmentation quality.
The severity of pulmonary fibrosis was quantified using the ratio of the fibrotic region area to the total lung area, as defined in Equation (9). Practically, this corresponds to the proportion of fibrotic pixels relative to all lung-field pixels. This quantitative indicator enables both severity grading and longitudinal assessment of disease progression.
To achieve stable convergence and accurate region-level segmentation, the proposed model was trained using a hybrid loss function combining Cross-Entropy Loss and Dice Loss, formulated as
where
and
where
and
denote the ground truth and predicted pixel values, ε is the total number of pixels, and ε prevents division by zero.
The Cross-Entropy Loss focuses on pixel-level classification accuracy, while the Dice Loss ensures optimal region overlap, effectively handling class imbalance. Their combination allows the network to maintain both local precision and global consistency in pulmonary fibrosis segmentation.
4.3. Comparison with Different Segmentation Models
To rigorously assess the usability and effectiveness of the proposed method, we conducted comparative experiments against several state-of-the-art semantic segmentation models, including UNet [
12], UNet++ [
18], DeepLabV3+ [
19], Swin-Unet [
20], Segformer [
21], HPA-UNet [
22], and DRD-UNet [
23].
The experimental results are summarized in
Table 2. As the results indicate, our proposed AFTNet achieves superior performance across all key evaluation metrics while maintaining a comparable model complexity without significantly increasing parameters. Specifically, AFTNet attains an IoU of 80.34%, a DSC coefficient of 81.76%, and a Hausdorff Distance of 21.54%, all of which outperform the results obtained by the competing models. These findings underscore the robustness, efficiency, and clinical applicability of AFTNet in medical image segmentation tasks, offering both high accuracy and computational practicality.
To ensure a fair and consistent comparison among different models, all baseline networks presented in
Table 2 were trained and evaluated under identical experimental conditions. Each model utilized the same clinically annotated chest X-ray dataset, with an input resolution of 1760 × 2140 pixels and an 8:1:1 split for training, validation, and testing. The same data augmentation strategies—including random rotation, horizontal flipping, brightness and contrast adjustment, and elastic transformation—were applied to all models during training.
For optimization, all experiments employed the Adam optimizer with an initial learning rate of 1 × 10
−4, a batch size of 4, and a maximum of 200 training epochs. In addition, a cosine learning rate schedule and early stopping based on validation loss were used uniformly across all models to prevent overfitting and ensure stable convergence. These consistent settings ensure that the performance differences in
Table 2 accurately reflect the inherent design and representational advantages of each architecture, rather than discrepancies in training configurations.
From the experimental results, it can be observed that the proposed method achieves improvements in IoU and DSC, along with a reduction in Hausdorff Distance. This performance gain can be attributed to the design of the SB-MSA and CB-MSA modules. The SB-MSA module focuses on modeling spatial dependencies across distant pulmonary regions, enabling the network to capture global contextual information and subtle structural correlations in chest X-ray images. This capability is significant for recognizing diffuse and irregular fibrotic patterns that often appear across large areas of the lung field.
In contrast, the CB-MSA module emphasizes inter-channel relationships, allowing the network to reweight feature responses according to their diagnostic relevance adaptively. By enhancing the contrast between fibrosis-related and non-fibrotic features, CB-MSA strengthens texture-level discrimination and improves boundary precision. When these two attention mechanisms operate jointly within the Attention Feature Transformation (AFT) blocks, they form a complementary relationship: SB-MSA provides a broad semantic understanding of structural context, while CB-MSA refines feature saliency and local detail. This synergy effectively enhances both global interpretability and fine-grained segmentation accuracy, contributing to the superior performance demonstrated by AFTNet in pulmonary fibrosis detection and quantification tasks.
In addition to segmentation accuracy, the efficiency of the proposed model was further evaluated in terms of parameter size and inference time. As summarized in
Table 2, AFTNet achieves superior performance while maintaining a moderate parameter count of 39.6 M, which is lower than several state-of-the-art models, such as Segformer (88.3 M), HPA-UNet (44.9 M), UNet++ (44.8 M), and DRD-UNet (48.6 M). It demonstrates that the proposed network attains a favorable balance between accuracy and model complexity. The efficiency can be attributed to the use of separable convolutions and the optimized Attention Feature Transformation (AFT) blocks, which effectively reduce redundant operations while enhancing feature expressiveness. Consequently, AFTNet provides a lightweight yet powerful solution that enables rapid and accurate pulmonary fibrosis segmentation, making it suitable for real-time clinical applications and deployment in resource-constrained environments.
Figure 6a,b illustrate the primary outcomes of the proposed framework at different processing stages.
Figure 6a depicts the lung region extracted from the original chest X-ray image during the preprocessing stage. This step effectively isolates the lung fields from surrounding anatomical structures, such as the rib cage and soft tissues, thereby reducing background noise and providing a clean input for subsequent analysis. The extracted lung region preserves essential anatomical details, including bronchial branches, vascular patterns, and variations in tissue density, which serve as the foundation for assessing pulmonary fibrosis.
Figure 6b shows the results produced by the proposed semantic segmentation model. In contrast to the preprocessing step, the model performs selective segmentation, focusing on regions of lung tissue with a higher likelihood of fibrosis occurrence. As illustrated, it successfully highlights fibrotic areas predominantly in the lower lung zones, which are clinically recognized as common sites of fibrosis manifestation. This selective segmentation underscores the model’s ability to emphasize clinically relevant features, supporting accurate lesion localization, early detection, and longitudinal disease monitoring.
To quantitatively evaluate fibrosis severity, we computed the fibrosis ratio, defined as the proportion of the fibrotic region area relative to the total lung area. The segmented lung region from
Figure 6a serves as the denominator (total lung area), while the fibrosis-positive regions identified in
Figure 6b constitute the numerator. This ratio provides an intuitive and precise metric for assessing both the extent and progression of pulmonary fibrosis.
As demonstrated in
Figure 7, the segmented fibrosis regions are highlighted in semi-transparent pink for a more intuitive clinical visualization. The corresponding fibrosis ratio is then calculated to assist physicians in evaluating the extent of pulmonary fibrosis. This quantitative measure enables clinicians to monitor disease progression across follow-up examinations and provides a reliable reference for determining the most appropriate therapeutic strategy.
4.4. Discussion
Although the proposed AFTNet is inspired by hybrid CNN-Transformer architectures such as TransUNet, Swin-UNet, and CAT-UNet, it introduces several important innovations that clearly differentiate it from these existing frameworks. The following aspects highlight the architectural and practical distinctions that contribute to AFTNet’s superior performance and clinical applicability.
AFTNet incorporates a novel Dual Attention Feature Transformation (AFT) block that unifies Spatial-Based Multi-Head Self-Attention (SB-MSA) and Channel-Based Multi-Head Self-Attention (CB-MSA) within a residual framework. This design enables the network to concurrently capture both spatial and inter-channel dependencies, enhancing its ability to model complex pulmonary structures and heterogeneous fibrotic textures in chest X-ray images. The dual-attention mechanism strengthens the contextual representation of subtle pathological features that are often overlooked in single-attention or convolution-only architectures.
AFTNet emphasizes computational efficiency while maintaining high segmentation accuracy. Through a hierarchical block partitioning and merging strategy, the model effectively reduces parameter complexity and memory cost, achieving only 39.6 million parameters, which is significantly lower than several Transformer-based models, such as SegFormer (88.3 M) or DeepLabV3+ (42.1 M). This compact design enables real-time inference on standard medical hardware, facilitating seamless integration into clinical diagnostic systems without the need for specialized computational resources.
Unlike general-purpose segmentation networks, AFTNet is specifically optimized for the detection and quantitative assessment of pulmonary fibrosis. The architecture incorporates a dedicated preprocessing module to isolate lung fields and eliminate non-lung regions, effectively reducing background interference. Furthermore, a fibrosis-ratio quantification module translates segmentation outputs into clinically interpretable metrics that reflect disease severity. This end-to-end design bridges the gap between algorithmic segmentation and medical interpretation, facilitating clinical adoption and decision support.
These structural and functional innovations distinguish AFTNet from prior hybrid CNN-Transformer models. By combining dual attention mechanisms, lightweight optimization, and fibrosis-specific adaptation, AFTNet provides a more accurate, efficient, and clinically relevant framework for automated pulmonary fibrosis analysis in chest X-ray imaging. The model’s strong empirical performance and its validation in collaboration with Taichung Veterans General Hospital further confirm its potential for real-world deployment in intelligent diagnostic systems.
5. Conclusions
This paper proposed an Attention Feature Transformation Network (AFTNet) for automated pulmonary fibrosis identification and quantitative assessment. By integrating convolutional neural networks with Transformer-based architectures and incorporating dual attention mechanisms across spatial and channel dimensions, AFTNet achieves enhanced feature representation and superior segmentation accuracy. Experimental evaluations demonstrated that AFTNet consistently outperformed state-of-the-art models, including UNet, UNet++, DeepLabV3+, Swin-Unet, SegFormer, HPA-UNet, and DRD-UNet, in terms of IoU, DSC, and Hausdorff metrics, while maintaining a balanced model complexity (39.6 M parameters).
Moreover, incorporating a pulmonary segmentation preprocessing step effectively eliminated interference from non-lung tissues, thereby improving fibrosis detection accuracy by focusing on lung parenchymal regions. The subsequent fibrosis proportion calculation provided a reliable quantitative indicator of disease severity, offering clinicians an objective reference for evaluating progression and therapeutic outcomes.
The proposed AFTNet presents an efficient and accurate framework for clinical diagnosis and longitudinal monitoring of pulmonary fibrosis. Its robustness and computational efficiency make it particularly suitable for deployment in resource-limited healthcare settings, thereby contributing to the advancement of intelligent diagnostic systems for pulmonary diseases. Future research may extend this work toward multimodal data integration, automated classification of diverse pulmonary pathologies, and temporal disease modeling, further enhancing the accuracy and clinical utility of computer-assisted diagnosis and prognosis in pulmonary medicine.
The proposed AFTNet provides an efficient and accurate framework for the clinical diagnosis and longitudinal monitoring of pulmonary fibrosis. Its robustness and computational efficiency make it suitable for deployment in resource-limited healthcare settings, advancing intelligent diagnostic systems for pulmonary diseases. Although AFTNet demonstrated strong segmentation performance, this study did not include a detailed ablation analysis to quantify the contribution of each attention module. Future work will conduct a comprehensive ablation study and evaluate AFTNet on multi-center datasets to verify its generalizability. Further extensions will explore multimodal data integration, automated classification, and temporal disease modeling to enhance both clinical applicability and interpretability. These efforts will deepen understanding of how attention-based mechanisms contribute to feature learning and strengthen the robustness and scalability of AFTNet in real-world medical imaging applications.
In addition, although we have proposed a quantitative measure of pulmonary fibrosis severity based on chest X-ray segmentation, this study did not include a correlation analysis with established radiological scoring systems or physiological parameters such as pulmonary function indices or arterial blood gas results. We will aim to integrate these clinical and physiological measurements with the automatically computed fibrosis ratio to validate its clinical relevance and strengthen its diagnostic interpretability in the future. Establishing such correlations may enable the fibrosis ratio to serve as a noninvasive biomarker for early fibrosis screening, disease progression monitoring, and treatment response evaluation, thereby enhancing the practical value of AFTNet in real-world clinical applications.