YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning

Lee, Yangheon; Park, Hyun-Cheol

doi:10.3390/math14071223

Open AccessArticle

YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning

by

Yangheon Lee

and

Hyun-Cheol Park

^*

Department of Computer Engineering, Korea National University of Transportation, Chungju 27469, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(7), 1223; https://doi.org/10.3390/math14071223

Submission received: 19 February 2026 / Revised: 21 March 2026 / Accepted: 3 April 2026 / Published: 6 April 2026

(This article belongs to the Special Issue Application of Machine Learning and Mathematical Methods in Image Analysis and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Cone-beam computed tomography (CBCT) offers advantages of low radiation dose and rapid acquisition but suffers from scatter-induced shading artifacts that limit diagnostic value compared to multi-detector CT (MDCT). While CycleGAN enables unpaired image translation, its uniform loss application struggles with localized artifact removal. We propose a two-stage learning framework with YOLO-based region correction loss. Stage 1 trains a standard CycleGAN to establish stable CBCT-MDCT domain mapping. Stage 2 fine-tunes the model by applying gradient magnitude minimization loss selectively to artifact regions detected by a pretrained YOLO detector, enabling focused correction while preserving anatomical structures. Using 11,000 2D CBCT slices from 17 patients (14 training, 3 testing) and 23,500 2D MDCT slices from 50 patients, our method achieves a 14.0% reduction in artifact score compared to baseline CycleGAN while maintaining high structural similarity (SSIM > 0.96). Independent evaluation using integral nonuniformity (INU) and shading index (SI) confirms consistent improvement across physics-based metrics. The self-regulating mechanism, where YOLO detection confidence naturally decreases as artifacts diminish, provides automatic adjustment without manual intervention. This work demonstrates that combining staged learning with object detection offers an effective solution for localized artifact removal in medical image translation, potentially improving diagnostic accuracy while preserving the low-dose benefits of CBCT.

Keywords:

cone-beam computed tomography; multi-detector computed tomography; image translation; CycleGAN; YOLO; shading artifact removal; region correction loss; two-stage learning

MSC:

68T07

1. Introduction

Computed tomography (CT) is a cornerstone of modern medical diagnostics [1]. Among CT modalities, cone-beam CT (CBCT) has gained widespread adoption in dentistry, orthopedics, and radiation therapy guidance due to its low radiation dose (typically 10–100 times lower than conventional CT) and rapid acquisition capability [2,3]. This geometry enables complete 3D volume acquisition with a single gantry rotation, providing excellent visualization of high-contrast structures such as bone.

However, CBCT suffers from inherent limitations stemming from its wide-area acquisition geometry. The primary challenge is increased scatter radiation: as X-rays traverse the patient’s body, Compton scattering causes photons to deviate from their original paths [4]. These scattered photons reaching the detector degrade image contrast and increase noise. Due to the geometric characteristics of cone beams, scatter accumulates predominantly in lower regions, producing characteristic lower-region shading artifacts. These artifacts obscure soft tissue structures and significantly reduce diagnostic value compared to multi-detector CT (MDCT), which achieves superior image quality through collimated fan beams and effective scatter rejection.

Recent advances in deep learning have transformed medical imaging [5,6,7]. Convolutional neural networks (CNNs) have achieved remarkable success in segmentation [8], classification [9], and detection [10]. Generative adversarial networks (GANs) [11] have opened new possibilities for image-to-image translation, with CycleGAN [12] enabling unpaired domain mapping through cycle consistency constraints. This capability is particularly valuable for CBCT-to-MDCT translation, as obtaining perfectly registered image pairs from identical patients is practically infeasible [4,13]. More recently, transformer-based architectures such as SwinUNet [14] and diffusion models [15] have shown promising results for medical image synthesis, achieving high fidelity in paired-data settings. However, these methods typically require registered image pairs or substantially larger datasets, limiting their applicability to unpaired CBCT-MDCT translation where our method operates.

Despite these advances, directly applying conventional CycleGAN to CBCT artifact removal presents fundamental limitations. First, CycleGAN simultaneously optimizes multiple loss functions with competing objectives, requiring careful hyperparameter tuning and often leading to training instability in medical imaging applications where domain differences are substantial. Second, existing approaches apply losses uniformly across the entire image, implicitly learning global style transformation without explicitly identifying artifact regions. As illustrated in Figure 1, lower-region shading artifacts frequently persist in generated images despite successful overall domain transformation, limiting clinical applicability. where the red boxes highlight residual lower-region shading artifacts, these artifacts frequently persist in generated images despite successful overall domain transformation, limiting clinical applicability.

To address these limitations, we propose a two-stage learning framework with YOLO-based region correction loss. Our key insight is that learning objectives should be clearly separated: Stage 1 focuses exclusively on stable domain mapping using standard GAN and cycle consistency losses, while Stage 2 explicitly targets artifact removal using object detection technology. By leveraging spatial localization from YOLO [10], we apply focused correction only to problematic regions, preserving normal anatomical structures while selectively removing artifacts.

This work makes the following contributions:

A two-stage learning strategy that separates global domain mapping (Stage 1) from localized artifact correction (Stage 2), reducing optimization complexity and improving training stability.
A YOLO-based region correction loss that applies gradient magnitude minimization selectively to detected artifact regions through a fully differentiable formulation, enabling direct generator optimization while preserving anatomical structures.
A self-regulating mechanism where YOLO detection confidence naturally decreases as artifacts diminish, providing automatic adjustment of correction intensity without manual intervention.
Experimental validation demonstrating 14.0% artifact score reduction while maintaining structural similarity (SSIM > 0.96) on a dataset of 11,000 CBCT and 23,500 MDCT images, with ablation studies confirming the superiority of two-stage learning over joint training.

2. Related Work

2.1. Unpaired Image-to-Image Translation

Image-to-image translation learns mappings between visual domains. Early approaches like Pix2Pix [16] achieved excellent results using conditional GANs [17] with paired training data, but require pixel-aligned image pairs that are difficult to obtain in medical applications. Pix2PixHD [18] extended this to high-resolution synthesis using multi-scale architectures.

CycleGAN [12] addressed the paired data requirement by learning bidirectional mappings with cycle consistency loss. The constraint

x \to G (x) \to F (G (x)) \approx x

enforces content preservation during transformation without requiring registered data, enabling applications where paired data is unavailable. Subsequent works extended unpaired translation: DiscoGAN [19] and DualGAN [20] independently proposed similar concepts; UNIT [21] assumed shared latent spaces; MUNIT [22] enabled multi-modal outputs; and StarGAN [23] handled multi-domain translation. However, all these methods apply losses uniformly across images, limiting effectiveness for localized modifications such as artifact removal.

2.2. Deep Learning for CT Image Enhancement

Deep learning has been extensively applied to CT image quality improvement [24]. For supervised settings, CNN-based networks have shown success in low-dose CT denoising [25]. Encoder–decoder architectures including U-Net [8], V-Net [26], and Attention U-Net [27] provide effective frameworks for medical image processing.

For unpaired CBCT-MDCT translation, CycleGAN-based approaches have gained significant attention. Liang et al. [13] applied CycleGAN to generate synthetic CT from CBCT for radiation therapy dose calculation. Kida et al. [4] demonstrated CycleGAN’s ability to reduce CBCT artifacts for adaptive radiotherapy. Applications extend to MRI-CT conversion [28] and low-dose CT denoising [29]. Recent systematic reviews [30] confirm growing interest in CBCT-CT synthesis.

However, existing CycleGAN methods face common limitations: training instability from simultaneous multi-objective optimization, and inability to explicitly target localized artifacts. GAN training challenges including mode collapse and gradient vanishing [31,32] are exacerbated when losses treat artifact regions identically to normal anatomy.

2.3. Object Detection in Medical Imaging

Object detection localizes structures, lesions, and abnormalities in medical images. While Faster R-CNN [33] provides accurate detection through region proposal networks, YOLO [10] reformulates detection as single-pass regression, enabling real-time inference. The YOLO family has evolved significantly: YOLOv3 [34] introduced multi-scale predictions; YOLOv4 [35] integrated state-of-the-art techniques; and recent versions like YOLO11 [36] achieve improved accuracy with reduced parameters.

Medical applications include lung nodule detection, breast cancer screening, and lesion localization. Nonmaximum suppression (NMS) [37] removes duplicate detections. However, integrating object detection into generative model training for artifact removal—using detection to guide generator optimization—represents a novel direction that we explore in this work.

3. Materials and Methods

3.1. Dataset

This study uses completely unpaired data independently acquired from different patient groups, reflecting realistic clinical scenarios where paired CBCT-MDCT data is unavailable.

The CBCT dataset consists of 11,000 2D axial slices extracted from head and neck volumes of 17 patients acquired during radiation therapy sessions. These images exhibit characteristic lower-region shading artifacts due to scatter radiation. The dataset is split into 9350 training slices (14 patients) and 1650 validation slices (3 patients) to ensure no patient overlap and prevent data leakage.

The MDCT dataset comprises 23,500 2D axial slices from head and neck volumes of 50 patients, collected independently from the CBCT cohort. MDCT images serve as the target domain, providing reference characteristics for high-quality CT without corresponding paired images.

All images have original dimensions of 400 × 400 pixels, resized to 256 × 256 for training efficiency. Pixel intensities are normalized to the

[- 1, 1]

range.

3.2. Method Overview

The proposed methodology combines two-stage learning with YOLO-based region correction loss to explicitly remove lower-region shading artifacts. Unlike conventional CycleGAN approaches that apply uniform losses across entire images, our framework separates learning into distinct stages with clearly defined objectives, as illustrated in Figure 2.

Stage 1 establishes stable domain transformation between CBCT and MDCT using only adversarial and cycle consistency losses. This provides a foundation for accurate domain mapping without the complexity of additional loss terms. Stage 2 adds YOLO-based region correction loss, explicitly identifying shading regions and applying focused correction only to those areas.

This staged separation offers three advantages: (1) each stage focuses on clearly defined objectives, improving training stability; (2) fine-grained local correction in Stage 2 builds upon the stable transformation from Stage 1; and (3) separating global and local losses by stage avoids conflicts between competing objectives.

3.3. Network Architecture

The proposed method builds upon CycleGAN [12] with ResNet-based generators and PatchGAN discriminators. Let X denote the CBCT domain and Y the MDCT domain. The network comprises two generators

G : X \to Y

and

F : Y \to X

, and two discriminators

D_{Y}

and

D_{X}

. Generator G transforms CBCT image x into MDCT-style image

\hat{y} = G (x)

, while F performs the reverse mapping. Discriminators distinguish real from generated images in their respective domains.

3.3.1. Generator Architecture

Both generators employ identical ResNet-based architectures with fully convolutional design, avoiding pooling layers to minimize spatial information loss. The input image passes through an initial 7 × 7 convolution layer mapping to 64-dimensional feature space, followed by ReLU activation.

Features are then transformed through 10 residual blocks [38]. Each block contains two 3 × 3 convolution layers with ReLU activation and skip connections:

h_{i + 1} = h_{i} + {Conv}_{2} (ReLU ({Conv}_{1} (h_{i})))

(1)

The output stage uses a 7 × 7 convolution generating single-channel output. A global skip connection adds the input image before tanh activation:

\hat{y} = tanh (x + {Conv}_{out} (h_{10}))

(2)

This residual learning approach preserves input details while modifying only necessary parts—particularly suitable for CT images where most anatomical structures should remain unchanged while removing only artifacts.

To justify the two-stage approach, we analyzed gradient conflicts between the global GAN loss and the local correction loss. In joint training, the cycle consistency gradient for generator G depends on the Jacobian of the backward generator F, which is simultaneously updated. As both

θ_{G}

and

θ_{F}

change each step, the coupled Jacobian induces gradient oscillation. To resolve this, we employ an asymmetric update strategy in Stage 2: the backward generator F is frozen at

θ_{F}^{*}

(learned in Stage 1) and both discriminators

D_{X}

,

D_{Y}

are, likewise, fixed, so that only the forward generator G receives parameter updates from the combined adversarial, cycle consistency, and region correction losses. This decouples the Jacobian dependency and is mathematically equivalent to alternating optimization under the concave–convex procedure (CCP) framework, enabling the region correction loss to drive deeper artifact suppression without destabilizing the learned domain mapping. Ablation experiments confirm the effectiveness of this strategy: joint training yields artifact score 0.575 with significantly degraded PSNR (

p < 0.001

, Cohen’s

d = - 0.056

), while the proposed asymmetric update approach achieves 0.444 (

p < 0.001

).

3.3.2. Discriminator Architecture

Both discriminators use PatchGAN [16] architecture, determining authenticity at local patch level rather than globally. Following an initial 3 × 3 convolution, four stride-2 convolution layers with spectral normalization [39] progressively reduce spatial resolution. Each layer uses 4 × 4 kernels, batch normalization, and LeakyReLU activation (slope 0.2). The final 1 × 1 convolution with Sigmoid activation outputs per-location authenticity probabilities. PatchGAN’s focus on high-frequency details and reduced parameter count enables discrimination of arbitrary-sized inputs, while spectral normalization constrains the Lipschitz constant for training stability.

3.4. Stage 1: Base CycleGAN Training

Stage 1 establishes stable domain transformation using only adversarial and cycle consistency losses.

3.4.1. Adversarial Loss

Following LSGAN [40] for improved stability, adversarial losses are

\begin{matrix} L_{D_{Y}} & = E_{y} [{(D_{Y} (y) - 1)}^{2}] + E_{x} [{(D_{Y} (G (x)))}^{2}] \end{matrix}

(3)

\begin{matrix} L_{G} & = E_{x} [{(D_{Y} (G (x)) - 1)}^{2}] \end{matrix}

(4)

The discriminator learns to output 1 for real MDCT images and 0 for generated images, while the generator learns to produce outputs recognized as real. Losses

L_{D_{X}}

and

L_{F}

for reverse mapping are defined symmetrically.

3.4.2. Cycle Consistency Loss

Cycle consistency enforces anatomical structure preservation:

L_{cyc} = E_{x} {[∥ F (G (x)) - x ∥}_{1}] + E_{y} {[∥ G (F (y)) - y ∥}_{1}]

(5)

This bidirectional constraint ensures meaningful correspondences rather than arbitrary transformations—critical for preserving anatomical structures in medical imaging.

3.4.3. Total Loss Functions

The total generator loss combines adversarial and cycle consistency terms:

L_{G, total} = \frac{1}{2} (L_{G} + L_{F}) + λ_{cyc} L_{cyc}

(6)

where

λ_{cyc} = 10

provides strong structure preservation. The discriminator loss is

L_{D, total} = \frac{1}{2} (L_{D_{Y}} + L_{D_{X}})

(7)

Generators and discriminators are optimized alternately. After Stage 1, generators perform overall CBCT-to-MDCT transformation, but uniformly applied losses leave local shading artifacts unaddressed.

3.5. Stage 2: YOLO-Based Fine-Tuning

Stage 2 explicitly identifies and removes shading artifacts remaining from Stage 1 by adding YOLO-based region correction loss to the trained baseline model.

3.5.1. YOLO Artifact Detector Training

The YOLO artifact detector is trained on images generated by the Stage 1 model using semi-automatic labeling. The labeling pipeline comprises four steps: (1) Region restriction to the lower 25% of each image, where shading artifacts predominantly occur; (2) K-means clustering-based segmentation identifying artifact candidates through brightness-based grouping, selecting the darkest cluster; (3) Morphological post-processing with opening and closing operations to clean region boundaries; (4) Minimum bounding box generation with manual verification for quality assurance.

We train YOLO11n [36], a lightweight variant providing fast inference suitable for real-time detection during Stage 2 training. Training uses Adam optimizer with initial learning rate

10^{- 3}

decreasing to

10^{- 4}

, momentum 0.937, and weight decay

5 \times 10^{- 4}

. Data augmentation includes HSV transformation, translation, scaling, horizontal flip, and mosaic augmentation. Early stopping terminates training if validation loss shows no improvement for 20 epochs. During Stage 2, all YOLO parameters remain fixed to maintain consistent detection criteria. Figure 3 shows detection results.

To validate the labeling process, the semi-automatic pipeline was assessed for consistency. The K-means clustering and morphological post-processing produce deterministic results given the same input, ensuring reproducibility of the initial candidate generation. Manual verification focuses on boundary refinement and false positive removal, substantially reducing the total annotation effort compared to fully manual labeling. A formal inter-rater reliability study with multiple independent annotators is planned for future work to further establish annotation robustness at scale.

We chose bounding box detection (YOLO) over pixel-level segmentation (e.g., U-Net masks) for artifact region representation. While pixel-level masks offer finer spatial control, bounding boxes provide three practical advantages: (1) annotation effort is substantially lower, requiring only box-level labels rather than pixel-wise masks; (2) YOLO’s single-pass detection is significantly faster during the iterative training loop, avoiding the overhead of per-pixel inference; and (3) the gradient penalty within the bounding box inherently smooths the shading artifact without requiring precise boundaries, as the scatter-induced shading field varies smoothly across the affected region. Preliminary comparison with U-Net segmentation masks on a subset of data showed comparable artifact reduction, supporting the choice of the simpler approach for this application.

3.5.2. Region Correction Loss

Region correction loss addresses shading artifacts by minimizing gradient magnitude within YOLO-detected artifact regions. In CBCT, scatter-induced shading manifests as irregular dark bands with visible edges at artifact boundaries, where intensity transitions sharply between affected and unaffected regions. By suppressing these gradient magnitudes, the loss function smooths artifact regions to blend naturally with surrounding tissue, effectively reducing the visual prominence of shading patterns.

Given binary mask

M \in {0, 1}^{H \times W}

from YOLO detection (1 for detected regions, 0 otherwise), spatial gradients of generated image

\hat{y}

are computed using finite differences:

\begin{matrix} \nabla_{x} {\hat{y}}_{i, j} & = {\hat{y}}_{i, j + 1} - {\hat{y}}_{i, j} \end{matrix}

(8)

\begin{matrix} \nabla_{y} {\hat{y}}_{i, j} & = {\hat{y}}_{i + 1, j} - {\hat{y}}_{i, j} \end{matrix}

(9)

Gradient magnitude is

∥ \nabla \hat{y} ∥ = \sqrt{{(\nabla_{x} \hat{y})}^{2} + {(\nabla_{y} \hat{y})}^{2} + ϵ}

(10)

where

ϵ = 10^{- 8}

ensures numerical stability.

Region correction loss is the batch-averaged masked gradient magnitude:

L_{corr} = \frac{1}{B} \sum_{i = 1}^{B} \sum_{h, w} M_{i} (h, w) \cdot ∥ \nabla {\hat{y}}_{i} (h, w) ∥

(11)

This fully differentiable formulation enables direct generator optimization through backpropagation. Selective application to detected regions achieves artifact removal while preserving normal structures.

3.5.3. Stage 2 Generator Loss

The Stage 2 generator loss is

L_{G, fine} = L_{G, total} + λ_{corr} L_{corr}

(12)

where

λ_{corr}

weights the region correction loss. Under the asymmetric update strategy, this loss is optimized with respect to the forward generator G only, while the backward generator F and both discriminators remain frozen at their Stage 1 optima. This asymmetry is justified because artifacts exist only in the CBCT domain, and freezing F eliminates the gradient conflicts that arise from simultaneous bidirectional updates (see Section 3.3).

3.5.4. Self-Regulating Mechanism

With fixed YOLO parameters during Stage 2, detection confidence naturally decreases as training reduces artifacts. This forms a self-regulating mechanism: strong correction when artifacts are prominent; reduced correction as they diminish, preventing over-smoothing of normal regions.

3.6. Implementation Details

The model was implemented in PyTorch (version 2.11.0) with experiments conducted on an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

Stage 1 Training: Adam optimizer [41] with initial learning rate

10^{- 4}

and exponential decay (rate 0.5). A batch size of 2 balances memory efficiency and stability. An image pool of size 50 [42] stores generated images for discriminator training, reducing oscillation. Instance normalization [43] is applied in generators as it preserves content better than batch normalization [44] for style transfer. Weights are initialized from

N (0, 0.02)

.

Stage 2 Fine-tuning: The Stage 1 model initializes Stage 2. A lower learning rate of

5 \times 10^{- 5}

preserves learned features while enabling gradual improvement. Based on ablation studies,

λ_{corr} = 1.0

is optimal. YOLO uses confidence threshold 0.1 to detect weak artifacts.

YOLO Training: YOLO11n is trained with Adam optimizer, learning rate from

10^{- 3}

to

10^{- 4}

, momentum 0.937, weight decay

5 \times 10^{- 4}

, and 3-epoch warmup. Loss weights emphasize localization: box = 7.5, cls = 0.5, dfl = 1.5.

4. Results

4.1. Evaluation Metrics

4.1.1. Artifact Score

Artifact score directly measures shading artifact removal. The pretrained YOLO detector evaluates generated images, calculating mean detection confidence:

Artifact Score = \frac{1}{N} \sum_{i = 1}^{N} {confidence}_{i}

(13)

where N is the number of detected artifacts. Lower scores indicate fewer or weaker remaining artifacts. The same YOLO model used in Stage 2 provides consistent evaluation criteria. While using the training detector for evaluation might raise concerns about circular bias, this approach is justified in unpaired settings where ground-truth artifact-free CBCT images are unavailable. The detector serves as a consistent proxy metric, and the relative confidence changes between models provide meaningful comparisons. Furthermore, qualitative results and structure preservation metrics (SSIM, Dice) offer independent validation.

4.1.2. Structure Preservation Metrics

Structure preservation verifies that Stage 2 fine-tuning does not alter anatomical structures. We compare generated images from baseline (Stage 1) and fine-tuning (Stage 2) models by extracting bone structures through threshold-based binarization (pixels

\geq 0

in the

[- 1, 1]

normalized range).

Dice coefficient measures spatial overlap between bone masks:

Dice = \frac{2 | A \cap B |}{| A | + | B |}

(14)

where A and B are baseline and fine-tuning bone masks, respectively.

Structural similarity index (SSIM) [45] measures perceptual similarity by comparing luminance, contrast, and structure. Values approaching 1 indicate preserved structural patterns.

4.1.3. Independent Artifact Metrics

To validate artifact removal independently of the YOLO detector, we employ two physics-based metrics:

Integral nonuniformity (INU): Measures the spatial uniformity of pixel intensities across multiple regions of interest. We place concentric ROIs at the image center and along four cardinal directions at five radial distances. A lower INU indicates better uniformity and fewer shading artifacts.

INU = \frac{{CT}_{max} - {CT}_{min}}{C}

(15)

where

{CT}_{max}

and

{CT}_{min}

are the maximum and minimum of the ROI mean values, respectively, and C is the center ROI mean. This formulation captures the spread of regional intensity variations relative to the central reference, where scatter-induced shading manifests as systematic intensity differences between peripheral and central regions.

Shading index (SI): Specifically quantifies low-frequency shading artifacts using frequency-domain analysis. A lower SI indicates reduced shading.

SI = \sqrt{\frac{1}{| Ω |} \sum_{x \in Ω} {(I_{LF} (x) - {\bar{I}}_{LF})}^{2}}

(16)

where

I_{LF}

is the low-frequency component obtained by applying a 2D Butterworth low-pass filter (order 2, cutoff at 5% of Nyquist frequency) to the image via FFT, and

{\bar{I}}_{LF}

is its spatial mean. This RMS deviation of the low-frequency component directly quantifies the magnitude of slowly varying intensity inhomogeneities characteristic of scatter-induced shading.

4.2. Comparison Models

We compare three configurations to validate the two-stage strategy:

Baseline (Stage 1 only): CycleGAN trained with only adversarial and cycle consistency losses. This performs overall style transformation without explicit artifact removal.

Joint training: CycleGAN trained with GAN losses and region correction loss simultaneously from the first epoch, serving as an ablation to evaluate whether separating training stages is beneficial.

Proposed (two-stage fine-tuning): Complete proposed methodology using baseline initialization with YOLO-based region correction loss (Stage 1 + Stage 2). Sensitivity analysis evaluates five

λ_{corr}

values: 0.5, 1.0, 2.0, 3.0, and 5.0.

4.3. Quantitative Results

Table 1 presents artifact removal performance with 95% confidence intervals. The baseline model achieves artifact score 0.516. The proposed fine-tuning method with

λ_{corr} = 1.0

achieves 0.444, representing a 14.0% reduction. All improvements of the proposed method over the baseline are statistically significant (paired t-test,

p < 0.001

): PSNR (

d = 0.039

), SSIM (

d = 0.149

), and INU (

d = - 0.050

). In contrast, joint training actively degrades performance, achieving an artifact score of 0.575 (worse than baseline), with PSNR significantly lower (

p < 0.001

,

d = - 0.056

) and SSIM significantly lower (

p < 0.001

,

d = - 0.157

). This confirms that simultaneous optimization of global and local objectives causes destructive gradient conflicts. Furthermore, independent metrics confirm the effectiveness of our method: INU improves from 1.014 [95% CI: 1.003, 1.026] to 0.974 [0.964, 0.985] (3.9% reduction) and SI from 0.208 [0.208, 0.208] to 0.204 [0.204, 0.205] (1.7% reduction), demonstrating that artifact reduction extends beyond the YOLO detector’s perception.

Table 2 confirms structure preservation. All fine-tuned models maintain SSIM > 0.96, demonstrating selective artifact targeting without altering normal anatomy. The optimal

λ_{corr} = 1.0

model achieves Dice coefficient 0.849 and SSIM 0.971, indicating high preservation of baseline characteristics alongside artifact reduction.

4.4. Qualitative Results

Figure 4 presents representative transformation results for the optimal

λ_{corr} = 1.0

model. The original CBCT (left) shows distinct lower-region shading artifacts. The baseline result (center) retains these artifacts despite domain transformation. In contrast, the fine-tuning result (right) effectively removes shading artifacts while preserving anatomical structures.

This visual comparison demonstrates that YOLO-based region correction loss selectively acts on artifact regions without distorting normal structures. Bone morphology, soft tissue boundaries, and overall anatomical fidelity are maintained while problematic shading is significantly reduced. As our method uses unpaired data (CycleGAN), pixel-perfect paired MDCT ground-truth for the CBCT inputs is unavailable. Comparing fine-tuned outputs against the baseline (Stage 1) is, therefore, a necessary proxy to ensure that the correction does not deviate from the learned anatomy. Qualitative inspection of corresponding MDCT reference images from the target domain confirms that the proposed outputs exhibit improved HU uniformity consistent with MDCT characteristics, particularly in the lower soft-tissue regions where shading artifacts were most prominent.

4.5. 3D Volumetric Consistency

As our approach operates on 2D axial slices, we evaluated inter-slice consistency to verify that no “flicker” artifacts are introduced in reconstructed 3D volumes. We computed the slice difference (SDiff) metric, defined as the mean absolute difference between adjacent slices along the z-axis:

SDiff = \frac{1}{S - 1} \sum_{i = 1}^{S - 1} {∥ {\hat{y}}_{i + 1} - {\hat{y}}_{i} ∥}_{1}

(17)

where S is the number of slices and

{\hat{y}}_{i}

is the generated image at slice i. The SDiff values for the proposed method were comparable to those of the baseline, indicating that Stage 2 fine-tuning does not introduce measurable inter-slice discontinuity. This confirms that the localized gradient minimization within YOLO-detected regions operates consistently across adjacent slices without disrupting volumetric continuity.

4.6. Frequency-Domain Analysis

To further characterize the nature of artifact reduction, we analyzed the radial power spectrum of the correction signal (difference between Stage 2 and Stage 1 outputs). The correction predominantly affects low-frequency components (<5% of Nyquist frequency), consistent with the physics of scatter-induced shading, which produces slowly varying intensity bias fields. This observation validates our choice of the shading index (SI) metric, which specifically quantifies low-frequency variation via Butterworth low-pass filtering. The correspondence between the frequency profile of the correction and the spectral characteristics of scatter artifacts provides independent evidence that our method targets the correct physical phenomenon rather than introducing arbitrary intensity changes.

4.7. Ablation Study

We conducted ablation studies to validate the key components of our framework.

4.7.1. Training Strategy

We compared the proposed two-stage approach against two alternatives: joint training, where GAN losses and region correction loss are optimized simultaneously from the beginning, and random initialization, where Stage 2 fine-tuning is applied without Stage 1 pretraining. As shown in Table 3, joint training not only fails to improve but actively degrades performance relative to the baseline (artifact score 0.575 vs. 0.516), with PSNR significantly lower (

p < 0.001

,

d = - 0.056

) and SSIM significantly lower (

p < 0.001

,

d = - 0.157

). This demonstrates that simultaneous optimization of competing global and local objectives causes destructive gradient conflicts that harm overall image quality. Random initialization without Stage 1 pretraining collapses to an over-smoothed output (PSNR 16.104, artifact score 0.000), confirming that the stable domain mapping from Stage 1 is essential. In contrast, the proposed two-stage method achieves the best artifact reduction (0.444) with significant improvements in all metrics (

p < 0.001

).

4.7.2. Loss Function

We evaluated three region correction loss functions within our two-stage framework: L2 gradient minimization (proposed), total variation (TV), and L1 ROI intensity loss. As shown in Table 3, TV loss collapses the output to an over-smoothed image (PSNR 16.242, artifact score 0.000), consistent with its theoretical tendency to produce piecewise constant solutions in BV(

Ω

) space, which destroys soft-tissue detail. L1 ROI loss, which directly minimizes pixel intensity in detected regions, fails to reduce artifacts (artifact score 0.554 vs. baseline 0.516) while dramatically increasing INU to 2.541 (vs. baseline 1.014), indicating that direct intensity manipulation introduces new uniformity distortions. The proposed L2 gradient minimization achieves the only effective artifact reduction (0.444) while improving INU (0.974) and SI (0.204), consistent with its physics-based motivation: scatter-induced shading produces smoothly varying low-frequency bias fields, and L2 gradient minimization induces isotropic diffusion that selectively suppresses these components without introducing staircasing artifacts or intensity shifts.

Figure 5 shows a qualitative comparison of the effects of different

λ_{corr}

values on artifact removal and image quality.

4.7.3. Pareto Front Analysis

Figure 6 presents the Pareto front analysis plotting SSIM against artifact score for the training strategies and loss functions. The proposed method (

λ_{corr} = 1.0

) dominates both the baseline and joint training, achieving simultaneously higher SSIM and lower artifact score. Joint training is Pareto-dominated by the baseline, confirming that joint optimization actively harms both objectives. TV Loss and Random Init achieve artifact score 0.000 but at the cost of extreme PSNR degradation (below 16.3), placing them far from the useful operating region. The proposed method lies on the Pareto front, representing the only effective trade-off between structure preservation and artifact reduction.

4.8. YOLO Detector Performance

The reliability of the YOLO artifact detector is crucial for our method. On a held-out test set of 936 images, the detector achieves precision of 0.745, recall of 0.953, and F1-score of 0.837 at confidence threshold of 0.1. Specifically, out of 43 ground-truth artifact annotations, the detector correctly identified 41 (TP), missed 2 (FN), and produced 14 detections not matching ground-truth labels (FP). The high recall ensures that most artifacts are detected for correction. Notably, many of the nominal false positives correspond to subtle shading artifacts present in the image but not captured by the conservative ground-truth labeling; applying gradient minimization to these regions is, therefore, beneficial rather than harmful. For any remaining true false positives on genuinely artifact-free regions, the gradient minimization loss has a negligible effect because smooth regions already exhibit low gradient magnitudes. False negatives leave those artifact regions uncorrected, but the Stage 1 base translation is preserved. At the optimal F1 threshold of 0.5, the detector achieves precision of 0.923 and recall of 0.837 (F1 = 0.878).

Figure 7 presents the precision–recall curve across confidence thresholds. The curve demonstrates stable performance in the low-threshold range (0.1–0.5), with F1 remaining above 0.83 across this range. This robustness to threshold selection supports the practical applicability of the method. Qualitative examples of true positive, false positive, and false negative detections are provided in the Supplementary Material (Figure S1).

4.9. Computational Cost

The two-stage training introduces additional computational overhead. Joint training of 30 epochs required approximately 17.3 h on a single NVIDIA GPU, providing a reference for training cost. Stage 2 fine-tuning adds approximately 4.8 h for 30 epochs to the Stage 1 training time. However, inference time remains identical to a standard CycleGAN, as the YOLO detector is only used during training. This ensures that the proposed method is suitable for real-time clinical deployment.

5. Discussion

5.1. Effectiveness of Two-Stage Learning

Separating domain mapping (Stage 1) from artifact correction (Stage 2) provides clear advantages, as confirmed by our ablation study. Stage 1 establishes a stable CBCT-MDCT transformation foundation without additional loss complexity. This stable initialization proves crucial for Stage 2, where the model focuses specifically on artifact removal without disrupting learned domain mappings.

Our ablation study (Table 3) provides strong quantitative evidence: joint training, where all losses are optimized simultaneously from the beginning, not only fails to improve but actively degrades performance (artifact score 0.575 vs. baseline 0.516), with PSNR and SSIM both significantly lower (

p < 0.001

,

d = - 0.056

and

d = - 0.157

, respectively). Similarly, applying Stage 2 without Stage 1 pretraining (Random Init) collapses the output entirely (PSNR 16.104). In contrast, the proposed two-stage approach achieves substantially better artifact reduction (0.444,

p < 0.001

). This confirms that the staged approach addresses a fundamental challenge in multi-objective GAN training: competing losses from global transformation and local correction cause destructive gradient conflicts when optimized jointly, and a stable domain mapping foundation is prerequisite for effective localized correction.

5.2. Benefits of YOLO Integration

Using object detection to identify artifact regions provides explicit spatial guidance unavailable from uniform losses. The fixed YOLO detector ensures consistent detection criteria throughout training, while the differentiable region correction loss enables direct generator optimization.

The self-regulating property—where detection confidence decreases as artifacts diminish—provides natural training dynamics. Early in Stage 2, strong artifacts trigger high-confidence detections and correspondingly strong correction. As training progresses and artifacts decrease, detection confidence falls, automatically reducing correction intensity and preventing over-smoothing.

5.3. Preservation of Anatomical Structures

High SSIM values (>0.96) across all fine-tuned models confirm that region correction loss acts selectively. By applying gradient minimization only within detected bounding boxes, normal anatomical regions remain unaffected. This addresses a key concern in medical image translation: ensuring that artifact removal does not introduce anatomical distortions that could affect clinical interpretation.

The Dice coefficient results further support this conclusion. Even at the optimal

λ_{corr} = 1.0

, bone structure overlap between baseline and fine-tuning outputs exceeds 0.84, indicating minimal structural alteration despite significant artifact reduction.

5.4. Clinical Implications

The 14.0% artifact score reduction represents meaningful improvement for clinical applications. Lower-region shading artifacts particularly affect soft tissue visualization in areas critical for radiation therapy planning, such as the posterior neck and shoulder regions. Reducing these artifacts while maintaining CBCT’s low-dose advantage could improve treatment planning accuracy without additional radiation exposure. The consistent improvement in integral nonuniformity (INU) and shading index (SI) further confirms the objective reduction in shading, which translates to better visibility of soft tissue boundaries.

5.5. Limitations and Future Directions

Several limitations warrant discussion. First, the current approach focuses on lower-region shading artifacts with relatively localized positions. Artifacts distributed extensively across images or with unclear boundaries may be difficult to identify using bounding box detection. Future work could explore pixel-level segmentation for more precise artifact delineation.

Second, the semi-automatic labeling pipeline requires manual verification. Fully automatic labeling methods would improve scalability. Third, validation is limited to head and neck CBCT from a single institution with a fixed patient-level split (14 training, 3 testing) rather than k-fold cross-validation. Due to the substantial computational cost of sequentially training two-stage GAN models—where each fold would require full Stage 1 pretraining followed by Stage 2 fine-tuning—comprehensive k-fold cross-validation was infeasible within our computational budget. Nevertheless, the strict patient-level split ensures no data leakage, and the statistically significant results (

p < 0.001

across all metrics) provide confidence in the findings. Broader validation across acquisition conditions, scanner types, and anatomical regions would further establish clinical robustness.

Fourth, our method operates on 2D slices. While we confirmed inter-slice consistency through quantitative analysis (SDiff metric, Section 4.5), 3D volumetric training could potentially leverage spatial continuity for better results, albeit at a higher computational cost.

Finally, the current framework trains YOLO and CycleGAN separately. End-to-end joint optimization of detector and generator could potentially improve overall performance, though this introduces additional training complexity.

6. Conclusions

We proposed a two-stage learning framework with YOLO-based region correction loss for removing shading artifacts in CBCT-to-MDCT translation. By separating global domain mapping (Stage 1) from localized artifact correction (Stage 2), our method achieves 14.0% artifact score reduction while maintaining high structural fidelity (SSIM > 0.96). Ablation studies confirm that this two-stage strategy significantly outperforms joint training (

p < 0.001

).

The integration of object detection into generative model training enables explicit targeting of artifact regions, providing a promising direction for medical image translation tasks requiring localized corrections. The self-regulating mechanism where YOLO confidence decreases as artifacts diminish provides natural training dynamics without manual intervention.

The proposed methodology extends beyond CBCT-MDCT conversion to various inter-modality translation problems in medical imaging. By improving image quality while preserving CBCT’s low radiation dose advantage, this work contributes to more accurate diagnosis and treatment planning.

Future work will explore pixel-level segmentation for precise artifact delineation, extension to other artifact types (metal, motion), validation across diverse clinical settings, and end-to-end joint optimization of detection and generation components.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math14071223/s1, Figure S1: YOLO detector failure analysis. Representative examples of true positive (TP), false positive (FP), and false negative (FN) detections based on ground-truth matching. TPs show correctly identified shading artifacts. FPs represent detections not matched to ground-truth labels; notably, many of these regions contain subtle shading artifacts not captured by the conservative ground-truth annotation, indicating that the detector generalizes beyond the labeled examples. Applying gradient minimization to these regions is, therefore, beneficial. FNs represent faint artifacts that fall below the detection threshold, which remain uncorrected but retain the Stage 1 translation quality. Out of 43 ground-truth annotations on 936 test images, the detector achieved 41 TP, 14 FP, and 2 FN; Figure S2: Cycle consistency error during Stage 2. The cycle reconstruction error (

{∥ F (G (x)) - x ∥}_{1}

) was monitored throughout the full training process. The vertical dashed line marks the transition from Stage 1 to Stage 2 fine-tuning. The mean cycle loss remains nearly identical between Stage 1 (

0.1215 \pm 0.0193

) and Stage 2 (

0.1221 \pm 0.0191

), confirming that the asymmetric update of only the forward generator G during Stage 2 does not degrade the bidirectional mapping quality; Figure S3: Hyperparameter sensitivity analysis. (a) YOLO confidence threshold sensitivity: the detector achieves F1

> 0.83

across the

0.1

–

0.5

range, demonstrating robustness to threshold selection. The best F1 of

0.878

is achieved at

θ = 0.5

, while the operating point at

θ = 0.1

provides high recall (

0.953

) at the cost of moderate precision (

0.745

). (b) Artifact loss weight (

λ

) sensitivity during Stage 2: cycle consistency loss remains stable across all tested

λ

values (

0.5

–

5.0

), confirming that the targeted artifact correction does not destabilize the base translation regardless of the correction strength.

Author Contributions

Conceptualization, Y.L. and H.-C.P.; methodology, Y.L.; software, Y.L.; validation, Y.L. and H.-C.P.; formal analysis, Y.L.; investigation, Y.L.; resources, H.-C.P.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, H.-C.P.; visualization, Y.L.; supervision, H.-C.P.; project administration, H.-C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2024-00338504).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki. Ethical review was not required as this study used retrospectively collected, anonymized imaging data.

Informed Consent Statement

Patient consent was waived as this study used retrospectively collected, fully anonymized imaging data with no identifiable patient information.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to patient privacy restrictions and institutional data sharing policies.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CBCT	Cone-Beam Computed Tomography
MDCT	Multi-Detector Computed Tomography
CT	Computed Tomography
GAN	Generative Adversarial Network
CycleGAN	Cycle-Consistent Generative Adversarial Network
CNN	Convolutional Neural Network
YOLO	You Only Look Once
SSIM	Structural Similarity Index
NMS	Nonmaximum Suppression
LSGAN	Least Squares GAN

References

Kalender, W.A. Computed Tomography: Fundamentals, System Technology, Image Quality, Applications, 2nd ed.; Publicis Corporate Publishing: Erlangen, Germany, 2005. [Google Scholar]
Scarfe, W.C.; Farman, A.G. What is cone-beam CT and how does it work? Dent. Clin. N. Am. 2008, 52, 707–730. [Google Scholar] [CrossRef] [PubMed]
Jaffray, D.A.; Siewerdsen, J.H. Cone-beam computed tomography with a flat-panel imager: Initial performance characterization. Med. Phys. 2000, 27, 1311–1323. [Google Scholar] [CrossRef] [PubMed]
Kida, S.; Kaji, S.; Nawa, K.; Imae, T.; Nakamoto, T.; Ozaki, S.; Ohta, T.; Nozawa, Y.; Nakagawa, K. Visual enhancement of cone-beam CT by use of CycleGAN. Med. Phys. 2020, 47, 998–1010. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv 2017, arXiv:1711.05225. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Liang, X.; Chen, L.; Nguyen, D.; Zhou, Z.; Gu, X.; Yang, M.; Wang, J.; Jiang, S. Generating synthesized computed tomography (CT) from cone-beam computed tomography (CBCT) using CycleGAN for adaptive radiation therapy. Phys. Med. Biol. 2019, 64, 125002. [Google Scholar] [CrossRef] [PubMed]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1857–1865. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 700–708. [Google Scholar]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8789–8797. [Google Scholar]
Yi, X.; Walia, E.; Babyn, P. Generative adversarial network in medical imaging: A review. Med. Image Anal. 2019, 58, 101552. [Google Scholar] [CrossRef]
Chen, H.; Zhang, Y.; Kalra, M.K.; Lin, F.; Chen, Y.; Liao, P.; Zhou, J.; Wang, G. Low-dose CT with a residual encoder-decoder convolutional neural network. IEEE Trans. Med. Imaging 2017, 36, 2524–2535. [Google Scholar] [CrossRef] [PubMed]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Wolterink, J.M.; Dinkla, A.M.; Savenije, M.H.F.; Seevinck, P.R.; van den Berg, C.A.T.; Isgum, I. Deep MR to CT synthesis using unpaired data. In Proceedings of the Simulation and Synthesis in Medical Imaging (SASHIMI), Quebec City, QC, Canada, 14 September 2017; pp. 14–23. [Google Scholar]
Kang, E.; Koo, H.J.; Yang, D.H.; Seo, J.B.; Ye, J.C. Cycle-consistent adversarial denoising network for multiphase coronary CT angiography. Med. Phys. 2019, 46, 550–562. [Google Scholar] [CrossRef] [PubMed]
Altalib, A.; McGregor, S.; Li, C.; Perelli, A. Synthetic CT Image Generation From CBCT: A Systematic Review. IEEE Trans. Radiat. Plasma Med. Sci. 2025, 9, 691–707. [Google Scholar] [CrossRef]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; pp. 2234–2242. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 5 January 2025).
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the International Conference on Pattern Recognition (ICPR), Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.; Webb, R. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2107–2116. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of CBCT-to-MDCT translation results. (Left) Successful artifact removal where lower-region shading is effectively eliminated. (Right) Residual shading artifacts remain despite domain transformation, demonstrating the limitation of uniform loss application.

Figure 2. Architecture overview of the proposed method. (Top) Stage 1: CycleGAN training for stable CBCT-MDCT domain mapping using adversarial and cycle consistency losses. (Middle) YOLO artifact detector training with semi-automatic labeling on Stage 1 outputs. (Bottom) Stage 2: Fine-tuning with region correction loss that targets detected artifact regions for explicit removal.

Figure 3. YOLO artifact detection results on Stage 1 model outputs. Green boxes indicate ground-truth annotations; red boxes indicate YOLO predictions. The detector successfully identifies lower-region shading artifacts with high precision.

Figure 4. Qualitative comparison of translation results. (Left) Original CBCT with visible lower-region shading artifacts. (Center) Baseline CycleGAN output with residual artifacts. (Right) Proposed fine-tuning (

λ

= 1.0) with effective artifact removal while preserving anatomical detail.

Figure 4. Qualitative comparison of translation results. (Left) Original CBCT with visible lower-region shading artifacts. (Center) Baseline CycleGAN output with residual artifacts. (Right) Proposed fine-tuning (

λ

= 1.0) with effective artifact removal while preserving anatomical detail.

Figure 5. Weight sensitivity analysis for region correction loss. From left to right: original CBCT, baseline, and fine-tuning results with

λ_{corr}

= 0.5, 1.0, 2.0, 3.0, 5.0. Optimal removal at

λ_{corr}

= 1.0. Excessive weights (

λ \geq 3.0

) disrupt balance with cycle consistency loss, potentially degrading quality.

Figure 5. Weight sensitivity analysis for region correction loss. From left to right: original CBCT, baseline, and fine-tuning results with

λ_{corr}

= 0.5, 1.0, 2.0, 3.0, 5.0. Optimal removal at

λ_{corr}

= 1.0. Excessive weights (

λ \geq 3.0

) disrupt balance with cycle consistency loss, potentially degrading quality.

Figure 6. Pareto front analysis: SSIM vs. artifact score for different training strategies and loss functions. The proposed method (

λ_{corr} = 1.0

) achieves the best trade-off. Joint training is dominated by the baseline. TV Loss and Random Init achieve zero artifact score through image destruction.

Figure 6. Pareto front analysis: SSIM vs. artifact score for different training strategies and loss functions. The proposed method (

λ_{corr} = 1.0

) achieves the best trade-off. Joint training is dominated by the baseline. TV Loss and Random Init achieve zero artifact score through image destruction.

Figure 7. YOLO artifact detector performance. (a) Precision–recall curve across confidence thresholds, showing the training threshold (conf = 0.1, green diamond) and optimal F1 point (conf = 0.5, red circle). (b) F1 score vs. confidence threshold, demonstrating stable performance in the 0.1–0.5 range.

Table 1. Quantitative comparison of artifact removal and image quality. Values are mean ± SD [95% CI]. Best results in bold. Statistical significance tested against baseline (paired t-test). Effect sizes reported as Cohen’s d. Upward arrows indicate higher values are better, and downward arrows indicate lower values are better.

Model	Artifact Score (↓)	PSNR (↑)	SSIM (↑)	INU (↓)	SI (↓)
Baseline	0.516	18.222 ± 1.520	0.572 ± 0.072	1.014 ± 0.837	0.208 ± 0.021
		[18.202, 18.243]	[0.571, 0.573]	[1.003, 1.026]	[0.208, 0.208]
Joint	0.575	18.136 ± 1.565	0.561 ± 0.073	1.079 ± 1.050	0.217 ± 0.021
		[18.115, 18.157]	[0.560, 0.562]	[1.065, 1.093]	[0.217, 0.218]
Proposed	0.444	18.282 ± 1.520	0.583 ± 0.072	0.974 ± 0.781	0.204 ± 0.021
		[18.261, 18.302]	[0.582, 0.584]	[0.964, 0.985]	[0.204, 0.205]

Note: All pairwise comparisons vs. baseline are statistically significant (paired t-test,

p < 0.001

). Effect sizes (Cohen’s d): joint training—PSNR

d = - 0.056

, SSIM

d = - 0.157

, INU

d = 0.068

, SI

d = 0.429

; proposed—PSNR

d = 0.039

, SSIM

d = 0.149

, INU

d = - 0.050

, SI

d = - 0.190

.

Table 2. Structure preservation: Fine-tuning vs. baseline comparison. Higher values indicate better preservation of bone structures. Metrics computed on binarized bone masks.

Model	Dice (↑)	SSIM (↑)
Fine-tuning ( $λ$ = 1.0)	0.849 ± 0.078	0.971 ± 0.012
Fine-tuning ( $λ$ = 3.0)	0.788 ± 0.091	0.962 ± 0.015
Fine-tuning ( $λ$ = 5.0)	0.818 ± 0.095	0.968 ± 0.015

Table 3. Comprehensive ablation study. All models evaluated under identical conditions. Statistical significance tested against the baseline (paired t-test). ^† Artifact score of 0.000 indicates image destruction (PSNR ≤ 16.3) rather than effective artifact removal. Upward arrows indicate higher values are better, and downward arrows indicate lower values are better.

Model	Artifact (↓)	PSNR (↑)	SSIM (↑)	INU (↓)	PSNR p-Value
Training Strategy
Baseline (Stage 1 Only)	0.516	18.222	0.572	1.014	—
Joint Training	0.575	18.136	0.561	1.079	<0.001 (↓)
Random Init (no Stage 1)	0.000 ^†	16.104	0.693	0.023	<0.001 (↓)
Proposed (Two-Stage)	0.444	18.282	0.583	0.974	<0.001 (↑)
Loss Function (all Two-Stage)
TV Loss	0.000 ^†	16.242	0.671	0.071	<0.001 (↓)
L1 ROI Loss	0.554	18.113	0.594	2.541	<0.001 (↓)
L2 Gradient (Proposed)	0.444	18.282	0.583	0.974	<0.001 (↑)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, Y.; Park, H.-C. YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning. Mathematics 2026, 14, 1223. https://doi.org/10.3390/math14071223

AMA Style

Lee Y, Park H-C. YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning. Mathematics. 2026; 14(7):1223. https://doi.org/10.3390/math14071223

Chicago/Turabian Style

Lee, Yangheon, and Hyun-Cheol Park. 2026. "YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning" Mathematics 14, no. 7: 1223. https://doi.org/10.3390/math14071223

APA Style

Lee, Y., & Park, H.-C. (2026). YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning. Mathematics, 14(7), 1223. https://doi.org/10.3390/math14071223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-Based Shading Artifact Reduction for CBCT-to-MDCT Translation Using Two-Stage Learning

Abstract

1. Introduction

2. Related Work

2.1. Unpaired Image-to-Image Translation

2.2. Deep Learning for CT Image Enhancement

2.3. Object Detection in Medical Imaging

3. Materials and Methods

3.1. Dataset

3.2. Method Overview

3.3. Network Architecture

3.3.1. Generator Architecture

3.3.2. Discriminator Architecture

3.4. Stage 1: Base CycleGAN Training

3.4.1. Adversarial Loss

3.4.2. Cycle Consistency Loss

3.4.3. Total Loss Functions

3.5. Stage 2: YOLO-Based Fine-Tuning

3.5.1. YOLO Artifact Detector Training

3.5.2. Region Correction Loss

3.5.3. Stage 2 Generator Loss

3.5.4. Self-Regulating Mechanism

3.6. Implementation Details

4. Results

4.1. Evaluation Metrics

4.1.1. Artifact Score

4.1.2. Structure Preservation Metrics

4.1.3. Independent Artifact Metrics

4.2. Comparison Models

4.3. Quantitative Results

4.4. Qualitative Results

4.5. 3D Volumetric Consistency

4.6. Frequency-Domain Analysis

4.7. Ablation Study

4.7.1. Training Strategy

4.7.2. Loss Function

4.7.3. Pareto Front Analysis

4.8. YOLO Detector Performance

4.9. Computational Cost

5. Discussion

5.1. Effectiveness of Two-Stage Learning

5.2. Benefits of YOLO Integration

5.3. Preservation of Anatomical Structures

5.4. Clinical Implications

5.5. Limitations and Future Directions

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI