CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis

Kamiya, Naoki; Ashino, Kosuke; Hosokawa, Yuto; Shibazaki, Koji

doi:10.3390/app16020814

Open AccessArticle

CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis

by

Naoki Kamiya

^1,*,†

,

Kosuke Ashino

^2,*,†

,

Yuto Hosokawa

¹ and

Koji Shibazaki

³

¹

School of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Japan

²

Graduate School of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Japan

³

Faculty of Fine Arts, Aichi University of the Arts, Nagakute 480-1194, Japan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(2), 814; https://doi.org/10.3390/app16020814

Submission received: 24 December 2025 / Revised: 9 January 2026 / Accepted: 12 January 2026 / Published: 13 January 2026

Download

Browse Figures

Versions Notes

Abstract

The structural analysis of paper fibers is vital for the noninvasive classification and conservation of traditional handmade paper in cultural heritage. Although digital still cameras (DSCs) offer a low-cost and noninvasive imaging solution, their inferior image quality compared to white-light confocal microscopy (WCM) limits their effectiveness in fiber classification. To address this modality gap, we propose an unpaired image-to-image translation approach using cycle-consistent adversarial networks (CycleGANs). Our study targets a multifiber setting involving kozo, mitsumata, and gampi, using publicly available domain-specific datasets. Generated WCM-style images were quantitatively evaluated using peak signal-to-noise ratio, structural similarity index measure, mean absolute error, and Fréchet inception distance, achieving 8.24 dB, 0.28, 172.50, and 197.39, respectively. Classification performance was tested using EfficientNet-B0 and Inception-ResNet-v2, with F1-scores reaching 94.66% and 98.61%, respectively, approaching the performance of real WCM images (99.50% and 98.86%) and surpassing previous results obtained directly from DSC inputs (80.76% and 84.19%). Furthermore, Grad-CAM visualization confirmed that the translated images retained class-discriminative features aligned with those of the actual WCM inputs. Thus, the proposed CycleGAN-based image conversion effectively bridges the modality gap, enabling DSC images to approximate WCM characteristics and support high-accuracy paper fiber classification, which is a practical alternative for noninvasive material analysis.

Keywords:

CycleGAN; image-to-image translation; non-invasive analysis; paper fiber classification; cultural heritage; deep learning

Graphical Abstract

1. Introduction

The structural analysis of paper fibers plays a crucial role in the classification, conservation, and authentication of traditional handmade paper used in cultural heritage artifacts. Standardized methods for fiber identification, such as ISO 9184-1 [1] and JIS P 8120 [2], typically rely on chemical maceration and optical microscopy, which are inherently destructive. Although accurate, these procedures are unsuitable for fragile or irreplaceable materials. Consequently, digital image-based, noninvasive alternatives have garnered increasing interest [3]. For instance, artificial intelligence (AI)-driven approaches have been increasingly applied in the field of cultural heritage for tasks such as classification, restoration, and analysis [3,4,5]. Specific applications include dating ancient manuscripts [3] and estimating fiber composition in historical paper using macro images captured using consumer-grade digital still cameras (DSCs) [6]. These studies extracted morphological features using convolutional neural networks (CNNs) and have shown promising performance in classifying fiber species. However, the image quality of DSCs is generally inferior to that of white-light confocal microscopes (WCMs), particularly in terms of resolution and contrast, which limits their performance. Therefore, bridging the domain gap between low-cost, noninvasive imaging and high-precision microscopic imaging has emerged as a novel and significant direction—one that has received limited attention thus far, despite its practical potential in the field of cultural heritage preservation.

Recent advances in deep learning, particularly Generative Adversarial Networks (GANs) [7], have enabled image-to-image translation across different imaging modalities. Early approaches, such as Pix2Pix [8], required paired training data, which is often difficult to obtain in practical scenarios. To address this limitation, cycle-consistent adversarial networks (CycleGANs) [9] were proposed for translating unpaired datasets, showing notable success in tasks such as transferring styles between photographs and paintings. In medical and biological imaging, CycleGANs and similar deep learning frameworks have been extensively applied to generate synthetic images across modalities. Applications range from translating between magnetic resonance imaging and computed tomography [10,11] to virtual staining, which transforms label-free microscopy images into equivalent chemically stained versions [12,13]. Furthermore, cross-modality super-resolution approaches have been proposed to bridge the gap between low- and high-resolution microscopy [14]. These applications demonstrate the potential of CycleGAN to improve the quality of low-resolution or low-contrast images in the absence of paired training data. These applications demonstrate the potential of CycleGAN to improve the quality of low-resolution or low-contrast images in the absence of paired training data. In our context, although DSC and WCM images are not paired on a per-sample basis, they depict the same types of paper fibers and share underlying structural patterns, making CycleGAN [9] a suitable approach for learning domain mappings. We employ CycleGAN as the core architecture because of its established stability and effectiveness in unpaired translation tasks, making it an ideal baseline to validate the proposed modality-bridging concept. Our prior study, consumer-grade optics to microscopic imaging conversion (COMIC) [15], demonstrated the visual feasibility of such unpaired translation on Kozo fibers using basic similarity metrics. However, it was limited to a single fiber type and did not evaluate downstream tasks. Previous research has also shown that classification models trained on WCM images outperform those using DSC inputs due to the superior resolution and contrast of microscope imagery [16].

To address this modality gap and build upon our previous work, this study expands the scope to a multiclass setting involving kozo, mitsumata, and gampi fibers, using domain-specific datasets [17,18]. We propose a CycleGAN-based method to convert DSC macro-images into WCM-like images and systematically investigate whether these synthetic images can support fiber species classification. The generated images are evaluated using conventional metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM) [19], and mean absolute error (MAE), while further incorporating the Fréchet inception distance (FID) [20] to assess perceptual realism. Additionally, we apply gradient-weighted class activation mapping (Grad-CAM) [21] to explore classification decision regions and enhance interpretability.

Through this multifaceted evaluation, we aim to demonstrate that AI-driven modality conversion enables practical, non-invasive fiber analysis using accessible imaging tools. By comparing the classification accuracy of original DSC images, CycleGAN-converted images, and true WCM images, we provide empirical evidence that deep learning-based image translation can serve as a viable surrogate for high-resolution microscopy in heritage science.

2. Materials and Methods

2.1. Image Details

This study targets three types of traditional Japanese paper fibers: kozo, mitsumata, and gampi. Two publicly available image datasets were acquired using different modalities: consumer DSC macro photography and WCM. The DSC dataset, KoMiGaPf2025_DSC [17], was obtained using an Olympus Tough TG-5 (Olympus, Tokyo, Japan) in microscope mode under controlled illumination. The WCM dataset, KoMiGaPf2024_WCM [18], was captured using an OPTELICS HYBRID⁺ system (Lasertec Corporation, Yokohama, Japan) with a 20× objective lens.

To prepare inputs for DSC-to-WCM image translation and subsequent evaluation, all images were converted into non-overlapping patches. Each DSC image (4000 × 3000 pixels) was center-cropped to 698 × 704 pixels and then divided into four patches of 352 × 349 pixels. Each patch was resized to 1024 × 1024 pixels using nearest-neighbor interpolation to match the WCM resolution. This interpolation method was selected to preserve the high-frequency pixel information of the raw DSC inputs without introducing smoothing artifacts that could hinder the learning of texture mapping. For the WCM images (2048 × 2048 pixels), each image was split into four nonoverlapping patches of 1024 × 1024 pixels. No overlap was used at any step.

For the experiments, we used 900 original images per modality, resulting in 3600 patches per modality (1200 patches per fiber type) for 7200 patches. A three-fold cross-validation strategy was adopted, where data splitting was performed at the original image capture level. Since the fiber arrangement in handmade paper is random and chaotic, non-overlapping images captured from different regions of the sheet exhibit structurally distinct patterns. Therefore, we ensured that patches in the training and test sets originated from different capture regions to effectively prevent data leakage. It is important to note that while the DSC and WCM images capture the same fiber species, they were acquired from different physical regions of the samples. Thus, there is no pixel-wise spatial correspondence (pairing) between the DSC and WCM datasets.

2.2. CycleGAN-Based DSC-to-WCM Conversion

To perform unpaired image-to-image translation between DSC macro and WCM images, we adopted the CycleGAN architecture [9]. This framework is particularly well-suited for our application, as the two domains (DSC and WCM) do not have pixel-wise paired correspondence but share underlying structural features of the same fiber types.

CycleGAN comprises two generators,

G_{D \to W} (DSC \to WCM)

and

G_{W \to D} (WCM \to DSC)

, which learn bidirectional mappings between the two domains, and two discriminators,

D_{W}

and

D_{D}

, which evaluate whether the generated images are indistinguishable from real samples in their respective domains. The model is trained to minimize a composite loss function that includes an adversarial loss to ensure the realism of the generated images, and a cycle-consistency loss that enforces forward translation followed by backward translation to reconstruct the original input. The model was trained to minimize the total loss:

L = L_{GAN} + λ L_{cyc},

(1)

where

L_{GAN}

is the adversarial loss that encourages the generated images to be indistinguishable from real images in the target domain, and

L_{cyc}

is the cycle-consistency loss that ensures the round-trip translation (

DSC \to WCM \to DSC

) retains the original content structure [9]. Hyperparameter

λ

balances the contribution of the two terms and was set to 10 following the original implementation.

Training was conducted using the Adam optimizer [22] with a learning rate of

2 \times 10^{- 4}

and

β_{1} = 0.5

, over 200 epochs. The learning rate was held constant for the first 100 epochs, and then linearly decayed to zero during the remaining 100 epochs. We used ResNet-based generators [23] and 70 × 70 PatchGAN discriminators, as originally proposed by Zhu et al. [9].

After training, generator

G_{D \to W}

was used to translate all DSC patches into synthetic WCM-style images. These translated outputs were then subjected to quantitative and classification-based evaluations to assess their utility in downstream fiber analysis.

The detailed implementation parameters, including hardware specifications and hyperparameters, are summarized in Table 1.

2.3. Experiments

2.3.1. Quantitative Image Quality and Similarity Evaluation

We employed PSNR, SSIM, MAE, and FID to assess the visual and statistical quality of the converted images. These metrics were selected to provide both pixel- and distribution-level comparisons between domains.

First, PSNR, SSIM, and MAE were calculated between the original DSC patches and their CycleGAN-converted counterparts (DSC–converted), capturing direct similarity in brightness, structure, and overall reconstruction error. For baseline comparison, the same metrics were also computed between the unpaired DSC and WCM patches (DSC–WCM), which represent the raw cross-domain gap without any translation. These comparisons allow us to evaluate whether the converted images are closer to their source inputs (i.e., DSC) or to the target WCM domain.

In addition, we computed the FID [20] between the converted and real WCM images to quantify the statistical similarity of their feature distributions in a deep feature space. FID is widely used in generative image evaluation because of its sensitivity to visual fidelity and diversity, and it captures domain alignment at a more semantic level than pixel-based metrics. The FID was computed using the standard Inception-v3 network pretrained on ImageNet. All image patches were resized to 299 × 299 pixels and normalized to the range

[- 1, 1]

to match the network’s input specifications. For reference, the FID between the original DSC and WCM patches was computed to establish the initial domain discrepancy prior to translation.

Together, these metrics provide a comprehensive assessment of how well the CycleGAN-generated images approximate the WCM domain, both locally and globally.

2.3.2. Classification-Based Evaluation Using WCM-Trained Models

To evaluate the downstream utility of the converted images, we conducted a fiber-type classification experiment using CNNs. Specifically, we adopted EfficientNet-B0 [24] and Inception-ResNet-v2 [25] as classification backbones, based on their superior performance in a prior comparative study on DSC-based paper fiber classification across multiple architectures [6], and their verified effectiveness on WCM images in subsequent work [16]. Each model was trained exclusively on WCM patches under a three-class setting (Kozo, Mitsumata, and Gampi). The evaluation was conducted using two test sets in each fold of the three-fold cross-validation: (i) the converted patches generated from the DSC inputs using CycleGAN and (ii) the original WCM patches from the same test fold. All splits were defined at the original image level to avoid information leakage between training and evaluation.

Classification performance was assessed using standard metrics (accuracy, precision, recall, and F1-score) calculated per class and as macro-averages. These metrics quantify how well the models trained on high-resolution WCM data generalize to converted images originating from consumer-grade sources.

2.3.3. Grad-CAM-Based Attention Analysis

To investigate whether the conversion process shifts model attention toward WCM-like discriminative regions, we conducted an interpretability analysis using Grad-CAM [21]. Grad-CAM heatmaps were generated for each input type (DSC, WCM, and converted images) using both EfficientNet-B0 and Inception-ResNet-v2 classifiers.

Let

\tilde{g} (i)

denote the raw Grad-CAM score at pixel

i

. Each map is normalized to the range [0, 100] via min–max normalization, as follows:

g (i) = \frac{\tilde{g} (i) - {\tilde{g}}_{m i n}}{{\tilde{g}}_{m a x} - {\tilde{g}}_{m i n}} \times 100,

(2)

where

{\tilde{g}}_{\min}

and

{\tilde{g}}_{\max}

are the minimum and maximum Grad-CAM values within the image, respectively.

To quantify the concentration of model attention, we computed the proportion of pixels whose normalized Grad-CAM scores exceeded a series of thresholds

T \in \{0, 10, \dots, 100\}

as follows:

R (T) = \frac{|\{i |g (i) \geq T\}|}{N} \times 100,

(3)

where

N

is the total number of pixels in the image and

∣ \cdot ∣

denotes set cardinality. Higher

R (T)

values at larger thresholds indicate more focused activation.

We compared

R (T)

profiles across the three input domains to assess changes in attention sharpness and alignment. Furthermore, for the converted images, the results were stratified into correctly classified and misclassified subsets to explore the relationship between attention localization and classification success.

3. Results

This section presents the quantitative and qualitative results of the proposed DSC-to-WCM image translation framework. We begin by evaluating the fidelity and structural similarity of the converted images using PSNR, SSIM, MAE, and FID. The results are summarized in Table 2. We then provide representative examples from both high- and low-SSIM cases to qualitatively examine the visual consistency between the original and converted images (Figure 1). Finally, we evaluate the effectiveness of the converted images in fiber classification by applying two WCM-trained classifiers to the converted and original WCM images (Table 3).

Table 2 summarizes the quantitative evaluation of the translated images using PSNR, SSIM, MAE, and FID. Table 2a lists the pixel-wise similarity between the original DSC images and their converted counterparts. The DSC–converted pairs achieved an average PSNR of 8.24 and SSIM of 0.28 on average, compared to 6.81 and 0.17, respectively, for the DSC–WCM pairs. These results indicate that the converted images retain more structural features of the original DSC inputs than the true WCM images. However, the MAE for the DSC–converted pairs (172.50) was slightly higher than that of the DSC–WCM (157.82), suggesting the introduction of noticeable pixel-wise changes through the translation process. Table 2b lists the FID, which assesses the perceptual similarity to the WCM domain. The FID between the converted and WCM images was 197.39, which was substantially lower than the FID of 381.39 measured between the original DSC and WCM images, indicating that the translated images more closely matched the perceptual distribution of the true WCM images.

Figure 1 shows visual examples of representative patches from the best- and worst-case SSIM scenarios. High-SSIM samples display a strong alignment of structural features such as fiber contours, whereas low-SSIM samples often exhibit deformation or incomplete translation artifacts. These examples help to characterize the range of translation quality achieved by the model.

To evaluate the effectiveness of the converted images in downstream fiber classification, we applied two CNN classifiers, EfficientNet-B0 and Inception-ResNet-v2, which were previously identified as high-performing architectures for patch-based paper-fiber classification tasks [6]. Both classifiers were trained exclusively on WCM patches labeled with Kozo, Mitsumata, and Gampi. The trained models were then used to classify (i) the converted images produced from DSC inputs by our CycleGAN model (Proposed), and (ii) the original WCM test patches (Reference) in each cross-validation fold. Table 3 summarizes the classification performance based on the accuracy, precision, recall, and F1-score for each fiber class and their macro-averaged values.

As shown in Table 3a, EfficientNet-B0 achieved macro-averaged accuracy, precision, recall, and F1-score of 96.46%, 94.69%, 94.69%, and 94.66%, respectively, when classifying the converted images. These values are close to those obtained for the original WCM images (99.67%, 99.50%, 99.50%, and 99.50%, respectively), indicating that the converted images preserved most of the discriminative features necessary for classification. Similarly, Table 3b shows that Inception-ResNet-v2 attained an F1-score of 98.61% on the converted images, which is nearly identical to the 98.86% achieved on the original WCM images. The corresponding accuracy gap was only 0.17%, further supporting the viability of the conversion approach.

These findings are more compelling than those of our earlier benchmark study [16], which reported lower accuracy scores when classifying the original DSC images directly: 87.28% for EfficientNet-B0 and 89.46% for Inception-ResNet-v2. In contrast, the converted images improved the accuracy by +9.18% and +9.61%, respectively. These results demonstrate that the CycleGAN-based conversion effectively transforms consumer-grade macro images into representations that are more aligned with microscope-level imagery, thereby enhancing classification performance without requiring destructive imaging techniques.

4. Discussion

This study explored the feasibility of using CycleGAN to translate macro images acquired with a consumer-grade DSC into WCM-like images, with the goal of supporting noninvasive paper-fiber classification. The experimental results consistently demonstrate that the proposed conversion framework narrows the modality gap between the DSC and WCM images, as evidenced by the improved distributional similarity (lower FID scores), classification performance approaching that of the original WCM images, and more WCM-like attention patterns observed in Grad-CAM analyses. These findings suggest that the generated images not only resemble WCM images visually but also retain class-discriminative features that are critical for CNN-based classification.

4.1. Image-Level Evaluation: Fidelity, Structure, and Domain Alignment

To assess the effectiveness of the proposed CycleGAN-based conversion, we first examined the fidelity and structural similarity of the translated images compared with both their DSC origins and the target WCM domain. As shown in Table 2a, the converted images exhibited higher PSNR (8.24) and SSIM (0.28) than the original DSC–WCM pairs (PSNR: 6.81, SSIM: 0.17), indicating that the transformation preserved substantial structural information from the input while introducing appropriate modality-specific changes. However, the MAE for DSC–converted pairs (172.50) was slightly higher than that for the DSC–WCM pairs (157.82), suggesting that the conversion induces perceptually meaningful but pixel-level deviations. This aligns with the objective of shifting the domain appearance rather than performing identity mapping.

From a distributional perspective, Table 2b shows that the converted images achieved a significantly smaller FID (FID = 197.39) to WCM images than the original DSC images (FID = 381.39). This result suggests that the proposed translation method effectively narrows the domain gap in the perceptual feature space, moving the translated images closer to the WCM distribution.

The qualitative examples in Figure 1 further illustrate this trend. High-SSIM cases show strong preservation of fiber structures, such as contours and textures, closely resembling their DSC inputs. Even in low-SSIM examples, essential fiber-related features remain visible, suggesting that the model avoids overfitting to domain-specific textures while maintaining the relevant structural content. This visual consistency across a range of similarity levels reinforces the quantitative findings, highlighting the ability of the model to perform domain translation while preserving critical information.

4.2. Task-Level Evaluation: Classification and Attention Alignment

To evaluate the practical utility of the translated images, we assessed their impact on downstream fiber-type classification tasks using EfficientNet-B0 and Inception-ResNet-v2. Both models were trained exclusively on the WCM patches and then evaluated on the converted images and original WCM test sets. As summarized in Table 3, EfficientNet-B0 achieved an average F1-score of 94.66% on the converted images, compared to 99.50% on the true WCM images. Inception-ResNet-v2 yielded an even higher F1-score of 98.61% on the converted images, nearly matching its performance on the WCM inputs (98.86%). These results demonstrate that the conversion enables the classification performance approaching that of true WCM imagery, despite the source being low-cost DSC data.

Importantly, compared with prior works that directly classified DSC images using the same network architectures, yielding F1-scores of 80.76% (EfficientNet-B0) and 84.19% (Inception-ResNet-v2) [16], the proposed conversion offers substantial gains (+13.90 and +14.42 percentage points, respectively). This performance improvement confirms that the CycleGAN-translated images are more suitable for deep learning-based fiber classification than raw DSC inputs, thereby supporting the hypothesis that domain translation improves task relevance.

To further understand the mechanism underlying the observed classification improvements, we examined the classifier attention patterns using Grad-CAM visualizations. As shown in Figure 2a, the R(T) curves, representing the proportion of high-activation pixels, reveal that across all samples, the converted images exhibit spatial attention distributions that are more closely aligned with the WCM images than with the original DSC images. This suggests that the proposed conversion not only enhances visual similarity, but also encourages attention behavior that is more consistent with the WCM-trained models.

A closer look at the correctly classified cases (Figure 2b) shows that the R(T) curves of the converted images closely track those of the WCM images, whereas the DSC inputs yield clearly different patterns. This implies that successful conversion supports feature localization consistent with WCM-based decision-making. In contrast, misclassified cases (Figure 2c) showed larger discrepancies between the converted and WCM attention profiles, suggesting that insufficient or inconsistent translation may lead to classifier confusion. The observed discrepancy in attention patterns for misclassified cases suggests that the translation process occasionally failed to reconstruct the specific discriminative features (e.g., clear fiber edges or textures) expected by the WCM-trained classifiers. When these key features are absent or insufficiently translated, the classifier fails to focus on the salient regions, leading to both attention misalignment and incorrect predictions.

These trends are qualitatively supported by the Grad-CAM heatmaps in Figure 3. Although DSC images often show scattered or modality-specific activations, the converted images tend to redirect attention toward fiber-relevant structures, better resembling the heatmaps generated from WCM inputs. This attention-level alignment provides complementary evidence that the proposed CycleGAN-based conversion not only improves visual fidelity but also enhances the semantic interpretability and decision reliability of downstream models.

4.3. Limitations and Future Directions

Although the results of this study are promising, there are some limitations. First, the Grad-CAM-based attention analysis inherently depends on the architecture and training of the classifier. Although the observed alignment of attention maps between the converted and WCM images provides useful interpretability, it does not guarantee structural fidelity in a physical sense. We also acknowledge that the Grad-CAM based metric is an indirect measure and can be sensitive to hyper-parameters such as heatmap normalization and smoothing. Furthermore, similar attention patterns do not necessarily guarantee that the captured features are morphologically identical to those in real microscopy. Future studies should incorporate additional explanatory techniques to reinforce the robustness of attention-level comparisons.

Second, this study focused on three types of traditional Japanese paper fibers and used imaging data collected under controlled conditions. While this allows for controlled experimentation, broader validation is needed to assess generalizability to other fiber types, mixed-fiber papers, or images acquired under different hardware and lighting conditions. Expanding the dataset to include these variations would support the applicability of this method to more diverse heritage contexts. It should be noted that the high classification accuracy observed (~99% for WCM) reflects the controlled acquisition environment of the current dataset. While this validates that the proposed translation effectively recovers class-discriminative features in a controlled setting, further validation on datasets with more variable lighting and diverse conditions is necessary to assess generalization in uncontrolled environments.

Nevertheless, the present results establish that when the fiber type is known and high-quality WCM images are available for training, unpaired image translation from the DSC to WCM domains is feasible and beneficial. This provides a practical pathway for noninvasive fiber analysis in cases where conventional microscopy is not available, offering a promising tool for digital heritage science.

5. Conclusions

This study proposed a CycleGAN-based framework to translate consumer-grade DSC images into WCM-like images for noninvasive paper-fiber classification. Using publicly available datasets of traditional Japanese paper fibers, we demonstrated that the translated images significantly reduce the modality gap in terms of both pixel-level and distributional similarity. When applied to downstream classification using WCM-trained CNN models, the converted images achieved accuracy and F1-scores comparable to those obtained from the true WCM images. Furthermore, Grad-CAM-based attention analysis revealed that the attention patterns of the classifiers on the converted images resemble those on the WCM images, offering further evidence of modality alignment at the feature level.

Importantly, while this study assumes prior knowledge of fiber species for training and evaluation, the demonstrated improvements in visual fidelity and classification suggest that the proposed conversion framework could serve as a foundation for broader applications. In particular, generating WCM-like representations from noninvasive DSC images may support downstream tasks, such as fiber-type inference in unknown samples, specifically when combined with existing methods for coarse-grained fiber estimation. Therefore, this work offers a promising direction for extending high-quality analysis to fragile or irreplaceable materials where destructive testing is not feasible. For practical application, this methodology offers a significant advantage for smaller institutions, such as local museums and libraries, which often lack the financial resources for high-end microscopic equipment. By enabling high-accuracy analysis using widely available consumer-grade cameras, our approach contributes to the democratization of digital archiving and non-invasive preservation of cultural heritage.

Author Contributions

Conceptualization, N.K. and K.A.; methodology, N.K. and K.A.; software, K.A.; validation, N.K., K.A. and Y.H.; formal analysis, K.A. and Y.H.; investigation, K.A. and Y.H.; resources, N.K.; data curation, K.A. and Y.H.; writing—original draft preparation, K.A. and N.K.; writing—review and editing, N.K.; visualization, K.A. and Y.H.; supervision, N.K. and K.S.; project administration, N.K. and K.S.; funding acquisition, N.K. and K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant number JP22H00003, and a research grant from the Naito Science & Engineering Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study—KoMiGaPf2025_DSC (Zenodo, DOI: 10.5281/zenodo.15274772) and KoMiGaPf2024_WCM (Zenodo, DOI: 10.5281/zenodo.14550144)—are archived on Zenodo with restricted access. While the data are not openly downloadable, access may be granted upon reasonable request for academic and collaborative research purposes. Interested researchers are encouraged to contact the authors via the contact information provided in the corresponding Zenodo records. Access will be granted on a case-by-case basis in accordance with data usage policies. This controlled distribution ensures both the integrity of the datasets and the promotion of further research in the field of paper fiber analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DSC	Digital Still Camera
WCM	White-light Confocal Microscope
FID	Fréchet Inception Distance
SSIM	Structural Similarity Index Measure
MAE	Mean Absolute Error
PSNR	Peak Signal-to-Noise Ratio
CNN	Convolutional Neural Network
Grad-CAM	Gradient-weighted Class Activation Mapping

References

ISO 9184-1; Paper, Board and Pulps—Fibre Furnish Analysis. International Organization for Standardization: Geneva, Switzerland, 2022.
JIS P8120; Paper, Board and Pulps—Fibre Furnish Analysis. Japanese Industrial Standards Committee: Tokyo, Japan, 1994.
Popović, M.; Dhali, M.A.; Schomaker, L.; van der Plicht, J.; Lund Rasmussen, K.; La Nasa, J.; Degano, I.; Perla Colombini, M.; Tigchelaar, E. Dating ancient manuscripts using radiocarbon and AI-based writing style analysis. PLoS ONE 2025, 20, e0323185. [Google Scholar] [CrossRef] [PubMed]
Fiorucci, M.; Khoroshiltseva, M.; Pontil, M.; Traviglia, A.; Del Bue, A.; James, S. Machine Learning for Cultural Heritage: A Survey. Pattern Recognit. Lett. 2020, 133, 102–108. [Google Scholar] [CrossRef]
Grimaude, S.; Remenyi, R.; Gabor, A. Deep learning for historical document analysis and recognition: A survey. J. Imaging 2022, 8, 280. [Google Scholar]
Kamiya, N.; Ashino, K.; Sakai, Y.; Zhou, Y.; Ohyanagi, Y.; Shibazaki, K. Non-destructive estimation of paper fiber using macro images: A comparative evaluation of network architectures and patch sizes for patch-based classification. NDT 2024, 2, 487–503. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
Wolterink, J.M.; Dinkla, A.M.; Savenije, M.H.F.; Seevinck, P.R.; van den Berg, C.A.T.; Išgum, I. Deep MR to CT Synthesis Using Unpaired Data. In Simulation and Synthesis in Medical Imaging; Tsaftaris, S.A., Gooya, A., Frangi, A.F., Prince, J.L., Eds.; Springer: Cham, Switzerland, 2017; pp. 14–23. [Google Scholar] [CrossRef]
Gonzalez, Y.; Shen, C.; Jung, H.; Nguyen, D.; Jiang, S.B.; Albuquerque, K.; Jia, X. Semi-automatic sigmoid colon segmentation in CT for radiation therapy treatment planning via an iterative 2.5-D deep learning approach. Med. Image Anal. 2020, 68, 101896. [Google Scholar] [CrossRef]
Rivenson, Y.; Wang, H.; Wei, Z.; de Haan, K.; Zhang, Y.; Wu, Y.; Günaydın, H.; Zuckerman, J.E.; Chong, T.; Sisk, A.E.; et al. Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning. Nat. Biomed. Eng. 2019, 3, 466–477. [Google Scholar] [CrossRef]
Christiansen, E.M.; Yang, S.J.; Ando, D.M.; Javaherian, A.; Skibinski, G.; Lipnick, S.; Mount, E.; O’Neil, A.; Shah, K.; Lee, A.K.; et al. In silico labeling: Predicting fluorescent labels in unlabeled images. Cell 2018, 173, 792–803. [Google Scholar] [CrossRef]
Wang, H.; Rivenson, Y.; Jin, Y.; Wei, Z.; Gao, R.; Günaydın, H.; Bentolila, L.A.; Kural, C.; Ozcan, A. Deep learning enables cross-modality super-resolution in fluorescence microscopy. Nat. Methods 2019, 16, 103–110. [Google Scholar] [CrossRef] [PubMed]
Ashino, K.; Zhou, Y.; Ohyanagi, Y.; Shibazaki, K.; Kamiya, N. COMIC: Consumer-Grade Optics to Microscopic Imaging Conversion for Non-Destructive Paper Fiber Analysis. In Proceedings of the Computers and the Humanities Symposium 2024, Sendai, Japan, 7–8 December 2024; Volume 2024, pp. 139–144. [Google Scholar]
Hosokawa, Y.; Ashino, K.; Shibazaki, K.; Kamiya, N. Comparison of Digital Camera and Confocal Microscope Images for Fiber Type Estimation in Traditional Japanese Paper Using Patch-Based Classification. In Proceedings of the Media Computing Conference, St. Petersburg, Russia, 15–17 October 2025; pp. 1–2. [Google Scholar]
Inayoshi, T.; Ashino, K.; Kamiya, N. KoMiGaPf2025_DSC: Digital Still Camera (DSC) Macro Image Dataset of Kozo, Mitsumata, and Gampi Fibers, Captured with Olympus Tough TG-5; Zenodo: Geneva, Switzerland, 2025. [Google Scholar] [CrossRef]
Kamiya, N.; Ashino, K. KoMiGaPf2024_WCM: White Light Confocal Microscope Image Dataset of Kozo, Mitsumata, and Gampi Fibers (20×), Captured with Optelics Hybrid; Zenodo: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2018, arXiv:1706.08500. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2019, 128, 336–359. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]

Figure 1. Examples of patches with the highest and lowest structural similarity index measure (SSIM) values between the DSC and converted images. The top row shows the original DSC images and the bottom row shows the corresponding converted images. The left three columns present, for each paper type (Kozo, Mitsumata, Gampi), the patch with the highest SSIM, whereas the right three columns present the patch with the lowest SSIM. The SSIM value for each patch is indicated in parentheses. Even in the lowest-SSIM cases, the fiber-like structures observed in the original DSC images are visually preserved in the converted images.

Figure 2. Comparison of the proportion

R (T)

of pixels with normalized Grad-CAM value

\geq T

among the converted, WCM, and DSC images. (a) All converted images. (b) Correctly classified cases. (c) Misclassified cases. Left: EfficientNet-B0; right: Inception-ResNet-v2.

Figure 2. Comparison of the proportion

R (T)

of pixels with normalized Grad-CAM value

\geq T

among the converted, WCM, and DSC images. (a) All converted images. (b) Correctly classified cases. (c) Misclassified cases. Left: EfficientNet-B0; right: Inception-ResNet-v2.

Figure 3. Representative Grad-CAM visualizations for the DSC, WCM, and converted images used for the analysis in Figure 2. For each paper type (Kozo, Mitsumata, and Gampi), Grad-CAM heatmaps are shown for the two classifiers (left: EfficientNet-B0; right: Inception-ResNet-v2).

Table 1. Implementation details and hyperparameters for CycleGAN training.

Category	Parameter	Description
Hardware	CPU	AMD Ryzen Threadripper PRO 5965WX
	GPU	3 $\times$ NVIDIA RTX A6000 (48 GB)
	RAM	256 GB (32 GB $\times$ 8, 3200 MHz)
Software	OS	Ubuntu 22.04 LTS
	Framework	PyTorch 1.9.0 (NGC Container: PyTorch 21.06)
	CUDA/Python	CUDA 11.3.1/Python 3.8
Model Config	Generator	ResNet-based (9 blocks)
	Input Patch Size	1024 $\times$ 1024 pixels
	Batch Size	1
Training	Total Epochs	200 (100 constant + 100 linear decay)
	Learning Rate	$2 \times 10^{- 4}$ (Adam optimizer, $β_{1} = 0.5$ )
	Loss Function	Least Squares GAN Loss + Cycle Consistency Loss
Performance	Inference Time	approx. 1.99 s/image (per patch)

Table 2. Quantitative evaluation of the proposed digital still camera (DSC)-to-white-light confocal microscopy (WCM) conversion using four metrics. Note: The metrics for ‘DSC vs. WCM’ are provided solely as a reference to quantify the initial domain gap between the source and target modalities, given the unpaired nature of the datasets. (a) Image-quality evaluation based on PSNR, SSIM, and MAE. The left block shows the metrics between the original DSC and converted images (proposed method). For comparison, the right block shows the metrics between the DSC and WCM images (reference). (b) Similarity evaluation based on FID. The left block shows the FID between the converted and WCM images (proposed method), whereas the right block shows the FID between the original DSC and WCM images (reference).

(a)
	DSC vs. converted images (Proposed)				DSC vs. WCM images (Reference)
	Kozo	Mitsumata	Gampi	Average	Kozo	Mitsumata	Gampi	Average
PSNR (↑)	7.39	7.76	9.56	8.24	6.14	6.58	7.71	6.81
SSIM (↑)	0.32	0.30	0.23	0.28	0.20	0.18	0.12	0.17
MAE (↓)	182.44	181.65	153.41	172.50	165.85	165.21	142.39	157.82
(b)
	Converted vs. WCM images (Proposed)				DSC vs. WCM images (Reference)
	Kozo	Mitsumata	Gampi	Average	Kozo	Mitsumata	Gampi	Average
FID (↓)	212.39	244.30	135.48	197.39	381.03	399.14	364.00	381.39

DSC: digital still camera images; WCM: white-light confocal microscope images. Higher PSNR and SSIM and lower MAE and FID indicate better agreement.

Table 3. Classification performance on the converted and original WCM images. (a) EfficientNet-B0. The left block shows the results for the converted images (proposed), and the right block shows the results for the original WCM images (reference). (b) Inception-ResNet-v2. The left block shows the results for the converted images (proposed), and the right block shows the results for the original WCM images (reference).

(a)
	Converted images (Proposed)				Original WCM images (Reference)
	Kozo	Mitsumata	Gampi	Average	Kozo	Mitsumata	Gampi	Average
Accuracy	96.06	95.11	98.22	96.46	99.78	99.50	99.72	99.67
Precision	94.91	94.21	94.94	94.69	99.34	100.00	99.17	99.50
Recall	93.17	90.92	100.00	94.69	100.00	98.50	100.00	99.50
F1-score	94.03	92.54	97.40	94.66	99.67	99.24	99.59	99.50
(b)
	Converted images (Proposed)				Original WCM images (Reference)
	Kozo	Mitsumata	Gampi	Average	Kozo	Mitsumata	Gampi	Average
Accuracy	98.64	98.64	99.94	99.07	99.56	98.86	99.31	99.24
Precision	98.81	97.21	99.83	98.62	98.76	99.91	97.76	98.88
Recall	97.08	98.75	100.00	98.61	99.92	96.67	100.00	98.86
F1-score	97.94	97.97	99.92	98.61	99.34	98.26	98.97	98.86

DSC: digital still camera images; WCM: white-light confocal microscope images. All metrics are reported in percent (%).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kamiya, N.; Ashino, K.; Hosokawa, Y.; Shibazaki, K. CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis. Appl. Sci. 2026, 16, 814. https://doi.org/10.3390/app16020814

AMA Style

Kamiya N, Ashino K, Hosokawa Y, Shibazaki K. CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis. Applied Sciences. 2026; 16(2):814. https://doi.org/10.3390/app16020814

Chicago/Turabian Style

Kamiya, Naoki, Kosuke Ashino, Yuto Hosokawa, and Koji Shibazaki. 2026. "CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis" Applied Sciences 16, no. 2: 814. https://doi.org/10.3390/app16020814

APA Style

Kamiya, N., Ashino, K., Hosokawa, Y., & Shibazaki, K. (2026). CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis. Applied Sciences, 16(2), 814. https://doi.org/10.3390/app16020814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CycleGAN-Based Translation of Digital Camera Images into Confocal-like Representations for Paper Fiber Imaging: Quantitative and Grad-CAM Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Details

2.2. CycleGAN-Based DSC-to-WCM Conversion

2.3. Experiments

2.3.1. Quantitative Image Quality and Similarity Evaluation

2.3.2. Classification-Based Evaluation Using WCM-Trained Models

2.3.3. Grad-CAM-Based Attention Analysis

3. Results

4. Discussion

4.1. Image-Level Evaluation: Fidelity, Structure, and Domain Alignment

4.2. Task-Level Evaluation: Classification and Attention Alignment

4.3. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI