Next Article in Journal
Stone Procurement Strategies in Ugento (Lecce) During the Messapic Age
Previous Article in Journal
Ancestral Inca Construction Systems and Worldview at the Choquequirao Archaeological Site, Cusco, Peru, 2024
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hyperspectral Imaging Combined with Deep Learning for the Detection of Mold Diseases on Paper Cultural Relics

1
School of Economics and Management, Chongqing Jiaotong University, Chongqing 400074, China
2
Department of Tourism and Service Management, Chongqing University of Education, Chongqing 400065, China
3
Chongqing Key Laboratory of Optical Fiber Sensing and Photoelectric Detection, Chongqing University of Technology, Chongqing 400054, China
*
Authors to whom correspondence should be addressed.
Heritage 2025, 8(12), 495; https://doi.org/10.3390/heritage8120495
Submission received: 11 September 2025 / Revised: 15 November 2025 / Accepted: 21 November 2025 / Published: 23 November 2025
(This article belongs to the Section Cultural Heritage)

Abstract

Mold contamination is one of the critical factors threatening the safety of paper-based cultural relics. Current detection methods rely predominantly on offline analysis, facing challenges such as low efficiency and limited real-time accuracy, which hinder their effectiveness in meeting the technical requirements of cultural heritage preventive conservation. This study proposes a hyperspectral imaging (HSI)-deep learning integrated fungal segmentation framework for deterioration detection in paper-based artifacts. Firstly, the HSI data was reduced to three dimensions via Locally Linear Embedding (LLE) manifold learning to construct 3D pseudo-color imagery, effectively preserving discriminative spectral features between fungal colonies and substrates while eliminating spectral redundancy. Secondly, a hybrid architecture synergizing Feature Pyramid Networks (FPN) with Vision Transformers was developed for semantic segmentation, leveraging CNN’s local feature extraction and Transformer’s global context modeling to enhance fungal signature saliency and suppress background interference. Innovatively, a dynamic sparse attention mechanism is introduced, optimizing attention allocation through the TOP-K algorithm to screen regions richer in mold information spatially and spectrally, thereby improving segmentation accuracy. Semantic segmentation experiments were conducted on papers infected with different molds. The results demonstrate that the proposed method achieves excellent performance in mold segmentation, providing technical support for mold detection and preventive conservation of cultural relics.

1. Introduction

Due to the limitations of current monitoring and preservation technologies, the deterioration of cultural relics caused by biological factors (such as pests and fungi) has become increasingly prominent, posing a major threat to the long-term preservation, display, and utilization of cultural relics. This issue is particularly severe for paper-based artifacts. Fungi, characterized by their widespread distribution, rapid growth, resilience, and ease of airborne transmission [1], can proliferate rapidly and erode the surface of artifacts under suitable conditions. The secretion of metabolites and enzymes by molds directly degrades the fibrous structure of paper artifacts, significantly diminishing their historical and artistic worth.
Currently, traditional fungal detection methods primarily rely on visual inspection and microscopic examination. However, these methods are inadequate for early detection of fungal infection and accurate assessment of fungal distribution and infection extent. Although offline detection techniques such as fluorescence spectroscopy [2] and scanning electron microscopy [3] can effectively detect fungi, the equipment required for these techniques is complex, precluding online automated detection. Since the 1990s, imaging spectroscopy technology has been introduced into the field of cultural relic preservation. Although traditional spectral analysis [4,5] can be used for the detection of microbial pests such as fungi, it has significant shortcomings in terms of precise identification, detection of minute areas, and quantitative analysis.
With the evolution of spectral technology from multispectral to hyperspectral imaging, the quality of spectral information obtained has significantly improved. Hyperspectral technology has been widely applied in food safety inspection. For example, Farrugia J et al. [6] used hyperspectral imaging combined with PCA to detect fungi in milk, while Long Y et al. [7] employed Raman hyperspectral imaging technology for the detection and identification of fungi in corn. In the field of cultural relic inspection, hyperspectral imaging technology has also demonstrated tremendous potential. Lan R et al. utilized hyperspectral imaging technology to detect contaminants on the surface of cultural relics and analyze their materials, providing in situ analysis and assessment methods for the aging degree of paper and textile fibers in painting and calligraphy artifacts [8]. Wang S et al. successfully achieved texture, image, and color restoration of artifacts by acquiring and processing hyperspectral images of paintings contaminated with fungi [9].
Analyzing the distribution and species of mold from hyperspectral data of cultural relics is similar to object recognition in other fields. Currently, object recognition primarily employs deep learning algorithms based on CNNs, including target detection methods represented by the YOLO series [10] and semantic segmentation algorithms represented by the U-Net series [11]. Semantic segmentation, being pixel-level detection, delivers finer recognition precision. To enhance the accuracy of semantic segmentation, many methods adopt skip connections [12] to integrate low-level and high-level features. Simultaneously, attention mechanisms [13] are introduced during the network’s feature extraction stage to focus on information-rich regions and improve segmentation performance. However, the surface of mold on paper artifacts often contains indistinguishable impurities, ink marks, and other elements that share similar characteristics with the mold of interest. Due to the inherent local nature of CNN-based methods, they have limitations in capturing global information from images, which can easily lead to misidentification and omission of mold. The Transformer model has demonstrated exceptional performance in natural language processing tasks, and its variant, the Vision Transformer (ViT) [14,15], has achieved state-of-the-art results in image classification tasks, fully demonstrating the significant potential of Transformer in computer vision tasks. Consequently, numerous ViT-based methods have been applied to the semantic segmentation of objects. The SETR model, which employs ViT as its encoder, validated the capability of a pure Transformer in semantic segmentation, though its segmentation accuracy still requires improvement. This occurs because the self-attention mechanism in Transformer inherently functions as a low-pass filter, which tends to suppress high-frequency information. When targets are indistinct, this may lead to the loss of mold-related information.
In the field of semantic segmentation, several studies [16,17] have attempted to integrate the strengths of CNNs and Transformers into a unified model by incorporating the local properties of convolution into Transformers. This integration aims to simultaneously capture local contextual information and build global dependencies. Some approaches introduce the local characteristics of convolution [18] into Transformers to enhance their ability to capture local information, thereby retaining more detailed features. However, under conditions of complex background interference and subtle mold characteristics, these methods still face significant challenges.
Based on the above analysis and research, this manuscript designs a mold segmentation network for paper artifacts that combines a CNN and Transformer. The overall structure of the network consists of two major parts: an encoder and a decoder. Inspired by HiFormer [19], this paper employs a Feature Pyramid Network (FPN) [20,21] to generate multi-level features at different resolutions, which are then unidirectionally fed into a multi-stage Transformer for fusion. Multi-level features of different resolutions contain different emphases of mold information, and they serve as inputs to various stages of the Transformer in the encoder, enabling the entire encoder to extract more effective feature information. Furthermore, this paper introduces the TOP-K algorithm into the self-attention mechanism and proposes a dynamic sparse attention module. This module performs variance sorting on the channels and spatial dimensions of the feature maps to select channels and regions with richer mold information, while reducing the interference of background noise, which is beneficial for mold segmentation.
In summary, the main contributions of this paper are as follows:
(1) A mold segmentation network that integrates multi-level features is proposed. Multi-level CNN features generated by the FPN are fused in the Transformer layers, combining the ability of CNNs to capture detailed features with the ability of transformers to suppress background noise, thereby enhancing mold feature information.
(2) A dynamic selection sparse self-attention module is proposed, which selects important mold features from both spatial and channel dimensions and suppresses redundant background noise.

2. Related Work

2.1. Dimension Reduction

In the field of data dimensionality reduction, traditional methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) struggle to reveal the intrinsic manifold structure of hyperspectral data, thereby limiting their practical application performance. In recent years, manifold learning algorithms have emerged as effective approaches for discovering low-dimensional manifold structures within high-dimensional data. Among these, Locally Linear Embedding (LLE) [22] operates under the assumption that data exhibits linear characteristics locally and preserves this local structure during dimensionality reduction; Laplacian Eigenmaps [23] maintain local neighborhood relationships by constructing adjacency graphs; while Isometric Mapping (ISOMAP) [24] preserves the global manifold structure using geodesic distances.
In comparison, methods like t-SNE [25], despite offering excellent visualization, often fail to maintain global structures, making them unsuitable as feature inputs for downstream tasks. This study ultimately selects the Locally Linear Embedding algorithm due to its theoretical simplicity, strong reproducibility, and the close alignment of its core concept with our focus on the “local similarity” hypothesis. Moreover, as a classical algorithm, it provides a solid theoretical foundation for our research.

2.2. Vision Transformer

The Transformer was originally designed for natural language processing tasks, and its outstanding performance has demonstrated its tremendous potential. Unlike the local convolution operations of Convolutional Neural Networks (CNNs), the built-in self-attention mechanism of the Transformer can effectively capture global dependencies, as shown in Figure 1.
In recent years, Transformers have been widely applied to vision tasks. Dosovitskiy et al.[26] proposed Vision Transformer (ViT), which, for the first time, treats an image as a sequence of position vectors fed into a Transformer for image classification tasks. Given the excellent performance of ViT, Ramachandran Prajit et al. [27] introduced a Stand-Alone self-attention network that replaces convolutions with an attention mechanism. However, due to the lack of translation equivariance and locality in the Transformer architecture, it performs poorly on smaller datasets. To enhance the generalization capability of Transformers, Ze Liu et al. [28] proposed a novel local attention mechanism that establishes global relationships by shifting windows along spatial dimensions and dynamically attends to neighboring elements, thereby improving the ability to extract local information. Nevertheless, the self-attention mechanism inherently exhibits low-pass filtering characteristics, making it difficult to recognize small or inconspicuous molds.

2.3. Segmentation Network Based on the Combination of ViT and CNN

CNNs excel at extracting local features, but they are prone to introducing noise when extracting feature information in convolutional layers. On the other hand, the inherent self-attention mechanism of Transformers can model global relationships and effectively suppress background noise interference. Therefore, some approaches attempt to combine CNNs with Transformers to equip the network with the ability to model long-range dependencies and focus on local information simultaneously.
Wang, WK et al. [29] proposed a multigranularity hybrid CNN-ViT model based on external tokens and cross-attention for PolSAR image classification. A CNN-based external feature extractor was designed to extract local features from the PolSAR image and ViT can focus on global features. A segmentation method [30] that fuses CNN and ViT is proposed to address the problems of large differences in shape and size of tumor regions of breast ultrasound images that lead to difficulty in segmentation, limitations in long range dependency and spatial correlation in convolutional neural network (CNN) modeling, and the huge amount of data required by vision Transformer (ViT). Global and local detail features were extracted using a modified Swin Trans former module and a CNN encoder module based on deformable convolution, respectively.
Although these methods inherit the local characteristics of CNNs while maintaining the global advantages of Transformers, they still face challenges from inconspicuous features and complex backgrounds. Unlike the aforementioned methods, this paper employs a CNN feature pyramid to generate feature maps at different scales, preserving more local information through a top-down and bottom-up pyramid structure. Then, feature maps of different resolutions are used as inputs to different Transformer layers, achieving the fusion of CNN features within the Transformer. Furthermore, this paper adopts a dynamic sparse attention module to select regions richer in mold information both spatially and across channels, while suppressing background interference information.

3. Methodology

3.1. Dimensionality Reduction with Locally Linear Embedding (LLE)

The basic assumption of the LLE algorithm is that data is linear within small local regions, meaning that a data point can be linearly reconstructed by its nearest neighbors. For example, given a sample x 1 , we use the K-nearest neighbors approach to find the k closest samples x 2 , , x k + 1 , in its original high-dimensional neighborhood. It is assumed that x 1 can be linearly represented by x 2 , , x k + 1 as shown in Equation (1):
x 1 = q 12 × x 2 + q 13 × x 3 + + q 1 k + 1 × x k + 1
where q 12 , q 13 , , q 1 k + 1 are the weight coefficients. After dimensionality reduction using LLE, it is hoped that the projection of x 1 in the low-dimensional space, denoted as y 1 , will maintain the same linear relationship with the projections of x 2 , x 3 , , x k + 1 denoted as y 2 , y 3 , , y k + 1 respectively, as shown in Equation (2):
y 1 = q 12 × y 2 + q 13 × y 3 + + q 1 k + 1 × y k + 1
The original high-dimensional sample has a corresponding low-dimensional projection, where, meaning that the LLE dimensionality reduction algorithm reduces the vector dimension while preserving local relationships unchanged. The main parameters of the LLE algorithm include the target dimension and the neighborhood size. The specific principles and computational process of the algorithm can be found in reference [22]. In this paper, we use this algorithm to reduce the high-dimensional spectrum to a three-dimensional space using LLE, resulting in a vector as shown in Equation (3):
U = LLE ( W , d , k ) = { w ¯ 1 , w ¯ 2 , w ¯ c } R C × 3
LLE ( W , d , k ) represents the LLE dimensionality reduction operation, d denotes the target dimension, which is 3 in this case, and k denotes the neighborhood size. U R 3 represents the result of reducing the high-dimensional vector to three dimensions.
The following figure shows the spectral curves of Aspergillus niger, Penicillium citrinum, ink, and paper background extracted from the spectral data of cultural relics. Each spectral curve contains 300-dimensional spectral features. As shown in Figure 2a, there are significant differences in spectral features among different targets or backgrounds. High-dimensional features are complex and contain a lot of redundancy. After LLE dimensionality reduction, the resulting three-dimensional spectral features are shown in Figure 2b. In the three-dimensional space, the differences in spectral features among different targets or backgrounds are effectively highlighted. Additionally, these three-dimensional spectral features can be used as the red, green, and blue color channels to construct false-color spectral images of paper cultural relics.

3.2. Mold Semantic Segmentation

3.2.1. Overall Network Architecture

Inspired by the respective outstanding performances of CNNs and Transformers in semantic segmentation tasks, we have designed a mold detection semantic segmentation network that unifies the ideal attributes of both into a single architecture. The overall architecture is shown in Figure 3.
The network adopts an encoder-decoder approach. The encoder is mainly divided into two parts: the CNN module and the Transformer module. ResNet50 and MaxViT [31] are used as the backbone networks for the CNN and Transformer, respectively. ResNet50 is a milestone architecture widely adopted in computer vision. Its residual connections effectively alleviate the gradient vanishing problem in deep networks, making it a benchmark model for numerous tasks. Leveraging pre-trained weights from large-scale datasets like ImageNet, ResNet50 can extract hierarchical features with strong generalization capabilities, laying a solid foundation for downstream tasks. As an innovative hybrid architecture, MaxViT maintains local feature extraction through convolutional operations while incorporating Transformer modules to establish global dependencies, effectively overcoming the limitations of traditional CNNs in long-range modeling. During the encoding phase, feature maps of different scales and dimensions are first generated through four levels of convolutional layers, each focusing on different local information, thereby forming hierarchical representations. Subsequently, these hierarchical representations are fed into different Transformer layers within the encoder to complete the network’s encoding operation and achieve the fusion of multi-level features.
In the decoding phase, the network consists of four levels of Transformer layers. Each Transformer layer is composed of a different number of MaxViT modules stacked together, with the specific configuration parameters shown in Table 1.
The last Transformer layer of the encoder contains rich contextual information, serving as a bridge between the encoder and the decoder. Meanwhile, during the downsampling process in the encoder, some spatial information of the mold may be lost. To address these issues, the outputs of the four encoding layers are skip-connected to the four decoding layers in the network structure, forming a complete mold detection network.

3.2.2. Feature Pyramid Module

The pyramid structure has demonstrated excellent performance in improving model capabilities and is therefore widely used in tasks such as object detection and semantic segmentation. Figure 4a employs the last layer for prediction. However, due to the convolutional downsampling operation, information about small mold regions may be lost, resulting in poor segmentation performance for inconspicuous mold. Figure 4b utilizes high-level feature maps for multi-scale prediction. A single high-level feature lacks both sufficient semantic information and adequate utilization of spatial information from low-level feature maps, which is crucial for detecting small mold regions.
Based on ResNet50, this paper designs a Feature Pyramid Network (FPN) structure, as shown in Figure 4c. Unlike traditional pyramids, FPN fuses multi-scale information in a top-down and bottom-up manner, allowing both large and small objects to have appropriate feature representations at corresponding scales. The original image A undergoes top-down convolution to generate multi-layer features. The bottom-layer features are upsampled and then fused with other features to generate hierarchical CNN features.
In the Transformer layers of the encoder, the convolutional features R1 are used as input instead of the original image, while R2, R3, and R4 are respectively fed into different levels of the Transformer network, providing detailed features of the mold for the Transformer layers. The mathematical process is shown in Equation (4):
R i + 1 = Conv ( Concat ( UpSampling ( L i ) , L i 1 ) ) , i = 0 , 1 , 2 , 3
“UpSampling(.)” represents upsampling, “Conv(.)” represents convolution, and “Concat(.)” represents concatenation.

3.2.3. MAXViT Module

The MaxViT module is a core component of the Transformer network, with its structure illustrated in Figure 5. It is mainly divided into two parts: the MBConv module and the Multi-Axis Self-Attention (MaxSA) module. Research has shown that combining local and global approaches helps improve the model’s generalization ability. Therefore, before applying the MaxSA module, we preprocess the image information using the MBConv module.
The MBConv module introduces local characteristics through convolution operations and utilizes the Squeeze and Excitation (SE) network to assign higher weights to mold feature information. Additionally, the MBConv module employs DepthWise Conv (DWC) instead of traditional positional encoding, eliminating the need for explicit positional encoding in the network, reducing the number of network parameters, and improving the network’s training speed.
C is the feature tensor, which serves as the input to the MAXViT module. The mathematical process is shown in Equation (5).
In the second stage, the MaxSA module is applied to extract global information. The MaxSA module decomposes the traditional self-attention mechanism into two sparse forms: Block Self-Attention (BSA) and Grid Self-Attention, connected in a serial manner with the BSA module followed by the Grid Self-Attention module. This reduces the quadratic complexity of self-attention computation to linear without sacrificing globality. C1 is the output of the MBConv module, BSA(.) represents block self-attention, FFN(.) represents the feedforward neural network, C2 is the output of the features after passing through the block self-attention module, GSA(.) represents grid self-attention, and C3 is the output of the MAXViT module. The mathematical processes are shown in Equations (5)–(7):
C 1 = C + PROJ ( SE ( DWC ( Conv ( BN ( C ) ) ) ) )
C 2 = C 1 + BSA ( C 1 ) + FFN ( C 1 + BSA ( C 1 ) )
C 3 = C 2 + GSA ( C 2 ) + FFN ( C 2 + GSA ( C 2 ) )

3.2.4. Dynamic Sparse Attention Module

Traditional self-attention calculates the relevance of all position vectors and extracts important information based on the computed attention scores, capturing global dependencies between position vectors. However, this mechanism entails an excessively large computational load. The sparse attention mechanism serves as a key optimization strategy to address the prohibitively high computational complexity and substantial memory consumption of the standard Transformer model’s self-attention. Sparse attention computes correlations only between a pre-selected, limited number of positional pairs. This significantly reduces the number of attention pairs that need to be calculated, enabling the model to handle longer sequences previously beyond its reach.
The sparse attention model proposed by Jiang M et al. [32] sparsifies the feature block Query and feature block Key defined in the self-attention module through a credit matrix generated by the pre-output layer, followed by similarity modeling between the two sparsified feature blocks, thereby significantly reducing computational resource consumption. The method for selecting sparse positional pairs is crucial for sparse attention mechanisms. Sun Z et al. [33] defined an Adaptive-Reciprocal Nearest Neighbors (A-RNN) relationship, which achieves flexibility through adaptive threshold learning during neighbor identification while ensuring reliability of neighbor relations via a reciprocity mechanism. Sun B et al. [34] integrated dynamic sparse attention through a bi-level routing mechanism, achieving context-aware computational resource allocation with enhanced adaptability. This approach not only optimizes semantic feature extraction but also ensures feature quality.
In this study, most position vectors primarily contain background information rather than mold information, leading to attention maps generated by self-attention focusing more on background regions, with small and weak mold features being overwhelmed by redundant background information. To address these issues, we introduce the TOP-K algorithm into the multi-axis attention mechanism and design an effective Dynamic Sparse Self-Attention module. The dynamic sparse attention mechanism selects K position vectors with higher scores in both spatial and channel dimensions to compute attention, filtering out some redundant background information and highlighting important mold features, while also reducing the computational load of self-attention. The module details are shown in Figure 6, where low-scoring gray areas are filtered out, and high-scoring bright areas are retained.
Unlike the traditional attention mechanism, the number of position vectors involved in dynamic sparse attention computation is limited by the window size of the MaxSA mechanism. In our experiments, the window size is set to 8 × 8. The output of the MBConv module C 1 R B × C × W × H , C1, is flattened to X R B × W 8 × H 8 , 8 × 8 , C serve as the input to the sparse attention module. It is then linearly mapped to the query Q, key K, and value V through three matrices: W q R ( C × C ) , W k R ( C × C ) , W v R ( C × C ) as shown in the mathematical process in Equation (8):
Q = X W q , K = X W k , V = X W v
Based on the Q, K, and V obtained from the linear mapping, combined with the Top-K algorithm, we retain position vectors and channels rich in mold information while suppressing the interference of background information. Given the nature of discrete data, variance can reflect the degree of dispersion of the data; the larger the variance, the greater the data fluctuation, and the richer the mold information it contains. Therefore, we calculate the variance along the channel dimension, select K position vectors with the largest variance values based on their ranking, and simultaneously calculate the variance along the spatial dimension to select K most important channels. In the dynamic sparse attention module, we first calculate the channel variance and spatial variance on the query Q to obtain the coordinate indices of K position vectors and channels. Using these coordinate indices, we select the most important position vectors and channels from Q and K to generate Q1 and K1, and select important position vectors from V to generate V1. To ensure that Q, K, and V correspond to the same position vectors and channels at the same positions, we calculate the variance values only once in Q to determine the position vector indices and channel indices. We then use the obtained Q1, K1, and V1 to compute the attention. SoftMax(.) represents the SoftMax function, and C represents the channel dimension of Q1 and K1. The formula is shown in Equation (9):
Attention ( Q 1 , K 1 , V 1 ) = SoftMax ( Q 1 K 1 T C ) V 1

3.3. Experimental Design

3.3.1. Experimental Procedure

The overall experimental process is shown in Figure 7. A hyperspectral camera is used to scan paper containing mold to obtain three-dimensional spectral data. The LLE (Locally Linear Embedding) linear manifold learning algorithm is employed to reduce the high-dimensional spectral data to 3 dimensions. These three-dimensional features are then used as the red, green, and blue channels of the image to synthesize a false-color image. Subsequently, the semantic segmentation algorithm proposed in this paper is applied to segment the mold in the false-color image.

3.3.2. Spectral Data Acquisition

In this experiment, the portable hyperspectral imaging system iSpecHyper-VS1000 from Lesen Optics was used to collect hyperspectral data of mold spots on paper artifacts in reflection mode. The acquisition setup is shown in Figure 8. The setup consists of a computer (equipped with acquisition software developed by Lesen Optics), a portable hyperspectral imaging system, two 1000 W halogen light sources, and a Canon RF24 lens (Canon Inc., Tokyo, Japan). The spectral range is 400–1000 nm, with a resolution of 2.8 nm. Parameters are listed in Table 2.
Since deep learning-based methods require a substantial number of samples while cultural artifacts are extremely precious, the mold-infected paper samples used in this study were artificially cultivated by staff at the Chongqing Three Gorges Museum based on mold types commonly found infesting paper artifacts in their collection. The study selected Xuan paper with material properties similar to those of cultural artifacts, and employed common mold species including Aspergillus niger, Penicillium citrinum, and Cladosporium cladosporioides. The specific procedure involved using a disposable sterile swab to perform unidirectional gentle swabbing from pre-cultured molds, transferring the contaminated swab to a glucose culture medium for incubation. After three days of cultivation, the culture broth was diluted with sterile deionized water and subsequently inoculated onto the sample paper. For each type of mold, 30 samples were prepared. Following hyperspectral scanning, the samples were cropped, with each original sample divided into 12 sub-samples, resulting in 360 sub-samples for each of the three molds. Each sub-sample image has a resolution of 256 × 256 pixels and was split into training, validation, and test sets in a ratio of 7:2:1. The Ground Truth of mold distribution areas was annotated by museum staff through visual inspection combined with microscopic observation. Identification of the three mold types was performed and compared with other models in a comparative experimental analysis. Finally, an ablation study was conducted on one of the datasets for the dynamic sparse attention module to verify its effectiveness.
Each pixel in a hyperspectral image contains a complete set of spectral data, which includes the absorption or reflection information at 300 different and continuous wavelengths at that pixel location, as shown in Figure 9. In this hyperspectral image, the spectral data of pixels belonging to the same type exhibit great similarity, while the spectral data of pixels belonging to different types show significant differences. Therefore, using spectral data provides a scientific basis for classifying hyperspectral images into different categories.

3.3.3. Experimental Configuration

The network model proposed in this paper is implemented based on the PyTorch 2.3.1 framework, and all training processes are completed on an AutoDL server equipped with NVIDIA 3090 series GPUs. During training, the initial learning rate of the model is set to 0.01, and the model is trained using an SGD optimizer with a decay rate of 0.001 and a momentum of 0.9. Additionally, the learning rate is decayed according to Equation (10) during the training process.
lr _ = base _ lr × ( 1.0 iter _ num max _ iter ) 0.9
The base learning rate (base_lr) is set to 0.01, iter_num represents the current training iteration, and max_iter is the maximum number of training iterations. To address the issues of class imbalance among mold categories and foreground-background imbalance in the dataset, the model’s loss function adopts a combined form, as shown in Equation (11):
Loss = 0.4 × Loss _ ce + 0.6 × Loss _ dice
where Loss_ce is the cross-entropy loss function, and Loss_dice is the Dice similarity coefficient loss function.
For each dataset, the images in both the training and test sets are resized to a uniform resolution of 512 × 512. The model is trained for 150 epochs on each of the three datasets, with a batch size of 24. The mean Intersection over Union (mIoU) metric is used to evaluate the model’s segmentation performance.

4. Experiments and Result Analysis

4.1. Experimental Comparative Analysis

To verify the superiority of the proposed model, comparative analyses were conducted against state-of-the-art segmentation networks including HiFormer [19], MaxVIT [31], ResNet50, DeepLabV3 [35], and TransUnet [36]. As quantitatively demonstrated in Table 3  (where bold values indicate the best results for each metric), the proposed architecture achieves significant improvements in fungal segmentation accuracy. The hybrid architectures (TransUnet, HiFormer, and Ours) consistently outperform pure CNN or Transformer models across all three fungal species, demonstrating the efficacy of integrating multi-scale CNN features with global Transformer attention. For Penicillium citrinum, HiFormer achieves the highest mIoU (90.3), closely followed by Ours (90.1), significantly surpassing pure CNN models like ResNet50 (83.3) and pure Transformer MaxVIT (80.6). In Aspergillus niger, Ours sets a new benchmark (96.8 mIoU), outperforming even TransUnet (96.3) and MaxVIT (94.3). For Cladosporium cladosporioides, Ours again leads (89.3), with MaxVIT (88.4) and HiFormer (87.9) trailing slightly.
The visual results for the semantic segmentation of different types of mold are shown in Figure 10. Quantitative experimental data and semantic segmentation results demonstrate that all semantic segmentation networks can accurately identify mold and background regions in the Aspergillus niger dataset, as shown in rows 1 to 4 of Figure 10. Overall, our method achieves the best performance, with an mIoU at least 0.5% higher than other methods. In the Penicillium citrinum dataset, many molds share similar spectral characteristics with the paper background. CNN-based models like ResNet50 and DeepLabV3 tend to misidentify background regions resembling molds as actual molds and fail to detect low-contrast molds. In contrast, models incorporating global features—MaxVIT, HiFormer, TransUnet, and our proposed network—effectively suppress background interference and accurately identify molds by establishing long-range dependencies. HiFormer uses only two feature maps with the richest semantic and positional information during decoding, while our model employs multi-scale feature fusion. Although this preserves more mold details, it introduces partial noise. HiFormer surpasses our model by 0.2% in segmentation accuracy, with results illustrated in rows 5 to 8 of Figure 10. Cladosporium molds are sparsely distributed with varying infection depths, making detection challenging. Our model retains multi-scale information and demonstrates exceptional detection capabilities on the Cladosporium dataset, achieving a 1.5% higher mIoU than TransUnet and outperforming HiFormer by 1.4%. Segmentation results are shown in rows 9 to 10 of Figure 10.

4.2. Ablation Experiment

To further demonstrate the effectiveness of the proposed dynamic sparse attention module in our network, we conducted an ablation experiment on the Cladosporium dataset focusing on this module.
Firstly, we divided the encoding layers of the network into four levels: 1, 2, 3, and 4 in sequence. We then replaced the standard self-attention module with the dynamic sparse attention module in a reverse order, layer by layer. The experimental results are shown in Table 4. As indicated in Table 4, the network incorporating the dynamic sparse attention mechanism exhibits better segmentation performance compared to the network based on the self-attention mechanism. The introduction of the dynamic sparse attention module at the fourth, third, second, and first levels of our network improves the performance by 2.6%, 3.2%, 3.1%, and 1.8%, respectively. This demonstrates the effectiveness of the dynamic sparse attention module. An analysis of the results reveals that the introduction of the dynamic sparse attention module at the third and fourth levels yields the best results. This is because the feature maps generated in the early stages of the network contain rich detailed information about the mold, and the introduction of the dynamic module may result in the loss of some important details, reducing the accuracy of weak mold recognition. In contrast, the feature maps generated in the later stages of the network contain semantic information about the mold, which is easily interfered with by background noise.
Meanwhile, more significant position vectors are selected both in the channel and spatial dimensions to compute attention, reducing the model’s complexity. Finally, we investigate the impact of the hyperparameter K on the network’s performance. We set K to 24, 32, 40, 48, 56, and 64, respectively. As shown in Table 5, our model achieves the best performance when K is 56. If K is too large, weak mold information with low contrast is easily interfered with by background information. Conversely, if K is too small, position vectors containing useful information are lost, reducing the model’s ability to recognize mold.

4.3. Application Testing

To validate the effectiveness of the proposed method, a set of paper artifacts infected with Cladosporium and their corresponding spectral data, provided by the Chongqing Three Gorges Museum, were collected and subjected to mold segmentation testing using our approach. Due to the inability to conduct prolonged testing on cultural artifacts, fine-grained annotations were not available, making it difficult to compute quantitative metrics. However, based on the visual segmentation results, the primary areas of mold distribution were largely accurately identified, as shown in Figure 11b. That said, a small number of mold instances located within ink stains were missed, likely because the semantic segmentation process, which relies on contextual information, was interfered with by local ink patterns. Examples of such cases are illustrated in Figure 11f,g. Overall, the method demonstrated satisfactory performance in identifying mold categories, their distribution on paper artifacts, and the extent of infection, suggesting its potential as a technical tool for monitoring museum collections.

5. Conclusions

In response to the early detection needs of mold infestation on paper artifacts, this paper proposes a semantic segmentation method that integrates hyperspectral imaging and deep learning. Through manifold learning for dimensionality reduction, multi-level feature fusion, and optimization with a dynamic sparse attention mechanism, precise identification of mold regions against complex backgrounds is achieved. By reducing high-dimensional spectral data to a three-dimensional space, the redundancy issue of hyperspectral data is effectively addressed, preserving the spectral difference characteristics between different molds and the background while providing a suitable input format for subsequent deep learning models. The designed CNN-Transformer hybrid network fully integrates the local detail capturing capability of CNNs with the global context modeling advantage of Transformers. Experiments demonstrate that the model achieves mean Intersection over Union (mIoU) scores of 90.1%, 96.8%, and 89.3% on datasets of Aspergillus niger, Penicillium citrinum, and Cladosporium cladosporioides, respectively, representing significant improvements over single CNN or Transformer models, especially in the recognition of weak and irregularly shaped molds. The introduced TOP-K dynamic sparse attention module screens key feature regions through channel and spatial variance, effectively suppressing background noise interference and reducing the false detection rate in complex backgrounds by approximately 3.5%. This study provides efficient technical support for the early monitoring and preventive conservation of mold on artifacts and can be applied to scenarios such as environmental monitoring in artifact storage facilities and pre-restoration damage assessment. However, the current method still has limitations: firstly, the model’s generalization ability in areas heavily covered with ink needs to be verified; secondly, the training data is concentrated on laboratory-cultured samples, and multi-source data from real artifact scenarios needs to be further expanded to improve robustness.

Author Contributions

Y.Z.: Conceptualization, Methodology, Writing—original draft. Q.S.: Conceptualization, Funding acquisition, Resources, Supervision. T.S.: Conceptualization, Funding acquisition, Writing—review and editing. S.D.: Software, Resources. Q.W.: Visualization, Data curation. Z.L.: Visualization, Data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Chongqing Talent Program of the Bureau of Science and Technology of Chongqing Municipality (Grant No. CSTC2021YCJH-BGZXM0287) and the Scientific Research Innovation Team of Chongqing University of Technology (Grant No. 2023TDZ014).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, K.; Guo, M.; Shi, S. Nondestructive Detection of Trichoderma on Surface of Paper Cultural Relics by Reflective Fiber Optic Spectroscopy. Acta Opt. Sin. 2024, 44, 2006004. [Google Scholar]
  2. Durmuş, E.; Güneş, A.; Kalkan, H. Detection of aflatoxin and surface mould contaminated figs by using Fourier transform near-infrared reflectance spectroscopy. J. Sci. Food Agric. 2017, 97, 317–323. [Google Scholar] [CrossRef] [PubMed]
  3. De Silveira, G.; Forsberg, P.; Conners, T.E. Scanning Electron Microscopy: A Tool for the Analysis of Wood Pulp Fibers and Paper. Surface Analysis of Paper; CRC Press: Boca Raton, FL, USA, 2020; pp. 41–71. [Google Scholar]
  4. Wang, R.; Yan, M.; Jiang, M.; Li, Y.; Kang, X.; Hu, M.; Liu, B.; He, Z.; Kong, D. Label-Free and Selective Cholesterol Detection Based on Multilayer Functional Structure Coated Fiber Fabry-Perot Interferometer Probe. Anal. Chim. Acta 2023, 1252, 341051. [Google Scholar] [CrossRef]
  5. Xue, Z.; Yu, Q.; Zhong, N.; Zeng, T.; Tang, H.; Zhao, M.; Zhao, Y.; Tang, B. Fiber Optic Sensor for Nondestructive Detection of Microbial Growth on a Silk Surface. Appl. Opt. 2022, 61, 4463–4470. [Google Scholar] [CrossRef]
  6. Farrugia, J.; Griffin, S.; Valdramidis, V.P.; Camilleri, K.; Falzon, O. Principal Component Analysis of Hyperspectral Data for Early Detection of Mould in Cheeselets. Curr. Res. Food Sci. 2021, 4, 18–27. [Google Scholar] [CrossRef]
  7. Long, Y.; Tang, X.; Fan, S.; Zhang, C.; Zhang, B.; Huang, W. Identification of Mould Varieties Infecting Maize Kernels Based on Raman Hyperspectral Imaging Technique Combined with Multi-Channel Residual Module Convolutional Neural Network. J. Food Compos. Anal. 2024, 125, 324–351. [Google Scholar] [CrossRef]
  8. Lan, R.; Li, Z.; Liu, Z.; Gu, T.; Luo, X. Hyperspectral Image Classification Using K-Sparse Denoising Autoencoder and Spectral-Restricted Spatial Characteristics. Appl. Soft Comput. 2019, 74, 693–708. [Google Scholar] [CrossRef]
  9. Wang, S.; Cen, Y.; Qu, L.; Li, G.; Chen, Y.; Zhang, L. Virtual Restoration of Ancient Mold-Damaged Painting Based on 3D Convolutional Neural Network for Hyperspectral Image. Remote Sens. 2024, 16, 2882. [Google Scholar] [CrossRef]
  10. Liu, Y.; Li, Y.; Zhao, T.; Kang, H. Vision Sensing-Driven Tunnel Crack Detection Method Using Particle Filtering-Integrated YOLOv5 Model. J. Circuits Syst. Comput. 2025, 34, 2550181. [Google Scholar] [CrossRef]
  11. Liu, F.; Wang, L. UNet-based model for crack detection integrating visual explanations. Constr. Build. Mater. 2022, 322, 126265. [Google Scholar] [CrossRef]
  12. Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Appl. Sci. 2022, 12, 2076–3417. [Google Scholar] [CrossRef]
  13. Kang, D.; Han, Y.; Zhu, J.; Lai, J. An Axially Decomposed Self-Attention Network for the Precise Segmentation of Surface Defects on Printed Circuit Boards. Neural Comput. Appl. 2022, 34, 13697–13712. [Google Scholar] [CrossRef]
  14. Wang, J.; Lu, S.-Y.; Wang, S.-H.; Zhang, Y.-D. RanMerFormer: Randomized vision transformer with token merging for brain tumor classification. Neurocomputing 2024, 573, 127216. [Google Scholar] [CrossRef]
  15. Stéphane, d.A.; Touvron, H.; Leavitt, M.; Morcos, A.; Biroli, G.; Sagun, L. ConViT: Improving vision transformers with soft convolutional inductive biases. J. Stat. Mech. Theory Exp. 2022, 114005, 1–27. [Google Scholar]
  16. Guo, X.; Lin, X.; Yang, X.; Yu, L.; Cheng, K.-T.; Yan, Z. UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation. Pattern Recognit. J. Pattern Recognit. Soc. 2024, 152, 110491. [Google Scholar] [CrossRef]
  17. Yuan, J.; Zhu, A.; Xu, Q.; Wattanachote, K.; Gong, Y. CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3795–3805. [Google Scholar] [CrossRef]
  18. Wu, S.; Hadachi, A.; Lu, C.; Vivet, D. Transformer for multiple object tracking: Exploring locality to vision. Pattern Recognition Letters 2023, 170, 70–76. [Google Scholar] [CrossRef]
  19. Wu, X.; Lu, H.; Li, K.; Wu, Z.; Liu, X.; Meng, H. Hiformer: Sequence Modeling Networks With Hierarchical Attention Mechanisms. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3993–4003. [Google Scholar] [CrossRef]
  20. She, Y. Feature Pyramid Networks and Long Short-Term Memory for EEG Feature Map-Based Emotion Recognition. Sensors 2023, 23, 1622. [Google Scholar]
  21. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  22. Yu, W.; Zhang, M.; Shen, Y. Learning a Local Manifold Representation Based on Improved Neighborhood Rough Set and LLE for Hyperspectral Dimensionality Reduction. Signal Process. 2019, 164, 20–29. [Google Scholar] [CrossRef]
  23. Tu, S.T.; Chen, J.Y.; Yang, W.; Sun, H. Laplacian Eigenmaps-Based Polarimetric Dimensionality Reduction for SAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2012, 50, 170–179. [Google Scholar] [CrossRef]
  24. Li, W.; Zhang, L.; Zhang, L.; Du, B. GPU Parallel Implementation of Isometric Mapping for Hyperspectral Classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1532–1536. [Google Scholar] [CrossRef]
  25. Xia, L.; Lee, C.; Li, J.J. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat. Commun. 2024, 15, 1753. [Google Scholar] [CrossRef]
  26. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zha, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  27. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). In Proceedings of the 32nd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 1 of 20. [Google Scholar]
  28. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
  29. Wang, W.; Wang, J.; Quan, D.; Yang, M.; Sun, J.; Lu, B. PolSAR Image Classification Via a Multigranularity Hybrid CNN-ViT Model With External Tokens and Cross-Attention. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8003–8019. [Google Scholar] [CrossRef]
  30. Peng, Y.; Liang, F. Tumor segmentation method for breast ultrasound images incorporating CNN and ViT. CAAI Trans. Intell. Syst. 2024, 19, 556–564. [Google Scholar]
  31. Nguyen-Tat, T.B.; Vo, H.-A.; Dang, P.-S. QMaxViT-Unet+: A query-based MaxViT-Unet with edge enhancement for scribble-supervised segmentation of medical images. Comput. Biol. Med. 2025, 187, 586–598. [Google Scholar] [CrossRef]
  32. Jiang, M.; Zhai, F.; Kong, J. Sparse Attention Module for optimizing semantic segmentation performance combined with a multi-task feature extraction network. Vis. Comput. 2022, 38, 2473–2488. [Google Scholar] [CrossRef]
  33. Sun, Z.; Zhang, C.; Zhang, M. Adaptive sparse attention module based on reciprocal nearest neighbors. J. Electron. Imaging 2024, 33, 033038. [Google Scholar] [CrossRef]
  34. Sun, B.; Liu, C.; Wang, Q.; Bi, K.; Zhang, W. MFFBi-Unet: Merging Dynamic Sparse Attention and Multi-scale Feature Fusion for Medical Image Segmentation. Interdiscip. Sci. Comput. Life Sci 2025. Early Access. [Google Scholar] [CrossRef] [PubMed]
  35. Zeng, W.; He, M. Rice disease segmentation method based on CBAM-CARAFE-DeepLabv3+. Crop Prot. 2024, 180, 106665. [Google Scholar] [CrossRef]
  36. Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 3178991. [Google Scholar] [CrossRef]
Figure 1. Illustration of Vision Transformer.
Figure 1. Illustration of Vision Transformer.
Heritage 08 00495 g001
Figure 2. Illustration of Spectral Feature Dimensionality Reduction. (a) High-dimensional spectral curve, (b) Three-dimensional spatial distribution of spectral features.
Figure 2. Illustration of Spectral Feature Dimensionality Reduction. (a) High-dimensional spectral curve, (b) Three-dimensional spatial distribution of spectral features.
Heritage 08 00495 g002
Figure 3. The network of mold semantic segmentation.
Figure 3. The network of mold semantic segmentation.
Heritage 08 00495 g003
Figure 4. Three types of pyramid structures.
Figure 4. Three types of pyramid structures.
Heritage 08 00495 g004
Figure 5. Structure of the MAXViT Module.
Figure 5. Structure of the MAXViT Module.
Heritage 08 00495 g005
Figure 6. Dynamic Sparse Attention Module.
Figure 6. Dynamic Sparse Attention Module.
Heritage 08 00495 g006
Figure 7. Experimental Process.
Figure 7. Experimental Process.
Heritage 08 00495 g007
Figure 8. Mold Collection Device for Cultural Relics.
Figure 8. Mold Collection Device for Cultural Relics.
Heritage 08 00495 g008
Figure 9. Spectral Curves of Different Components in Paper Artifacts.
Figure 9. Spectral Curves of Different Components in Paper Artifacts.
Heritage 08 00495 g009
Figure 10. The segmentation results of different network models.
Figure 10. The segmentation results of different network models.
Heritage 08 00495 g010
Figure 11. Mold Detection Application Results. (a) Hyperspectral image. (b) Segmentation result. (c) Mold regions. (d) Local region 1. (e) Local region 2. (f) Segmentation result of Local region 1. (g) Segmentation result of Local region 2.
Figure 11. Mold Detection Application Results. (a) Hyperspectral image. (b) Segmentation result. (c) Mold regions. (d) Local region 1. (e) Local region 2. (f) Segmentation result of Local region 1. (g) Segmentation result of Local region 2.
Heritage 08 00495 g011
Table 1. MaxViT Parameters for Each Transformer Layer.
Table 1. MaxViT Parameters for Each Transformer Layer.
LayerConfiguration ParametersOutput Size
Input--3 × W × H
encoderLayerconfiguration parametersoutput size
S1 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 2 64 × W 4 × H 4
S2 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 2 128 × W 8 × H 8
S3 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Dynamic M S A ( G = 8 , H = 2 ) × 2 256 × W 16 × H 16
S4 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Dynamic M S A ( G = 8 , H = 2 ) × 2 512 × W 32 × H 32
decoderLayerconfiguration parametersoutput size
Z1 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 1 256 × W 16 × H 16
Z2 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 2 128 × W 8 × H 8
Z3 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 2 64 × W 4 × H 4
Z4 M B C o n v ( E = 4 , R = 4 ) Window M S A ( W = 8 , H = 2 ) Grid M S A ( G = 8 , H = 2 ) × 2 64 × W 4 × H 4
Table 2. Parameters of the iSpecHyper-VS1000 Hyperspectral Imaging System.
Table 2. Parameters of the iSpecHyper-VS1000 Hyperspectral Imaging System.
Main Technical IndicatorsIndicator ParametersMain Technical IndicatorsIndicator Parameters
Spectral Range400 nm~1000 nmAD Dynamic Range12 bits
Spectral Resolution<3 nmImage Resolution1920 × 1200
Number of Spectral Channels300LensCanon 24 mm
DetectorCMOS/InGaAs (TE Cooled)Pixel Size5.86 μm × 5.86 μm
Table 3. Comparison data of different models on three defect data.
Table 3. Comparison data of different models on three defect data.
ModelsParameter Count (M)FLOPsAspergillus
niger
Penicillium
citrinum
Cladosporium cladosporioides
mIoU (%)mIoU (%)mIoU (%)
ResNet5035.336.3 G87.783.385.4
Deeplabv339.640 G92.482.586.3
MaxVIT26.89.2 G94.380.688.4
TransUnet93.231.5 G96.389.687.8
HiFormer34.217.4 G96.190.387.9
Ours42.815.7 G96.890.189.3
Table 4. Comparison of dynamic sparse attention modules using different levels on Penicillium citrinum.
Table 4. Comparison of dynamic sparse attention modules using different levels on Penicillium citrinum.
ModelStageParamFlopsPenicillium citrinum
1234 mIoU
Ours 43 M15.8 G86.9
43 M15.7 G89.5
43 M15.7 G90.1
42 M15.6 G90.0
42 M15.6 G88.7
MaxUnet 26 M9.2 G82.4
26 M9.1 G83.9
26 M9.0 G87.3
25 M8.9 G86.3
25 M8.9 G84.2
Note: ‘√’ indicates that the self-attention module is replaced by the dynamic sparse attention module.
Table 5. Comparison of different K values of dynamic sparse attention modules on Penicillium citrinum.
Table 5. Comparison of different K values of dynamic sparse attention modules on Penicillium citrinum.
ModelParametersComplexityKPenicillium citrinum
mIoU
Ours41 M15.5 G2487.6
42 M15.6 G3288.9
42 M15.6 G4089.6
43 M15.7 G4889.2
43 M15.7 G5690.1
43 M15.8 G6487.1
MaxUnet24 M8.8 G2485.1
24 M8.9 G3284.8
25 M8.9 G4083.9
25 M9.0 G4884.7
26 M9.0 G5685.8
26 M9.1 G6485.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Y.; Song, Q.; Song, T.; Dong, S.; Wu, Q.; Long, Z. Hyperspectral Imaging Combined with Deep Learning for the Detection of Mold Diseases on Paper Cultural Relics. Heritage 2025, 8, 495. https://doi.org/10.3390/heritage8120495

AMA Style

Zhao Y, Song Q, Song T, Dong S, Wu Q, Long Z. Hyperspectral Imaging Combined with Deep Learning for the Detection of Mold Diseases on Paper Cultural Relics. Heritage. 2025; 8(12):495. https://doi.org/10.3390/heritage8120495

Chicago/Turabian Style

Zhao, Ya, Qiankun Song, Tao Song, Shaojiang Dong, Qian Wu, and Zourong Long. 2025. "Hyperspectral Imaging Combined with Deep Learning for the Detection of Mold Diseases on Paper Cultural Relics" Heritage 8, no. 12: 495. https://doi.org/10.3390/heritage8120495

APA Style

Zhao, Y., Song, Q., Song, T., Dong, S., Wu, Q., & Long, Z. (2025). Hyperspectral Imaging Combined with Deep Learning for the Detection of Mold Diseases on Paper Cultural Relics. Heritage, 8(12), 495. https://doi.org/10.3390/heritage8120495

Article Metrics

Back to TopTop