Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu

Song, Zheli; Dong, Yi; Wang, Chao; Zhang, Xiu; Sun, Aibao; You, Cuiping; Mao, Jian; Liu, Shuangping

doi:10.3390/computers15050307

Open AccessArticle

Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu

by

Zheli Song

¹,

Yi Dong

²,

Chao Wang

²,

Xiu Zhang

²,

Aibao Sun

³,

Cuiping You

⁴,

Jian Mao

^5,6,*

and

Shuangping Liu

^1,5,6,*

¹

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

²

Luzhou Laojiao Group Co., Ltd., Luzhou 646000, China

³

National Engineering Research Center of Huangjiu, Zhejiang Guyuelongshan Shaoxing Wine Co., Ltd., Shaoxing 312000, China

⁴

School of Biotechnology, Jiangnan University, Wuxi 214122, China

⁵

State Key Laboratory of Food Science and Technology, National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, School of Food Science and Technology, Jiangnan University, Wuxi 214122, China

⁶

Shaoxing Key Laboratory of Traditional Fermentation Food and Human Health, Jiangnan University (Shaoxing) Industrial Technology Research Institute, Shaoxing 312000, China

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 307; https://doi.org/10.3390/computers15050307

Submission received: 9 March 2026 / Revised: 20 April 2026 / Accepted: 21 April 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Machine Learning: Techniques, Industry Applications, Code Sharing, and Future Trends)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The objective evaluation of Daqu cross-sectional quality is challenging due to its heterogeneous structure, small features, and low contrast. This study proposes a semantic-segmentation-based framework for the automated identification and quantitative analysis of Luzhou-flavor Daqu cross-sections. Four representative architectures—including three convolutional neural network (CNN)-based models (U-Net, U-Net++, and U²-Net) and one Transformer-based model (SegFormer)—were systematically benchmarked. To address severe class imbalance and enhance model robustness, a task-specific data augmentation pipeline was implemented. With these optimized augmentation strategies, the U²-Net model demonstrated the best performance, with a peak mean Intersection over Union (mIoU) of 87.54% and a Dice score of 98.30%. Based on the predicted masks, quantitative indicators such as plaque area ratio, pizhang thickness, and fissure length were precisely extracted. The proposed framework provides an objective and scalable solution for Daqu quality inspection, offering significant practical value for industrial scenarios involving complex materials and fine-grained defect patterns.

Keywords:

computer vision; Daqu; semantic segmentation; quantitative analysis

Graphical Abstract

1. Introduction

Chinese Baijiu is an integral component of traditional Chinese culture. As a fermented alcoholic beverage with a history spanning thousands of years, its brewing process embodies profound historical heritage and distinct regional characteristics [1]. During Baijiu production, Daqu serves as the core fermentation starter and plays an indispensable role. Typically produced from cereal grains and shaped into brick-like blocks, Daqu is enriched with diverse microbial communities and complex enzyme systems [2]. These microorganisms and enzymes not only drive saccharification and alcoholic fermentation, but also participate in the formation and transformation of flavor compounds, thereby fundamentally determining the Baijiu quality [3]. Consequently, Daqu is often referred to as “the backbone of Baijiu”.

Owing to the complexity of Daqu fermentation and the lack of unified quantitative evaluation standards, industrial assessment of Daqu quality still relies heavily on empirical judgment. Experienced craftsmen typically evaluate Daqu quality by visually inspecting appearance, color distribution, aroma, and cross-sectional structural characteristics [4]. However, this reliance on sensory evaluation creates a significant industrial bottleneck. The lack of precise, localized data on fermentation maturity and microbial health makes fine-grained process control difficult to implement. In modern smart distilleries, the absence of objective morphological metrics results in high inter-batch variability and prevents the integration of Daqu quality data into automated production management systems. Moreover, such experience-based approaches are inherently subjective and suffer from limited repeatability and consistency, making them increasingly inadequate for objective, standardized, and digitalized quality control in modern Baijiu production.

With the rapid advancement of computer technology and artificial intelligence, computer vision has become an important tool in food quality inspection due to its non-contact, non-destructive, and high-efficiency characteristics [5]. Deep learning- and computer-vision-based methods have demonstrated strong performance in food inspection tasks, including meat-state classification, fruit surface defect detection, and microbe-related quality or safety evaluation in fresh meat [6,7,8]. Nevertheless, compared with many other food materials, Daqu exhibits highly heterogeneous and multi-scale structural features arising from complex fermentation and microbial dynamics, which pose substantial challenges for automated quality evaluation [9]. Existing studies on Daqu quality assessment have mainly focused on hyperspectral imaging for physicochemical parameter prediction [10,11] or visible-light image-based grading [12]. However, hyperspectral systems are costly and complex [13,14], while visible-light approaches still lack sufficient capability for fine-grained structural characterization and quantitative analysis.

Semantic segmentation, which assigns semantic labels at the pixel level, provides an effective means for precise localization and delineation of complex structures [15]. Deep learning-based semantic segmentation models, such as U-Net, U-Net++, and the DeepLab series, have achieved remarkable success in medical imaging, remote sensing, and segmentation of surface defects in manufactured products [16,17,18], and have shown increasing potential in food-related applications [19,20,21]. Beyond recognition and localization, recent food-vision studies have increasingly emphasized quantitative estimation tasks, such as food weight and content estimation, highlighting the practical importance of converting image understanding into measurable indicators [22].

To address the challenges of Daqu quality evaluation, this study proposes a reproducible segmentation-to-measurement framework for quantitative analysis of Luzhou-flavor Daqu cross-sectional images under visible-light imaging. By focusing on key fermentation-related structures, including fire cycles, fissures, and microbial plaques, representative segmentation models (U-Net, U-Net++, and U²-Net) are systematically benchmarked under identical training settings, and a sample-wise (Daqu block–wise) split protocol is adopted to avoid potential data leakage and to evaluate generalization on unseen physical samples. Based on the predicted masks, we further developed a mask-driven quantification module to extract interpretable sample-level indicators, including microbial plaque area proportion, average pizhang thickness, and fissure length, together with visualizations for qualitative verification. Here, pizhang refers to the characteristic peripheral layer of the Daqu cross-section. Relying solely on low-cost visible-light imaging, the proposed approach provides a cost-effective and scalable basis for objective Daqu quality assessment and can be readily extended to other industrial scenarios involving heterogeneous cross-sectional structures and fine-grained, imbalanced defect patterns.

2. Materials and Methods

2.1. Sample Collection

The data acquisition in this study aimed to construct a high-quality, feature-balanced, and representative image dataset of Daqu cross-sections, providing a reliable foundation for subsequent semantic segmentation model training and feature extraction analyses. As no publicly available Daqu cross-sectional image datasets currently exist, all samples were collected on-site in the Daqu production workshop at the Huangyi Ecological Brewing Park of Luzhou Laojiao, Sichuan Province, China. All images were captured using a high-resolution smartphone imaging system equipped with a 64-megapixel (MP) primary sensor and an f/1.8 large aperture lens. The optical system features a 24 mm equivalent focal length, providing a wide field of view that ensures the entire cross-section is captured while maintaining edge-to-edge sharpness. To balance storage efficiency and feature granularity, raw images were recorded at a resolution of 2608 × 4640 pixels. The final dataset comprises 660 such high-resolution images derived from 350 independent physical Daqu blocks.

To comprehensively capture the diverse morphologies and quality features of Daqu during storage, the acquisition protocol emphasized both sample representativeness and feature diversity. Two typical cross-sectional images were captured for each sample, resulting in 660 base images. For samples exhibiting prominent quality features, such as microbial plaques, fire cycles, or fissures, close-up images were captured to enhance the visibility of small-scale features and to augment the data for underrepresented categories. For samples with complex feature distributions or unusual morphologies, multi-angle imaging was performed to record structural variations and surface texture information from different perspectives. This strategy ensured comprehensive coverage of samples while enhancing the morphological diversity and detail integrity of the dataset.

To improve feature class balance within the dataset, the distribution of samples across different categories was strictly controlled during acquisition.As summarized in Table 1, the dataset contains a total of 1624 manually annotated instances across four categories. On-site screening and feature annotation ensured that key visual features, including microbial plaques, fire cycles, and fissures, each had no fewer than 200 independent samples, effectively mitigating performance bias due to class imbalance during model training (see Table 1). This deliberate distribution design achieved a favorable balance between structural complexity and sample representativeness within the dataset.

All images were acquired under strictly controlled indoor conditions to minimize the influence of external illumination, reflections, and environmental variability on the visual characteristics of the Daqu samples [23]. The sampling device for capturing Daqu cross-sectional images is illustrated in Figure 1. During acquisition, a high-resolution smartphone camera was used as the imaging device, with automatic automatic high dynamic range (HDR) and artificial intelligence (AI) enhancement features disabled to preserve the authentic texture and color information of the samples. Illumination was provided by LED bar lights with a correlated color temperature of 6500 K and an approximate illuminance of 30,000 lux, simulating standardized daylight conditions. During imaging, the camera lens was positioned perpendicular to the sample surface and maintained at a fixed distance of approximately 30 mm from the center of the cross-section. The final captured images have a resolution of 2608 × 4640 pixels.

2.2. Image Annotation and Preprocessing

To ensure data consistency and improve model training performance, the collected images underwent systematic annotation and preprocessing before model development.

During the annotation stage, pixel-level manual labeling was performed using the Labelme tool. To ensure high annotation quality and cross-batch consistency, a three-step protocol was implemented: (1) Standardization: Four semantic categories were defined based on expert consensus—normal Daqu, fire cycle, fissures, and microbial plaques—with explicit visual boundary criteria established for each category. (2) Manual Annotation: Trained personnel performed the precise segmentation, specifically focusing on capturing the intricate and irregular contours of small-scale features such as fissures and plaques. (3) Quality Verification: All annotations were stored in JavaScript Object Notation (JSON) format, and the derived masks underwent a mandatory manual cross-checking process. Additionally, each mask was visualized using colormap techniques to detect and correct potential errors, such as boundary omissions or category mismatches, ensuring a rigorous ground truth for model training.

To align the raw data with the network’s input requirements and optimize parameter initialization, a standardized preprocessing and augmentation pipeline was implemented. During preprocessing, two primary operations were performed: image size standardization and data augmentation. Due to the high resolution of the original images, feeding them directly into the model would result in excessive computational cost and memory usage. Therefore, to maintain structural integrity while improving efficiency, all images were uniformly resized to 512 × 512 pixels prior to model input [24].

In deep learning workflows, the generalization capability of a model depends heavily on the diversity and representativeness of the training data. When data volume is limited or distribution is narrow, overfitting becomes likely, leading to performance degradation on unseen samples. Data augmentation provides an effective strategy to enrich data variability by applying geometric transformations, lighting perturbations, or noise simulation. Prior studies have demonstrated that appropriate augmentation helps simulate real-world imaging conditions and encourages the model to learn more invariant and robust feature representations [25].

For this study, a task-specific augmentation pipeline was designed and implemented, which includes [26]:

(i): Geometric transformations: random horizontal and vertical flipping ( $p = 0.5$ ), rotation ( $\pm 15^{\circ}$ ), scaling (0.8–1.2), and random cropping;
(ii): Photometric and color perturbations: controlled adjustments of brightness, contrast, and saturation to emulate changes in illumination and color cast, enhancing model adaptability to varying lighting conditions;
(iii): Spatial composite augmentation: inspired by the Mosaic strategy, combining patches from multiple images to increase structural diversity and background complexity;
(iv): Small-object targeted augmentation: designed specifically for low-proportion, fine-grained targets such as fissures and plaques, involving mask-guided local magnification and random cropping to increase the model’s exposure to small objects and improve segmentation precision.

All augmentation parameters were refined through iterative experimentation to ensure sample diversity without introducing distortion or semantic shift. Ultimately, this study constructed a well-structured Daqu cross-sectional dataset comprising 3828 images, with balanced category distribution and standardized formatting, providing a solid foundation for subsequent semantic segmentation model training and evaluation. Based on this standardized data foundation, the core training hyperparameters were configured as follows: the AdamW optimizer was employed with an initial learning rate (

l r

) of

1 \times 10^{- 4}

and weight decay of

1 \times 10^{- 4}

. Due to the high-resolution input (

512 \times 512

), the batch size was set to 4 to balance memory usage and gradient stability. The model was trained for 100 epochs, utilizing a multi-step learning rate decay (factor 0.5) at the 40th, 80th, and 100th epochs to ensure fine-grained convergence.

2.3. Construction of Image Segmentation Models

To enable automated and fine-grained segmentation of key structural regions in Daqu cross-section images—including the normal Daqu body, fire cycle, fissures, and microbial plaques—this study establishes a unified evaluation framework based on deep-learning semantic segmentation models. Daqu cross-section images exhibit pronounced structural heterogeneity, characterized by the coexistence of large homogeneous regions (e.g., the Daqu body) and small, irregularly shaped structures (e.g., elongated fissures and scattered microbial plaques), which poses significant challenges for accurate pixel-level segmentation.

To systematically investigate the suitability of different network architectures for this task, three representative segmentation models with increasing structural complexity were selected for comparison: U-Net as a classical baseline, U-Net++ as a refined architecture emphasizing dense feature fusion, and U²-Net as the primary model featuring nested multi-scale representations. This design allows for a comprehensive analysis of how different architectural strategies handle complex industrial images with severe class imbalance and weak-texture targets.

U-Net serves as a stable and widely adopted baseline due to its symmetric encoder–decoder structure and skip connections, which effectively preserve spatial information during downsampling and upsampling operations [27]. In this study, U-Net is used to quantify the performance improvements achieved by more advanced architectures when applied to Daqu cross-section segmentation. In recent years, U-Net has rarely been used as a standalone research entry point; instead, it is commonly integrated with various enhancement mechanisms to further improve model performance. Representative examples include enhanced U-Net variants that increase network depth by introducing additional convolutional, max-pooling, and transposed convolution layers [28], as well as Swin-UNet models that incorporate attention mechanisms or Transformer-based architectures [29]. These improvements have demonstrated superior feature representation capabilities across diverse application domains.

U-Net++ [30] enhances the original U-Net by introducing densely connected skip pathways between encoder and decoder stages, enabling multi-scale feature aggregation and improved gradient flow. This design facilitates better boundary delineation and feature reuse, making U-Net++ particularly suitable for structures with ambiguous contours, such as the fire cycle [31]. The overall network architecture is illustrated in Figure 2a.

U²-Net [32] introduces a more fundamental architectural innovation by embedding U-shaped structures within residual U-blocks (RSUs) throughout the network. Each RSU integrates standard and dilated convolutions to jointly capture fine-grained local details and broader contextual information. This nested multi-scale design allows U²-Net to maintain high feature fidelity for small-scale structures such as fissures and microbial plaques, without relying on pretrained backbones. A schematic overview of the U²-Net architecture is shown in Figure 2b.

For experimental consistency and reproducibility, all models were trained under identical settings. The dataset comprises 3828 annotated and augmented Daqu cross-sectional images. To avoid potential data leakage caused by multiple images captured from the same physical Daqu block, we performed a sample-wise split at the Daqu block level: all images belonging to the same block ID were assigned to the same subset, and the dataset was divided into training and validation sets with an 8:2 ratio. Data augmentation was applied only to the training set, while the validation set was kept unchanged to provide an unbiased evaluation on unseen Daqu blocks. All models were trained on the same NVIDIA GPU platform with a batch size of 4 for 200 epochs. The optimization employed AdamW with an initial learning rate of

1 \times 10^{- 4}

and a polynomial learning-rate decay schedule to improve training stability and convergence.

Given the substantial class imbalance in Daqu cross-sectional images—particularly for plaque regions, fissures, and fire cycles, which occupy only a small proportion of the image and contain far fewer pixels than background and normal Daqu regions—these small target regions have a disproportionate impact on overall quality assessment. Therefore, we incorporate targeted optimizations for the segmentation of these small and underrepresented regions, including an enhanced composite loss, data augmentation, and feature-level improvements. Specifically, we adopted an enhanced composite loss to address the severe class imbalance between dominant classes (background and normal Daqu) and minority classes (fire cycle, fissure, and plaque). The overall objective is defined as:

L = 0.4 L_{Lovász} + 0.3 L_{GDice} + 0.3 L_{WCE},

(1)

where

L_{Lovász}

(Lovász-Softmax) encourages IoU-consistent region overlap,

L_{GDice}

(generalized Dice) alleviates class-size bias, and

L_{WCE}

(weighted cross-entropy) provides stable pixel-wise supervision. The cross-entropy term uses class weights computed from training-set pixel statistics via median frequency balancing,

w_{c} \propto median ({f_{k}}) / (f_{c} + ϵ)

, and pixels labeled as 255 are ignored. This design reduces the dominance of majority classes and improves robustness and sensitivity for fine-grained segmentation of minority structures.

2.4. Evaluation Metrics

To comprehensively assess the performance of the segmentation models on Daqu cross-sectional images, several widely adopted semantic segmentation metrics were employed for quantitative evaluation. These include the mean Intersection over Union (mIoU), which measures the accuracy of regional overlap; the Dice Similarity Coefficient, which emphasizes the segmentation quality of foreground targets; and Pixel Accuracy (PA), which evaluates the proportion of correctly classified pixels. Together, these metrics provide a multi-dimensional and objective evaluation of model effectiveness and robustness [33,34].

Intersection over Union (IoU) quantifies the overlap between the predicted region and the ground truth, and is defined as:

IoU = \frac{T P}{T P + F P + F N},

(2)

where

T P

represents true positives (correctly predicted target pixels),

F P

denotes false positives (pixels incorrectly predicted as target), and

F N

refers to false negatives (missed target pixels).

In this study, class-wise IoU for the three minority features (fire cycles, fissures, and microbial plaques) was treated as a key evaluation metric. The prioritization of these small-target metrics is essential because the ultimate objective of the framework is to isolate these quality features for subsequent parameterized extraction. Only by ensuring high IoU at the pixel level can the integrity of the extracted morphological parameters be guaranteed for reliable industrial grading.

For multi-class segmentation tasks, the mean IoU (mIoU) is computed as the average IoU across all classes:

mIoU = \frac{1}{K} \sum_{k = 1}^{K} {IoU}_{k},

(3)

where K is the total number of classes and

{IoU}_{k}

is the IoU of the k-th class. As one of the most commonly used metrics in semantic segmentation, mIoU provides a comprehensive evaluation of model performance across different categories. In this study, it was also used as a global indicator of segmentation stability across different fermentation batches.

The Dice Similarity Coefficient (Dice) measures the similarity between the predicted region and the ground truth, and is defined as:

Dice = \frac{2 T P}{2 T P + F P + F N}

(4)

.

Compared with IoU, Dice is more sensitive to small-scale structures such as fissures and plaque regions, making it particularly useful for evaluating fine-grained feature segmentation. In this study, the Dice coefficient is prioritized as a “strong” and more rigorous metric because it is significantly more sensitive to the structural integrity of small-scale features. Given that quality-critical targets (fissures and plaques) occupy a very small proportion of the image, the Dice metric provides a much stricter assessment of the model’s precision on these “small targets” than standard pixel accuracy.

Pixel Accuracy (PA) evaluates the proportion of pixels that are correctly classified, and is calculated as:

PA = \frac{T P + T N}{T P + T N + F P + F N},

(5)

where

T N

denotes true negatives, i.e., pixels correctly identified as non-target regions. This metric reflects the overall classification accuracy of the model.

The selection of these metrics provides a robust evaluation framework tailored to the specific challenges of Daqu quality inspection. By jointly employing the above metrics, model performance can be assessed from both a macroscopic perspective—overall segmentation quality—and a microscopic perspective—its effectiveness in detecting small or subtle target structures. These metrics collectively provide objective and reliable support for model comparison and performance optimization.

3. Results

3.1. Selection of Cross-Sectional Features and Parameter Extraction

The semantic segmentation categories in this study were determined based on the sensory evaluation standards used by a large Chinese Baijiu producer in routine industrial practice. According to long-term production experience and previous studies, characteristics such as cross-sectional color, structural morphology, and plaque distribution have consistently served as key indicators for assessing Daqu quality [35]. In the traditional grading system, sensory attributes typically account for 60% of the total score, among which cross-sectional features and pizhang (outer-layer) thickness contribute approximately 25% and 12%, respectively, highlighting their critical importance in quality assessment.

Based on enterprise quality standards and expert recommendations, four sensory features were finally selected as the primary targets for semantic segmentation in this study, namely the normal Daqu body region, fire cycle, fissure structures, and plaque regions, as illustrated in Figure 3. These features not only visually reflect the physical and biological states of Daqu, but also contain internal information closely associated with fermentation performance. Parameter extraction based on segmentation results is described as follows:

(1): Plaque Area Ratio: Microbial plaques are contamination regions caused by undesired microorganisms and represent an important factor in evaluating the sanitary condition of Daqu. Common contaminants include Penicillium, Monascus, Aspergillus flavus, and Aspergillus niger [36]. These microorganisms often proliferate when the Daqu block cools to room temperature while retaining high internal moisture, forming dominant colonies. In industrial evaluation, the proportion of plaque area in the cross-section is routinely used as an indicator of Daqu quality and potential microbial risk. The plaque area ratio is defined as:

$R_{plaque} = \frac{N_{plaque}}{N_{Daqu}}$

(6)

where $N_{plaque}$ denotes the number of pixels belonging to plaque regions, and $N_{Daqu}$ denotes the total number of pixels within the Daqu cross-section.
(2): Pizhang Thickness: The pizhang refers to the outer layer of Daqu formed during fermentation and storage, consisting primarily of partially ungelatinized starch on the Daqu surface [12]. Its thickness is strongly associated with ventilation properties, moisture retention capacity, and fermentation stability. The fire cycle is a ring-like structure formed by Maillard reactions between the surface and middle layers during temperature rise [37], and is commonly used as a practical visual boundary to distinguish the pizhang from the inner Daqu region. In this study, the thickness of the pizhang is computed by measuring the pixel-wise minimum distance between the outer boundary of the fire cycle and the external contour of the Daqu, using the following definition:

$T_{pizhang} = \frac{1}{| P |} \sum_{p \in P} min_{e \in E} d (p, e)$

(7)

where P is the set of boundary pixels of the fire cycle region, $| P |$ is the number of such pixels, E is the set of outer contour pixels of the Daqu, and $d (p, e)$ denotes the Euclidean distance between pixel p and pixel e. This averaging process avoids the influence of extreme points and provides a robust measure of pizhang thickness.
(3): Fissure Length: Fissures are common structural defects formed during the production and storage of Daqu [38]. Their number and length reflect the structural stability of the block. Fissure formation is influenced by moisture content, drying conditions, and internal stress distribution. Excessive fissure may compromise mechanical strength and disrupt the diffusion of gases and moisture within the Daqu, thereby affecting microbial community dynamics. Total fissure length is computed as:

$L = \sum_{i = 1}^{n} L_{i}$

(8)

where $L_{i}$ is the length of the i-th fissure, obtained via connected-component analysis and skeletonization of the fissure region, and n is the total number of fissures. This approach reduces measurement variability caused by irregular fissure boundaries.

3.2. Model Training Performance

To establish a rigorous and systematic comparative framework, this study implemented a multi-tiered evaluation scheme involving four representative architectures: the Transformer-based SegFormer and three convolutional neural network (CNN)-based models (U-Net, U-Net++, and U²-Net). Under identical experimental settings, the performance analysis was conducted at two granularities—overall metrics for global stability and class-wise IoU for small-target precision. This hierarchical approach ensures that the superiority of U²-Net is verified through both cross-paradigm benchmarking and fine-grained feature validation, providing a more solid foundation for industrial quality assessment.

As illustrated in Figure 4, all four models achieved stable convergence within approximately 100 epochs; however, notable differences were observed in convergence speed and training stability. The SegFormer model exhibited a relatively slow but steady decline in loss, benefiting from the global receptive field of the attention mechanism, yet it required more epochs to reach a plateau compared to nested CNN structures. U-Net and U-Net++ exhibited similar decreasing trends in both training and validation losses and maintained relatively stable convergence during the later training stages, although moderate fluctuations were still present. This behavior indicates a certain sensitivity to local perturbations when modeling complex structural features. In contrast, U²-Net demonstrated the fastest reduction in validation loss and reached a lower loss level at an early stage of training. Its loss curve was smoother with no pronounced oscillations, reflecting clear advantages in multiscale feature extraction, contextual modeling, and gradient propagation, which enable more efficient learning of complex Daqu cross-sectional structures.

To evaluate the stability of the performance gains, all results in Table 2 and Table 3 are reported as Mean ± Standard Deviation (SD) from five independent trials with different random seeds. The standard deviations of U²-Net remained small, and the performance margins over the baseline models were larger than these variance ranges.

In terms of quantitative evaluation (see Figure 5 and Table 2), U²-Net achieved the best overall performance across multiple key metrics related to regional segmentation accuracy. On the validation set, its mIoU reached 85.48%, which exceeded those of SegFormer, U-Net, and U-Net++ by 6.25%, 7.16%, and 9.00%, respectively. The Dice coefficient also showed a notable improvement, reaching 98.09%, representing an enhancement of 1.88% to 2.74% compared to other models. This indicates that U²-Net is exceptionally robust for identifying small-scale targets such as the fire cycle, fissures, and microbial plaques.

Interestingly, while the Transformer-based SegFormer achieved the highest PA (98.91%), its mIoU was substantially lower than that of U²-Net by a large margin. This discrepancy is mainly attributable to the pronounced class imbalance in Daqu cross-sectional images, where the background and main Daqu body dominate the pixel count. As a result, PA can remain high even when small structural features are misclassified. Therefore, mIoU and Dice are more informative indicators for evaluating critical sparse regions.

Furthermore, the analysis of model complexity in Table 2 reveals that SegFormer is the most computationally efficient, with only 7.03 M parameters and 5.21 G FLOPs. However, for Daqu quality evaluation where precision is paramount, the superior feature representation capabilities of U²-Net justify its higher computational cost (74.38 G FLOPs). Therefore, the overall superiority of U²-Net in balancing global context and local detail remains statistically and practically significant.

In the context of real-world implementation, inference time and hardware requirements are critical factors. Although U²-Net involves higher FLOPs, its inference speed on modern GPU-accelerated edge devices remains sufficient for the throughput requirements of Daqu production lines, where quality inspection typically occurs at a frequency of a few samples per second. The superior feature representation of U²-Net—especially for capturing subtle fissures and microbial plaques—justifies the additional computational cost, as it minimizes the risk of misgrading high-value Daqu blocks. Conversely, SegFormer presents a viable alternative for mobile-based inspection or resource-constrained scenarios where inference latency and energy consumption are more prioritized than extreme segmentation precision.

Therefore, the overall superiority of U²-Net in balancing global context and local detail remains statistically and practically significant for standardized industrial quality assessment.

Further analysis of per-class IoU trends for small-scale structural features (see Figure 6 and Table 3) reveals the distinct performance of each architecture when handling fine-grained categories. As shown in Figure 6A–C, U²-Net demonstrates the most superior predictive stability and the highest final IoU values across all three challenging targets—fire cycles, fissures, and microbial plaques—with IoU values consistently exceeding 83% in the later training stages. This stability suggests that the nested U-structure (RSU) is particularly effective at maintaining feature fidelity for targets with weak textures and ambiguous boundaries by integrating multi-scale receptive fields.

The Transformer-based SegFormer, while being the most lightweight model (7.03 M Params, 5.21 G FLOPs), exhibited a polarized performance profile in small-object segmentation. Specifically, SegFormer achieved highly competitive results for fire cycles and microbial plaques (Figure 6A,C), benefiting from its global attention mechanism that captures complex semantic dependencies. However, for the fissure target (Figure 6B), its IoU remained lower than those of the other models. This indicates that while attention-based models excel at identifying textured colonies, they may lack the specific inductive bias required for the precise delineation of extremely slender, line-like structures such as fissures.

By comparison, U-Net exhibited slightly better overall performance than U-Net++, achieving more stable segmentation results for the fire cycle and the main Daqu body. Although U-Net++ yielded lower final segmentation accuracy than U-Net and U²-Net, it benefits from channel compression via 1 × 1 convolutions, resulting in a reduced parameter count and lower computational complexity (Table 2). For real-world implementation, SegFormer and U-Net++ offer advantages in inference latency and may be suitable for mobile-based inspection. However, U2-Net remains more appropriate for high-precision quality assessment. Its stronger segmentation performance may justify the additional computational cost in standardized industrial production lines.

To further illustrate the segmentation performance of the four models, representative Daqu cross-sectional images not involved in training were selected for visual comparison (Figure 7). Overall, SegFormer and U²-Net achieved the most favorable results, with all semantic regions being effectively identified and delineated. In contrast, U-Net and U-Net++ exhibited noticeable limitations, particularly in Sample 1 and Sample 5, where discontinuities were observed in the segmented regions. This fragmentation indicates a reduced capability of these earlier CNN architectures to maintain global spatial coherence when dealing with complex structural features.

A detailed inspection of Sample 1 reveals that while both SegFormer and U²-Net successfully captured the slender microbial plaques, the masks generated by U²-Net were more closely aligned with the ground truth. This region is characterized by weak texture, low contrast, and an elongated morphology, representing a challenging small-object case. The superior fidelity of U²-Net in this scenario demonstrates that its nested U-structure is more adept at preserving fine-grained details and structural integrity than the attention-based mechanism of SegFormer under extreme scale variations.

Furthermore, in the third-row example (Sample 5), the fire cycle boundaries predicted by U-Net and U-Net++ suffered from severe edge fragmentation due to blurred boundaries and color similarity. While SegFormer provided a more continuous region, U²-Net maintained the most accurate and coherent contour. These visual comparisons are consistent with the quantitative results. They indicate that U²-Net is the most reliable architecture among the tested models for capturing the intricate and heterogeneous morphological features of Daqu cross-sections. This provides a solid foundation for objective quality assessment.

3.3. Ablation Study and Performance Analysis

To further validate the internal consistency and rationality of the proposed framework, a comprehensive ablation study was performed as an extension of the comparative analysis. This stage focuses on verifying the task-specific data augmentation strategies and the composite loss function (

L_{W C E}

+

L_{D i c e}

+

L_{L o v a s z}

). The results of the ablation study, summarized in Table 4 and Table 5, demonstrate how these specific components address the class imbalance and structural complexity identified in the cross-model comparison, thereby completing the logic of the evaluation framework.

The transition from single-component losses (M1–M3) to the proposed composite loss function (M7) led to a substantial improvement in segmentation stability and accuracy. While standard weighted cross-entropy (M1) provides stable pixel-wise supervision, it often struggles with the severe class imbalance inherent in Daqu cross-sections, where small features like fissures and plaques are easily suppressed.

By integrating $L_{D i c e}$ and $L_{L o v a s z}$ , the model’s mIoU increased from 84.01% (M1) to 85.48% (M7).
More importantly, the Dice coefficient rose to 98.09%, indicating that the composite loss effectively encourages the model to focus on regional overlap and the structural integrity of minority classes.

The introduction of the multi-strategy data augmentation pipeline (M8) provided the most significant boost to the model’s overall performance.

Comparing M7 and M8, the addition of data augmentation further elevated the mIoU by 2.06%, reaching a peak performance of 87.54%.
The enhancement is particularly pronounced for the “Fissure” category, where the IoU rose from 74.81% (M7) to 79.46% (M8).
This improvement validates that geometric and photometric perturbations, especially the small-object targeted augmentation, effectively enriched the morphological diversity of the training set. This allowed U²-Net to learn more robust and invariant features, mitigating overfitting on the limited samples of fine-grained defects.

In summary, the combination of the composite loss function and the data augmentation strategy improved both pixel-level accuracy and sensitivity to complex weak-texture targets. The full configuration (M8) represents the optimal balance for the objective quantitative assessment of Daqu quality features, providing a reliable data foundation for subsequent industrial grading.

The effectiveness of the proposed strategy in mitigating class imbalance is particularly evident when comparing the performance of minority classes against the majority classes. In the Daqu dataset, the “Background” and “Daqu body” occupy over 90% of the total pixels, while “Fissure” and “Plaque” represent less than 1% and 8%, respectively. Without specific handling (M1), the model tends to exhibit high pixel accuracy but fails to capture the precise morphology of these sparse features.

Overall, all three models demonstrated stable performance in segmenting the background and major quality-related features of Daqu cross-sections, with U²-Net achieving the most favorable results across all evaluation metrics. Nevertheless, consistent performance bottlenecks remain for extremely small targets (e.g., sparse microbial plaques) and regions with complex or discontinuous boundaries (e.g., non-continuous fire cycles). These findings suggest that enhancing model sensitivity to weak textures and fine-scale structures remains a critical direction for future research.

3.4. Extraction and Visualization of Key Morphological Parameters

To translate pixel-level segmentation into sample-level indicators, we developed a Mask-driven Quantification Module (MQM). This post-processing framework systematically processes predicted masks through three operations: geometric reconstruction (e.g., fissure skeletonization), topological analysis (e.g., boundary distance calculation), and statistical aggregation of area ratios. As illustrated in Figure 8, the MQM overlays these masks onto original images to generate quantifiable visual representations, enabling precise discrimination of structural differences and the objective extraction of quality parameters.

(a): Visualization of pizhang thickness: The average pizhang thickness was computed as the mean of the minimum Euclidean distances from each point on the outer boundary of the fire-cycle region (P) to the Daqu outer contour (E), following Equation (7). The measured distances were visualized using line markers in Figure 8a, and all values were reported in pixel units. This visualization highlights spatial variations in pizhang thickness, providing insights into the thermal response intensity and the maturity uniformity of the Daqu body.
(b): Visualization of plaque area ratio: A pseudo-color mapping was applied to highlight plaque regions, with color gradients indicating their spatial distribution (Figure 8b). The model accurately identified plaque regions, and their area ratios were presented in percentage form, allowing an intuitive assessment of contamination severity.
(c): Fissure morphology and length analysis: Skeletonization was applied to the fissure regions to extract their main structural lines (Figure 8c). Fissure areas were marked with green lines, while the skeletons were indicated by white solid lines. The corresponding length-related parameters were also recorded in pixel units.

To further connect the segmentation outputs with quantitative evaluation, we present representative samples to visualize the complete post-processing pipeline. As shown in Figure 9, the first column displays the input Daqu cross-sectional images, the second column shows the corresponding predicted semantic masks, and the third column illustrates the parameter extraction visualization derived from the masks. Specifically, the segmented regions are used to compute pixel-level quality descriptors, including the average pizhang thickness (estimated from the geometric relationship between the Daqu outer contour and the fire-cycle boundary), the total fissure length (measured from the fissure mask after skeletonization), and plaque-related area ratios. The resulting numerical metrics are summarized in Table 6, which provides a direct correspondence between the predicted masks and the final parameterized indicators. Overall, the combined visual and quantitative results demonstrate that the proposed framework can translate semantic segmentation into interpretable, sample-level quality metrics for consistent assessment of Daqu cross-sections.

All morphological parameters extracted in this section, such as pizhang thickness (

T_{P}

) and fissure length (

L_{F}

), are quantified using pixel units (px). To ensure the framework’s industrial applicability, these pixel-based measurements can be converted to real-world physical dimensions (e.g., millimeters) by applying a spatial resolution factor, expressed as mm/px. This calibration is typically achieved by placing a standard reference object within the imaging field or through pre-defined camera intrinsic and extrinsic calibration, ensuring that the semantic outputs can be translated into standardized quality metrics for production control.

In summary, the proposed feature parameter extraction and visualization strategy based on U²-Net segmentation effectively bridges pixel-level classification and quantitative quality assessment of Daqu cross-sections. By automatically extracting and visualizing key parameters—such as pizhang thickness, plaque area ratio, and fissure morphology and length—the method enhances the interpretability of complex structural features and provides a unified, objective quantitative description of Daqu quality characteristics. Compared with traditional experience-driven evaluations, the proposed approach reduces subjectivity and improves reproducibility, offering reliable technical support for standardized quality assessment and digital analysis of Daqu, as well as a solid data foundation for subsequent quality grading and process control.

4. Discussions and Conclusions

Daqu quality strongly affects fermentation efficiency, flavor formation, and final product consistency in Baijiu production. However, traditional evaluation still relies heavily on manual sensory judgment, which is subjective and poorly reproducible. To address this limitation, we developed a deep semantic segmentation-based framework for the objective evaluation of Luzhou-flavor Daqu. By integrating pixel-level annotated data with optimized deep neural networks, the proposed method establishes a comprehensive technical pipeline for the automated recognition and quantitative analysis of quality-related features, including fire cycles, fissures, and microbial plaques. This framework facilitates a critical transition from qualitative sensory interpretation to a robust, data-driven pathway for intelligent quality assessment in Baijiu fermentation.

Experimental results confirm that U²-Net, with the optimal M8 configuration, achieves superior performance with an mIoU of 87.54% and a Dice score of 98.30%. Compared to SegFormer, U-Net, and U-Net++, U²-Net demonstrates enhanced multi-scale feature fusion, maintaining IoU values above 80% for small and complex targets. The narrow standard deviation intervals (

\pm 0.17 %

for mIoU) across independent trials underscore the statistical robustness of these performance gains.

A critical challenge addressed is class imbalance, as background pixels exceed 90% while features like fissures represent less than 1%. By integrating

L_{W C E}

for reweighting and

L_{L o v a s z}

for IoU optimization, the framework effectively penalizes minority-class misclassifications. This is evidenced by the “Fissure” IoU increasing from 73.16% to 79.46%, proving that the composite loss shifts the model’s focus toward quality-critical structural integrity rather than mere pixel accuracy.

This research is significant because it shifts Daqu evaluation from subjective sensory interpretation to objective data-driven analysis. By introducing semantic segmentation, the proposed framework moves beyond the black-box grading behavior of previous classification models. It enables pixel-level isolation of critical features and converts morphological traits into measurable parameters, such as pizhang thickness and fissure density. This quantitative description improves inspection efficiency and consistency compared with manual grading. Furthermore, such high-fidelity data serves as a core sensing component for smart manufacturing, supporting potential real-time feedback for environmental adjustments in intelligent fermentation systems. The modular nature of the proposed MQM architecture also ensures that this pipeline can be extended to other Baijiu flavors, laying a technical foundation for the digital transformation of the brewing industry.

Despite these outcomes, several limitations remain. First, the current morphological parameters are measured in pixel units (px). While this ensures internal consistency, they lack direct correspondence with physical dimensions. Future work will focus on integrating camera calibration and reference scale objects to establish a spatial resolution factor (mm/px) [39,40]. This mapping, potentially combined with 3D reconstruction techniques, will enable precise millimeter-level quantification for practical industrial deployment and standardized Daqu grading [41]. Second, this study did not include paired traditional quality-evaluation records for the same Daqu samples. Therefore, a direct quantitative comparison between the proposed model-based measurements and conventional expert-based evaluation methods could not be conducted at this stage. In future studies, paired datasets containing both image-based morphological parameters and traditional sensory or grading results will be collected to further validate the practical applicability of the proposed framework in real production scenarios.

Author Contributions

Z.S.: Writing—original draft, Methodology, Software, Formal analysis, Data curation, Visualization. Y.D., C.W., X.Z., C.Y. and A.S.: Validation, Resources. J.M.: Writing—review & editing, Conceptualization, Validation, Resources, Funding acquisition. S.L.: Writing—review & editing, Conceptualization, Validation, Investigation, Project administration, Funding acquisition, Resources, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program of China (2022YFD2101204); National Natural Science Foundation of China (22138004, 22422807); the Zhejiang Province and Local Collaborative Project (2024SDXT001-4), and the National Engineering Research Center of Solid-state Brewing.

Data Availability Statement

The code used in this study is publicly available https://github.com/Adli1219/Luzhou-Flavor-Liquor-Daqu-Cross-section-Semantic-Segmentation (accessed on 1 March 2026). The data that support the findings of this study are not publicly available due to confidentiality reasons but are available from the corresponding author upon reasonable request.

Conflicts of Interest

Authors Yi Dong, Chao Wang, Xiu Zhang were employed by the company Luzhou Laojiao Group Co., Ltd., author Aibao Sun was employed by the company Zhejiang Guyuelongshan Shaoxing Wine Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Jin, G.Y.; Zhu, Y.; Xu, Y. Mystery behind Chinese liquor fermentation. Trends Food Sci. Technol. 2017, 63, 18–28. [Google Scholar] [CrossRef]
He, G.Q.; Huang, J.; Wu, C.D.; Jin, Y.; Zhou, R.Q. Bioturbation effect of fortified Daqu on microbial community and flavor metabolite in Chinese strong-flavor liquor brewing microecosystem. Food Res. Int. 2020, 129, 108851. [Google Scholar] [CrossRef]
Hou, Q.; Wang, Y.; Qu, D.; Zhao, H.; Tian, L.; Zhou, J.; Liu, J.; Guo, Z. Microbial communities, functional, and flavor differences among three different-colored high-temperature Daqu: A comprehensive metagenomic, physicochemical, and electronic sensory analysis. Food Res. Int. 2024, 184, 114257. [Google Scholar] [CrossRef]
Zhang, Y.D.; Shen, Y.; Niu, J.; Ding, F.; Ren, Y.; Chen, X.X.; Han, B.Z. Bacteria-induced amino acid metabolism involved in appearance characteristics of high-temperature Daqu. J. Sci. Food Agric. 2023, 103, 243–254. [Google Scholar] [CrossRef]
Song, Z.L.; Li, Y.B.; Zhao, H.Y.; Liu, X.G.; Ding, H.L.; Ding, Q.S.; Ma, D.N.; Liu, S.P.; Mao, J. Application of computer vision techniques to fermented foods: An overview. Trends Food Sci. Technol. 2025, 160, 104982. [Google Scholar] [CrossRef]
Sardari, H.; Firouz, M.S.; Hosseinpour, S.; Bohlol, P. Image-based discrimination of ultrasound-assisted frozen meat using deep learning. Future Foods 2025, 12, 100822. [Google Scholar] [CrossRef]
Fan, S.X.; Li, J.B.; Zhang, Y.H.; Tian, X.; Wang, Q.Y.; He, X.; Zhang, C.; Huang, W.Q. On line detection of defective apples using computer vision system combined with deep learning methods. J. Food Eng. 2020, 286, 110102. [Google Scholar] [CrossRef]
Davis, K.A.; Harris, C. Microbial evaluation of inoculated fresh meat utilizing imaging and computer vision. J. Anim. Sci. 2025, 103, 510–511. [Google Scholar] [CrossRef]
Xia, Y.Y.; Tian, J.P.; Huang, D.; Wang, J.; He, K.L.; Xie, L.L.; Hu, X.J.; Yang, H.L. Hyperspectral Reconstruction in Combination With a Crown Porcupine Optimization-Optimized Support Vector Regression (CPO-SVR) Machine Learning Model for Predicting the Total Acid Content of Daqu. J. Food Process Eng. 2025, 48, e70172. [Google Scholar] [CrossRef]
Huang, H.P.; Hu, X.J.; Tian, J.P.; Chen, P.; Huang, D. Multigranularity cascade forest algorithm based on hyperspectral imaging to detect moisture content in Daqu. J. Food Process Eng. 2021, 44, e13633. [Google Scholar] [CrossRef]
Jiang, X.N.; Hu, X.J.; Huang, H.P.; Tian, J.P.; Bu, Y.H.; Huang, D.; Luo, H.B. Detecting total acid content quickly and accurately by combining hyperspectral imaging and an optimized algorithm method. J. Food Process Eng. 2021, 44, e13844. [Google Scholar] [CrossRef]
Zhao, M.K.; Han, C.Y.; Xue, T.H.; Ren, C.; Nie, X.; Jing, X.; Hao, H.Y.; Liu, Q.F.; Jia, L.Y. Establishment of a Daqu Grade Classification Model Based on Computer Vision and Machine Learning. Foods 2025, 14, 668. [Google Scholar] [CrossRef]
Wang, T.; Huang, R.L.; Wei, Z.; Xie, M.Z.; Chen, L.P.; Wang, W.T.; Yun, Y.H. Multimodal nondestructive testing of tender coconut water quality using spectroscopy, computer vision, and deep learning. Food Control 2026, 180, 111660. [Google Scholar] [CrossRef]
Song, W.R.; Wang, H.; Yun, Y.H. Smartphone video imaging: A versatile, low-cost technology for food authentication. Food Chem. 2025, 462, 140911. [Google Scholar] [CrossRef]
Ghosh, S.; Das, N.; Das, I.; Maulik, U. Understanding Deep Learning Techniques for Image Segmentation. ACM Comput. Surv. 2019, 52, 81. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Tian, L.; Zhong, X.R.; Chen, M. Semantic Segmentation of Remote Sensing Image Based on GAN and FCN Network Model. Sci. Program. 2021, 2021, 9491376. [Google Scholar] [CrossRef]
Lukács, H.I.; Beregi, B.Z.; Porteleki, B.; Fischl, T.; Botzheim, J. Attention U-Net-based semantic segmentation for welding line detection. Sci. Rep. 2025, 15, 1314. [Google Scholar] [CrossRef]
Bastian, M.; Angela, M.-A.; Constanza, A.; Eduardo, A. Lightweight DeepLabv3+ for Semantic Food Segmentation. Foods 2025, 14, 1306. [Google Scholar]
Cao, M.Y.; Tang, F.F.; Ji, P.; Ma, F.Y. Improved real-time semantic segmentation network model for crop vision navigation line detection. Front. Plant Sci. 2023, 14, 1140560. [Google Scholar] [CrossRef]
Kong, X.Y.; Sun, X.H.; Wang, Y.Z.; Peng, R.Y.; Li, X.Y.; Yang, Y.H.; Lv, Y.R.; Tseng, S.P. Food Calorie Estimation System Based on Semantic Segmentation Network. Sens. Mater. 2023, 35, 2013–2033. [Google Scholar] [CrossRef]
Gonzalez, B.; Garcia, G.; Velastin, S.A.; Gholamhosseini, H.; Tejeda, L.; Farias, G. Automated Food Weight and Content Estimation Using Computer Vision and AI Algorithms. Sensors 2024, 24, 7660. [Google Scholar] [CrossRef]
Wen, Y.; Xue, J.L.; Sun, H.; Song, Y.; Lv, P.F.; Liu, S.H.; Chu, Y.Y.; Zhang, T.Y. High-precision target ranging in complex orchard scenes by utilizing semantic segmentation results and binocular vision. Comput. Electron. Agric. 2023, 215, 108440. [Google Scholar] [CrossRef]
Nima, N.; Ali, R.; Hamed, S.; Soleiman, H. Detection of external defects of tomato crop using appearance parameters by convolutional neural networks. Future Foods 2025, 11, 100611. [Google Scholar] [CrossRef]
Casado-García, A.; Domínguez, C.; García-Domínguez, M.; Heras, J.; Inés, A.; Mata, E.; Pascual, V. CLoDSA: A tool for augmentation in classification, localization, detection, semantic segmentation and instance segmentation tasks. BMC Bioinform. 2019, 20, 323. [Google Scholar] [CrossRef] [PubMed]
Abu Alhaija, H.; Mustikovela, S.K.; Mescheder, L.; Geiger, A.; Rother, C. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. Int. J. Comput. Vis. 2018, 126, 961–972. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Roy, K.; Chaudhuri, S.S.; Pramanik, S. Deep learning based real-time Industrial framework for rotten and fresh fruit detection using semantic segmentation. Microsyst. Technol. 2021, 27, 3365–3375. [Google Scholar] [CrossRef]
Zang, H.C.; Wang, C.S.; Zhao, Q.; Zhang, J.; Wang, J.M.; Zheng, G.Q.; Li, G.Q. Constructing segmentation method for wheat powdery mildew using deep learning. Front. Plant Sci. 2025, 16, 1524283. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
Zhou, Z.W.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J.M. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
Qin, X.B.; Zhang, Z.C.; Huang, C.Y.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U²-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Liu, Y.H.; Yao, J.; Lu, X.H.; Xie, R.P.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Can, D.; Ruijie, G.A.O.; Yongwei, Z.; Lihong, M.; Miankun, W.; Pulin, L.I.U.; Jiahui, C.; Peiwen, F.A.N. Relationship between sensory indexes, physicochemical indexes, microbial community and volatile compounds in high-temperature Daqu. Food Ferment. Ind. 2022, 23, 78–85. [Google Scholar]
Wang, X.; Deng, X.; Wan, Q.; Luo, W.; Zhao, R.; Gao, Z.; Zhang, Y.; Zhang, D.; Yang, J. Isolation, identification and control of molds on the surface of medium-high-temperature Daqu. China Brew. 2024, 43, 90–94. [Google Scholar]
Xu, B.Y.; Xu, S.S.; Cai, J.; Sun, W.; Mu, D.D.; Wu, X.F.; Li, X.J. Analysis of the microbial community and the metabolic profile in medium-temperature Daqu after inoculation with Bacillus licheniformis and Bacillus velezensis. LWT 2022, 160, 113214. [Google Scholar] [CrossRef]
Xin-hao, Z.; Dan-ping, H.; Jian-ping, T.; Dan, H. Research on the Daqu quality detection system based on machine vision. Food Mach. 2018, 4, 80–84. [Google Scholar]
Qilin, S.; Ruirui, Z.; Liping, C.; Linhuan, Z.; Hongming, Z.; Chunjiang, Z. Semantic segmentation and path planning for orchards based on UAV images. Comput. Electron. Agric. 2022, 200, 107222. [Google Scholar] [CrossRef]
Chai, D.F.; Newsam, S.; Huang, J.F. Aerial image semantic segmentation using DCNN predicted distance maps. ISPRS J. Photogramm. Remote Sens. 2020, 161, 309–322. [Google Scholar] [CrossRef]
Hu, Q.; Wang, K.K.; Ren, F.S.; Wang, Z.Y. Research on underwater robot ranging technology based on semantic segmentation and binocular vision. Sci. Rep. 2024, 14, 11844. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of the imaging system.

Figure 2. Overall network architecture of the proposed framework. (a) Network model diagram of U-Net++. The black color indicates the original U-Net, while the green and blue colors represent the dense convolutional blocks on the skip connections [30]. (b) Network model diagram of U²-Net. The network consists of a six-level encoder, a five-level decoder, and a saliency map fusion module [32].

Figure 3. Illustration of key quality features in Daqu cross-sections.

Figure 4. Comparison of training and validation loss curves for SegFormer, U-Net, U-Net++, and U²-Net models.

Figure 5. Dynamic changes of segmentation metrics on the validation set during the training process. (A) Mean Intersection over Union (mIoU). (B) Dice coefficient. (C) Pixel Accuracy (PA).

Figure 6. Validation IoU curves for three fine-grained quality features: (A) Fire cycle. (B) Fissure. (C) Microbial plaque.

Figure 7. Visual comparison of segmentation masks generated by SegFormer, U-Net, U-Net++, and U²-Net on representative Daqu cross-sections.

Figure 8. Visualization process for extracting quality features from Daqu cross-sections.

Figure 9. From semantic segmentation to quantitative assessment: representative samples with input images, predicted masks, and extracted quality parameters.

Table 1. Statistical distribution of semantic categories and quality features in the Daqu cross-sectional image dataset (660 images).

Class	Total Instances ¹	Images ²	Area Ratio (%) ³
Daqu Body	759	659	82.55
Fire Cycle	603	367	5.98
Fissure	497	329	1.05
Plaque	629	275	7.24
Total	2488	660	100.00

¹ Total Instances: Total manually annotated regions per class; ² Images: Number of unique images containing the class (sum exceeds 660 due to feature coexistence); ³ Area Ratio: Average pixel occupancy relative to the total Daqu area.

Table 2. Comparison of overall performance and model complexity (Mean ± SD).

Model	mIoU (%)	Dice (%)	PA (%)	Params (M)	FLOPs (G)
SegFormer	79.23 ± 0.68	96.21 ± 0.44	98.91 ± 0.08	7.03	5.21
U-Net	78.32 ± 1.73	95.43 ± 0.57	96.49 ± 0.24	13.40	48.65
U-Net++	76.48 ± 3.23	95.35 ± 0.58	96.25 ± 0.57	9.16	54.55
U²-Net	85.48 ± 0.39	98.09 ± 0.13	97.27 ± 0.10	44.07	74.38

Note: All metrics are evaluated on the validation set except for model complexity parameters. mIoU: mean Intersection over Union; Dice: Dice coefficient reflecting spatial overlap; PA: pixel accuracy; Params: total number of trainable parameters; FLOPs: floating-point operations per second, indicating computational cost.

Table 3. Comparison of class-wise IoU performance (Mean ± SD).

Model	Background (%)	Daqu (%)	Fire Cycle (%)	Fissure (%)	Plaque (%)
SegFormer	94.28 ± 0.73	98.49 ± 0.04	91.10 ± 0.06	69.70 ± 1.61	61.18 ± 2.50
U-Net	97.68 ± 0.46	90.90 ± 0.46	70.37 ± 6.45	63.71 ± 4.04	68.94 ± 1.87
U-Net++	97.30 ± 0.98	90.51 ± 1.23	67.25 ± 8.10	62.55 ± 4.83	64.82 ± 11.40
U²-Net	98.53 ± 0.18	92.62 ± 0.21	82.51 ± 0.65	74.81 ± 1.10	78.92 ± 0.62

Note: Values represent the validation Intersection over Union (Val_IoU) for each specific category, presented as Mean ± SD (%).

Table 4. Ablation study of overall performance metrics.

ID	Components				mIoU (%)	Dice (%)	PA (%)
ID	Aug	WCE	Dice	Lovasz	mIoU (%)	Dice (%)	PA (%)
M1	-	✓	-	-	84.01 ± 0.34	83.76 ± 0.78	97.35 ± 0.04
M2	-	-	✓	-	83.59 ± 0.48	80.67 ± 0.63	96.86 ± 0.39
M3	-	-	-	✓	83.17 ± 0.94	80.01 ± 1.19	97.27 ± 0.09
M4	✓	✓	-	-	85.99 ± 0.45	97.70 ± 0.16	97.72 ± 0.02
M5	-	✓	✓	-	84.32 ± 0.34	98.18 ± 0.09	97.37 ± 0.04
M6	✓	✓	✓	-	87.05 ± 0.30	87.76 ± 0.23	97.66 ± 0.02
M7	-	✓	✓	✓	85.48 ± 0.39	98.09 ± 0.13	97.27 ± 0.10
M8	✓	✓	✓	✓	87.54 ± 0.17	98.30 ± 0.10	97.68 ± 0.03

Note: Aug: data augmentation; WCE: weighted cross-entropy loss; Dice: Dice loss; Lovasz: Lovasz-Softmax loss. M7 represents the model using the combined loss function without augmentation; M8 includes all components and data augmentation.

Table 5. Ablation study of class-wise IoU performance.

ID	Components				Val_IoU (%)
ID	Aug	WCE	Dice	Lovasz	Background	Daqu	Fire Cycle	Fissure	Plaque
M1	-	✓	-	-	98.86 ± 0.11	92.76 ± 0.08	75.53 ± 1.59	73.16 ± 1.44	79.75 ± 0.46
M2	-	-	✓	-	98.08 ± 0.71	91.74 ± 0.78	77.82 ± 1.04	72.23 ± 1.22	78.06 ± 1.55
M3	-	-	-	✓	98.67 ± 0.19	92.57 ± 0.15	76.76 ± 1.56	76.17 ± 1.16	71.66 ± 3.49
M4	✓	✓	-	-	99.12 ± 0.02	93.60 ± 0.07	81.75 ± 1.83	73.41 ± 1.71	82.07 ± 0.54
M5	-	✓	✓	-	98.68 ± 0.08	92.94 ± 0.12	76.74 ± 1.31	74.01 ± 0.71	79.22 ± 0.70
M6	✓	✓	✓	-	98.91 ± 0.05	93.45 ± 0.04	82.72 ± 1.33	77.56 ± 0.54	82.63 ± 0.30
M7	-	✓	✓	✓	98.53 ± 0.18	92.62 ± 0.21	82.51 ± 0.65	74.81 ± 1.10	78.92 ± 0.62
M8	✓	✓	✓	✓	99.02 ± 0.03	93.56 ± 0.09	82.95 ± 0.44	79.46 ± 0.84	82.72 ± 0.50

Note: Values represent the Intersection over Union (IoU) for each category. M7 and M8 reflect the configurations with and without data augmentation under the combined loss function.

Table 6. Representative quantitative outputs derived from the predicted masks.

Sample ID	$T_{P}$ (px)	$L_{F}$ (px)	Daqu (%)	Fire Cycle (%)	Plaque (%)	Ratio (%)
Daqu_5	0.00	1066	33.78	0.00	0.00	0.00
Daqu_136	287.12	0	29.81	6.48	0.00	0.00
Daqu_210	197.65	728	35.82	1.62	3.09	8.62
Daqu_214	0.00	0	34.24	0.00	5.17	15.10

Note:

T_{P}

: average thickness of Pizhang (skin layer);

L_{F}

: total length of fissures; Ratio: the percentage of Plaque area relative to the Daqu area. All area-based metrics are calculated as the percentage of the total cross-section area.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Z.; Dong, Y.; Wang, C.; Zhang, X.; Sun, A.; You, C.; Mao, J.; Liu, S. Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu. Computers 2026, 15, 307. https://doi.org/10.3390/computers15050307

AMA Style

Song Z, Dong Y, Wang C, Zhang X, Sun A, You C, Mao J, Liu S. Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu. Computers. 2026; 15(5):307. https://doi.org/10.3390/computers15050307

Chicago/Turabian Style

Song, Zheli, Yi Dong, Chao Wang, Xiu Zhang, Aibao Sun, Cuiping You, Jian Mao, and Shuangping Liu. 2026. "Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu" Computers 15, no. 5: 307. https://doi.org/10.3390/computers15050307

APA Style

Song, Z., Dong, Y., Wang, C., Zhang, X., Sun, A., You, C., Mao, J., & Liu, S. (2026). Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu. Computers, 15(5), 307. https://doi.org/10.3390/computers15050307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Segmentation-Based Identification and Quantitative Analysis of Cross-Sectional Quality Features in Luzhou-Flavor Liquor Daqu

Abstract

1. Introduction

2. Materials and Methods

2.1. Sample Collection

2.2. Image Annotation and Preprocessing

2.3. Construction of Image Segmentation Models

2.4. Evaluation Metrics

3. Results

3.1. Selection of Cross-Sectional Features and Parameter Extraction

3.2. Model Training Performance

3.3. Ablation Study and Performance Analysis

3.4. Extraction and Visualization of Key Morphological Parameters

4. Discussions and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI