4.1. Data Acquisition and Evaluation Metrics
The data acquisition of the images of the underground pipeline corridor is performed by the unmanned vehicle, which is shown in
Figure 11. The configuration of the unmanned vehicle is shown in
Table 3. The camera is opened by the corresponding camera driver in the ROS system, and the information from the camera is released and recorded in real time through the communication method of ROS.
Our data originates from 15 irregular spaces within 4 utility tunnel compartments, with each compartment and irregular space containing a specific amount of data.
The commonly used evaluation metric for semantic segmentation is Mean Intersection over Union (mIOU), which is employed as the performance indicator for this experiment, its calculation method is shown as follows:
where
represents the number of pixels where class
i is predicted as class
j,
is the total number of classes, and
is the total number of pixels for the target class
i.
The evaluation metrics used to measure text generation are mainly Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (Rouge-L), which are used to measure the accuracy and recall of text generation, respectively, and the specific formulas are shown as follows:
where in the formula for BLEU,
denotes the matching degree,
is the weight of each n-gram with the same weight by default, and
is a penalty factor. In the formula for Rouge-L,
,
denote the recall and accuracy, respectively, and
is set to be a very large number so that Rouge-L considers
almost exclusively.
4.2. Experiments
In terms of semantic segmentation, this article sets up comparison experiments to compare the image quality of the enhanced image with the original image and whether there is any improvement in the accuracy of semantic segmentation after data enhancement and adding a channel attention module. The comparison experiments are shown in
Table 4.
After each complete training iteration, the model is evaluated for performance on the validation set to find the iterations that achieve optimal performance and prevent overfitting.
In terms of text generation, comparison experiments are set up to determine the effect of fine-tuning the model on the underground pipeline corridors dataset. The comparison experiments are shown in
Table 5. After training is completed, two performance metrics are performed on the test set.
After the low light enhancement, the enhanced image is compared with the original image, and the corresponding three metrics Peak Signal-to-Noise Ratio(PSNR), Structural Similarity(SSIM), and Mean Absolute Error(MAE) are shown in
Table 6. Larger values of PSNR and SSIM and smaller values of MAE mean a higher quality of the enhanced image. From the table, it can be seen that in terms of data enhancement of underground pipe corridors, the image enhanced by the Zero-DCE network has a higher image quality and is more suitable for the subsequent semantic segmentation task. The data of the original underground pipeline corridors and after performing the low light enhancement are shown in
Figure 12.
As can be seen in
Figure 12, the visibility of the underground pipeline corridors image data enhanced by the low-light enhancement algorithm becomes higher, and the feature information that can be extracted from the same photo after low-light enhancement increases, which also lays a good foundation for the subsequent semantic segmentation task.
As can be seen in
Figure 13, in the enhanced image, the red (noise) in the flame area not only covers a larger area but also exhibits a denser texture. This directly indicates that after enhancement, the number of noisy pixels in the flame area has increased, and the noise fluctuations have become more intense.
The total loss of the semantic segmentation model during training is shown in
Figure 14. After 160,000 iterations, the loss of the model is minimized and the performance is optimal.
Comparing the semantic segmentation accuracies of the low-light-enhanced data and the original data, the segmentation accuracies of the two batches of data are shown in
Table 7.
From
Table 7, it can be seen that after enhancement, except for a very few objects whose segmentation accuracy has decreased, the segmentation accuracy of most of the objects is improved, and the overall segmentation accuracy is also improved. Low-light enhancement of the image can effectively increase the feature information, and allow the model to extract more feature information, which in turn improves the semantic segmentation accuracy of the model. With the addition of the channel attention module, the model is able to focus its attention on certain channels that are more important for distinguishing the categories, which in turn improves the semantic segmentation accuracy. The original network achieves the highest segmentation accuracy on the enhanced image after adding the channel attention module. Category fluctuations are all below the 10% engineering threshold, representing local trade-offs resulting from module optimization rather than model instability.
The negative growth in a few categories is minimal, stems from reasonable model optimization trade-offs, and does not impact core functionality (overall performance improvement). This falls entirely within acceptable engineering parameters, requiring no additional adjustments to the model architecture. The current performance already meets the practical requirements for utility tunnel inspections.
As shown in the experimental results of
Table 7, the Zero-DCE + DCNv4 + SE fusion model exhibits a certain degree of decline in IoU performance for segmentation tasks involving two target categories: “Flame” and “Gas Detector.” Specifically, its IoU for the flame category is 79.47%, lower than the 81.7% achieved by the original DCNv4 model. while the IoU for gas detectors was 92.59%, below the 93.27% achieved by the Zero-DCE + DCNv4 model. This performance fluctuation stems not from a single module failure, but from the combined effects of insufficient feature adaptability and inadequate category-specific matching during the fusion process. The specific causes and potential improvements are analyzed below.
One core contributing factor is the imbalanced feature weight distribution within the SE module. The SE module employs a “Squeeze-Excitation” mechanism to adaptively calibrate channel feature weights, whose performance relies on accurately identifying and amplifying key target features. However, in this experiment, it exhibited an adaptation bias toward the category features of flames and gas detectors. Flame targets exhibit blurred edges, irregular textures, and dispersed pixel distributions, while gas detectors typically appear as small objects with regular shapes and uniform textures. Both are easily obscured in images by more dominant background features. During global feature compression and activation, the SE module may disproportionately allocate weights to background or dominant categories, thereby suppressing channel responses for critical features like flame edges and gas detector contours. This reduces the model’s sensitivity to extracting these target features, ultimately lowering the IoU.
Addressing these issues, the model presents clear room for improvement across three dimensions: module optimization, collaborative strategies, and data-loss adaptation. First, optimize the SE module’s category adaptability by replacing the generic SE module with a category-aware variant. Train channel excitation weights separately for flames and gas detectors, while simplifying the fully connected layer structure and introducing Dropout to suppress noise, preventing imbalanced feature weight distribution. Second, establish a feature adaptation bridge between Zero-DCE and DCNv4 by inserting a 1 × 1 convolution + Batch Normalization layer. This maps enhanced features to DCNv4’s feature space while incorporating an illumination intensity branch to adaptively enable and adjust Zero-DCE intensity, preventing ineffective enhancement. Third, we enhance data and loss specificity by oversampling both target and non-target samples. Combining edge enhancement and random scaling strategies expands feature diversity. A Dice-IoU hybrid loss function increases the model’s focus on edges and small target regions, mitigating segmentation boundary inaccuracies.
When training with the LoRA method of fine-tuning, the number of parameters changed accounts for less than 1% of the number of parameters in the whole model, which makes it possible to train large models according to a specific task even with limited computational resources.
LoRA has a rank of 8. It will be injected into all core network layers of the Qwen2-VL-7B model that support LoRA adaptation, such as attention layers and feedforward network layers.
The training loss for this article, fine-tuning and training the visual–linguistic model, is shown in
Figure 15. The training loss curve shows that the model fully converges after about 200 steps of iterations, at which point the performance metrics computed with this weight on the test set and the original model on the test set are shown in
Table 8.
As can be seen from
Table 8, the fine-tuned trained visual–linguistic model performs much better than the original model on the test set, with a huge improvement in both BLEU-4 (the weighted average of BLEUs computed from 1–4 g) and Rouge-L. This suggests that by fine-tuning the visual–linguistic model, we have improved the model’s comprehension accuracy for underground pipeline corridors scene understanding, allowing the model to generate more accurate and complete scene understanding text for our specific task.
Table 9 presents comparative experiments of different fine-tuning strategies across parameter count, training duration, and generation metrics. As shown, the LoRA method achieves high performance across all metrics, demonstrating the rationale for selecting this approach. Within the text templates, most content is predefined, while sections describing object interactions remain open-ended—reflecting a balance between fixed paradigms and diversity.
The original model achieved a BLEU-4 score of 1.08, primarily measuring the N-gram alignment between its generated utility tunnel descriptions and human-annotated reference texts. This extremely low score indicates the original model is entirely unsuitable for text generation tasks in underground utility tunnel scenarios. A score of 1.08 reflects the poor performance of a general-purpose model in this specific context, proving the original model cannot be directly applied to utility tunnel text generation. The score surged after fine-tuning, demonstrating that through LoRA fine-tuning, the model learned to recognize utility tunnel-specific objects and comprehend scene logic. The generated text achieved high alignment with the human-annotated reference, significantly improving adaptability and accuracy.
Our scene understanding system achieves single-frame inference in just 0.0673 s, fully meeting the real-time inspection requirements for utility tunnels. Regarding core specifications, equipment image capture typically operates at 30 frames per second. With an inference speed of 0.0673 s per frame, it precisely matches the capture rhythm, preventing image backlogs or critical frame omissions. For pipeline tunnel inspections, the end-to-end latency requirement for “detection → alert” of critical hazards like flames is less than or equal to 1 s. This inference speed occupies only 6.7% of the latency threshold, leaving ample time for subsequent alert triggering and manual response. It significantly outpaces the reaction speed of manual hazard inspections, fully meeting the response demands of emergency scenarios.
The visualization interface for scene understanding is shown in
Figure 16. In the scene understanding visualization interface, the leftmost one is the original image of the current frame, the middle one is the semantic segmentation result of the image, and the rightmost one is the scene understanding text generation result of the image. The figure shows the scene understanding results of the three frames before and after a scene video after frame-splitting. During the scene understanding process of the first image frame, the visual–linguistic model generates wrong text with incorrect locations of the worker in the text, and the wrong text descriptions can be corrected efficiently by utilizing the category and location information of the objects in the image segmentation result map. The colored parts of the scene understanding text are the text parts that are differentiated in the three frames of the image: the red parts are the incorrect text generated by the visual–linguistic model, and the green parts are the correct text. The bottom section displays scene understanding visualizations for different compartments.
This study offers broad and practical applications, with its core value lying in providing an integrated intelligent solution for underground utility tunnel operation and maintenance. This solution combines “low-light adaptation + precise segmentation + standardized text generation.” This scenario-customized technical framework and low-cost model adaptation scheme can be seamlessly deployed to underground infrastructure such as cable tunnels and subway corridors, as well as high-risk confined spaces like chemical industrial parks and mine tunnels. It provides critical support for digital management of urban underground spaces, upgrading intelligent inspection equipment, and implementing multimodal perception technologies.