Next Article in Journal
Integrated Valuation of Ecosystem Services: A Systematic Review of Socio-Biophysical Valuation Research
Previous Article in Journal
Study on the Spatial Heterogeneity of Carbon Emissions and Low-Carbon Planning Strategies in Megacities in the Climate Transition Zone: A Case Study of Xi’an, China
Previous Article in Special Issue
ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lesion-Preserving and Confidence-Aware Fish Lesion Segmentation for Sustainable Aquaculture and Aquaponic Health Monitoring

1
College of Engineering, China Agricultural University, 17 Qinghua East Road, Haidian, Beijing 100083, China
2
College of Information and Electrical Engineering, China Agricultural University, 17 Qinghua East Road, Haidian, Beijing 100083, China
*
Authors to whom correspondence should be addressed.
Sustainability 2026, 18(12), 5819; https://doi.org/10.3390/su18125819
Submission received: 8 May 2026 / Revised: 2 June 2026 / Accepted: 4 June 2026 / Published: 7 June 2026

Abstract

Timely fish disease monitoring is an important requirement for sustainable aquaculture because disease outbreaks can reduce survival, increase treatment inputs, and destabilise production. In aquaponic systems, fish health is also linked to nutrient cycling and the stability of integrated fish–vegetable production, making automated fish-health perception a potentially useful component of resource-efficient farming. Existing classification and detection methods can identify disease categories or approximate lesion locations, but they provide limited information about lesion area, boundary shape, and severity-related spatial extent. This study presents a deep learning framework for pixel-level fish lesion segmentation to support sustainable aquaculture health monitoring, with aquaponic systems considered as a potential application context. The framework combines lesion-preserving frequency augmentation (LPFA), confidence-guided large-kernel encoding (CGLE), and confidence-filtered decoder refinement (CFDR). LPFA expands lesion appearance variation during training while retaining the main lesion layout. CGLE uses coarse prediction confidence to allocate broader contextual modelling to uncertain encoder regions, and CFDR applies selective decoder correction to low-confidence regions. A public freshwater fish disease dataset is reformulated into a dense prediction task with 1750 raw images from seven image-level categories, including six disease categories and one normal healthy category. The images are divided into training, validation, and test subsets at an 8:1:1 ratio, and controlled augmentation strategies are applied online rather than being used to create a larger static dataset. Across five random-seed runs, the proposed method achieves 82.6 ± 0.3 % mIoU, 90.9 ± 0.2 % mDice, and 73.5 ± 0.4 % Boundary IoU. Relative to TransUNet, the mean mIoU rises from 78.4 ± 0.4 % to 82.6 ± 0.3 %, and Boundary IoU rises from 68.8 ± 0.5 % to 73.5 ± 0.4 %, with paired bootstrap testing supporting the stability of the improvement. These results indicate its potential as a lesion-quantification decision-support component for smart and sustainable fish-production systems.

1. Introduction

Aquaculture is an important source of animal protein and contributes directly to food supply, farm income, and regional production stability [1,2,3,4,5]. In intensive culture systems, infectious and stress-related diseases may spread quickly once abnormal individuals are not detected in time, leading to treatment delay and economic loss [1,4,6]. Manual inspection and laboratory diagnosis remain valuable for confirming disease causes, but they are labour-intensive, expert-dependent, and difficult to deploy as high-frequency screening tools across large farms. With the rapid development of artificial intelligence, deep learning, and computer vision, image-based perception has been widely extended from general object recognition to a broad range of domain-specific applications, including agriculture, remote sensing, medical imaging, biological production, and intelligent engineering systems [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Recent studies have demonstrated its potential in crop disease and pest detection, plant phenotyping, robotic weed control, fruit maturity recognition, quality assessment, and intelligent production management [23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]. By learning discriminative visual representations from complex and variable environments, these models can support automated detection, classification, localization, counting, and quality evaluation. In aquaculture, this technological progress has opened new opportunities for non-invasive and high-frequency fish-health monitoring, making image-based analysis a practical component of smart aquaculture [2,47,48,49].
In parallel with vision-based monitoring, non-imaging molecular and biosensor-based diagnosis has advanced rapidly as another important route for fish-health and seafood safety assessment [1,5,50,51]. Peptide–graphene sensing has demonstrated molecular-level marker detection and information processing [50], and recent aptamer-based biosensors have been reviewed as rapid tools for fish disease diagnosis and seafood safety monitoring [51]. These diagnostic routes provide pathogen- or biomarker-level evidence, whereas image-based segmentation provides spatial lesion extent, boundary, and longitudinal severity cues [2,6,47]. The vision-based segmentation target in this study is therefore complementary to molecular or laboratory diagnosis rather than an isolated replacement for it.
Most existing visual studies in this area focus on fish disease classification or diseased-fish detection [52,53,54,55,56,57]. Classification methods usually provide image-level disease labels [52,53,54], while detection methods provide approximate object or symptom locations through bounding boxes [55,56,57,58,59]. Recent multi-task studies have begun to combine detection with segmentation for disease assessment [60,61,62], but classification and detection outputs alone do not directly supply lesion area, boundary shape, or lesion expansion information. These spatial properties are important for severity assessment, treatment planning, and longitudinal monitoring, because the same disease category can correspond to substantially different lesion extents. Pixel-level lesion segmentation is therefore more closely aligned with quantitative aquaculture management than classification or detection alone. Figure 1 illustrates this motivation. The left part shows common aquaculture imaging challenges, including reflection, low contrast, pose variation, and occlusion. The middle part contrasts image-level classification and box-level detection with their limited spatial outputs. The right part indicates the target of multi-class lesion segmentation, where lesion regions are delineated at the pixel level for downstream quantification.
In practical aquaculture monitoring, a lesion mask can be converted into several management-oriented indicators that are directly supported by the present lesion/background segmentation setting. For example, lesion-to-image area ratio, lesion component count, and boundary irregularity can support screening for severe, scattered, or weak-boundary symptoms [5,60,61,62]. When an additional fish-body mask or reliable body-area estimate is available, these indicators can be further normalised as lesion-to-body ratios. In future continuous-imaging settings, repeated segmentation of the same fish may also support tracking of lesion expansion or recovery. At the farm level, such indicators can help prioritise fish for closer manual inspection, trigger water-quality checks, and support treatment records [1,2,4]. In aquaponic systems, early lesion-area warnings may be useful because fish stress and disease management affect feeding, nutrient release, water quality, and the stability of the integrated fish–plant production cycle [63,64]. The segmentation output is therefore intended as quantitative decision support rather than an automatic replacement for veterinary diagnosis [1,5].
Fish lesion segmentation is difficult because its obstacles occur simultaneously at the data, image, and model levels. At the data level, public fish disease datasets are mostly organised for classification, and pixel-level multi-class lesion annotations remain limited [53,54,65]. At the image level, aquaculture scenes often include reflection, turbid water, low contrast, partial occlusion, pose variation, complex backgrounds, and fish-scale textures that resemble lesion texture [48,66]. At the model level, fixed receptive fields and uniform decoder processing are not well matched to weak-boundary or ambiguous regions, where prediction confidence tends to be low. Therefore, the problem is not only a matter of selecting a stronger backbone; it also requires consistent treatment of lesion appearance variation, contextual modelling, and decoder-stage refinement.
The proposed framework addresses these requirements from three connected levels rather than replacing the backbone alone. Lesion-preserving frequency augmentation (LPFA) expands lesion appearance variation while retaining the main lesion layout through source-phase retention and mask-constrained reconstruction. Confidence-guided large-kernel encoding (CGLE) uses coarse prediction confidence to regulate how much large-kernel contextual information is assigned to uncertain encoder regions. Confidence-filtered decoder refinement (CFDR) uses the same confidence source to strengthen correction in low-confidence decoder regions and reduce unnecessary modification of high-confidence regions. This shared confidence map makes encoder-side contextual allocation and decoder-side refinement follow a consistent criterion.
The main contributions of this study are summarised as follows:
  • A public freshwater fish disease classification dataset is reformulated as a pixel-level multi-class lesion segmentation task, enabling lesion area, boundary, and category information to be evaluated for quantitative fish-health assessment.
  • A lesion-preserving and confidence-aware segmentation framework is proposed by combining LPFA, CGLE, and CFDR, so that appearance diversification, large-kernel contextual modelling, and uncertain-region refinement are handled in a unified pipeline.
  • On the reannotated dataset, the proposed method achieves 82.6% mIoU, 90.9% mDice, and 73.5% Boundary IoU; the corresponding TransUNet baseline scores are 78.4%, 88.8%, and 68.8%.

2. Related Work

2.1. Fish Disease Visual Analysis and Data Resources

Computer vision has become an active tool for fish-health analysis. Aquaculture monitoring is moving from occasional manual inspection toward data-driven and high-frequency observation. Recent reviews describe this shift from traditional diagnosis and handcrafted image descriptors to convolutional and deep learning methods for disease recognition, growth monitoring, and farm management [1,2,5,6,47]. Classification studies are useful because they provide a direct category-level decision from an image. Public resources have supported this direction. Examples include the freshwater fish disease dataset released by Biswas et al. [53] and SalmonScan for salmon disease recognition [54]. These datasets support reproducible benchmarking, but most of them provide image-level labels rather than dense lesion masks.
Detection-oriented studies extend disease recognition by providing approximate localisation. DF-YOLO and NAM-YOLOv7 were developed for diseased-fish detection in underwater or intensive aquaculture conditions [55,58]. CBFW-YOLOv8 further addressed fish body-surface disease recognition in recirculating aquaculture systems [59]. More recent work has moved closer to lesion-level analysis. YOLO-FD incorporated disease detection and semantic segmentation through multi-task learning [60]. FDMNet jointly modelled detection and segmentation for three fish diseases [61]. VMI-ATN-RCNN combined segmentation and classification in a hybrid pipeline [62], and FlatIMG explored flatfish disease detection with disease image generation [65]. These studies show clear progress from image-level recognition to spatial analysis. However, public multi-class pixel-level fish lesion segmentation remains less developed than classification or detection, especially under reflection, low contrast, weak boundaries, and texture interference.
Recent underwater-organism detection research further emphasises scale variation, low contrast, blurred boundaries, complex backgrounds, and small targets as common challenges in aquatic imagery [48,55,58,66,67]. For example, an efficient dynamic YOLOv8-based framework introduced a streamlined ViT backbone, Dynamic Head multi-scale fusion, and SlideLoss to improve multi-scale target detection of underwater organisms in blurry underwater environments [67]. Although these studies generally focus on bounding-box detection rather than dense lesion masks, they show that aquatic visual analysis requires scale-aware feature representation and contextual modelling. This observation is relevant to fish lesion segmentation, where symptoms may be small, low-contrast, and embedded in scale-like texture.

2.2. Disease Image Augmentation and Appearance Robustness

Data augmentation is widely used when annotated disease images are limited. It can reduce overfitting and improve robustness. Conventional operations include flipping, rotation, scaling, cropping, brightness adjustment, contrast adjustment, Gaussian noise, and blur. These transformations remain important baselines because they are simple, reproducible, and compatible with most segmentation pipelines [68]. Stronger methods introduce larger distributional changes through sample mixing, region deletion, or learned policies. Representative examples include Mixup [69], CutMix [70], Mosaic-style composition [71], Cutout [72], Random Erasing [73], AutoAugment [74], and RandAugment [75]. Generative augmentation is another possible way to enlarge appearance variation with high efficiency [76,77,78,79]. These methods can improve robustness to appearance variation. However, lesion segmentation imposes an additional requirement: the augmented image must remain consistent with the lesion mask. Excessive global perturbation may damage weak lesion boundaries or alter lesion geometry, creating a mismatch between the image and the pixel-level label.
Frequency-domain augmentation provides a more structured route for changing appearance statistics. Signal analysis indicates that phase information carries much of the spatial layout. In contrast, amplitude information is closely related to style and texture statistics [80,81]. Fourier Domain Adaptation (FDA) uses this property for semantic segmentation by exchanging low-frequency amplitude components across domains [82]. Later Fourier-based domain generalisation methods and AmpMix further explored amplitude manipulation under distribution shift [83,84]. Uncertainty-aware Fourier augmentation has also been studied for medical image segmentation [85]. For fish lesion segmentation, the key issue is not only to increase image diversity. The lesion geometry and mask consistency must also be preserved. This motivates lesion-centred frequency manipulation rather than unconstrained whole-image perturbation.

2.3. Lesion Segmentation Networks and Confidence-Guided Refinement

Encoder–decoder architectures remain a central basis for dense lesion prediction. U-Net established the standard multi-scale segmentation design through skip connections and symmetric decoding [86,87]. DeepLabV3+ improved multi-scale contextual representation through atrous separable convolution [88]. Transformer-based segmentation methods further strengthen long-range dependency modelling. TransUNet combines convolutional features with transformer encoding for medical segmentation [89,90]. Swin-Unet uses hierarchical shifted-window self-attention in a U-shaped structure [91]. These architectures provide useful baselines. Even so, fish lesion segmentation often requires more targeted treatment of weak boundaries, fragmented regions, and visually ambiguous texture.
Large-kernel convolution has recently been reconsidered as an effective way to enlarge receptive fields while retaining convolutional inductive bias. RepLKNet showed that scaling convolution kernels can improve representation when the design is carefully regularised [92]. InternImage and UniRepLKNet further indicated that broader spatial aggregation can support strong visual representation without discarding local structure modelling [93,94]. This is relevant to fish lesions because low-contrast or partially occluded lesions often require information from a broader surrounding context. However, assigning large-kernel modelling uniformly to all regions can increase computation and may disturb already reliable local details.
Confidence- or uncertainty-guided refinement provides another useful perspective. Low-confidence pixels often correspond to boundary ambiguity, low contrast, occlusion, or class confusion [95,96,97]. Coarse logits or masks can therefore serve as cues for later correction. This idea appears in refinement-oriented methods such as SegRefiner and SAMRefiner [98,99]. Existing studies usually examine large-kernel modelling, transformer encoding, or decoder refinement separately. The present framework links these directions. It uses the same coarse confidence map to guide encoder-side large-kernel context allocation and decoder-side confidence-filtered correction. Difficult regions are therefore treated according to a consistent cue throughout the network.

3. Materials and Methods

The proposed framework contains three coordinated components: lesion-preserving frequency augmentation (LPFA), confidence-guided large-kernel encoding (CGLE), and confidence-filtered decoder refinement (CFDR). These components correspond to the data-level, encoder-side, and decoder-side difficulties identified in the introduction. The overall framework is illustrated in Figure 2. During training, LPFA extracts lesion-centred patches, performs frequency-domain augmentation, and reconstructs the augmented sample under mask constraints. The augmented image is then processed by a TransUNet encoder. A coarse prediction is converted into a shared confidence map through softmax and maximum-probability selection. This confidence map regulates the CGLE blocks in the encoder and the CFDR modules in the decoder, so that broad-context modelling and feature correction are concentrated on uncertain lesion regions before the final segmentation map is produced. During inference, LPFA is disabled, and the original image is processed by the segmentation network, which internally generates the coarse confidence cue before final refinement.
Given an input image x, the overall mapping is written as
y ^ = N Θ A Φ ( x ) ,
where A Φ denotes the training-time LPFA operator with parameters Φ , N Θ denotes the complete segmentation network with parameters Θ , and y ^ denotes the final pixel-wise segmentation result. In the present framework, N Θ is implemented by a TransUNet backbone equipped with CGLE in selected encoder blocks and CFDR in the decoder. This notation is used for the training forward path; for validation and testing, A Φ is set to the identity mapping. The following subsections first define the dataset and task setting, and then describe LPFA, CGLE, and CFDR in order.

3.1. Dataset and Task Definition

The lesion segmentation task is built on the public “Freshwater Fish Disease Aquaculture in South Asia” dataset hosted on Kaggle (https://www.kaggle.com/datasets/subirbiswas19/freshwater-fish-disease-aquaculture-in-south-asia accessed on 6 May 2026), which is associated with the image-level fish disease recognition study of Biswas et al. [53]. The public release is organised for classification and does not provide pixel-level lesion annotations, so it cannot be used directly for semantic segmentation.
The original public data contain seven image-level categories, consisting of six disease categories and one normal healthy category. Each category contains 250 images, giving 250 × 7 = 1750 raw images. To reformulate the data for dense lesion prediction, pixel-level lesion masks were manually annotated for diseased images, while normal healthy images were retained as negative samples with background-only masks. The annotations were checked through data cleaning and quality control. The static dataset therefore remains as the 1750 original images and their corresponding masks. The data-loading pipeline uses an 8:1:1 split for training, validation, and testing, corresponding to 1400, 175, and 175 original images, respectively. Geometric, photometric, masking, frequency-domain, and LPFA transformations are applied online during model training to diversify the input distribution; they are not counted as additional stored images. The raw images come from the public source dataset, whereas the pixel-level masks, split files, and preprocessing scripts are derived for the present segmentation task.
The pixel-level masks were produced with Labelme by two trained annotators. Before full annotation, both annotators reviewed the disease-category definitions, representative lesion examples, and boundary cases in the dataset. A shared guideline was then used: visible ulceration, cotton-like fungal growth, gill discolouration, tail whitening, and parasite-related abnormal regions were marked as lesion pixels; healthy body regions, background water, specular reflection, shadows, and scale texture without disease evidence were assigned to background. Low-contrast boundaries were traced along the most continuous local colour and texture transition, and ambiguous cases were discussed jointly until a consensus mask was obtained. For diseased images, annotated foreground pixels inherited the image-level disease category of the source image. The resulting task is image-level disease-conditioned lesion-region segmentation, not pixel-level differential diagnosis for images with multiple coexisting pathologies. A randomly selected overlap subset was annotated independently by both annotators for quality control, yielding a Cohen’s kappa of 0.91, a mean pixel agreement of 96.8%, and a foreground Dice agreement of 92.6%. All masks were finally checked for polygon closure, category consistency, empty-mask errors, and accidental foreground labels in healthy images.
Figure 3 shows representative lesion crops from the six disease categories and examples of the training-time transformations used to diversify the input distribution. Normal healthy images are excluded from this lesion-visualisation figure because they contain no annotated lesion region and are used as background-only samples. The conventional augmentation examples include a 15° rotation, horizontal flipping, zero-mean Gaussian noise with σ = 0.04 on RGB values normalised to [ 0 , 1 ] , brightness scaling by a factor of 1.25 , and random block masking with a square block whose side length is 22% of the crop size. These transformations are generated on the fly during training rather than stored as a pre-expanded dataset. The LPFA example in the same figure illustrates the proposed lesion-centred frequency-domain augmentation, while the final column shows the manually annotated lesion mask overlaid on the original crop.
In the task definition, healthy tissue, all non-lesion regions, and all pixels in normal healthy images are assigned to the background class, while diseased pixels are divided into six lesion categories: Aeromoniasis, bacterial gill disease, bacterial red disease, saprolegniasis, parasitic disease, and white-tail viral disease. The full semantic task therefore contains K = 7 classes, where background is class 0. Given an input image x R H × W × 3 , the model predicts a pixel-wise class map y ^ { 0 , , K 1 } H × W , where H and W denote the image height and width.
To avoid data leakage caused by augmented near-duplicates, the 8:1:1 split was performed at the original-image level before any online augmentation was enabled. Augmentation was applied only to training samples within their own subset, and validation and test images were evaluated without augmentation. This prevents transformed views derived from the same original sample from appearing in different evaluation partitions.
The lesion-area statistics in Table 1 also show a strong foreground–background imbalance. This imbalance was handled in three ways: normal images were retained as explicit background-only negatives, the segmentation objective combined cross-entropy with Dice loss so that small foreground regions contributed to the optimisation, and all reported overlap metrics were class averaged rather than pixel averaged. Class-averaged metrics reduce the dominance of the background class in the final scores.

3.2. Lesion-Preserving Frequency Augmentation (LPFA)

At the data level, LPFA is introduced to enlarge lesion appearance variation while reducing the risk of label inconsistency. Instead of applying random perturbations to the entire image, the augmentation operator A Φ acts on lesion-related patches and uses class-consistent pairing to preserve semantic compatibility.
Let one mini-batch during training be denoted by B = { ( x i , y i ) } i = 1 B , where x i is the ith input image, y i is the corresponding label, and B is the batch size. For each sample x i , the augmentation switch is defined as
z i Bernoulli p aug · 1 [ e e 0 ] ,
where z i { 0 , 1 } is the augmentation indicator, p aug is the augmentation probability, e is the current training epoch, e 0 is the warm-up epoch after which augmentation becomes active, and 1 [ · ] is the indicator function. When z i = 1 , the sample enters the augmentation branch; when z i = 0 , the sample remains unchanged. To avoid semantic conflict from cross-disease mixing, a partner sample x j with the same dominant lesion class is selected from the same mini-batch, where the dominant lesion class is defined as the foreground lesion category with the largest annotated area in the mask. If no same-class lesion sample is available in the current mini-batch, x j is drawn from a class-wise training queue updated from previous mini-batches; if the queue is still empty during early training, LPFA is skipped for that sample. For healthy images with empty foreground masks, LPFA is skipped and the sample remains a background-only negative sample, so no artificial lesion appearance is introduced into normal images.
In the patch extraction stage, the source lesion patch P i is cropped from x i , and the class-consistent paired lesion patch P j is cropped from x j according to the corresponding lesion masks. A fixed margin is retained around each crop box to preserve local context. A two-dimensional Fourier transform is then applied to the two patches:
F ( P i ) = A i e j ϕ i , F ( P j ) = A j e j ϕ j ,
where F ( · ) denotes the two-dimensional Fourier transform, A i and A j denote the amplitude spectra of P i and P j , ϕ i and ϕ j denote the corresponding phase spectra, and j is the imaginary unit. In this representation, the phase spectrum is used to maintain the main spatial arrangement, whereas the amplitude spectrum is adjusted to vary appearance statistics.
To determine which frequency bands should retain source appearance and which bands should absorb information from the paired sample, the frequency-selection subnetwork inside A Φ , denoted by F Φ ( · ) , is used to generate a frequency weight map from the logarithmic amplitude spectrum:
W i = w min + ( w max w min ) σ F Φ ( log ( A i + ϵ ) ) ,
where W i [ w min , w max ] is the predicted frequency weight map, σ ( · ) is the sigmoid function, and ϵ is a small constant used to avoid numerical instability in the logarithm. The bounds are set to w min = 0.2 and w max = 0.8 , which keeps both the source and paired spectra active in the mixture. The source amplitude spectrum A i and the paired amplitude spectrum A j are then fused adaptively:
A ˜ i = W i A i + ( 1 W i ) A j ,
where A ˜ i is the fused amplitude spectrum and ⊙ denotes element-wise multiplication. A large value in W i means that the corresponding frequency location preserves more of the source lesion appearance, whereas a small value introduces more appearance information from the paired lesion patch.
The architecture of F Φ is intentionally lightweight because it is used only inside the online augmentation branch. Its input is the three-channel logarithmic amplitude spectrum of a resized 128 × 128 lesion patch, and its output is a one-channel frequency weight map that is broadcast to the amplitude channels. In the implementation, F Φ contains three convolutional layers: a 3 × 3 convolution from 3 to 16 channels with ReLU, a 3 × 3 convolution from 16 to 32 channels with ReLU, and a final 3 × 3 convolution from 32 channels to 1 channel followed by a sigmoid activation. The total parameter count is 5377 ( 0.0054 M), so the overhead is negligible compared with the segmentation backbone. The bounded output range keeps the augmented spectrum from becoming a pure source copy or a full paired-spectrum replacement. This module is active during training-time augmentation and is not used during inference.
During reconstruction, the source-phase spectrum ϕ i is kept unchanged, and the enhanced patch is obtained as
P ˜ i = F 1 ( A ˜ i e j ϕ i ) ,
where F 1 ( · ) denotes the inverse Fourier transform, ( · ) extracts the real part, and P ˜ i is the reconstructed enhanced patch. The enhanced patch is then pasted back into the original lesion region to form the augmented image x i a u g :
x i a u g = x i ( 1 M ¯ i ) + P ˜ i b o x M ¯ i ,
where M ¯ i is the binary lesion mask in image coordinates and P ˜ i b o x is the reconstructed patch resized back to the original crop box. This mask-constrained reconstruction keeps the non-lesion background unchanged and restricts the frequency-induced appearance change to the annotated lesion region.
Since LPFA does not geometrically transform the label, the original mask is reused for the augmented sample. Source-phase retention and mask-constrained reconstruction vary the lesion appearance within the annotated region while preserving the label coordinates. The mask IoU, mask Dice, and Boundary IoU between the original and reused labels are therefore 100.0% by construction, and the centroid shift is 0.0 pixels. The LPFA procedure is illustrated in Figure 4.

3.3. Confidence-Guided Large-Kernel Encoding (CGLE)

After LPFA has enlarged appearance variation at the data level, the main question inside N Θ becomes how to allocate contextual modelling capacity more effectively to difficult regions. CGLE takes TransUNet as the basic framework and feeds coarse prediction confidence back into selected middle and late encoder blocks to regulate the contribution of a large-kernel branch. Regions with low coarse confidence often correspond to blurred boundaries, texture interference, low contrast, or class ambiguity. These regions require broader contextual aggregation, whereas high-confidence regions can retain stronger local modelling.
Let the coarse segmentation logits obtained from an initial forward pass be denoted by Z ( 0 ) R H × W × K , where H and W are the image height and width and K is the number of classes. For a pixel location p, the class confidence is defined as
s ( p ) = max k { 0 , , K 1 } Softmax ( Z ( 0 ) ( p ) ) k ,
where s ( p ) [ 0 , 1 ] is the maximum class confidence at location p. This confidence map provides a unified confidence source for both the encoder and the decoder.
The confidence cue is generated by an internal coarse branch before final refinement. The computation sequence is as follows:
(i)
The image is embedded and first encoded without confidence guidance;
(ii)
The decoder produces a coarse feature map without CFDR correction;
(iii)
The segmentation head maps this coarse feature to logits Z ( 0 ) , from which s is computed;
(iv)
s is detached from the computation graph and resized to the corresponding encoder and decoder resolutions;
(v)
The encoder is run with CGLE guidance, and the decoder is then refined with CFDR to obtain the final prediction.
This produces one coarse-to-refined evaluation sequence for each input image, and the additional coarse branch is included in the reported computational measurements.
In the tth modified encoder block, the input token feature is denoted by h t . The original TransUNet self-attention path is first retained as
a t = h t + MSA LN ( h t ) ,
where MSA ( · ) denotes multi-head self-attention and a t is the attention-updated token feature. CGLE is then attached as a residual contextual adapter after this attention path. After layer normalisation, the feature is sent to a small-kernel branch and a large-kernel branch:
v t sm = DWConv k sm ( LN ( a t ) ) , v t lg = DWConv k lg , d lg ( LN ( a t ) ) ,
where k sm and k lg denote the kernel sizes of the small-kernel and large-kernel branches, d lg denotes the dilation rate of the large-kernel branch, and v t sm and v t lg are the outputs of the two branches.
In the implementation, k sm = 5 , k lg = 7 , and d lg = 2 , giving an effective large-kernel receptive field of ( 7 1 ) × 2 + 1 = 13 pixels on the corresponding feature map. CGLE is inserted as a residual contextual adapter in the middle-to-late transformer blocks indexed 4–9, while the original TransUNet attention and MLP paths are retained. These values were selected from preliminary validation trials over { 5 , 7 , 9 } large-kernel sizes and { 1 , 2 } dilation rates, where 7 with dilation 2 provided the best balance between boundary accuracy and computational cost.
To combine structural variation with semantic uncertainty, a local structure descriptor r t is first defined. Let a ¯ t denote the channel-averaged feature map obtained after restoring a t to the spatial resolution of the tth block. Then
r t = η t a ¯ t AvgPool ( a ¯ t ) ,
where η t ( · ) denotes a lightweight projection layer and r t is the structural descriptor map for block t. In parallel, the confidence map in image space is resized to the spatial resolution of the tth block to obtain an uncertainty map
u t = 1 ρ t ( s ) ,
where ρ t ( · ) denotes the resizing operation from image space to the resolution of block t, and u t is the uncertainty map at that scale. The structural descriptor and the uncertainty map are concatenated along the channel dimension and transformed into a gating map:
α t = σ ψ t ( [ r t , u t ] ) ,
where [ · , · ] denotes channel concatenation, ψ t ( · ) denotes a projection layer, and α t [ 0 , 1 ] is the gating weight map.
Once α t is obtained, the two convolution branches are fused in a position-adaptive manner:
m t = PWConv GELU ( 1 α t ) v t sm + α t v t lg ,
where m t is the fused contextual feature. The contextual adapter is added residually to the attention output, and the original MLP path is then applied:
a ˜ t = a t + m t , h t + 1 = a ˜ t + MLP LN ( a ˜ t ) .
where h t + 1 is the output feature of block t. When α t is large, the large-kernel branch receives greater weight at that location; when α t is small, the small-kernel branch remains dominant.

3.4. Confidence-Filtered Decoder Refinement (CFDR)

CGLE addresses the question of where broader context is needed in the encoder. CFDR then addresses a related question in the decoder: where stronger correction is needed during reconstruction. CFDR uses the same coarse confidence source as CGLE, so encoder-side context allocation and decoder-side feature correction are guided by a consistent cue. Because coarse prediction errors are usually concentrated in low-confidence regions, the decoder does not apply the same correction strength to all positions. Instead, a confidence-derived filtering map is used to regulate the spatial range and strength of the correction residual.
At decoder stage q, let d q denote the local decoder feature after feature fusion, and let g q denote the higher-level semantic feature resized to the same spatial size as d q . The correction residual is first generated by
c q = φ q ( [ d q , g q ] ) ,
where φ q ( · ) denotes a lightweight convolutional projector and c q is the correction residual feature at decoder stage q.
The image-space confidence map is then resized to the spatial size of decoder stage q, and a filtering weight map is defined as
w q = 1 ρ q ( s ) γ ,
where ρ q ( · ) denotes the resizing operation that maps the confidence map to the resolution of decoder stage q, w q is the filtering weight map, and γ > 0 is an exponent that controls filtering strength. Because s is the maximum class confidence, lower-confidence regions produce larger values in w q .
The exponent was set to γ = 2.0 in all reported experiments. This setting suppresses unnecessary residual correction in high-confidence regions while retaining a strong response for uncertain boundaries and fragmented lesion regions.
The refined decoder feature is finally written as
d ˜ q = d q + w q c q ,
where d ˜ q is the corrected decoder feature. This equation means that the decoder correction strength is modulated by w q instead of being applied uniformly. Low-confidence regions therefore receive stronger correction, while high-confidence regions are less disturbed.

3.5. Joint Optimisation

The entire framework is optimised in an end-to-end manner. The objective function is written as
min Θ , Φ E ( x , y ) L s e g N Θ ( A Φ ( x ) ) , y .
where ( x , y ) denotes a training sample and its corresponding label, Θ denotes the parameters of the segmentation network N Θ , Φ denotes the parameters of the augmentation operator A Φ , and L s e g ( · ) denotes the segmentation loss. This objective indicates that LPFA and the segmentation network are trained jointly under the same segmentation target.
The segmentation loss is defined as a weighted combination of cross-entropy loss and Dice loss:
L s e g = λ c e L c e + λ d i c e L d i c e .
where L c e denotes cross-entropy loss, L d i c e denotes Dice loss, and λ c e and λ d i c e are the corresponding balancing coefficients. The balancing coefficients were fixed as λ c e = 0.5 and λ d i c e = 0.5 in all experiments; therefore, the exact training objective was L s e g = 0.5 L c e + 0.5 L d i c e . Here, L c e is the pixel-wise multi-class cross-entropy, and L d i c e is the standard soft Dice loss calculated from the predicted softmax probabilities and the one-hot segmentation masks. No additional annotation-level supervision is imposed on the frequency-selection network F Φ ; the bounded frequency mixture defines the augmentation range, and F Φ is updated jointly through the overall segmentation objective.

4. Experiments

4.1. Experimental Setup

All experiments were conducted on a local Intel-based workstation equipped with an NVIDIA GeForce RTX 4060 GPU with 8 GB of memory. The input resolution was 512 × 512 , the batch size was 16, and the total training length was 200 epochs. All compared methods used the same 8:1:1 train–validation–test split and evaluation protocols as much as possible. Minor implementation-level adjustments were made when necessary to ensure stable training under the same memory budget, while the input size, epoch budget, and evaluation metrics were kept consistent.
For the TransUNet baseline and the proposed method, the local implementation followed stochastic gradient descent with momentum 0.9 and weight decay 10 4 . The initial learning rate was set to 0.01 and updated with polynomial decay during training, while the effective lower bound was kept at 10 4 to avoid excessive attenuation in the late stage. The TransUNet training pipeline used eight data-loading workers. For augmentation-related experiments, geometric, photometric, masking, corruption-style, RandAugment, FDA, and LPFA variants were implemented as online training transformations under the same original-image split; validation and test images were not augmented. In LPFA, the augmentation probability was set to p aug = 0.5 , the warm-up epoch was set to e 0 = 10 , and the lesion patch size used for frequency-domain processing was set to 128 × 128 .
The warm-up value was selected after a small sensitivity check over e 0 { 0 , 10 , 20 } using the same split and training protocol. The corresponding mIoU/Boundary IoU values were 82.2%/73.1%, 82.6%/73.5%, and 82.4%/73.2%, respectively. These differences are smaller than the main improvement over TransUNet, indicating that the method is not highly sensitive to this value. The setting e 0 = 10 was retained because it provides a short stabilisation stage before lesion-centred frequency augmentation is activated.
To improve reproducibility, the baseline TransUNet and the proposed method were repeated with five random seeds: 2021, 2022, 2023, 2024, and 2025. The data split was kept fixed, while weight initialisation, data shuffling, and online augmentation randomness followed the corresponding seed. The implementation used Python 3.10, PyTorch 2.0.1, CUDA 11.8, and cuDNN 8.7 on the same RTX 4060 GPU. Random seeds were set for Python, NumPy, and PyTorch, and deterministic cuDNN behaviour was enabled when it did not prevent model execution. Mean values, standard deviations, and 95% confidence intervals are reported for the repeated runs. In addition, paired bootstrap testing with 10,000 resamples was used to compare the proposed method with TransUNet on the same test split.

4.2. Evaluation Metrics

For the multi-class lesion segmentation task, mIoU, mDice, Precision, Recall, and Boundary IoU were used as the main accuracy metrics, following common segmentation evaluation practice and boundary-sensitive assessment criteria [100,101]. mIoU and mDice evaluate region-level overlap, Precision and Recall reflect false-positive and false-negative behaviour, and Boundary IoU evaluates boundary-sensitive mask quality. Params, FLOPs, and FPS were further reported to describe model complexity and inference efficiency.
Let K denote the number of evaluated semantic classes, including the background class 0, and let T P c , F P c , and F N c denote the true positives, false positives, and false negatives for class c, respectively. The intersection over union for class c is defined as
IoU c = T P c T P c + F P c + F N c ,
and the mean intersection over union is defined as
mIoU = 1 K c = 0 K 1 IoU c .
The Dice coefficient for class c is defined as
Dice c = 2 T P c 2 T P c + F P c + F N c ,
and the mean Dice score is defined as
mDice = 1 K c = 0 K 1 Dice c .
Similarly, Precision and Recall for class c are defined as
Precision c = T P c T P c + F P c , Recall c = T P c T P c + F N c ,
where Precision c measures how many pixels predicted as class c are correct, and Recall c measures how many ground-truth pixels of class c are recovered. To assess boundary quality more directly, Boundary IoU is further computed on the boundary bands of prediction and ground truth:
BIoU c = B ( Y ^ c , d ) B ( Y c , d ) B ( Y ^ c , d ) B ( Y c , d ) ,
where B ( · , d ) denotes the boundary band generated with width d from the corresponding mask. Following the Boundary IoU protocol, d was set to 2% of the image diagonal and rounded up to the next integer. For the 512 × 512 evaluation images, 512 2 × 0.02 = 14.48 , giving d = 15 pixels after ceiling. The same value was used for all models and ablation settings. In the present experiments, all overlap and detection-oriented metrics are reported in class-averaged form over c = 0 , , K 1 .

4.3. Comparative Experiments

The comparison set was selected to cover representative technical routes for fish lesion segmentation. U-Net with a ResNet-50 encoder [86] and DeepLabV3+ with a ResNet-50 backbone [88] represent classic encoder–decoder segmentation and atrous multi-scale convolution, respectively. TransUNet with R50-ViT-B/16 [90] is the architecture-specific baseline because the proposed method modifies TransUNet. Swin-Unet with Swin-Tiny [91] provides a hierarchical transformer comparison. DINOv2 with ViT-B/16 [102] and SAM with a ViT-B image encoder [103] are included to examine the transfer ability of large-scale pretrained visual representations. YOLO11x-seg [104] represents detection-driven segmentation, which is commonly adopted in agricultural and aquaculture vision.
For the foundation-model baselines, the public pretrained image encoders were used at their released model scale. To avoid unstable end-to-end fine-tuning on the small training split, the DINOv2 and SAM image encoders were frozen, and only task-specific segmentation heads were trained with the same mask labels. DINOv2 used the ViT-B/16 feature encoder followed by a lightweight upsampling segmentation head. SAM was evaluated in a supervised adaptation setting with its ViT-B image encoder and a task-specific mask head. The reported Params, FLOPs, and FPS still include the full image encoder during inference. For YOLO11x-seg, predicted instance masks were converted to a semantic map by assigning each instance mask to its predicted lesion class and resolving overlaps by confidence score. YOLO11x-seg therefore serves as an applied detection-driven segmentation reference, while TransUNet remains the architecture-matched baseline.
The main experiments evaluate whether the proposed framework improves lesion delineation under a unified protocol. Table 2 reports class-averaged overlap, detection-oriented, and boundary-sensitive metrics together with model complexity and inference efficiency. Reporting these quantities together is necessary because a method that improves the average mask score but fails to preserve lesion boundaries would have limited value for disease assessment.
The proposed method achieves the highest scores among the compared methods, with 82.6% mIoU, 90.9% mDice, 93.2% Precision, 90.1% Recall, and 73.5% BIoU. TransUNet records 78.4% mIoU, 88.8% mDice, 86.5% Recall, and 68.8% BIoU, while YOLO11x-seg records 78.2%, 87.2%, 84.4%, and 66.8% on the same four metrics. The higher Recall and BIoU indicate better lesion recovery and boundary delineation, not only fewer false positives.
To clarify the reported total inference cost, the proposed inference path was decomposed into the final TransUNet path, the coarse confidence branch, the CGLE adapters, and the CFDR modules, as shown in Table 3.
The repeated-run results in Table 4 show that the improvement is larger than the observed seed-level variation. The proposed method improves mean mIoU by 4.2 percentage points and mean Boundary IoU by 4.7 percentage points over TransUNet, while the corresponding standard deviations remain below 0.5 percentage points.
The proposed method is not the fastest model in Table 2. DINOv2 and Swin-Unet obtain higher FPS, and Swin-Unet has fewer parameters and FLOPs. However, their mask quality is lower, particularly in Boundary IoU. SAM has strong general segmentation capacity, but its computational cost is high in this setting, with 976.83 G FLOPs and 3.93 FPS. The proposed method reduces FLOPs to 109.02 G and increases FPS to 34.69 compared with SAM, making it more suitable for practical single-GPU inference. Compared with plain TransUNet, the proposed method increases parameters from 105.28 M to 107.52 M and increases FLOPs, but it obtains higher Recall and Boundary IoU, which is consistent with the objective of improving lesion recovery and boundary quality.

4.4. Class-Wise Performance Analysis

The class-level comparison is visualised in Figure 5 for four representative models: DeepLabV3+, TransUNet, YOLO11x-seg, and the proposed method. These methods represent convolutional context modelling, hybrid CNN–transformer modelling, and detection-driven segmentation. The figure shows whether the overall gains in Table 2 persist across disease categories rather than being dominated by the background class.
The category-level comparison shows that the proposed method obtains the highest scores across all six lesion categories and the background class in this comparison. Table 5 reports the exact class-wise values corresponding to Figure 5. When only the six lesion categories are averaged, the proposed model reaches 81.6% mean IoU, compared with 77.1% for TransUNet, 76.8% for YOLO11x-seg, and 74.2% for DeepLabV3+. The corresponding lesion-category mean Dice is 90.4% versus 88.1% for TransUNet. This pattern indicates that the global gain in Table 2 is not dominated by the comparatively easier background class.
The largest IoU differences relative to TransUNet are observed in Aeromoniasis (+6.3), bacterial gill disease (+6.1), parasitic disease (+5.9), and white-tail viral disease (+5.7). These categories are often affected by fragmented lesion morphology, weak contrast, reflection, and texture interference. The class-wise results are therefore consistent with the design goal of assigning more modelling capacity to difficult regions while improving lesion appearance robustness.

4.5. Ablation Study

The ablation study separates the three core components of the proposed method: LPFA, CGLE, and CFDR. The purpose is to determine whether the final improvement arises from one dominant component or from the interaction between data-level appearance expansion and confidence-guided network adaptation.
All three components contribute positively when they are added individually. LPFA gives the largest isolated gain, improving mIoU from 78.4% to 81.2% and BIoU from 68.8% to 71.9%, which is consistent with the later augmentation comparison. CGLE improves mIoU to 80.2% and BIoU to 71.1%, indicating that encoder-side confidence-guided context allocation is useful for weak-boundary lesion regions. CFDR produces a smaller but consistent gain, increasing BIoU to 70.5% and improving Recall from 86.5% to 87.1%.
The two-component settings show that the data-level and confidence-aware paths are mutually compatible. LPFA+CGLE reaches 81.8% mIoU and 72.6% BIoU, LPFA+CFDR reaches 81.6% mIoU and 72.4% BIoU, and CGLE+CFDR reaches 81.4% mIoU and 72.3% BIoU. The full model achieves 82.6% mIoU and 73.5% BIoU, giving the highest scores in this ablation setting. These results indicate that the three components are complementary. LPFA reduces appearance mismatch at the input level, CGLE reallocates contextual modelling in the encoder, and CFDR applies selective correction in the decoder.
The ablation results also indicate a practical complexity–performance trade-off. The CGLE+CFDR setting reaches 81.4% mIoU and 72.3% BIoU without using LPFA, and can therefore be considered when training-time simplicity is preferred. The full model adds LPFA and obtains a further 1.2 percentage-point mIoU gain and 1.2 percentage-point BIoU gain over CGLE+CFDR. Because LPFA is disabled during inference, this final gain does not increase inference FLOPs relative to the CGLE+CFDR architecture, although it does add training-time augmentation computation.

5. Discussion

Ablation results indicate that the proposed design improves both region-level overlap and boundary-sensitive segmentation quality. Table 2 shows simultaneous improvements in mIoU, mDice, Recall, and Boundary IoU. Table 6 further shows that the gain does not come from a single isolated component. LPFA operates at the data level, CGLE reallocates contextual modelling in the encoder, and CFDR performs selective correction in the decoder. The combined improvement in Recall and Boundary IoU is especially relevant for fish lesion segmentation. Missed lesion pixels and unstable boundaries directly affect lesion-area estimation and severity assessment.
The higher accuracy comes with a clear computational cost. Compared with TransUNet, the proposed method increases parameters from 105.28 M to 107.52 M, increases FLOPs from 58.43 G to 109.02 G, and reduces FPS from 55.22 to 34.69 on the local GPU. This throughput may be suitable for offline severity quantification or workstation-based monitoring, where image acquisition and manual review are usually slower than GPU inference. For embedded deployment on low-power edge devices, however, the full model may require pruning, quantisation, or a lighter backbone. A practical deployment strategy is to use the full model for high-accuracy inspection and the CGLE+CFDR variant or a compressed model for real-time edge screening.
The effect of LPFA is further examined in Table 7 under the same TransUNet backbone. All compared augmentation strategies are applied online during training. Conventional geometric and photometric augmentation improves robustness in some cases, but not all perturbations are suitable for weak-boundary lesion masks. Noise and blur slightly reduce mIoU and BIoU, suggesting that corruption-style augmentation can obscure already ambiguous lesion borders. FDA improves appearance generalisation through frequency-domain transformation. However, unconstrained amplitude exchange may still introduce appearance changes that are not fully aligned with lesion geometry. LPFA raises mIoU from 78.4% to 81.2% and BIoU from 68.8% to 71.9% over the unaugmented baseline, matching the LPFA-only setting in Table 6. Relative to FDA, mIoU increases from 80.1% to 81.2%, and BIoU from 71.1% to 71.9%. These results indicate that lesion-preserving frequency manipulation is more suitable for this dataset than global corruption or unconstrained appearance perturbation.
The ablation results also clarify the roles of LPFA, CGLE, and CFDR. LPFA provides the largest single-module gain, indicating that lesion-preserving appearance diversification is important for this dataset. CGLE provides the strongest network-side single-module gain, suggesting that low-confidence regions benefit from broader encoder-side context. This pattern is consistent with the visual nature of fish lesions, where weak contrast and fish-scale texture may require surrounding anatomical context for separation. CFDR produces a smaller but stable improvement. It concentrates decoder correction on uncertain pixels instead of uniformly modifying all decoder features. The confidence map should therefore be interpreted as a coarse guidance cue rather than a direct supervision signal. Its value lies in providing a shared criterion for encoder-side contextual allocation and decoder-side correction.
Figure 6 visualises feature-activation distributions rather than direct mask predictions. YOLO11x-seg often activates broad fish-body regions, reflecting its object-level localisation tendency. DINOv2 and SAM provide general visual representations, but their activations are not always concentrated on weak-boundary or low-contrast fish lesions. In contrast, the proposed method shows more lesion-centred responses in bacterial gill disease, saprolegniasis, parasitic disease, and white-tail viral disease. This provides qualitative context for the Recall and Boundary IoU results.
A qualitative and quantitative review of the low-scoring test cases further clarifies the remaining failure modes. The lowest-IoU cases were mainly associated with three conditions. The first was very small lesions occupying less than 2% of the image. The second was severe specular reflection or water turbidity, which weakened the colour contrast between the lesion and healthy tissue. The third was occlusion or body bending, which broke the visible lesion boundary into disconnected fragments. In these cases, the model tended to under-segment thin lesion margins or produce small false-positive responses on scale texture and reflection. The mean IoU of the small-lesion subset was about 3.8 percentage points lower than the overall lesion-category mean. This indicates that tiny and fragmented symptoms remain the most difficult scenario. These observations suggest that the model is most reliable for visible body-surface lesions with continuous boundaries. Low-contrast early lesions should still be checked by human operators or follow-up imaging.
The proposed system also has practical and ethical boundaries. Automated lesion segmentation should be used as a decision-support tool rather than as the sole basis for diagnosis, medication, or fish removal. From an animal-welfare perspective, earlier lesion quantification may help reduce delayed intervention and unnecessary broad treatment, but false alarms may also lead to avoidable handling stress if the system is used without expert review. From an environmental perspective, better targeting of treatment and monitoring can reduce unnecessary chemical or antibiotic inputs, whereas high-frequency AI monitoring introduces energy and hardware costs. Therefore, deployment should consider model accuracy, device power consumption, operator oversight, data governance, and the local management protocol of the target production system.
Several limitations remain. The dataset is derived from public images, and the robustness analysis relies on controlled online augmentation. Therefore, cross-device, cross-waterbody, and cross-season generalisation in real farms requires further validation. No independent external farm dataset was available in the present study, and the model should therefore not be interpreted as fully validated across all aquaculture environments. The current results mainly demonstrate within-dataset lesion segmentation ability under a carefully controlled split. Future studies should collect production images from different farms, cameras, water qualities, seasons, and fish species to evaluate domain generalisation more directly. This highlights the ongoing necessity of constructing specialized datasets for interpreting complex ecological and environmental contexts [105,106]. A further direction is to combine lesion masks with water-quality and environmental sensing. Recent spatiotemporal reconstruction work shows how salinity-related environmental fields can be inferred from multisource observations [107]; in aquaculture, analogous integration of salinity, temperature, dissolved oxygen, and image-derived lesion trajectories could support multi-modal fish-health prediction. The proposed model also has higher FLOPs than plain TransUNet, although its computational cost remains much lower than SAM in this setting. For future deployment on resource-constrained aquaculture edge devices, low-rank model compression and memory-efficient streaming inference [108,109] remain important for reducing computational overhead. Extremely small lesions, severe specular reflection, and heavy occlusion may still reduce segmentation stability.

6. Conclusions

This study proposes a confidence-guided large-kernel TransUNet framework for fish lesion segmentation. The framework integrates LPFA, CGLE, and CFDR to address weak-boundary segmentation through lesion-preserving data augmentation, encoder-side contextual allocation, and decoder-side selective refinement. A shared coarse confidence map guides both large-kernel context modelling and decoder correction. Across repeated random-seed runs on the reannotated freshwater fish disease dataset, the proposed method achieves 82.6 ± 0.3 % mIoU, 90.9 ± 0.2 % mDice, and 73.5 ± 0.4 % Boundary IoU, while the original TransUNet records 78.4 ± 0.4 %, 88.8 ± 0.2 %, and 68.8 ± 0.5 %, respectively. The augmentation comparison, ablation study, and heatmap analysis further support lesion-preserving augmentation and confidence-aware context allocation for fish lesion segmentation in aquaculture images. At the same time, the method increases FLOPs relative to plain TransUNet and still requires external farm validation before direct deployment in heterogeneous production environments. Its most realistic near-term use is therefore decision support for lesion-area quantification and health monitoring, followed by lightweight adaptation for edge devices. Concrete future research will focus on integrating water-quality sensors for multi-modal health prediction, self-supervised domain adaptation to new fish species and farms, and temporal lesion tracking across video frames.

Author Contributions

Conceptualization, C.-T.Z., X.L. and H.W.; methodology, C.-T.Z.; software, C.-T.Z.; validation, C.-T.Z. and Y.-X.G.; formal analysis, C.-T.Z.; investigation, C.-T.Z. and Y.-X.G.; resources, X.L. and H.W.; data curation, C.-T.Z. and Y.-X.G.; writing—original draft preparation, C.-T.Z.; writing—review and editing, X.L. and H.W.; visualization, C.-T.Z. and Y.-X.G.; supervision, X.L. and H.W.; project administration, X.L. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bohara, K.; Joshi, P.; Acharya, K.P.; Ramena, G. Emerging technologies revolutionising disease diagnosis and monitoring in aquatic animal health. Rev. Aquac. 2024, 16, 836–854. [Google Scholar] [CrossRef]
  2. Mandal, A.; Ghosh, A.R. Role of artificial intelligence (AI) in fish growth and health status monitoring: A review on sustainable aquaculture. Aquac. Int. 2024, 32, 2791–2820. [Google Scholar] [CrossRef]
  3. Yao, M.; Huo, Y.; Tian, Q.; Zhao, J.; Liu, X.; Wang, R.; Xue, L.; Wang, H. FMRFT: Fusion mamba and DETR for query time sequence intersection fish tracking. Comput. Electron. Agric. 2025, 237, 110742. [Google Scholar] [CrossRef]
  4. Islam, S.I.; Ahammad, F.; Mohammed, H. Cutting-edge technologies for detecting and controlling fish diseases: Current status, outlook, and challenges. J. World Aquac. Soc. 2024, 55, e13051. [Google Scholar] [CrossRef]
  5. Li, D.; Li, X.; Wang, Q.; Hao, Y. Advanced techniques for the intelligent diagnosis of fish diseases: A review. Animals 2022, 12, 2938. [Google Scholar] [CrossRef]
  6. Johny, T.K.; Swaminathan, T.R.; Sood, N.; Pradhan, P.K.; Lal, K.K. A panoptic review of techniques for finfish disease diagnosis: The status quo and future perspectives. J. Microbiol. Methods 2022, 196, 106477. [Google Scholar] [CrossRef]
  7. Sun, H.; Wang, R.F.; Yu, R.; Sun, Y. Pest image recognition algorithm based on joint adversarial transfer learning. Measurement 2026, 278, 121620. [Google Scholar] [CrossRef]
  8. Wu, M.; Zheng, K.; Chen, J.; Zhang, J.; Li, M.; Wu, S. Assessing elderly walkability to urban parks using mobility analysis and multi-source data: A case study of central Fuzhou, China. Sci. Rep. 2026, 16, 13685. [Google Scholar] [CrossRef] [PubMed]
  9. Jiao, P.; Yan, P.; Zhang, J.; Zhao, Z.; Su, F.; Zhang, W. Mining user features with hyperbolic representations for diffusion prediction. Neural Netw. 2026, 200, 108855. [Google Scholar] [CrossRef]
  10. Fu, J.; Wang, C.; Liu, M.; Li, X.; Liu, Y.; Shi, W.; Wang, R. HyperR3SNet: Leveraging hyperbolic space and vision foundation models for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2026. [Google Scholar] [CrossRef]
  11. Wang, R.F.; Tu, Y.H.; Li, X.C.; Chen, Z.Q.; Zhao, C.T.; Yang, C.; Su, W.H. An intelligent robot based on optimized yolov11l for weed control in lettuce. In Proceedings of the 2025 ASABE Annual International Meeting, American Society of Agricultural and Biological Engineers, Toronto, ON, Canada, 13–16 July 2025; p. 1. [Google Scholar]
  12. Gao, Y.; Wang, Y.; Yang, N.; Wang, Q.; Javadi, B.; Ai, Q.; Zhu, J. Multi-timescale distributed control for multi-energy virtual power plant clusters via cloud-edge collaboration. Appl. Energy 2026, 414, 127835. [Google Scholar] [CrossRef]
  13. Gao, Y.; Li, S.; Xu, T.; Lakshminarayana, S.; Bu, S.; Gu, C.; Ai, Q. Distributed Model Predictive Control Strategy for Multi-energy Virtual Power Plant Based on Digital Twin. IEEE Trans. Smart Grid 2025, 17, 1971–1984. [Google Scholar] [CrossRef]
  14. Chen, P.; Nie, X.; Ning, Y.; Zhang, Y. Learning Efficient and Adaptive Cross-Channel Dependencies for Weakly-Supervised Object Detection. IEEE Trans. Multimed. 2025, 27, 8954–8966. [Google Scholar] [CrossRef]
  15. Wang, L.; Ma, Y.; Yan, Z.; Zhang, L.; Hu, Y.; Zhao, S. Giving or receiving: Impact of online socializing in online fitness community on physical activity and emotional state. Comput. Hum. Behav. 2025, 169, 108669. [Google Scholar] [CrossRef]
  16. Di, X.; Cui, K.; Wang, R.F. Toward efficient uav-based small object detection: A lightweight network with enhanced feature fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
  17. Wang, L.; Bala, H.; Yan, L.; Guo, X. Physicians’ Contributions to Online Healthcare Platforms: Relative Effects of Herding Cues and Feedback Types. J. Manag. Inf. Syst. 2026, 43, 205–236. [Google Scholar] [CrossRef]
  18. Wang, Z.; Zhou, H.; Liu, H. Physics-informed neural networks based digital volume correlation for displacement and strain measurements. Mech. Syst. Signal Process. 2026, 248, 113998. [Google Scholar] [CrossRef]
  19. Weng, Y.; Yang, K.; Liu, Z.; He, W.; Tang, X. Lgvlm-miot: A lightweight generative visual-language model for multilingual iot applications. IEEE Internet Things J. 2025, 12, 13311–13326. [Google Scholar] [CrossRef]
  20. Li, Z.; Sun, C.; Wang, H.; Wang, R.F. Hybrid optimization of phase masks: Integrating Non-Iterative methods with simulated annealing and validation via tomographic measurements. Symmetry 2025, 17, 530. [Google Scholar] [CrossRef]
  21. Yang, T.; Qian, Z.; Hang, N.; Liu, M. S-PINN: Stabilized physics-informed neural networks for alleviating barriers between multi-level co-optimization. Comput. Methods Appl. Mech. Eng. 2025, 447, 118348. [Google Scholar] [CrossRef]
  22. Lai, C.H.; Wu, T.E.; Wang, C.C. Enhancing Information Security in Smart Manufacturing Through Least Significant Bit Steganography in Engineering Drawings. J. Comput. Inf. Sci. Eng. 2025, 25, 091006. [Google Scholar] [CrossRef]
  23. Wang, R.F.; Xu, M.; Bauer, M.; Schardong, I.; Ma, X.; Chee, P.; Cui, K. Cott-adnet: Lightweight real-time cotton boll and flower detection under field conditions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 6–10 March 2026; pp. 500–509. [Google Scholar]
  24. Jiang, R.; Tong, S.; Wu, J.; Hu, H.; Zhang, R.; Wang, H.; Zhao, Y.; Zhu, W.; Li, S.; Zhang, X. A novel EEG artifact removal algorithm based on an advanced attention mechanism. Sci. Rep. 2025, 15, 19419. [Google Scholar] [CrossRef] [PubMed]
  25. Li, M.; Zhong, J.; Chen, T.; Lai, Y.; Psounis, K. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 13337–13349. [Google Scholar]
  26. Lv, J.; Wang, C.; Xie, L. Adaptive distributed observer design for nonlinear multiagent systems. Automatica 2026, 183, 112625. [Google Scholar] [CrossRef]
  27. Wang, R.F.; Xu, R.; Chee, P.W.; Wang, H.; Li, C. A review of visual navigation for agricultural robots in open fields and controlled environments. Comput. Electron. Agric. 2026, 248, 111754. [Google Scholar] [CrossRef]
  28. Xu, B.; Rang, G.; Xie, R.; Li, W.; Gong, D.; Fan, Z.; Yang, S.; He, J. A prediction approach based on long short-term memory networks for dynamic multiobjective optimization. Expert Syst. Appl. 2025, 283, 127792. [Google Scholar] [CrossRef]
  29. Lai, Y.; Zhong, J.; Li, M.; Zhao, S.; Li, Y.; Psounis, K.; Yang, X. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models. IEEE Trans. Med. Imaging 2026, 45, 2727–2737. [Google Scholar] [CrossRef] [PubMed]
  30. Xu, J.; Tang, R.; Lv, P.; Yu, M.; Yu, G.; Chen, E. LTKT: Knowledge Tracing Based on Positive and Negative Learning Transfers. Tsinghua Sci. Technol. 2026, 31, 1894–1917. [Google Scholar] [CrossRef]
  31. Wang, R.F.; Su, W.H. The application of deep learning in the whole potato production Chain: A Comprehensive review. Agriculture 2024, 14, 1225. [Google Scholar] [CrossRef]
  32. Zuo, T.Y.; Di, K.; Li, P.; Jiang, Y. Autonomous Domain Adaptation Self-Optimization Approach for Cross-Domain Industrial Agents. ACM Trans. Intell. Syst. Technol. 2026, 17, 1–25. [Google Scholar] [CrossRef]
  33. Li, M.; Li, Q.; Wang, Y. Class balanced adaptive pseudo labeling for federated semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16292–16301. [Google Scholar]
  34. Yang, C.Y.; Min, X.H.; Zhao, L.; Wang, Y.N. Real-time cutterhead torque prediction via cluster-based geological identification in karst shield tunneling. Tunn. Undergr. Space Technol. 2026, 174, 107736. [Google Scholar] [CrossRef]
  35. Wang, R.F.; Qu, H.R.; Su, W.H. From sensors to insights: Technological trends in image-based high-throughput plant phenotyping. Smart Agric. Technol. 2025, 12, 101257. [Google Scholar] [CrossRef]
  36. Lv, J.; Wang, C.; Xie, L. Distributed output-feedback leader-following consensus of nonlinear multiagent systems. Automatica 2025, 177, 112281. [Google Scholar] [CrossRef]
  37. Li, M.; Zhong, J.; Zhao, S.; Lai, Y.; Zhang, H.; Zhu, W.B.; Zhang, K. Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning. arXiv 2025, arXiv:2503.16188. [Google Scholar] [CrossRef]
  38. Gu, Y.; Zhu, L.; Li, J.H.; Zhao, G.Y. Preview MPC control of train active suspension based on Kalman filter disturbance estimation. Measurement 2026, 268, 120616. [Google Scholar] [CrossRef]
  39. Wang, R.F.; Qin, Y.M.; Zhao, Y.Y.; Xu, M.; Schardong, I.B.; Cui, K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI 2025, 6, 235. [Google Scholar] [CrossRef]
  40. Xu, B.; Zheng, Y.; Li, W.; Gao, X.; Gong, D.; He, J.; Fan, Z. Handling multiobjective optimization problems with complex constraints: A constraints grouping-based approach. In IEEE Transactions on Systems, Man, and Cybernetics: Systems; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
  41. Li, M.; Zhong, J.; Li, C.; Li, L.; Lin, N.; Sugiyama, M. Vision-language model fine-tuning via simple parameter-efficient modification. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 14394–14410. [Google Scholar]
  42. Sun, H.; Xi, X.; Wu, A.Q.; Wang, R.F. ToRLNet: A Lightweight Deep Learning Model for Tomato Detection and Quality Assessment Across Ripeness Stages. Horticulturae 2025, 11, 1334. [Google Scholar] [CrossRef]
  43. Zuo, T.Y.; Li, P.; Di, K.; Jiang, Y.; Wang, Y.; Lim, B.H. Self-Organized Team Formation via Multi-Task Hedonic Games for Capability-Heterogeneous Human-Machine Agents. IEEE Trans. Autom. Sci. Eng. 2026, 23, 8722–8735. [Google Scholar] [CrossRef]
  44. Zeng, Y.; Yu, Z.; Jiang, D.; Zhang, W.; Hong, Y.; Hu, Z.; Luo, J.; Cui, K. Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection. arXiv 2026, arXiv:2604.15065. [Google Scholar]
  45. Zheng, X.; Yu, H.; Cui, H.; Sun, C.; Li, X.; Su, R.; Wei, L.; Zhou, J.; Wang, J.; Jin, Q. KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering. IEEE Trans. Ind. Inform. 2026. early access. [Google Scholar] [CrossRef]
  46. Wang, R.F.; Zhao, C.T.; Tu, Y.H.; Chen, Z.Q.; Su, W.H. A Field-Adaptive Mechanical Weeding System Coupling Oscillating Pneumatic Mechanism With Deep Learning for Intra-Row Weed Control in Lettuce. J. Field Robot. 2026. Epub ahead of printing. [Google Scholar] [CrossRef]
  47. Wu, A.Q.; Li, K.L.; Song, Z.Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.F. Deep learning for sustainable aquaculture: Opportunities and challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
  48. Al-Abri, S.; Keshvari, S.; Al-Rashdi, K.; Al-Hmouz, R.; Bourdoucen, H. Computer vision based approaches for fish monitoring systems: A comprehensive study. Artif. Intell. Rev. 2025, 58, 185. [Google Scholar] [CrossRef]
  49. Han, Y.S.; Zhao, S.L.; Chen, C.; Cui, K.; Hu, P.; Wang, R.F. SEAF-Net: A Sustainable and Lightweight Attention-Enhanced Detection Network for Underwater Fish Species Recognition. J. Mar. Sci. Eng. 2026, 14, 351. [Google Scholar] [CrossRef]
  50. Lu, J.Y.; Guo, Z.; Huang, W.T.; Bao, M.; He, B.; Li, G.; Lei, J.; Li, Y. Peptide-Graphene Logic Sensing System for Dual-Mode Detection of Exosomes, Molecular Information Processing and Protection. Talanta 2024, 267, 125261. [Google Scholar] [CrossRef]
  51. Fan, R.; Lu, J.Y.; Tian, Y.Q.; Wu, M.Y.; Huang, W.T.; Hu, S.B. Recent Advances in Aptamer-Based Biosensors for Fish Disease Diagnosis and Seafood Safety Detection. Talanta 2026, 298, 128828. [Google Scholar] [CrossRef]
  52. Ahmed, M.S.; Aurpa, T.T.; Azad, M.A.K. Fish Disease Detection Using Image Based Machine Learning Technique in Aquaculture. J. King Saud. Univ.-Comput. Inf. Sci. 2022, 34, 5170–5182. [Google Scholar] [CrossRef]
  53. Biswas, S.; Muduli, D.; Islam, M.A.; Kanade, A.S.; Zamani, A.T.; Kanade, S.P.; Parveen, N. Empirical Evaluation of Deep Learning Techniques for Fish Disease Detection in Aquaculture Systems: A Transfer Learning and Fusion-Based Approach. IEEE Access 2024, 12, 176136–176154. [Google Scholar] [CrossRef]
  54. Ahmed, M.S.; Jeba, S.M. SalmonScan: A novel image dataset for machine learning and deep learning analysis in fish disease detection in aquaculture. Data Brief. 2024, 54, 110388. [Google Scholar] [CrossRef]
  55. Wang, Z.; Liu, H.; Zhang, G.; Yang, X.; Wen, L.; Zhao, W. Diseased Fish Detection in the Underwater Environment Using an Improved YOLOv5 Network for Intensive Aquaculture. Fishes 2023, 8, 169. [Google Scholar] [CrossRef]
  56. Yu, G.; Zhang, J.; Chen, A.; Wan, R. Detection and Identification of Fish Skin Health Status Referring to Four Common Diseases Based on Improved YOLOv4 Model. Fishes 2023, 8, 186. [Google Scholar] [CrossRef]
  57. Ouyang, C.; Peng, H.; Tan, M.; Yang, L.; Deng, J.; Jiang, P.; Hu, W.; Wang, Y. YOLO-TPS: A Multi-Module Synergistic High-Precision Fish-Disease Detection Model for Complex Aquaculture Environments. Animals 2025, 15, 2356. [Google Scholar] [CrossRef]
  58. Cai, Y.; Yao, Z.; Jiang, H.; Qin, W.; Xiao, J.; Huang, X.; Pan, J.; Feng, H. Rapid detection of fish with SVC symptoms based on machine vision combined with a NAM-YOLOv7 hybrid model. Aquaculture 2024, 582, 740558. [Google Scholar] [CrossRef]
  59. Yin, Y.; Sun, X.; Yu, G.; Wang, J.; Li, D.; Wang, Y. CBFW-YOLOv8: Automated recognition method for fish body surface diseases in recirculating aquaculture systems. Comput. Electron. Agric. 2025, 236, 110612. [Google Scholar] [CrossRef]
  60. Li, X.; Zhao, S.; Chen, C.; Cui, H.; Li, D.; Zhao, R. YOLO-FD: An accurate fish disease detection method based on multi-task learning. Expert Syst. Appl. 2024, 258, 125085. [Google Scholar] [CrossRef]
  61. Liu, Z.; Yan, Z.; Li, G. FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases. J. Imaging 2025, 11, 305. [Google Scholar] [CrossRef]
  62. Kabitha, P.; Nandini, D.U. VMI-ATN-RCNN: A hybrid deep learning model for fish disease segmentation and classification in aquaculture. Aquaculture 2025, 611, 743047. [Google Scholar] [CrossRef]
  63. Somerville, C.; Cohen, M.; Pantanella, E.; Stankus, A.; Lovatelli, A. Small-Scale Aquaponic Food Production: Integrated Fish and Plant Farming; Food and Agriculture Organization of the United Nations: Rome, Italy, 2014. [Google Scholar]
  64. Yildiz, H.Y.; Robaina, L.; Pirhonen, J.; Mente, E.; Domínguez, D.; Parisi, G. Fish Welfare in Aquaponic Systems: Its Relation to Water Quality with an Emphasis on Feed and Faeces—A Review. Water 2017, 9, 13. [Google Scholar] [CrossRef]
  65. Hwang, S.B.; Kim, H.Y.; Heo, C.Y.; Jung, H.Y.; Jung, S.J.; Cho, Y.J. Flatfish Disease Detection Based on Part Segmentation Approach and Disease Image Generation. J. World Aquac. Soc. 2025, 56, e70031. [Google Scholar] [CrossRef]
  66. Olsen, A.S.; Rosin, P.L.; Jones, C.B.; Cable, J.; Perkins, S.E. Computer vision for infectious disease surveillance; Saprolegnia spp. in salmonids. Ecol. Inform. 2025, 93, 103567. [Google Scholar] [CrossRef]
  67. Li, Z.; Li, G.; Song, X.; Wang, X. An Efficient and Dynamic Framework for Multi-Scale Target Detection of Underwater Organisms. J. Ocean Univ. China 2026, 25, 150–160. [Google Scholar] [CrossRef]
  68. Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A Comprehensive Survey on Data Augmentation. IEEE Trans. Knowl. Data Eng. 2026, 38, 47–66. [Google Scholar] [CrossRef]
  69. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar] [CrossRef]
  70. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar] [CrossRef]
  71. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  72. DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar] [CrossRef]
  73. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13001–13008. [Google Scholar] [CrossRef]
  74. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Strategies From Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar] [CrossRef]
  75. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
  76. Shao, Z.; Devarakonda, A. Scalable dual coordinate descent for kernel methods. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, Taiwan, China, 19–21 February 2025; pp. 52–63. [Google Scholar]
  77. Wang, Y.; Jiang, T.; Shao, Z.; Ye, H.; Sun, J.; Ma, M.; Zhang, J.; Chen, Y.; Li, H. ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor. arXiv 2026, arXiv:2604.01552. [Google Scholar] [CrossRef]
  78. Jiang, T.; Wang, Y.; Ye, H.; Shao, Z.; Sun, J.; Zhang, J.; Chen, Z.; Zhang, J.; Chen, Y.; Li, H. SADA: Stability-guided Adaptive Diffusion Acceleration. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025; PMLR, 2025. Volume 267, pp. 27649–27669. [Google Scholar]
  79. Wang, Y.; Jiang, T.; Shao, Z.; Ye, H.; Sun, J.; Ma, M.; Wang, Q.; Zhang, J.; Chen, Y.; Li, H.H. Accelerating Denoising Generative Models is as Easy as Predicting Second-Order Difference. In Proceedings of the ICLR 2026, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
  80. Oppenheim, A.V.; Lim, J.S. The importance of phase in signals. Proc. IEEE 1981, 69, 529–541. [Google Scholar] [CrossRef]
  81. Yang, Y.; Lao, D.; Sundaramoorthi, G.; Soatto, S. Phase Consistent Ecological Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9011–9020. [Google Scholar] [CrossRef]
  82. Yang, Y.; Soatto, S. FDA: Fourier Domain Adaptation for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4085–4095. [Google Scholar] [CrossRef]
  83. Xu, Q.; Zhang, R.; Zhang, Y.; Wang, Y.; Tian, Q. A Fourier-Based Framework for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14383–14392. [Google Scholar] [CrossRef]
  84. Xu, Q.; Zhang, R.; Fan, Z.; Wang, Y.; Wu, Y.; Zhang, Y. Fourier-based augmentation with applications to domain generalization. Pattern Recognit. 2023, 139, 109474. [Google Scholar] [CrossRef]
  85. Oh, K.; Jeon, S.; Heo, D.-W.; Shin, Y.; Suk, H.-I. FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image Segmentation. arXiv 2024, arXiv:2406.14308. [Google Scholar] [CrossRef]
  86. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
  87. Tang, W.; Cui, K.; Chan, R.H. Optimized hard exudate detection with supervised contrastive learning. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
  88. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
  89. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
  90. Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef] [PubMed]
  91. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar] [CrossRef]
  92. Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar] [CrossRef]
  93. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
  94. Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5513–5524. [Google Scholar]
  95. Cui, K.; Tang, W.; Zhu, R.; Wang, M.; Larsen, G.D.; Pauca, V.P.; Alqahtani, S.; Yang, F.; Segurado, D.; Fine, P.; et al. Efficient localization and spatial distribution modeling of canopy palms using uav imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4413815. [Google Scholar] [CrossRef]
  96. Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv 2017, arXiv:1703.04977. [Google Scholar] [CrossRef]
  97. Zhang, W.; Huang, Q.; Ma, M.; Jiang, Y.; Chen, Y.; Huang, Z.; Wu, W.; Cui, K.; Lian, R.; Wu, Z.; et al. Center-guided classifier for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2026, 64, 4404916. [Google Scholar] [CrossRef]
  98. Wang, M.; Ding, H.; Liew, J.H.; Liu, J.; Zhao, Y.; Wei, Y. SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process. arXiv 2023, arXiv:2312.12425. [Google Scholar] [CrossRef]
  99. Lin, Y.; Li, H.; Shao, W.; Yang, Z.; Zhao, J.; He, X.; Luo, P.; Zhang, K. SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement. arXiv 2025, arXiv:2502.06756. [Google Scholar] [CrossRef]
  100. Müller, D.; Soto-Rey, I.; Kramer, F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res. Notes 2022, 15, 210. [Google Scholar] [CrossRef]
  101. Cheng, B.; Girshick, R.; Dollar, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving Object-Centric Image Segmentation Evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15334–15342. [Google Scholar] [CrossRef]
  102. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. Trans. Mach. Learn. Res. 2024. [Google Scholar] [CrossRef]
  103. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–20 October 2023. [Google Scholar] [CrossRef]
  104. Jocher, G.; Qiu, J. Ultralytics YOLO11, version 11.0.0; Official Software Citation Provided in the Ultralytics Documentation; Ultralytics: Frederick, MD, USA, 2024; Available online: https://github.com/ultralytics/ultralytics (accessed on 6 May 2026).
  105. Cui, K.; Shao, Z.; Larsen, G.; Pauca, V.; Alqahtani, S.; Segurado, D.; Pinheiro, J.; Wang, M.; Lutz, D.; Plemmons, R.; et al. Palmprobnet: A probabilistic approach to understanding palm distributions in ecuadorian tropical forest via transfer learning. In Proceedings of the 2024 ACM Southeast Conference, Greenville, SC, USA, 3–5 April 2024; pp. 272–277. [Google Scholar]
  106. Cui, K.; Bohara, S.; Prasai, S.; Shao, Z.; Tang, W.; Pillaca, M.; Flores, E.; Yang, Z.; Larsen, G.; Dethier, E.; et al. ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest. arXiv 2026, arXiv:2605.15397. [Google Scholar] [CrossRef]
  107. Liang, Z.; Bao, S.; Su, H.; Zhang, W.; Wang, H.; Yan, H. Daily Subsurface Salinity Reconstruction From Multisource Satellite Observations Using Wavelet-Enhanced 3-D Mamba. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 3371–3388. [Google Scholar] [CrossRef]
  108. Shao, Z.; Wang, Y.; Wang, Q.; Jiang, T.; Du, Z.; Ye, H.; Zhuo, D.; Chen, Y.; Li, H. Flashsvd: Memory-efficient inference with streaming for low-rank models. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2026; Volume 40, pp. 25278–25285. [Google Scholar]
  109. Wu, W.; Shao, Z.; Cui, K.; Kim, J.; Wang, Y.; Ye, H.; Zhuo, D.; Chen, Y. FlashSVD v1. 5: Making Low-Rank Transformers Inference Actually Fast. arXiv 2026, arXiv:2605.08314. [Google Scholar]
Figure 1. Motivation and target task of the study. The panels summarise aquaculture imaging challenges, the limited spatial outputs of classification and detection, and the proposed pixel-level multi-class lesion segmentation task. The red box in the detection example denotes a bounding-box localisation result.
Figure 1. Motivation and target task of the study. The panels summarise aquaculture imaging challenges, the limited spatial outputs of classification and detection, and the proposed pixel-level multi-class lesion segmentation task. The red box in the detection example denotes a bounding-box localisation result.
Sustainability 18 05819 g001
Figure 2. Overall architecture of the proposed framework. LPFA performs lesion-preserving augmentation during training, while CGLE and CFDR use a shared confidence map to guide encoder context allocation and decoder refinement. Black arrows denote feature or prediction flow, green dashed arrows denote confidence-map guidance, and coloured brackets distinguish the data-level augmentation, encoder, coarse-confidence, and decoder stages. The dotted circles labelled α denote confidence gates used for adaptive fusion. Ellipses indicate repeated transformer blocks or tokens and do not omit unique operations. During inference, LPFA is disabled, and the segmentation network internally generates the coarse confidence cue before final refinement.
Figure 2. Overall architecture of the proposed framework. LPFA performs lesion-preserving augmentation during training, while CGLE and CFDR use a shared confidence map to guide encoder context allocation and decoder refinement. Black arrows denote feature or prediction flow, green dashed arrows denote confidence-map guidance, and coloured brackets distinguish the data-level augmentation, encoder, coarse-confidence, and decoder stages. The dotted circles labelled α denote confidence gates used for adaptive fusion. Ellipses indicate repeated transformer blocks or tokens and do not omit unique operations. During inference, LPFA is disabled, and the segmentation network internally generates the coarse confidence cue before final refinement.
Sustainability 18 05819 g002
Figure 3. Representative dataset annotation and training-time augmentation examples. Rows correspond to six disease categories: Aero., Aeromoniasis; BGD, bacterial gill disease; BRD, bacterial red disease; Saprol., saprolegniasis; Para., parasitic disease; WTVD, white-tail viral disease. Columns show the original lesion crop, rotation by 15°, horizontal flipping, Gaussian noise with σ = 0.04 , brightness scaling by 1.25 , random block masking with a side length of 22% of the crop size, LPFA, and the manual lesion mask overlaid in red. The augmented views illustrate online transformations used during training, not additional stored dataset samples.
Figure 3. Representative dataset annotation and training-time augmentation examples. Rows correspond to six disease categories: Aero., Aeromoniasis; BGD, bacterial gill disease; BRD, bacterial red disease; Saprol., saprolegniasis; Para., parasitic disease; WTVD, white-tail viral disease. Columns show the original lesion crop, rotation by 15°, horizontal flipping, Gaussian noise with σ = 0.04 , brightness scaling by 1.25 , random block masking with a side length of 22% of the crop size, LPFA, and the manual lesion mask overlaid in red. The augmented views illustrate online transformations used during training, not additional stored dataset samples.
Sustainability 18 05819 g003
Figure 4. Illustration of LPFA on an Aeromoniasis sample. Panels show the source image, lesion mask, source patch, paired patch, frequency-augmented patch, and final augmented image. White contours mark the annotated lesion boundary used for mask-constrained reconstruction. LPFA preserves lesion layout through source-phase and mask-constrained reconstruction while varying local appearance.
Figure 4. Illustration of LPFA on an Aeromoniasis sample. Panels show the source image, lesion mask, source patch, paired patch, frequency-augmented patch, and final augmented image. White contours mark the annotated lesion boundary used for mask-constrained reconstruction. LPFA preserves lesion layout through source-phase and mask-constrained reconstruction while varying local appearance.
Sustainability 18 05819 g004
Figure 5. Class-wise comparison of representative methods. (Left) Class-wise IoU. (Right) Class-wise Dice. Bkg., background; Aero., Aeromoniasis; BGD, bacterial gill disease; BRD, bacterial red disease; Saprol., Saprolegniasis; Para., parasitic disease; WTVD, white-tail viral disease.
Figure 5. Class-wise comparison of representative methods. (Left) Class-wise IoU. (Right) Class-wise Dice. Bkg., background; Aero., Aeromoniasis; BGD, bacterial gill disease; BRD, bacterial red disease; Saprol., Saprolegniasis; Para., parasitic disease; WTVD, white-tail viral disease.
Sustainability 18 05819 g005
Figure 6. Feature-activation heatmaps across six lesion categories. Rows denote disease categories, columns denote models, and warmer colours indicate stronger responses to lesion-relevant image regions, whereas cooler colours indicate weaker responses.
Figure 6. Feature-activation heatmaps across six lesion categories. Rows denote disease categories, columns denote models, and warmer colours indicate stronger responses to lesion-relevant image regions, whereas cooler colours indicate weaker responses.
Sustainability 18 05819 g006
Table 1. Lesion-area distribution measured from the pixel-level masks.
Table 1. Lesion-area distribution measured from the pixel-level masks.
ClassMean Area (%)Std. (%)Median (%)Range (%)
Aeromoniasis10.88.59.20.5–36.7
Bacterial gill disease19.617.215.00.9–81.1
Bacterial red disease7.56.05.80.6–25.8
Saprolegniasis5.65.54.20.2–37.4
Parasitic disease7.28.13.90.6–44.7
White-tail viral disease9.48.07.50.6–37.3
Note: The area ratio is the percentage of lesion pixels in an image. Across diseased masks, foreground lesion pixels account for 10.4% of pixels on average, while background pixels account for 89.6%; normal healthy images are background-only negative samples.
Table 2. Quantitative comparison of different segmentation models on the fish lesion segmentation task.
Table 2. Quantitative comparison of different segmentation models on the fish lesion segmentation task.
ModelmIoU (%)↑mDice (%)↑Prec. (%)↑Rec. (%)↑BIoU (%)↑Params (M)↓FLOPs (G)↓FPS (s−1)↑
UNet62.874.278.174.951.443.93183.0641.13
DeepLabV3+76.286.088.284.160.040.3593.1385.23
TransUNet78.488.890.586.568.8105.2858.4355.22
Swin-Unet68.182.284.479.357.634.4915.3091.78
DINOv271.584.786.683.561.789.3738.97130.50
SAM69.882.985.082.357.893.74976.833.93
YOLO11x-seg78.287.288.984.466.862.10296.4063.29
Proposed82.690.993.290.173.5107.52109.0234.69
Note: For TransUNet and the proposed method, the reported values are mean values across five random seeds; additional repeated-run statistics are provided below for mIoU, mDice, and BIoU. For the proposed method, FLOPs and FPS include both the internal coarse branch and the final CGLE/CFDR refinement pass. The compared methods cover different model scales, so Params, FLOPs, and FPS are reported alongside accuracy. ↑ indicates that higher is better, and ↓ indicates that lower is better. The best value in each column is highlighted in bold.
Table 3. Inference cost decomposition of the proposed method.
Table 3. Inference cost decomposition of the proposed method.
ComponentFLOPs (G)↓Latency (ms)↓
Final TransUNet path58.4318.10
Coarse confidence branch43.067.30
CGLE adapters6.272.71
CFDR modules1.260.71
Total proposed method109.0228.82
Note: The total latency of 28.82 ms corresponds to 34.69 FPS under single-image inference on the RTX 4060 GPU. LPFA is active only during training and is therefore not included in inference FLOPs or latency. ↓ indicates that lower is better.
Table 4. Repeated-run stability of TransUNet and the proposed method.
Table 4. Repeated-run stability of TransUNet and the proposed method.
ModelRunsmIoU (%)↑95% CImDice (%)↑BIoU (%)↑95% CIBootstrap p
TransUNet5 78.4 ± 0.4 [77.9, 78.9] 88.8 ± 0.2 68.8 ± 0.5 [68.2, 69.4]
Proposed5 82.6 ± 0.3 [82.2, 83.0] 90.9 ± 0.2 73.5 ± 0.4 [73.0, 74.0]<0.01
Note: Values are mean ± standard deviation across five random seeds. The two confidence-interval columns correspond to mIoU and BIoU, respectively. ↑ indicates that higher is better, and ↓ indicates that lower is better. The best value in each metric column is highlighted in bold. The bootstrap p value compares the proposed method with TransUNet on the paired test evaluation; p < 0.05 is considered statistically significant.
Table 5. Exact class-wise IoU and Dice values for the representative models shown in Figure 5.
Table 5. Exact class-wise IoU and Dice values for the representative models shown in Figure 5.
ClassDL IoU↑DL Dice↑TU IoU↑TU Dice↑Y11 IoU↑Y11 Dice↑Prop. IoU↑Prop. Dice↑
Bkg.88.292.186.293.086.590.588.493.9
Aero.72.584.175.987.476.886.582.290.8
BGD73.084.475.787.275.385.681.890.6
BRD75.485.779.289.578.487.480.990.1
Saprol.73.884.578.889.177.086.780.389.4
Para.73.184.576.587.775.986.082.491.0
WTVD77.486.776.587.877.587.482.290.5
Note: DL, DeepLabV3+; TU, TransUNet; Y11, YOLO11x-seg; Prop., the proposed method. Values are percentages. ↑ indicates that higher is better. The best value in each row-metric group is highlighted in bold.
Table 6. Ablation study of the proposed method. Higher values are better for all reported metrics, and the highest value in each metric column is highlighted in bold.
Table 6. Ablation study of the proposed method. Higher values are better for all reported metrics, and the highest value in each metric column is highlighted in bold.
SettingLPFACGLECFDRmIoU (%)↑mDice (%)↑Prec. (%)↑Rec. (%)↑BIoU (%)↑
BaseNoNoNo78.488.890.586.568.8
+LPFAYesNoNo81.290.491.888.471.9
+CGLENoYesNo80.289.891.587.871.1
+CFDRNoNoYes79.589.391.287.170.5
LPFA+CGLEYesYesNo81.890.692.589.172.6
LPFA+CFDRYesNoYes81.690.592.188.972.4
CGLE+CFDRNoYesYes81.490.492.388.772.3
FullYesYesYes82.690.993.290.173.5
Note: Base denotes the TransUNet baseline. ↑ indicates that higher is better. The best value in each metric column is highlighted in bold. LPFA, lesion-preserving frequency augmentation; CGLE, confidence-guided large-kernel encoder; CFDR, confidence-filtered decoder refinement.
Table 7. Comparison of different online data augmentation strategies under the same TransUNet backbone.
Table 7. Comparison of different online data augmentation strategies under the same TransUNet backbone.
StrategymIoU (%)↑mDice (%)↑Prec. (%)↑Rec. (%)↑BIoU (%)↑
None78.488.890.586.568.8
Flip+Rotate78.889.190.786.869.7
Bright./Cont.78.988.990.686.969.5
Noise/Blur77.988.690.286.468.2
Masking79.589.487.386.969.9
RandAug.79.589.791.287.570.3
FDA80.189.991.587.971.1
LPFA81.290.491.888.471.9
Note: All strategies are applied online during training and are not counted as additional static dataset samples. ↑ indicates that higher is better. The best value in each metric column is highlighted in bold. Flip+Rotate, Bright./Cont., and Noise/Blur follow common data-augmentation practice [68]; Masking denotes Cutout or Random Erasing [72,73]; RandAug., RandAugment [75]; FDA, Fourier Domain Adaptation [82]; LPFA, lesion-preserving frequency augmentation.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, C.-T.; Guan, Y.-X.; Lou, X.; Wang, H. Lesion-Preserving and Confidence-Aware Fish Lesion Segmentation for Sustainable Aquaculture and Aquaponic Health Monitoring. Sustainability 2026, 18, 5819. https://doi.org/10.3390/su18125819

AMA Style

Zhao C-T, Guan Y-X, Lou X, Wang H. Lesion-Preserving and Confidence-Aware Fish Lesion Segmentation for Sustainable Aquaculture and Aquaponic Health Monitoring. Sustainability. 2026; 18(12):5819. https://doi.org/10.3390/su18125819

Chicago/Turabian Style

Zhao, Chang-Tao, Ying-Xue Guan, Xiuhua Lou, and Haihua Wang. 2026. "Lesion-Preserving and Confidence-Aware Fish Lesion Segmentation for Sustainable Aquaculture and Aquaponic Health Monitoring" Sustainability 18, no. 12: 5819. https://doi.org/10.3390/su18125819

APA Style

Zhao, C.-T., Guan, Y.-X., Lou, X., & Wang, H. (2026). Lesion-Preserving and Confidence-Aware Fish Lesion Segmentation for Sustainable Aquaculture and Aquaponic Health Monitoring. Sustainability, 18(12), 5819. https://doi.org/10.3390/su18125819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop