The proposed framework contains three coordinated components: lesion-preserving frequency augmentation (LPFA), confidence-guided large-kernel encoding (CGLE), and confidence-filtered decoder refinement (CFDR). These components correspond to the data-level, encoder-side, and decoder-side difficulties identified in the introduction. The overall framework is illustrated in
Figure 2. During training, LPFA extracts lesion-centred patches, performs frequency-domain augmentation, and reconstructs the augmented sample under mask constraints. The augmented image is then processed by a TransUNet encoder. A coarse prediction is converted into a shared confidence map through softmax and maximum-probability selection. This confidence map regulates the CGLE blocks in the encoder and the CFDR modules in the decoder, so that broad-context modelling and feature correction are concentrated on uncertain lesion regions before the final segmentation map is produced. During inference, LPFA is disabled, and the original image is processed by the segmentation network, which internally generates the coarse confidence cue before final refinement.
Given an input image
x, the overall mapping is written as
where
denotes the training-time LPFA operator with parameters
,
denotes the complete segmentation network with parameters
, and
denotes the final pixel-wise segmentation result. In the present framework,
is implemented by a TransUNet backbone equipped with CGLE in selected encoder blocks and CFDR in the decoder. This notation is used for the training forward path; for validation and testing,
is set to the identity mapping. The following subsections first define the dataset and task setting, and then describe LPFA, CGLE, and CFDR in order.
3.1. Dataset and Task Definition
The lesion segmentation task is built on the public “Freshwater Fish Disease Aquaculture in South Asia” dataset hosted on Kaggle (
https://www.kaggle.com/datasets/subirbiswas19/freshwater-fish-disease-aquaculture-in-south-asia accessed on 6 May 2026), which is associated with the image-level fish disease recognition study of Biswas et al. [
53]. The public release is organised for classification and does not provide pixel-level lesion annotations, so it cannot be used directly for semantic segmentation.
The original public data contain seven image-level categories, consisting of six disease categories and one normal healthy category. Each category contains 250 images, giving raw images. To reformulate the data for dense lesion prediction, pixel-level lesion masks were manually annotated for diseased images, while normal healthy images were retained as negative samples with background-only masks. The annotations were checked through data cleaning and quality control. The static dataset therefore remains as the 1750 original images and their corresponding masks. The data-loading pipeline uses an 8:1:1 split for training, validation, and testing, corresponding to 1400, 175, and 175 original images, respectively. Geometric, photometric, masking, frequency-domain, and LPFA transformations are applied online during model training to diversify the input distribution; they are not counted as additional stored images. The raw images come from the public source dataset, whereas the pixel-level masks, split files, and preprocessing scripts are derived for the present segmentation task.
The pixel-level masks were produced with Labelme by two trained annotators. Before full annotation, both annotators reviewed the disease-category definitions, representative lesion examples, and boundary cases in the dataset. A shared guideline was then used: visible ulceration, cotton-like fungal growth, gill discolouration, tail whitening, and parasite-related abnormal regions were marked as lesion pixels; healthy body regions, background water, specular reflection, shadows, and scale texture without disease evidence were assigned to background. Low-contrast boundaries were traced along the most continuous local colour and texture transition, and ambiguous cases were discussed jointly until a consensus mask was obtained. For diseased images, annotated foreground pixels inherited the image-level disease category of the source image. The resulting task is image-level disease-conditioned lesion-region segmentation, not pixel-level differential diagnosis for images with multiple coexisting pathologies. A randomly selected overlap subset was annotated independently by both annotators for quality control, yielding a Cohen’s kappa of 0.91, a mean pixel agreement of 96.8%, and a foreground Dice agreement of 92.6%. All masks were finally checked for polygon closure, category consistency, empty-mask errors, and accidental foreground labels in healthy images.
Figure 3 shows representative lesion crops from the six disease categories and examples of the training-time transformations used to diversify the input distribution. Normal healthy images are excluded from this lesion-visualisation figure because they contain no annotated lesion region and are used as background-only samples. The conventional augmentation examples include a 15° rotation, horizontal flipping, zero-mean Gaussian noise with
on RGB values normalised to
, brightness scaling by a factor of
, and random block masking with a square block whose side length is 22% of the crop size. These transformations are generated on the fly during training rather than stored as a pre-expanded dataset. The LPFA example in the same figure illustrates the proposed lesion-centred frequency-domain augmentation, while the final column shows the manually annotated lesion mask overlaid on the original crop.
In the task definition, healthy tissue, all non-lesion regions, and all pixels in normal healthy images are assigned to the background class, while diseased pixels are divided into six lesion categories: Aeromoniasis, bacterial gill disease, bacterial red disease, saprolegniasis, parasitic disease, and white-tail viral disease. The full semantic task therefore contains classes, where background is class 0. Given an input image , the model predicts a pixel-wise class map , where H and W denote the image height and width.
To avoid data leakage caused by augmented near-duplicates, the 8:1:1 split was performed at the original-image level before any online augmentation was enabled. Augmentation was applied only to training samples within their own subset, and validation and test images were evaluated without augmentation. This prevents transformed views derived from the same original sample from appearing in different evaluation partitions.
The lesion-area statistics in
Table 1 also show a strong foreground–background imbalance. This imbalance was handled in three ways: normal images were retained as explicit background-only negatives, the segmentation objective combined cross-entropy with Dice loss so that small foreground regions contributed to the optimisation, and all reported overlap metrics were class averaged rather than pixel averaged. Class-averaged metrics reduce the dominance of the background class in the final scores.
3.2. Lesion-Preserving Frequency Augmentation (LPFA)
At the data level, LPFA is introduced to enlarge lesion appearance variation while reducing the risk of label inconsistency. Instead of applying random perturbations to the entire image, the augmentation operator acts on lesion-related patches and uses class-consistent pairing to preserve semantic compatibility.
Let one mini-batch during training be denoted by
, where
is the
ith input image,
is the corresponding label, and
B is the batch size. For each sample
, the augmentation switch is defined as
where
is the augmentation indicator,
is the augmentation probability,
e is the current training epoch,
is the warm-up epoch after which augmentation becomes active, and
is the indicator function. When
, the sample enters the augmentation branch; when
, the sample remains unchanged. To avoid semantic conflict from cross-disease mixing, a partner sample
with the same dominant lesion class is selected from the same mini-batch, where the dominant lesion class is defined as the foreground lesion category with the largest annotated area in the mask. If no same-class lesion sample is available in the current mini-batch,
is drawn from a class-wise training queue updated from previous mini-batches; if the queue is still empty during early training, LPFA is skipped for that sample. For healthy images with empty foreground masks, LPFA is skipped and the sample remains a background-only negative sample, so no artificial lesion appearance is introduced into normal images.
In the patch extraction stage, the source lesion patch
is cropped from
, and the class-consistent paired lesion patch
is cropped from
according to the corresponding lesion masks. A fixed margin is retained around each crop box to preserve local context. A two-dimensional Fourier transform is then applied to the two patches:
where
denotes the two-dimensional Fourier transform,
and
denote the amplitude spectra of
and
,
and
denote the corresponding phase spectra, and
j is the imaginary unit. In this representation, the phase spectrum is used to maintain the main spatial arrangement, whereas the amplitude spectrum is adjusted to vary appearance statistics.
To determine which frequency bands should retain source appearance and which bands should absorb information from the paired sample, the frequency-selection subnetwork inside
, denoted by
, is used to generate a frequency weight map from the logarithmic amplitude spectrum:
where
is the predicted frequency weight map,
is the sigmoid function, and
is a small constant used to avoid numerical instability in the logarithm. The bounds are set to
and
, which keeps both the source and paired spectra active in the mixture. The source amplitude spectrum
and the paired amplitude spectrum
are then fused adaptively:
where
is the fused amplitude spectrum and ⊙ denotes element-wise multiplication. A large value in
means that the corresponding frequency location preserves more of the source lesion appearance, whereas a small value introduces more appearance information from the paired lesion patch.
The architecture of is intentionally lightweight because it is used only inside the online augmentation branch. Its input is the three-channel logarithmic amplitude spectrum of a resized lesion patch, and its output is a one-channel frequency weight map that is broadcast to the amplitude channels. In the implementation, contains three convolutional layers: a convolution from 3 to 16 channels with ReLU, a convolution from 16 to 32 channels with ReLU, and a final convolution from 32 channels to 1 channel followed by a sigmoid activation. The total parameter count is 5377 ( M), so the overhead is negligible compared with the segmentation backbone. The bounded output range keeps the augmented spectrum from becoming a pure source copy or a full paired-spectrum replacement. This module is active during training-time augmentation and is not used during inference.
During reconstruction, the source-phase spectrum
is kept unchanged, and the enhanced patch is obtained as
where
denotes the inverse Fourier transform,
extracts the real part, and
is the reconstructed enhanced patch. The enhanced patch is then pasted back into the original lesion region to form the augmented image
:
where
is the binary lesion mask in image coordinates and
is the reconstructed patch resized back to the original crop box. This mask-constrained reconstruction keeps the non-lesion background unchanged and restricts the frequency-induced appearance change to the annotated lesion region.
Since LPFA does not geometrically transform the label, the original mask is reused for the augmented sample. Source-phase retention and mask-constrained reconstruction vary the lesion appearance within the annotated region while preserving the label coordinates. The mask IoU, mask Dice, and Boundary IoU between the original and reused labels are therefore 100.0% by construction, and the centroid shift is 0.0 pixels. The LPFA procedure is illustrated in
Figure 4.
3.3. Confidence-Guided Large-Kernel Encoding (CGLE)
After LPFA has enlarged appearance variation at the data level, the main question inside becomes how to allocate contextual modelling capacity more effectively to difficult regions. CGLE takes TransUNet as the basic framework and feeds coarse prediction confidence back into selected middle and late encoder blocks to regulate the contribution of a large-kernel branch. Regions with low coarse confidence often correspond to blurred boundaries, texture interference, low contrast, or class ambiguity. These regions require broader contextual aggregation, whereas high-confidence regions can retain stronger local modelling.
Let the coarse segmentation logits obtained from an initial forward pass be denoted by
, where
H and
W are the image height and width and
K is the number of classes. For a pixel location
p, the class confidence is defined as
where
is the maximum class confidence at location
p. This confidence map provides a unified confidence source for both the encoder and the decoder.
The confidence cue is generated by an internal coarse branch before final refinement. The computation sequence is as follows:
- (i)
The image is embedded and first encoded without confidence guidance;
- (ii)
The decoder produces a coarse feature map without CFDR correction;
- (iii)
The segmentation head maps this coarse feature to logits , from which s is computed;
- (iv)
s is detached from the computation graph and resized to the corresponding encoder and decoder resolutions;
- (v)
The encoder is run with CGLE guidance, and the decoder is then refined with CFDR to obtain the final prediction.
This produces one coarse-to-refined evaluation sequence for each input image, and the additional coarse branch is included in the reported computational measurements.
In the
tth modified encoder block, the input token feature is denoted by
. The original TransUNet self-attention path is first retained as
where
denotes multi-head self-attention and
is the attention-updated token feature. CGLE is then attached as a residual contextual adapter after this attention path. After layer normalisation, the feature is sent to a small-kernel branch and a large-kernel branch:
where
and
denote the kernel sizes of the small-kernel and large-kernel branches,
denotes the dilation rate of the large-kernel branch, and
and
are the outputs of the two branches.
In the implementation, , , and , giving an effective large-kernel receptive field of pixels on the corresponding feature map. CGLE is inserted as a residual contextual adapter in the middle-to-late transformer blocks indexed 4–9, while the original TransUNet attention and MLP paths are retained. These values were selected from preliminary validation trials over large-kernel sizes and dilation rates, where 7 with dilation 2 provided the best balance between boundary accuracy and computational cost.
To combine structural variation with semantic uncertainty, a local structure descriptor
is first defined. Let
denote the channel-averaged feature map obtained after restoring
to the spatial resolution of the
tth block. Then
where
denotes a lightweight projection layer and
is the structural descriptor map for block
t. In parallel, the confidence map in image space is resized to the spatial resolution of the
tth block to obtain an uncertainty map
where
denotes the resizing operation from image space to the resolution of block
t, and
is the uncertainty map at that scale. The structural descriptor and the uncertainty map are concatenated along the channel dimension and transformed into a gating map:
where
denotes channel concatenation,
denotes a projection layer, and
is the gating weight map.
Once
is obtained, the two convolution branches are fused in a position-adaptive manner:
where
is the fused contextual feature. The contextual adapter is added residually to the attention output, and the original MLP path is then applied:
where
is the output feature of block
t. When
is large, the large-kernel branch receives greater weight at that location; when
is small, the small-kernel branch remains dominant.
3.4. Confidence-Filtered Decoder Refinement (CFDR)
CGLE addresses the question of where broader context is needed in the encoder. CFDR then addresses a related question in the decoder: where stronger correction is needed during reconstruction. CFDR uses the same coarse confidence source as CGLE, so encoder-side context allocation and decoder-side feature correction are guided by a consistent cue. Because coarse prediction errors are usually concentrated in low-confidence regions, the decoder does not apply the same correction strength to all positions. Instead, a confidence-derived filtering map is used to regulate the spatial range and strength of the correction residual.
At decoder stage
q, let
denote the local decoder feature after feature fusion, and let
denote the higher-level semantic feature resized to the same spatial size as
. The correction residual is first generated by
where
denotes a lightweight convolutional projector and
is the correction residual feature at decoder stage
q.
The image-space confidence map is then resized to the spatial size of decoder stage
q, and a filtering weight map is defined as
where
denotes the resizing operation that maps the confidence map to the resolution of decoder stage
q,
is the filtering weight map, and
is an exponent that controls filtering strength. Because
s is the maximum class confidence, lower-confidence regions produce larger values in
.
The exponent was set to in all reported experiments. This setting suppresses unnecessary residual correction in high-confidence regions while retaining a strong response for uncertain boundaries and fragmented lesion regions.
The refined decoder feature is finally written as
where
is the corrected decoder feature. This equation means that the decoder correction strength is modulated by
instead of being applied uniformly. Low-confidence regions therefore receive stronger correction, while high-confidence regions are less disturbed.
3.5. Joint Optimisation
The entire framework is optimised in an end-to-end manner. The objective function is written as
where
denotes a training sample and its corresponding label,
denotes the parameters of the segmentation network
,
denotes the parameters of the augmentation operator
, and
denotes the segmentation loss. This objective indicates that LPFA and the segmentation network are trained jointly under the same segmentation target.
The segmentation loss is defined as a weighted combination of cross-entropy loss and Dice loss:
where
denotes cross-entropy loss,
denotes Dice loss, and
and
are the corresponding balancing coefficients. The balancing coefficients were fixed as
and
in all experiments; therefore, the exact training objective was
. Here,
is the pixel-wise multi-class cross-entropy, and
is the standard soft Dice loss calculated from the predicted softmax probabilities and the one-hot segmentation masks. No additional annotation-level supervision is imposed on the frequency-selection network
; the bounded frequency mixture defines the augmentation range, and
is updated jointly through the overall segmentation objective.