We propose BiNeXt-SMSMVL (Bilateral ConvNeXt-based Structure-aware Multi-scale Multi-view Learning Network), a multi-task framework for joint diagnosis of retinal diseases through bilateral image analysis. As illustrated in
Figure 1, the architecture employs a dual-stream heterogeneous design that simultaneously processes left and right fundus images to extract segmentation features for vasculature and optic disc/cup structures. Cross-attention mechanisms enable inter-ocular feature interaction and spatial alignment. The system integrates multi-scale representations spanning local lesions to global retinal context, with Squeeze-and-Excitation (SE) modules dynamically enhancing discriminative pathological features. Multi-level feature fusion is achieved through an enhanced ConvNeXt-Tiny backbone. By jointly optimizing disease classification and bilateral symmetry analysis, our framework concurrently diagnoses seven retinal pathologies (including diabetic retinopathy and glaucoma) while performing comparative biomarker analysis, significantly improving early lesion detection accuracy. Image preprocessing applies contrast-limited adaptive histogram equalization (CLAHE) exclusively to the luminance channel, enhancing local contrast while mitigating illumination inhomogeneity. A lightweight W-Net architecture segments critical anatomical structures (e.g., retinal vasculature and optic disc/cup), enabling computation of clinical biomarkers: cup-to-disc ratio (CDR), vascular fractal dimension (FD), vessel density (VD), and arteriole-to-venule ratio (AVR). Quantitative biomarkers are integrated into the BiNeXt-SMSMVL framework to establish mapping relationships between structural metrics and diagnostic predictions. The bilateral collaborative learning mechanism captures structural similarities and pathological differences between eyes. Also, a penalty loss function enforces consistency between bilateral overall labels and individual eye labels, thereby enhancing sensitivity to early-stage and mild lesions.
3.1. Data Pre-Processing and Augmentation
In fundus image analysis, distinguishing tissue structures and vascular morphology is critical for disease diagnosis. However, lighting and environmental conditions during imaging can degrade quality, obscuring fine details of microvessels and retinal tissues [
6,
7]. Among these issues, brightness imbalance is the most common. To improve fundus image quality, we applied contrast-limited adaptive histogram equalization (CLAHE) [
37] to the brightness channel, enhancing contrast and clarity to highlight local details better and address brightness unevenness.
As shown in
Figure 2, the process begins by converting the image from the RGB color space to the LAB color space. Next, the luminance channel (L) and the chrominance channels (A, B) are separated, and the CLAHE algorithm is applied only to the luminance channel (L) for contrast enhancement. The processed luminance channel is then recombined with the original chrominance channels, and the image is finally converted back to the RGB color space. This method significantly improves the quality of fundus images, making them more suitable for subsequent diagnostic analysis. In addition, we employed various image enhancement techniques, such as Ben’s color enhancement method, to further improve image quality. To standardize the numerical range of the images, we performed normalization and optionally applied it to the entire dataset. Gamma correction was also used to adjust the brightness and contrast of the images.
We implemented a series of data augmentation operations to enhance the model’s generalization capability, including random horizontal and vertical flipping. These techniques help the model learn the symmetrical features of the images and reduce dependency on image orientation [
38]. Specifically, random affine transformations, such as rotation, translation, scaling, and shearing, enable the model to adapt to images from different perspectives and scales. Random color transformations, including brightness, contrast, saturation, and hue adjustments, simulate images under varying lighting conditions, enhancing color robustness. The Gaussian blurring reduces fine image details, encouraging the model to focus more on the overall structure of the image. Random erasing simulates potential occlusions in images, improving the model’s robustness to occlusions. The comparison between original and augmented images is shown in
Figure 3.
3.2. Vessel and Optic Disc/Cup Segmentation
Ophthalmic disease diagnosis models must embed specific medical knowledge frameworks to ensure alignment with clinical diagnostic pathways and enhance model interpretability. This study focuses on three key medical indicators for fundus diseases: the Cup-to-Disc Ratio (CDR) as a core metric for glaucoma screening, vessel tortuosity as a critical feature for diabetic retinopathy assessment, and the Arteriole-to-Venule Ratio (AVR) as a primary marker for hypertensive retinopathy [
27,
28]. We establish a bidirectional interpretability link between pathological mechanisms and computational models by explicitly extracting these clinically significant anatomical features. For instance, an increased CDR indicates the degree of optic nerve fiber atrophy. This medical knowledge-driven feature extraction strategy significantly improves the model’s clinical acceptability and provides a robust data foundation for subsequent multi-disease classification tasks.
Although many lightweight retinal vessel segmentation models demonstrate superior performance over complex architectures on specific datasets, their cross-domain generalization remains challenging when test data significantly deviates from the training distribution (e.g., images captured by different fundus camera models), segmentation accuracy often degrades substantially. To address this, we designed a Unsharp Masking (USM) module [
39] that enhances high-frequency components in input images, which can be expressed as follows:
where
(sharpness),
(brightness compensation). The
, where
denotes the result of Gaussian blur and
denotes the smooth level.
Figure 4 shows the comparison of images before and after USM processing. By observing the images, it is evident that USM processing effectively enhances the local contrast at vessel-background boundaries, enabling the encoder to more accurately capture texture features of key anatomical structures such as the optic disc contour and micro-vascular branches. This module achieves three key optimizations:
- a.
Explicit Feature Enhancement: USM enhances high-frequency components of the image through Laplacian operator-equivalent operations, making it easier for W-Net to capture fine structures such as vessel boundaries and optic disc contours during the encoding stage.
- b.
Improved Domain Adaptability: Brightness/color standardization reduces distribution shifts caused by imaging devices (e.g., different models of fundus cameras) or acquisition conditions (e.g., pupil dilation levels).
- c.
Enhanced Model Robustness: By suppressing illumination inhomogeneity (e.g., central reflection artifacts) while enhancing anatomically relevant features, the module reduces the risk of model overfitting to irrelevant artifacts.
Figure 4.
Comparison before and after USM processing.
Figure 4.
Comparison before and after USM processing.
This module mitigates the risk of overfitting specific imaging patterns in the training data, providing reliable cross-domain adaptability for lesion screening and analysis. For vessel and optic disc/cup segmentation, we propose W-Net+, a multi-task segmentation model that integrates USM enhancement and attention-guided mechanisms, as illustrated in
Figure 5. The model employs a dual-branch cross-entropy loss combined with cosine annealing learning rate scheduling (from
to
) and multi-scale data augmentation, further strengthening its generalization capability under varying imaging conditions. Such a USM-enhanced W-Net+ framework improves segmentation accuracy and significantly enhances model stability in cross-dataset validation, providing a more reliable anatomical foundation for quantitative analysis of fundus lesions.
Specifically, the model consists of two cascaded U-Nets, where the first-layer U-Net network performs initial localization of blood vessels in the input fundus image, extracting multi-scale features through continuous downsampling and upsampling operations. In the network structure, each convolutional block consists of two convolutional layers, with the number of channels starting from 32 and doubling progressively to 128, then gradually decreasing to 32 through the upsampling path. Notably, the feature maps generated by the first-layer U-Net are used directly for coarse vessel segmentation and serve as a spatial attention mechanism to guide the second-layer U-Net network to focus on key anatomical regions. The second-layer network adopts a similar encoder-decoder structure but adds skip connections at each decoding stage to integrate shallow and deep features from the encoding path. By concatenating the original fundus image with the first-layer feature map in the channel dimension, an information-enhanced input is formed, effectively preserving the texture details of the original image and the vessel features extracted by the first-layer network. This cascaded design enables the second-layer U-Net to further refine vessel boundaries based on the coarse segmentation of the first layer. The entire model achieves high-precision segmentation of vessel boundaries with only 68,000 parameters (1–3 orders of magnitude lower than traditional models), featuring low computational complexity and fast inference speed. It is particularly suitable for clinical quantitative analyses such as vessel tortuosity calculations needed for retinopathy screening, and also provides ideal algorithmic support for resource-constrained portable fundus examination devices.
We chose a decoupled architecture over MTL [
40] as it is both necessary—our complex ODIR-5K dataset lacks segmentation masks for dozens of rare pathologies—and strategically superior, allowing us to use robust, neutral anatomical priors (e.g., CDR) rather than disease-specific lesions.
3.3. Structure-Aware Bilateral Multi-Scale Disease Classification
While W-Net+ (
Section 3.2) provides precise anatomical segmentation and clinically interpretable metrics (CDR, vessel tortuosity, AVR), preliminary experiments showed that using only these geometric features with traditional classifiers like AutoGluon achieved limited performance (60-65% accuracy) on hypertensive retinopathy and macular diseases. This is because geometric metrics alone cannot capture subtle color changes (hemorrhages, exudates) and texture patterns (drusen, edema) critical for these conditions. Therefore, we propose BiNeXt-SMSMVL, which integrates structural guidance from W-Net+ segmentation with multi-scale visual features and bilateral information to address these limitations. As shown in
Figure 6, this approach combines the interpretability of medical metrics with the discriminative power of deep visual features for comprehensive disease classification. In practice, we applied the ConvNeXt-Tiny, as shown in
Figure 7.
The ConvNeXt-Tiny network architecture includes an image preprocessing module (stem) and four consecutive feature extraction stages. In these stages, the spatial dimensions of feature maps gradually decrease while the number of channels progressively increases, enabling the feature maps to capture increasingly broad feature receptive fields. Each stage consists of a downsampling layer shown in
Figure 7c and ConvNeXt blocks shown in
Figure 7b, where the downsampling layer is responsible for reducing image resolution, and the ConvNeXt blocks handle feature extraction. Compared to ResNet50, ConvNeXt-Tiny adjusts the stacking frequency of convolutional modules (from ResNet50’s 3,4,6,3 to ConvNeXt-Tiny’s 3,3,9,3), recombining previously extracted features to make the final features richer and more complex, which is highly advantageous for image classification tasks. Additionally, the ConvNeXt block draws on the grouped convolution technique from ResNeXt, achieving a better balance between model complexity and accuracy. The “shortcuts” in the network help gradients propagate backward during training, resulting in better models. ConvNeXt-Tiny also reduces the number of normalization layers to accelerate the training process and adopts layer normalization to decrease the model’s sensitivity to parameter initialization. Except for the first block, ConvNeXt-Tiny adds a downsampling layer in each block, which changes the dimensions of feature maps, enabling the network to extract more complex features.
Objects and structures in images exhibit different characteristics at various scales. For instance, small-scale features may contain detailed information (edges and textures), while large-scale features may contain overall structural information (such as shapes and contours). Combining features from different scales allows the model to simultaneously utilize both detailed and overall information, thereby comprehending image content more comprehensively. As shown in
Figure 8, we applied a pre-trained ConvNeXt-Tiny model to extract multi-scale features. The model’s first four stages are extracted separately to form different feature extraction modules, and four convolutional layers are defined as a set of adapters to modify the output channel number of each stage for subsequent processing. This way, we can obtain feature maps of different scales from each stage when the input image passes through all four stages sequentially. Due to the adapters, the output channel numbers of the four stages are adapted to 64, 128, 256, and 576, respectively. Subsequently, adaptive average pooling is applied to each feature map, and finally, all pooled feature maps are concatenated along the channel dimension to obtain the final multi-scale feature map. To further enhance the network’s performance, we introduce the attention mechanism network SENet, as shown in
Figure 9.
By explicitly modeling interdependencies between channels to recalibrate channel features. The working principle of the SE layer can be divided into three phases: First, the squeeze phase is through Global Average Pooling (GAP); the spatial dimensions of each channel’s feature map are compressed into a single value so that each channel produces a feature representing the global information of that channel. Second, the excitation phase uses a fully connected layer (typically including a ReLU activation function and another fully connected layer) to learn dependencies between channels, generating weights representing each channel’s importance. Third, the re-scaling phase multiples the generated weights by the original channel feature maps to re-scale the feature maps, thereby enhancing important features and suppressing unimportant ones.
This multi-scale feature map (labeled ‘Image Features’ in
Figure 6) serves as the primary visual input to our core fusion module. As detailed in
Figure 6, our “structure-aware” framework processes anatomical information in parallel. The segmentation masks from W-Net+ are reshaped to create ‘Biometric-based Features’, which act as a spatial-morphological feature map. Both the ‘Image Features’ and the ‘Biometric-based Features’ are then processed by their own independent SE layer (Squeeze-and-Excitation layer), referencing the mechanism described in
Figure 9, to enhance their respective channel-wise features. Concurrently, both feature sets are fed into a Cross-Attention module to model their interactions. In this module, we use the biometric features as the Query (Q) and the image features as the Key (K) and Value (V), guiding the model to learn the correlation between anatomical regions and visual pathologies. The outputs of these three modules (SE-Image, SE-Biometric, and Cross-Attention) are then combined via element-wise addition.
We constructed the multi-task classification head based on this fused representation, as shown in
Figure 6. As a final fusion step, the retipy library extracts explicit Numerical Biomarkers (e.g., CDR, VD, FD) from the original masks. This 1D vector is concatenated (indicated by the ‘C’ icon in
Figure 6) with the feature vector resulting from the element-wise addition. This final, comprehensive vector is then fed into three distinct classifiers for left-eye, right-eye, and binocular (L&R) classification. During training, we freeze the parameters of the shared feature extraction modules to ensure both left and right eye images benefit from a common feature extractor. This architecture allows the model to learn shared underlying patterns while the independent attention components and final classifier paths capture eye-specific diagnostic information.
3.4. Loss Function Design
Constructing or optimizing the loss function is a crucial component in model optimization. The loss function defines the difference or error between the model’s predicted and true values and is the objective that needs to be minimized during model training. A good loss function should accurately reflect the deficiencies in model performance and guide the model to learn the correct features. The binary cross-entropy loss function we use is a common loss function for binary classification problems. It measures the difference between the probability distribution predicted by the model and the true label distribution and is widely used in various classification tasks, such as image recognition. The formula for the binary cross-entropy loss function is as follows:
where
represents the true label (0 or 1), and
represents the probability predicted by the model. The binary cross-entropy loss function calculates the logarithmic difference between true and predicted labels. When the model’s predictions perfectly match the true labels, the cross-entropy loss is 0; the greater the discrepancy between the predicted probabilities and the true labels, the higher the loss value, which motivates the model to adjust parameters during training to reduce prediction errors. Below are the multi-task loss function formulas we adopted in our project, where (
), (
), and (
) represent the loss functions for the left eye, right eye, and binocular respectively, and (
) represents the overall model loss function.
To better optimize the model, we added a penalty term to the original binary cross-entropy loss function, resulting in the final loss function shown below:
The left and right eye labels should satisfy the following rules: when the first position (i.e., Normal) of both eye labels is 1, the first position of the binocular total label should also be 1. For the other seven positions (seven types of diseases), if either eye has a position value of 1, the corresponding position in the binocular total label should be 1. To encourage the binocular total label calculation to meet the above rules, we define a penalty term as follows:
where
N is the total number of labels,
represents the
i-th binocular total label, and
and
represent the
i-th left and right eye labels respectively. Specifically, the
is the indicator function, which evaluates to 1 if the condition
is true, and 0 otherwise.
The calculation method is as follows: For the first position of the label, if the logical AND result of the binocular total label and either eye’s label is inconsistent, then calculate the absolute difference; for the remaining seven positions of the label, if the binocular total label is inconsistent with the maximum value of either eye’s label, then calculate the difference. This design of penalty loss ensures that the model can correctly reflect the label relationship of binocular images during prediction, thereby improving the accuracy and consistency of the model. By adding this penalty term to the total loss, the model will be guided to follow these specific label rules during the training process. We chose the L1-like absolute difference (Equation (
7)) as it directly enforces our deterministic logical ‘AND’/‘OR’ rules, unlike metrics such as KL divergence which are unsuited for this task. This penalty is intentionally designed as a soft constraint (optimized via
in ablation study), guiding the model toward common symmetries while crucially retaining the flexibility to learn valid asymmetric disease presentations.