2.1. Datasets
Primary data. All experiments use the
RETA–IDRiD vascular tree subset, released with the RETA benchmark, [
12,
24]. It contains 81 color fundus images (45° field of view (FOV), 4288 × 2848 px, saved as 1024 × 1024 crops in our pipeline) with dense annotations for vessels, artery/vein identity, bifurcations, and skeletons. We follow the official split: 54 images for training/validation and 27 for hold-out testing. IDRiD images are deidentified and publicly licensed for research; no additional ethical approval is required.
Of the 54 RETA-IDRiD images labeled, 27 are reserved by the organizers as a challenge test set—the masks are not publicly released and evaluation is possible only by blind submission to the leaderboard. To keep our study fully reproducible, we therefore work solely with the 54 accessible images. Ten (18%) are left as an internal test fold, stratified to match the prevalence of lesions in the entire set (two heavily diseased eyes, eight mild/normal). The remaining 44 form the training pool. This 80/20 split is standard practice for small medical datasets and avoids data leakage while preserving enough samples for model fitting and augmentation.
Why RETA? Compared with legacy sets such as DRIVE [
25], STARE [
26] or CHASE_DB1 [
27], RETA offers (i)
cleaner masks (semi-automated, adjudicated), (ii)
lesion variability (exudates, hemorrhages) that stresses false positive control, and (iii)
structural labels enabling future benchmarks at the artery/vein or tree level. For video-based generalization, we point the reader to the recent RVD dataset, [
28], left for future work.
2.3. Image Processing Strategy: Column-Based Approach
As mentioned above, the distinctive aspect of our segmentation approach lies in the processing of retinal images column by column. This involves dividing each image into column vectors and, for each of these vectors, using a neural network to determine which pixels correspond to the retinal background and which correspond to blood vessels. In our case, the images have a resolution of 1024 × 1024 pixels, meaning that we process 1024 columns per image. These columns are subsequently reassembled in the correct order to reconstruct the segmentation of the original image.
Deep learning-based networks require large amounts of data for effective training, and the annotation process is time consuming and must be carried out by domain experts. The annotated dataset available for training is therefore limited. By changing the training unit from full images to individual image columns, we effectively increase the number of training samples by a factor of 1024. This significantly expands the annotated dataset and provides sufficient data to train deep learning models.
A key step in our approach is interpreting each column as a time series and processing it using neural networks based on LSTM layers, which are particularly well suited for handling sequential data.
The image processing workflow is illustrated in
Figure 1, which shows how the image is decomposed into individual columns, how an estimate is obtained for each column, and how the final vessel-background segmentation is reconstructed by combining the estimations from all columns.
2.4. Long Short-Term Memory (LSTM) Networks
LSTM networks operate based on two core state vectors. The first, the hidden state vector , represents both the internal state and the output of the LSTM layer at time step t. The second, the cell state vector , is designed to preserve longer-term dependencies across the sequence. At each time step, the network selectively updates by adding or removing information through gated mechanisms.
An LSTM cell typically consists of four gates: (1) the input gate, denoted by the vector , which controls how much new information is incorporated into the cell state; (2) the forget gate, , which modulates the removal of past information from the cell state ; (3) the cell candidate gate, , which provides new candidate values to be added to the state; and (4) the output gate, , which determines the extent of information to be passed to the hidden state. These gates operate in conjunction to maintain and update the internal memory of the LSTM.
Figure 2a illustrates the internal architecture of an LSTM cell and the gate outputs at time
t, from left to right:
,
,
, and
. The figure also displays the input vector
and the computations that update both
and
using their respective values
and
from the previous time step (
).
The LSTM network has three types of parameters: input weights in matrix
, the recurrent weights in
, and biases in the vector
. The matrix
combines information from the input
, while
controls the contribution of the previous hidden state
, while
adds a bias. Each of these parameters is partitioned into subcomponents corresponding to the four gates:
Each gate computes its output based on the current input
and the hidden state from the previous time step
according to the following:
Notice, for instance, the input gate vector
is calculated as the weighted input
plus the recurrent contribution
and the bias
. The activation function
, a logistic sigmoid defined as
, is applied element-wise to the resulting vector. The remaining gate vectors are computed analogously with the particularity that in Equation (
4), the activation function is the hyperbolic tangent,
, which also is applied element-wise.
The gate outputs obtained in Equations (
2)–(
5) with the cell state
are used to update the cell and hidden vector states
and
as follows:
where ⊙ denotes the Hadamard (element-wise) product. Notice that the cell state
is updated by selectively retaining prior information (
) and incorporating new candidate content (
). The hidden state
, which serves as the LSTM output at time
t, is derived from the cell state in the form
pondered by the output gate vector
.
As mentioned, the LSTM network processes image columns, where each column is represented as a sequence of vectors of size 1 × 3 over 1024 time steps. Each vector , with t ranging from 1 to 1024 (since the input images are of size 1024 × 1024), contains the features corresponding to the pixel at position t in the image column, expressed in the CIE–Lab color space. Consequently, all input sequences have a fixed length of 1024, and each input vector encodes the color information of a single pixel in that column. The LSTM network produces a binary output vector of dimensions 1024 × 1, with each entry indicating whether the corresponding input pixel is classified as vessel or background. The initial hidden and cell states, denoted by and , define the starting conditions of the network and can be initialized as zero vectors.
In the LSTM networks, the information is processed sequentially along the column. At each step
t, the network leverages the information accumulated from previous steps, capturing both short-term and long-term dependencies in the sequence. As the input sequence consists of the complete set of pixels in an image column, and there is no temporal constraint on data availability, we propose using a bidirectional LSTM (Bi-LSTM) structure, which enables the model to learn dependencies in both directions along the sequence, thus improving its capacity to model contextual relationships across the entire column. The output of the Bi-LSTM is computed by combining the outputs of the two LSTM layers from an expression of the following type:
where
are the outputs of the LSTM cell that processes the column in the forward direction and
are the outputs of the LSTM cell that processes the image column in the backward direction. The symbol
stands for the activation function. The matrices
and
contain the weights used to combine the outputs of the forward and backward cells in the Bi-LSTM layer and
the biases.
Figure 2b provides a graphical representation of the information flow within a bidirectional LSTM (Bi-LSTM) layer.
2.6. Data Augmentation, Training and Hyperparameter Selection
To train and evaluate the models, the dataset is divided into two subsets: one for model training and one for testing. In all experiments, the first 44 images were used for training, while the remaining 10 were used for testing. This deterministic split allows for a consistent comparison across experiments and different architectures by ensuring that all models are trained and evaluated on exactly the same data. This avoids the variability that could be introduced by random partitioning, which might otherwise affect model training and test results.
2.6.1. Bi-LSTM Training
The network was trained using the ADAM optimizer with a gradient decay factor of 0.9 and a squared gradient decay factor of 0.999. The initial learning rate was set to and remained constant throughout training, as no learning rate schedule was applied. Training was performed for a maximum of 8 epochs with a mini-batch size of 2048, and data was shuffled once at the beginning of training. The L2 regularization coefficient was set to to mitigate overfitting, and gradient clipping was applied using the L2 norm method with a threshold of 1. The training was executed on a CPU environment.
Figure 4 depicts the training workflow of the proposed approach, which relies on column-based image processing. So, once the model is trained, it is applied to the test set. The testing procedure consists of processing the labeled images reserved for testing, as described in
Figure 1, and comparing the results provided by the network with the corresponding target values by performing a quantitative evaluation and also a qualitative assessment for interpreting the results and validating the model’s performance.
2.6.2. U-Net Training
The U-Net model was trained using the ADAM optimizer. The initial learning rate was set to , and no learning rate scheduling was applied. The optimizer’s hyperparameters included a gradient decay factor of 0.9 and a squared gradient decay factor of 0.999, with a small epsilon value () to ensure numerical stability. L2 regularization was applied with a weight of to reduce overfitting. Training was performed over a maximum of 50 epochs with a mini-batch size of 4. The training data was shuffled at the beginning of each epoch. The training was executed on a CPU environment.
2.6.3. DeepLab v3+ Training
The network was trained using the Stochastic Gradient Descent with Momentum (SGDM) optimizer. The training configuration included a momentum of 0.9 and an initial learning rate of 0.001, with a total of 60 epochs. Mini-batches of size 8 were used, and the data were shuffled at every epoch. L2 regularization was set to to prevent overfitting. The training was executed on a CPU environment.
2.6.4. Vision Transformer Training
All hyperparameters reflect recurrent experiments: class-weighted cross-entropy, ADAMW (, weight decay ) and the deterministic 44/10 RETA–IDRiD split. A single epoch sweeps ≈0.5 M patches—∼12× more gradient steps than one full-image epoch of the Bi-LSTM. During inference, softmax probabilities are stitched back to a single mask and thresholded at 0.5, exactly as for the Bi-LSTM pipeline.
So, each image in the training group and each corresponding target is decomposed into columns, and all the vectors obtained from the set of training and all the corresponding target vectors are aggregated and used to train the model. This operation yields 44 × 1024 = 45,056 labelled instances.
As a data augmentation strategy, given the circular shape of the retinas centered within square images, the most natural approach is to apply image rotations. By simply rotating the images by 90°, 180°, and 270°, the number of labeled instances can be increased by a factor of four without the need for additional image processing. This augmentation helps reduce overfitting by exposing the model to different orientations of the same anatomical structures, which improves its ability to generalize to unseen data. If further (non-right-angle) rotations are introduced, minimal additional processing of both the images and their corresponding target masks will be required to preserve alignment. However, in all the experiments, the data augmentation strategy consisted of applying the same set of rotations, specifically at angles , , , , , , , , , , and . As a result, a total of labeled instances were generated.
2.7. Performance Evaluation
In the present case, we compare the pixel classification results of each reconstructed image after being processed column by column with its corresponding target. To evaluate binary classification performance, the confusion matrix, a 2 × 2 contingency table, is used. In this matrix, pixels that are correctly classified as positive are referred to as true positives (TPs), while negative pixels incorrectly classified as positive are called false positives (FPs). Similarly, pixels that are correctly classified as negative are termed true negatives (TNs), and positive pixels misclassified as negative are known as false negatives (FNs). In addition to evaluating the results using TPs, TNs, FPs, and FNs, we use these values to compute the following performance metrics.
These metrics provide a comprehensive evaluation of binary classification performance, especially in contexts where class imbalance may affect simple accuracy-based assessments. While accuracy gives an overall sense of correctness, it may be misleading when one class dominates. In such cases, recall and specificity become particularly important, as they reflect the model’s ability to correctly identify positive and negative instances, respectively.
Precision complements recall by indicating the reliability of positive predictions, which is essential in applications where false positives carry a significant cost. The F1-score, as the harmonic mean of precision and recall, offers a single metric that balances both aspects and is especially useful when the dataset has an uneven class distribution.
The false-positive rate (FPR) highlights the proportion of negative cases incorrectly labeled as positive and is a key component in Receiver Operating Characteristic (ROC) analysis. Finally, the Matthews Correlation Coefficient (MCC) provides a robust evaluation that considers all four components of the confusion matrix (TPs, TNs, FPs, FNs), making it a particularly valuable metric for assessing model performance on imbalanced datasets.