A Fully Automated Pipeline for a Robust Conjunctival Hyperemia Estimation

: Purpose: Many semi-automated and fully-automated approaches have been proposed in literature to improve the objectivity of the estimation of conjunctival hyperemia, based on image processing analysis of eyes’ photographs. The purpose is to improve its evaluation using faster fully-automated systems and independent by the human subjectivity. Methods: In this work, we introduce a fully-automated analysis of the redness grading scales able to completely automatize the clinical procedure, starting from the acquired image to the redness estimation. In particular, we introduce a neural network model for the conjunctival segmentation followed by an image processing pipeline for the vessels network segmentation. From these steps, we extract some features already known in literature and whose correlation with the conjunctival redness has already been proved. Lastly, we implemented a predictive model for the conjunctival hyperemia using these features. Results: In this work, we used a dataset of images acquired during clinical practice.We trained a neural network model for the conjunctival segmentation, obtaining an average accuracy of 0.94 and a corresponding IoU score of 0.88 on a test set of images. The set of features extracted on these ROIs is able to correctly predict the Efron scale values with a Spearman’s correlation coefﬁcient of 0.701 on a set of not previously used samples. Conclusions: The robustness of our pipeline conﬁrms its possible usage in a clinical practice as a viable decision support system for the ophthalmologists.


Introduction
The estimation of conjunctival hyperemia is a standard procedure during clinical evaluation in ophthalmology.Conjunctival redness or hyperemia is evaluated on the dilation of blood vessels in the conjunctiva area, and it can be symptomatic of different kinds of inflammations or infections.Its evaluation is commonly investigated for the contact lens effects [1,2] and for the so-called dry-eyes syndrome [3].The hyperemia estimation is performed on images of the patient's eye acquired using a slit lamp.The slit lamp, also called slit bio-microscope, is an optical instrument used in ophthalmology for the observation of eye tissues.It allows for visualizing the bulb and the ocular annexes, the corneal layers, the vitreous and the anterior chamber, the crystalline lens and the iris.All of this information can be acquired posing no risks for the patient.
Hyperemia is commonly evaluated using qualitative grading scales, where the conjunctival hyperemia is compared to a set of standardized pictures (template images or references).These reference pictures are images artificially generated using software, painted, or real photographs.The clinician establishes the score by matching the patient conjunctiva to the reference one that he considers the most similar.This measure is performed for each eye independently.
There are many grading scales proposed in literature for the hyperemia estimation.Davies (1978), Mandell (1989), and Woods (1989) proposed descriptive grading scales; Koch et al. (1984) and Schnider (1990) proposed art illustrations for the evaluation of single condition grading scales; Annunziato et al. (1992) and Efron (1999)  introduce the use of photographic references.The latest proposed reference images were generated using computer graphic (Jenvis, 2009).Each scale is associated with a different set of reference images and each of them is focused on a different aspect of the hyperemia estimation; for instance, the conjunctival redness in the Efron Grading Scales for Contact Lens Complications [1] is expressed through five images depicting 1-5 grading ranging from normal to severe.
A critical issue of these assessments is their reproducibility: the template matching evaluation is an intrinsic operator-dependent procedure and thus it is affected by subjectivity.The reproducibility of the image acquisition and moreover the hyperemia evaluation are determined by the experience of the operating clinician, and it can be affected by several experimental conditions, leading to imprecisions and biases [4,5].Generally, the automation of the clinical exam provides a support for the clinical laboratory or clinical practice assessment through faster and more reliable evaluation and reducing the reliance on the operator expertise for the evaluation [6].
Several authors have already proposed semi-automated grading systems for the hyperemia quantification, offering semi-supervised pipelines to process the patient's images [7].All of these methods facilitate the clinician operations providing a series of automated processing steps, which, however, require the human intervention (e.g., a manual selection of the region of interest and color values) [3,[8][9][10][11][12][13][14].At the same time, they allow for introducing new quantitative grading scales that are difficult to relate to the standard clinical assessments.These grading scales can provide a viable medical alternative to standard grading scales ensuring greater accuracy and different information for the clinician, but they can not be used as standalone results prior to a lengthy medical certification trial.Thus, despite the efficiency of the automated methods, their practical usage by clinicians remains limited.
The creation of a fully automated pipeline which starts from the acquired image until the hyperemia score prediction is still an open problem [15].Some authors have already proposed interesting results on this problem.Sánchez Brea et al. [16] implemented a fully automated pipeline for the segmentation of the conjunctiva, but the predictions of their model were obtained only on a small part of the dataset on which there was a reasonably good agreement between experts' evaluations, discarding any source of issues, that might be instead encountered during clinical practice.Derakhshani et al. [17] tried to predict the assessment of the vascularity of conjunctiva using a neural network approach: the images were rescaled to improve the computational efficiency, losing a great part of the information; moreover, the results reported in their work are not easily interpretable, since the final scores are obtained without a subdivision of the samples in train and test, alongside a best-model selection procedure on the same data.
All of these methods introduced the usage of modern machine learning and deep learning frameworks for the processing of the features extracted from the original images, after a semi-automated segmentation of the region-of-interests (ROIs).Deep learning convolutional neural network (CNN) image segmentation models have shown promising results in medical applications in the last few years.We can find classical or tailored deep learning architectures in many medical research fields, including the opthalmologic one [18].Despite the growth in computational power availability, which allows for enlarging the application of even more complex deep learning models, their usage necessarily requires the manual annotation performed by experts.
In this work, we proposed a fully-automated pipeline for the processing of slit lamp images and hyperemia quantification.The most informative portion of the image for the hyperemia estimation in these images is given by the vessels' network structure [11] and thus it is crucial to extract this component from the whole image.Our pipeline includes the segmentation of the conjunctival area using CNN, the segmentation of the vessels network using image processing algorithms, the extraction of a set of features related to the redness and the prediction of the Efron scale grading value using a Ridge regression model [1].The full automation of the pipeline allows for including features such as the vessels network structure that would be normally hard to quantify due to the time requirement of the manual segmentation.In this work, we evaluated the performances of this pipeline using a set of features tailored for the Efron scale quantification, as already discussed in literature.

Patient Selection
The analyzed images had been obtained during routine ophthalmological examinations in the Ophthalmological Unit at IRCCS S. Orsola University Hospital of Bologna.Images were retrieved from charts of subjects who gave their voluntary consent to research.The study was approved by the Local Ethics Committee, and carried out in accordance with the Declaration of Helsinki.
We collected 70 patients and for each of them the images of both eyes were acquired (Grading Dataset, GD).Using a slit lamp, we obtained 2 images for each eye, the first taken from the nasal part of the eye and the second one from the temporal bulbar conjunctiva.In this way, we obtained a full set of 280 images for our analysis.Individuals between 20 and 50 years of age were selected for the current study.The patients were selected by an heterogeneous population, and thus the dataset includes samples with high and low redness levels.A global description of the dataset is showed in Table 1.

Slit Lamp Images
The photos had been taken by two trained clinicians with a digital slit lamp microscope (Topcon SL-D4 slit-lamp biomicroscope).For all the participants, the following protocol was used for photography: 1.We illuminated with a halogen lamp, with a 7.5 KLux illumination input to the slit lamp (both red-free and diffuser filters excluded) and with a wide diffuse illumination of a slit lamp (8 mm circle); 2. We adopted a magnification of 16×; 3. We used a 14 mm diaphragm aperture; 4. We set the sensibility of the digital image to 100 ISO; 5. We set the acquisition time to 1/80 s.All the photographs were acquired under similar room illumination.
For taking images of the nasal and temporal bulbar conjunctiva, each participant was instructed to look horizontally left and right.Gentle pressure was applied to open the lids in order to ensure that they did not obstruct the conjunctiva during the photography.Photographs were taken without flash and quickly to avoid dry eye and irritation.
The images of the GD were collected during clinical practice by the clinicians without standardized parameters of acquisition (such as brightness): each clinician acquired the image according his best judgment, as it is standard clinical procedure.
All of the images were captured in a raw format, i.e., RGB 8-bit, and saved in JPEG format (2576 × 1934, 150 dpi, 24 bit).

Clinical Scoring of Images
Three trained clinicians performed the evaluation of the full set of 280 images, independently.The clinicians scored each image according to the Efron grading scale: despite the 5 possible values of the Efron scale (from normal, 1, to severe, 5), we allowed the usage of intermediate values in case of doubt.We have chosen the Efron scale as reference since it is a standard reference in ophthalmology and its automation can easily encourage the clinicians community to use our method.
All of the clinicians scored the images in the same physical space, with the same source of illumination and without time limits.Two computer monitors (HP Z27 UHD 4K, 27 , 3840 × 2160 resolution) were used: the first displayed the grading scale images (reference) and in the second one the clinical images were showed.In both of the monitors, the same screen color and brightness were used for all the three clinical evaluations.
We collected the evaluation of the three trained clinicians on the full set of images, and we kept the median value as ground truth score for the analysis.

Image Processing Pipeline
Our image processing pipeline is composed by a series of independent and fully automated steps (ref.Efron scale values prediction.The first step of processing involves the segmentation of the conjunctiva area from the background, i.e., skin, eyelids, eyelashes, and caruncle.For each image of the datasets, we collected a manual annotation of the conjunctiva area performed and validated by the 3 experts.The manual annotation set includes a binary mask of the original image in which only the conjunctiva area is highlighted.The conjunctiva segmentation allows for reducing the region of interest of our analysis on the most informative area for the redness evaluation. Several already published automated grading systems for the hyperemia quantification perform this step using semi-supervised methods, leaving to the final user the manual selection of the region of interest (ROI) [3,[8][9][10].Our method automatizes this task using a semantic segmentation neural network model trained on a set of images manual annotated and validated by experts.
In the second step, the pipeline performs a second segmentation for the vessels' network identification, starting from the segmented conjunctiva.The vessels network is automatically segmented using a customized version of the NEFI [19] algorithm.The vessels network segmentation provides a mask to apply to the original image.
In step three, the pipeline extracts features starting from the selected areas based on quantities proposed in the literature and based on different color spaces (RGB and HSV, i.e., Hue, Saturation, and Value).In step four, the extracted set of features was used to feed a penalized regression model for the prediction of the final Efron scale value.

Step 1-Conjunctiva Segmentation
In literature, there are several deep learning models proposed to automate the segmentation of eyes' images (e.g., optical coherence tomography [20], sclera segmentation [21][22][23], retinal vessel [24]); nevertheless, we could not find any model focusing on slit lamp images.These images are theoretically easy to segment since the purpose is to isolate the white-like part of the image (conjunctiva) from the red-like background (skin, eyelids, eyelashes, and caruncle), but in reality the differences between these two regions are not uniform in the different parts of the same image and not as well defined in the color space.An image thresholding [10] or an image quantization [9] could solve this task in the simplest cases, but they can not take care of the extremely variability of the samples provided by a clinical acquisition.There is not a rigid standardized procedure in the image acquisition by a slit lamp during the clinical acquisition and the "elaboration" of the images is left to the experience of the clinicians.
In our work, we divide the available image samples into a training (42 patients, 60% of the full set of images, i.e., 168 images) and a validation set (28 patients, 40% of the full set of images, i.e., 112 images).From the training set, we excluded a subset of 18 images, i.e., 9 patients, as test set for the evaluation of the model performances over the training procedure.For each image, we collected a manual annotation of the conjunctiva area performed and validated by 3 experts.The manual annotation set includes a binary mask of the original image in which only the conjunctiva area is highlighted; this set of images-masks was used as ground truth for our deep learning model.
The amount of the available samples does not justify the usage of a complex deep learning architecture with a large amount of parameters to tune.During the research exploration, we tried several CNN architectures commonly used in segmentation tasks, starting from DenseNet CNN to the lighter U-Net variants [25,26].The evaluation of model performances has to balance both a good performance on the validation set and a greater ability of extrapolation on new possible samples.We would like to stress that, despite the above requirements being commonly looked for in any deep learning application, they are essential for any clinical application, in which the variability of the samples is extremely high.We remark that even high accuracy scores could still include distortions and artifacts of clinical significance: the most important result is the clinician's evaluation, i.e., the visual segmentation accuracy estimated by experts.All of the predicted images were carefully evaluated by the experts of the Ophthalmology research group of the IRCCS S. Orsola University Hospital Ophthalmic Unit, Laboratory for Ocular Surface Analysis of the University of Bologna and their agreement, jointly with the training numerical performances, lead us to choose a Mobile U-Net model with skip connection as the best model able to balance our needs.
We implemented the Mobile U-Net model using the Tensorflow Python library.The model was trained for 100 epochs with an RMSProp optimizer (learning rate of 10 −4 and decay of 0.995).For each epoch, we monitored the accuracy score, i.e., the average agreement between the mask produced by the model and the ground truth at pixel level, and the Intersection-Over-Union (IoU) score, i.e., the area obtained by the union of the mask produced by the model and the ground truth divided by their intersection area, on each image of the test set.Since the portion of the image occupied by the conjunctiva area could be smaller compared to the full image size, the IoU score is a more informative metric for the evaluation of the model performances since it is robust to unbalances in the sample, differently from the accuracy score.Our training set includes both nasal and temporal images for both the patient eyes.This means that we have an intrinsic vertical flip of the training images.We however performed a large data augmentation procedure to build a more robust model and to hypothesize the possible variability of the test set.For each image, we performed a vertical and/or a horizontal flip jointly with a random rotation.

Step 2-Vessels Network Segmentation
The conjunctiva segmentation allows for reducing the region of interest of our analysis on the most informative area for the redness evaluation.The first step of processing involves a standardization of the images and the correction of the light dependence due to uneven illumination of the conjunctiva.In the same image, different parts of the conjunctiva could have different brightness based on the angle of the incidental light on the conjunctiva and the aperture of the slit lamp diaphragm.This different brightness produces images with a shadow component or with a flash glare.The vessels network in the conjunctiva can thus appear as a brighter or a darker red component, and this can drastically affect the redness evaluation.The images, in fact, tend to appear with a heterogeneous red component when the acquisition is performed in low-light situations.At the same time, the light standardization allows for removing the possible shadows or hyper-intense areas due to the effect of the flash over the patient's eyes.This standardization is performed independently on each channel of the image (R, G, and B), and it involves a median blurring of the channels followed by a normalization of the difference between the blurred and the original channel.The resulting standardized image is converted into grayscale before the application of the next steps.
Starting from the standardized masks, the pipeline performs the vessel segmentation.The vessels' network segmentation is a standard task in ophthalmology image processing, and many studies have already published promising results on this topic [3,9,10,19,27,28].In our work, we decided to use a customized version of the method proposed in [19], using a watershed adaptive filter to highlight the network vessel component and to segment it from the background.After removing the smaller connected components (up to a predetermined size) to remove noise artifacts, the resulting segmentation is refined using a GuoHall skeletonizer [29].In this way, only the backbone (only 1 pixel for each branch) of the vessels network is selected.In the next steps, we use the backbone network as a mask for the extraction of the redness features, focusing our analysis on the smallest but most informative region of the image.

Step 3-Redness Features
The redness measurements are standardly performed on the whole area occupied by the conjunctiva, despite the most informative section being given by the only vessels network component.Therefore, for all the features, only the pixels belonging to the vessels network were included.[9] a redness measure given by a combination of the RGB channels.The authors extract this score on the manually segmented conjunctival area without any preprocessing step, obtaining promising results on the prediction of the conjunctival redness.The core assumption behind the relation between this feature and the conjunctival redness is the perception of the red color into the conjunctiva area: the mathematical formula tends to emphasize the R intensity using a weighted combination of the three channels.This starting point is generally true in the major part of slit lamp images, but it can suffer from the image light exposition: the RGB channels do not respond in the same way to the image brightness, and this behavior can affect the robustness of the measure.

Park et al. propose in
In our work, we evaluated the same score using the mask provided by the vessels network segmentation: in this way, we can focus our evaluation of the redness only in the most informative area of the conjunctiva, minimizing possible light exposition artifacts.We also improved this measure applying the preprocessing light standardization.Thus, the first feature extracted is the redness score given by where we denote with R, G, B the red, green, and blue channels of the standardized image, respectively.The n value represents the amount of pixels in the mask produced by the conjunctival segmentation step.

HSV Redness
A second interesting feature was proposed by Amparo et al. in [8].The authors suggest to move from the RGB color space to the more informative HSV one, i.e., the hue, saturation, and value.This space is more informative than the RGB one since it is capable of being careful about the different light exposure (saturation).The authors also suggested the usage of a preprocessing step for the slit lamp images, including a white-balance correction to overcome possible light exposition issues.Despite their interesting results, they did not show further results on the dependence of their feature to the same image sampled in different light conditions.Their measure uses a combination of saturation and hue for the redness estimation, thus we can preliminarly expect a more robust behavior to image brightness.In addition, in this case, we applied the same computation for the extraction of a second score on our standardized image after the vessels network segmentation: where H and S represent the hue and saturation intensities of the standardized image, respectively.

Fractal Analysis
A third set of measurements were proposed by Schulze et al. in [10], who performed a fractal analysis from the vessels segmented in the conjunctiva.Our technique for the vessels network segmentation is quite different from theirs, but we can, however, apply fractal analysis for the study of the vessels' topology.The vessels network can be compared to a fractal structure and thus analyzed using standard measurements of complex systems analysis [30][31][32].Their evaluation can be very informative from a medical point-of-view, and it fills an opened gap in the clinical practice.The estimation of the fractal dimension of the vessels network could be informative of multiple pathologies, since it quantifies the neo-genesis and the ramification of the vessels in the area of interest.Their relation to the hyperemia is straightforward since the greater the area occupied by the vessels, the greater the conjunctival redness.The core assumption in this case is given by the goodness of the segmentation algorithm used for the vessels network extraction: the algorithm should be able to identify the vessels network in any light exposition or the measures can suffer from false positive or missing detections.In particular, we extracted as putative features the fractal dimension of the vessels network using the box-counting and pixel-counting algorithms.Both algorithms are commonly used methods for the analysis of fractal structures included into images.In our work, we choose a logarithmic set of sizes for the box counting evaluation, and, for each image, the coefficients estimated from the linear interpolation between the box-sizes and the counts were used as features (this line slope is usually referred to as the fractal dimension).

Color Measures
We also computed the averages of the RGB channels {µ R , µ G , µ B } and HSV {µ H , µ S , µ V } channels of the standardized conjunctival networks, reaching a total of 10 putative features for the redness estimation.There are intrinsic correlations between these features that need to be taken into consideration for the subsequent analyses.

Step 4-Regression Pipeline
The initial step of our regression analysis consists of the standardization of the extracted features.Each feature belongs to a different space/range of values and to combine their values, and we have to rescale all of them into a common range.We rescaled all the features using their median values, normalizing according to the 1st and 3rd quantiles, i.e., a robust scaling algorithm: in this way, we minimize the dependency from possible outliers.Medians and quantiles were estimated on the training set and then applied to the test set to avoid cross contaminations.
The processed set of features is then used in a penalized regression model.We used a penalized ridge regression (or Tikhonov regularization) for the Efron prediction.Ridge regression is a regularized version of the linear regression in which an extra regularization term is added to the cost function, penalizing high values of the coefficients of the regression.In our simulations, we used a penalization coefficient equal to 2.5.
The full set of data was divided into a train/test sets using a shuffled 10-fold cross validation.The model was trained on a subset (90%) of the available samples, and its predictions are compared to the ground truth provided by the corresponding test set (10%).
We want to note that, despite the Efron scale admitting only integer values (from 1 to 5) in our dataset, we have a 8% of floating point values: the experts introduce floating point values when there is no an exact concordance between the patient image and the template image references.We would stress also that the accordance between multiple experts in the Efron scale evaluation is generally low since the measure has an intrinsic subjectivity, and this bounds the best possible predictions of a quantitative model.

Conjunctival Segmentation
The results obtained by our training on the test images are showed in Figure 2a.The masks obtained by the conjunctival segmentation can be applied to the original images to select only the region of interest (ROI) for the following analysis.The results showed in Figure 2b highlight the efficiency of the model in the conjunctiva detection and segmentation.In 100 epochs of training, the model reaches an average accuracy of 0.94 and a corresponding IoU score of 0.88 on the test set.
The visualization of the results confirms the goodness of the training, showing a good agreement with the ground truth masks segmented by the experts.Despite the performances, the visualization of the produced masks also allows an empirical interpretation about what the model has learned during the training: in many cases, the ground truth produced by the experts is "rough", and it does not take care about the superposition of the eyelash on the conjunctival surface.The manual segmentation performed by the experts was also not pixel-perfect, but it is a polygon composed of straight segments and thus the model could not reach a perfect learning (and neither it is desirable in terms of model generalization).The developed model is able to better discriminate between the eyelashes and the conjunctiva showing more accurate masks.100).The IoU score is more informative given that it is robust to unbalances in the sample, differently from the accuracy score; (b) example of the results obtained on a test image.On the top-left, the ground truth.On the top-right, the predicted segmentation.On the bottom-left, the raw (input) image.On the bottom-right, the resulting ROI of the conjunctival area.

Vessels Network Segmentation
An example of this processing is provided in Figure 3.In this case, the red component of the image tends to be uniform along the entire conjunctiva area, making it difficult to visualize the network of vessels.The image standardization algorithm allows for better discriminating the vessels from the background, removing the false positive redness component from the conjunctiva.We use the result of this processing step for both the vessels network extraction and for the next features' evaluation, since the color ranges produced by this step are more robust to the different image expositions.A preliminary analysis of the results obtained by this approach confirms its efficiency: this processing, applied on the standardized image, as discussed above, is powerful enough to extract the most informative vessels from the image.An example of the results obtained by our preprocessing is showed in Figure 3f.
We performed our analyses on a server grade machine (64 GB RAM memory and 1 CPU i9-9900K, with eight cores) and the proposed pipeline, from the conjunctiva segmentation to the vessels network extraction, took less than 2 min per image.

Efron Scale Prediction
We processed the dataset of 280 samples using our automated pipeline extracting the full set of features.In this dataset, we have two images for each eye (nasal and temporal); thus, we use the average of each feature extracted on the two samples.The full set of 10 features were used for the ridge regression model estimating the correlation between the ground truth Efron values and the predicted ones.We trained the regression model using a 10-fold cross validation: in this way, we can ensure that the model predicts the outcome on a completely novel set of samples.We tuned the pipeline parameters using a grid search algorithm to better fit the available data.The best model found is able to predict the correct Efron scale values with a Spearman's rank correlation coefficient of 0.701 and a corresponding p-value of 10 −22 (ref.Figure 4a).An example of the prediction obtained on the test set is showed in Figure 5 We apply the same pipeline for 100 different cross validations to test the robustness of our model.In each iteration, a different 10-fold cross validation is provided to the ridge regression: in this way, we test the sensitivity of the model to different training sets.The resulting distribution of Spearman's rank correlation coefficient (ref.Figure 4b) shows the robustness of the developed model: the estimated coefficients are centered around a value of 0.69 with a spread of 0.008.The developed model uses a combination of the extracted features to find the best parameters for the regression model.Despite the resulting performances, we evaluate the informative power of each feature independently.We apply the same cross validation procedure using each feature singularly one by one: in this way, we can assign to each of them its informative power given by the associated Spearman's correlation coefficient.As expected, not all the features are equally informative, but only three of them are associated with a Spearman's correlation coefficient greater than 0.55 (in absolute values).We found as the most informative feature the score 1 (ref.Table 2) with a correlation coefficient of 0.64, immediately followed by the average saturation and average blue channel with 0.61 and −0.56, respectively.Thus, the results obtained by our predicted model are related to the informative power of these three features.

Discussion
The results obtained on the Efron scale predictions highlight a statistical agreement between only a small set of extracted features and the grading scores.In particular, we show the efficiency of some RGB measurements which have statistical significant correlations with the Efron grading scale values.Their efficiency is justified by their robustness to the most common issues in the image acquisition given by the light exposition and aperture of the slit lamp diaphragm.
The best correlation performances are achieved by the measure proposed by Park et al. (score 1 ): this measure is already known in literature as a feature related to the redness as much as the blue channel (the blue channel conserves the most informative component for the discrimination between the white-like conjunctiva area and the red-like vessels network).It is, however, interesting to notice how the saturation of the segmented vessels network area shows a not negligible correlation with the redness measure of the Efron scale.Moreover, the correlation between the image saturation and Efron values is a positive correlation, implying that the higher is the image exposition to the light with brighter red color and the higher tends to be the evaluation of the clinicians.The quality of the image certainly affects the evaluation of the clinicians, and it is quite obvious that, in a low contrast image, the evaluation of the vessels is harder for human eyes.
The analysis of the single features correlation also explains the apparently low efficiency of the score 2 : despite the HSV color space being theoretically more robust than the RGB one and the high correlation between saturation and Efron values, the hue channel (µ H ) tends to anti-correlate with the Efron scale.The combination of the HSV channels proposed by Amparo et al. does not take care about them and thus their score is penalized in relation to the others.
The lowest predictive powers are achieved by the fractal features: the fractal analysis of the vessels network seems to be unrelated to the Efron values despite their theoretically information power in the description of the neo-genesis and ramification of the vessels.The fractal measures that are independent from the extension of the vessels network by definition can be seen as informative of the dynamic of the process of neo-genesis.Therefore, this lack of correlation seems to imply that the vessels growth is not different from the physiological condition.
The methods to estimate conjunctival redness are usually assessed by evaluating the agreement of the estimates of a panel of experts.The same set of samples is shown to a pool (at least two) of experts, and they perform the grading evaluation under the same experimental conditions.In this way, one can double check the obtained results, comparing the efficiency of the model against the expert evaluation using the agreement between the experts as the maximal obtainable correlation: the rationale is that any automated method can not agree with the opinion of any single expert more than how much two experts can agree between themselves.In our work, each image was independently evaluated by three experts.It is important to notice that, despite our automated method not reaching a perfect prediction, we observed an internal agreement between the experts' evaluations of at most 0.84 (evaluated as Spearman's correlation coefficient).In this case, any automated pipeline trained on this dataset could not achieve a predictive power greater than this value.Any result higher than this should be regarded as over fitting.
Our results also highlight a non-negligible correlation between the saturation of the image and the Efron scale values.It is not surprising that a brighter image is more easily analyzed by the human eye, but the correlation between the brightness of the image color and the Efron scale values leads us to hypothesize a possible criticality in the clinical evaluation that can lead to an overestimation of the hyperemia severity by the clinicians.The proposed pipeline tries to overcome the problems related to image exposure using the standardization algorithm described above, but further analysis will be required to completely remove the light dependence from the feature extraction procedure.
The robustness of our pipeline on a set of images sampled with a no-rigid acquisition protocol confirms its possible usage in a clinical practice as a viable decision support extend the previous evaluation to multiple conditions; Courtney & Lee (1982), Lupelli (1998), McMonnies & Chapman Davies (1987), Price et al. (1982), Begley (1992), and Lofstrom et al. (1998)

Figure 1
extraction and color features extraction; 4.

Figure 1 . 4 )
Figure 1.Schematic representation of the pipeline.(Step 1) The image acquired by the slit lamp is used as input for the Neural Network (Mobile U-Net) model for the segmentation of the conjunctival area.The model was trained using manually annotated images validated by experts.(Step 2(a)) Focusing on the conjunctival area the image is standardized using a brighter-correction algorithm given by a combination of median filters and background subtraction.(Step 2(b)) The image standardization helps us to remove possible artifacts and color issues related to the non-rigid image acquisition procedure.The processed image is used as input for the vessels network segmentation algorithm given by a tuned implementation of the NEFI algorithm.(Step 3,4) Starting from the vessels' network image, a set of features for the quantification of the conjunctival redness are extracted and used for the development of a predictive model.The full set of steps are performed automatically and thus without any human intervention.

Figure 2 .
Figure 2. Results obtained by the trained Mobile U-Net model on the 18 test images.(a) evolution of the average accuracy and average IoU along the training epochs (100).The IoU score is more informative given that it is robust to unbalances in the sample, differently from the accuracy score; (b) example of the results obtained on a test image.On the top-left, the ground truth.On the top-right, the predicted segmentation.On the bottom-left, the raw (input) image.On the bottom-right, the resulting ROI of the conjunctival area.

Figure 3 .
Figure 3.Comparison between the raw image and the result of our standardization algorithm.(a,c) the original image and corresponding enlarged area.In this case, the image acquisition was performed with an incorrect light exposition, which leads to an unbalanced redness component along the entire conjunctiva area.It is evident in the enlarged section how the vessel color tends to be very close to the background, making it their evaluation difficult and creating a false positive redness component in the conjunctiva.(b,d) the standardized image and corresponding enlarged area.The standardization algorithm allows for better discriminating the vessels network component from the red-like background.In the enlarged portion, we can see how the algorithm is able to split the vessels with the uniform background, despite the latter having a non-negligible redness component; (e) the vessels network segmentation estimated on the raw image; (f) the vessels network segmentation estimated on the standardized image.The standardization processing allows for removing the artifacts (false positive branches) related to the red-like background component.

Figure 4 .
Figure 4. Results of the regression model for the prediction of the Efron scale values, developed starting from the set of features extracted.The correlation between the ground truth and the predicted values is estimated using the Spearman's rank correlation coefficient (ref.plot legend); (a) results of a single cross validation of the model.We highlight with the dashed line the axes bisector that corresponds to a perfect prediction.Despite the significance of the correlation found, the model finds many difficulties with low Efron scale values.We note that the predictions are performed on a set of data completely independent to the training set; (b) results obtained by the same pipeline on 100 different cross validations.In each iteration, a different 10-fold cross validation was applied for the estimation of the Spearman's rank correlation coefficients.The average Spearman's rank coefficient found is 0.69 with a standard deviation of 0.008.

Figure 5 .
Figure 5. Example of the predictions obtained by the regression model on three samples.For each image, we report the assigned Efron score and our method's predicted score.We would like to stress that the predicted scores are floating point numbers and therefore they are more descriptive for the redness evaluation, ensuring a finer redness grading scale of values.

Table 1 .
Descriptive statistics of the analyzed samples in the dataset.Age and Efron grading values are

Table 2 .
Spearman's correlation coefficient scores of single features in relation to the Efron grading scale values, ranked from the highest to the lowest.The analysis of the single feature correlation highlights an unbalanced informative power in the prediction of the Efron scale values.Only three features have a correlation coefficient greater than 0.55 (in absolute values).