Evaluation of Erythema Severity in Dermatoscopic Images of Canine Skin: Erythema Index Assessment and Image Sampling Reliability

The regular monitoring of erythema, one of the most important skin lesions in atopic (allergic) dogs, is essential for successful anti-allergic therapy. The smartphone-based dermatoscopy enables a convenient way to acquire quality images of erythematous skin. However, the image sampling to evaluate erythema severity is still done manually, introducing result variability. In this study, we investigated the correlation between the most popular erythema indices (EIs) and dermatologists’ erythema perception, and we measured intra- and inter-rater variability of the currently-used manual image-sampling methods (ISMs). We showed that the EIBRG, based on all three RGB (red, green, and blue) channels, performed the best with an average Spearman coefficient of 0.75 and a typical absolute disagreement of less than 14% with the erythema assessed by clinicians. On the other hand, two image-sampling methods, based on either selecting specific pixels or small skin areas, performed similarly well. They achieved high intra- and inter-rater reliability with the intraclass correlation coefficient (ICC) and Krippendorff’s alpha well above 0.90. These results indicated that smartphone-based dermatoscopy could be a convenient and precise way to evaluate skin erythema severity. However, better outlined, or even automated ISMs, are likely to improve the intra- and inter-rater reliability in severe erythematous cases.


Introduction
Canine atopic dermatitis (AD) is a chronic allergic and inflammatory skin disease with characteristic clinical features [1]. It is one of the most common skin diseases in dogs, with a prevalence of 3-15% [2]. Environmental and food allergens trigger the allergic reaction, manifesting as pruritus (i.e., itch) and skin lesions that include erythema (redness), hyperpigmentation (increased pigmentation), and excoriations (scratched lesions) ( Figure 1). In most dogs, AD is a lifelong condition that requires long-term management, including the administration of antipruritic and anti-inflammatory drugs, allergen immunotherapy, and good hygiene of the coat and skin [3]. Since the treatment response is highly individual, the precise tracking of the evolution of clinical signs is crucial to select the proper anti-allergic therapy. Few disease severity scales have been developed to grade the clinical signs of canine AD. First, an owner-assessed pruritus estimation is done with a 10-point Pruritus Visual Analog Scale (PVAS) [4]. Skin lesions are evaluated most often with the fourth iteration of the Canine Atopic Dermatitis Extent and Severity Index (CADESI4) [1]. The CADESI4 is based on the grading of different lesion types on 20 locations leading to 60 assessments across several AD-related body sites; its execution takes approximately 4 min. Since the CADESI4 is not very sensitive to short-term changes of chronic skin lesions, such as hair loss or increased skin thickness, the scale's derivative-based only on erythema evaluation has been proposed [5].
Various custom-made or commercial imaging or spectroscopic devices (e.g., Mexameter MX 18, Courage-Khazaka Electronic, Köln, Germany) tried to overcome the subjectivity of erythema evaluation [6,[14][15][16][17] by calculating an erythema index (EI), which is a ratio between those spectral or imaging components that correlate well with the skin redness. A quotient between the red and green colors is the most common since the skin pigment melanin has a smaller impact on these two channels. On the contrary, some authors demonstrated the benefits of an added blue channel [18][19][20]. In general, most studies [6,19,[21][22][23][24] presented a high correlation between the visual and optical erythema estimations with the correlation coefficients from 0.69 to 0.91. Furthermore, there was a good to excellent agreement between various optical devices with the Pearson's correlation coefficient (r p ) of 0.76-0.81 [14] and the coefficient of determination (r 2 ) between 0.82 and 0.99 [25,26]. On the other hand, two facial devices for skin analysis (VISIA, Canfield, Parsippany, NJ, USA and CSKIN, source unknown) achieved a poor correlation with visual scores or between themselves (r p = 0.21-0.49) [27].
Discrepancies among devices probably appear due to manual skin sampling. Many spectroscopic systems with single-point probes average the erythema intensity in a small area ranging between 0.2 and 0.5 cm 2 for commercial devices like the Mexameter M X18 (Courage-Khazaka Electronic, Köln, Germany), DermaSpectrometer (Cortex Technology, Hadsund, Denmark), and Chromameter CR 200 (Minolta, Osaka, Japan) [14]. Most imaging systems have a large sampling area (e.g., 3.0 and 7.1 cm 2 for the dermatoscope DermLite DL1 (3Gen, San Juan Capistrano, CA, USA), or the custom-made Skimager [19,21]), which can produce faulty EI readings due to the inclusion of hair and pigment. Therefore, only a suitable skin area needs to be selected from the acquired images. Currently, specialists sample skin images manually by selecting specific pixels or small areas [17,19,21]. On the other hand, there are a few semi-automated approaches [6,28], where the user would select erythematous and native skin areas with the representative erythematous skin being determined by the redness gradient-or fuzzy entropy-based algorithm.
As of today, erythema indices (EIs) and manual image-sampling methods (ISMs), which can significantly impact erythema estimation, have not been studied and compared thoroughly. With this study, we wanted to demonstrate that a smartphone-based dermatoscopy, relying on certain EIs and ISMs, could be a convenient and reliable method for evaluation of the skin erythema severity in dogs with AD. Therefore, we first investigated a correlation between the most common EIs, including the a* dimension of the CIELAB color space, and visual erythema scores. Secondly, we applied three different ISMs on erythematous skin images, which served to estimate intra-and inter-rater variability of the proposed optical system.

Materials and Methods
The Latvian Food and Veterinary Service approved this study under the reference number 1.1-13E/20/865. We enrolled 43 purebred or crossbred client-owned dogs, which were presented at the dermatology service with AD diagnosis during a three-week period. The average age was 6.8 years (0.4-18.5 years). The most common breeds were American Staffordshire terrier (n = 5), Shih Tzu, Boston terrier (3), Labrador retriever, pointer, West Highland white terrier, and English and French bulldogs (2). We evaluated erythema in the inguinal region, which had to exhibit a low amount of hair and no secondary lesions (e.g., lichenification, excoriation, and hyperpigmentation). On the day of measurement, we made sure that the measurement site had not been washed or treated (e.g., with lotions, shampoos, etc.).
Two different erythema evaluations were made: (1) Visual; according to the continuous erythema scale ( Figure 2). Three different dermatology residents were involved in marking a spot corresponding to the severity of skin erythema. However, only the one being the patient's clinician performed the assessment.   After the acquisition, RGB (red, green, blue color space) images were normalized against the white standard. Three new raters (veterinarians, different from the dermatologists in the visual erythema evaluations), performed three manual image-sampling methods (ISMs) in order to select a representative portion of erythematous skin without depigmented or pigmented spots and hair ( Figure 4). Each of raters executed image sampling twice. At least one month passed between the original and a repeated image sampling. The first ISM included manually selecting 60 representative pixels (PT). Secondly, pixels from two small (SQ2) or one large square (SQ1) were considered. Typical blue (B), green (G), and red (R) values were calculated as an average from all the selected pixels. Different EIs were estimated as: The pixels' RGB values were additionally used for a calculation of the dimension a* of the CIELAB color space according to the following model: where a 0-3 are regression coefficients, retrieved from the calibration procedure on all patches of the ColorChecker Classic [29]. First, and separately for each EI and ISM, we evaluated the relationship between visual and optical erythema estimation by calculating Spearman's rank correlation coefficient (r s ) and residuals of linear regression. Based on the six ISM executions, we estimated the mean and standard deviations, which served to select the best-performing EI and ISM. Finally, we studied intra-and inter-rater reliability among all three raters (veterinarians) by calculating Spearman's rank correlation coefficient (r s ), intra-class correlation ICC (2, k) (two-way random effects, absolute agreement, multiple raters [30]), and Krippendorff's alpha (α k ). Additionally, we studied absolute intra-and inter-rater agreements for EI BRG and a*.

Results
We optically and visually estimated the erythema severity in 43 dogs. We excluded two measurements from further analysis due to extreme skin thinness. For most of the EIs and ISMs, we found a strong correlation between optical and visual assessments ( Table 1). As shown in our preliminary study [21], the EI BRG achieved the best performance with an average Spearman's correlation coefficient (r s ) of 0.74. Due to the decreasing negative numerator's value in Equation (2), the EI GR exhibited a negative correlation. The single best and worse correlations between optical and visual erythema assessment resulted in r s of 0.83 (rater 1, based on EI BRG and SQ2) and 0.55 (rater 2, EI BG , SQ1), respectively. Selecting specific representative image pixels (method PT, Figure 4) turned out to be the best performing ISM with a mean r s of 0.71. The SQ2 method, which is based on two small squares, exhibited a slightly lower correlation strength, but a faster mean execution (3.7 ± 0.3 vs. 31.2 ± 5.9 s). Table 1. Mean and standard deviation (SD) of the Spearman's correlation coefficients (r s ) between visual and optical erythema evaluation, relying on different erythema indices (EI: RG-EI RG, GR-EI GR , BRG-EI BRG , BG-EI BG , and a*) and image-sampling methods (ISM: PT-60 pixels, SQ2-two small squares, SQ1-one big square). The EI BRG and the PT ISM also produced the smallest mean residuals (i.e., errors) in the linear regression analysis between visual and optical erythema evaluation ( Table 2). The single best and worst models resulted in the mean fitting error of 10.3% (rater 1, EI RG , PT) and 14.0% (rater 2, EI BG , SQ1), respectively. For the best-aforementioned model, the residual values ranged between −29.3% and 31.3%, with a standard deviation of 13.2%. Table 2. Mean and SD of the residuals (absolute values, in %) of linear regression models between visual and optical erythema evaluation, relying on different erythema indices (EI: RG-EI RG, GR-EI GR , BRG-EI BRG , BG-EI BG , and a*) and image-sampling methods (ISM: PT-60 pixels, SQ2-two small squares, SQ1-one big square).

Mean
11.9 ± 0.8 12.0 ± 0.6 11.7 ± 0.6 12.7 ± 0.7 12.3 ± 0.4 The agreement between raters when applying ISMs was strong, since all the studied parameters (Spearman's rank correlation-r s , ICC, and Krippendorff's alpha-α k ) were above 0.90 (Table 3). The best and worst ISM applications resulted in an α k of 0.99 (rater 2, a*, PT) and 0.91 (rater 2, EI BG , SQ1) for intra-, and 0.98 (rater 1-2, a*, PT) and 0.70 (rater 1-3, EI RG , PT) for inter-rater reliability, respectively. When investigating individual ISMs, it seems that PT and SQ2 were more reliable for a single rater (Table 4). On the other hand, both ISMs exhibited a higher inter-rater variability compared to SQ1. Table 3. Mean and SD of the intra-and inter-rater reliability coefficients (Spearman's r s , intra-class correlation (ICC), and α k ). Absolute agreement (∆) is listed for two EIs: EI BRG and a* of the CIELAB color space. Despite the mean intra-and inter-rater misestimates in EI BRG being small (i.e., up to 0.05; Table 5), the further study revealed that the differences had increased along with the severity of erythema (Table 5, Figure 5). Evaluating native or mildly erythematous skin by a single rater resulted in minor EI BRG misestimates of up to 0.07 (Figure 5a). On the other hand, the maximal disagreement between multiple raters was −0.62 (Figure 5b), representing an error between~30 and 60%. This phenomenon occurred on the skin with severe and patchy erythema (Figure 5c) where the raters selected different sampling weights for severely and mildly erythematous skin.  The study on intra-rater disagreement in a* revealed that most of the differences were below the so-called just-noticeable difference (JND, the range of 2.3-5.0), which corresponds to the human eye's capability to spot a difference between two colors (Figure 6a).

Rater
Similarly, the human eye would not have detected most of the misestimates between raters (Figure 6b). However, 4% of the ratings resulted in an a* difference larger than 5.0. Figure 6. A difference (Bland-Altman) plot of a* (CIELAB color space) misestimates between (a) a single (intra-) and (b) multiple raters (inter-rater absolute agreement). The full and two dashed lines represent the difference mean and 95% limits of agreement (i.e., 1.96 × of SD), respectively. Gradients of gray color mark just-noticeable difference (JND, the range of 2.3-5.0), which corresponds to the human eye capability to spot a difference between two colors.

Discussion
This study, which focused on the feasibility of a smartphone-based dermatoscopy, is one of the first of its kind in veterinary dermatology. Our results confirmed the findings of our preliminary study [21], that the proposed system can present an objective and reliable method for the evaluation of the skin erythema severity in dogs with AD.
We found a strong correlation between the optical (EI BRG and PT) and the dermatologists' visual erythema severity estimates. On average, the Spearman coefficient (r s ) was 0.75 (Table 1), with a range between 0.70 and 0.83. Our results are comparable to the studies on human erythematous skin, where the obtained correlation coefficients were between 0.69 and 0.91 [6,22,24]. However, Frew et al., who reported the highest r p of 0.91, differentiated only between a few erythematous categories without the inclusion of the native skin color [6]. We should also point out that, in this study, only one dermatologist performed each visual erythema estimation without repetitions. As a result, our conclusions do not consider any possible intra-and inter-rater variability in the visual evaluation of skin erythema.
Compared to the rest of EIs, including the dimension a* of the CIELAB color space, EI brg 's performance in terms of correlation and absolute fitting residuals was superior by more than 3%. Still, some authors discourage using EIs as EI BRG , which rely on the blue channel, due to the strong blue light absorption by melanin [6]. This factor is probably the main reason why Saknite et al. [18] indicated that EI BRG is the most suitable for detecting contrast between pigmented and non-pigmented skin. Generally, pigment seems to have a significant impact on the visual perception of erythema. In a study on the inter-rater reliability of evaluating erythema visually, Zhao et al. showed that the ICC dropped from 0.41-0.78 for non-pigmented to only 0.06-0.23 for pigmented erythematous skin, respectively [12]. All these observations could discourage us from promoting EI BRG . However, the inguinal region of all 43 dogs in our study was never completely pigmented, enabling us to find and sample non-pigmented, erythematous skin.
Despite the coordinate a* corresponding with the red color in the CIELAB color space, a* did not correlate better than other EIs with the visual perception of erythema. Actually, its mean correlation coefficient and fitting residuals were lower for 0.04 and 0.6 p.p. compared to EI BRG (Tables 1 and 2). Other studies reported even more discouraging results on the CIELAB performance. Logger et al. found a weak correlation between a visually determined erythema score and a* with r s of 0.37 [25]. Similarly, a* retrieved from RGB images exhibited a limited capability to differentiate between erythema categories in canine skin [31].
For a single rater, the ISM with two small squares (SQ2) performed the best [19,21]. However, the PT method (selecting specific pixels) achieved slightly better results when adding extra raters and ISM repetitions, but the differences were negligible (Tables 1, 2 and 4). Collectively, the intra-and inter-reliabilities of ISMs were very high with the parameter values (r s , ICC, and α k ) above 0.90 (Table 3). Such reliability is superior to the studies on the visual or optical evaluations of erythema in human skin, in which reliability parameters were usually well below 0.90 (see Introduction) [1,[7][8][9][10][11][12][13].
Among ISMs, SQ1's general performance was the worst. Previously, we speculated that the SQ1 method samples also hair and pigment, which would negatively affect the correlation with the visual evaluation of erythema. However, SQ1 exhibited a significantly higher inter-rater reliability with an α k of 0.93, compared to 0.88, achieved by PT and SQ2 (Table 4). As expected, with smaller skin areas or even single pixels to choose from, there is a bigger probability that a rater would sample markedly different skin with various erythema severities. As Figure 5c demonstrates, the first two raters (blue and green crosses) evenly included mild and severe erythema with the mean inter-rater EI BRG misestimate of 0.08. On the other hand, the third rater (black crosses) focused only on the severe erythema patches, resulting in the mean inter-rater EI BRG misestimate of 0.53, which represents an immense, almost six times error increase. Of course, such extreme inter-rater variability can be expected only in severe cases, where the erythema distribution is patchy [32]. On the other hand, EI BRG misestimates were merely around 0.01 (error of up to 1%) on the native or mildly erythematous skin (Table 5). In our previous report [19], readers can find a further discussion on the clinical relevance and limitations of the proposed dermatoscopic system in dogs.
Altogether, most of the misestimates in EI among raters were below 0.14, representing an error of up to 14% (canine EI BRG generally ranges from 1 to 2). In the previous study [29], we showed that different smartphone-based dermatoscopic systems had absolute disagreement in EI BRG for around 3%, significantly less than our veterinary raters. In the worst-case scenario, we could expect combined errors of up to 17%. Still, these errors were mostly not big enough to be detected by the human eye ( Figure 6). Assuming that two other CIELAB dimensions (L*, b*) do not change, misestimates in a* were generally below 3.0, which is the lower range of JND, the color detection threshold for the human eye.

Conclusions
Our study showed that smartphone-based dermatoscopy is a convenient and reliable way to calculate EI and, by that, evaluate the severity of skin erythema in dogs with AD. We demonstrated a high correlation between the optical (EI BRG ) and the dermatologists' visual erythema evaluations with the average Spearman coefficient (r s ) of 0.75. However, the proposed dermatoscopic approach should be applied to non-pigmented skin only since melanin can influence EI BRG level.
The tested manual ISMs exhibited high intra-and inter-rater reliabilities. Despite the method with selecting individual pixels (PT) achieving slightly better performance, we recommend selecting two small skin areas (SQ2 method), due to its speed. As with any ISM, there could be a significant inter-rater erythema misestimation on severely erythematous skin. In these cases, better outlined or automated ISMs should be tested to improve the erythema assessment.  Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Latvian Food and Veterinary Service (1.1-13E/20/865, 9.6.2020).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Not applicable.