Comparison of Semi-Quantitative Scoring and Artificial Intelligence Aided Digital Image Analysis of Chromogenic Immunohistochemistry

Semi-quantitative scoring is a method that is widely used to estimate the quantity of proteins on chromogen-labelled immunohistochemical (IHC) tissue sections. However, it suffers from several disadvantages, including its lack of objectivity and the fact that it is a time-consuming process. Our aim was to test a recently established artificial intelligence (AI)-aided digital image analysis platform, Pathronus, and to compare it to conventional scoring by five observers on chromogenic IHC-stained slides belonging to three experimental groups. Because Pathronus operates on grayscale 0-255 values, we transformed the data to a seven-point scale for use by pathologists and scientists. The accuracy of these methods was evaluated by comparing statistical significance among groups with quantitative fluorescent IHC reference data on subsequent tissue sections. The pairwise inter-rater reliability of the scoring and converted Pathronus data varied from poor to moderate with Cohen’s kappa, and overall agreement was poor within every experimental group using Fleiss’ kappa. Only the original and converted that were obtained from Pathronus original were able to reproduce the statistical significance among the groups that were determined by the reference method. In this study, we present an AI-aided software that can identify cells of interest, differentiate among organelles, protein specific chromogenic labelling, and nuclear counterstaining after an initial training period, providing a feasible and more accurate alternative to semi-quantitative scoring.


Introduction
Digital technology is an organic part of our daily lives and has a huge impact on our profession, social network, and leisure activities. It has had an impact in nearly every aspect of modern medicine, but the degree of digitalization is highly uneven among medical fields [1]. While conventional light microscopy is still the gold standard method for investigations in the pathological workup, in radiology, a new discipline called radiomics has recently evolved, integrating the work of radiologists, software engineers, and data scientists [2]. However, in pathology, the application of digital image analysis, especially aided by artificial intelligence (AI), is still relatively rare [3]. Immunohistochemistry (IHC) is a fundamental technique that is used to identify certain antigens in tissue sections with diagnostic, differential diagnostic and prognostic value [4][5][6]. The chromogen 3,3 -Diaminobenzidine (DAB) is widely-used to visualize proteins of interest. However, there is no stochiometric relationship between the chromogen's intensity and the quantity of antigens; in a standardized experiment, a stronger DAB intensity indicates a higher protein level in the tissue and vice versa [7]. IHC intensity scoring is a method that is widely used for the assessment of protein quantity and is usually ranked on a four-point scale (0, 1, 2, 3) in pathological diagnostics and research [8][9][10][11][12]. However, this semi-quantitative technique is subjective and highly inaccurate and demonstrates significant intra-and inter-observer variability [13]. One possible solution is the application of machine learning, which is valuable in automating workflows where repetitive, lengthy, and monotonous tasks are encountered [14]. In histopathology the task at hand is the qualitative and quantitative assessment of differentially stained cellular morphology of tissue sections. On the differentially stained samples, the major organelles of the cells can be distinguished with different colours due to their unique chemical interactions with the dye molecules. Parameters describing the shape of cells, optical densities, etc., differ from type to type. Consequently, cells (i.e., neurons) can be identified by the 'old-fashioned' manual way of looking into the microscope or via the highly automated processing of digital images (taken by microscope cameras or slide scanners) using different analytical algorithms. Moreover, AI-aided platforms cannot only recognize different tissue structures, but they can also measure staining intensity more accurately than semi-quantitative scoring systems can [15]. In this study, we examine the usability of AI-aided digital image analysis to estimate the protein levels on DAB-labelled IHC slides. Furthermore, we also investigated the reliability of the software by comparing its intensity results to the semi-quantitative IHC scores of three experimental groups that have been assessed by five scientists. The inter-observer variability of the conventional scoring method was also evaluated. The current research focused strictly on the comparison of the two methods. The biological significance of the labelled protein's (lemur tyrosine kinase 2 (LMTK2)) was previously published in an immunofluorescent IHC study [16]. Furthermore, we declare that this is the first time where chromogenic (CHR)-IHC intensity scoring results and this type of investigation have been published.

Sample Selection and Processing
Post-mortem formalin-fixed paraffin-embedded (FFPE) human brain samples were obtained from the Medical Research Council (MRC) London Neurodegenerative Diseases Brain Bank at the Institute of Psychiatry, Psychology and Neuroscience, King's College London. All procedures were conducted under the ethical approval of the Institutional Ethics Committee of the MRC London Neurodegenerative Diseases Brain Bank (18/WA/0206) at the Institute of Psychiatry, Psychology and Neuroscience, King's College London, and the Brains for Dementia Research Project (08/H0704/128+5). Informed consent for autopsy, neuropathological assessment, and research participation were obtained for all subjects, and data were anonymized. Block taking, immunohistochemical labelling, and neuropathological assessment for neurodegenerative diseases were carried out in accordance with standard protocols as described in detail in earlier studies [6,16,17].
We established three groups, including two different neurodegenerative dementias (Alzheimer's disease and Dementia with Lewy bodies) with severe neuropathological stages and age-matched controls (CNT) with no known neurodegenerative condition. The assessed brain region was the middle frontal gyrus (Brodmann area 9). There were 6-6 (n = 18 in total) samples in the three experimental groups. CHR-IHC labelling of LMTK2 was performed according to a standardized protocol that had been published earlier [16].
The slides were scanned with a Virtual Slide Microscope VS120 (Olympus Corp., Tokyo, Japan) with the same illumination intensities, exposure times, and camera settings. Focusing on the cortex, 39 photos/case were taken from the WSI files at medium magnification (200×), covering the whole cortical area on each slide. Both digital image analysis and semi-quantitative scoring were performed on these photos to guarantee that the software and the observers evaluated the same images and to avoid discrepancies resulting from the different settings of the light microscope and the slide scanner as well as the digital display.

IHC Intensity Scoring
Semi-quantitative cellular scoring was carried out using a four-point scale based on the IHC intensity of the cells: negative (0), mild positivity (1+), moderate positivity (2+), and strong positivity (3+) (Figure 1). A few reference images were analysed quantitatively with ImageJ software (National Institute of Health, Bethesda, MD, US), using Cell counter module to calculate the exact average IHC intensity score of the given images. These were used as reference images and allowed the investigators to execute a more accurate semiquantitative scoring methodology such as the one described in our previous work [4]. Then, the observers determined the mean IHC intensity scores for each image. To do this, we extended the original scores with values of 0.5, 1.5, and 2.5, e.g., if an image contained mild (1+) and moderate (2+) positive cells in a ratio that was approximately 1:1, then we assigned a score of 1.5. Thus, the final scores of the images were determined on a 7-point scale. In this particular case, the constructed convolutional network was trained on thousands of neurons that had been gathered into two morphology classes to detect specific IHC patterns. As the starting step, the recognition criteria were first determined, which were the abundant cytoplasm and the large, well-recognized nucleus that was visible in

Digital Image Analysis
The core of the image analysis software of Pathronus platform (V1.2, Vitrolink Kft., Debrecen, Hungary) was a Convolutional Neural Network (CNN) [18]. Pathronus was developed as an AI-assisted image analysis platform that was only to be used for the purposes of research and development. It combines the concept of a pathology-focused online shared workspace with state-of-the-art CNN techniques. In general, the two aspects of the application work to improve each other in a feedback-loop system: The users provide new images that are uploaded and analysed to determine whether the analysed objects were detected correctly by the AI or whether points of interest should be annotated manually, and these images act as new training data for the network. This improves the capabilities and accuracy of the CNN, allowing it to make better predictions when a new image is uploaded later on. Since the platform functions use diverse user images (and equipment), most of the biases that a network might naturally develop when trained using data from a handful of sources can be naturally eliminated. With a sufficient quantity of training data, such a system might be capable of identifying any desirable features on any type of pathological imagery.
In this particular case, the constructed convolutional network was trained on thousands of neurons that had been gathered into two morphology classes to detect specific IHC patterns. As the starting step, the recognition criteria were first determined, which were the abundant cytoplasm and the large, well-recognized nucleus that was visible in the given section plane as the types of accepted neurons (Class 1, Figure 2C). Any other object not falling into the previously mentioned categories was labeled as a rejected item (Class 0, Figure 2A) for the purposes of the training process. A confusion matrix analysis was performed to test the accuracy of the system. . Class 0 represents an example of a misidentified item (vessels which mimic the shape of a neuron). Class 1 depicts an ideal neuron that could be used for the intensity measurements, which has a large amount of cytoplasm and an easily observed nucleus. H = nuclear counterstain; haematoxylin.
By design, the Pathronus platform runs an object detection algorithm, which feeds images to the CNN and marks the areas of the morphology class as a Region of Interest (ROI) in the images (e.g., specified the exact number, coordinates, and extent of the neurons in the images of the tissue sections). The AI model system that was used in this research is based on the Keras-RetinaNet module. It implements deep learning and uses an . Class 0 represents an example of a misidentified item (vessels which mimic the shape of a neuron). Class 1 depicts an ideal neuron that could be used for the intensity measurements, which has a large amount of cytoplasm and an easily observed nucleus. H = nuclear counterstain; haematoxylin.
By design, the Pathronus platform runs an object detection algorithm, which feeds images to the CNN and marks the areas of the morphology class as a Region of Interest (ROI) in the images (e.g., specified the exact number, coordinates, and extent of the neurons in the images of the tissue sections). The AI model system that was used in this research is based on the Keras-RetinaNet module. It implements deep learning and uses an algorithm called RetinaNet, which is one of the most advanced object recognition algorithms that is used to detect features in images. The code itself is based on the Keras deep learning framework that is part of the python programming language [19]. As a result, it successfully recognized and cropped Class 1 DAB-stained neurons in the images ( Figure 2D). The neurons were manually checked by humans. The accepted neurons were found on the same set of images that the pathologist used for scoring and were processed for DAB intensity signal levels by the platform. It used colour deconvolution with the same parameters applied to all ROIs to separate the nuclear counterstain haematoxylin and the cytoplasmic DAB signals in all of the cropped neurons (see Figure 2). The platform only kept the inverted DAB signal and measured the average intensity (8-bit grayscale) of all of the neurons with the same settings ( Figure 2D). In an 8-bit image, 0 represents black and 255 means white, thus for a better graphical presentation and to avoid misunderstandings, we used an inverse grayscale, where the more intense (darker) DAB signals have higher inverse grayscale values (i.e., darkest value = 255). The individual intensity values of the images corresponding to the CNT, AD, and DLB cases were then summarized. Finally, we determined the mean inverse gray intensities of the experimental groups. Please note that the analysis focused on the neuron intensity that had been identified by the software and that had been further evaluated by the pathologists participating this study; therefore, the surpassing performance of the neural network module was not a fundamental requirement, as the neurons that were subjected to intensity analysis were manually checked and filtered prior to assessment. It was not within the scope of this study to create an all-out version of a neuron classifier software.

Comparison between Semi-Quantitative Scoring and Digital Image Analysis
Because original semi-quantitative scoring is based on a four-point scale while the software used the grayscale ranges from 0 to 255, it was not possible to compare the methods directly. Therefore, we converted the inverse grayscale values of the neurons into IHC intensity scores according to the following formula: 0-33→0; 34-107→1; 108-181→2; 182-255→3, where the first ranges refer to the inverse grayscale values and where the numbers after arrows are the converted scores. The first inverse grayscale range (0-33) was derived from the digital image analysis of the IHC negative, haematoxylin-only slide, where the maximum measured value was 33, and the rest of the conversion resulted from the division of the grayscale range of 34-255 into three equal parts. Then, we calculated the mean scores of the images that were from cell-level data. In order to perform a valid inter-rater reliability analysis, a second conversion to the 7-point scale was applied on the mean values based on the following formula: 0-0.25→0; 0.26-0.75→0.5; 0.76-1.25→1; 1.26-1.75→1.5; 1.76-2.25→2; 2.26-2.75→2.5; 2.75-3→3. Through this formula, the IHC intensity data that were measured by the Pathronus platform and the semi-quantitative scores that were determined by the observers are easily comparable. Nevertheless, it should be noted that the presentation of digital image analysis results on a seven-point scale due to double-conversion decreases the evaluation accuracy by losing a significant amount of resolution.

Statistical Analysis
Inter-observer reliability for the experimental groups was investigated by the comparison of the intensity scores of the images (n = 234/group) given by the observers and Pathronus using Fleiss' kappa. Pairwise comparisons of the observers (including Pathronus) were also performed with Cohen's kappa. Strength of agreement was adopted from the study by Landis and Koch [20] as follows: κ < 0.20→Poor; 0.21-0.40→Fair; 0.41-0.60→Moderate; 0.61-0.80→Good; 0.81-1.00→Very good. Statistical tests were executed with IBM SPSS Statistics for Windows, Version 25.0 (IBM Corp., Armonk, NY, USA).
To determine the accuracy of the semi-quantitative scoring and original Pathronus, data validation was required. The statistical relevance of the found differences in CHR-IHC intensities among the three experimental groups were calculated for each observer. The Shapiro-Wilk normality test, equal variance test, one-way analysis of variance (ANOVA), and all pairwise comparison (Holm-Sidak method) were carried out using the SigmaPlot 12.0 software (Systat Software Inc., San Jose, CA, USA). We used the previously published quantitative fluorescent IHC analysis on the same disease groups as a reference [16].

Results
The two classes ( Figure 2) were populated randomly and represented all cases (CNT, DLB and AD) after annotation was completed utilizing the Pathronus platform. Class 0 consisted of 4061 samples, while Class 1 had 5009 neurons. A total of 70% of the 9070 neurons were used as the training data set, 20 % were for the validation data set, and 10% were used as the test dataset. The confusion matrix that was generated on the test dataset can be seen in Table 1  The number of neurons on the CHR images that was analysed by Pathronus and that was manually checked by pathologists was 12,516. Supplementary Table S1 contains the summarized inverse mean gray intensities of the images after color deconvolution (see Methods) of the individually processed IHC-stained neurons. The mean intensity values of the individual cases in the experimental groups ranged between 113.58-123.22, 100.21-114, and 107.76-122.8 in CNT, AD, and DLB, respectively. Semi-quantitative scoring was performed by five observers on 39 images/case (n = 234/group). The scores that were given for each group ranged between one and three (Supplementary Table S1). CNT achieved the highest rating, and AD received the lowest scores from the majority of the observers. Double-converted Pathronus values spread from 1 to 2. Table 2 contains the inter-rater reliability among the observers and Pathronus. Sub-tables A, B, and C show the Cohen's kappa values and the strength of the agreements that were determined in the pairwise comparisons in the CNT, DLB, and AD groups, respectively. Sub-table D includes the overall comparison among the five human observers plus Pathronus using Fleiss' kappa. Although, Cohen's kappa values were highly variable among the observers in different groups, poor agreement dominated, while moderate agreement was the least common. Overall agreement with statistically significant (p < 0.005) Fleiss' kappa values was poor in every experimental group.  The statistically significant differences of the distinct experimental groups varied among the observers and between methods. Certain observers (#1 and #4) achieved statistically significant alterations among every group, while others (#2, #3, and #5) did not. However, only the Pathronus analysis (original and converted) was able to reproduce the reference data; specifically, CNT had the strongest and AD had the weakest immunopositivity, and statistically significant differences were revealed between the CNT versus (vs.). AD groups and the DLB vs. AD groups. Figure 3

Discussion
A major difficulty in biological research is the transformation of qualitative data to quantitative data [21]. Semi-quantitative scoring is a widely used method that is able to solve this problem [13,22]. However, we must be aware of its limitations. Subjectivity is a major issue in the scoring process, which is highly influenced by histological expertise [23,24]. In many fields of translational research, tissue scoring is delegated to biomedical personnel (including senior researchers, post-docs and even students) who do not have the same amount of experience as board-certified pathologists, who receive many years of tissue interpretation training. Studies following a 'do-it-yourself' pathology approach may suffer from Type I (false positivity) and Type II (false negativity) errors [21]. Although, board-certified pathologists are highly skilled in recognizing patterns in morphological changes, the human visual system has a limited ability to detect subtle changes in tissues, especially with respect to spatial and intensity assessments [22]. A major shortage of conventional scoring and the necessity of a better, higher resolution method is exemplified in Figure 4. Both images were rated with score 2 by all observers, but digital image analysis by Pathronus revealed that the inverse grayscale (0-255 = light to dark) value of first image is 110.02 (Panel A), while of the second is 123.03 (Panel B). This discrepancy arose from the significantly smaller evaluation range (7 vs. 256), resulting in difficulties

Discussion
A major difficulty in biological research is the transformation of qualitative data to quantitative data [21]. Semi-quantitative scoring is a widely used method that is able to solve this problem [13,22]. However, we must be aware of its limitations. Subjectivity is a major issue in the scoring process, which is highly influenced by histological expertise [23,24]. In many fields of translational research, tissue scoring is delegated to biomedical personnel (including senior researchers, post-docs and even students) who do not have the same amount of experience as board-certified pathologists, who receive many years of tissue interpretation training. Studies following a 'do-it-yourself' pathology approach may suffer from Type I (false positivity) and Type II (false negativity) errors [21]. Although, board-certified pathologists are highly skilled in recognizing patterns in morphological changes, the human visual system has a limited ability to detect subtle changes in tissues, especially with respect to spatial and intensity assessments [22]. A major shortage of conventional scoring and the necessity of a better, higher resolution method is exemplified in Figure 4. Both images were rated with score 2 by all observers, but digital image analysis by Pathronus revealed that the inverse grayscale (0-255 = light to dark) value of first image is 110.02 (Panel A), while of the second is 123.03 (Panel B). This discrepancy arose from the significantly smaller evaluation range (7 vs. 256), resulting in difficulties detecting subtle differences in the labelling intensities with the naked eye and from various human physiological factors such as fatigue and eyestrain, which may occur during the monotonous process of assessing a large number of images [21,25,26]. Although the overall inter-rater agreement was poor, the ranking of mean intensities given to each of the experimental groups was the same for all but one of the observers (Table 3). Better inter-rater agreement might have been reached with a longer pre-training period that was restricted to the evaluation of neuronal cells using this scoring system or by applying cut-offs (e.g., size of neuron or visibility of nuclei). However, it is known from the literature that while high inter-observer agreement is achievable in qualitative scoring (e.g., existence of IHC labelled structures or percentage of positive cells) [27][28][29][30], often fair or poor overall agreement is achieved in the semi-quantitative scoring of staining intensity, even among experts with decades of practice [31]. Generally, results are influenced by study design, the type of tissue being investigated, and how the observers use a specific scoring system [21,22]. Consequently, tissue-specific training with an established scoring system probably improves inter-observer agreement, but it still cannot eliminate the problem of intra-observer variability, such as that observed in the above-detailed physiological factors [22].
Besides inter-and intra-observer reproducibility, it is also essential that the results can be validated. However, as semi-quantitative scoring is the gold standard, digital image analysis of datasets is rarely performed. Studies are not consistent regarding the practicality of the methods that are used. Some authors have reported unequivocal advantages of digital analysis [32], while others did not find any analytical benefits other than time-efficiency [33]. Favorably, a previous quantitative immunofluorescent IHC analysis was carried out on the same cases that were used in the present study, allowing us to make a trustworthy comparison with our current results. Reasonably good agreement was determined in the groups in the order of (CNT > DLB > AD), except for in the case of one investigator. However, the statistically significant differences that were observed among the groups were highly variable, with only the original Pathronus analysis and, perhaps more surprisingly, the converted analysis was able to reproduce the reference data (Table 3). Questions may arise as to why we do not interpret the original Pathronus evaluation as a quantitative technique. CHR-IHC quantification is very difficult due to the numerous variables that are involved from the pre-analytical phase to the post-processing steps, resulting in doubts and inconsistencies in the literature [34][35][36]. Moreover, DAB chromogen does not follow the Beer-Lambert law; the reaction is not stochiometric, and consequently, the staining intensity is not related to the number of antigens [37]. Nonetheless, DAB-based CHR-IHC is the primary choice in diagnostic pathology because it is an easily accessible, relatively cheap, and fast technique compared to quantitative molecular biological methods (i.e., Western blot, qPCR), which may not be feasible in the first-line pathological workup. Furthermore, CHR-IHC intensity-based semiquantitative evaluation is an organic part of several widely used scoring systems with therapeutic relevance (e.g., breast cancer) [8,38]. Originally, the Pathronus platform was developed to support pathologists in routine diagnostic procedures that predominantly required the assessment of CHR-IHC slides. It is an online forum that can be used by pathologists and histologists where they can upload images of difficult or interesting cases and can share and discuss them with other experts from all over the world. In addition, they can teach and train the platform by annotating disease-specific or diagnostically relevant structures on the images. Pathronus may synthesize these data, and the next time somebody uploads a similar image in association with the same disease, the platform may be able to pre-analyze it and label the previously learnt pathologically important ROIs. The software may eliminate inter-and intra-observer bias in the future because the number of investigated neurons is practically unlimited, as the method is able to analyse thousands of cells and can cover whole slides very quickly. However, DAB labelling is still not a quantifiable technique even though results are comparable and can provide a better estimation on protein expression (in accordance with reference datasets) in a standardized experiment (reagents, incubation times, etc.) compared to when the commonly used eyeballing semi-quantitative methods are implemented (Table 3) [16]. Based on the findings outlined here, digital image analysis is undoubtedly the future of histology. Although a comprehensive investigation into the differences between semiquantitative scoring and digital image analysis was beyond the scope of our current work, an emerging number of publications within this field shed light on a considerable number of features of the two methods. Despite its numerous advantages such as speed, objectivity with good predictive value, its ability to handle large datasets, etc., it also has several disadvantages, namely the cost, equipment requirements, or level of acceptance by some scientific communities and regulators (Table 4) [3,21,22,25,26,33,34,37,[39][40][41][42][43][44].
the monotonous process of assessing a large number of images [21,25,26]. Although the overall inter-rater agreement was poor, the ranking of mean intensities given to each of the experimental groups was the same for all but one of the observers (Table 3). Better inter-rater agreement might have been reached with a longer pre-training period that was restricted to the evaluation of neuronal cells using this scoring system or by applying cutoffs (e.g., size of neuron or visibility of nuclei). However, it is known from the literature that while high inter-observer agreement is achievable in qualitative scoring (e.g., existence of IHC labelled structures or percentage of positive cells) [27][28][29][30], often fair or poor overall agreement is achieved in the semi-quantitative scoring of staining intensity, even among experts with decades of practice [31]. Generally, results are influenced by study design, the type of tissue being investigated, and how the observers use a specific scoring system [21,22]. Consequently, tissue-specific training with an established scoring system probably improves inter-observer agreement, but it still cannot eliminate the problem of intra-observer variability, such as that observed in the above-detailed physiological factors [22]. A remarkable limitation of semi-quantitative scoring compared to digital image analysis is the significantly smaller evaluation range (7 vs. 256) due to the difficulties that are experienced when attempting to detect subtle differences in the labelling intensities using the human eye alone. The observers allocated a score of 2 to both of the images, whereas the Pathronus original method revealed that the intensity of Panel A was 110.02, while that of Panel B was 123.03 on the grayscale (ranged between 0-255). Although the human eye is capable of perceiving small differences, the objective and reproducible categorization of hundreds of images on an extended scale is not possible for human observers whereas possible for a digital image analysis software. This shortage of semi-quantitative scoring may result in statistical bias compared to software-based results.

Conclusions
Semi-quantitative scoring is still widely used as the gold-standard for evaluating CHR-IHC samples. However, it has obvious limitations that need to be addressed. AI-aided software (i.e., Pathronus) might identify cells of interest, differentiate among organelles, protein specific chromogenic labelling, and nuclear counterstaining after an initial training period. This provides a real alternative to semi-quantitative scoring, allowing robust and fast data processing with better predictive value.

Informed Consent Statement:
Informed consent for autopsy, neuropathological assessment, and research participation were obtained for all subjects and data were anonymized.

Data Availability Statement:
The datasets used and/or analysed during the current study are available from the corresponding author upon reasonable request.