Inter-observer Variability in the Analysis of CO-RADS Classification for COVID-19 Patients

During the early stages of the pandemic, computed tomography (CT) of the chest, along with serological and clinical data, was frequently utilized in diagnosing COVID-19, particularly in regions facing challenges such as shortages of PCR kits. In these circumstances, CT scans played a crucial role in diagnosing COVID-19 and guiding patient management. The COVID-19 Reporting and Data System (CO-RADS) was established as a standardized reporting system for cases of COVID-19 pneumonia. Its implementation necessitates a high level of agreement among observers to prevent any potential confusion. This study aimed to assess the inter-observer agreement between physicians from different specialties with variable levels of experience in their CO-RADS scoring of CT chests for confirmed COVID-19 patients, and to assess the feasibility of applying this reporting system to those having little experience with it. All chest CT images of patients with positive RT-PCR tests for COVID-19 were retrospectively reviewed by seven observers. The observers were divided into three groups according to their type of specialty (three radiologists, three house officers, and one pulmonologist). The observers assessed each image and categorized the patients into five CO-RADS groups. A total of 630 participants were included in this study. The inter-observer agreement was almost perfect among the radiologists, substantial among a pulmonologist and the house officers, and moderate-to-substantial among the radiologists, the pulmonologist, and the house officers. There was substantial to almost perfect inter-observer agreement when reporting using the CO-RADS among observers with different experience levels. Although the inter-observer variability among the radiologists was high, it decreased compared to the pulmonologist and house officers. Radiologists, house officers, and pulmonologists applying the CO-RADS can accurately and promptly identify typical CT imaging features of lung involvement in COVID-19.


Introduction
The novel coronavirus (SARS-CoV-2) originating from Wuhan, China, has emerged as a significant global threat [1][2][3][4].The reverse transcription-polymerase chain reaction (RT-PCR) assay is the gold standard for diagnosing COVID-19, but its sensitivity can vary between 42% and 83%, depending on the viral load [5].Furthermore, the limited availability of RT-PCR tests and delayed delivery of results in some developing countries posed challenges during the initial wave of the pandemic [5,6].
Understanding the extent of inter-observer variability in the CO-RADS classification is crucial for ensuring consistent and reliable application of this reporting system.Therefore, in this study, we aimed to assess the inter-observer agreement among physicians from different specialties with varying levels of experience in their CO-RADS scoring of CT chests from confirmed COVID-19 patients.Furthermore, we evaluated the feasibility of implementing this reporting system among physicians with limited knowledge.

Study Participants
All chest CT images of patients with positive RT-PCR tests for COVID-19 from the medical records of Suez Canal University Hospitals from August 2020 to June 2021 were eligible for this study.Exclusion criteria included patients with at least one negative RT-PCR test (n = 5), patients with unavailable RT-PCR results (n = 14), and CT scans with major artefacts such as respiratory motion or incomplete scanning that affected the accuracy of the CT image interpretation (n = 8).After applying the inclusion and exclusion criteria, 630 patients were included in this study.
The study was approved by the Institutional Review Board, and consent was waived due to the retrospective nature of the study.

CT Chest Imaging
Patients underwent CT without contrast using the same protocol with a single 64-slice CT scanner (LightSpeed VCT; GE Healthcare; Chicago; United States).The acquisition parameters at our hospital were as follows: 120 kV tube voltage with automatic tube current modulation (150 mAs); tube rotation time of 0.28 s; beam collimation of 128 ch × 0.6 mm; and beam pitch of 1.5.By default, 2.0 mm without interslice gap chest CT images were reconstructed using a sharp tissue kernel (Bl57) with the filtered back-projection technique.The slice thickness of the reconstructed images ranged from 1.25 to 5 mm at other institutions.

Imaging Analysis
Each chest CT image was reviewed by seven observers who were divided into three groups according to the type of specialty, as follows: The radiologist group included observer 1 (a chest consultant radiologist with 20 years of experience), observer 2 (a radiologist consultant with ten years of experience), and observer 3 (a radiologist with five years of experience).The chest physician group included observer 4 (a chest consultant with 15 years of experience).The foundation-year physician group included observers with five, six, and seven foundation-year physicians with little experience.The foundation-year physician group was the only group that received a training session by the main researcher, which included one hour of conceptual lectures followed by practical application on 30 CTs performed on COVID-19 patients with findings corresponding to each CO-RADS category.These thirty cases of training were excluded from the sample study.For each patient, the chest CT scans were evaluated for the following characteristics: presence, amount, and distribution pattern of ground-glass opacities; the presence of consolidation; the presence of air bronchograms; the number of lobes affected where ground-glass or consolidative opacities were present; the presence of nodules; the presence of pleural effusion; the presence of thoracic lymphadenopathy (defined as lymph node size > 10 mm in short axis size), airway abnormalities (including airway wall thickening, bronchiectasis, and endoluminal secretions); and the presence of underlying lung diseases such as emphysema or fibrosis.Opacities with a crazy-paving pattern, a reverse halo sign, rounded morphology, intralesional cavitation, and linear opacities were noted.The observers assessed each image and categorized the patients according to the CO-RADS classification system (Table 1) [15].The extracted chest CT images were anonymized and the observers were blinded to all clinical data of the patients, including the PCR results, except for their age and sex.The radiologist plotted the detailed descriptive data for each patient and identified the CO-RADS classes in the tables.However, the different groups plotted the cases according to the CO-RADS classification without detailed descriptive data.

Statistical Analysis
Statistical analyses were performed using SPSS version 26 (IBM, Armonk, NY, USA).The data are presented in tables and figures.Qualitative data are presented as frequencies and percentages.To determine the inter-observer agreement, the Fleiss k value was determined across the observers.The k values were obtained by comparing the CO-RADS scores of each observer with the median scores of the remaining seven observers.Inter-observer agreement was considered slight for a k value of 0.01-0.20,fair for a k value of 0.21-0.40,moderate for a k value of 0.41-0.60,substantial for a k value of 0.61-0.80,and almost perfect for a k value of 0.81-1.00[17].A probability value of less than 0.05 was considered statistically significant for all tests.

Results
A total of 630 participants were included in this study.Table 2 provides an overview of the basic characteristics of the study participants, including their gender and nationality.Among the participants, the majority were male, accounting for 61.9% (390 patients), while the remaining 38.1% were female (240 patients).Regarding nationality, the study primarily included Egyptian participants, who accounted for 98.7% (622) of the patients.There were also a small number of participants from other nationalities, including three patients (0.5%) from Italy, one patient (0.2%) from India, two patients (0.3%) from Germany, one patient (0.2%) from the United States, and one patient (0.2%) from Ukraine.All of the chest CT scans were assessed by the three radiologists, and Table 3 provides an overview of their findings.The CT scans were assessed as normal for 107, 104, and 103 patients by observers 1, 2, and 3, respectively.Emphysema was found in 16, 13, and 13 patients by observers 1, 2, and 3, respectively.In addition, lung masses were found in 12, 6, and 6 patients by observers 1, 2, and 3, respectively.This table provides additional details on the number of CT scans showing peri-fissural nodules, tree-in-bud, centrilobular nodules, consolidation, cavitation, and smooth septal thickening with pleural effusion, as identified by each observer.Table 4 focuses on ground glass opacity (GGO) characteristics observed by the radiologists.The GGOs were perihilar in 349, 334, and 319; single foci in 20, 403, and 13; centrilobular in 405, 14, and 371; and homogenous extensive in 18, 109, and 11 by observers 1, 2, and 3, respectively.GGOs with smooth septal thickening were found in 113, 2, and 114 patients, and smooth septal thickening and effusion were found in 5, 149, and 2 by observers 1, 2, and 3, respectively.Small GGOs, not centrilobular, and not close to the pleura were found in 152, 4, and 143 by observers 1, 2, and 3, respectively.Organizing (scaring) pneumonia patterns without typical features were found in 9, 366, and 4 by observers 1, 2, and 3, respectively.Multifocal bilateral GGOs were found in 375, 396, and 370 by observers 1, 2, and 3, respectively.Multifocal unilateral GGOs close to the pleural surface or fissure were found in 404, 404, and 409 by observers 1, 2, and 3, respectively.GGOs with typical features on one side and unifocal on the other were found in 15, 15, and 14 by observers 1, 2, and 3, respectively.Unifocal bilateral GGOs were found in 6, 6, and 3 by observers 1, 2, and 3, respectively.Table 5 highlights the typical features identified by the radiologists, including multifocal bilateral GGOs with consolidation, organizing (scaring) pneumonia, crazy-paving signs, thickened vessels, and reversed halo signs.Multifocal bilateral GGOs with consolidation close to the pleural surface or fissure and pleural sparing were found in 385, 382, and 369 by observers 1, 2, and 3, respectively.The typical features of organizing (scaring) pneumonia patterns were found in 291, 283, and 292 by observers 1, 2, and 3, respectively.Typical features with crazy paving were found in 108, 97, and 109 by observers 1, 2, and 3, respectively.Typical features with thickened vessels were found in 355, 308, and 362 by observers 1, 2, and 3, respectively.Typical features with reversed halos were found in 39, 35, and 34 by observers 1, 2, and 3, respectively.Table 6 shows the results of applying the CO-RADS classification system to the CT scans by the three radiologists.Observer 1 classified 21.1% as CO-RADS 1, 2.4% as CO-RADS 2, 5.7% as CO-RADS 3, and 10.5% as CO-RADS 4. Observer 2 recorded very similar results, with 21.1% CO-RADS 1, 2.4% CO-RADS 2, 5.7% CO-RADS 3, and 10.6% CO-RADS 4. Observer 3 classified 21.6% as CO-RADS 1, 2.2% as CO-RADS 2, 5.1% as CO-RADS 3, and 8.3% as CO-RADS 4. All three observers classified the majority of the cases as CO-RADS 5, with percentages ranging from 60.2% to 62.9%.The level of agreement between these observers was very high, based on the Fleiss kappa values.The Fleiss kappa value was 0.997 between observers 1 and 2, 0.921 between observers 2 and 3, and 0.924 between observers 1 and 3.This indicates an almost perfect consensus between the observers when applying the CO-RADS classification to the scans.κ, Cohen's kappa coefficient (95% confidence interval); a , observers 1 and 2; b , observers 2 and 3; c , observers 1 and 3.
Table 8 provides an overview of the inter-observer agreement among radiologists, the pulmonologist, and house officers regarding the CO-RADS classification.Among the radiologists and the pulmonologist, there was a substantial inter-observer agreement, indicated by κ values ranging from 0.613 to 0.661.In the case of the radiologists and house officers, the inter-observer agreement was moderate to substantial, with κ values ranging between 0.503 and 0.692.Notably, the agreement between radiologists and observer 6 was almost perfect, with κ values ranging from 0.900 to 0.903.Overall, the table demonstrates moderate to substantial inter-observer agreement on CO-RADS classifications among the different medical professionals.The values represent Cohen's kappa coefficient (95% confidence interval).
Representative cases from this study are shown in Figures 1-3.The values represent Cohen's kappa coefficient (95% confidence interval).
Representative cases from this study are shown in Figures 1-3.The values represent Cohen's kappa coefficient (95% confidence interval).
Representative cases from this study are shown in Figures 1-3.

Discussion
CT imaging is widely used as a diagnostic method for COVID-19 pneumonia.Radiological differential diagnosis and isolation of other viral agents causing pneumonia in patients have gained importance, especially during pandemics [18].In many countries, CT scans, together with serological and clinical data, are commonly used to diagnose COVID-19.Therefore, a CT imaging protocol is required to enhance radiation protection and achieve the ALARA radiation rule [19].Although chest CT findings may partially overlap with other diseases, particularly other types of viral infections, COVID-19 may have specific CT characteristics that are less common under different conditions [20].
The current study assessed the inter-observer agreement in applying the CO-RADS classification to interpret the chest CT scans of 630 patients.Our results demonstrated substantial to almost perfect agreement between radiologists, a pulmonologist, and house officers in classifying COVID-19 severity using the CO-RADS.Specifically, there was an almost perfect agreement between the three radiologists (κ = 0.921-0.997).This indicates that the CO-RADS allows radiologists to consistently classify COVID-19 severity on CT scans.Agreement was slightly lower but still substantial between the pulmonologist and the three house officers (κ = 0.584-0.736).This suggests that pulmonologists and house officers can also reliably apply the CO-RADS, although there is more variability compared to specialized radiologists.These results are consistent with those of previous studies [3,5,13,15,16,[21][22][23][24][25][26].Fonseca et al. [3] emphasized the substantial inter-observer agreement among the three readers for CO-RADS classifications, even three months after the initial case analysis, and without any additional training (κ = 0.642).Özdemir et al. [5] reported good to almost perfect inter-observer variability among their four readers (κ = 0.79-0.86).Prokop et al. [15] reported an overall moderate reliability among their eight readers (κ = 0.47).Bellini et al. [16] registered an overall moderate inter-observer

Discussion
CT imaging is widely used as a diagnostic method for COVID-19 pneumonia.Radiological differential diagnosis and isolation of other viral agents causing pneumonia in patients have gained importance, especially during pandemics [18].In many countries, CT scans, together with serological and clinical data, are commonly used to diagnose COVID-19.Therefore, a CT imaging protocol is required to enhance radiation protection and achieve the ALARA radiation rule [19].Although chest CT findings may partially overlap with other diseases, particularly other types of viral infections, COVID-19 may have specific CT characteristics that are less common under different conditions [20].
The current study assessed the inter-observer agreement in applying the CO-RADS classification to interpret the chest CT scans of 630 patients.Our results demonstrated substantial to almost perfect agreement between radiologists, a pulmonologist, and house officers in classifying COVID-19 severity using the CO-RADS.Specifically, there was an almost perfect agreement between the three radiologists (κ = 0.921-0.997).This indicates that the CO-RADS allows radiologists to consistently classify COVID-19 severity on CT scans.Agreement was slightly lower but still substantial between the pulmonologist and the three house officers (κ = 0.584-0.736).This suggests that pulmonologists and house officers can also reliably apply the CO-RADS, although there is more variability compared to specialized radiologists.These results are consistent with those of previous studies [3,5,13,15,16,[21][22][23][24][25][26].Fonseca et al. [3] emphasized the substantial inter-observer agreement among the three readers for CO-RADS classifications, even three months after the initial case analysis, and without any additional training (κ = 0.642).Özdemir et al. [5] reported good to almost perfect inter-observer variability among their four readers (κ = 0.79-0.86).Prokop et al. [15] reported an overall moderate reliability among their eight readers (κ = 0.47).Bellini et al. [16] registered an overall moderate inter-observer agreement for CO-RADS ratings among 12 readers (κ = 0.43).Fujioka et al. [21] reported substantial to almost perfect levels of inter-observer agreement for the CO-RADS (ICC = 0.800-0.874).Sheha et al. [22] discovered that CO-RADS reporting exhibited good inter-rater agreement (ICC = 0.75).Abdel-Tawab et al. [23] reported an overall excellent inter-reviewer agreement among their three readers for the CO-RADS (κ = 0.801).Nair et al. [24] reported an overall moderate inter-observer agreement for CO-RADS categories among the six readers (κ = 0.548).Atta et al. [25] reported an overall substantial agreement among three readers (κ = 0.78).Sushentsev et al. [26] demonstrated moderate inter-observer agreement among the three readers for the CO-RADS, with a κ value of 0.51.
In the present study, assessing the agreement between radiologists, a pulmonologist, and house officers indicated moderate to substantial agreement (κ = 0.503-0.692).The highest agreement was observed between the radiologists and one house officer (observer 6).Nonetheless, it is worth noting that, overall, there was a reasonable level of agreement among reviewers with varying levels of experience.The radiologists demonstrated perfect inter-observer agreement, while the less experienced house officers showed moderate agreement with the radiologists.These findings suggest that while experience may contribute to higher agreement levels, clinicians with varying levels of experience can still provide meaningful assessments of COVID-19 CT images.This aligns with the findings of Fonseca et al. [3].
Although the present study demonstrated good inter-observer agreement among radiologists, a pulmonologist, and house officers, it is crucial to address the factors contributing to inter-observer variability.Inter-observer variability in the CO-RADS may be attributed to several factors.Firstly, the imaging features of COVID-19 have a wide spectrum, and may overlap with other diagnoses [27].Secondly, CO-RADS descriptors are subjective and qualitative [13].Thirdly, many radiologists were initially unfamiliar with CO-RADS [9].Finally, inherent differences exist among radiologists in diagnostic reasoning and image interpretation [28].Standardized training in the CO-RADS, calibration exercises, and prudent use of its ordinal scale may help improve agreement and consistency.Nonetheless, interobserver variability underscores the complexity and challenges of classifying COVID-19 pneumonia [27].
Similar to our findings, many studies agree that the most frequently observed characteristic results of COVID-19 are multifocal bilateral, peripherally located ground-glass appearance, and peripheral consolidation close to the pleural surface or fissure with pleural sparing [5,11,18,29].
A notable finding in our study was the high disagreement between radiologists in fundamental radiological findings.These discrepancies may be attributed to several factors, including varying definitions and interpretation criteria, subjective interpretations, varying experience levels among radiologists, and radiologists potentially being influenced by fatigue when reading large numbers of chest CT scans.Additional factors that may contribute to disagreements include the quality of the scan itself and the possibility of overlooking subtle abnormalities.Going forward, steps could be taken to standardize the evaluation criteria, provide more training opportunities to improve consistency, implement quality checks to catch discrepancies, and ensure radiologists take breaks during long review sessions.
Regarding the CO-RADS, higher proportions of COVID-19 patients were CO-RAD 5 and 1. Sheha et al. [22] reported similar results for RT-PCR-confirmed cases, with CO-RADS categories 1 and 5 showing a higher proportion of positive cases than the CO-RADS 2 category.This also agrees with the results reported by De Jaegere et al. [30].
In summary, the results of this study provide insights into the characteristics observed in chest CT scans, the distribution of CO-RADS categories, and the inter-observer agreement among radiologists, pulmonologists, and house officers.The high inter-observer agreement supports the use of the CO-RADS classification for systematically assessing and communicating the spectrum of COVID-19 findings on CT scans.The system allows for consistent interpretation between radiologists, as well as substantial agreement between specialties.Moreover, continuous refinement and validation of the CO-RADS methodology and descriptors are essential to improve its accuracy and reliability.As new knowledge and evidence emerge regarding the imaging features of COVID-19, updates to the classification system can be made to ensure its relevance and effectiveness.
The current study had certain limitations.Firstly, it focused exclusively on patients with PCR-confirmed COVID-19, and lacked a control group with alternative respiratory diagnoses.Without such a group, we were unable to fully determine the accuracy of the CO-RADS in distinguishing COVID-19 from other respiratory conditions.Further research should evaluate the specificity and predictive values of CO-RADS scoring by including patients with confirmed negative COVID-19 test results.This will allow for a more rigorous assessment of the classification system's ability to correctly diagnose COVID-19 and avoid false positive errors.Secondly, the study was constrained by its retrospective nature, and there was a lack of clinical data regarding the duration of symptoms at the time of CT scanning.Thirdly, the small number of observers in our study may potentially impact the generalizability and reproducibility of our findings.Therefore, future studies with a larger number of observers are required to validate and strengthen our findings.Fourthly, observer experience could be a potential confounding factor in interpreting our study results.However, we attempted to diminish this issue through rigorous training and standardization of data collection procedures for all observers.This included providing clear CO-RADS classification guidelines, establishing criteria for making scoring judgements, and promoting consistency in data collection techniques across radiologists of varying seniority levels.However, some residual variability due to subjective interpretations still likely remained.Further research quantifying the impact of radiologist credentials and experience on CO-RADS scoring reliability would provide additional clarity.Finally, the time interval between the initial PCR tests and CT scans was not strictly defined, which could contribute to discrepancies between PCR-based diagnoses and the observed imaging patterns, particularly at different stages of the disease course.Addressing this limitation in future research is crucial for obtaining a more comprehensive understanding of the relationship between CT imaging and COVID-19 diagnosis.

Conclusions
The present study revealed almost perfect inter-observer agreement when reporting the use of the CO-RADS among radiologists with varying levels of experience.Although the inter-observer variability of the CO-RADS classification system for COVID-19 among radiologists was high, it decreased compared to the pulmonologist and house officers.Radiologists, house officers, and pulmonologists can apply the CO-RADS accurately to promptly identify typical CT imaging features of lung involvement in COVID-19.

Figure 1 .
Figure 1.Multiple chest CT images of COVID-19 patients.All observers agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement between radiologists and pulmonologist was based on the following features: Typical features of multifocal bilateral GGO and consolidation, peri-fissural nodules, perihilar GGO, and centrilobular GGO.In addition to the previous findings, all radiologists agreed on the typical features of multifocal bilateral GGO, consolidation, organizing pneumonia, and thickened vessels.

Figure 2 .
Figure 2. Multiple chest CT images of COVID-19 patients.Two senior radiologists, a pulmonologist, and a house officer agreed to calculate the CO-RADS for this case as CO-RADS 4. Agreement was based on the following features: multifocal unilateral GGO and consolidation close to the pleura.One junior radiologist with another house officer agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement was based on the following features: multifocal unilateral GGO, consolidation close to the pleura, and other side unifocal GGO.

Figure 1 .
Figure 1.Multiple chest CT images of COVID-19 patients.All observers agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement between radiologists and pulmonologist was based on the following features: Typical features of multifocal bilateral GGO and consolidation, peri-fissural nodules, perihilar GGO, and centrilobular GGO.In addition to the previous findings, all radiologists agreed on the typical features of multifocal bilateral GGO, consolidation, organizing pneumonia, and thickened vessels.

Figure 1 .
Figure 1.Multiple chest CT images of COVID-19 patients.All observers agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement between radiologists and pulmonologist was based on the following features: Typical features of multifocal bilateral GGO and consolidation, peri-fissural nodules, perihilar GGO, and centrilobular GGO.In addition to the previous findings, all radiologists agreed on the typical features of multifocal bilateral GGO, consolidation, organizing pneumonia, and thickened vessels.

Figure 2 .
Figure 2. Multiple chest CT images of COVID-19 patients.Two senior radiologists, a pulmonologist, and a house officer agreed to calculate the CO-RADS for this case as CO-RADS 4. Agreement was based on the following features: multifocal unilateral GGO and consolidation close to the pleura.One junior radiologist with another house officer agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement was based on the following features: multifocal unilateral GGO, consolidation close to the pleura, and other side unifocal GGO.

Figure 2 .
Figure 2. Multiple chest CT images of COVID-19 patients.Two senior radiologists, a pulmonologist, and a house officer agreed to calculate the CO-RADS for this case as CO-RADS 4. Agreement was based on the following features: multifocal unilateral GGO and consolidation close to the pleura.One junior radiologist with another house officer agreed to calculate the CO-RADS for this case as CO-RADS 5.The agreement was based on the following features: multifocal unilateral GGO, consolidation close to the pleura, and other side unifocal GGO.

Figure 3 .
Figure 3. Multiple chest CT images of COVID-19 patients.Senior radiologists and the pulmonologist agreed to calculate the CO-RADS for this case as CO-RADS 4. The agreement was based on the following features: typical features of multifocal unilateral GGO, consolidation, unifocal GGO on the other side, organizing pneumonia without typical features, multiple unilateral ventrilobular GGO, and focal unilateral GGO.The junior radiologist and house officers agreed to calculate the CO-RADS for this patient as CO-RADS 5.The agreement was based on the following features: multifocal bilateral GGO and consolidation with organizing pneumonia.

Figure 3 .
Figure 3. Multiple chest CT images of COVID-19 patients.Senior radiologists and the pulmonologist agreed to calculate the CO-RADS for this case as CO-RADS 4. The agreement was based on the following features: typical features of multifocal unilateral GGO, consolidation, unifocal GGO on the other side, organizing pneumonia without typical features, multiple unilateral ventrilobular GGO, and focal unilateral GGO.The junior radiologist and house officers agreed to calculate the CO-RADS for this patient as CO-RADS 5.The agreement was based on the following features: multifocal bilateral GGO and consolidation with organizing pneumonia.

Table 2 .
Basic characteristics of the participants.

Table 3 .
Characteristics of chest CT performed by radiologists.

Table 4 .
Ground glass opacity characteristics of radiologists.

Table 5 .
Typical features and associated characteristics.

Table 7 .
CO-RADS among house officers and pulmonologist.

Table 8 .
Agreement variability between radiologists, pulmonologist, and house officers regarding the CO-RADS.