Reliability and Validity of Single Axial Slice vs. Multiple Slice Quantitative Measurement of the Volume of Effusion-Synovitis on 3T Knee MRI in Knees with Osteoarthritis

Effusion-synovitis (ES) is recognized as a component of osteoarthritis, creating a need for rapid methods to assess ES on MRI. We describe the development and reliability of an efficient single-slice semi-automated quantitative approach to measure ES. We used two samples from the Osteoarthritis Initiative (OAI): 50 randomly selected OAI participants with radiographic osteoarthritis (i.e., Kellgren–Lawrence (KL) grade 2 or 3) and a subset from the Foundation for the National Institutes of Health Osteoarthritis Biomarker study. An experienced musculoskeletal radiologist trained four non-expert readers to use custom semi-automated software to measure ES on a single axial slice and then read scans blinded to prior assessments. The estimated intraclass correlation coefficient (ICC) for intra-reader reliability of the single-slice ES method in the KL 2–3 sample was 0.96 (95% CI: 0.93, 0.97), and for inter-reader reliability, the ICC was 0.90 (95% CI: 0.87, 0.95). The intra-reader mean absolute difference (MAD) was 35 mm3 (95% CI: 28, 44), and the inter-reader MAD was 61 mm3 (95% CI: 48, 76). Our single-slice quantitative knee ES measurement offers a reliable, valid, and efficient surrogate for multi-slice quantitative and semi-quantitative assessment.


Introduction
Osteoarthritis (OA), particularly knee OA, is a highly prevalent joint disease and a leading source of chronic pain, disability, and economic burden [1][2][3]. With the increasing age, obesity, and sedentary lifestyle of the population, the incidence and burden of knee OA will continue to grow [4][5][6]. No disease-modifying treatments for OA have been approved by the US Food and Drug Administration (FDA) or European Medicines Agency (EMA) [7]. There is a critical need to develop and evaluate biomarkers of knee OA for the purposes of identifying appropriate participants for clinical trial enrollment and ascertaining outcomes during trial follow-up.
The development of osteophytes and the degradation of articular cartilage traditionally characterized the pathogenesis of OA, which is now known to be a disease of the whole

Study Design, Setting, Participants
The Osteoarthritis Initiative (OAI) is a longitudinal cohort study of 4796 individuals, aged 45- For the current reliability study, we used two samples from the OAI: (1) 50 randomly selected OAI participants with radiographic osteoarthritis in at least one knee at baseline, identified as Kellgren-Lawrence (KL) grade 2 or 3 [36], with an existing semi-quantitative assessment of effusion-synovitis on MRI from ancillary studies performed inside the OAI. This sample of 50 participants included one knee per participant, selected with maximum KL grade, or otherwise randomly if both knees had the same KL grade. (2) A subset of 301 knees drawn from the 600 participants in the Foundation for the National Institutes of Health (FNIH) Osteoarthritis Biomarkers Consortium, which was designed as a nested case-control study within the OAI [37]. The subset was chosen randomly for prior MRI measurement methodology development and was designed to follow the case and control distributions of the full FNIH sample [38]. The FNIH subset included radio-graphic and pain progressors (n = 97), radiographic-only progressors (n = 52), pain-only progressors (n = 52), and non-progressors (n = 100) (we had access to the raw data from this study) [38,39]. Four non-expert readers (G.G., C.C., A.V., D.R.) were trained to use custom semiautomated software to measure ES on a single axial slice under the direction of a musculoskeletal (MSK) radiologist with 30 years of experience in musculoskeletal MRI, and then read scans from the KL 2-3 sample (n = 50), blinded to prior assessments (two replicates for three readers, one replicate for one reader). The amount of time required to train a non-expert reader ranged from~2 to~4 h with a training set of 50 scans. One non-expert reader also measured ES on a single axial slice in the FNIH subset (n = 301) to enable comparison with the multi-slice methodology [38].

Radiography Acquisition and Kellgren-Lawrence Grade Assessment
Bilateral posteroanterior fixed-flexion weight-bearing radiographic views were obtained using a SynaFlexer (Synarc, San Francisco, CA, USA), as described in the radiographic procedure manual (https://nda.nih.gov/oai, accessed on 23 November 2022). Expert readers centrally scored the images using the KL grading system [36], with adjudication by an MSK radiologist (more details can be found from the OAI documentation [40]).

MRI Acquisition and Semi-Quantitative Scoring Assessment of Effusion-Synovitis
Non-contrast-enhanced MRI scans were acquired from the 4 OAI sites on identical 3 Tesla (T) systems (Siemens Trio MR, Erlanger, Germany). MSK radiologists centrally reviewed the scans and graded ES using the MRI Osteoarthritis Knee Score (MOAKS) (more details can be found from the OAI documentation [41]). The MOAKS ES grade is a whole-scan, semi-quantitative assessment based on hyperintensity within the articular cavity that represents a composite of effusion and synovial thickening (0: physiologic amount; 1: small-fluid continuous in the retropatellar space; 2: medium-with slight convexity of the suprapatellar bursa; 3: large-evidence of capsular distention). The reported intra-rater and inter-rater reliability of MOAKS ES, scored by MSK radiologists, was 0.90 (95% CI: 0.78, 1.00) and 0.72 (95% CI: 0.52, 0.92), respectively, calculated using linear weighted kappa [42].

Semi-Automated Quantitative Assessment of Effusion-Synovitis
The sagittal 3D dual-echo steady-state sequence (DESS) with water excitation was reformatted to axial 3T DESS images that were assessed for ES. A customizable software platform that incorporated and revised aspects of a previous semi-automated approach was developed [38]. The team used commercially available hardware and software tools (i.e., i7 desktop, DICOM-rated 27" 4K monitor, and gaming mice) to build an optimized workspace that could be rapidly deployed for use at remote sites. A single-slice approach was chosen to reduce the time it took to assess each knee for ES. The process was further enhanced by programming a multi-button mouse to perform all 20 required functions that were part of the semi-automated method to allow the use of one hand to perform all of the functions without needing to look away from the area of focus (see Figure 1). Our team decided to confine the choice of a single axial 3T DESS slice within the region located between the superior and inferior poles of the patella, as previous literature has demonstrated that ES knee OA is primarily present within this region [43]. Additionally, this allowed the methodology to maintain consistency with a prior reported semi-automated multi-slice method for segmenting ES [38]. The final system employed several desktop automations that included lossless magnifiers (toggling/adjusting full-picture zoom; continuous focal magnification) and local image capturing to compare original images with saved mappings while revising as needed. continuous focal magnification) and local image capturing to compare original images with saved mappings while revising as needed. The measurement procedure began with displaying all 60 axial slices in a 10 × 6 tile on a single 27″ 178° wide-viewing angle screen. After loading all of the axial slices, the superior and inferior poles of the patella were identified and marked on their respective slices. Within the range of the patellar groove, the slice judged to contain the largest area of ES was selected. Leveraging the region-growing algorithms and dynamic, variable grayscale thresholding of the program, the reader would typically settle on a threshold encompassing potential pixels that were consistent with the presence of ES. After selecting the more prominent regions of effusion, the reader would sequentially toggle the sensitivity and adjust finer details. Fluid was captured in the patellofemoral joint (PFJ), along the trochlea and extending into the lateral and medial recesses and posteriorly around the femoral condyles. Lastly, subregional divisions that were anchored to the femur using the MOAKS delineation (anterior vs. posterior; medial vs. lateral) were drawn to segment the ES into one of four quadrants (AM, AL, PM, PL) [42]. After confirming a final review of the segmented slice enlarged, the reader would proceed to the next scan (see Figure 2). The measurement procedure began with displaying all 60 axial slices in a 10 × 6 tile on a single 27 178 • wide-viewing angle screen. After loading all of the axial slices, the superior and inferior poles of the patella were identified and marked on their respective slices. Within the range of the patellar groove, the slice judged to contain the largest area of ES was selected. Leveraging the region-growing algorithms and dynamic, variable grayscale thresholding of the program, the reader would typically settle on a threshold encompassing potential pixels that were consistent with the presence of ES. After selecting the more prominent regions of effusion, the reader would sequentially toggle the sensitivity and adjust finer details. Fluid was captured in the patellofemoral joint (PFJ), along the trochlea and extending into the lateral and medial recesses and posteriorly around the femoral condyles. Lastly, subregional divisions that were anchored to the femur using the MOAKS delineation (anterior vs. posterior; medial vs. lateral) were drawn to segment the ES into one of four quadrants (AM, AL, PM, PL) [42]. After confirming a final review of the segmented slice enlarged, the reader would proceed to the next scan (see Figure 2).

Statistical Analysis
Participant and knee-level characteristics from the OAI baseline visit were summarized for the KL 2-3 sample (n = 50) and the FNIH subset (n = 301). Distributions of single-slice measurements of ES were plotted and summarized in each sample separately.
Reliability was evaluated based on the intraclass correlation coefficient (ICC), defined as the proportion of the total variance in the measurements due to "true" differences between subjects, where the "true" value is the average that would be obtained if measured an infinite number of times. This value reflects the consistency of the measurement, not the accuracy. While the ICC reflects how well subjects can be distinguished from each other despite the presence of measurement error, it is a relative measure that depends on the heterogeneity of the sample; subjects in a heterogeneous population are easier to distinguish than subjects who are similar in terms of the feature being measured. The standard error of measurement (SEM) is an absolute measure of how far apart repeated measurements are for a single subject, expressed in the unit of measurement [44,45]. Intraand inter-reader ICC, as well as the SEM, were estimated from a linear mixed model, with random effects for knee, reader, and the interaction between knee and reader (see Supplementary Materials for model and the ICC and SEM formulas). Bias-corrected and acceleration-adjusted (BCa) nonparametric bootstrap 95% confidence intervals (95% CI) were generated from 20,000 replicates, with resampling at the knee level [46,47].

Statistical Analysis
Participant and knee-level characteristics from the OAI baseline visit were summarized for the KL 2-3 sample (n = 50) and the FNIH subset (n = 301). Distributions of singleslice measurements of ES were plotted and summarized in each sample separately.
Reliability was evaluated based on the intraclass correlation coefficient (ICC), defined as the proportion of the total variance in the measurements due to "true" differences between subjects, where the "true" value is the average that would be obtained if measured an infinite number of times. This value reflects the consistency of the measurement, not the accuracy. While the ICC reflects how well subjects can be distinguished from each other despite the presence of measurement error, it is a relative measure that depends on the heterogeneity of the sample; subjects in a heterogeneous population are easier to distinguish than subjects who are similar in terms of the feature being measured. The standard error of measurement (SEM) is an absolute measure of how far apart repeated measurements are for a single subject, expressed in the unit of measurement [44,45]. Intra-and inter-reader ICC, as well as the SEM, were estimated from a linear mixed model, with random effects for knee, reader, and the interaction between knee and reader (see We estimated the intra-reader mean absolute difference (MAD), defined as the mean difference between two measurements by the same reader, and the inter-reader MAD, defined as the mean difference between any two measurements by different readers. The 95% CIs were generated from bootstrap nonparametric percentiles from 20,000 replicates [48,49].
Concurrent criterion validity of the single-slice method was evaluated based on comparison to the multi-slice method, as well as ES grading by MSK radiologists. We estimated the Spearman correlation between total ES measured on a single axial slice and total ES volume measured with the multi-slice methodology [38] in the FNIH subset (n = 301), as well as the correlation between the single-slice ES measurement and MOAKS ES in the FNIH subset (n = 301) and the KL 2-3 sample (n = 50). BCa bootstrap 95% CIs were generated from 20,000 replicates [50].
We compared the contributions of MOAKS ES, quantitative multi-slice ES measurement, and quantitative single-slice ES measurement by comparing their contributions to FNIH case status [37]. We fit a logistic regression model for radiographic and pain progression case status (97 cases vs. 204 controls) with the following predictors: KL grade, BMI, sex and age, MOAKS ES, multi-slice ES, and single-slice ES (Model XABC). To test the null hypothesis that MOAKS ES provides no additional information beyond the two quantitative methods (multi-slice and single-slice), we compared the full model to a reduced model that did not include the MOAKS ES variable (Model XBC) with a likelihood ratio test. To test the null hypothesis that quantitative ES measurement (multi-slice and single-slice) provides no additional information beyond MOAKS ES, we compared the full model to a reduced model that did not include the quantitative ES variables (Model XA). Finally, to compare information provided by the two quantitative ES methods, we compared a model that included multi-slice and single-slice ES measurements (Model XBC) to a reduced model that did not include the multi-slice ES predictor (Model XC) and a reduced model that did not include single-slice ES (Model XB). Using the same approach, we compared the contributions of MOAKS ES and the two quantitative ES methods to radiographic progression case status (149 cases vs. 152 controls), and for pain progression case status (149 cases vs. 152 controls).
A comparison between total ES measured on a single axial slice and total ES volume measured with the multi-slice methodology in the FNIH subset is shown in Figure 4, with an estimated correlation of 0.75 (95% CI: 0.68, 0.81). The single-slice ES measurements are compared with MOAKS ES in Figure 5, with an estimated correlation of 0.62 (95% CI: 0.39, 0.79) in the KL 2-3 sample, and 0.67 (95% CI: 0.59, 0.73) in the FNIH subset. The multi-slice ES volume measurement was compared to MOAKS ES in the FNIH subset previously, with an estimated correlation of 0.74 (95% CI: 0.68, 0.79) ( Figure S2).
In the FNIH case-control subset, we found that both quantitative ES methods, multislice and single-slice ES measurement, provided information beyond MOAKS ES for radiographic and pain progression case status (LR 8.6, p = 0.01). MOAKS ES did not significantly improve the model fit for radiographic and pain progression case status beyond the quantitative ES methods (LR 2.3, p = 0.51). The single-slice ES measurement provided information beyond the multi-slice measurement (LR 6.9, p < 0.01), while the multi-slice ES measurement was not significant in a model that already included the singleslice ES measurement (LR 2.3, p = 0.13). For radiographic progression case status, we similarly found that the quantitative ES methods provided additional information beyond MOAKS ES (LR 10.6, p < 0.01), with the single-slice ES measurement providing more information than the multi-slice measurement (LR 7.4, p < 0.01). When considering pain progression case status, both the multi-slice and single-slice measurements provided added information (LR 10.0, p < 0.01 and LR 9.90, p < 0.01) ( Table 3).
Subregional and total single-slice ES measurements in the KL 2-3 sample, as well as single-slice measurement of Baker's cysts, averaged across the readers, are shown in Figure  3. Single-slice ES measurements in the FNIH subset are shown by case-control status in Figure S1. A comparison between total ES measured on a single axial slice and total ES volume measured with the multi-slice methodology in the FNIH subset is shown in Figure 4, with an estimated correlation of 0.75 (95% CI: 0.68, 0.81). The single-slice ES measurements are compared with MOAKS ES in Figure 5, with an estimated correlation of 0.62 (95% CI: 0.39, 0.79) in the KL 2-3 sample, and 0.67 (95% CI: 0.59, 0.73) in the FNIH subset. The multislice ES volume measurement was compared to MOAKS ES in the FNIH subset previously, with an estimated correlation of 0.74 (95% CI: 0.68, 0.79) ( Figure S2).
In the FNIH case-control subset, we found that both quantitative ES methods, multislice and single-slice ES measurement, provided information beyond MOAKS ES for radiographic and pain progression case status (LR 8.6, p = 0.01). MOAKS ES did not significantly improve the model fit for radiographic and pain progression case status beyond the quantitative ES methods (LR 2.3, p = 0.51). The single-slice ES measurement provided information beyond the multi-slice measurement (LR 6.9, p < 0.01), while the multi-slice ES measurement was not significant in a model that already included the single-slice ES measurement (LR 2.3, p = 0.13). For radiographic progression case status, we similarly found that the quantitative ES methods provided additional information beyond MOAKS ES (LR 10.6, p < 0.01), with the single-slice ES measurement providing more information than the multi-slice measurement (LR 7.4, p < 0.01). When considering pain progression case status, both the multi-slice and single-slice measurements provided added information (LR 10.0, p < 0.01 and LR 9.90, p < 0.01) ( Table 3).        All models include the following covariates: baseline KL grade, BMI, sex, and age, indicated as 'X'. * Adequacy index is defined as LR s /LR f , where LR f is the −2 log likelihood ratio statistic for the full set of predictors in model XABC, and LR s is the −2 log likelihood ratio statistic for the subset of predictors. MOAKS: MRI Osteoarthritis Knee Score.

Discussion
We generated rapid, reproducible ES calculations with strong intra-and inter-reader reliabilities. The single-slice ES measurement had high correlation with the quantitative multi-slice method and with semi-quantitative MOAKS assessed by MSK radiologists in the KL 2-3 and FNIH samples, supporting the concurrent criterion validity.
The reliability of different methods of assessing ES on NCE-MRI has been reported for semi-quantitative and quantitative approaches. The reliability of MOAKS ES readings has been reported as weight kappa and percent agreement, 0.90 (95% CI: 0.78, 1.00) and 0.90, respectively, for intra-rater reliability and 0.72 (95% CI: 0.52, 0.92) and 0.70, respectively, for inter-rater reliability [42]. Our intra-rater and inter-rater reliability were excellent compared to those reported from the results of random-effects pooling of intra-reader ICCs and interreader ICCs from a systematic review and those reported by Maksymowych for KIMRISS and MOAKS status scores [51,52]. Our ICCs were also similar to those reported by Wang et al. for their method of quantitative measurement of ES [53]. Li et al. reported that their semi-automated method of measuring ES had similarly moderate correlation (r = 0.77) with a semi-quantitative method of assessing ES (i.e., WORMS) [34]. Our single-slice ES segmentation methodology had excellent intra-rater and inter-rater reliability across four non-expert readers and was as good as or better than the published reliability results for semi-quantitative and quantitative methods that have been developed to assess ES.
By utilizing non-expert readers, our method draws upon a larger pool of potential readers. Its direct costs were low despite its technological and practical advantages. Moreover, our approach proposes technical advancements to previous methods of assessing ES that are constrained to a single threshold and are generally more cumbersome to navigate [38]. The most experienced non-expert reader in this project commented that a single threshold was not sufficient for any slice and, thus, was unlikely to suffice for all of the slices in a knee scan. This sentiment, shared by other readers, indicated that the signal emanating from individual pixels might not definitively provide information about its content absent contextual information (i.e., the surrounding pixels' intensities and composition of tissues). Two illustrative cases (see Figure S3 in the Supplementary Materials) demonstrate the benefit of not using a set threshold when segmenting NCE-MRIs for ES.
Our single-slice method of assessing ES on MRI was rapid and efficient. Given the lack of expertise in reading MRIs for ES and using these novel software tools, there was an associated learning curve. Consequently, reading time decreased over time, which was calculated as less than two minutes per scan after a brief period of training and practice. This represented a major efficiency improvement when compared to our previous experience of approximately 10 min per scan for a multi-slice approach [38] or 25 min per scan if the reader were able to adjust the threshold in the multi-slice approach while measuring ES across the entire region of interest. The relatively extended amount of time for the multi-slice method was due to the need to selectively and sequentially apply different thresholds to appropriately shade regions of ES across each slice. This issue helped motivate the decision to choose a single slice with the largest area of ES in the PFJ. The move to the single-slice method was also facilitated by the ability to display all 60 uncompressed slices in a 10 × 6 grid on a single 178 • wide-viewing angle 4K monitor screen. Efficiency was further improved via a suite of additional software functions, automations, and associated programmable mouse button mappings. This allowed the user to rapidly select the maximum ES slice, dynamically increase and decrease the threshold and shade ES with pixel-by-pixel control, draw subregional divisions, review and revise as necessary and advance between scans. Our approach provided the readers with enough control and flexibility to ensure reliable and valid assessment. Our results suggest that it may be possible to assess effusion volume with fewer images. This potential time-saving approach requires further study [38]. Future considerations include the investigation of the performance characteristics of performing MOAKS readings on a single slice, which would also lead to significant time savings.
We utilized the FNIH case-control study to compare the semi-automated single-slice and multi-slice measurement of ES, as well as the MOAKS semi-quantitative measure of ES. The two quantitative ES methods, multi-slice and single-slice ES measurement, contributed information beyond MOAKS ES for radiographic and/or pain progression case status. Further, the single-slice measurement was more informative than the multi-slice measurement when considering radiographic progression case status, though not necessarily for pain progression case status. The relationships between inflammation and radiographic and/or pain progression are complex. It may be that certain areas with inflammation, such as within the suprapatellar bursa, that are not captured as well by our single-slice or the multi-slice methods do not contribute as much relatively to the pathogenesis of pain or radiographic progression. The MOAKS semi-quantitative measure of ES takes ES in all articular subregions into account. It yields only one semiquantitative score for the whole knee and results in a potential diminution of the contribution of ES that is assessed only between the poles of the patella. Roemer et al. found that 11 articular subregions consistently exhibited definite synovitis in a population with mixed radiographic OA severity, and the suprapatellar bursa was the second commonest synovitis site (59.5% of knees) [14]. Alternatively, since on NCE-MRI synovitis cannot be distinguished from effusion, the relative contributions of synovitis and effusion in the infrapatellar region as segmented by the single and the multi-slice methods may be different when compared to synovitis and effusion in other subregions of the knee that are encompassed by the MOAKS methodology. Calculation of MOAKS ES scores for individual regions within the knee may be needed to further understand the comparisons between semiquantitative MOAKS scoring and the quantitative single slice method. Further work is needed to better understand the contribution of ES in different parts of the knee to the pain and/or radiographic progression of knee OA.
This study had several limitations. ES is distributed in a heterogenous fashion within the knee, resulting in different subregional distributions of ES, whereas our proposed methodology focuses on segmenting a single slice, which has the potential to overestimate or underestimate the amount of ES in the whole knee. Moreover, there was variation in the axial MRI slice that readers selected with the largest area of ES. The slice selected differed by an average standard deviation of 1.89 slices among the four different readers. Furthermore, our method was semi-automated and performed by non-expert readers, which resulted in a learning curve in terms of efficiency and reader competence. There are various causes of effusion that may not be easily distinguished from synovitis due to OA on NCE-MRI. Our results may not be generalizable to images that are from different vendors, machines, sequences, etc. Further work is needed to determine whether the responsiveness of the single-slice method is generalizable to assessing change in ES in longitudinal studies.

Conclusions
In conclusion, our more efficient single-slice quantitative measurement of ES using non-expert readers had excellent intra-and inter-reader reliability and good correlation with the quantitative multi-slice measurements and semi-quantitative ES graded by MSK radiologists. The proposed quantitative single-slice method is an inexpensive and valid assessment of ES that can be used to segment large sets of MRI images.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm12072691/s1, Equation S1: Linear mixed model with random effects for knee, reader, and the interaction between knee and reader, Formula S1: Intra-reader intraclass correlation coefficient (ICC), Formula S2: Intra-reader standard error of measurement (SEM), Formula S3: Inter-reader intraclass correlation coefficient (ICC), Formula S4: Inter-reader standard error of measurement (SEM), Figure S1: Single-slice ES volume in FNIH sample: Distribution summary (n = 301), Figure S2 Funding: This research was funded by the NIH/National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) (grant R01-AR071409: Tracking Treatable Tissues: Change in qMRI Biomarkers and Future Cartilage Loss). The OAI is a public-private partnership comprising five contracts (N01-AR-2-2258; N01-AR-2-2259; N01-AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health. The private funding partners had no role in the study design or in the collection, analysis, or interpretation of the data, the writing of the manuscript, or the decision to submit the manuscript for publication. Publication of this article was not contingent upon approval by the private funding partners. This manuscript was prepared using an OAI public use data set and does not necessarily reflect the opinions or views of the OAI investigators, the NIH, or the private funding partners.
Institutional Review Board Statement: Institutional review board (IRB) approval for the OAI was obtained from all OAI sites. We did not obtain specific IRB approval for our study since the parent study already had approval.

Informed Consent Statement: All OAI study participants provided informed consent.
Data Availability Statement: Data used in the preparation of this article were obtained from the Osteoarthritis Initiative (OAI) database, which is available for public access at https://nda.nih.gov/ oai/ (accessed on 23 November 2022). Additional data analyzed during the current study is available from the corresponding author upon reasonable request.