Muscle Sonography in Inclusion Body Myositis: A Systematic Review and Meta-Analysis of 944 Measurements

Inclusion body myositis (IBM) is a slowly progressive muscle weakness of distal and proximal muscles, which is diagnosed by clinical and histopathological criteria. Imaging biomarkers are inconsistently used and do not follow international standardized criteria. We conducted a systematic review and meta-analysis to investigate the diagnostic value of muscle ultrasound (US) in IBM compared to healthy controls. A systematic search of PubMed/MEDLINE, Scopus and Web of Science was performed. Articles reporting the use of muscle ultrasound in IBM, and published in peer-reviewed journals until 11 September 2021, were included in our study. Seven studies were included, with a total of 108 IBM and 171 healthy controls. Echogenicity between IBM and healthy controls, which was assessed by three studies, demonstrated a significant mean difference in the flexor digitorum profundus (FDP) muscle, which had a grey scale value (GSV) of 36.55 (95% CI, 28.65–44.45, p < 0.001), and in the gastrocnemius (GC), which had a GSV of 27.90 (95% CI 16.32–39.48, p < 0.001). Muscle thickness in the FDP showed no significant difference between the groups. The pooled sensitivity and specificity of US in the differentiation between IBM and the controls were 82% and 98%, respectively, and the area under the curve was 0.612. IBM is a rare disease, which is reflected in the low numbers of patients included in each of the studies and thus there was high heterogeneity in the results. Nevertheless, the selected studies conclusively demonstrated significant differences in echogenicity of the FDP and GC in IBM, compared to controls. Further high-quality studies, using standardized operating procedures, are needed to implement muscle ultrasound in the diagnostic criteria.


Introduction
Inclusion body myositis (IBM) is one of the most common subtypes of idiopathic inflammatory myopathies (IIM) [1], primarily affecting those above 45 years of age. It has a progressive course and affects skeletal muscles with a distinct pattern, [2] causing asymmetric muscle weakness [3][4][5] mainly in finger flexors and/or quadricep muscles [6]. This combination of weakness often results in loss of ambulation and independence, as well as the need for assertive devices and increased supportive care over the duration of the disease. Dysphagia is also frequent in patients, which leads to swallowing disorders and an increased risk of aspiration pneumonia [7,8]. The prevalence of IBM was estimated to be between 24.8-46 patients per 1,000,000, with a male to female ratio of 2:1 [9,10]. Overall life expectancy is still shorter in IBM patients compared to the general population [11]. Although the pathophysiology of IBM has not yet been clearly elucidated [12], several factors, e.g., genetic, aging, immunologic and mitochondrial dysfunction, have been suggested to play a role [13][14][15][16][17][18][19][20].
Nevertheless, the diagnosis of IBM may be challenging as it might interfere with other subtypes of myopathies [21]. Hence, developing alternative and efficient diagnostic techniques became a necessity for assuring early diagnosis of the disease. Muscle sonography has been described as an emerging tool for diagnosis of many muscle affections and their characteristic patterns [25,26], as well as for evaluating several inflammatory neuropathies [27,28].
Therefore, the aim of this study was to evaluate the current evidence and effectiveness of the muscle ultrasound (US) as a reliable method to investigate and diagnose IBM, as well as validating its sensitivity and specificity regarding its performance in improving patient management.

Materials and Methods
We performed a systematic search on the following databases: Medline (through PubMed), Scopus and Web of Science, up until 11 of September 2021. We used the following search strategy and we searched on the previously mentioned databases using the "title and abstract" domain in order to reach all the studies discussing the use of ultrasound to measure the muscle parameters in inclusion body myositis (IBM). The synonyms of our search strategy were retrieved from the MeSH database and were revised by a senior author (R.A). The terms were combined by "OR" and "AND" Boolean operators according to the method described in the Cochrane Handbook for Systematic Reviews of Interventions (Chapter 4.4.4) [29] as follows: (Ultrasonography OR ultrasound OR Ultrasonic OR Echotomography OR Sonography OR Sonographic OR Ultrasonographic OR Echography OR Ultrasonic) AND (Inclusion Body Myositis OR Inclusion Body Myositis OR Inclusion Body Myopathy). Our study was performed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [30]. The details of the search process, and the included studies, are shown in the following PRISMA diagram ( Figure 1).

Selection Criteria
We included the full-text articles published in peer-reviewed journals that measured muscle parameters, such as muscle thickness and muscle echo intensity by ultrasound, in IBM patients diagnosed by muscle biopsy, and compared them to healthy controls. We also included studies that calculated the sensitivity and specificity of ultrasound in the diagnosis of IBM for the diagnostic test accuracy analysis. We excluded case reports, letters to the editor and studies that did not provide numerical data for muscle echo intensity or muscle thickness.

Selection and Screening
Four authors (A.E., A.M., M.B., Y.T.S.) screened the articles by title and abstract, two authors (A.M., M.B.) independently screened the articles by reading their full text and a third author (Y.T.S.) was referred to in case of a disagreement. Two authors (A.M. and M.B.) extracted data regarding patient characteristics, US devices used and outcome measures (muscle echo intensity and muscle thickness).

Quality Assessment
Two authors (M.B. and Y.T.S.) evaluated the quality of the included studies using the National Institute of Health (NIH) quality assessment tool for observational cohort and cross-sectional studies [31]. This tool consists of 14 questions regarding the sample size and selection and exposure and outcome assessment. Studies scoring ≥9 points were considered of good quality, 5-8 points of fair quality and 1-4 points of poor quality. Diagnostic accuracy studies were assessed using Cochrane's QUADAS-2 Tool.

Statistical Analysis
We conducted two types of analyses: double arm meta-analysis and diagnostic test accuracy analysis. We used review manager software version 5.4 (The Cochrane Collaboration, UK) [32] and OpenMetaAnalyst software [33] for the double arm analysis. This analysis pooled the mean difference between IBM patients and healthy controls in terms of muscle echo intensity and muscle thickness. The random effects model of the DerSimonian Laird method was applied to account for heterogeneity between the studies [34]; p < 0.05 was considered statistically significant. Heterogeneity was assessed using the Chi-squared test and the I 2 statistic; p < 0.05 proved significant heterogeneity and I 2 > 50% indicated substantial heterogeneity. Leave-one-out sensitivity analysis was carried out to examine the effect of omitting each study on the overall result. Publication bias could not be assessed using funnel plots due to the small number of included studies [29].
We used Meta-DiSc software [35] for the diagnostic test accuracy analysis to pool the sensitivity and specificity, as well as the likelihood ratios of ultrasonography, in the diagnosis of IBM compared to the other reference tests (muscle biopsy and magnetic resonance imaging (MRI)). A receiver-operator curve (ROC) was created and the area under the curve (AUC) was calculated to evaluate the performance of US as a diagnostic test for IBM.

Quality Assessment
The quality of included studies was assessed using the NIH scale and QUADAS-2 tool. Using the NIH scale, one study scored 9 and was considered of high quality, while four studies were considered to be of fair quality (score 5-8) ( Table 2). For diagnostic test accuracy studies, some concerns regarding the patient selection methodology were found for two studies [36,37], while no concern regarding applicability was found for other domains ( Figure 2).

Echogenicity of GC
Ultrasound analysis between 70 IBM patients (129 US measurements) and 98 controls (185 US measurements) in three studies found that the mean difference in echogenicity index was 27.90 GSV (95% CI 16.32-39.48, p < 0.001), denoting significance and a higher value in IBM patients than in healthy controls in the GC. Significant heterogeneity was found between study data (I 2 = 84%, p < 0.001) (Figure 4).  [38]. It showed a drastic surge in echogenicity in IBM patients.

Muscle Thickness of FDP
Analysis of two studies between 46 IBM cases (87 US measurements) and 82 controls (135 US measurements) revealed no significant difference in muscle thickness between IBM patients and healthy controls (MD = −1.75, 95% CI −5.01-1.51, p = 0.29). Heterogeneity was significant and high ( Figure S1).
Regarding the muscle thickness of FDP, the study from John Hopkins presented a significant increase in the thickness compared to the control group. However, Radboudumc's data included the point of no difference, denoting no statistical difference. Paramalingam et al. compared a group of five IBM patients to a group of age-and sex-matched healthy

Ultrasound Diagnostic Accuracy
Three studies assessed the diagnostic performance of ultrasound, differentiating between IBM (65 US measurements) and healthy controls (41 US measurements). The pooled sensitivity and specificity of US were 0.82, 95% CI 0.75-0.88 and 0.98, 95% CI 0.89-1.00, respectively (Figures 5 and 6). Between-study heterogeneity was not significant for either measurement. The pooled positive likelihood ratio, negative likelihood ratio and diagnostic odds ratio were 16 (Figures 7-9). As for the summary ROC curve (SROC), the area under the curve (AUC) was 0.612 and the Q* index was 0.5848 ( Figure S2). was significant and high ( Figure S1).
Regarding the muscle thickness of FDP, the study from John Hopkins presented a significant increase in the thickness compared to the control group. However, Radboudumc's data included the point of no difference, denoting no statistical difference. Paramalingam et al. compared a group of five IBM patients to a group of age-and sex-matched healthy controls. The results showed no significant association between the muscle thickness of the FDP in IBM patients and in the control [39]. Noto et al., in 2013, also assessed muscle atrophy by measuring the muscle CSA. The mean CSA (range) was 80.5 mm 2 (63.0-117.4) for the FDP muscle and 131.8 mm 2 (109.0-149.5) for the FCU muscle.

Ultrasound Diagnostic Accuracy
Three studies assessed the diagnostic performance of ultrasound, differentiating between IBM (65 US measurements) and healthy controls (41 US measurements). The pooled sensitivity and specificity of US were 0.82, 95% CI 0.75-0.88 and 0.98, 95% CI 0.89-1.00, respectively (Figures 5 and 6). Between-study heterogeneity was not significant for either measurement. The pooled positive likelihood ratio, negative likelihood ratio and diagnostic odds ratio were 16 (Figures 7-9). As for the summary ROC curve (SROC), the area under the curve (AUC) was 0.612 and the Q* index was 0.5848 ( Figure S2).

Sensitivity and Subgroup Analysis
According to the leave-one-out sensitivity analysis, none of the outcomes were affected following the removal of a single study (Figures 10-12). A relatively large, but insignificant, change was found in muscle thickness difference outcome following removal of Leeuwenberg (the John Hopkins population) [38]; the overall mean difference following its removal was still insignificant between patients and controls ( Figure 12). We also performed sensitivity analyses to examine the effect of the study design (retrospective or prospective) on the heterogeneity in the meta-analyses. The study by Leeuwenberg et al. [38] was found to have a high impact on heterogeneity, as heterogeneity dropped substantially after eliminating it from the analysis, especially the Leeuwenberg (Radboudumc) population in the FDP echogenicity analysis ( Figure S3) or the Leeuwenberg (John Hopkins) population in muscle thickness meta-analysis ( Figure S4), and there was no major impact on the pooled effect after their elimination. However, there was only a minimal change in the pooled effect and heterogeneity in the GC echogenicity meta-analysis after leaving it out of that study [38].

Sensitivity and Subgroup Analysis
According to the leave-one-out sensitivity analysis, none of the outcomes were affected following the removal of a single study (Figures 10-12). A relatively large, but insignificant, change was found in muscle thickness difference outcome following removal of Leeuwenberg (the John Hopkins population) [38]; the overall mean difference following its removal was still insignificant between patients and controls ( Figure 12). We also performed sensitivity analyses to examine the effect of the study design (retrospective or prospective) on the heterogeneity in the meta-analyses. The study by Leeuwenberg et al. [38] was found to have a high impact on heterogeneity, as heterogeneity dropped substantially after eliminating it from the analysis, especially the Leeuwenberg (Radboudumc) population in the FDP echogenicity analysis ( Figure S3) or the Leeuwenberg (John Hopkins) population in muscle thickness meta-analysis ( Figure S4), and there was no major impact on the pooled effect after their elimination. However, there was only a minimal change in the pooled effect and heterogeneity in the GC echogenicity meta-analysis after leaving it out of that study [38].   To assess the impact of disease duration on the overall effect and heterogeneity, we conducted subgroup analysis of the FDP and GC echogenicity meta-analysis across studies by mean duration of disease, either < 70 months or ≥70 months. The result of the subgroup analysis showed that there was no statistically significant difference in overall FDP and GC echogenicity between the subgroups (p = 0.77, I 2 = 0%; p = 0.59, I 2 = 0% respectively) and the pooled effects of FDP and GC echogenicity analysis for each subgroup were consistent with the primary analysis ( Figures S5-S6).

Heterogeneity
Substantial heterogeneity was observed among the studies (I 2 = 70% and I 2 = 84%) with regards to the echogenicity of the FDP and GC muscle group set, respectively. Moreover, muscle thickness demonstrated higher heterogeneity (I 2 = 91%). Regarding the diagnostic test accuracy meta-analysis, the heterogeneity in the sensitivity and the specificity analysis recorded X 2 = 1.7; p = 0.42 and X 2 = 2.42; p = 0.49, respectively. The threshold analysis for different cut-off values was made using Spearmann's Test in an attempt to identify the reason for heterogeneity, and yielded no effect. Only three [36,37,39] out of the seven studies [25,[36][37][38][39][40][41] evaluated the inter-rater or intra-rater reliability of the measurements. Accordingly, publication bias assessment could not be performed due to the small number of included studies.
Following the strategies described by Part Two of the Cochrane Handbook 9.5.3 for addressing the heterogeneity, we performed a meta-regression analysis for covariates that may have potentially caused heterogeneity. The meta-regression showed that the US device could be a source of heterogeneity in the meta-analysis, comparing the FDP muscle echogenicity between IBM patients and controls (p-value 0.01 and 0.03), as shown in Table S1. However, the meta-regression analyses for other analyses showed that the device was not a source of heterogeneity. Meta-regression may be not conclusive when there are few studies included in the meta-analysis.
In terms of the laterality of muscle evaluation, four studies [25,37,38,40] used the average of bilateral examination. It is worth noting that Leeuwenberg (John Hopkins) [38] used the average of the bilateral examination while Leeuwenberg (Radboudumc) [38] mostly used the average of the bilateral examination, and in cases of unilateral examination, these results were used as representative for both sides. Five studies [25,36,38,39,41] used the quantitative method for the scoring system. Three of them [36,39,41] also reported using the Heckmatt rating scale (which is a semi-quantitative scale), which included the use of a modified Heckmatt rating scale by one study [36], whereas, two studies [37,40] used only the Heckmatt rating scale.

Discussion
The results of our meta-analyses demonstrated an increased echogenicity in certain muscle groups compared to controls and could therefore be used as a supportive criterion in the challenge of an early diagnosis for IBM. In detail, the echogenicity in IBM patients was increased with 36.55 GSV, 95% CI 28.65-44.45 and 27.9 GSV 95% CI, 16.32 −39.48 for the FDP and GC muscle groups, respectively, and the test of overall effect was significant for both measurements (p < 0.001). In contrast, results of muscle thickness in IBM patients showed no difference with −1.75, 95% CI, −5.01 to 1.51. No potential outliers were identified by conducting leave-one-out sensitivity analysis. The heterogeneity testing was significant in the measures of echogenicity in the FDP and GC muscle groups, muscle thickness in the FDP and diagnostic test accuracy. Results of diagnostic test accuracy meta-analysis revealed a collective specificity of 0.98, 95% CI, 0.89-1 and sensitivity of 0.82 CI 95%, 0.75-0.88. This agreed with the data of a previous study that showed significant accuracy of US compared to MRI in detecting muscle abnormalities and fat infiltration of IBM patients [40]. On the basis of a literature comparison, US provided a higher sensitivity than anti-cN1A-antibody testing, a less invasive technique than electromyography (EMG) and muscle biopsy, as well as a less expensive method than MRI.
Needle EMG is a well-established and historically used neurophysiological test, for which a sensitivity of 89% was determined for IIM in the work of Bohan and Peter [45] and abnormalities of IIM were described in more detail by [46]. For IBM, the data were even less frequent and showed, e.g., spontaneous activity in 62.5% [47] or classical myopathic changes [48] in EMG examinations. Recently, diagnostic accuracy has been evaluated for IIM, demonstrating a sensitivity of 85.2% in IBM [49].
Muscle biopsy is the diagnostic gold standard for identifying the typical pathological features with a limited sensitivity in early disease phases, which may necessitate multiple muscle biopsies to establish a diagnosis [50]. However, the full pathological picture (rimmed vacuoles, p62 aggregates, increased major histocompatibility complex class I expression and endmysial T cells) reached a sensitivity of 93% and specificity of 100% [50].
To increase diagnostic certainty as early as possible, a non-invasive imaging technique could be used as a supportive diagnostic measure with the potential of repeated application and without potential harm. Both MRI and US fulfil these criteria. Until now, MRI remains the most widely used technique for muscle imaging as such modality can delicately visualize the distribution of affected muscles and surrounding tissues (fascia and skin), disease activity and permanent muscle damage, such as muscle atrophy and/or replacement by fatty tissue [51]. For IBM, the sensitivity of thigh muscle MRI was evaluated in 2002 and showed 72% [52]. Whole-body MRI also contributed to specific recognition patterns in IBM [10], e.g., with a sensitivity of 80% and specificity of 100% for morphological/degenerative features of the quadriceps femoris muscle [53]. Thus, these two statistical parameters are comparable to US, whereby the MRI application has the advantage of less operator dependency but also some potential drawbacks, such as its high cost, time-consuming nature (whole-body MRI requires approximately 1 h), lack of widespread availability and exclusion of subjects with metal implants, pacemakers or claustrophobia. The more cost-effective, faster, geographically widespread (e.g., in rural areas and for patients with restricted mobility), bedside and alternative method is US, with its excellent image resolution and the ability to detect morphological changes in the muscle, such as edema, atrophy and muscle replacement by fibroadipose tissue. The visualization of these myocharacteristics by the US parameters of muscle echogenicity and muscle thickness [36] have led to this method also being used for other myopathies and as follow-up assessment for disease severity and residual muscle damage [40].
In detail, echogenicity is a hallmark feature of muscle replacement by fatty tissue and fibrosis [54] and is widely accepted as an ultrasound parameter in chronic muscle pathologies, such as myositis [55][56][57]. Mainly three methods are available to assess muscle echogenicity: (1) visual qualitative method to determine echogenicity in relation to other tissues; (2) semiquantitative evaluation by a scale [58]; and (3) grayscale-based quantitative measurement. Independent of the 40-year long availability, different evaluation methods for echogenicity are used in clinical trials, indicating a further need of harmonizing and combining them.
Standardization of muscle thickness is another issue to be solved, as various factors, such as body size of the patient, disease duration, sex and exact anatomical localization, might affect measurements [38,39].
Beside the patient-dependent factors leading to the heterogeneity in our meta-analysis, another major source is the dependency on the ultrasound operator primed with the anisotropic nature of the muscle, in which any trace change in the viewing angle alters the echo intensity. However, to overcome this, Scott et al. recommended taking extra care to reduce the probe tilt, especially while evaluating the echo-intensity values [57].
Other probable causes of heterogeneity include the variation in sampling, clinical characteristics of IBM patients and the differences in the control groups.
To our knowledge, this is the first systematic review and meta-analysis to convey the echogenicity and muscle thickness in IBM and to evaluate the sensitivity and specificity of ultrasonography to support the diagnosis of IBM by conducting diagnostic test accuracy meta-analysis. Although our literature searches were thorough and data extraction were cautious, limitations still exist in this review, such as the small number of published papers on IBM. However, as suggested by Callan et al., periodical evaluation of the current evidence is mandatory to increase awareness and improve research methodology [9]. Further meta-analysis is considered once additional homogenous studies become available.
In addition, upcoming studies on the diagnostic value of muscle ultrasonography should include larger samples of patients with IBM, as current evidence of sonography is not on a sufficient standardized level for the diagnosis of IBM. Nevertheless, we propose that the deployment of US, in combination with the standard clinical, histological and serological assessment, and with the long-term goal of its implementation, adds value to education and clinical practice.
Finally, the development of international standard operating procedures, e.g., as developed for measures of disease activity in myositis by iMACS [58], would equalize the imaging evaluation and provide the basis for longitudinal studies. Advances in artificial intelligence will further facilitate US as a useful diagnostic and follow-up/longitudinal technique [59], which will foster its use in clinical trials [60].
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cells11040600/s1: Figure S1-Forest plot of the muscle thickness of the FDP in mm of the included studies; Figure S2-SROC curve showing the AUC was 0.612 and the value with the highest sensitivity and specificity (Q* index) was 0.5848; Figure S3-Sensitivity analysis for FDP echogenicity meta-analysis; Figure S4-Sensitivity analysis for muscle thickness meta-analysis; Figure S5-Subgroup analysis for FDP echogenicity meta-analysis by disease duration; Figure S6-Subgroup analysis for GC echogenicity meta-analysis by disease duration; Table S1-Meta-regression for US device.