Inappropriate Evaluation of Effect Modifications Based on Categorical Outcomes: A Systematic Review of Randomized Controlled Trials

Our meta-epidemiological study aimed to describe the prevalence of reporting effect modification only on relative scale outcomes and inappropriate interpretations of the coefficient of interaction terms in nonlinear models on categorical outcomes. Our study targeted articles published in the top 10 high-impact-factor journals between 1 January and 31 December 2021. We included two-arm, parallel-group, interventional superiority randomized controlled trials to evaluate the effects of modifications on categorical outcomes. The primary outcomes were the prevalence of reporting effect modifications only on relative scale outcomes and that of inappropriately interpreting the coefficient of interaction terms in nonlinear models on categorical outcomes. We included 52 articles, of which 41 (79%) used nonlinear regression to evaluate effect modifications. At least 45/52 articles (87%) reported effect modifications based only on relative scale outcomes, and at least 39/41 (95%) articles inappropriately interpreted the coefficient of interaction terms merely as indices of effect modifications. The quality of the evaluations of effect modifications in nonlinear models on categorical outcomes was relatively low, even in randomized controlled trials published in medical journals with high impact factors. Researchers should report effect modifications of both absolute and relative scale outcomes and avoid interpreting the coefficient of interaction terms in nonlinear regression analyses.


Introduction
In clinical medicine, treatment effects can vary among patients; thus, precision or personalized medicine based on patient characteristics has been advocated [1]. Therefore, the evaluation of effect modifications (i.e., different treatment effects dependent on other variables) has been used as a means to identify specific patient subgroups that may respond Int. J. Environ. Res. Public Health 2022, 19, 15262 2 of 9 to treatment [2]. However, when evaluating effect modifications, researchers should be cautious about the outcome scales used, as the directions of the results may not necessarily match. Two types of scales are typically used, namely, absolute scale (i.e., absolute difference) and relative scale (i.e., relative difference). As an effect modification on an absolute scale implies a net benefit of different treatment effects within a subgroup, evaluating effect modifications based not only on a relative scale but also on an absolute scale would be better for identifying specific patient subgroups [2][3][4].
For instance, let us consider a situation in which we evaluate an effect modification between a treatment and an independent variable, X, based on the treatment success rate (Supplementary Table S1). The absolute differences in treatment success rates between the treatment and non-treatment groups are 16% − 8% = 8% in the X+ group and 8% − 4% = 4% in the X− group. These results suggest the presence of an effect modification. In contrast, the relative differences are 16%/8% = 2 times in the X+ group and 8%/4% = 2 times in the X− group, which suggests no effect modification. If researchers evaluate effect modifications based only on a relative scale and not on an absolute scale, they cannot correctly identify subgroups.
The same logic applies to regression. In linear regression analyses, the coefficients indicate an absolute scale, while in nonlinear regression analyses, these indicate a relative scale. For example, a coefficient of a logistic regression indicates the change of the outcome on a log-odds ratio scale relative to a one-unit change in the corresponding covariate. It is caused by the link function (e.g., logit for logistic regression and log for a Poisson regression). Furthermore, because of the link function, nonlinear regression analyses show an inherent interaction, i.e., treatment effects are constant on a relative scale (e.g., the log-odds scale for logistic regression), whereas the probability of the outcomes changes depending on other variables without interaction terms [5,6]. For instance, as illustrated in Supplementary Figure S1, a logit curve moves upwards when the intercept only increases. The change in probability is small at both extreme dependent values while it is large at the middle (i.e., compression). This phenomenon has an inherently interactive nature because the difference between the two logistic curves is contingent on the value of an independent variable. It explains the interactions between one independent variable and another independent variable without any interaction terms. In other words, just including a different independent variable without the interaction term changes the residual variation of the underlying model and moves a logit curve similarly. This inherent interactive nature was summarized by Mize et al. [7]. Thus, the coefficients of the interaction terms do not necessarily indicate any effect modification. The American Sociological Association does not recommend using the coefficient of interaction terms to evaluate effect modifications in nonlinear regression analyses of categorical outcomes [8].
In economics and sociology, inappropriate evaluations of effect modifications based on categorical outcomes have been demonstrated [5,8]. Most previous studies have not correctly interpreted the coefficients of the interaction terms in nonlinear regression analyses [5,7]. Moreover, in clinical medicine, researchers utilize many categorical outcomes, including hard outcomes (e.g., 30-day mortality rate and readmission) and soft outcomes (e.g., number of readmissions and exacerbations). However, to the best of our knowledge, no study has evaluated whether researchers in clinical medicine appropriately evaluate effect modifications based on categorical outcomes. In this study, we aimed to describe the prevalence of reporting effect modifications based only on relative scale outcomes and that of inappropriately interpreting the coefficient of nonlinear models on categorical outcomes. We targeted randomized controlled trials (RCTs) reported in high-impact-factor medical journals, which are expected to be of high quality.

Study Design
This was a meta-epidemiological study that used previously published RCTs. As it used only open data, informed consent from patients was not required, and there were no ethical concerns. The study protocol was pre-registered on an open platform (https: //osf.io/snpj7/ registration date: 22 May 2022). Additionally, we have reported the study according to the guidelines for meta-epidemiological studies [9].

Eligibility Criteria
The inclusion criteria were full-text articles of two-arm, parallel-group, interventional superiority RCTs that conducted statistical analysis for evaluating subgroup-specific effect modifications between treatments and independent variables based on categorical outcomes. We included studies with factorial or cluster designs. In our analysis, we defined nonlinear regressions for categorical outcomes regardless of the frequentist or Bayesian framework as follows: (1) generalized linear models such as logistic, Poisson, negative binomial, ordinal, and multinomial logistic regression; (2) generalized linear mixed-effects models such as mixed-effects logistic, Poisson, negative binomial, ordinal, and multinomial logistic regression; and (3) generalized estimating equations. We did not restrict the types of outcomes, such as primary, secondary, and explanatory outcomes. The exclusion criteria were articles of other types of RCTs, such as equivalence, non-inferiority, cross-over, or more than two treatment group trials, as well as study protocols and animal studies. Moreover, when only a two-by-two table of outcomes was created, and researchers did not analyze the effect modification, we excluded the related articles. Additionally, when effect modifications were planned in the study protocols, and the results were not described in the main text or supplementary materials, the related articles were excluded. Finally, articles published in non-English languages were excluded.

Search Strategy
We searched for potential RCTs in high-impact-factor medical journals published between 1 January and 31 December 2021. Based on the journal impact factors from journal citation reports in 2020, the major clinical journals (a category of clinical medicine) were as follows:  Table S2).

Study Selection
For screening, A.S. reviewed the titles and abstracts of the selected articles and checked whether they met the inclusion criteria. A.S. then reviewed the full text and online appendixes to determine whether the articles would be finally included. Any one of the remaining authors (N.Y., N.S., M.O., or H.S.) confirmed the articles, and the two authors finally decided whether to include the articles in the discussion.

Outcomes
In this study, we set the prevalence of reporting effect modifications based only on relative scale outcomes and that of inappropriately interpreting the coefficient of interaction terms in nonlinear models on a categorical outcome as the primary outcomes. Regarding outcome scale, we assessed whether researchers reported only a difference in relative scale outcomes as an index of an effect modification, or used the difference in absolute scale outcomes as an index of an effect modification. Regarding the interpretation of interaction terms, we assessed whether researchers interpreted the coefficient of interaction terms in nonlinear models only as an index of an effect modification, or interpreted the coefficient of an index of model fitness and evaluated effect modification using other metrics, such as marginal effects. Moreover, when researchers evaluated an effect modification, but the methodology was not sufficiently described in the main text, supplementary materials, or the study protocol, we considered the study as reporting an "unclear description." A.S. and one of the coauthors (N.Y., N.S., M.O., or H.S.) assessed the studies independently, and any disagreement was resolved through discussion. When there were multiple articles on a single trial (e.g., salami slicing), we used the number of articles rather than the number of studies.

Data Items
A.S. recorded the funding sources and the number of citations from the Web of Science on 19 June 2022. A.S. confirmed the presence of funding by for-profit organizations through an internet search and specialties (e.g., infectious disease, neurology, and cardiology) and types of intervention (e.g., behavioral intervention, device, medication, and surgery/procedure) through full-text reviews. One of the coauthors (N.Y., N.S., M.O., or H.S.) extracted the following information from the main text, cited protocols, and cited trial registration, and then A.S. confirmed it: methodologies used for evaluating effect modifications; whether any statisticians were included as coauthors; whether the CONSORT statement was cited for reporting; the number of effect modifications evaluated; whether all analyses for effect modifications were pre-specified and described in the main text or the cited protocol; whether multiplicity adjustments (e.g., Bonferroni, Holm or Hochberg methods) were used to evaluate effect modifications; the presence of spin in the abstract or main text based on the results of the effect modifications; and whether statistically significant results of any effect modifications were reported [11]. "Spin" was defined as authors highlighting the result of secondary or explanatory analyses despite a non-significant result for primary outcomes. The definition of "spin" did not include a reporting strategy intended to distract the reader from a non-significant result, unlike in previous literature [12]. A.S. and one of the coauthors (N.Y., N.S., M.O., or H.S.) independently assessed the presence of spin and reached a consensus through discussion.

Statistical Analysis
We summarized the study characteristics as the median and interquartile range for continuous variables and as a percentage for categorical variables. We preliminarily determined whether the following factors were associated with insufficient reporting of effect modifications based only on relative scale outcomes: the number of participants, coauthorship of a statistician, citation of the CONSORT statement, the presence of for-profit organizations, the number of evaluated effect modifications, pre-registration of all analyses for effect modifications, the use of multiplicity adjustment, the presence of statistical significance of any effect modification, and spin. A.S. performed the statistical analyses using R software version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Figure 1 shows the study selection flowchart. After title and abstract screening and 575 subsequent full-text reviews, we finally included 52 articles describing the analyses of effect modifications based on a categorical outcome (Supplementary Tables S3 and S4). The study characteristics are summarized in Table 1. We included articles from various specialties, such as anesthesiology, cardiovascular surgery, cardiology, critical care, emergency medicine, gastroenterology, general surgery, hematology, immunology, infectious disease, internal medicine, nephrology, neurology, obstetrics and gynecology, oncology, otolaryngology, pediatrics, psychology, pulmonology, rheumatology, and urology.

Primary Outcomes
With respect to outcome scale, in 45/52 (87%) of the included articles and 39/41 (95%) of the articles using nonlinear models, researchers reported effect modification based only on relative scale outcomes to identify patient subgroups. In 4/52 (8%) studies, researchers evaluated effect modifications based on relative and absolute scale outcomes or visual inspection of forest plots [18,20,22,23]. With respect to interpretations of interaction terms, at least 39/41 (95%) articles inappropriately interpreted the coefficient of interaction terms merely as an index of effect modifications. We could not assess the methodology in 2/41 (5%) articles because of unclear descriptions. The individual outcome assessments are summarized in Supplementary Table S3.

Exploratory Analyses
The exploratory analyses are summarized in Table 2. Compared with the articles reporting effect modification based not only on relative scale outcomes, citation of CONSORT statement, the statistical significance of any effect modifications, and spin were observed more in articles reporting only relative scale outcomes.  Note: n, number; IQR, interquartile range.

Difference between Protocol and Review
In the protocol, we planned to assess the appropriateness based on an outcome scale used for effect modifications. We changed it to describe the methodologies used in evaluating effect modifications in detail to ensure that readers can easily understand how to improve their methodologies. To avoid mis-specifying the study characteristics, we changed the data extraction and exploratory analyses of the protocol. According to the protocol, A.S. solely decided which articles to include. We modified this protocol to ensure that the decision was made by two authors after discussion. We changed the data source of the number of citations from PubMed to Web of Science and that of specialties from Web of Science to full-text reviews. Although we had planned exploratory statistical tests to evaluate the association between the study characteristics and inappropriateness, we did not conduct these statistical tests owing to the small sample size. Instead, we conducted only descriptive analyses.

Discussion
Our study had the following two aims: (1) evaluating the prevalence of reporting effect modifications based only on relative scale outcomes and (2) describing the prevalence of inappropriate interpretation of the coefficients of nonlinear models. To the best of our knowledge, this is the first systematic review to comprehensively analyze the prevalence of reporting effect modifications based only on relative scale outcomes and of inappropriately interpreting the coefficient of interaction terms in nonlinear models on categorical outcomes in medical journals with high impact factors. Over 80% of the RCTs reported effect modifications based only on relative scale outcomes and inappropriately interpreted the coefficients of the interaction terms as indices of effect modifications. Our findings convey two important messages: (1) it would be better for researchers to report effect modifications on both absolute and relative scale outcomes, and (2) researchers should not interpret the coefficient of interaction terms in nonlinear regression for categorical outcomes.
Although absolute scale outcomes, such as absolute risk difference, have been recommended for identifying patient subgroups, most RCTs included in this study used nonlinear regression and evaluated effect modifications based only on relative scale outcomes [2,3]. A previous meta-epidemiological study reviewed subgroup analyses of binary outcomes in articles published in The New England Journal of Medicine and found that only 40% of these articles reported an absolute risk difference [4]. We summarized the detailed methodologies used for evaluating effect modifications, and most subgroup analyses used interaction terms in nonlinear regression. This was the main reason for reporting effect modifications based only on relative scale outcomes. Although interaction terms can improve model fitness in nonlinear regression, researchers should be aware that nonlinear regression indicates only relative scale outcomes and that absolute scale outcomes should be reported instead [24].
In addition, a substantial proportion of RCTs inappropriately interpreted interaction terms in nonlinear regression analyses to evaluate for effect modifications in the natural metrics (e.g., probability). When researchers use regression for categorical outcomes, it may be difficult to use a single absolute scale, such as the absolute risk difference, to represent the entire population. Although researchers can evaluate interaction terms as absolute scale outcomes using linear probability models, in many cases, these models do not fit the data. They result in predicted probabilities outside zero to one, and predictions outside the logical range are nonsense [25]. The predicted probability of a linear probability model is quite different from the true value when it is close to zero or one. As our study showed, interaction terms in nonlinear regression analyses have been inappropriately evaluated in high-impact-factor medical journals. Instead of interpreting the coefficient of interaction terms, some researchers have proposed recommendations for the appropriate evaluation of effect modifications based on categorical outcomes in nonlinear models [5,8].
One of them used a marginal effect or difference of two predicted outcomes (treatment vs. not) as an absolute scale outcome [7]. Testing the second difference in the two marginal effects across a subgroup could be an appropriate way to evaluate an effect modification. Researchers should appropriately analyze effect modifications, and reviewers should carefully evaluate them.
Our recommendations can be generalized to other medical journals and study designs, such as observational studies and meta-analyses. Our study involved only articles published in high-impact-factor medical journals. However, these journals employ rigorous statistical reviews of submitted manuscripts, and we expect the prevalence of inappropriate evaluations to be higher in other medical journals. Thus, appropriate methodologies for evaluating effect modifications could be a major statistical issue in clinical journals.
Our descriptive and exploratory analyses suggest that there might be room for improvement in the design and reporting of many RCTs. During trial design, researchers should clearly describe the methodology for subgroup analyses in the protocol, especially how they evaluated effect modifications, and should use multiplicity adjustment to avoid alpha errors. When reporting the trial, researchers should avoid overstatement of the secondary or exploratory analyses and cite the CONSORT statement to maintain the quality of reporting. Although spin was detected in only a small number of articles in our study, this may be due to the low number of positive results of subgroup analyses. Our study found that these endeavors have not been fully considered by RCTs.
Our study had several limitations. First, because we could not collect individual patient data, it was not possible to evaluate whether an appropriate evaluation would change the direction of the study results. However, a previous study used three-patient cohort data and showed that different interpretations between absolute and relative scale outcomes are possible [26]. Second, the initial screening of titles and abstracts was conducted by just one author; therefore, selection errors may have occurred. Although we reviewed the full text of 578/686 (84%) articles, the number of articles included in the review may be underestimated. Third, owing to the small number of included studies, we could not identify the factors associated with reporting effect modifications based only on relative scale outcomes. Further meta-epidemiological studies are needed to identify modifiable factors to evaluate effect modifications.

Conclusions
In our study, the prevalence of reporting effect modification based only on relative scale outcomes, as well as that of inappropriately interpreting the coefficient of interaction terms in nonlinear models, was quite high in high-impact-factor medical journals evaluating