Adverse Event Burden Score—A Versatile Summary Measure for Cancer Clinical Trials

Simple Summary In cancer clinical trials, adverse event data are collected after every treatment cycle, using the Common Terminology Criteria for adverse events, which includes 837 terms. The vast number of potentially reportable adverse events over multiple treatment cycles makes summarizing and analyzing adverse event data challenging. The current standard reporting of adverse event data includes the frequency of the maximum (worst) grade of commonly occurring adverse events. In this article, we propose a single quantitative summary measure that incorporates both the frequency and the severity of multiple adverse events over time; the adverse event burden score. This score is a well-defined measure that enables statistical comparisons analogous to other quantitative endpoints in clinical trials. The adverse event burden score can readily accommodate different trial settings, diseases, and treatments, with diverse safety profiles. Abstract This article introduces the adverse event (AE) burden score. The AE burden by treatment cycle is a weighted sum of all grades and AEs that the patient experienced in a cycle. The overall AE burden score is the total AE burden the patient experienced across all treatment cycles. AE data from two completed Alliance multi-center randomized double-blind placebo-controlled trials, with different AE profiles (NCCTG 97-24-51: 176 patients, and A091105: 83 patients), were utilized for illustration. Results of the AE burden score analyses corroborated the trials’ primary results. In 97-24-51, the overall AE burden for patients on the treatment arm was 2.2 points higher than those on the placebo arm, with a higher AE burden for patients who went off treatment early due to AE. Similarly, in A091105, the overall AE burden was 1.6 points higher on the treatment arm. On the placebo arms, the AE burden in 97-24-51 remained constant over time; and increased in later cycles in A091105, likely attributable to the increase in disease morbidity. The AE burden score enables statistical comparisons analogous to other quantitative endpoints in clinical trials, and can readily accommodate different trial settings, diseases, and treatments, with diverse AE profiles.


Introduction
The collection of adverse events (AEs) is an important aspect of cancer clinical trials, with the goal of capturing "untoward medical events" that cancer patients experience while enrolled on trials. The reporting of such AEs is often mandated by the National Cancer Institute (NCI) and the US Food

Methods
The primary goal of the methodology that follows is to describe the single quantitative variable that summarizes the frequencies and severities of multiple AEs that may occur over time in cancer patients undergoing therapy. Below, we provide a general definition of AE burden score. A list of all symbols used as mathematical notation in this section is included at the end of the paper for reference.

Defining AE Burden Measures and Associated Analysis Methods
Let Y kg (t) be the indicator that a patient experiences an AE of type k (= 1, . . . , K), grade g(= 1, . . . , 5), where g = 5 indicates death due to AE type k, at time (or treatment cycle) t. That is, Y kg (t) = 1, if the patient experiences event k of grade g at time t 0, otherwise.
Using these indicator variables, multiple measures of AE burden can be defined. Specific examples, including the current convention of occurrence of grade 3 or worse AEs, are given in Appendix A. All of these approaches reduce the multi-dimensional AE information into a categorical summary measure.
We now introduce the framework for computing an AE burden at each time point by computing a weighted sum of Y kg (t) values over all grades and AEs of interest. The definition requires a pre-specified severity weight, w kg , for each combination of adverse event k and grade g. The values of (w kg , k = 1, . . . , K, g = 1, . . . , 5) are subjective, and they serve the dual purposes of quantifying how bad one considers a combination (k, g) to be, and putting the severities of qualitatively different toxicities on the same numerical domain. For example, all values of w kg may be specified on a domain of 0 to 10 or 0 to 100 for convenience. In practice, the severity weights may be elicited from oncologists Cancers 2020, 12, 3251 3 of 15 familiar with the treatments and AEs for a particular type of cancer being studied, and if desired, consensus values may be obtained from a group of stakeholders.
We define the patient AE burden at time t as The AE burden B(t) is thus a quantitative variable. This is similar to expressions (1) and (2) of Hobbs et al [5] for quantifying possibly recurrent toxicities. The burden at each time point t can be summarized using mean (standard deviation), and/or median (range); compared between the treatment arms using a t-test or the Mann-Whitney test; expressed as a function of treatment/dose and patient covariates using a regression model for each t, or more generally used as repeated measures across all time points, t, and analyzed by a longitudinal regression model.
An overall AE burden score across treatment cycles/times can be defined by summing over t as: Note that TB is a single quantitative summary reflecting the overall AE burden that a patient experiences across all treatment cycles, henceforth referred as the overall AE burden score, in contrast with B(t) which is the AE burden at a single treatment cycle t. The overall AE burden, TB, can be compared between treatment arms and modeled using parametric or non-parametric methods for continuous outcomes.
Some considerations of the AE burden score: A.
Weight functions: The weight functions u t (weight for time t) and w kg (weight for AE type and grade) should be defined a priori in a manner that is relevant to the disease, treatments, and study objectives. Interpretation of the AE burden at time, t, B(t), and the overall AE burden score, TB, depend on these weight functions. Specifically: -A simple weight function for w kg can be w kg = g, i.e., the weight of an AE equals the grade of the event regardless of k. In this case, the interpretation of B(t) is the total of all the grades across all adverse events that a patient experienced at time t. Although a limitation of this definition is that, for example, g = 3 for two qualitatively different AEs k and r have the same weight w k3 = w r3 = 3, it is easily interpretable and informative in most settings. This weight function also takes advantage of the work that has already gone into the development of the CTCAE, where the AE grades reflect similar severity from one AE type to another. A more complex weight function, if needed, may be defined a priori with consensus from stakeholders, including clinicians, patients, and others. -For u t , a TB with the weight function u t = 1 equates to the total grades of AEs across all treatment cycles that a patient experienced. A TB with the weight function u t = 1/c, where c is the number of treatment cycles the patient received, equates to the overall AE burden a patient experienced averaged across all treatment cycles.

B.
Grade 5 events: With grade 5 indicating the worst outcome of death from the AE, one may argue that the impact of a grade 5 event is much more burdensome relative to that of lower grade events. Therefore, the weight for grade 5 events can be inflated relative to the weights of grade 1-4 events as deemed appropriate for a specific tumor type and/or trial setting. However, when comparing AE burdens across trials, it is important that the same weight function be used to ensure comparability across trials. In our analyses of the two completed trials presented in the Results section, below, we assigned a weight of 10 to grade 5 events, using the weight domain from 0 to 10. This choice of weight for grade 5 events was intended to reflect the increased burden of death due to the AE (counts as twice its severity), while at the same time not being too large that it overshadows the burden of lower grade events.  Table 1 shows the proposed burden scores alongside the maximum grade (current standard reporting of AEs) for 9 patients who had different AE profiles. Although Patients 1-3 experienced only mild AEs (grade 1), their AE burdens, B(t), vary slightly by treatment cycle, and the total overall burdens (TB) for these patients range from 2 to 7 (average overall burden 1 to 1.4). Using the current reporting standard, all 3 patients would have the same maximum AE grade of 1. More noticeably, Patients 4-6 had some moderate AEs (grade 2) with varying B(t) from cycle to cycle, and the total overall burden, TB, for these patients range from 12 to 59 (average overall burden 2.4 to 19.7). Using the current reporting standard, these 3 patients would have the same maximum AE grade of 2. Thus, the burden scores incorporate information on the number and the severity of the AEs that a patient reported and provide a much more informative summary of the patient's experience than the simple measure of maximum grade during treatment.

The Utility of AE Burden Scores
Note that in Table 1, the overall AE burden averaged across cycles as well as the total overall AE burden are presented. Both measures provide a summary of the overall AE burden. In studies where most patients are expected to stay on treatment for a similar duration, and early dropout is not a major concern, analyses using either the average or the total score should yield similar conclusions. However, in situations where patients are expected to go off treatment at different time points, the choice of whether to choose the average or the total overall burden requires careful consideration, as they have different implications. The total overall AE burden does not adjust for the number of treatment cycles a patient received, or the important possibility that treatment was discontinued due to AEs. If the number of treatment cycles is predefined and it is likely that all patients will have the same number of treatment cycles, then the total overall AE burden might be an appropriate measure to use. In situations where patients are expected to go off treatment early if a severe AE occurs, or treatment is given until disease progression, it is important to account for the number of treatment cycles a patient received. In this case, the average overall AE burden would be a more appropriate measure to use. For example, consider Patients 7-9 (Table 1). Although the total overall AE burden of Patient 7 (27) is only slightly higher than those of Patients 8 and 9 (26 for both), the average overall AE burden of Patient 7, who went off treatment at cycle 2, was 13.5 compared to 6.5 and 5.2 for Patients 8 and 9, who went off treatment at cycles 4 and 5, respectively. Although the total overall AE burden scores were comparable, Patient 7 experienced a much higher average overall AE burden than Patients 8 and 9.

Application to Clinical Trials
The proposed AE measures were applied to data from two completed clinical trials, NCCTG 97-24-51 [7] and A091105 [8], conducted through the Alliance for Clinical Trials in Oncology, an NCI-funded national clinical trials network. It is important to emphasize that the AE burden analyses of these trials and the results presented here are intended only for illustration of the proposed measure. These results are not intended to replace the results reported in the primary trial publications. The analyses conducted for this manuscript used de-identified data, and do not require ethical approval nor informed consent. The trials included in this paper were chosen for two reasons. First, both trials were randomized double-blind, and placebo-controlled, and thus the placebo arms serve as references to evaluate toxicity of the active treatments by comparing the AE burden of the treatment arms to that of the placebo arms. The placebo arms also serve as potential benchmarks when evaluating the AE burden scores for future trials in the same disease and setting. Thus, our first hypothesis was that the AE burden score would be higher in the active treatment arm compared to the placebo arm in each trial. Second, we wanted to illustrate application of the AE burden to trials in different diseases. This diversity of disease types and disease behavior (one an aggressive malignancy and the other a non-malignant, slow growing cancer) allowed us to assess AE burden when the diseases themselves were generating AEs. Thus, our second hypothesis was that the AE burden score would capture the disease-induced AEs, both in the case of patients with aggressive malignancy (non-small cell lung cancer) as well as in patients who had less-aggressive desmoid tumors.

Overview of the Two Selected Trials
The two trials selected for illustrating this AE measure were NCCTG 97-24-51 and A091105. The first trial included 176 patients, with 85 randomized to active agent as part of maintenance therapy with carboxyaminotriazole (CAI), which was proven to be therapeutically ineffective, and 91 to placebo [7]. The second trial included 83 patients, with 47 randomized to sorafenib, which significantly prolonged progression-free survival, and 36 to placebo [8].

Understanding the Overall AE Burden Score, TB, in the Context of Placebo Arms
The AE burden by treatment cycle, B(t), with weight, w kg , equaling the AE grade for grade 1-4 events, and a weight of 10 for grade 5 events, as well as the overall burden score across treatment cycles (TB with weight u t = 1/c) were computed for each patient, as described earlier. To benchmark AE burden scores based on the placebo arm, we assessed scores within each trial by treatment arm. For NCCTG 97-24-51, the overall AE burden score for patients who received CAI was higher than those who received placebo (median TB: 5 versus 2.8, Wilcoxon p-value <0.0001; Table 2). Of note, patients on the CAI arm went off treatment more quickly than patients on the placebo arm (the number of patients on treatment at each cycle is shown at the bottom of Figure 1A), especially those with high AE burden in the early cycles (as shown in Figure 1B). Nonetheless, the overall AE burden score for patients on the CAI arm was 2.2 points higher than for patients on the placebo arm. Specifically, Figure 1A shows that the AE burden by cycle in the CAI arm was high in the early period (cycle 1 mean = 6.2, 95% confidence interval (CI): 5.2-7.2, n = 85) and decreased over time (cycle 2 mean = 4.8, 95% CI: 3.6-6.1, n = 52; cycle 5 mean = 1.7, 95% CI: 0.84-2.5, n = 16), whereas the AE burden by cycle of patients on the placebo arm remained relatively constant over time (cycle 1 mean = 3.7, 95% CI: 2.7-4.7, n = 91; cycle 2 mean = 3.8, 95% CI: 2.9-4.7, n = 75; cycle 5 mean = 3.6, 95% CI: 2.0-5.1, n = 28).  of patients on the placebo arm was relatively flat in the early treatment cycles, but showed an increase in later cycles (Figure 2), with these observations likely attributable to the disease morbidity experienced by patients on the placebo arm. A higher proportion of patients on the sorafenib arm remained on treatment by cycle 15 compared to the placebo arm (53% versus 42%) despite the higher overall AE burden score on the sorafenib arm; this pattern appears to reflect improved disease control.   Similarly, for A091105, the AE burden at each cycle, B(t), and the overall AE burden score, TB, using the same weight functions as described for NCCTG 97-24-51, were computed for each patient. As the majority of patients stayed on treatment for a long time (48% received 15 cycles of treatment), we did not evaluate the pattern of AE burden scores by duration of treatment. Figure 2 shows that the AE burden by cycle was consistently higher in the sorafenib arm compared to the placebo arm; the overall AE burden score across treatment cycles was 1.6 points higher with sorafenib compared to placebo (median TB: 3.6 versus 2.0, Wilcoxon p-value = 0.0042; Table 2). The AE burden by cycle of patients on the placebo arm was relatively flat in the early treatment cycles, but showed an increase in later cycles (Figure 2), with these observations likely attributable to the disease morbidity experienced by patients on the placebo arm. A higher proportion of patients on the sorafenib arm remained on treatment by cycle 15 compared to the placebo arm (53% versus 42%) despite the higher overall AE burden score on the sorafenib arm; this pattern appears to reflect improved disease control.

Clinically Logical Patterns of the AE Burden by Treatment Cycle, B(t)
Given the high rate of patients going off treatment in NCCTG 97-24-51, we sought to understand the patterns of association between the AE burden by cycle and patients' duration on treatment (as shown in Figure 1B). Only 16 patients remained on the CAI arm by cycle 5. Figure 1B shows a higher AE burden by cycle experienced by patients who went off treatment early, as expected. The AE burden by cycle for the 49 patients who received only one cycle of CAI treatment was higher than those who went off treatment in subsequent cycle (mean B(t): 7.2 versus 2.6 to 5.2). These findings of high AE burden followed by a rapid withdrawal from the trial are in keeping with what one would expect to see in clinical practice, and they illustrate how the AE burden may be used to quantify this pattern.

Clinically Logical Patterns of the AE Burden by Treatment Cycle, B(t)
Given the high rate of patients going off treatment in NCCTG 97-24-51, we sought to understand the patterns of association between the AE burden by cycle and patients' duration on treatment (as shown in Figure 1B). Only 16 patients remained on the CAI arm by cycle 5. Figure 1B shows a higher AE burden by cycle experienced by patients who went off treatment early, as expected. The AE burden by cycle for the 49 patients who received only one cycle of CAI treatment was higher than those who went off treatment in subsequent cycle (mean B(t): 7.2 versus 2.6 to 5.2). These findings of high AE burden followed by a rapid withdrawal from the trial are in keeping with what one would expect to see in clinical practice, and they illustrate how the AE burden may be used to quantify this pattern. Furthermore, Figure 3 shows the AE burden by cycle grouped by treatment arm, and whether patients went off treatment due to AE or by choice versus due to disease progression or other reasons. The small number of patients who went off treatment for other reasons were combined with disease progression. The AE burden by cycle for patients who went off treatment due to AE or by choice in the CAI arm (mean B(t) between 8 and 9) was much higher than the AE burden of patients who went off treatment due to disease progression or other reasons (mean B(t) 2 to 5 for the CAI arm, and mean B(t) 3 to 4 for the placebo arm). Furthermore, Figure 3 shows the AE burden by cycle grouped by treatment arm, and whether patients went off treatment due to AE or by choice versus due to disease progression or other reasons. The small number of patients who went off treatment for other reasons were combined with disease progression. The AE burden by cycle for patients who went off treatment due to AE or by choice in the CAI arm (mean B(t) between 8 and 9) was much higher than the AE burden of patients who went off treatment due to disease progression or other reasons (mean B(t) 2 to 5 for the CAI arm, and mean B(t) 3 to 4 for the placebo arm). Similar patterns were observed in A091105. Figure 4 shows the AE burden by cycle, grouped by the reason patients went off treatment by arm. In the sorafenib arm, patients who went off treatment due to AE or by choice had higher AE burden compared to those who remained on treatment. The AE burden of these patients on the sorafenib arm was higher than the AE burden of patients on the placebo arm, who went off treatment due to disease progression or other reason reasons. Again, these analyses illustrate how the AE burden by cycle captures patterns of study withdrawal within the Figure 3. AE burden by treatment cycle, B(t), grouped by treatment arm and by reason patient went off treatment for NCCTG 97-24-51. The AE burden of CAI patients who went off treatment due to adverse events (solid line) was higher than that of patients on either arm who went off for disease progression (dashed lines). Similar patterns were observed in A091105. Figure 4 shows the AE burden by cycle, grouped by the reason patients went off treatment by arm. In the sorafenib arm, patients who went off treatment due to AE or by choice had higher AE burden compared to those who remained on treatment. The AE burden of these patients on the sorafenib arm was higher than the AE burden of patients on the placebo arm, who went off treatment due to disease progression or other reason reasons. Again, these analyses illustrate how the AE burden by cycle captures patterns of study withdrawal within the context of desmoid tumors, a non-malignant, less-aggressive disease entity.

Discussion
This paper describes a framework to define an AE burden measure that is simple, yet flexible enough to accommodate different trial settings, diseases, and treatments; accommodating diverse adverse event safety profiles. It uses all of the information collected on AEs during the trial, unlike other descriptive summaries that provide only maximum (worst) grade over time, and that include only adverse events relevant to specific categories of AE. In essence, the overall AE burden score provides a comprehensive picture of the AE burden experienced by patients, appears to be more informative than more commonly used approaches, and yet manages to distill complex data into a single score.
Of note, other tools have been devised to summarize adverse event data. For example, the Toxicity over Time (ToxT) [9,10] is such tool, which uses a combination of statistical techniques, ranging from graphical summary to repeated measures models to survival analysis, to summarize the adverse event profile over the entire course of study. The ToxT provides a major improvement over more conventional adverse event reporting. However, in contrast to the AE burden score, the ToxT requires more extensive synthesis of data and more extensive explanation of the methodology that led to the stated conclusions. Such long explanations preclude succinct reporting of clinical trial

Discussion
This paper describes a framework to define an AE burden measure that is simple, yet flexible enough to accommodate different trial settings, diseases, and treatments; accommodating diverse adverse event safety profiles. It uses all of the information collected on AEs during the trial, unlike other descriptive summaries that provide only maximum (worst) grade over time, and that include only adverse events relevant to specific categories of AE. In essence, the overall AE burden score provides a comprehensive picture of the AE burden experienced by patients, appears to be more informative than more commonly used approaches, and yet manages to distill complex data into a single score.
Of note, other tools have been devised to summarize adverse event data. For example, the Toxicity over Time (ToxT) [9,10] is such tool, which uses a combination of statistical techniques, ranging from graphical summary to repeated measures models to survival analysis, to summarize the adverse event profile over the entire course of study. The ToxT provides a major improvement over more conventional adverse event reporting. However, in contrast to the AE burden score, the ToxT requires more extensive synthesis of data and more extensive explanation of the methodology that led to the stated conclusions. Such long explanations preclude succinct reporting of clinical trial data, which typically need to also focus on multiple endpoints, not toxicity alone. Again, an advantage of the AE burden score is its succinctness.
A major advantage of the proposed AE burden score appears to be its functionality within a specific clinical trial. First, because it is quantitative, it allows for comparisons between trial arms, thus allowing investigators to make inferences about safety and tolerability of an investigational intervention based on a single well-defined statistic, as opposed to multiple verbal descriptors of adverse events by category. The use of a single overall AE burden score, as opposed to sorting through several verbal categories of adverse events, facilitates formal comparison between study arms. Second, the AE burden score readily captures AE severity. The score quantifies the magnitude of the difference in AE burden between the arms of the trial, making it an ideal measure of the severity of treatment toxicity. Third, the AE burden score is well defined and easy to understand. It incorporates all reported AE data and distills a sizable amount of data into a single score. Thus, this aspect of the reporting of AE differences between arms can be reduced to a statistical comparison in a manner analogous to comparisons of survival or other similar endpoints in clinical trials [11][12][13].
In addition to within-trial comparisons, the AE burden score could also facilitate comparisons of AEs across trials for the same disease and patient population, as long as the same weighing schemes are used for AEs. Furthermore, a baseline AE burden can be established as an anchor, for example, for defining a threshold for acceptable toxicity for various diseases or for various patient populations. With the establishment of such anchors, the use of AE burden score can be applied to single arm trials where the AE burden of an investigational agent can be compared against the established anchor for that disease setting and patient population, again, using the same weighing schemes. This approach might also lead to greater safety in monitoring clinical trials in a real time manner, as adverse event data that approach an a priori established anchor threshold could prompt a more timely review of a trial for safety purposes.
As with any method, appropriate use of AE burden score requires care and consideration. First, although the proposed measure is flexible, it should be clearly defined a priori to ensure its objectivity and its comparability across trials. Since the purpose of this manuscript is to introduce the AE burden score and to illustrate its application, we did not compare the use of different weight functions. The weight function used in our examples is simple and readily interpretable, and it may be applied quite generally in many different settings. However, if a more complex weight function is preferable, the weight functions should be considered carefully and agreed upon by all stakeholders, to provide a clinically meaningful interpretation. Of note, the same weight function should be used if AE burden scores are compared across trials. Second, it is important to note that AE burden score cannot ameliorate the problem of poor quality data. More specifically, with more than 800 AE terms and multiple treatment cycles, not all AEs are reported at every cancer treatment cycle. In well-designed trials, AEs that are clinically relevant are required to be evaluated and reported, commonly called solicited AEs or AEs of special interest. However, missing data or the sporadic capture of AE data can lead to compromised study conclusions, as is the case with any endpoint that relies on only partially collected data. Thus, it remains prudent to closely evaluate the patterns and mechanism of missing data to ensure that appropriate statistical methods are used when analyzing AE burden scores. This is a complex issue that may be explored in future applications.

Conclusions
With all of the foregoing considerations in mind, the proposed AE burden score provides a simple and objective approach to the current reporting and analysis of AE data. It quantifies the magnitude of the AE burden that patients experience during their cancer treatment and should be considered a safety endpoint in cancer clinical trials. This measurement merits further research to determine optimal weight functions, ongoing testing, and further integration into cancer clinical trials over time.

Conflicts of Interest:
The authors have no conflict of interest to disclose.

Appendix A
Given indicator variable Y kg (t) = 1, if the patient experiences event k of grade g at time t 0, otherwise.

1.
The simplest case is to use each Y kg (t) itself as the measure of interest to: (a) Estimate π kg (t) = prob Y kg (t) = 1 for each k, g, and t. The probability π kg (t) can be estimated using simple proportions, compared between treatment arms or doses using a Chi-square or Fisher's exact test, as appropriate, and can be modelled as a function of treatment/dose and patient covariates using logistic or probit regression models. (b) Consider sequences of Y kg (t) values as t varies for each patient to assess the likelihood that a patient experiences event k of grade g over time, using longitudinal analyses methods, such as marginal logistic regression models with generalized estimating equations to account for within-patient association, [

2.
Another possibility is to define Y kg + (t) as the indicator that a patient experiences event k of grade g or worse at time t. That is, Y kg + (t) = 1, if patient experience event k of grade ≥ g at time t 0, otherwise.
Then, Y kg + (t) can be used as an endpoint measure by itself using the same methods listed in (1a) or as a repeated measure using methods listed in (1b). Using g = 3 would reflect the convention of considering grade 3 or worse toxicity as nominally "severe".
Note that with a slight abuse of notation, k may be used to represent a set of all adverse events occurring in a body system rather than a single AE type, e.g., k = H for hematologic AEs and k = NH for non-hematologic AEs. Then AE burden measures similar to Examples 1 and 2 above can be defined for the body system.