1. Introduction
Immune checkpoint inhibitors (ICIs) have significantly transformed the treatment of metastatic melanoma, offering markedly improved survival compared to conventional chemotherapy, which provides a median overall survival (OS) of 4–12 months [
1,
2,
3]. Multiple clinical trials have confirmed the benefit of both monotherapy and combination therapy, with durable outcomes over extended follow-up periods [
1,
2,
4]. However, survival outcomes vary across studies, influenced by differences in patient populations, treatment protocols, and follow-up durations [
2]. For example, the CheckMate 066 study reported a 39% OS rate for nivolumab versus 17% for dacarbazine [
4]. The CheckMate 067 trial showed that combination therapy outperformed monotherapy, with a median OS of 72.1 months, compared to 36.9 months for nivolumab and 19.9 months for ipilimumab; progression-free survival (PFS) followed a similar pattern: 11.5, 6.9, and 2.9 months, respectively [
2]. Notably, 77% of patients treated with the combination and 69% treated with nivolumab were treatment-free at 6.5 years [
2], underscoring the potential for durable remission.
Despite these encouraging outcomes, ICIs are associated with a substantial risk of immune-related adverse events (irAEs), where immune responses target healthy tissue. Toxicities such as pneumonitis, colitis, thyroiditis, and pancreatitis may significantly impact quality of life [
5,
6,
7]. Hribernik et al. [
8] observed irAEs in 63% of patients on pembrolizumab, with 8.7% experiencing grade 3 or 4 events and 11.6% discontinuing treatment. Similarly, Robert et al. reported an 80.7% irAE rate, including 17.7% with grade 3 or 4 events and 8.8% with discontinuations [
9]. In a pivotal trial, 86% of patients receiving monotherapy and 96% receiving combination therapy experienced irAEs [
1,
2], with grade 3 or 4 events occurring in 28%, 21%, and 59% of patients on ipilimumab, nivolumab, and their combination, respectively; the discontinuation rates were 16%, 12%, and 39%. Such variability highlights the challenges in synthesizing safety data across treatment regimens [
8,
9,
10].
Balancing the therapeutic efficacy of ICIs with their risk of severe side effects requires a systematic approach. A meta-analysis offers a rigorous framework for integrating evidence from diverse clinical trials to support informed decision-making. By combining results from independent studies, a meta-analysis enhances statistical power and precision in estimating treatment effects [
11,
12,
13], being particularly valuable in immunotherapy research, where study outcomes often differ. It also allows for estimating pooled confidence intervals (CIs) and identifying heterogeneity in treatment effects [
13].
The robustness of a meta-analysis depends on how it addresses between-study differences. Fixed-effects models (FEMs) and random-effects models (REMs) are commonly used [
11,
12,
14], with FEMs assuming a shared treatment effect and REMs allowing variability. However, these models may struggle with complex data typical of immunotherapy studies, such as uneven follow-ups or non-normally distributed outcomes. To overcome such limitations, generalized linear mixed models (GLMMs) have been introduced [
15,
16,
17]. GLMMs are especially effective in modeling binary outcomes, survival data, and clustered structures, accommodating study design diversity and providing more robust, clinically relevant summaries in immunotherapy research.
Recent meta-analyses have investigated moderators of immune checkpoint inhibitor efficacy, including sex-based differences [
18], the circadian timing of ICI administration [
19], clinical variables in resectable non-small-cell lung cancer [
20,
21], and survival outcomes reconstructed from individual patient data in advanced mucosal melanoma [
22]. While these studies have substantially advanced our understanding of treatment-specific effects within defined populations, a comprehensive, model-based comparison of survival, risk, and treatment efficacy outcomes in advanced melanoma remains lacking. The present study aims to address this gap.
In this study, our primary objective is to synthesize and compare the therapeutic profiles of immune checkpoint inhibitor (ICI) monotherapy and combination therapy in patients with advanced melanoma while properly addressing differences across clinical trials. Our analysis focuses on two key outcome categories relevant to treatment decisions: adverse events and survival.
First, for binary outcomes, such as the frequency of immune-related adverse events and response rates, we estimate pooled effect sizes using three statistical approaches: fixed-effects, random-effects, and generalized linear mixed models. This enables a direct comparison of the risks and benefits across treatment types while quantifying uncertainty within each model framework.
Second, for time-dependent outcomes, specifically progression-free survival (PFS) and overall survival (OS), we extract survival data from published Kaplan–Meier curves at multiple time points. These data are analyzed using a generalized linear mixed model approach, which allows us to evaluate how survival trends evolve over time while accounting for variation both within and between studies.
Rather than comparing statistical models directly, we match each modeling approach to the structure of the data, prevalence versus survival, ensuring both statistical validity and clinical interpretability. This strategy provides a more nuanced understanding of treatment effects and supports evidence-based decision-making in immunotherapy for melanoma. To the best of our knowledge, this is the first meta-analysis of metastatic melanoma to jointly model efficacy and toxicity using complementary statistical frameworks and time-resolved survival data. By integrating heterogeneous evidence through robust and clinically aligned modeling strategies, our study establishes a generalizable template for future analyses in oncology where time-to-event outcomes and between-study variation are central.
The remainder of this paper is organized as follows: 
Section 2 details the materials and methods, including study identification, data extraction, and the modeling strategies applied to both binary and longitudinal outcomes, as well as the influence of model choice and the rationale behind model selection. 
Section 3 presents the meta-analytic results for adverse events, therapeutic benefits, and survival outcomes, with emphasis on the influence of model choice. 
Section 4 discusses the methodological and clinical implications of our findings, including limitations and future directions for research in immunotherapy evidence synthesis. 
Section 5 includes the final remarks. Additional information on the included trials and extracted variables is provided in 
Appendix A.
  2. Materials and Methods
  2.1. Study Identification and Data Extraction
This meta-analysis systematically evaluated both the benefits and risks of immunotherapy in patients with advanced or metastatic melanoma. A systematic literature search was conducted in PubMed, Web of Science, Google Scholar, and ClinicalTrials.gov, covering publications up to December 2024. The search strategy combined terms such as "melanoma", "immune checkpoint inhibitors", "monotherapy", "combination therapy", "overall survival", "progression-free survival", and "immune-related adverse events". A flowchart illustrating the systematic selection process of the studies included in the meta-analysis is presented in 
Figure 1.
Studies were eligible if they met the following criteria:
- Involved adult patients with advanced or metastatic melanoma; 
- Reported results for ICI monotherapy or a combination therapy; 
- Provided extractable data on adverse events, clinical responses, or survival outcomes; 
- Clearly separated results for monotherapy and combination therapy groups. 
A total of 14 studies were included, contributing 22 different treatment arms (16 monotherapy and 6 combination therapy treatment arms). In this paper, monotherapy refers to treatment with a single ICI agent, nivolumab, ipilimumab, or pembrolizumab, whereas combination therapy involves regimens that include nivolumab plus ipilimumab; the nivolumab–relatlimab combination was not included. The authors participated in the literature search in an organized manner, following the same search principles to ensure consistency. The variables extracted included the study design, sample size, treatment details, adverse event rates, clinical response, and survival data.
Therapeutic benefits were assessed through variables including complete response (CR), partial response (PR), overall response (OR), and stable disease (SD). Risks were examined by analyzing the incidence of any grade of immune-related adverse events (irAEs), with a focus on pneumonitis, colitis, diarrhea, increased aspartate aminotransferase (AST), increased alanine aminotransferase (ALT), and two forms of thyroiditis. As increased AST and ALT levels may indicate underlying immune-mediated hepatitis, a clinically relevant irAE, they were included as separate outcomes in the analysis. Survival outcomes were evaluated using overall survival (OS) and progression-free survival (PFS), extracted across time points from Kaplan–Meier curves. All analyzed data were obtained from the references listed in the bibliography [
2,
4,
8,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34] and briefly summarized in 
Appendix A, 
Table A1 and 
Table A2.
When survival data were not explicitly reported, values were digitized from Kaplan–Meier curves using WebPlotDigitizer software, v4.5 (Ankit Rohatgi, Pacifica, CA, USA) [
35]. All proportions were recalculated if needed, and binomial approximations were used to derive 
 confidence intervals. This systematic review has not been registered.
  2.2. Meta-Analysis of Adverse Events and Benefits
A central challenge in synthesizing findings from multiple clinical studies is addressing the heterogeneity arising from differences in study populations, treatment protocols, and methodological designs. To account for this variability, we employed a stepwise meta-analytic framework incorporating a fixed-effects model (FEM), a random-effects model (REM), and a generalized linear mixed model (GLMM), allowing us to assess treatment effects under varying assumptions about inter-study variability. All outcomes—including the pooled prevalence of immune-related adverse events (irAEs) and clinical benefit rates—were evaluated separately for monotherapy and combination therapy arms, using the model best suited to the observed heterogeneity.
Heterogeneity was assessed using Cochran’s 
Q statistic and the 
 statistic, which jointly capture the presence and magnitude of variability beyond chance. Between-study variance (
) was estimated using the DerSimonian–Laird (DL) method, a widely applied, non-iterative approach valued for its simplicity and robustness in accounting for inter-study differences [
13,
14].
  2.2.1. Heterogeneity Assessment and Model Selection
Cochran’s 
Q statistic serves as a diagnostic test for the presence of heterogeneity between study arms. It is defined as follows:
          where 
 is the number of study arms, 
 is the proportion of events observed in the study arm 
, and
          represents the inverse-variance weight. Here, 
 denotes the estimated sampling variance for study arm 
i, which is commonly approximated under a binomial model as follows:
          where 
 is the observed event proportion, and 
 is the sample size in study arm 
i. The estimate of the pooled fixed effects, denoted by 
, is as follows:
To interpret the degree of heterogeneity, we calculated the 
 statistic:
The 
 statistic quantifies the proportion of total variation across study arms that is due to heterogeneity rather than chance, offering a practical measure for evaluating consistency in meta-analysis results. According to [
36], 
 values lower than 25% may be interpreted as indicating low heterogeneity, 50% may be interpreted as moderate, and 75% or higher may be interpreted as substantial to considerable heterogeneity. Following this framework, we adopted a conservative modeling strategy based on the heterogeneity magnitude.
Specifically, an FEM was applied when 
, indicating low variability among studies. For 
, an REM was employed to account for moderate to substantial between-study variance. When heterogeneity reached extreme levels (
), we utilized a GLMM, which has been recommended in high-variance contexts such as prevalence and proportion meta-analyses [
37]. This structured model selection aimed to better align statistical complexity with the level of heterogeneity.
While 
 values exceeding 75% are considered to reflect considerable heterogeneity [
36], we adopted a more stringent threshold of 90% when switching to a GLMM. This decision reflects emerging evidence from recent methodological studies indicating that conventional random-effects models often become unstable when 
 surpasses 90%, particularly in the context of meta-analyses of proportions and prevalence. In such scenarios, a GLMM offers increased modeling robustness by working on transformed (e.g., logit) scales and by explicitly accounting for random intercepts and non-constant variance structures.
  2.2.2. Fixed-Effects Model (FEM)
Under the FEM framework, it is assumed that all included studies estimate the same underlying effect size, meaning that the true proportion of events is identical across all studies.
The pooled estimate of the proportion 
 is calculated as a weighted average of study-specific estimates, with weights determined by the inverse of the variance. The formula for the pooled estimate of the proportion is as follows:
          which corresponds to 
 from Equation (
4).
The variance of the pooled estimate of the proportion is given by
          and the standard error is
Based on this standard error, a 95% confidence interval (CI) for the pooled estimate of the proportion under the FEM is constructed as follows:
An FEM is most suitable when clinical, methodological, and statistical heterogeneity are minimal or absent. In such settings, an FEM offers more precise estimates due to narrower confidence intervals and is often favored in meta-analyses where study protocols and populations are tightly controlled, such as in oncology. Its simplicity and efficiency make it a logical choice when study conditions are standardized, particularly in early-phase or protocol-driven trials. However, an FEM assumes a common true effect and ignores between-study variability, which can lead to underestimated uncertainty when heterogeneity is present.
  2.2.3. Random-Effects Model (REM)
When moderate heterogeneity was present, which means 
, we proceeded to estimate the variance 
 between the study arms using the DerSimonian–Laird method:
          where
This value quantifies the variance in the true effect sizes between study arms.
The REM accounts for this by modifying the weight of each study arm:
The pooled estimate under the REM is then
The variance of the pooled estimate of the proportion is given by
          and the standard error is
Based on this standard error, a 95% CI for the pooled estimate of the proportion under the REM is constructed as follows:
This methodology provides a flexible and statistically grounded approach to synthesizing evidence when treatment effects vary between study arms, a common occurrence in immunotherapy trials due to diverse patient populations, study protocols, and endpoints.
The decision to employ an REM is guided by both statistical diagnostics and a clinical understanding of immune-related variability. In the context of ICI treatments, individual studies often report wide-ranging outcomes due to differences in tumor burden, immune profiles, and previous therapies. Consequently, an REM not only offers statistical rigor but also better reflects real-world clinical diversity.
  2.2.4. Generalized Linear Mixed Model (GLMM)
To capture between-study arm variability in cases of extreme heterogeneity, we implemented a GLMM applied to logit-transformed event proportions 
 for study arm 
i. In cases of extreme heterogeneity (
), a GLMM provides a more robust alternative to conventional random-effects models [
16]. The transformed variable is
          and the model equation is
          where 
 is the intercept coefficient, and 
 is the random intercept for study arm 
i, where 
 is the between-study arm variance from Equation (
10).
After fitting the model parameters with a logit link, predicted values 
 are obtained on the logit scale. These values are transformed back to the original proportion scale using the inverse logit function:
The same inverse logit transformation is applied to the lower and upper bounds of the 95% confidence intervals on the logit scale, denoted as 
 and 
, respectively:
This ensures that both point estimates and confidence intervals are interpretable on the original proportion scale.
The model used in this analysis relies on the Lmer function from the pymer4 Python package, which provides access to linear mixed models (LMMs) and generalized linear mixed models (GLMMs) by interfacing with the lme4 package implemented in R.
For each pooled proportion estimate derived from the respective models (Equations (
6), (
13), and (
19)), we performed a two-proportion Z-test to assess the differences between monotherapy and combination therapy:
          where 
 and 
 are the pooled estimates for each group, and 
 and 
 are their standard errors, respectively.
Forest plots and summary charts were used to visualize the pooled proportions and their respective confidence intervals for both risk and benefit outcomes.
  2.3. Modeling of Overall and Progression-Free Survival
The analysis included 14 published studies, from which 22 distinct treatment arms were defined based on monotherapy or combination immunotherapy groups. Of these, 19 arms reported Kaplan–Meier (KM) survival data suitable for extraction. Time-specific overall survival (OS) proportions were extracted from the published KM curves using WebPlotDigitizer software, v4.5 [
35]. For each treatment arm, survival percentages were manually digitized at standard time intervals (typically 12, 24, 36, 48, and 60 months), depending on the reporting granularity of each study. The corresponding number of patients at risk at baseline was obtained from study tables or figure captions. This process yielded a total of 66 time point-level OS observations across the 19 KM-reported arms.
In parallel, progression-free survival (PFS) data were extracted from 16 KM-reported arms using the same methodology, yielding 60 time point-level PFS observations. Both OS and PFS datasets were used to construct a GLMM designed to evaluate longitudinal trends in survival outcomes over time. The model included time (in months), therapy type (monotherapy or combination therapy), and their interaction as fixed effects, with treatment arm specified as the grouping factor and modeled as a random intercept.
The extracted OS and PFS percentages were converted into proportions 
, representing the estimated probability of survival at time 
 for study arm 
i. To stabilize variance and allow for linear modeling, the proportions were transformed using the logit function:
The binomial standard error for each transformed proportion was approximated as follows:
        where 
 is the number of patients in study 
i contributing to the estimate at time 
. The proportions were truncated to the [0.01, 0.99] interval to avoid infinite logits.
A GLMM was then fitted to the logit-transformed OS and PFS proportions using restricted maximum likelihood estimation (REML). The model included the following:
- A fixed effect for time (in months); 
- A fixed effect for therapy type (coded as 0 = combination and 1 = monotherapy); 
- An interaction term between time and therapy type; 
- A random intercept for study ID to account for between-study variability. 
The model equation is
        where:
-  is the logit-transformed OS or PFS at time  for study arm i; 
-  indicates therapy type (combination therapy or monotherapy); 
-  is the zero-mean normally distributed random intercept per study arm; 
-  is the zero-mean normally distributed residual error. 
Model parameters were estimated using the 
statsmodels library in Python, v0.14, (Python Software Foundation, Wilmington, DE, USA). Predictions were back-transformed using the inverse logit function to provide interpretable survival percentages at selected time points. Confidence intervals were calculated on the logit scale using the delta method, based on the full covariance matrix of the fixed effects, and they were subsequently transformed to the probability scale. The GLMM framework was selected based on established methods for modeling proportion data with within-study clustering [
15,
16,
17].
It is important to note that our GLMM-based survival analysis does not model time-to-event data in the traditional sense, nor does it incorporate censoring. Instead, we extract discrete survival proportions from published KM curves at multiple clinically relevant time points, and we model these logit-transformed values using generalized linear mixed models. This approach enables a meta-analysis of longitudinal survival trends while accounting for study-level heterogeneity via random intercepts. While it cannot reconstruct full survival functions or hazard rates, this method allows for consistent comparisons across treatment arms in the absence of individual patient data.
  2.4. Influence of Model Choice and Rationale for Model Selection
Here, we provide a detailed examination of the three statistical models used in the analysis, aiming to clarify the rationale behind their selection, as well as the specific advantages and limitations that each model presents in different contexts. Particular attention is given to the assumptions that these models make regarding heterogeneity across studies and their ability to accommodate data structures typical of clinical research. One of the central challenges in synthesizing evidence from multiple clinical studies lies in managing heterogeneity stemming from differences in study populations, treatment protocols, and methodological approaches. To address this, we applied a stepwise meta-analytic strategy that integrates three models: an FEM, an REM, and a GLMM. This framework enabled us to explore treatment effects under different assumptions regarding variability between studies and the structure of the underlying data.
The FEM assumes that all included studies estimate the same underlying treatment effect and is most appropriate when heterogeneity is minimal. In contrast, the REM introduces a between-study variance component, yielding more conservative confidence intervals and making it a better fit in the presence of moderate heterogeneity. However, it has been shown that traditional inverse-variance estimators of 
 and the overall effect can be biased, particularly in the context of odds ratios. The study in [
38] proposed improved methods for 
 estimation and overall effect inference, demonstrating in simulation studies that standard REM approaches often yield suboptimal coverage and biased estimates. The GLMM builds on these approaches by incorporating study-specific random effects and directly modeling binary outcomes within a hierarchical structure. As such, the GLMM is particularly well suited for meta-analyses that involve unbalanced study designs, variations in sample sizes, or differing event rates across trials [
39].
We assessed all outcomes, including the pooled prevalence of irAEs and clinical benefit rates, separately for monotherapy and combination therapy groups using the most appropriate model, according to the heterogeneity analysis. While the FEM produced the narrowest confidence intervals, it tended to underestimate uncertainty when heterogeneity was present. The REM offered improved interval coverage but was more sensitive to small-study effects. Among the three, the GLMM showed the greatest robustness in estimating both effect sizes and variance components, particularly in settings with substantial design and outcome variability.
In practical terms, the FEM may be appropriate when studies are highly comparable in design and population characteristics. The REM is more suitable when moderate heterogeneity is expected and a sufficient number of studies is available. However, in fields such as oncology where heterogeneity in patient populations, study designs, and outcome measures is common, the GLMM provides a more flexible and reliable framework for inference and is generally the preferred choice. Nevertheless, even the GLMM relies on assumptions that may be restrictive in some applications. For example, time-invariant latent variables may influence outcomes differently across time or interact variably with observed covariates.
  4. Discussion
Our results underscore the methodological value of adapting model complexity to data heterogeneity. For example, PFS and OS were best modeled using a GLMM, as they involve longitudinal proportions influenced by latent clinical and temporal factors. Likewise, pooled estimates of adverse events and clinical benefits demonstrated that precise estimation and group comparison are possible even in highly variable datasets, provided that the correct statistical model is applied. While our findings confirm known clinical trends, the primary aim of this work was to highlight the analytical pathway that supports such inferences. All conclusions were derived via clearly defined probabilistic models, which were fit to appropriately transformed data and subjected to inferential rigor.
Numerous recent meta-analyses have examined the efficacy and safety of immune checkpoint inhibitors (ICIs) in oncology, applying diverse analytical frameworks and focusing on different moderators of treatment response. For instance, Conforti et al. [
18] explored sex-based differences in ICI efficacy and found a significantly greater overall survival benefit in male patients (HR 0.72, 95% CI 0.65–0.79) than in females (HR 0.86, 95% CI 0.79–0.93). Landre et al. [
19] explored novel moderators such as the time of day of ICI infusion, where a recent meta-analysis showed superior outcomes when ICIs were administered earlier in the day, suggesting that biological timing may modulate treatment response (OS HR 0.50, 95% CI 0.42–0.58; PFS HR 0.51, 95% CI 0.42–0.61), possibly reflecting circadian influences on immune function. Moreover, a comprehensive meta-analysis [
20] in resectable non-small-cell lung cancer (NSCLC) assessed the impact of neoadjuvant chemoimmunotherapy versus chemotherapy across surgical, pathological, and efficacy endpoints. Importantly, even patients with low PD-L1 expression (<1%) demonstrated a significant benefit in event-free survival (HR 0.74, 95% CI 0.62–0.89), although no difference was observed in overall survival. Recently, Patel et al. [
21] conducted a comparative meta-analysis of randomized trials on the overall survival of resectable non-small-cell lung cancer to assess the timing of immunotherapy (neoadjuvant, peri-operative, and postoperative), finding no statistically significant OS difference between timing groups, although neoadjuvant chemo-immunotherapy approaches appeared preferable due to a shorter treatment duration and lower costs. These findings underscore the need to account for multiple biological and treatment-related modifiers when assessing the efficacy of immunotherapy.
In contrast to these studies, which focused on specific populations or treatment conditions, our meta-analysis addresses survival and risk outcomes in metastatic melanoma across a range of clinical trials, applying a comparative model-based framework. We account for inter-study heterogeneity and assess the impact of model selection on pooled estimates. By situating our results alongside prior findings that examine sex, tumor biology, timing, and PD-L1 status as potential moderators, we extend the literature by offering a comprehensive, model-driven meta-analytic comparison focused on survival and risk outcomes within a single tumor type. This methodology not only enables a transparent synthesis of heterogeneous evidence but also offers a replicable analytic structure for future comparative meta-analyses in immuno-oncology. A key methodological contribution of our study is the application of a longitudinal generalized linear mixed model framework, which enables time-resolved modeling of survival outcomes. In contrast, previous meta-analyses primarily relied on point estimates or aggregate-level comparisons. For example, the Bayesian network meta-analysis by Silveira Nogueira Lima et al. [
40] integrated evidence from immunotherapy and targeted therapy trials to estimate relative treatment rankings but did not capture survival dynamics over time. Teo et al. [
22] reconstructed individual patient data from Kaplan–Meier curves, yet their analyses were based on a restricted mean survival time and a restricted mean time lost, without evaluating treatment trajectories. Elias et al. [
41] performed subgroup analyses by age using aggregate data but did not incorporate longitudinal modeling. In contrast, our GLMM-based approach allows for the continuous modeling of OS and PFS over multiple time points, offering a more detailed view of treatment efficacy over time. Additionally, by jointly assessing both efficacy and toxicity using an FEM, an REM, and a GLMM, our analysis provides a more comprehensive assessment of the benefit–risk profile of immune checkpoint inhibitors in metastatic melanoma.
To enable a consistent comparison with the results of Teo et al.’s study [
22], here, we focus on the 12-month OS rates reported in both studies. In our analysis, combination ICI therapy achieved a 12-month OS of 71.5% and a PFS of 47.9%, compared to 55.7% and 28.3% for monotherapy. These values closely align with those reported in a recent individual patient data meta-analysis of mucosal melanoma, which showed 12-month OS rates of 71.8% for combination therapy and 64.0% for monotherapy, as well as PFS rates of 35.1% and 28.3%, respectively. However, unlike [
22], which was limited to short-term outcomes, our work also included long-term follow-up, demonstrating a 5-year OS of 55.7% and 34.3% and a PFS of 39% and 17.2% for combination therapy and monotherapy. These findings underscore the sustained benefit of combination immunotherapy in advanced melanoma and highlight the importance of long-term survival analyses in evaluating treatment efficacy. The observed differences may also be influenced by biological variation between cutaneous and mucosal melanoma subtypes, with cutaneous forms generally exhibiting a greater immunogenicity and response to ICI.
  Limitations and Future Directions
Several limitations of this meta-analysis should be acknowledged. First, the available data exhibited a notable imbalance in study representation between study arms, with a greater number of trials reporting outcomes for monotherapy compared to combination immunotherapy. This discrepancy may introduce asymmetry in the precision of pooled estimates and may affect the robustness of comparative analyses. Second, PFS and OS were not reported uniformly across studies, either in terms of the follow-up duration or the specific time points at which survival probabilities were extracted. While we addressed this by harmonizing data as proportions at available time points and applying a GLMM to account for study-level heterogeneity, the lack of standardized survival intervals limits the temporal comparability across studies. Third, the analysis of adverse events was constrained by the absence of temporal data on irAE onset. Reported AE rates were cumulative and did not indicate whether events occurred before, during, or after the time points used to assess OS. This temporal ambiguity constrains the interpretability of any analysis attempting to correlate toxicity with survival probability. Fourth, we grouped anti-PD-1 and anti-CTLA-4 agents under a single “monotherapy” category despite known mechanistic and toxicity differences. This decision was based on consistent adverse event patterns across monotherapy studies in our dataset, as well as prior literature that treated single-agent immune checkpoint inhibitors collectively [
29,
30]. While this grouping enabled broader comparisons with combination therapy, it may have obscured agent-specific safety signals and should be interpreted with caution. Fifth, we did not stratify analyses based on treatment line, as this information was inconsistently reported across studies. Although treatment setting can influence both efficacy and toxicity, introducing this criterion would have significantly reduced the number of eligible studies and compromised the statistical power of our comparisons. This limitation reflects the trade-off between methodological granularity and dataset comprehensiveness.
Finally, a methodological consideration in our survival analysis is the use of a GLMM to model survival proportions over time. This approach allowed us to perform a longitudinal meta-analysis across heterogeneous studies using aggregate data extracted from KM curves. While a GLMM does not incorporate right-censoring and relies on logit-transformed survival proportions rather than individual time-to-event data, it offers a flexible framework for capturing between-study variation and temporal trends. These features made it particularly suitable for the structure of our dataset, where individual-level data were unavailable. We acknowledge that a GLMM assumes smooth survival trajectories between time points and may be less suited to settings with strong time-varying hazards. However, within the context of our objectives and data availability, this method provided a robust and interpretable model for comparing survival outcomes.
Future work should prioritize access to individual patient-level data, which would enable time-dependent modeling strategies. Joint models—linking longitudinal AE occurrence with survival endpoints—could provide more accurate insights into whether early irAEs can serve as surrogate markers for therapeutic efficacy. These directions are increasingly supported in the literature, with recent studies using real-world data to examine the temporal association between immune-related toxicity and survival outcomes in melanoma immunotherapy [
42,
43,
44]. In parallel, Bayesian frameworks may offer a flexible alternative to frequentist GLMMs for synthesizing sparse or heterogeneous clinical data [
40]. These directions represent promising avenues for strengthening both the accuracy and interpretability of evidence in immunotherapy and beyond.
  5. Final Remarks
This study demonstrates the application of formal meta-analytical modeling to assess the risk–benefit profile of immunotherapy regimens in melanoma treatment. By integrating multiple modeling strategies, ranging from classical fixed- and random-effects models to generalized linear mixed models, we were able to appropriately handle heterogeneous data structures and varying effect sizes across clinical endpoints. The use of a model selection procedure driven by heterogeneity (quantified via the  statistic) enabled flexible and robust estimation across distinct outcome types. Specifically, an FEM was employed in homogeneous contexts, while REM and GLMM approaches captured both moderate and high between-study arm variance. This framework was essential in analyzing adverse event proportions, response rates, and survival outcomes, each of which exhibited different degrees of variability.
This meta-analysis advances the methodological rigor and clinical relevance of evidence synthesis in metastatic melanoma. While our central finding, namely, that combination immune checkpoint inhibitor therapy offers superior efficacy at the cost of increased toxicity, aligns with prior clinical trials, the principal contribution of our study lies in its integrative and comprehensive analytic framework. We jointly modeled treatment efficacy and adverse events across heterogeneous trials using an FEM, an REM, and a GLMM, thereby enhancing generalizability beyond the constraints of individual studies. Notably, our application of time-resolved modeling strategies to OS and PFS, based on data extracted from Kaplan–Meier curves, represents a novel methodological contribution. This allowed us to model survival outcomes as continuous trajectories rather than rely solely on aggregated endpoints, offering a more nuanced picture of treatment benefit over time.
By aligning statistical techniques with the structural complexity of clinical outcomes, our approach bridges the gap between inference and decision-making in oncology. The application of GLMMs to logit-transformed survival proportions, with the explicit modeling of between-study variability, marks a significant innovation in the field. This framework, emphasizing transparency, flexibility, and robustness, not only strengthens the reliability of our conclusions but also provides a generalizable template for future meta-analyses where time-to-event outcomes and heterogeneous data sources are central. As immunotherapy continues to evolve, such methodologies can inform treatment selection, patient counseling, and the design of future clinical trials.