Achilles and the Tortoise: Rethinking Evidence Generation in Cardiovascular Surgery and Interventional Cardiology
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors offer a novel, thought-provoking framework—Zeno’s paradox—to illustrate the misalignment between rapid TAVI device iteration and evidence maturation. Its comprehensive chronological mapping of 270 studies, juxtaposed with CE/FDA approval dates, provides a unique snapshot of the “leapfrogging” phenomenon and underscores the oft-neglected value of well-conducted observational data in rapidly evolving device fields. However, some questions should be concerned as following:
- Inclusion/exclusion criteria created post-hoc, there were not dual independent screening or data extraction.
- “Major RCT” subset (Analysis-1) was selected by author’s judgment , suggesting high selection bias.
- Time-to-approval calculated from first posted date instead of first-in-man or CE/FDA submission dates might increase mis-estimates real-world lag.
- Statistical comparison (t-test/Mann–Whitney) was performed on non-random convenience samples which was not adjusted for clustering by valve generation or sponsor.
- Overlapping coloured bars in Figure 1 make exact overlap of trials and valve launches impossible to quantify, which are not Gantt-style alignment metric (e.g., Spearman correlation between launch and trial-start).
- There is not Kaplan–Meier or cumulative incidence curves for “evidence maturity” (time from CE mark to first peer-reviewed report).
- There is not forest plot or meta-analytic summary of treatment effects to show whether successive generations truly outperform predecessors.
- Patient numbers in Table 1 (R 814 ± very wide range; NR 689 ± even wider) are presented as means without inter-quartile ranges, exaggerating central tendency.
- There is not quantitative estimate of how often a new valve reaches market before ≥50 % of prior-generation RCT follow-up is published (could be extracted from Figure 1 but is not).
- There is not citation of existing adaptive or Bayesian platform trials in TAVI (e.g., UK TAVI, PARTNER II nested registries) that already shorten evidence cycles.
- Equivalence of observational and RCT effect sizes is invoked from 1985-1998 literature, which ignores TAVI-specific examples where observational data over-estimate benefit (e.g., early STS/ACC TVT vs PARTNER III mortality rates).
- The authors did not present the discussion of device-specific learning curves, centre volume, or operator heterogeneity—factors that amplify between-generation performance gaps.
- Safety signals (permanent pacemaker, paravalvular leak) are mentioned qualitatively but not integrated into a quantitative benefit-harm model across generations.
- Health-economic dimension is absent: no cost-effectiveness decay analysis when evidence becomes obsolete.
- Patient-centred outcomes (quality of life, valve durability >10 y) are not linked to iterative technology, hence the “paradox” is asserted, not demonstrated, with hard clinical data.
Author Response
Comments and Suggestions for Authors
The authors offer a novel, thought-provoking framework—Zeno’s paradox—to illustrate the misalignment between rapid TAVI device iteration and evidence maturation. Its comprehensive chronological mapping of 270 studies, juxtaposed with CE/FDA approval dates, provides a unique snapshot of the “leapfrogging” phenomenon and underscores the oft-neglected value of well-conducted observational data in rapidly evolving device fields. However, some questions should be concerned as following:
- Inclusion/exclusion criteria created post-hoc, there were not dual independent screening or data extraction.
Response: The manuscript is a registry-based descriptive review rather than a systematic review. As a narrative review based on publicly available data from ClinicalTrials.gov, our aim was to provide a broad and representative overview of the TAVI evidence landscape, rather than to perform a systematic review or meta-analysis. Therefore, formal dual screening or independent data extraction—though ideal for systematic reviews—was beyond the scope of this work.
Correction: “This manuscript is a registry-based descriptive review and not a meta-analysis, therefore an independent data extraction was beyond the scope of this work; data extraction was single-author with verification of key trial entries against registry pages and primary publications where available”, added in Methods and in a new paragraph (Limitations) - “Major RCT” subset (Analysis-1) was selected by author’s judgment , suggesting high selection bias.
Response: I agree that the selection of “major RCTs” was based on my judgment, but being a cardiac surgeon and working in a centre where TAVIs are currently performed I am involved in the matter every day to judge which studies have or have had the greatest weight in clinical practice. These were chosen as they are widely recognized in the field and have shaped clinical practice and guidelines. We have clarified in the text that this selection was narrative and illustrative, not systematic.
Correction: “Analysis1 used a subset of randomized trials chosen because they were pivotal, multicenter, or explicitly designed as head-to-head or industry-independent trials; the objective was illustrative chronology rather than exhaustive trial weighting”, added in Methods - Time-to-approval calculated from first posted date instead of first-in-man or CE/FDA submission dates might increase mis-estimates real-world lag.
Response: Agree this may affect fine-grained lag estimates. We used the first posted date on ClinicalTrials.gov as a consistent and publicly accessible reference. While first-in-man or regulatory submission dates could offer additional insight, they are not uniformly available.
Correction: “Timing markers used were CE mark and FDA public approval dates and registry posting dates as recorded on ClinicalTrials.gov; submission dates and first-in-man dates are frequently unavailable or inconsistently reported across manufacturers and registries, and therefore were not used.” added in Limitations - Statistical comparison (t-test/Mann–Whitney) was performed on non-random convenience samples which was not adjusted for clustering by valve generation or sponsor.
Response: The statistical tests were used descriptively and exploratory to illustrate general trends between randomized (R) and non-randomized (NR) studies, not to infer causality or adjust for confounders. We acknowledge the limitations of unadjusted comparisons and have added a note to this effect.
Correction: “Statistical t-test/Mann–Whitney were exploratory and descriptive comparisons, not adjusted for clustering by valve generation or sponsor.” added in Limitations and paragraph 3.2 - Overlapping coloured bars in Figure 1 make exact overlap of trials and valve launches impossible to quantify, which are not Gantt-style alignment metric (e.g., Spearman correlation between launch and trial-start).
Response: We agree that Figure 1 is primarily visual and qualitative, intended to illustrate the “leapfrogging” phenomenon. A quantitative alignment metric (e.g., Spearman correlation) is an excellent suggestion for future research.
Correction: “This is a schematic chronology to visually illustrate the “leapfrogging” phenomenon cited in the text.”, amended Figure 1 caption;
“Figure 1 do not report an exact overlap quantification, that was not attempted in this current paper; Spearman correlation or Gantt-derived metrics should be a proposed solution for future quantitative work” added in Limitations - There is not Kaplan–Meier or cumulative incidence curves for “evidence maturity” (time from CE mark to first peer-reviewed report).
Response: I agree that such analyses would be valuable in a dedicated longitudinal study; as a narrative review, our goal was to map the broad chronological relationship between device approval and evidence generation, not to perform time-to-event analysis.
Correction: “As a narrative review, time-to-evidence maturity (CE to peer-reviewed publication) is an important quantitative metric not calculated here; it could be an explicit next step for future research, using Kaplan–Meier or cumulative incidence.”, added in Limitations - There is not forest plot or meta-analytic summary of treatment effects to show whether successive generations truly outperform predecessors.
Response: Our review did not aim to compare treatment effects across valve generations, but rather to highlight the structural misalignment between innovation and validation; as stated, this paper is a descriptive review and not a meta-analysis.
Correction: The character of descriptive review has already been underlined before - Patient numbers in Table 1 (R 814 ± very wide range; NR 689 ± even wider) are presented as means without inter-quartile ranges, exaggerating central tendency.
Response: The statistical tests were used descriptively and exploratory to illustrate general trends between randomized (R) and non-randomized (NR) studies
Correction: same of Comment no. 4, Table 1 legend amended - There is not quantitative estimate of how often a new valve reaches market before ≥50 % of prior-generation RCT follow-up is published (could be extracted from Figure 1 but is not).
Response: This is a valuable point. We have added a qualitative statement based on Figure 1.
Correction: “As a general summary, is possible to estimate that in approximately 70–75% of cases, a new transcatheter valve model reached the market before ≥50% of the follow-up data from the prior-generation RCT reached their primary endpoints.” added in Analysis 1 results - There is not citation of existing adaptive or Bayesian platform trials in TAVI (e.g., UK TAVI, PARTNER II nested registries) that already shorten evidence cycles.
Response: I fully agree that the citations in this regard are lacking; I add the bibliographic references and expand the relevant paragraph, also adding a critical review of such studies
Correction: paragraph 4.3 added and modified in Discussion
References added: no. 52, 53 and 54, reference for NOTION already present [6]
[52] Ludman PF; UK TAVI Trial Investigators. The UK Transcatheter Aortic Valve Implantation Registry
[53] Carroll JD, Mack MJ, Vemulapalli S, et al. STS-ACC TVT Registry of Transcatheter Aortic Valve Replacement.
[54] Hamm CW, GARY-Executive Board. The German Aortic Valve Registry (GARY) - Equivalence of observational and RCT effect sizes is invoked from 1985-1998 literature, which ignores TAVI-specific examples where observational data over-estimate benefit (e.g., early STS/ACC TVT vs PARTNER III mortality rates).
Response: I agree with the observation and add a paragraph of clarification.
Correction: “The cited literature was meant just to illustrate the long-standing debate on RCTs vs. observational studies, not to claim equivalence in TAVI.” added in Discussion. - The authors did not present the discussion of device-specific learning curves, centre volume, or operator heterogeneity—factors that amplify between-generation performance gaps
Response: This is a crucial point of this article: the data cited are almost never available (not reported), especially those relating to the experience of individual operators. In a few countries they are part of a transparency framework supported by various rationales, but in most they are unattainable. Since they should be introduced as variables, “artificial intelligence” intended as a higher level of integrative analysis, could play an important role cross-referencing subjective data in a scientific context and weighting them appropriately.
Correction: “Randomized studies do not include data such as device-specific learning curves, center volume, or heterogeneity in operator experience, all of which exacerbate the performance gap between generations. Since it would be useful to introduce these variables, artificial intelligence, intended as a higher level of integrative analysis, could play an important role in this analysis, cross-referencing subjective data within a scientific framework and weighting them appropriately” added in the Discussion, paragraph 4.3 - Safety signals (permanent pacemaker, paravalvular leak) are mentioned qualitatively but not integrated into a quantitative benefit-harm model across generations.
Response: We agree that a quantitative benefit-harm model would be insightful, but it was beyond the scope of this narrative review.
Correction: The character of descriptive review has already been underlined before - Health-economic dimension is absent: no cost-effectiveness decay analysis when evidence becomes obsolete.
Response: I absolutely agree with the validity of an economic study, but I deliberately avoided it: the healthcare costs of a topic like transcatheter valves, the costs of all the trials, the industrial costs, the impact on the DRG, the costs of reoperations for failed prostheses, the cost of adverse events (pacemakers, etc.), are a difficult field to approach and resolve also for many high-level healthcare systems.
Correction: I therefore take the liberty of not introducing this topic in this article, citing it only in the Limitations - Patient-centred outcomes (quality of life, valve durability >10 y) are not linked to iterative technology, hence the “paradox” is asserted, not demonstrated, with hard clinical data.
Response: I agree with this concept, but these are precisely the data that are most affected by the rapid cycle of technological innovations.
Correction: “Data on long-term endpoints (duration, quality of life at >10 years) require extensive follow-up and therefore create a discrepancy with frequent device changes, highlighting that these are precisely the endpoints most compromised by the rapid innovation cycle. This "paradox" extends over many years, preventing the collection of comparable data. Added to this is the recent lowering of the recommended age threshold for TAVI (to 70 years), quality of life data, for example, will be much less comparable with those of patients implanted with previous models.” added in Discussion, lines 347-353
Reviewer 2 Report
Comments and Suggestions for AuthorsThe introduction is not very well structured. The aim is not clear. I cannot understand from the introduction what is the aim of the paper. The same problem from the abstract.
There is one author, but several time it says "we".
The material and method is unstructered and i still do not understand what is the point of this paper.
The results are not relevant and do not prove anything new. There is a mix of topics, hard to follow.
Also the same for the disscution.
There is no conclusion.
The paper and the research should be rewritten/redone from zero.
Author Response
Comments and Suggestions for Authors
The introduction is not very well structured. The aim is not clear. I cannot understand from the introduction what is the aim of the paper. The same problem from the abstract.
There is one author, but several time it says "we".
The material and method is unstructered and i still do not understand what is the point of this paper.
The results are not relevant and do not prove anything new. There is a mix of topics, hard to follow.
Also the same for the disscution.
There is no conclusion.
The paper and the research should be rewritten/redone from zero.
Response: I thank the reviewer for his direct evaluation. The comment helped me identify important deficiencies in clarity and structure, which were also addressed in more detail by other reviewers.
This manuscript is a descriptive review of a registry, rather than an original randomized trial or a pooled meta-analysis, and its purpose is to provide a transparent snapshot of the evidence landscape related to TAVI device iterations and study activity between 2002 and May 2025.
Since it is difficult to provide detailed responses to a generalized comment, I affirm that I have corrected inconsistencies in the author's pronoun and have integrated all subsections of the article with clarifying rewording or a new paragraph. I added a Limitations paragraph to resume major concerns. I have added a brief and explicit Conclusion that defines the scope of the article and the next practical steps for more analytical work.
We hope these changes address the Reviewer's concerns and improve the article's value to readers.
Reviewer 3 Report
Comments and Suggestions for AuthorsHello,
Thanks for submitting this manuscript to this journal. It is an interesting read however it suffers from major flaws of the scientific discourse as we know it in the era of evidence synthesis.
1: Misinterpretation of Zeno's Paradox – The paper uses Zeno's paradox to argue that RCTs cannot keep pace with device innovation, likening RCT evidence to "Achilles" and technological innovation to the "tortoise." Zeno's paradox is about temporal impossibility of motion, not about practical validation lag. In Zeno's paradox, Achilles theoretically never catches the tortoise because infinite steps are required in finite time—but this is proven false by mathematical analysis (convergent infinite series) and hence the analogy doesn't work. RCTs aren't trying to "catch a moving target" in the mathematical sense. The problem isn't that validation is theoretically impossible; it's that industry releases new devices before prior ones are validated. This is a market timing issue, not a mathematical paradox. Better framing could be about "Regulatory capture" or "the innovation-evidence gap" rather than invoking Zeno's paradox inappropriately.
2. Analysis 1 – Timeline Visualization is Descriptive, Not Analytical-The issue: Figure 1 simply overlays RCT timelines with device approval dates but doesn't quantify the actual clinical impact of this lag. Shows devices approved while prior RCTs ongoing = true observation, but draws no causal conclusions about patient harm. Missing analytical elements: Did the evidence lag result in worse clinical outcomes? If Sapien 3 was approved while Sapien XT trials ongoing, did this harm patients? (No data provided). Were device approvals evidence-based or not? FDA approval based on bench testing + limited clinical data (typically <100 patients) CE Mark European pathway even lighter touch than FDA. But does this matter if outcomes are similar? What the paper doesn't address is the Post-market surveillance data showing whether devices approved during lag periods had inferior outcomes. Whether "fragmented evidence base" (claimed in Discussion) actually prevented optimal clinical decisions. Cost of delayed innovation vs. cost of inadequate pre-market validation also needs to be discussed. The timeline is interesting historically but doesn't support claims about clinical harm.
3: Analysis 2 – Statistical Analysis is Unclear about following Issues with the statistical approach: 3.1 – Comparison of R vs. NR studies lacks clinical context: Shows R studies longer (mean 87.7 months vs. NR 66.1 months) and recruit more patients (median 460 vs. 150).However, this is expected and appropriate as RCTs need follow-up to detect mortality/morbidity differences (requires more time) NR studies are registries/pilots designed for feasibility, not hard endpoints. Longer duration and larger samples does not equate to the problem but is actually the correct methodology. 3.2 –There is no correlation between duration and sample size (Figure 4): Finding: "No strong linear correlation between enrolled patients and study duration". This actually suggests appropriate tailoring of study design Short studies with few patients are pilot feasibility studies (appropriate) while Long studies with many patients are definitive RCTs (appropriate). The lack of correlation isn't a design flaw; it's methodological diversity. 3.3 – Trend analysis (Figure 3 & 5) lacks statistical power: Linear regression on chronological counts is not rigorous. No p-values, confidence intervals, or formal trend testing reported. Observation that "both R and NR studies increasing over time" is expected as field matures, and is not problematic. The data doesn't support the paper's implicit argument that study designs are dysfunctional. The patterns shown reflect appropriate evolution of trial methodology.
4: Core Argument Poorly Structured- The paper conflates three distinct problems:
-
Industry release timing issue: Devices approved before prior RCTs conclude (REAL problem)
-
Evidence fragmentation: Each device iteration requires new studies (EXPECTED, not inherently problematic)
-
Long-term durability requiring extended follow-up: (CONSTRAINT, not a design flaw)
-
Problem #1 requires regulatory action (mandate longer post-market surveillance before approval)
-
Problem #2 is inherent to iterative device development (not solvable without halting innovation)
-
Problem #3 is a fundamental requirement for bioprosthesis validation (not optional)
5: The paper recommends:
-
Adaptive trial designs – Mentioned but not explained or justified and needs to be further elaborate how the issue can be addressed.
-
Real-world data (RWD) – Referenced but not operationalized
-
AI-based "randomization via biological variability" – This is the critical flaw. AI-Based Pseudo-Randomization via "Biological Variability"
The proposal (Discussion & NYC simulation):"The population's biological variability and all its related characteristics could act, if carefully analyzed, as a sort of randomization." This is scientifically unsound as it Conflates Heterogeneity with Randomization, while Randomization eliminates selection bias by distributing measured and unmeasured confounders equally. Heterogeneity (biological variability) creates exactly the confounding that randomization prevents. E.g. Older, frail patients "naturally selected" for TAVI; younger low-risk for surgery. This is selection bias, not substitute randomization and no amount of ML can recover causal inference from such biased selection
5.2 – Paper invokes "expanded propensity score" approach via AI but Propensity score requires Known confounders to be measured, Strong assumption of no unmeasured confounding. These can never be guaranteed in observational data as TAVI selection is often determined by Anatomy ,Physician preference ,Institutional protocols (unmeasured), Patient preference (largely unmeasured) which ML cannot recover unmeasured confounding (a fundamental theorem of causal inference)
5.3 – The NYC Simulation is Illustrative, Not Proof as The ChatGPT simulation produces results "consistent with main studies" (claimed in paper).Since, Simulations by definition cannot validate methodology and If simulation matches prior RCTs, it's because the AI was trained on/aware of RCT results rather this is circular reasoning, not validation. Generative AI cannot perform novel causal discovery
5.4 –The paper proposes using AI on observational data as RCT substitute but the real bottleneck isn't statistical method; it's regulatory approval authority, Devices already approved with limited evidence (FDA 510k pathway), More "AI-based analysis" of post-market data won't change this hence what's needed is Regulatory mandate for longer surveillance before approval
6. Adaptive trial designs (mentioned but underdeveloped): TAVI field has published adaptive trials (e.g., platform trials) and authors could reference ADAPT, SMART trial designs, etc. Registry-based RCTs: Multiple papers show how to conduct RCTs embedded in clinical registries which solves the "data collection lag" problem. The Regulatory pathways for innovation- FDA already has "accelerated approval" + post-market surveillance requirement and hence the Problem isn't absence of regulation; it's inadequate post-market enforcement. RCT alternatives is already published: N-of-1 trials, target trial framework, causal forest methods, Paper cites none of these.
7: Mischaracterization of RCTs vs. Observational Studies- Paper's claim that "RCTs and observational studies often share more similarities than traditionally assumed. neither is inherently superior." This is misleading in context as For assessing device safety (primary concern in TAVI): RCTs and observational data have fundamentally different value: RCTs: Detect rare adverse events if powered appropriately while Observational studies are Good for frequent events, poor for rare complications. Observational data provides correlations, not causation (unless unmeasured confounding absent—rarely true). The paper cites Benson & Hartz (2000) meta-analysis showing OS effects fell within RCT CI in 17/19 cases But This doesn't mean OS and RCT are "equivalent". It means effects were similar in direction for that particular meta-analysis where Small sample size for this generalization Does NOT validate using OS to replace RCTs for causal inference
8: No Discussion of Potential Harms from Accelerating Device Approval, The paper implicitly advocates for faster innovation cycle but doesn't address: Durability - If Sapien 3 approved before long-term durability of Sapien XT established, we lose comparison data while Patients receiving newer devices may have unknown failure modes. Device recalls: Paper notes 2 devices recalled (LotusEdge 2021, AcurateNEO2 2025) but it Doesn't discuss whether faster approval contributed to inadequate pre-market testing. Real-world harms undocumented as there are No comparison of complication rates in patients implanted during "lag periods" vs. those implanted after RCT completion and If outcomes worse, this suggests current "rapid innovation" approach may be harmful
9: Overstates the AI Solution in the current form-From Discussion/Perspective: "By integrating variables often overlooked or analyzed in isolation—such as phenotype, race, comorbidities, genetic background and operator experience—AI can help stratify outcomes with a level of granularity unattainable through conventional study designs." However, we need to emphasise that "Granularity" ≠ Validity: More subgroups analyzed = more p-values = higher multiple comparison problem. Stratified analysis already done in RCTs: Modern trials (e.g., PARTNER III) report extensive subgroup analyses and there is No evidence that AI outperforms: Paper provides no citations showing AI-based confounding adjustment superior to regression/matching for TAVI outcomes.
POSITIVE ASPECTS of the paper-
-
Core observation valid: Innovation does outpace validation—this is real
-
Comprehensive data collection: 270 studies analyzed; effort is substantial
-
Raises important question: How should regulatory/research frameworks adapt to rapid device iteration?
-
Acknowledges limitations: Author notes "high variability," "multiple factors" influence study design
In light of the same, I recommend a major revision for the drafted manuscript.
Thanks
Author Response
Thanks for submitting this manuscript to this journal. It is an interesting read however it suffers from major flaws of the scientific discourse as we know it in the era of evidence synthesis.
1: Misinterpretation of Zeno's Paradox – The paper uses Zeno's paradox to argue that RCTs cannot keep pace with device innovation, likening RCT evidence to "Achilles" and technological innovation to the "tortoise." Zeno's paradox is about temporal impossibility of motion, not about practical validation lag. In Zeno's paradox, Achilles theoretically never catches the tortoise because infinite steps are required in finite time—but this is proven false by mathematical analysis (convergent infinite series) and hence the analogy doesn't work. RCTs aren't trying to "catch a moving target" in the mathematical sense. The problem isn't that validation is theoretically impossible; it's that industry releases new devices before prior ones are validated. This is a market timing issue, not a mathematical paradox. Better framing could be about "Regulatory capture" or "the innovation-evidence gap" rather than invoking Zeno's paradox inappropriately.
Response: I agree that the original mathematical framing risked misinterpretation. But, like its use in ancient Greek philosophy, this paradox has been cited for purely rhetorical purposes to illustrate the recurring temporal discrepancy; I have used Zeno's paradox as a metaphorical and illustrative device to vividly describe the constant "chase" between proof generation and technological iteration, not as a literal mathematical statement.
Correction: “Zeno's paradox is cited as a conceptual metaphor for the "gap between innovation and scientific evidence" and is not a literal mathematical statement. Its use captures the essence of the purpose of the Greek paradoxes, which is to be used as part of a method of investigation [2]. The true paradox lies in the difficulty of applying, in daily clinical practice, validated clinical data to a constantly evolving reality, which renders them obsolete as soon as they become definitive.” added in Introduction
- Analysis 1 – Timeline Visualization is Descriptive, Not Analytical-The issue: Figure 1 simply overlays RCT timelines with device approval dates but doesn't quantify the actual clinical impact of this lag. Shows devices approved while prior RCTs ongoing = true observation, but draws no causal conclusions about patient harm. Missing analytical elements: Did the evidence lag result in worse clinical outcomes? If Sapien 3 was approved while Sapien XT trials ongoing, did this harm patients? (No data provided). Were device approvals evidence-based or not? FDA approval based on bench testing + limited clinical data (typically <100 patients) CE Mark European pathway even lighter touch than FDA. But does this matter if outcomes are similar? What the paper doesn't address is the Post-market surveillance data showing whether devices approved during lag periods had inferior outcomes. Whether "fragmented evidence base" (claimed in Discussion) actually prevented optimal clinical decisions. Cost of delayed innovation vs. cost of inadequate pre-market validation also needs to be discussed. The timeline is interesting historically but doesn't support claims about clinical harm.
Response: We agree that Figure 1 is primarily descriptive and historical. As a narrative review, its purpose was to map and visualize the phenomenon of "leapfrogging," not to perform a causal analysis of patient harm. Quantifying the direct clinical impact of this lag—while a crucial question—would require a different study design with linked patient-level outcome data, which was beyond the scope of this paper. We have clarified in the text that the figure serves to illustrate the structural problem, not to prove clinical harm. Furthermore, such clinical harm would result from the preliminary, intermediate results of the studies themselves and would be reported on the Clinicaltrial.org website; therefore, it is a topic already under close observation within the trials themselves, without the necessary "external supervision" to be attributed to the present article.
Correction: “Figure 1 is primarily descriptive and historical and serves to illustrate the structural problem, not to prove any clinical harm which, if present, is already contemplated and reported in the intermediate results of the studies themselves (on the same website Clinicaltrial.org)” added in Limitations.
3: Analysis 2 – Statistical Analysis is Unclear about following Issues with the statistical approach:
3.1 – Comparison of R vs. NR studies lacks clinical context: Shows R studies longer (mean 87.7 months vs. NR 66.1 months) and recruit more patients (median 460 vs. 150).However, this is expected and appropriate as RCTs need follow-up to detect mortality/morbidity differences (requires more time) NR studies are registries/pilots designed for feasibility, not hard endpoints. Longer duration and larger samples does not equate to the problem but is actually the correct methodology. 3.2 –There is no correlation between duration and sample size (Figure 4): Finding: "No strong linear correlation between enrolled patients and study duration". This actually suggests appropriate tailoring of study design Short studies with few patients are pilot feasibility studies (appropriate) while Long studies with many patients are definitive RCTs (appropriate). The lack of correlation isn't a design flaw; it's methodological diversity.
3.3 – Trend analysis (Figure 3 & 5) lacks statistical power: Linear regression on chronological counts is not rigorous. No p-values, confidence intervals, or formal trend testing reported. Observation that "both R and NR studies increasing over time" is expected as field matures, and is not problematic. The data doesn't support the paper's implicit argument that study designs are dysfunctional. The patterns shown reflect appropriate evolution of trial methodology.
Response: I agree with the common consideration in the three points of comment #3: the statistical analysis presented can be misleading. My intention was not to frame them as a "problem”, but to empirically describe the evidence landscape on TAVI. I have reworded the relevant sections to avoid implying that these intrinsic methodological differences are inherently dysfunctional.
I emphasized the descriptive nature of the statistical analysis and removed the trend lines.
Correction: “The statistical comparisons between randomized and non-randomized studies were exploratory and descriptive in nature; they were not adjusted for confounding factors such as valve generation or sponsor, and thus are not intended for causal inference.” added in Methods, and changes made in Analysis 2 results, paragraph 3.2
4: Core Argument Poorly Structured- The paper conflates three distinct problems:
- Industry release timing issue: Devices approved before prior RCTs conclude (REAL problem)
- Evidence fragmentation: Each device iteration requires new studies (EXPECTED, not inherently problematic)
- Long-term durability requiring extended follow-up: (CONSTRAINT, not a design flaw)
- Problem #1 requires regulatory action (mandate longer post-market surveillance before approval)
- Problem #2 is inherent to iterative device development (not solvable without halting innovation)
- Problem #3 is a fundamental requirement for bioprosthesis validation (not optional)
Response: Agree they are distinct issues with different solutions.
Correction: subsection added in Discussion, 4.1
5: The paper recommends:
- Adaptive trial designs – Mentioned but not explained or justified and needs to be further elaborate how the issue can be addressed.
- Real-world data (RWD) – Referenced but not operationalized
- AI-based "randomization via biological variability" – This is the critical flaw. AI-Based Pseudo-Randomization via "Biological Variability"
The proposal (Discussion & NYC simulation):"The population's biological variability and all its related characteristics could act, if carefully analyzed, as a sort of randomization." This is scientifically unsound as it Conflates Heterogeneity with Randomization, while Randomization eliminates selection bias by distributing measured and unmeasured confounders equally. Heterogeneity (biological variability) creates exactly the confounding that randomization prevents. E.g. Older, frail patients "naturally selected" for TAVI; younger low-risk for surgery. This is selection bias, not substitute randomization and no amount of ML can recover causal inference from such biased selection
5.2 – Paper invokes "expanded propensity score" approach via AI but Propensity score requires Known confounders to be measured, Strong assumption of no unmeasured confounding. These can never be guaranteed in observational data as TAVI selection is often determined by Anatomy, Physician preference, Institutional protocols (unmeasured), Patient preference (largely unmeasured) which ML cannot recover unmeasured confounding (a fundamental theorem of causal inference)
5.3 – The NYC Simulation is Illustrative, Not Proof as The ChatGPT simulation produces results "consistent with main studies" (claimed in paper). Since, Simulations by definition cannot validate methodology and If simulation matches prior RCTs, it's because the AI was trained on/aware of RCT results rather this is circular reasoning, not validation. Generative AI cannot perform novel causal discovery
5.4 –The paper proposes using AI on observational data as RCT substitute but the real bottleneck isn't statistical method; it's regulatory approval authority, Devices already approved with limited evidence (FDA 510k pathway), More "AI-based analysis" of post-market data won't change this hence what's needed is Regulatory mandate for longer surveillance before approval
Response: I recognize that the way this concept is presented can be misleading; I don't believe AI can replace randomized trials; instead, I believe a logical superstructure can integrate a wealth of data that has previously been difficult to integrate, thus refining the planning of specific randomized trials. This could save resources and patients, and provide more rapid clinical responses. I prefer, if you agree, to completely rewrite this section, eliminating the “AI study on NY population” and adding well-known problem of drug resistance, where an "intelligent" system (whether natural or artificial), integrating knowledge from different fields, could have prevented adverse effects on the patients. In this way I would like to resolve and clarify together both concepts that I have not clearly expressed, that of biological variability and that of the use of artificial (highly integrative) intelligence.
Correction: subsection added in Discussion, 4.3
- Adaptive trial designs (mentioned but underdeveloped): TAVI field has published adaptive trials (e.g., platform trials) and authors could reference ADAPT, SMART trial designs, etc. Registry-based RCTs: Multiple papers show how to conduct RCTs embedded in clinical registries which solves the "data collection lag" problem. The Regulatory pathways for innovation- FDA already has "accelerated approval" + post-market surveillance requirement and hence the Problem isn't absence of regulation; it's inadequate post-market enforcement. RCT alternatives is already published: N-of-1 trials, target trial framework, causal forest methods, Paper cites none of these.
Response: I agree with this comment
Correction: References added:
[52] Ludman PF; UK TAVI Trial Investigators. The UK Transcatheter Aortic Valve Implantation Registry
[53] Carroll JD, Mack MJ, Vemulapalli S, et al. STS-ACC TVT Registry of Transcatheter Aortic Valve Replacement.
[54] Beckmann A, Funkat AK, Lewandowski J, et al. Cardiac Surgery in Germany during 2021: The GARY registry.
7: Mischaracterization of RCTs vs. Observational Studies- Paper's claim that "RCTs and observational studies often share more similarities than traditionally assumed. neither is inherently superior." This is misleading in context as for assessing device safety (primary concern in TAVI): RCTs and observational data have fundamentally different value: RCTs: Detect rare adverse events if powered appropriately while Observational studies are Good for frequent events, poor for rare complications. Observational data provides correlations, not causation (unless unmeasured confounding absent—rarely true). The paper cites Benson & Hartz (2000) meta-analysis showing OS effects fell within RCT CI in 17/19 cases But This doesn't mean OS and RCT are "equivalent". It means effects were similar in direction for that particular meta-analysis where Small sample size for this generalization Does NOT validate using OS to replace RCTs for causal inference
Response: I have revised this section to provide a more balanced and accurate portrayal. It's still commonly believed that RCTs are the gold standard, but in practice some limitations of such studies are significant. Some examples from my clinical practice:
PARTNER 1B (Inoperable): Subjective "inoperable" definition limiting generalizability; Early learning curve with 1st-generation device; Unblinded design;
NOTION 1 (Low-Risk): Underpowered due to slow enrollment; Necessary composite endpoint masked individual outcome differences;
SURTAVI (Intermediate-Risk): Major selection bias (only ~40% of screened patients enrolled); High crossover from SAVR to TAVI (~10%);
PARTNER III (Low-Risk): Critically short 1-year follow-up for a low-risk population; Highly selected centers/operators (sponsor bias), and so on.
Correction: “The cited literature was intended solely to illustrate the long-standing debate between RCTs and observational studies, not to claim equivalence in TAVI. RCTs remain the gold standard for causal inference, and observational studies are subject to confounding factors that are not yet clearly measurable, but can sometimes produce similar results, despite their different study designs. This likely happens when a study identifies more confounding factors than others.” added in Discussion
8: No Discussion of Potential Harms from Accelerating Device Approval, The paper implicitly advocates for faster innovation cycle but doesn't address: Durability - If Sapien 3 approved before long-term durability of Sapien XT established, we lose comparison data while Patients receiving newer devices may have unknown failure modes. Device recalls: Paper notes 2 devices recalled (LotusEdge 2021, AcurateNEO2 2025) but it Doesn't discuss whether faster approval contributed to inadequate pre-market testing. Real-world harms undocumented as there are No comparison of complication rates in patients implanted during "lag periods" vs. those implanted after RCT completion and If outcomes worse, this suggests current "rapid innovation" approach may be harmful
Response: We agree that a quantitative benefit-harm model would be insightful, but it was beyond the scope of this narrative review. The implications for patient safety of this current "transition" between transcatheter and surgical procedures are a difficult topic to address, even for experienced healthcare systems.
Correction: The character of descriptive review has already been underlined before.
9: Overstates the AI Solution in the current form-From Discussion/Perspective: "By integrating variables often overlooked or analyzed in isolation—such as phenotype, race, comorbidities, genetic background and operator experience—AI can help stratify outcomes with a level of granularity unattainable through conventional study designs." However, we need to emphasise that "Granularity" ≠ Validity: More subgroups analyzed = more p-values = higher multiple comparison problem. Stratified analysis already done in RCTs: Modern trials (e.g., PARTNER III) report extensive subgroup analyses and there is No evidence that AI outperforms: Paper provides no citations showing AI-based confounding adjustment superior to regression/matching for TAVI outcomes.
Response: I have revised this section to provide a more balanced and clear portrayal.
Correction: subsection added in Discussion, 4.3 and final consideration in Conclusions
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper has improved.
Author Response
I sincerely thank you for your appreciation of my revision of the article.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe author’s detailed and thoughtful response, coupled with substantial manuscript revision, resolves the most critical methodological and conceptual flaws initially present. Revisions have improved clarity, scientific rigor, and argumentative structure, and demonstrate responsiveness to reviewer guidance. The manuscript now represents a balanced, well-organized, and informative narrative review suitable for publication. Further analytic work on harm quantification, as suggested, will be valuable for the field, but is appropriately beyond the reviewed scope at present.
Author Response
I sincerely thank you for your appreciation of my revision of the article, your suggestions were very helpful.

