Review Reports - A Simple Yet Powerful Hybrid Machine Learning Approach to Aid Decision-Making in Laboratory Experiments

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study introduce a hybrid machine-learning (ML) framework that combines Ordinary Least Squares (OLS) for global surface estimation, Gaussian Process (GP) regression for uncertainty modeling, Expected Improvement (EI) for active learning, and K means clustering for diversifying conditions. It applied this approach to published growth-rate data of the diatom Thalassiosira pseudonana.

The paper presents a novel approach with good theoretical and practical value. However, there is almost no detailed description of the specific methods, making it difficult for readers to learn how to integrate multiple methods together.

Author Response

Thank you for your valuable feedback. In response to the request for a clearer description of how OLS, GP, EI, and K-means are integrated, we have substantially expanded the Methods & Materials section. Specifically, new text now read as follows:

Ordinary Least Squares (OLS) regression model to fit the observed growth rate data adding a second-order polynomial that includes both quadratic and interaction terms (Phosphate², Temperature², and Phosphate*Temperature) to help capture non-linear relationships between the predictor variables and the response variable;
Gaussian Process (GP) regression model with a Matern (ν=2.5) kernel trained on residuals to capture uncertainty across the parameter space. The combine predictive mean at a new point x is , and the perspective variance . Hyperparameters for the GP model were optimised using maximum likelihood estimation via scikit-learn's internal routines. A small noise term (alpha=1e-6) was added to improve numerical stability;
Expected Improvement (EI) implemented with a small exploration parameter (ξ = 0.01), evaluated a uniform 20×20 grid spanning the phosphate and temperature ranges, to rank untested conditions by potential gain; these was a decision criterion to identify which untested conditions are most likely to yield new insights; and
Apply K-means clustering to the top candidate points (ranked by Expected Improvement), selecting one representative from each cluster so that each experimental cycle explores diverse yet high-potential regions of the parameter space. The number of clusters was set to match the cycle batch size (5 experiments), and the algorithm was run with 10 initializations (n_init=10) to ensure convergence. All implementation details, including complete Python code and example notebooks, are publicly available at https://github.com/benocd/ml-experiment-optimizer.

These added sentences clarify:

That the GP is trained on OLS residuals rather than raw data.
How the combined prediction and variance are computed.
That EI is evaluated on a 20×20 grid.
That K-means is run on the top EI-ranked candidates (with n_init=10) to select a batch of five diverse points.
That this entire procedure is executed iteratively.

For readers who wish to examine every implementation detail—data loading, feature standardization, exact scikit-learn calls, pseudocode, and notebook outputs—we have made the complete Jupyter notebook available online at: https://github.com/benocd/ml-experiment-optimizer

We trust that these additions and the publicly accessible code repository fully address your concern about reproducibility and clarity. Please let us know if you would like any further elaboration on a specific substep.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript introduces a hybrid machine learning framework combining OLS, Gaussian Processes, Expected Improvement, and K-means clustering to optimize experimental design. The approach is well explained and satisfactorily simulated on a validated biological dataset. The results are effectively presented and show potential utility in reducing experimental burden.

To further enhance the manuscript, the following changes are suggested:

(1) Please include a brief comparison with other optimization methods (e.g., Bayesian Optimization, TPE, or BOHB) to better place the suggested approach.

(2) Demonstrate or model scalability to experimental spaces of higher dimensions, or clearly indicate this as an area of future work.

(3) Sketch out possible challenges in applying such in wet-lab contexts, such as noise, data variability, and system constraints.

In addition, the claim of expert-level decision-making must be made with a measure of caution against the backdrop of an absence of experimental validation. Overall, this is a relevant and timely contribution. The manuscript is publishable after moderate revision with focus on methodological positioning and pragmatic scope.

Author Response

(1) Please include a brief comparison with other optimization methods (e.g., Bayesian Optimization, TPE, or BOHB) to better place the suggested approach.

Response:

Thank you for suggesting that we contextualize our hybrid framework by comparing it with other popular black‐box optimization methods. In the revised manuscript, we have added a new paragraph (see Results, pp. 9–10) summarizing direct comparisons to both a Tree-structured Parzen Estimator (TPE) and a “vanilla” Bayesian Optimization (BO) loop.

To benchmark our hybrid OLS + GP + EI + K-means approach, we tested two other optimizers on the same Thalassiosira pseudonana growth-rate data and identical experimental budget (i.e., 4–5 cycles of 5 experiments each). First, we ran a Tree-structured Parzen Estimator (TPE) optimizer, sampling directly from the same phosphate–temperature search space. Under identical constraints, TPE required on average 16.8 ± 4.44 cycles (i.e., ~84 ± 22 experiments) to converge on the same optimum growth region—roughly four times more iterations than our hybrid method (Table 1). Second, we tested a “pure” Gaussian-Process-based Bayesian Optimization (BO) loop (i.e., GP + EI only, without the OLS prefit or K-means diversification). Although this simplified BO produced a similar overall convergence trend to our hybrid strategy, it was less precise: its best‐found condition deviated from the true optimum by 3.9 ± 0.7 (relative units), whereas our hybrid approach remained within 2.7 ± 2.2 of the true maximum under noise-free conditions.

We also attempted to incorporate the BOHB framework to leverage multi-fidelity sampling, but we were unable to adapt it to our setting in a way that remained strictly comparable (i.e., BOHB typically requires hierarchical fidelity levels or early-stopping criteria that do not map directly to single‐point, batch-size constraints). For this reason, we have omitted BOHB results.

Together, these comparisons demonstrate that our hybrid OLS + GP + EI + K-means method can reach the same optimum with far fewer experiments than TPE, and with higher accuracy than a standard GP-only BO loop (Table 1).

(2) Demonstrate or model scalability to experimental spaces of higher dimensions, or clearly indicate this as an area of future work.

Response:

Thank you for highlighting the need to clarify how our method scales to higher-dimensional experimental spaces. We agree that, while the underlying components (OLS, GP, EI, K-means) can in principle be applied to more than two variables, a full treatment of high-dimensional settings requires additional investigation. Accordingly, we have revised the original paragraph from Perspective section to acknowledge this limitation and frame it as future work. The new text reads:

“In this paper, we demonstrated our ML strategy on two-dimensional data (phosphate and temperature). In principle, the same algorithmic pipeline (OLS → GP → EI → K-means) can be extended to higher-dimensional search spaces (e.g., temperature, light intensity, pH, phosphate, nitrogen, macronutrients, micronutrients, CO₂, total nutrients, salinity). However, practical application in 10 or more dimensions will require additional considerations—such as feature selection, kernel scaling, and computational optimizations—to maintain efficiency and interpretability. We therefore consider large-scale, multi-parameter optimization as an important area of future work: verifying performance across diverse biological datasets, assessing computational cost as dimensionality grows, and exploring strategies (e.g., sparse kernels or sequential variable grouping) to keep model inference and decision-making tractable when human intuition alone is insufficient.”

We commit to treating this as future work, explicitly noting that additional algorithmic refinements (e.g., dimensionality reduction, sparse (additive) GP kernels, or adaptive candidate generation) will be necessary to keep the approach both computationally feasible and interpretable.

We hope this revision addresses your concern by transparently stating the limitations of our two-dimensional proof-of-concept and by underscoring the need for further development before routine application in very high-dimensional experimental designs.

(3) Sketch out possible challenges in applying such in wet-lab contexts, such as noise, data variability, and system constraints.

Response:

Thank you for raising the important point that real wet‐lab implementation will introduce sources of noise, variability, and practical constraints. We have added a new paragraph to the Perspective section to outline these challenges. The inserted text reads as follows:

“Practical Wet‐Lab Considerations – Although our simulations assumed ideal (noise‐free, or predictable noise) measurements, several challenges arise when deploying this hybrid framework in a real laboratory setting. First, measurement noise—stemming from instrument precision limits (e.g., plate reader fluctuations), pipetting errors, and biological heterogeneity—can obscure true growth‐rate signals. Without accounting for heteroscedastic noise, the OLS fit may be biased and the GP’s uncertainty estimates may be over-confident in regions with high variability. Second, data variability between batches (batch‐to‐batch differences in media preparation, cell inoculum density, or environmental conditions such as ambient temperature or humidity) can introduce systematic shifts that a purely data‐driven model might misinterpret as genuine trends. Third, system constraints—such as limited throughput (e.g., only 5–10 assays per day), reagent cost, and turnaround time for readouts—restrict how many conditions can be tested per cycle and how rapidly new data become available for retraining. Fourth, equipment calibration and drift (e.g., changes in light‐intensity outputs of incubators or gradual sensor degradation) can violate the assumption of a static response surface, requiring periodic recalibration or the inclusion of time‐dependent covariates.

To mitigate these issues, future work should incorporate noise‐aware GP kernels (e.g., heteroscedastic or Student‐t kernels) and replicate measurements to empirically estimate observation variance. Domain‐informed priors in the OLS step (for example, known saturating behaviour at extreme nutrient levels) can help stabilize global fits in the presence of noisy data. Automated experimentation platforms [28] and real‐time data pipelines can minimize human‐induced variability and accelerate throughput, but they also require validation to ensure that mechanical errors are within acceptable bounds. Finally, adaptive stopping criteria based on confidence intervals—rather than a fixed cycle count—may be necessary when experimental resources are scarce or when early‐stage measurements exhibit high uncertainty. Addressing these practical considerations is an important direction for translating our algorithm from in silico validation to robust, reproducible wet‐lab workflows.”

We trust that this addition clearly sketches out the key challenges (noise, data variability, system constraints) and points toward strategies for addressing them in future wet‐lab implementations.

Response

Thank you for highlighting the need to temper our claims and acknowledge the absence of wet-lab validation. We have revised the Conclusions section accordingly. The new text now reads:

“However, because these results are based on in silico simulations rather than side-by-side wet-lab trials, the assertion that the machine learning–driven procedure can match the decision-making capabilities of expert scientists should be interpreted cautiously. In a real laboratory setting—where noise, measurement variability, and logistical constraints come into play—further validation will be required. Nonetheless, the ability of ML to reach equivalent conclusions under ideal (noise-free) conditions highlights its transformative potential and motivates future work to test and refine this framework in true wet-lab environments.”

This revision explicitly notes that, without direct head-to-head experiments against human experts, we cannot definitively claim “expert-level” decision-making. We also signal that deployment in practical laboratory workflows will require accounting for noise, variability, and resource constraints. We trust this addresses the raised concern and provides a clearer, more balanced conclusion.

Reviewer 3 Report

Comments and Suggestions for Authors

Regarding Q1: Does the introduction provide sufficient background and include all relevant references?

The introduction of the paper offers a compelling and well-articulated historical narrative, tracing the evolution of the scientific method from philosophical foundations to modern empirical practices. This reads well and is interesting and draws on a rich body of philosophical and historical literature, creating a thoughtful context for the integration of machine learning into experimental science. This framing is intellectually engaging and provides a solid conceptual grounding; however, the paper lacks engagement with cutting-edge literature in contemporary machine learning. While the authors briefly mention Gaussian Processes, OLS, and Expected Improvement, they do not adequately situate their work within the broader context of recent developments in active learning, surrogate modelling, or hybrid modelling strategies in experimental design. Recent advancements in Bayesian optimization, deep kernel learning, and probabilistic programming, among others, are absent, and key references are missing; To strengthen the introduction, the authors should integrate a discussion of state-of-the-art ML techniques relevant to experimental optimization and draw connections between their hybrid approach and ongoing debates or innovations in the field. This would position the work more clearly within the current research frontier in machine learning and knowledge extraction and appeal more directly to interdisciplinary audiences. Moreover, at the end of the paper the authors argue for autonomously suggesting and prioritizing experiments; here the authors need to include the current debate about fully autonomous systems and human oversight, which is a current hot debate in the machine learning community: Human oversight in ML/AI refers to the design and implementation of systems that remain understandable, steerable, and responsible to human direction throughout their operation. These systems are expected to align with human ethical goals, values, and constraints, allowing humans to monitor, guide, or intervene as needed to ensure appropriate functioning. Effective oversight requires that the ML/AI behaves predictably and transparently, enabling users to interpret its decisions and influence its actions when necessary. In the context of machine learning and autonomous systems, human oversight involves mechanisms that allow users to correct, stop, or redirect the system’s behavior, particularly in uncertain or high-stakes scenarios. This includes preventing harmful outcomes, avoiding reward misspecification, and ensuring that the AI adapts appropriately to feedback. Human oversight is a foundational principle in the development of trustworthy and accountable AI, especially in domains where safety, ethics, and social impact are critical, this should at least be mentioned and here is a very recent reference: Holzinger, A., Zatloukal, K. & Müller, H. (2025). Is Human Oversight to AI Systems still possible? New Biotechnology, 85, 59--62, https://doi.org/10.1016/j.nbt.2024.12.003; and this brings this reviewer directly to a further issue: The authors mention, e.g. biology, agriculture, medicine directly in the abstract, but do not elaborate further. Here the authors need to provide at least an outlook and context - and here the authors can directly connect to the comment above: In high-stakes decision in e.g. agriculture - which affects all people on our planet - a human-centered approach is needed, which refers to the design and deployment of systems that explicitly account for the needs, goals, and expertise of human stakeholders across the domain value chain. Rather than replacing human decision-makers, such systems aim to augment human capabilities by providing interpretable, context-aware, and adaptive support for complex tasks in farming, resource management, and food production. This approach recognizes the diversity of users, including farmers, agronomists, and rural communities, and emphasizes usability, transparency, and trust. In practical terms, human-centered AI in agriculture may support site-specific crop recommendations, precision irrigation, yield forecasting, or early detection of pests and diseases, while allowing users to understand and influence the reasoning behind the system’s suggestions.

Regarding Q2: Is the research design appropriate?

The research design presented in the manuscript is generally appropriate for the stated aims, particularly in demonstrating the feasibility and efficiency of a hybrid machine learning framework for guiding experimental decisions. The use of a well-characterized biological dataset as a simulator allows for controlled benchmarking of the approach without introducing confounding variability from real-world lab conditions. The iterative simulation setup, employing Ordinary Least Squares regression for global trends and Gaussian Processes for local uncertainty modeling, is methodologically sound and well-aligned with the problem of optimizing experimental conditions in high-dimensional spaces. However, from a critical perspective, the design could be strengthened by including comparisons with more advanced or diverse machine learning baselines beyond OLS and GP, such as Bayesian neural networks or recent advances in meta-learning. Moreover, the validity of the conclusions would benefit from empirical validation using new experimental data rather than solely relying on simulations. Including ablation studies to isolate the contribution of each algorithmic component (e.g., Expected Improvement, K-means selection) would also enhance the interpretability and robustness of the results. Overall, the research design is coherent and conceptually justified, though it would benefit from additional methodological depth and empirical grounding.

Regarding Q3: Are the methods adequately described?

The methods are described with sufficient clarity to understand the overall structure and operation of the proposed hybrid machine learning framework. The authors explain the integration of Ordinary Least Squares regression, Gaussian Process regression, Expected Improvement as an acquisition function, and K-means clustering for diversity in sampling. The simulation-based experimental loop is laid out in a stepwise fashion, and the rationale for each component is briefly justified. Nonetheless, the methodological description would benefit from greater precision in several areas. The parameterization of the Gaussian Process model, including kernel choice and hyperparameter optimization, is not specified. The details of the Expected Improvement calculation, such as handling of noise or batch acquisition adjustments, are omitted. Similarly, the implementation of K-means clustering lacks clarity regarding the choice of the number of clusters and initialization. The stopping criterion based on EI thresholds is reasonable but could be contextualized further in terms of its sensitivity or theoretical grounding. Overall, while the methods are presented in a conceptually sound and readable manner, they fall short of full reproducibility. A more formal and detailed exposition would strengthen the methodological rigour and make the work more accessible to readers from the machine learning community.

Regarding Q4: Are the results clearly presented? Generally yes, but the clarity of the results would be strengthened by a more explicit summary of key findings in relation to the original experimental design, and by clearly separating empirical findings from speculative interpretations. Overall, while the results are comprehensible and effectively illustrated, a more rigorous and quantitatively detailed analysis would enhance their scientific value.

Generally, this reviewer is positive, have read this paper with pleasure, but for inclusion into this journal the machine learning background must be improved, also a future outlook section would be

Author Response

Reviewer comment:

Response:

Thank you for your insightful suggestions regarding (1) situating our work within cutting‐edge ML advances (e.g., active learning, deep kernels, probabilistic programming, hybrid strategies) and (2) addressing the debate around fully autonomous systems and human oversight—particularly in high-stakes domains such as agriculture, medicine, and environmental biology. Below, we present the two blocks of text that have been added to the Introduction and Perspective sections, along with a brief explanation of why each addition was made and how it addresses your comments. In the Introduction, we have added references to active learning, deep kernel learning, and probabilistic programming, and we have linked these state-of-the-art advances to our hybrid approach. We also explicitly mention “lab-in-the-loop” to underscore human oversight from the very beginning. In the Perspective section, we have rewritten and expanded the discussion of applications (agriculture, medicine, AI assistants, sustainability) to explicitly incorporate “human-in-the-loop” and “human-centered AI” principles. We have cited Holzinger et al. (2025) to root our argument in the latest debate about autonomy versus oversight in ML systems.

Additions to the Introduction (pp. 2–4, lines 20–55)

Added Text (Introduction)

“Active learning methods—such as uncertainty sampling, query-by-committee, and margin-based strategies—have been shown to iteratively focus experimental efforts on the most informative perturbations, thereby dramatically reducing the total number of wet-lab assays required to uncover complex biological networks [20, 21]. Furthermore, embedding a lab-in-the-loop paradigm—where ML-generated hypotheses guide wet-lab experiments and their outcomes are subsequently used to retrain and refine the surrogate model—ensures continual human oversight and robustness, which is critical for high-stakes applications in biology, agriculture, and medicine [22].

Gaussian Process Regression (GP) is a powerful foundational ML model with great potential for active learning. Recent work in Bayesian optimization has introduced several techniques that extend beyond classic Gaussian Processes. For example, deep Gaussian processes (deep GPs) [23] and deep kernel learning embed neural networks within GP covariance functions to capture complex, nonlinear structure in high-dimensional spaces, enabling scalable surrogate modelling and richer uncertainty estimates [24]. At the same time, probabilistic programming frameworks have made it possible to specify highly expressive priors and bespoke likelihoods—allowing surrogate models to incorporate domain knowledge (e.g., mechanistic constraints or structured noise models) directly into the optimization loop [25, 26]. By leveraging these advances (including entropy-based acquisition functions and multi-fidelity or multi-task extensions), our hybrid approach can be seen as one instantiation within a rapidly evolving toolbox of methods for data-efficient, robust experimental design.

Building on these state-of-the-art active-learning and surrogate-modelling techniques, we focus here on two foundational methods—Gaussian Process Regression (GP) and Ordinary Least Squares (OLS)—that serve complementary roles in our hybrid framework. These methods can quantify uncertainty, make precise predictions, and systematically explore parameter spaces [27]. The key benefit of integrating ML into experimental design is its ability to shift from exhaustive or arbitrary sampling toward targeted, knowledge-driven exploration. OLS modeling has been widely used due to its computationally inexpensive nature, interpretability, and robustness in capturing global trends within experimental data [28]. However, while OLS excels at identifying broad patterns, it assumes linear relationships or polynomial transformations, potentially missing complex local interactions [29]. Complementary to OLS, GP regression addresses this limitation by modeling data through flexible, probabilistic functions that explicitly quantify uncertainty [18]. By combining both methods—using OLS for global approximation and GP for local exploration—experimenters can significantly enhance the precision and efficiency of the discovery process. This fusion of foundational algorithms underpins our unique hybrid machine-learning approach, which we demonstrate can guide experimentation as effectively as senior scientists.”

Why These Additions Were Made:

Citing Cutting-Edge ML Techniques:

We acknowledge “deep Gaussian processes” and “deep kernel learning” as state-of-the-art advancements in surrogate modelling, directly responding to the request to discuss recent developments beyond classic GPs.

We highlight “probabilistic programming frameworks” (e.g., PyMC3, Stan) so that our surrogate models can incorporate domain knowledge (mechanistic or structured noise models).

Positioning Our Hybrid Approach:

By explicitly stating that our OLS+GP strategy is “one instantiation within a rapidly evolving toolbox,” we situate the hybrid pipeline in the broader context of active learning, multi-fidelity methods, and multi-task Bayesian optimization.

We keep the narrative self-contained by explaining why OLS remains valuable (interpretability, computational efficiency) and how GP complements it (flexible uncertainty quantification).

Linking Lab-in-the-Loop & Human Oversight:

The reference to “embedding a lab-in-the-loop paradigm” and “continual human oversight” shows that we have incorporated the notion of human-in-the-loop, which directly addresses the point regarding oversight and steerability.

Additions to the Perspective Section (pp. 13–14, lines 330–350)

Added Text (Perspective)

“Agricultural Innovation – In agriculture, this approach could be used to identify optimal growing conditions for different crop varieties, balancing multiple interacting factors such as soil composition, irrigation, fertilization, and climate variables. A purely autonomous system might propose nutrient and water levels without accounting for farmers’ local knowledge or infrastructure constraints. To address this, a human-in-the-loop approach is essential [35]: agronomists and farmers review model suggestions, verify feasibility given resource limitations, and provide feedback (e.g., adjusting target ranges for fertilizer or adjusting for unexpected weather events). This combination of algorithmic exploration and domain expertise can drastically reduce the time and resources needed for crop optimization and precision farming, while ensuring transparency and trust in high-stakes decisions that affect food security and rural livelihoods.

Drug Discovery and Personalized Medicine – In medicine, the methodology could accelerate the optimization of drug combinations, dosages, and treatment schedules tailored to individual patients. By efficiently exploring the multidimensional space of treatment parameters, machine learning can support adaptive clinical trials and personalized therapeutic strategies. However, fully autonomous dose-finding without clinician oversight risks patient safety, especially when complex interactions produce unexpected side effects. Embedding human oversight—as championed in recent debates on trustworthy AI [19]—means that oncologists or clinical pharmacologists examine uncertainty estimates, validate model-driven treatment recommendations, and intervene if the system’s predictions conflict with ethical or physiological considerations. This ensures that model outputs remain interpretable, steerable, and aligned with human values.

AI Assistant for Scientists – These findings point to the realistic potential for developing AI-powered laboratory assistants. Such systems could work alongside human researchers, autonomously suggesting and prioritizing experiments, analyzing interim results, and refining hypotheses. The idea of an AI scientist has been extensively discussed in the literature [36, 37]. Yet, as the community increasingly stresses the limits of autonomy and the necessity of human-centered design, we emphasize that such assistants should remain both understandable and steerable. Effective oversight mechanisms allow users to correct, stop, or redirect the AI’s behavior in uncertain or high-stakes scenarios. In practice, a lab assistant might flag low-confidence suggestions (e.g., “phosphate level at 1.8 mM yields high variance”) for expert review, ensuring that final experimental decisions remain under human supervision.

Sustainable Research Practices – By reducing the number of experiments needed to reach statistically meaningful conclusions, ML-guided strategies promote resource-efficient and environmentally responsible science. This is particularly important in contexts where reagents, time, or biological samples are limiting. Nevertheless, lab workflows must also consider logistical constraints—scheduling of shared instruments, maintenance of sterility, and avoidance of sampling biases. A human-centered AI approach acknowledges these factors by maintaining transparency: researchers can inspect which variables most influenced the surrogate model’s uncertainty estimate, adjust batch sizes to match instrument throughput, and ensure that AI-recommended protocols conform to ethical and safety standards.

Human-Centered AI in High-Stakes Domains – In high-stakes fields such as agriculture, medicine, and environmental biology, decisions can have far-reaching ethical, economic, and social impacts. A human-centered AI philosophy—rooted in participatory design and interpretability—seeks to augment rather than replace human decision-makers. For example, in precision irrigation, an AI model might suggest water volumes that maximize yield, but a farmer can override recommendations based on impending drought forecasts or local water restrictions. Similarly, in yield forecasting, model predictions should be accompanied by confidence intervals and explanations (e.g., “soil nitrogen levels contributed 40% to yield variance”), so that stakeholders understand and trust the system’s suggestions. By explicitly accounting for the needs, goals, and expertise of diverse users—farmers, agronomists, clinicians, and rural communities—these human-centered systems emphasize usability, transparency, and trust, ensuring that AI remains aligned with human values and constraints even as automation increases.”

Why These Additions Were Made:

Explicit Human-in-the-Loop Discussion:

We elaborated on the “Agricultural Innovation” and “Drug Discovery and Personalized Medicine” subsections to exemplify how human oversight is critical in high-stakes domains. Recognizing that fully autonomous recommendations can be dangerous, we show how domain experts (farmers, clinicians) must validate and, if necessary, override model suggestions.

Reference to “Trustworthy AI” Debate:

We anchor our discussion of human oversight within a current debate on AI transparency and accountability. This directly addresses the request to engage with literature on “Is human oversight still possible?” and “human-centred design.”

Human-Centred AI Philosophy:

We added a dedicated “Human-cantered AI in High-Stakes Domains” subsection to elaborate on how our hybrid framework can be deployed responsibly—e.g., by providing confidence intervals and explanations so that farmers and clinicians can understand and trust the AI’s reasoning.

Maintaining Practical Concerns:

In “Sustainable Research Practices,” we remind readers that lab workflows have logistical constraints (instrument scheduling, sterility), and that a human-centered approach allows experimentalists to adjust batch sizes or protocols to meet those constraints.

Reviewer:

Answer:

Thank you for recognizing that our iterative OLS + GP framework is well aligned with the goal of efficiently guiding experimental decisions. We appreciate your suggestions to strengthen the methodological rigor by (1) comparing against more advanced ML baselines (e.g., Bayesian neural networks or meta‐learning), (2) performing empirical validation with new wet‐lab data, and (3) including ablation studies to isolate the impact of each component (EI, K-means, etc.).

Implementing Bayesian neural networks (BNNs) or meta-learning algorithms requires (a) substantial reengineering to adapt them to our confound-free, two-dimensional diatom dataset (Phosphate × Temperature), and (b) careful hyperparameter tuning and all-at-once training on very limited data. In practice, BNNs tend to demand far more data to calibrate their uncertainty estimates effectively; our small, simulation-based dataset does not provide enough “real” variability to train or meaningfully compare a meta-learner. As a result, we found that attempting to add BNN or meta-learning baselines risked producing unstable or misleading results, which could obscure rather than clarify the relative benefits of our simpler hybrid pipeline.

Nonetheless, we have introduced a direct comparison to a Tree-structured Parzen Estimator (TPE) optimization (Table 1). TPE is a widely used, flexible Bayesian optimization method that is conceptually closer to our GP + EI loop (but uses kernel density estimators rather than a GP). Under the same simulated budget, TPE required on average 16.8 ± 4.44 cycles (≈ 84 experiments) to identify the optimum—roughly four times more than our hybrid approach (4–5 cycles, 20–25 experiments).

Our current work deliberately focuses on a well-characterized, publicly available diatom growth‐rate dataset to (a) eliminate confounding noise from unmeasured lab variables, and (b) allow reproducible benchmarking via simulation. Setting up a new series of wet-lab experiments—culturing Thalassiosira pseudonana across fresh parameter grids—would require weeks or months to complete (including culture maintenance, instrumentation calibration, and data collection). While we recognize the importance of real‐world validation, such an endeavour exceeds the timeline and resources available for the present manuscript.
While not an exhaustive ablation study, but we do include a small ablation study comparing our full hybrid method to a “pure” GP + EI Bayesian optimization (i.e., without OLS and K-means). That comparison shows a noticeable drop in accuracy—our hybrid approach remains closer to the true optimum, whereas GP + EI alone deviates to a larger extent. This demonstrates that including OLS for a global trend and K-means for batch diversity significantly improves performance under our simulated conditions.

While this single ablation (Hybrid vs. GP + EI) provides strong initial evidence of each component’s value, a more exhaustive suite of ablations (e.g., OLS + EI without GP, EI + K-means without OLS, random clustering instead of K-means) would further clarify how each piece contributes. Performing those additional ablation runs would multiply our simulation experiments and risk diluting the manuscript’s focus. We therefore consider full ablation experiments—ideally on higher-dimensional or noisy datasets—to be an important direction for future work.

Reviewer:

Response:

Thank you for pointing out the need for greater precision in our methodological description. We have added a more detailed, self-contained paragraph to the Methods section (Section 2, lines 134–154) that specifies: GP parameterization (kernel choice, hyperparameter optimization, noise term), EI calculation (exploration parameter, grid evaluation), K-means implementation (number of clusters, initialization), Stopping criterion (fixed batch cycles rather than an EI threshold).

Below is the exact text we have inserted; it should enable full reproducibility (and is accompanied by a link to our GitHub repository for all Python code and example notebooks).

“Ordinary Least Squares (OLS) regression model to fit the observed growth rate data adding a second-order polynomial that includes both quadratic and interaction terms (Phosphate², Temperature², and Phosphate*Temperature) to help capture non-linear relationships between the predictor variables and the response variable;
Gaussian Process (GP) regression model with a Matern (ν=2.5) kernel trained on residuals to capture uncertainty across the parameter space. The combine predictive mean at a new point x is , and the perspective variance . Hyperparameters for the GP model were optimised using maximum likelihood estimation via scikit-learn's internal routines. A small noise term (alpha=1e-6) was added to improve numerical stability;
Expected Improvement (EI) implemented with a small exploration parameter (ξ = 0.01), evaluated a uniform 20×20 grid spanning the phosphate and temperature ranges, to rank untested conditions by potential gain; these was a decision criterion to identify which untested conditions are most likely to yield new insights; and
Apply K-means clustering to the top candidate points (ranked by Expected Improvement), selecting one representative from each cluster so that each experimental cycle explores diverse yet high-potential regions of the parameter space. The number of clusters was set to match the cycle batch size (5 experiments), and the algorithm was run with 10 initializations (n_init=10) to ensure convergence. All implementation details, including complete Python code and example notebooks, are publicly available at https://github.com/benocd/ml-experiment-optimizer. “

In this updated text we specify the Matern kernel (ν = 2.5), the fact that the GP is trained on and that hyperparameters are found via maximum-likelihood in scikit-learn (with alpha = 1e-6). We give the explicit EI formula, note that ξ=0.01, and explain that EI is evaluated on a uniform 20×20 grid of phosphate vs. temperature. We clarify that K-means is run on the top 25 EI points, with n_init = 10, and that the number of clusters equals the batch size (5). Instead of an EI threshold, we use a fixed batch size of 5 per cycle and run exactly 6 cycles. This decision reflects practical experimental limitations (e.g., throughput, resource constraints) and makes the algorithm easier to reproduce.

We provide a link to the full code repository, and we believe the methods section now contains sufficient precision for reproducibility without overwhelming the reader with every line of code.

Reviewer:

Are the results clearly presented? Generally yes, but the clarity of the results would be strengthened by a more explicit summary of key findings in relation to the original experimental design, and by clearly separating empirical findings from speculative interpretations. Overall, while the results are comprehensible and effectively illustrated, a more rigorous and quantitatively detailed analysis would enhance their scientific value.

Answer:

Thank you for your feedback. We have added a new paragraph to the Discussion (immediately after the sensitivity‐analysis results) that explicitly summarizes our key findings relative to the original 75‐experiment full‐factorial design, and clearly distinguishes simulation‐based results from broader interpretation. The inserted text reads:

“By comparison, the original full-factorial experiment used 75 measurements, 25 experimental conditions each with 3 biological replicates, to locate the growth optimum. In contrast, our hybrid OLS + GP + EI + K-Means pipeline converged on that same optimum with only 20–25 simulated experiments (4–5 cycles), a ~66 % reduction in experimental load while maintaining ± 10 % accuracy under ideal, noise-free conditions. The TPE benchmark needed on average 16.8 ± 4.44 cycles (≈ 84 experiments) to reach the same target. These quantitative results are strictly from in silico simulations; implementing this algorithm in a real laboratory—where measurement noise, batch-to-batch variability, and logistical constraints arise—could alter convergence rates and accuracy. Empirical validation and more extensive ablation studies (e.g., testing OLS + EI, GP + EI only, or varying batch sizes) are therefore necessary next steps, but these findings establish a clear proof of concept that combining OLS and K-Means with GP + EI is substantially more efficient than conventional or GP-only strategies.”

This paragraph directly contrasts our simulated workload and accuracy with that of the original design and other baselines, and then explicitly notes that these are in silico results—separating empirical findings from speculative considerations. We believe this addition addresses your request for a more rigorous, quantitatively detailed summary.

We would like to thank the reviewer for their constructive feedback and valuable suggestions.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

In the opinon of this reviewer, the authors have adequately addressed all the reviewers' comments, consequently this reviewer would now argue to accept this paper.