1. Introduction
Spectroscopic techniques are widely recognized as cornerstones of Process Analytical Technology (PAT), a framework established for the real-time monitoring and control of industrial processes [
1]. Raman spectroscopy knows a growing adoption in the healthcare industry, associated sensors becoming more and more fit for practical use [
2,
3]. In the context of biomanufacturing, these techniques are pivotal for gaining a deeper understanding of production dynamics and are fundamental to the development of new products [
4,
5,
6,
7]. The goal of PAT is to build quality into products by design, moving away from reliance on end-product testing towards a more proactive, science-based approach to process control [
8,
9]. However, new product development remains a notoriously lengthy and resource-intensive endeavor. The adoption of PAT-based tools, such as in-line spectroscopic probes, provides a powerful approach that can better manage these efforts [
4]. A key advantage lies in moving beyond simple monitoring towards advanced strategies like Model Predictive Control (MPC), which uses process models to anticipate future outcomes [
10,
11,
12,
13]. Recent work on data-driven predictive modeling for commercial cell culture processes further demonstrates how at-line and on-line measurements can be used to predict critical performance attributes several days in advance, supporting proactive decision-making in biotherapeutic manufacturing [
14]. For instance, predicting the final state of fermentation allows for a wider range of conditions to be tested more efficiently. This enables the early termination of unpromising culture batches, freeing up valuable equipment faster and significantly accelerating project timelines. This predictive capability is particularly crucial in the early stages of product development, where historical data is limited and only a few batches have been completed. In such scenarios, building robust, traditional chemometric models is often not feasible due to the scarcity of comprehensive datasets [
15] and the cost of generating large calibration sets [
1,
16,
17]. Therefore, a significant advantage would come from a predictive method that is quick to implement and does not require specialized expertise in machine learning or an extensive historical database. Such an approach would unlock early process insights and accelerate optimization efforts at the most critical phase of development. This paper introduces a method that directly addresses this challenge, enabling early and reliable predictions in cell culture monitoring without the need for a pre-existing model. The proposed method does not rely on the very long and complex establishment of a mechanistic model, nor does it rely on machine learning-based techniques generally requiring a number of pre-existing and annotated batches that is often larger than the context of process development allows. This method allows us to rely solely on the spectra gathered from the early hours of the culture and extrapolate future spectra within a projected, simplified mathematical space before backprojection in the original space of the spectra. This is explained in
Section 2. This section also introduces evaluation metrics and actual data used to illustrate and validate the method.
Section 3 puts the method in practice on actual vaccine manufacturing data.
Section 4 discusses the soundness, the usefulness and perspectives of extensions of the method.
4. Discussion
Section 3 measures how anticipated spectra are closer on average to the actual spectra at the considered future time point than they are to actual spectra before and after that future time point. This evaluation rigorously avoids overfitting, as the anticipated spectrum is compared to a spectrum that is not part of the window of spectra used to compute projection, extrapolate in a future time period, and back-project in the original space. The method thus never has access to the actual spectrum it has to anticipate. This evaluation of the method itself, as well as its comparison to a more naive approach, confirm the soundness of the proposed anticipation method. Still, it does not allow to us to draw any conclusions on its usefulness in practice because the distance between spectra is still an abstract notion. To shed additional light on this usefulness, we feed anticipated spectra to a (confidential) ML-based model of biomass for an actual biotech product at GSK [
18,
19].
Figure 5 graphically compares actual optical density (obtained by wet-lab measurements), spectrum-based predicted optical density (spectra + ML-model), and anticipated spectrum-based predicted optical density (spectra + anticipation + ML-model) for representative batches at multiple fermentation stages.
Across batches, consistent trends were observed, with predictions generated using a 10 h anticipation horizon showing a strong agreement with experimental measurements. As the prediction horizon increased, prediction accuracy progressively decreased. In particular, predictions generated 30 h in advance exhibited larger discrepancies between estimated and measured spectra at later fermentation stages. Nevertheless, the corresponding predicted optical density values remained in reasonable agreement with experimental measurements, indicating that longer-horizon predictions still provide a meaningful approximation of the system behavior. Similar results have been obtained by applying the same methodology to another platform based on CHO cells.
Regarding the choice of linear fitting for the extrapolation steps, the selection was driven by predictive performance rather than the visual quality of the fit alone. Although some accumulated trajectories may appear slightly non-linear over short intervals, the linear model consistently provided the most robust extrapolation behavior. In contrast, quadratic or cubic models could better adapt to local fluctuations in the observed spectra, but this also increased the risk of overfitting and reduced stability in the future domain. The objective of the method is therefore not to maximize the fit on the accumulated spectra themselves, but to ensure reliable anticipation of unseen spectra. For this reason, the linear configuration was retained because it provided the most reliable anticipation performance across batches, even if it did not always appear optimal from a visual standpoint.
To provide further perspective on these results, we also report as a comparison in
Appendix A the performances (in RMSE) of the baseline naive approach and the proposed approach (
Table A4 and
Table A5). The reported RMSE values are calculated relative to the OD values predicted by the chemometrics model. As the anticipation horizon increases, both models exhibit increasing RMSE values.
Based on these results, subject matter experts for this particular application state that a 20 h prediction remains useful. For a fermentation process that typically runs for approximately 90 h, a 20 h prediction horizon offers a full day’s foresight into the culture’s trajectory. This timeframe is particularly valuable for two main reasons:
Early Assessment of Experimental Runs: In a development setting, many runs are experimental, testing new process conditions. A 20 h prediction allows scientists to establish early assessments of whether a run is performing as expected. If the anticipated trajectory for key metrics (such as biomass) is highly unfavorable, it indicates that the experimental condition is not viable.
Resource Optimization: Based on this predictive insight, a data-driven “go/no-go” decision can be made. By identifying and terminating unpromising fermentation runs approximately 24 h after their start, one can save significant resources—valuable operator time, media, and, most importantly, bioreactor capacity. This allows us to reallocate these resources to a new, more promising experimental run much sooner than if we had waited for the full 90 h process to complete. This holds for both development and commercial manufacturing contexts.
Therefore, even with a slight increase in prediction error (as indicated by ), the ability to make a well-informed decision to stop a failing batch a day in advance provides a substantial benefit, significantly accelerating the overall process development timeline and optimizing resources at several process lifecycle stages.
Another attractive element is that the method is designed to be highly responsive. A new spectrum is acquired every 5 min, feeding the model with near-real-time information. The prediction is generated based on a moving window of the 60 most recent spectra (5 h). In the event of a sudden process deviation or equipment failure, the model’s reaction would be as follows:
- 1.
Initial Detection: The first spectrum acquired after the event (within 5 min) will be different from the previous ones. When this new spectrum enters the moving window, the prediction will immediately start to diverge from its previous trajectory, reflecting the change.
- 2.
Full Adaptation Time: The prediction will become fully representative of the new process state once the moving window is completely refreshed with data post-event. Given that a spectrum is taken every 5 min and the window requires 60 spectra, the model will have fully adapted to the new conditions after 5 h (60 spectra * 5 min/spectrum).
Therefore, while there is a lag, the prediction is not static. It begins to adjust within minutes of a deviation. The “refresh rate” is inherently tied to the data acquisition frequency and the window size, resulting in a full system adaptation within approximately hours of a major, sustained event. This timeframe is considered acceptable for common monitoring purposes, as it still allows for timely intervention.
It is worth emphasizing that the limitation related to non-monotonous behaviors mainly becomes critical when considering long anticipation horizons. In practice, for short-term predictions (e.g., on the order of one hour), the method was observed to behave robustly, including for metabolites exhibiting non-monotonous dynamics. Such short-term anticipation already provides valuable operational insight, as it enables early detection of upcoming trends and supports timely decision-making during the fermentation process.
By combining runs with different process parameters (, feeding), we ensured that the final dataset was representative of a wide range of potential real-world conditions. The strong performance of our method across this entire, varied collection of lots confirms its robustness and its ability to generalize beyond a single, idealized process.
These results demonstrate the interest of the method in practice when combined with chemometrics models. Another practical benefit of the method is that it does not require any training or reference dataset to put the spectral anticipation methodology into practice. Each application just requires the first few hours of spectra to be produced for a given cell culture batch. This also offers a modular approach, where spectra anticipation is independent of the specificities of models, mechanistic or not, that could be used: any valid model that takes spectra as input may benefit from this methodology, in a “plug-and-play” spirit. Another appealing aspect related to the simplicity of the method is that, contrary to deep learning methods, it does not require high-performance computational hardware (no large CPU, and no GPU at all), making it fit for application in practice in regular industrial contexts.
Yet we are able to outline some limitations to the method and outline future work to overcome them. Although it is shown in this work that, in most cases, the first two components of the PCA are sufficient to capture around 99% of explained variance, one could adapt the method, if presented with more challenging datasets, to include a third component. In such a case, a similar investigation should occur to study the nature of the relation between PC1 + PC2 on one hand, and PC3 on the other hand. As was the case for the first component (with respect to time) and second component (with respect to the first) in a cross-validated framework, several degrees of complexity could be studied.
One limitation is that, by nature, the method copes better with nearly monotonous behaviors, as it relies on extrapolation. This nearly monotonous behavior is the one shown by biomass, for example. It only roughly increases over the course of a cell culture run. Biomass is a central metric to monitor, but some metabolites, such as ethanol, may present an increase, up to a peak, then a decrease. The current spectral anticipation method will be limited in managing the switch from increase to decrease, while it will perform well overall on the rest of the curve (respectively before and after this peak). This kind of non-monotonous behavior could be anticipated based on other methods which, contrary to the one introduced in this paper, have referenced historical datasets containing spectra corresponding to such courses, and could anticipate the occurrence of this switch.
5. Conclusions
This paper introduces and evaluates a method for Raman spectra anticipation in the context of bioproducts cell culture. The method relies on the extrapolation of future spectra from observed spectra in a lower-dimensionality space (two dimensions), obtained by a PCA. The extrapolated data is then back-projected to the original space of the spectra.
The method thus positions itself for use in contexts where chemometric models are available to perform monitoring based on concomitant spectra [
10], as an extension providing them with the capability of predicting the future course of a cell culture. Those models may or may not, in turn, be used in conjunction with mechanistic models [
20].
The method is evaluated qualitatively and quantitatively. It is also put in a concrete biotech context. These evaluations support the attractiveness of the method, which seems to be suitable for applications around 20 h of anticipation in a real industrial context. Another very appealing feature of the method is that it does not require a historical dataset to start implementing this method: there is no model “learning”/fitting required.
Spectra anticipation implements predictive monitoring and even paves the way for prescriptive monitoring. Systems implementing spectra anticipation, chemometrics monitoring models, and an extra regulation feedback loop (on feeding, typically), could achieve self-regulation, course correction and even yield optimization.