Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing

Korang Yeboah, Mark; Asiedu, Nana Yaw; Addo, Ahmad

doi:10.3390/s26123948

Open AccessArticle

Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing

by

Mark Korang Yeboah

^1,2,*

,

Nana Yaw Asiedu

²

and

Ahmad Addo

²

¹

Chair of Dynamics and Control, University of Duisburg–Essen, Lotharstraße, 47057 Duisburg, Germany

²

Faculty of Mechanical and Chemical Engineering, Kwame Nkrumah University of Science and Technology, PMB, University Post Office, Kumasi 00233, Ghana

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(12), 3948; https://doi.org/10.3390/s26123948 (registering DOI)

Submission received: 21 May 2026 / Revised: 15 June 2026 / Accepted: 19 June 2026 / Published: 21 June 2026

(This article belongs to the Special Issue Soft Sensors and Sensing Techniques (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Consolidated bioprocessing (CBP) is difficult to monitor because enzyme production, lignocellulose degradation, sugar release, and fermentation occur simultaneously under sparse measurement, feedstock variability, and plant–model mismatch conditions. This study proposes a computational sensor-set design framework for digital-twin-assisted CBP monitoring. A five-state virtual plant, consisting of active biomass, cellulolytic enzyme activity, residual insoluble substrate, soluble sugar, and ethanol, was used to evaluate all 16 ethanol-mandatory measurement packages formed from ethanol, sugar, biomass, enzyme, and residual-substrate proxy channels. Candidate sensor sets were assessed using finite-difference output sensitivities, Fisher-information-based state-observability and parameter-identifiability analyses, eigenvalue and parameter-correlation diagnostics, and paired Monte Carlo unscented Kalman filter soft-sensing reconstruction. Within the tested five-state virtual-plant benchmark and with the specified excitation schedule, noise assumptions, burden indices, and scoring objective, ethanol-only sensing provided the weakest support for state-aware CBP digital-twin reconstruction. At a

6 h

sampling interval, the state-observability log-pseudodeterminant increased from 4.18 with ethanol-only sensing to 8.56 after adding soluble sugar and to 16.42 with full-proxy monitoring. The ethanol–sugar–biomass–substrate package also gave strong reduced state-observability performance, with log-pseudodeterminants of 15.12, 13.76, and 12.51 at 6, 12, and

24 h

, respectively. Biomass and enzyme proxies contributed strongly to parameter learning, and the ethanol–sugar–biomass–enzyme package gave the strongest active parameter-identifiability performance, with log-pseudodeterminants of 10.82, 9.06, and 6.67 at 6, 12, and

24 h

, respectively. In the paired soft-sensing analysis, full-proxy monitoring reduced the mean latent-state RMSE from 1.1899 to 0.3756, followed by ethanol–biomass–enzyme–substrate with 0.3843 and ethanol–sugar–biomass–substrate with 0.4121. The primary aggregate ranking identified ethanol–sugar–biomass–substrate as the best overall package, with a sensor-value score of 0.8432 and a burden index of 7.0, followed by full-proxy monitoring with a score of 0.8173 and a burden index of 10.0. Robustness tests showed that ethanol–sugar–biomass–substrate remained top-ranked under uniform noise scaling, full UKF missingness, delay and bias stress test conditions, most scoring-weight scenarios, and all tested sensor-specific burden workflows. Full-proxy monitoring remained a close competitor under independent sensor-specific noise variation conditions and became top-ranked for some alternative operating trajectories. The proposed framework provides a simulation-based method for prioritizing informative measurement packages before implementing CBP digital twins in laboratory and pilot-plant settings.

Keywords:

consolidated bioprocessing; digital twin; sensor-set design; observability analysis; parameter identifiability; fisher information matrix; soft sensing; unscented Kalman filter; hybrid gray-box model; measurement uncertainty

1. Introduction

Consolidated bioprocessing (CBP) is a process-intensified route for producing ethanol and other biochemical products through the simultaneous biological integration of enzyme production, biomass deconstruction, and fermentation [1,2]. Its appeal lies in the potential to reduce dependence on externally supplied cellulases and to simplify the biomass-to-products chain compared with separated enzyme production, hydrolysis, and fermentation schemes. However, CBP remains technically challenging because its performance depends on host engineering, cellulosome biosynthesis, microbial consortia design, enzyme delivery, substrate accessibility, and feedstock deconstruction. Recent studies have therefore focused on CBP strains and genetic manipulation, synthetic cellulosomes and extracellular polymeric substances, microbial communities and co-cultures, enzyme delivery strategies, substrate accessibility, and process intensification [3,4,5,6,7]. From a monitoring perspective, CBP is not only a biochemical conversion process whose product yield should be maximized; it is also a nonlinear, partially observed dynamic system in which growth, enzyme synthesis, insoluble-substrate deconstruction, sugar release, sugar consumption, fermentation, and inhibition-related effects evolve on different time scales.

A central difficulty in developing digital-twin-assisted CBP is the limited availability of informative online measurements. Ethanol concentration is often one of the most accessible measurements, but it is a delayed product signal and cannot directly reveal whether poor batch performance originates from weak biomass growth, insufficient enzyme production, poor substrate accessibility, slow hydrolysis, sugar limitation, or product inhibition. In contrast, states that are more useful for process decision-making, such as living biomass concentration, active enzyme concentration, residual insoluble substrate, and soluble sugar concentration, are difficult to measure directly, continuously, or non-invasively. Similar measurement limitations have motivated the use of hybrid models, soft sensors, and online state-estimation methods in bioprocess monitoring and control [8,9,10,11]. Recent studies have also shown the value of soft-sensor recalibration, metabolic-heat-based soft sensing, spectroscopic monitoring, real-time biomass estimation, and Kalman-filter-based state–parameter estimation for improving the information content of bioprocess measurements [12,13,14,15,16,17,18]. In parallel, digital bioprocessing and digital chemical engineering studies emphasize accurate process measurements, model integration, predictive modeling, enabling digital technologies, and the progressive introduction of process analytical technology tools during process development [19,20,21,22,23,24,25]. For nonlinear systems, unscented Kalman filtering provides a convenient way to propagate uncertainty through nonlinear dynamics without local linearization [26,27]; therefore, the quality of soft sensing depends strongly on the informativeness of the available measurements.

The problem of choosing measurements for CBP cannot be reduced to the selection of a state-estimation algorithm. It is also necessary to determine whether the available measurements contain enough information for state reconstruction and parameter learning. State observability describes the extent to which hidden process states can be reconstructed from measured outputs, whereas parameter identifiability describes the extent to which model parameters can be estimated from available data. These concepts are especially important in partially observed biochemical systems, where unmeasured states and uncertain parameters can compensate for one another and produce similar measured trajectories [28,29]. Fisher-information- and sensitivity-based metrics provide practical tools for quantifying measurement informativeness, observability, and identifiability. This issue is particularly important for CBP because candidate measurements differ substantially in measurement burden, cost, delay, and online implementation difficulty. For example, ethanol, soluble sugar, biomass proxies, enzyme-activity proxies, and residual-substrate proxies are not equally easy to obtain; therefore, their value for digital-twin deployment should be evaluated before laboratory or pilot-scale implementation. Recent literature-derived CBP modeling has also shown that product prediction is strongly affected by heterogeneous feedstock–pretreatment–microbial descriptors, sparse product reporting, and missing-label structure [30].

Although CBP has been widely studied from biological, biochemical, and process-intensification perspectives [3,6,7,31,32], systematic evaluation of measurement sets for CBP digital twins remains limited. Existing soft-sensing and bioprocess-control literature demonstrates that software sensors and nonlinear state-estimation methods can improve process monitoring [8,9,10,11]. More recent work has further highlighted the importance of soft-sensor generalizability, sensor recalibration, biomass monitoring, spectroscopic data streams, and joint state–parameter estimation for reliable model-assisted monitoring [12,13,14,16,17,18]. However, there is still no systematic comparison of CBP measurement packages with respect to state observability, parameter identifiability, soft-sensor reconstruction performance, measurement burden, and robustness to measurement uncertainty and alternative scoring priorities. In particular, it remains unclear whether ethanol-only sensing is sufficient for constructing a state-aware CBP digital twin, whether ethanol and sugar measurements provide an adequate minimal package, or whether additional biomass, enzyme, and substrate proxies are required.

To address this gap, this study proposes a computational framework for evaluating CBP measurement packages according to state observability, parameter identifiability, soft-sensor reconstruction performance, measurement burden, and robustness to practical measurement imperfections. A compact hybrid gray-box CBP model was used as a virtual plant to generate finite-difference output sensitivities, Fisher-information-based observability and identifiability metrics, parameter-correlation diagnostics, eigenvalue spectra, condition-number diagnostics, and approximate uncertainty measures. Rather than restricting the analysis to a small preselected list, the workflow evaluated all ethanol-mandatory combinations of the five modeled measurement channels: ethanol, soluble sugar, biomass proxy, enzyme-activity proxy, and residual-substrate proxy. These candidate sensor sets were then tested using a Monte Carlo unscented Kalman filter reconstruction experiment under model–plant mismatch and measurement noise conditions, with common plant-mismatch and initial-estimate realizations paired across sensor sets within each replicate. Additional analyses assessed the sensitivity of the sensor-set ranking to uniform and sensor-specific noise changes, alternative operating trajectories, missing measurements, assay delay, systematic measurement bias, alternative scoring weights, and alternative sensor-specific measurement-burden scenarios.

The contribution of this paper is threefold. First, it provides a pre-experimental computational pipeline for ranking CBP measurement candidates before wet-lab or pilot-plant implementation while making the candidate-set dependence of the ranking explicit. Second, it combines state observability, parameter identifiability, and nonlinear UKF reconstruction performance rather than relying only on endpoint prediction or product monitoring. Third, it evaluates whether the resulting ranking remains defensible under common measurement and implementation stress conditions, including sensor-specific noise, missingness, delay, bias, operating-trajectory variation, and measurement-burden assumptions. The results are intended to support digital-twin readiness assessment and experimental planning for CBP, not to claim experimental validation of a specific organism, sensor platform, or pilot-scale process.

2. Model, Sensor Sets, and Information-Based Analysis

2.1. Hybrid CBP Digital-Twin Model

A compact hybrid gray-box model serves as the computational digital-twin core for consolidated bioprocessing (CBP). The model was formulated to include the most significant dynamic couplings inherent to CBP, namely, growth of microorganism biomass, synthesis of cellulolytic enzymes, hydrolysis of insoluble substrate, accumulation and consumption of soluble sugars, and ethanol synthesis. Simplified mechanistic and hybrid models are common choices for bioprocess monitoring, state estimation, and control due to their interpretation potential and computationally-efficient structures [8,9,10]. In addition, modern bioprocess digital twins and hybrid models place particular emphasis on model compactness required for online state estimation and uncertainty analysis [11,19,20,33,34,35]. The state vector is defined as

x (t) = {[\begin{matrix} X (t) & E (t) & B (t) & C (t) & P (t) \end{matrix}]}^{⊤},

(1)

where X is biomass activity, E is enzymatic activity, B is residual insoluble substrate, C is soluble sugars concentration, and P is ethanol concentration. The operating variables included temperature and pH as

u (t) = {[\begin{matrix} T (t) & pH (t) \end{matrix}]}^{⊤} .

(2)

The model describes CBP as three continuously varying regimes. The first regime represents growth and enzyme production, the second represents substrate hydrolysis, and the third represents ethanol fermentation. This phase-based structure corresponds to the well-known view of CBP as a process in which cellulase production, cellulose deconstruction, sugar release, and ethanol fermentation are coupled within a single operation [31,32]. Recent studies on CBP emphasize the role of this coupling in lignocellulosic conversion processes [5,6,7]. Related reduced-order CBP modeling and dynamic temperature–pH policy analyses have also shown that operating trajectories can strongly affect ethanol formation, conversion, productivity, and operating severity [36]. The phase weights are calculated using logistic functions as

\begin{matrix} ϕ_{1} (t) & = \frac{1}{1 + exp {k (t - t_{1})}}, \end{matrix}

(3)

\begin{matrix} ϕ_{3} (t) & = \frac{1}{1 + exp {- k (t - t_{2})}}, \end{matrix}

(4)

\begin{matrix} ϕ_{2} (t) & = max (0, 1 - ϕ_{1} (t) - ϕ_{3} (t)), \end{matrix}

(5)

where

k = 0.28

,

t_{1} = 18

h, and

t_{2} = 44

h. The weight functions were normalized so that their sum was equal to unity. This formulation does not cover all the regulatory mechanisms within the process as it was intended for use as a controlled dynamic benchmark in which the latent states were characterized by different time scales and levels of measurement relevance for observability and identifiability assessment.

The equations describing the nominal behavior of the system are written as

\begin{matrix} \frac{d X}{d t} & = ϕ_{1} [μ X (1 - \frac{X}{K}) - d X] - 0.0020 ϕ_{3} X, \end{matrix}

(6)

\begin{matrix} \frac{d E}{d t} & = ϕ_{1} [Y_{E} X - k_{\deg} E] - 0.0040 ϕ_{3} E, \end{matrix}

(7)

\begin{matrix} \frac{d B}{d t} & = - ϕ_{2} v_{hyd}, \end{matrix}

(8)

\begin{matrix} \frac{d C}{d t} & = ϕ_{2} (v_{hyd} - 0.10 C) - ϕ_{3} v_{ferm}, \end{matrix}

(9)

\begin{matrix} \frac{d P}{d t} & = ϕ_{3} v_{ferm}, \end{matrix}

(10)

where

K = 7.5

is the biomass carrying capacity constant. All growth, enzyme, hydrolysis, and fermentation dynamics depended on temperature and pH activity profiles. Hydrolysis and fermentation rates were computed as

\begin{matrix} v_{hyd} & = V_{max} (\frac{B}{K_{m} + B + ϵ}) tanh (E), \end{matrix}

(11)

\begin{matrix} v_{ferm} & = Y_{P} \frac{C}{1 + k_{inh} P}, \end{matrix}

(12)

where

ϵ

is a small constant ensuring division by a non-zero number. Hydrolysis rate increased with residual substrate amount and enzymatic activity but was also limited by the

tanh (E)

factor. Fermentation rate transformed soluble sugar into ethanol with a feedback-dependent effect on product-inhibition dynamics. Despite its simplicity, the model preserved the core of the monitoring problem related to CBP: ethanol is a late product, and the reasons for poor batch performance could be traced in biomass, enzymes, substrate, and sugar states.

The nominal kinetic constants, activity functions, initial values, and numerical bounds used in the hybrid CBP virtual plant are summarized in Table 1.

Using this parameterization, the time-varying kinetic terms in Equations (6)–(12) were defined as

\begin{matrix} μ & = μ_{0} f_{T, g} (T) f_{pH, g} (pH) θ_{μ}, \end{matrix}

(13)

\begin{matrix} Y_{E} & = Y_{E, 0} f_{T, g} (T) f_{pH, g} (pH) f_{pret} θ_{Y_{E}}, \end{matrix}

(14)

\begin{matrix} V_{max} & = V_{max, 0} f_{T, h} (T) f_{pH, h} (pH) f_{pret} θ_{V_{max}}, \end{matrix}

(15)

\begin{matrix} Y_{P} & = Y_{P, 0} f_{T, f} (T) f_{pH, f} (pH) θ_{Y_{P}}, \end{matrix}

(16)

\begin{matrix} k_{inh} & = k_{inh, 0} θ_{inh} . \end{matrix}

(17)

The temperature and pH activity functions were defined as clipped Gaussian factors as

\begin{matrix} f_{T, g} (T) & = clip \{exp [- {(\frac{T - 48}{12.5})}^{2}], 0.15, 1.25\}, \end{matrix}

(18)

\begin{matrix} f_{T, h} (T) & = clip \{exp [- {(\frac{T - 50}{13.0})}^{2}], 0.15, 1.25\}, \end{matrix}

(19)

\begin{matrix} f_{T, f} (T) & = clip \{exp [- {(\frac{T - 42}{11.0})}^{2}], 0.12, 1.20\}, \end{matrix}

(20)

\begin{matrix} f_{pH, g} (pH) & = clip \{exp [- {(\frac{pH - 5.6}{0.85})}^{2}], 0.15, 1.25\}, \end{matrix}

(21)

\begin{matrix} f_{pH, h} (pH) & = clip \{exp [- {(\frac{pH - 5.2}{0.75})}^{2}], 0.15, 1.25\}, \end{matrix}

(22)

\begin{matrix} f_{pH, f} (pH) & = clip \{exp [- {(\frac{pH - 6.0}{0.90})}^{2}], 0.15, 1.20\} . \end{matrix}

(23)

Seven log-multiplicative uncertainty factors were used to facilitate the parameter identifiability analysis as

θ = {[\begin{matrix} θ_{μ} & θ_{Y_{E}} & θ_{V_{max}} & θ_{Y_{P}} & θ_{d} & θ_{inh} & θ_{feed} \end{matrix}]}^{⊤} .

(24)

They scaled growth rate, enzyme yield coefficient, hydrolysis capacity, ethanol yield, decay rate, inhibition rate, and feedstock accessibility, respectively. The logarithmic parameter transformation was appropriate because the kinetic uncertainties were multiplicative and because simultaneous biochemical effects can create practical identifiability degeneracies [28,29].

The assumed initial condition was

x_{0} = {[\begin{matrix} 0.10 & 0 & B_{0} & 0 & 0 \end{matrix}]}^{⊤}, B_{0} = \frac{S_{0}}{5},

(25)

where

S_{0} = 100

and thus

B_{0} = 20

in the present study. The batch cycle duration was 96 h, while numerical integration was done by means of the fourth-order Runge–Kutta method with a 2 h internal time step. In order to conduct the observability and identifiability analysis, the open-loop control input

(T, pH)

was scheduled in such a way that all growth/enzyme synthesis, hydrolysis, and fermentation modes were activated as

(T, pH) = \{\begin{matrix} (48 ° C, 5.60), & t < 24 h, \\ (50 ° C, 5.20), & 24 \leq t < 54 h, \\ (42 ° C, 6.00), & t \geq 54 h . \end{matrix}

(26)

This profile was not designed for optimal production of ethanol. The objective was to create a sufficiently representative trajectory to compare potential sensor suites. Excitation is vital in this respect as information-theoretic and sensitivity measures of identifiability are trajectory-dependent [37,38]. In other words, the model can be considered an experimental template for sensor suite design but not as a fully developed organism-specific CBP model.

2.2. Candidate Sensor Sets and Measurement Assumptions

The sensor library includes commonly used process measurements and potential proxy measurements for latent CBP states. Five modeled measurement channels are considered as

M = {P, C, X, E, B},

(27)

where P, C, X, E, and B denote ethanol, soluble sugar, a biomass proxy, an enzyme-activity proxy, and a residual insoluble-substrate proxy, respectively. This choice reflects a common monitoring challenge in CBP: product concentration is comparatively straightforward to monitor, whereas biological, enzymatic, hydrolysis-related, and intermediate-substrate states are sparse, delayed, or difficult to measure online. Soft-sensing approaches and process analytical technologies have been proposed to address similar monitoring limitations in bioprocess development [8,9,10,11,19,24].

The nominal measurement standard deviations and relative measurement-burden indices are summarized in Table 2. The burden index is dimensionless and represents relative sampling effort, assay latency, calibration burden, and online implementation difficulty rather than direct monetary cost. The nominal workflow assumes that ethanol and soluble sugar can be measured through routine at-line product and sugar analytics or calibrated spectroscopic/biosensor routes; biomass can be approximated using an optical, dielectric, or dry-weight-calibrated proxy; enzyme activity usually requires a higher-burden at-line or offline assay; and residual insoluble substrate requires a solids-related proxy or offline/at-line solids measurement. These values are therefore computational design assumptions for pre-experimental sensor-set comparison, not platform-calibrated constants. Sensitivity to alternative sensor-specific burden assumptions is examined later in the robustness analysis.

Ethanol is treated as the mandatory baseline measurement because it is the most direct product signal and represents the minimum product-monitoring configuration against which additional measurements are compared. To avoid limiting the recommendation to a small preselected candidate list, the analysis evaluates all ethanol-mandatory combinations formed by adding any subset of the remaining four measurement channels as

S_{P} = \{{P} \cup A : A \subseteq {C, X, E, B}\} .

(28)

This formulation gives

2^{4} = 16

candidate sensor packages. The full candidate set is defined as

\begin{matrix} S_{1} & = {P}, & S_{2} & = {P, C}, & S_{3} & = {P, X}, & S_{4} & = {P, E}, \\ S_{5} & = {P, B}, & S_{6} & = {P, C, X}, & S_{7} & = {P, C, E}, & S_{8} & = {P, C, B}, \\ S_{9} & = {P, X, E}, & S_{10} & = {P, X, B}, & S_{11} & = {P, E, B}, & S_{12} & = {P, C, X, E}, \\ S_{13} & = {P, C, X, B}, & S_{14} & = {P, C, E, B}, & S_{15} & = {P, X, E, B}, & S_{16} & = {P, C, X, E, B} . \end{matrix}

(29)

The ethanol-only set serves as the baseline because it represents the most straightforward product-monitoring configuration. The two- and three-channel sets test whether a small number of biochemical, biological, enzymatic, or hydrolysis-related measurements can substantially improve observability, identifiability, and state reconstruction. The four-channel sets test whether near-complete monitoring can provide most of the value of the full-proxy package while omitting one high-burden measurement. The full-proxy set represents the upper information benchmark within the modeled sensor library. Recent advances in soft sensing and state estimation in lignocellulosic and fermentation processes suggest that such measurement-set comparisons are necessary because hybrid sensors and state estimators depend strongly on which state variables are available for correction [39,40,41].

Measurement sensitivities are computed for sampling periods of 6, 12, and

24 h

. These periods represent increasingly sparse sampling scenarios in laboratory and pilot-scale operation. For a given sensor set

S

and sampling period

Δ t_{s}

, the model outputs are obtained by concatenating the measurements at each sample point into a single vector as

y_{S} = {[\begin{matrix} h_{S} {(x (t_{0}))}^{⊤} & h_{S} {(x (t_{1}))}^{⊤} & \dots & h_{S} {(x (t_{N}))}^{⊤} \end{matrix}]}^{⊤},

(30)

where

h_{S} (\cdot)

selects the state components corresponding to the sensors in

S

. Each output component is weighted by its corresponding nominal measurement standard deviation, so that more precise measurements contribute more strongly to the weighted sensitivity and Fisher-information calculations.

The modeled measurement channels represent idealized scalar outputs that could be obtained using different online, at-line, or offline measurement technologies. Examples of practical process analytical technology (PAT), assay, and soft-sensor routes are summarized in Table 3. This distinction is important because a single process variable can be measured with different levels of delay, calibration burden, matrix sensitivity, drift behavior, detection-limit constraint, and online feasibility [39,40,41,42].

2.3. State Observability Analysis

State observability is assessed by evaluating the sensitivity of the weighted output vector to perturbations in the initial state. Given a sensor set

S

, the finite-difference sensitivity with respect to the jth initial state is computed from perturbed initial conditions as

S_{x, j} = \frac{y_{S} (x_{0} + h_{j}^{+} e_{j}) - y_{S} (x_{0} - h_{j}^{-} e_{j})}{h_{j}^{+} + h_{j}^{-}},

(31)

where

e_{j}

is the jth unit vector. The nominal perturbation magnitude is defined as

ε_{j} = 10^{- 3} max (| x_{0, j} |, 1) + 10^{- 4} .

(32)

For states whose negative perturbation does not violate nonnegativity,

h_{j}^{+} = h_{j}^{-} = ε_{j}

, which gives the usual symmetric finite difference. When the negative perturbation would produce a negative initial state, the lower perturbed value is clipped to the feasible nonnegative bound and the actual perturbation distance is used in the denominator. Thus,

h_{j}^{+} = ε_{j}, h_{j}^{-} = x_{0, j} - max (0, x_{0, j} - ε_{j}) .

(33)

This definition avoids retaining a denominator of

2 ε_{j}

when the negative perturbation is clipped. For example, when

x_{0, j} = 0

, the calculation becomes a one-sided finite difference with denominator

ε_{j}

. The actual perturbation distances used in the computation are recorded as part of the reproducibility output.

The weighted state-sensitivity matrix is then calculated as

{\tilde{S}}_{x} = W_{S}^{- 1} S_{x},

(34)

where

W_{S}

is the diagonal matrix of measurement standard deviations repeated over all sampling times. More precise measurements therefore receive larger weights in the information calculation. The state-information matrix is approximated as

F_{x} = {\tilde{S}}_{x}^{⊤} {\tilde{S}}_{x} .

(35)

This matrix is not the structural observability matrix in the differential geometric sense. Rather, it is a finite-horizon Fisher-information approximation that measures how much information the sampled outputs contain about perturbations in the initial states. This type of sensitivity-based information analysis is widely used for practical observability assessment in complex biological and nonlinear dynamical systems [29,37,38].

Let

λ_{i} (F_{x})

denote the eigenvalues of

F_{x}

. The numerical eigenvalue threshold is defined explicitly as

τ_{x} = max (10^{- 8} λ_{max} (F_{x}), 10^{- 12}) .

(36)

The numerical rank is then computed as

r_{x} = # \{i : λ_{i} (F_{x}) > τ_{x}\} .

(37)

The log-pseudodeterminant is calculated over the active eigenvalues as

{log}_{10} pdet (F_{x}) = \sum_{λ_{i} (F_{x}) > τ_{x}} {log}_{10} (λ_{i} (F_{x})) .

(38)

The minimum active eigenvalue is defined as

λ_{min, x}^{+} = min \{λ_{i} (F_{x}) : λ_{i} (F_{x}) > τ_{x}\},

(39)

and the corresponding condition number is defined as

κ (F_{x}) = \frac{λ_{max} (F_{x})}{λ_{min, x}^{+}} .

(40)

If no eigenvalue exceeds the numerical threshold, the rank is set to zero, and the log-pseudodeterminant and condition-number diagnostics are treated as undefined or non-informative.

Approximate state uncertainty is estimated using a ridge-regularized pseudoinverse as

Σ_{x} \approx {(F_{x} + ρ I)}^{†}, ρ = 10^{- 10} .

(41)

Using these criteria, sensor ensembles can be compared not only by the total amount of information they provide but also by whether weakly informed state directions remain unresolved.

2.4. Parameter Identifiability Analysis

Parameter identifiability is analyzed using the same information-based framework, but with sensitivities computed with respect to logarithmically scaled model parameters instead of initial states. The parameter vector is defined as

θ = {[\begin{matrix} θ_{μ} & θ_{Y_{E}} & θ_{V_{max}} & θ_{Y_{P}} & θ_{d} & θ_{inh} & θ_{feed} \end{matrix}]}^{⊤},

(42)

where all nominal log-parameters are zero, corresponding to multiplicative scale factors of one. The finite-difference derivative of the output vector with respect to the jth log-parameter is computed as

S_{θ, j} = \frac{y_{S} (exp (θ + δ e_{j})) - y_{S} (exp (θ - δ e_{j}))}{2 δ}, δ = 10^{- 3} .

(43)

The exponential mapping ensures that the perturbations correspond to relative changes in the kinetic and feedstock-accessibility factors. The weighted parameter-sensitivity matrix is then calculated as

{\tilde{S}}_{θ} = W_{S}^{- 1} S_{θ},

(44)

and the corresponding Fisher-information approximation is

F_{θ} = {\tilde{S}}_{θ}^{⊤} {\tilde{S}}_{θ} .

(45)

The same active-eigenvalue diagnostics used for state observability are computed for parameter identifiability. Specifically, the numerical threshold for the parameter-information matrix is defined as

τ_{θ} = max (10^{- 8} λ_{max} (F_{θ}), 10^{- 12}),

(46)

the numerical rank is computed as

r_{θ} = # \{i : λ_{i} (F_{θ}) > τ_{θ}\},

(47)

and the active log-pseudodeterminant is calculated as

{log}_{10} pdet (F_{θ}) = \sum_{λ_{i} (F_{θ}) > τ_{θ}} {log}_{10} (λ_{i} (F_{θ})) .

(48)

The minimum active eigenvalue and condition number are computed from the same active-eigenvalue set. These diagnostics report how much information is available in the identifiable parameter directions and whether some parameter directions remain weakly informed.

Because the active log-pseudodeterminant can change when the numerical rank changes, a fixed-dimension regularized determinant is also computed for the seven-dimensional parameter vector as

D_{θ, reg} = {log}_{10} det (F_{θ} + ρ I_{7}) = \sum_{i = 1}^{7} {log}_{10} (λ_{i} (F_{θ}) + ρ), ρ = 10^{- 10} .

(49)

This fixed-dimension metric provides a complementary D-optimality-style comparison that does not discard low-information parameter directions. The reported identifiability results therefore distinguish between active information volume, numerical-rank coverage, minimum eigenvalue behavior, and fixed-dimension regularized information.

Approximate parameter uncertainty is derived from the same ridge-regularized pseudoinverse as

Σ_{θ} \approx {(F_{θ} + ρ I_{7})}^{†}, ρ = 10^{- 10},

(50)

with the standard error of the jth log-parameter estimated by

SE (θ_{j}) = \sqrt{{[Σ_{θ}]}_{j j}} .

(51)

The associated approximate multiplicative 95% uncertainty factor is defined as

exp (1.96 SE (θ_{j})) .

(52)

Correlations between parameters are defined as

R_{i j} = \frac{{[Σ_{θ}]}_{i j}}{\sqrt{{[Σ_{θ}]}_{i i} {[Σ_{θ}]}_{j j}}} .

(53)

Large off-diagonal correlation factors indicate parameter pairs that are difficult to estimate separately using the proposed measurements. This issue is important for partially observed bioprocesses because alternative kinetic parameter combinations can generate nearly identical ethanol or sugar time courses. When data are sparse, noisy, or insufficiently diverse, identifiability analysis provides a computational filter for selecting CBP measurement packages with stronger potential for useful parameter inference [28,29].

3. Soft-Sensor Evaluation, Sensor Ranking, and Robustness Assessment

3.1. Soft-Sensor Reconstruction Test

After the observability and identifiability analyses, each candidate sensor set was evaluated for nonlinear soft-sensing reconstruction under model–plant mismatch, initial-state uncertainty, and measurement noise conditions. An unscented Kalman filter (UKF) was used because it propagates mean and covariance information through nonlinear and phase-dependent dynamics without local linearization, which is important when some CBP states are only partially measured. Recent bioprocess monitoring studies emphasize that soft sensors, hybrid models, and model-based state-estimation methods are essential for digital bioprocessing because key physiological states are often unavailable from direct online measurements [11,40,41,43]. Related work on sensor-assisted bioprocess monitoring has also demonstrated the value of metabolic-heat-based soft sensing, spectroscopic monitoring, biomass estimation, soft-sensor recalibration, and joint state–parameter estimation for improving process-state reconstruction [12,13,14,15,16,17,18].

For the five-state CBP model, the UKF state estimate and covariance matrix at time

t_{k}

are denoted by

{\hat{x}}_{k | k}

and

P_{k | k}

, respectively. The sigma points were generated as

\begin{matrix} χ_{k | k}^{(0)} & = {\hat{x}}_{k | k}, \end{matrix}

(54)

\begin{matrix} χ_{k | k}^{(i)} & = {\hat{x}}_{k | k} + {[\sqrt{(n + λ) P_{k | k}}]}_{i}, i = 1, \dots, n, \end{matrix}

(55)

\begin{matrix} χ_{k | k}^{(i + n)} & = {\hat{x}}_{k | k} - {[\sqrt{(n + λ) P_{k | k}}]}_{i}, i = 1, \dots, n, \end{matrix}

(56)

with

n = 5

and

λ = α^{2} (n + κ) - n .

(57)

The UKF parameters were

α = 0.35

,

β = 2

, and

κ = 0

. Each sigma point was propagated over one plant step using the same fourth-order Runge–Kutta integration scheme as the virtual plant:

χ_{k + 1 | k}^{(i)} = f_{Δ t} (χ_{k | k}^{(i)}, u_{k}, θ_{model}),

(58)

where

f_{Δ t} (\cdot)

denotes the CBP model integrated over one internal time step. The predicted mean and covariance were calculated as

\begin{matrix} {\hat{x}}_{k + 1 | k} & = \sum_{i = 0}^{2 n} W_{i}^{(m)} χ_{k + 1 | k}^{(i)}, \end{matrix}

(59)

\begin{matrix} P_{k + 1 | k} & = Q + \sum_{i = 0}^{2 n} W_{i}^{(c)} (χ_{k + 1 | k}^{(i)} - {\hat{x}}_{k + 1 | k}) {(χ_{k + 1 | k}^{(i)} - {\hat{x}}_{k + 1 | k})}^{⊤}, \end{matrix}

(60)

where Q is the process-noise covariance matrix, and

W_{i}^{(m)}

and

W_{i}^{(c)}

are the conventional UKF mean and covariance weights.

At sampling instants, the measurement equation for sensor set

S

was

y_{k} = h_{S} (x_{k}) + v_{k}, v_{k} \sim N (0, R_{S}),

(61)

where

R_{S}

is the diagonal measurement-noise covariance matrix for the channels included in

S

. The predicted measurement sigma points and predicted measurement mean were calculated as

\begin{matrix} z_{k + 1 | k}^{(i)} & = h_{S} (χ_{k + 1 | k}^{(i)}), \end{matrix}

(62)

\begin{matrix} {\hat{y}}_{k + 1 | k} & = \sum_{i = 0}^{2 n} W_{i}^{(m)} z_{k + 1 | k}^{(i)} . \end{matrix}

(63)

The innovation covariance matrix and state–measurement cross-covariance matrix were calculated as

\begin{matrix} S_{k + 1} & = R_{S} + \sum_{i = 0}^{2 n} W_{i}^{(c)} (z_{k + 1 | k}^{(i)} - {\hat{y}}_{k + 1 | k}) {(z_{k + 1 | k}^{(i)} - {\hat{y}}_{k + 1 | k})}^{⊤}, \end{matrix}

(64)

\begin{matrix} C_{x y, k + 1} & = \sum_{i = 0}^{2 n} W_{i}^{(c)} (χ_{k + 1 | k}^{(i)} - {\hat{x}}_{k + 1 | k}) {(z_{k + 1 | k}^{(i)} - {\hat{y}}_{k + 1 | k})}^{⊤} . \end{matrix}

(65)

The Kalman gain and measurement-update step were computed as

\begin{matrix} K_{k + 1} & = C_{x y, k + 1} S_{k + 1}^{†}, \end{matrix}

(66)

\begin{matrix} {\hat{x}}_{k + 1 | k + 1} & = {\hat{x}}_{k + 1 | k} + K_{k + 1} (y_{k + 1} - {\hat{y}}_{k + 1 | k}), \end{matrix}

(67)

\begin{matrix} P_{k + 1 | k + 1} & = P_{k + 1 | k} - K_{k + 1} S_{k + 1} K_{k + 1}^{⊤} . \end{matrix}

(68)

The pseudoinverse was used to improve numerical stability when innovation covariance matrices were close to singular.

Estimation accuracy was evaluated using a Monte Carlo simulation experiment. For each of the 16 ethanol-mandatory sensor-set configurations,

N_{MC} = 100

simulation replicates were performed. In each replicate, multiplicative plant–model mismatch was imposed on the growth, enzyme-yield, hydrolysis-capacity, ethanol-yield, decay, inhibition, and feedstock-accessibility factors. The UKF uses the nominal model and therefore does not have access to the replicate-specific plant perturbation. Initial-state uncertainty was introduced by perturbing the initial state used by the estimator relative to the true plant initial state while enforcing physical nonnegativity bounds.

To ensure a fair paired comparison among sensor sets, the Monte Carlo design uses common plant and estimator realizations across all sensor packages. For a given replicate r, the same plant-parameter mismatch and the same initial-estimate perturbation were used for every candidate sensor set. Thus, differences in reconstruction error between two sensor sets are attributable to the measurement package rather than to different simulated plants. Measurement noise was generated consistently by channel: for each replicate, time point, and measurement channel, a channel-specific random-noise stream was used, so sensor sets sharing a channel received the same noise realization for that channel. Additional sensors introduced additional channel-specific noise streams without changing the underlying plant or initial-condition realization. This common-random-number design provides paired replicate-wise RMSE differences for the statistical comparisons.

For replicate r, the reconstruction error of state j is defined as

{RMSE}_{j, r} = \sqrt{\frac{1}{N_{t}} \sum_{k = 1}^{N_{t}} {({\hat{x}}_{j, k}^{(r)} - x_{j, k}^{(r)})}^{2}},

(69)

where

N_{t}

is the number of simulated time points. The latent-state RMSE was computed across the four non-product states:

{RMSE}_{latent, r} = \frac{1}{4} ({RMSE}_{X, r} + {RMSE}_{E, r} + {RMSE}_{B, r} + {RMSE}_{C, r}) .

(70)

Ethanol is excluded from this latent-state average because ethanol is measured in every ethanol-mandatory candidate set and therefore does not represent a hidden-state reconstruction challenge. Additional reported statistics include the final absolute estimation error:

e_{j, r}^{final} = |{\hat{x}}_{j, N_{t}}^{(r)} - x_{j, N_{t}}^{(r)}|,

(71)

the mean absolute error, and the final covariance trace:

tr (P_{N_{t} | N_{t}}^{(r)}) .

(72)

Each non-baseline sensor set was compared with ethanol-only monitoring using the paired replicate-specific latent-state RMSE differences:

d_{r} = {RMSE}_{latent, r}^{Ethanol only} - {RMSE}_{latent, r}^{S} .

(73)

A positive value of

d_{r}

indicates that sensor set

S

reduces latent-state reconstruction error relative to ethanol-only monitoring for the same replicate. Because the candidate space contains 16 ethanol-mandatory packages, there are 15 ethanol-only contrasts. The paired Wilcoxon signed-rank test was used as a secondary nonparametric comparison [44]. Bootstrap confidence intervals were also computed for the paired RMSE reduction, and false-discovery-rate-adjusted p-values are reported to account for the multiple ethanol-only contrasts. These statistical tests are not used to define the sensor ranking; instead, ranking is based on the combined observability, identifiability, UKF reconstruction, and measurement-burden scoring framework described below.

3.2. Scoring Sensor Values and Rankings

The final ranking step combines four criteria: state observability, parameter identifiability, UKF reconstruction accuracy, and measurement burden. Because these quantities have different units and numerical ranges, each metric is normalized to the interval

[0, 1]

across the evaluated ethanol-mandatory candidate set. For a metric

m_{i}

for which larger values are desirable, the normalized score is defined as

η_{i}^{+} = \frac{m_{i} - {min}_{j} (m_{j})}{{max}_{j} (m_{j}) - {min}_{j} (m_{j})} .

(74)

For a metric for which smaller values are desirable, the normalized score is defined as

η_{i}^{-} = 1 - \frac{m_{i} - {min}_{j} (m_{j})}{{max}_{j} (m_{j}) - {min}_{j} (m_{j})} .

(75)

If the metric range is zero, all normalized scores for that metric are set to 0.5. This min–max normalization places information value, reconstruction accuracy, and measurement burden on a common scale. However, because min–max normalization is candidate-list-dependent, the ranking should be interpreted as a ranking within the explicitly evaluated candidate space rather than as an absolute sensor value. The primary analysis therefore evaluated all 16 ethanol-mandatory combinations, and additional candidate-list, Pareto, and measurement-burden sensitivity outputs were generated to assess whether the recommendation depends on the set of candidates included in the comparison.

To score state observability, both total information volume and the weakest resolved active state direction were used:

S_{obs} = 0.60 η^{+} ({log}_{10} pdet (F_{x})) + 0.40 η^{+} ({log}_{10} λ_{min, x}^{+}),

(76)

where

λ_{min, x}^{+}

is the minimum eigenvalue of

F_{x}

above the numerical threshold defined in Section 2.3. Parameter identifiability is scored in the same way:

S_{id} = 0.60 η^{+} ({log}_{10} pdet (F_{θ})) + 0.40 η^{+} ({log}_{10} λ_{min, θ}^{+}) .

(77)

The pseudodeterminant term rewards total active information volume, whereas the minimum-eigenvalue term penalizes sensor sets that leave at least one active state or parameter direction weakly informed. This avoids assigning a high score to a configuration that performs well only in a small number of dominant directions.

The UKF reconstruction score is derived from the Monte Carlo mean latent-state RMSE:

S_{ukf} = η^{-} ({\bar{RMSE}}_{latent}),

(78)

where

{\bar{RMSE}}_{latent}

is the mean latent-state RMSE across the 100 paired Monte Carlo replicates. The measurement-burden score is defined as

S_{burden} = η^{-} (c_{S}),

(79)

where

c_{S}

is the total dimensionless measurement-burden index of sensor set

S

.

The primary aggregate sensor-value score is then calculated as

S_{total} = 0.30 S_{obs} + 0.35 S_{id} + 0.25 S_{ukf} + 0.10 S_{burden} .

(80)

The identifiability term is assigned a slightly larger weight than the observability term because the envisioned digital-twin use case includes both state reconstruction and model learning through parameter refinement. The burden term is included to discourage automatically selecting the most measurement-intensive package when a reduced package gives comparable information and reconstruction value.

A secondary value-per-burden diagnostic is also calculated as

S_{value / burden} = \frac{S_{total}}{c_{S}} .

(81)

This quantity is not used as the primary ranking criterion. Instead, it is used to interpret trade-offs between information gain and implementation burden. In addition, Pareto screening was performed using information, reconstruction accuracy, and measurement burden so that candidate packages can be identified as dominated or non-dominated. A sensor set is considered dominated if another evaluated package provides no worse information and reconstruction performance while having no greater burden. These diagnostics help distinguish the best aggregate package from lower-burden alternatives and from full-proxy monitoring, which represents the upper measurement-completeness benchmark.

3.3. Noise, Operating-Trajectory, and Scoring Robustness Analyses

Robustness analyses were performed to assess whether the sensor-set hierarchy depends strongly on measurement-quality assumptions, operating trajectory, measurement imperfections, scoring weights, or measurement-burden assumptions. This step is important because laboratories may differ in sensor calibration, assay availability, online implementation difficulty, and the relative priority placed on observability, identifiability, reconstruction accuracy, and burden.

First, a uniform measurement-noise sensitivity analysis was performed using three noise multipliers as

γ_{σ} \in {0.5, 1.0, 2.0} .

(82)

In this analysis, all nominal sensor standard deviations are scaled as

σ_{S, i}^{(γ)} = γ_{σ} σ_{S, i} .

(83)

The weighted sensitivity matrices, Fisher-information metrics, and aggregate scores were then recomputed. Because Fisher-information matrices depend on the inverse measurement variance, this analysis tested whether the ranking was preserved when all measurements were assumed to be uniformly more accurate or less accurate.

Second, an independent sensor-specific noise analysis was performed so that sensor channels could improve or degrade independently. For each scenario, the standard deviation of each sensor

k \in {P, C, X, E, B}

was multiplied by an independently sampled factor as

σ_{k}^{(m)} = m_{k} σ_{k}, m_{k} \in [0.5, 2.0] .

(84)

A total of 200 independent sensor-noise scenarios were evaluated. For each scenario, the observability and identifiability matrices were reweighted, the aggregate score was recomputed across the 16 ethanol-mandatory candidates, and the rank distribution was recorded. This analysis tested whether the ranking was stable when one sensor channel became noisier or more accurate relative to the others.

Third, the effect of operating trajectory was examined because sensitivity-based observability and identifiability metrics are trajectory-dependent. In addition to the nominal temperature–pH schedule, four alternative feasible schedules were tested: a milder excitation profile, an extended hydrolysis profile, an earlier fermentation profile, and a shifted feasible profile in which temperature–pH levels were moved away from the nominal values while keeping all process phases active. For each trajectory, the observability, identifiability, scoring, and ranking calculations were repeated for all candidate sensor sets and sampling intervals. The resulting rankings were compared with the nominal ranking using rank-correlation and maximum-rank-shift diagnostics.

Fourth, full UKF stress tests were performed for practical measurement imperfections. Unlike an information-only approximation, these tests reran the complete reconstruction workflow under stressed measurement conditions and then recomputed the corresponding ranking. Random missingness is represented by dropping measurement updates according to specified channel-availability probabilities. Two missingness cases were considered: 20% missing observations for all sensors and higher missingness for biomass, enzyme, and substrate proxy channels. Assay delay was tested by delaying measurement availability by

6 h

before UKF correction. Systematic measurement bias was injected directly as an additive offset in the affected measurement channels rather than being treated only as zero-mean variance. Two bias cases are considered: a moderate all-sensor bias of

0.5 σ_{k}

and a proxy-bias case in which biomass, enzyme, and substrate proxy channels carry larger bias than ethanol and sugar measurements.

Fifth, the aggregate score was recalculated using alternative weighting schemes. For each weighting scheme,

w = [\begin{matrix} w_{obs} & w_{id} & w_{ukf} & w_{burden} \end{matrix}], \sum_{i} w_{i} = 1,

(85)

and

S_{total}^{(w)} = w_{obs} S_{obs} + w_{id} S_{id} + w_{ukf} S_{ukf} + w_{burden} S_{burden} .

(86)

The following weight vectors were evaluated as

\begin{matrix} w_{primary} & = [\begin{matrix} 0.30 & 0.35 & 0.25 & 0.10 \end{matrix}], \end{matrix}

(87)

\begin{matrix} w_{equal} & = [\begin{matrix} 0.25 & 0.25 & 0.25 & 0.25 \end{matrix}], \end{matrix}

(88)

\begin{matrix} w_{obs} & = [\begin{matrix} 0.55 & 0.20 & 0.15 & 0.10 \end{matrix}], \end{matrix}

(89)

\begin{matrix} w_{id} & = [\begin{matrix} 0.15 & 0.60 & 0.15 & 0.10 \end{matrix}], \end{matrix}

(90)

\begin{matrix} w_{ukf} & = [\begin{matrix} 0.20 & 0.20 & 0.50 & 0.10 \end{matrix}], \end{matrix}

(91)

\begin{matrix} w_{burden - sensitive} & = [\begin{matrix} 0.25 & 0.30 & 0.20 & 0.25 \end{matrix}], \end{matrix}

(92)

\begin{matrix} w_{burden - averse} & = [\begin{matrix} 0.20 & 0.20 & 0.15 & 0.45 \end{matrix}] . \end{matrix}

(93)

These cases represent balanced performance, observation-oriented design, identification-oriented design, reconstruction-oriented design, burden-sensitive preference, and strongly burden-averse preference.

Finally, alternative sensor-specific measurement-burden scenarios were tested. These scenarios changed the channel-specific burden indices to represent different practical workflows, such as a spectroscopy-assisted workflow with lower burden for calibrated optical measurements, an offline-assay workflow with higher burden for enzyme and residual-substrate measurements, and a solids-intensive workflow with elevated burden for residual insoluble-substrate monitoring. The aggregate rankings were recomputed in each workflow-specific burden scenario. This analysis separated the effect of changing score weights from the effect of changing the assumed practical cost of individual sensor channels.

Together, these robustness analyses tested whether a sensor package remained attractive when measurement noise, operating policy, data missingness, assay delay, systematic bias, scoring priorities, and measurement-burden assumptions were varied. A package that maintains a high rank across these cases is more defensible for pre-experimental CBP digital-twin planning, whereas a package preferred only in one weighting or workflow scenario should be interpreted as objective specific rather than universally optimal.

The workflow used for pre-ranking CBP sensor packages before detailed soft-sensor evaluation and digital-twin deployment is shown in Figure 1.

3.4. Computational Reproducibility

All simulations, state-estimation routines, sensitivity analyses, statistical comparisons, tables, and figures were implemented in Python 3.13.5. The final production run was executed with the base random seed 42. The production run used the ethanol-mandatory candidate-set mode, giving 16 candidate sensor packages, and used

N_{MC} = 100

Monte Carlo replicates for each candidate sensor set.

The hybrid CBP virtual plant was simulated using a fixed-step fourth-order Runge–Kutta scheme. The same numerical integration approach was used for nominal trajectory simulation, finite-difference sensitivity analysis, and UKF prediction. The production run used GPU acceleration for vectorized finite-difference sensitivity batches when available, with automatic CPU fallback. The final run used an NVIDIA GeForce RTX 2070 for the GPU-enabled sensitivity calculations. The model equations, sensor definitions, observability and identifiability metrics, UKF update equations, scoring procedure, and robustness-test definitions are provided in Section 2 and Section 3. Additional UKF Monte Carlo settings required to reproduce the soft-sensing RMSE values and pairwise statistical comparisons are summarized in Table 4.

The numerical values in this study were treated as computational design assumptions for comparing candidate sensor packages, not as platform-calibrated experimental constants. The assumptions are consistent with the use of Fisher-information-based experimental design, nonlinear state estimation, and soft-sensing analysis in partially observed bioprocess systems [9,11,27,28,37,38]. The main assumptions are summarized in Table 5.

4. Results and Discussion

4.1. Nominal CBP Trajectory with the Excitation Schedule

The nominal CBP trajectory with the temperature–pH excitation profile showed the expected phase-dependent behavior. Biomass and enzyme activity increased mainly during the early phase, hydrolysis increased soluble sugar during the intermediate phase, and ethanol accumulation became dominant during the later phase, as shown in Figure 2. This behavior is consistent with CBP as a coupled process involving growth, enzyme production, substrate deconstruction, sugar release, and product formation on different time scales [6,7,31,32]. The delayed ethanol response also illustrates why ethanol-only monitoring is insufficient for diagnosing earlier causes of poor conversion, such as weak growth, insufficient enzyme activity, limited hydrolysis, or sugar limitation.

4.2. State-Observability Enhancement with Increasingly Informative Sensors

State-observability information increased substantially when additional measurements were added to ethanol-only monitoring. Across the 16 ethanol-mandatory candidate packages, ethanol-only monitoring provided the lowest state-information content, as shown in Figure 3. At a

6 h

sampling interval, the state-observability log-pseudodeterminant increased from 4.18 with ethanol-only monitoring to 8.56 after soluble sugar was added. Full-proxy monitoring gave the largest state-observability information, with values of 16.42, 15.06, and 13.81 at sampling intervals of 6, 12, and

24 h

, respectively.

The all-combination analysis also identified strong reduced four-channel packages. The ethanol–sugar–biomass–substrate package reached state-observability log-pseudodeterminants of 15.12, 13.76, and 12.51 at 6, 12, and

24 h

, respectively, while the ethanol–sugar–enzyme–substrate package gave very similar values of 14.94, 13.65, and 12.51. These results show that residual-substrate information can provide a major state-observability gain when combined with ethanol, sugar, and a biological or enzymatic proxy. The previously emphasized ethanol–sugar–biomass–enzyme package also remained informative, with values of 14.39, 13.30, and 12.31, but it no longer represented the strongest reduced observability option once all ethanol-mandatory combinations were included.

Overall, the observability analysis confirms that product-only sensing is weak for state-aware CBP digital twins. Intermediate and latent-state proxy measurements provide much stronger information about the initial-state directions that drive the batch trajectory. This finding agrees with broader bioprocess-monitoring experience, where endpoint or product signals alone are often insufficient for reconstructing hidden physiological states, whereas intermediate measurements and proxy variables can substantially improve soft-sensor performance [8,9,10,11]. Within the tested candidate set, full-proxy monitoring remained the maximum state-observability benchmark, while ethanol–sugar–biomass–substrate and ethanol–sugar–enzyme–substrate provided strong reduced alternatives.

4.3. Parameter-Identifiability Improvement with Biomass and Enzyme Sensors

The parameter-identifiability results were more nuanced than the state-observability results because different metrics emphasize different properties of the Fisher information matrix. For the active log-pseudodeterminant criterion, the ethanol–sugar–biomass–enzyme package gave the strongest parameter-identifiability performance among the evaluated packages, as shown in Figure 4. Its log-pseudodeterminants were 10.82, 9.06, and 6.67 at sampling intervals of 6, 12, and

24 h

, respectively. The ethanol–biomass–enzyme–substrate package was close behind, with values of 10.68, 8.93, and 6.59. These results show that biomass and enzyme proxies are especially informative for separating growth, enzyme-production, and hydrolysis-related parameter effects.

For comparison, full-proxy monitoring gave active parameter log-pseudodeterminants of 8.69, 6.63, and 3.94 at 6, 12, and

24 h

, respectively. The lower active pseudodeterminant for the full-proxy set does not mean that the full-proxy set is less informative in an absolute sense. Rather, it reflects the behavior of the active pseudodeterminant when additional weak eigenvalue directions are retained. The pseudodeterminant sums only eigenvalues above the numerical threshold, so it can favor a reduced package whose information is concentrated in fewer active directions.

The numerical-rank results clarify this interpretation, as shown in Figure 5. Although the ethanol–sugar–biomass–enzyme package had the largest active-information volume, its parameter-information matrix had numerical rank 6 in the tested sampling cases. In contrast, full-proxy monitoring retained full numerical rank 7 across the tested sampling intervals. Therefore, the ethanol–sugar–biomass–enzyme package is best viewed as the strongest active-information package for parameter learning, whereas full-proxy monitoring provides the most complete coverage of all seven parameter directions.

The eigenvalue spectra further illustrate the distinction between active information volume and full-dimensional coverage, as shown in Figure 6. In the

12 h

case, the ethanol–sugar–biomass–enzyme package produced a larger active pseudodeterminant than full-proxy monitoring, but the full-proxy set preserved an additional weak parameter direction and gave the larger fixed-dimension regularized determinant. The fixed-dimension regularized log determinant was 6.94 for full-proxy monitoring and 6.36 for the ethanol–sugar–biomass–enzyme package. Thus, full-proxy monitoring remains the most complete parameter-information configuration, while ethanol–sugar–biomass–enzyme is the strongest reduced package for the active-information criterion.

The difference between observability and identifiability is therefore important. A sensor configuration that is strong for state reconstruction is not necessarily the same configuration that is most efficient for parameter identification. Identifiability depends on whether parameter perturbations produce sufficiently distinct output responses. In the present model, biomass and enzyme measurements complement ethanol and sugar measurements by helping to separate growth, enzyme-yield, and hydrolysis effects from ethanol-yield, inhibition, and feedstock-accessibility effects. Residual-substrate information, by contrast, was especially valuable for state observability and the aggregate ranking, but it did not replace the parameter-learning value of the enzyme proxy.

The inclusion of biomass and enzyme surrogates lowered the uncertainty of parameters associated with growth, enzyme production, and hydrolysis, as shown in Figure 7. However, strong correlations were still observed for selected parameter pairs, especially growth and decay, ethanol yield and inhibition, and hydrolysis capacity and feedstock accessibility, as shown in Figure 8. These correlations show that some biological mechanisms can still produce similar measured trajectories even when additional proxy measurements are available. Therefore, Fisher-information metrics, eigenvalue spectra, fixed-dimension regularized determinants, and parameter-correlation diagnostics should be used together when judging practical identifiability and selecting sensor packages before experimental implementation [28,37,38].

4.4. Impact of the Sensor Set on UKF Reconstruction Quality

More informative sensor sets improved latent-state reconstruction under model–plant mismatch conditions. Ethanol-only monitoring gave the weakest reconstruction performance, with a mean latent-state RMSE of 1.1899. Adding soluble sugar alone reduced the mean latent-state RMSE only slightly to 1.1398, confirming that a minimal product–sugar package is still insufficient for reconstructing hidden biomass, enzyme, and residual-substrate dynamics. In contrast, packages that included residual-substrate measurements strongly reduced substrate RMSE, while packages that included biomass and enzyme proxies improved the corresponding biological and enzymatic state estimates. These trends are consistent with the observability results and with previous bioprocess soft-sensing studies, where estimator performance depends strongly on whether the available measurements excite the dominant latent-state directions [8,9,10,11].

Among all 16 ethanol-mandatory packages, full-proxy monitoring gave the lowest mean latent-state RMSE, 0.3756. The ethanol–biomass–enzyme–substrate package was close behind with 0.3843, followed by the top aggregate-scoring ethanol–sugar–biomass–substrate package with 0.4121. Thus, full-proxy monitoring remained the best reconstruction benchmark, but several four-channel packages achieved similar UKF accuracy with lower measurement burden. The Monte Carlo distributions of latent-state RMSE and representative state-wise errors are shown in Figure 9 and Table 6.

Paired Wilcoxon tests were used to compare each non-baseline sensor set with ethanol-only monitoring because the replicate-wise RMSE differences were not assumed to be normally distributed [44]. The revised Monte Carlo design used common plant mismatch and initial-estimate perturbations across all sensor sets, so the replicate-wise RMSE differences were paired observations. In addition to the raw test statistics, bootstrap confidence intervals were calculated for mean RMSE reductions, and the Wilcoxon p-values were adjusted across the 15 ethanol-only contrasts.

All non-baseline sensor sets reduced mean latent-state RMSE relative to ethanol-only monitoring. The largest reconstruction improvement was obtained with full-proxy monitoring, which reduced mean latent-state RMSE from 1.1899 to 0.3756. The corresponding mean reduction was 0.8143, with a bootstrap 95% interval of 0.7398–0.8894. The ethanol–biomass–enzyme–substrate package provided a nearly equivalent reconstruction reduction of 0.8056, while the ethanol–sugar–biomass–substrate package reduced RMSE by 0.7778. These results show that residual-substrate information is especially valuable for UKF state reconstruction when combined with biomass or enzyme proxies. The paired comparison results are summarized in Table 7.

4.5. Recommended Sensor-Set Ranking and Measurement-Burden Trade-Off

The aggregate sensor-value ranking changed when all 16 ethanol-mandatory combinations were evaluated instead of only the original seven preselected packages. The ethanol–sugar–biomass–substrate package achieved the highest overall score, followed by full-proxy monitoring and the ethanol–biomass–enzyme–substrate package, as shown in Table 8. Ethanol-only monitoring remained the least effective option, confirming that product-only sensing is inadequate when state observability, parameter identifiability, nonlinear reconstruction accuracy, and measurement burden are considered together.

This ordering clarifies the trade-off among information value, reconstruction accuracy, and measurement burden. Full-proxy monitoring gave the best UKF reconstruction accuracy and the most complete measurement coverage, but it also had the highest burden index. The ethanol–sugar–biomass–substrate package ranked first overall because it combined strong state observability, high UKF reconstruction accuracy, competitive identifiability, and lower burden than the full-proxy package. The ethanol–biomass–enzyme–substrate package gave nearly full-proxy reconstruction performance and a strong parameter-identifiability score, but its higher burden and lack of a sugar measurement placed it below the top-ranked ethanol–sugar–biomass–substrate package in the aggregate ranking.

The previously emphasized ethanol–sugar–biomass–enzyme package remained important for parameter learning and ranked sixth overall, but it was no longer the best reduced aggregate package once residual-substrate-containing combinations were included. Therefore, the practical recommendation is conditional on the design objective: full-proxy monitoring is preferred when maximum reconstruction and completeness are required; ethanol–sugar–biomass–substrate is preferred for the primary aggregate score; and ethanol–sugar–biomass–enzyme remains a strong reduced option when parameter identifiability is prioritized. This supports the broader digital bioprocessing view that process analytical measurements, state-estimation methods, and uncertainty-aware decision support should be integrated before closed-loop digital-twin deployment [11,19,20,24].

4.6. Robustness to Measurement Noise, Operating Trajectories, Measurement Imperfections, and Scoring Weights

The robustness analyses showed that the main recommendation was generally stable with changes in measurement quality, operating trajectory, measurement imperfections, scoring weights, and sensor-specific burden assumptions. With uniform measurement-noise scaling, the ethanol–sugar–biomass–substrate package remained top-ranked in the low-noise, nominal-noise, and high-noise cases, as shown in Figure 10 and Table 9. Spearman rank correlations relative to the nominal-noise ranking were close to one, and the maximum rank shift was at most two. This indicates that the primary ranking was not an artifact of a single assumed global measurement-noise level. Such robustness is important because sensor quality, assay uncertainty, and data availability can differ substantially across laboratories and development stages [11,19,24].

When individual sensor-noise levels were varied independently, the competition between the top reduced package and full-proxy monitoring became clearer, as shown in Figure 11. Across 200 sensor-specific noise scenarios, full-proxy monitoring had the best mean rank, 1.57, and was ranked first in 99 scenarios. The ethanol–sugar–biomass–substrate package had a mean rank of 1.83 and was ranked first in 96 scenarios. The ethanol–biomass–enzyme–substrate package ranked first in five scenarios. Thus, the full-proxy package is slightly more robust when individual sensor-noise assumptions vary, whereas ethanol–sugar–biomass–substrate remains the strongest lower-burden aggregate recommendation.

The operating-trajectory analysis confirmed that the ranking was not driven only by the nominal temperature–pH schedule, as shown in Figure 12 and Table 10. Full-proxy monitoring provided the highest state-observability score for all tested trajectories. The top aggregate package was ethanol–sugar–biomass–substrate with the nominal and hydrolysis-extended trajectories, while full-proxy monitoring became the top aggregate package for the mild, fermentation-early, and shifted-feasible trajectories. The Spearman correlation relative to the nominal trajectory ranged from 0.9647 to 1.0000, with a maximum rank shift of three. Therefore, the reduced-set recommendation should be interpreted in relation to the operating regime, whereas full-proxy monitoring is the most trajectory-robust information-complete configuration.

Full UKF stress tests were then used to examine practical measurement imperfections, including missing observations, assay delay, and systematic measurement bias. In these tests, the complete UKF reconstruction and ranking workflow was rerun under each stress condition. The ethanol–sugar–biomass–substrate package remained top-ranked in all stress scenarios, as shown in Figure 13 and Table 11. The largest degradation in the top-package reconstruction error occurred in the

6 h

assay-delay case, where the mean latent-state RMSE of the top package increased to 0.5953. Systematic bias was injected as an additive measurement offset rather than treated only as zero-mean variance. These results show that the main aggregate recommendation remained stable when measurements were missing, delayed, or biased.

The weight-sensitivity analysis showed that the preferred package was also stable across most scoring priorities, as shown in Figure 14 and Table 12. The ethanol–sugar–biomass–substrate package remained top-ranked with the primary, equal-weight, identifiability-focused, reconstruction-focused, burden-sensitive, and burden-averse formulations. Full-proxy monitoring became top-ranked only when the observability term was given dominant weight. This confirms that the top-ranked reduced package is not merely a consequence of a single arbitrary weight choice, although full-proxy monitoring remains the preferred option when the main objective is maximum state-observability coverage.

Finally, alternative sensor-specific measurement-burden scenarios were tested to distinguish weight sensitivity from changes in the assumed practical workflow. The ethanol–sugar–biomass–substrate package remained top-ranked in all tested burden workflows, as summarized in Table 13. This suggests that the primary recommendation is not solely caused by the nominal burden values, although the relative ranks of nearby packages still changed when biomass, enzyme, spectroscopy, or solids-measurement assumptions were altered.

4.7. Implications for Practical CBP Digital-Twin Development

Within the tested five-state virtual-plant benchmark, the results support a hierarchical approach to measurement selection for CBP digital-twin development. Ethanol-only monitoring was consistently the weakest configuration because ethanol is a delayed product signal and cannot indicate whether limited batch performance originates from biomass limitation, enzyme insufficiency, substrate scarcity, slow hydrolysis, sugar accumulation, or product inhibition. This interpretation is consistent with bioprocess soft-sensing studies showing that delayed quality indicators and hidden physiological states can limit real-time monitoring and control [8,9,10,11,42].

Adding soluble sugar provided a small, low-burden improvement because sugar links substrate hydrolysis to ethanol formation. However, the ethanol–sugar package should be viewed as a minimal measurement configuration rather than a complete digital-twin sensor package. It does not directly capture biomass growth, enzyme activity, or residual substrate availability. This agrees with process analytical technology and digital-bioprocessing approaches, where intermediate measurements are informative but often need to be combined with soft sensors, model-based estimation, and digital-twin architectures [11,19,22,24,45,46].

The all-combination analysis changed the practical recommendation relative to the original restricted seven-package comparison. Across all 16 ethanol-mandatory combinations, ethanol–sugar–biomass–substrate achieved the highest primary aggregate score. This package combines a product signal, a fermentable-intermediate signal, a biological-state proxy, and a hydrolysis or solids-state proxy. It therefore provides strong state observability and UKF reconstruction while avoiding the highest burden of full-proxy monitoring. Full-proxy monitoring remains preferred when maximum reconstruction accuracy, state-observability coverage, and information completeness are required. The ethanol–sugar–biomass–enzyme package remains important for parameter learning because biomass and enzyme measurements provide strong information about growth, enzyme-production, and hydrolysis-related parameter directions, but it is no longer the strongest aggregate reduced package once residual-substrate-containing combinations are included.

In practice, low-burden screening experiments could begin with ethanol, sugar, and biomass measurements. Experiments focused on overall digital-twin readiness should add a residual-substrate or solids-related proxy, giving the ethanol–sugar–biomass–substrate package. Model-refinement experiments that prioritize kinetic identifiability should include enzyme activity, especially when growth and hydrolysis parameters must be separated. High-quality benchmark experiments should use full-proxy monitoring when the measurement burden is acceptable. This staged development route is consistent with digital bioprocessing and digital chemical engineering roadmaps, in which digital twins evolve through improved modeling, state estimation, process analytics, enabling digital technologies, and control-oriented decision support [19,20,21,23,24,25,45,46].

These practical implications should be interpreted as design guidance for the tested virtual-plant benchmark, not as a universal sensor prescription. Platform-specific sensor errors, costs, delays, organisms, feedstocks, product portfolios, missing-data patterns, and operating regimes should be incorporated before implementation. This is especially important for literature-derived CBP datasets, where product prediction can be affected by heterogeneous feedstock–pretreatment–microbial descriptors, sparse reporting, and missing-label structure [30].

4.8. Limitations

Several limitations apply to this study. First, a computational virtual plant was used to generate observations and evaluate sensor-set performance. The results therefore provide guidance for sensor prioritization and digital-twin readiness assessment, but they should not be interpreted as experimental validation. Real CBP systems may include organism-specific regulation, feedstock heterogeneity, mass-transfer limitations, inhibitor formation, contamination, evaporation, sensor drift, and assay-specific bias. Future work should test the proposed hierarchy using synchronized experimental measurements of ethanol, soluble sugars, biomass proxies, enzyme activity, and residual solids.

Second, the five-state hybrid model is a compact representation of lignocellulosic CBP. It captures the main information pathways needed to study state observability, parameter identifiability, and soft-sensor reconstruction, but it does not resolve all biochemical details. More detailed models could separate cellulose, hemicellulose, glucose, xylose, cellobiose, individual enzyme classes, inhibitor species, co-products, and organism-specific metabolic states. This extension is relevant because literature-derived CBP datasets show uneven product support and missing-label structure across ethanol and co-products [30]. Such refinements may change the relative importance of candidate sensors and may introduce additional identifiability challenges unless richer measurements are available [28,29].

Third, the observability and identifiability criteria are local and depend on the operating trajectory around which sensitivities are evaluated. This study examined nominal, mild, hydrolysis-extended, fermentation-early, and shifted-feasible temperature–pH trajectories. The overall ranking was highly correlated across these cases, but the top aggregate package changed for some trajectories. The ethanol–sugar–biomass–substrate package was preferred in the nominal and hydrolysis-extended cases, whereas full-proxy monitoring became preferred for the mild, fermentation-early, and shifted-feasible profiles. Therefore, the recommendation should be interpreted as conditional on the explored operating region. Other feedstocks, organisms, pretreatment severities, solids loadings, batch durations, or control policies may generate different sensitivity patterns. This is a general limitation of sensitivity-based experimental design and practical identifiability analysis [37,38].

Fourth, the measurement-noise levels, missingness assumptions, bias levels, assay-delay approximation, and measurement-burden indices were chosen as computational design scenarios rather than calibrated values for a specific experimental platform. The ranking was stable with uniform noise scaling, missing observations, assay delay, systematic bias, alternative scoring weights, and alternative burden workflows. However, the sensor-specific noise Monte Carlo analysis showed that full-proxy monitoring and the ethanol–sugar–biomass–substrate package can exchange the top rank when individual sensor uncertainties vary independently. Actual measurements can also have platform-specific error structures, detection limits, sampling losses, maintenance requirements, operator-time costs, and latency constraints. Future studies should replace the abstract burden and error assumptions with experimentally measured values for the intended laboratory or pilot-plant platform [21,23,42].

Fifth, the expanded analysis evaluated all ethanol-mandatory combinations of the five modeled measurement channels, but it did not evaluate sensor packages that omit ethanol. Ethanol was treated as mandatory because it is the direct product signal and the practical product-monitoring baseline. Therefore, the ranking should be interpreted within the ethanol-mandatory design space. Other objectives, such as early-stage hydrolysis diagnostics before ethanol formation, could motivate a different candidate space.

Sixth, the UKF case study evaluated latent-state reconstruction under model–plant mismatch conditions, but it did not test closed-loop control performance in an experimental process. Good estimator performance is only one requirement for digital-twin deployment. Practical implementation also requires suitable actuators, acceptable measurement latency, robust controller design, reliable data transfer, and safe interaction among the physical process, model updates, operators, and controllers. The present framework should therefore be viewed as a sensor-prioritization and soft-sensing readiness tool, not as complete closed-loop digital-twin validation [45,46].

Finally, the aggregate ranking depends on the normalized multi-criteria scoring scheme. Although the weight-sensitivity analysis showed that the ethanol–sugar–biomass–substrate package remained preferred under most weighting schemes, full-proxy monitoring became preferred when state observability was given dominant weight. The preferred package therefore depends on whether the digital twin is intended for maximum state-information coverage, parameter learning, nonlinear reconstruction, or lower-burden screening. The ranking should be used as a decision-support tool rather than a universal sensor prescription.

5. Summary and Conclusions

This paper presented a computational methodology for selecting informative measurement packages for digital-twin-assisted consolidated bioprocessing (CBP). The framework combines state-observability analysis, parameter-identifiability analysis, UKF-based soft-sensor reconstruction, measurement-burden assessment, and robustness testing with changes in measurement noise, operating trajectory, measurement imperfections, scoring weights, and sensor-specific burden assumptions. The objective was to support pre-experimental sensor-set design before laboratory or pilot-scale digital-twin validation.

Within the tested five-state virtual-plant benchmark, ethanol-only sensing was inadequate for state-aware CBP digital-twin reconstruction because ethanol is a delayed product signal. At a

6 h

sampling interval, the state-observability log-pseudodeterminant increased from 4.18 with ethanol-only sensing to 8.56 after adding soluble sugar and to 16.42 with full-proxy monitoring. The ethanol–sugar–biomass–substrate package also provided strong reduced state-observability performance, with log-pseudodeterminants of 15.12, 13.76, and 12.51 at 6, 12, and

24 h

, respectively. Parameter-identifiability analysis showed that biomass and enzyme proxies were especially valuable for model learning: the ethanol–sugar–biomass–enzyme package gave the strongest active-information performance, with log-pseudodeterminants of 10.82, 9.06, and 6.67 at 6, 12, and

24 h

, respectively. Full-proxy monitoring provided the most complete all-parameter information coverage.

The paired UKF Monte Carlo reconstruction test showed that additional measurements substantially improved latent-state estimation under model–plant mismatch conditions. Ethanol-only monitoring gave a mean latent-state RMSE of 1.1899, whereas full-proxy monitoring gave the lowest RMSE, 0.3756, followed by ethanol–biomass–enzyme–substrate at 0.3843 and ethanol–sugar–biomass–substrate at 0.4121. After evaluating all 16 ethanol-mandatory candidate packages, the aggregate ranking changed relative to the original seven-package comparison. Ethanol–sugar–biomass–substrate achieved the highest primary aggregate sensor-value score, 0.8432, with a burden index of 7.0. Full-proxy monitoring ranked second, with a score of 0.8173 and a burden index of 10.0, while ethanol–biomass–enzyme–substrate ranked third, with a score of 0.8086. The previously emphasized ethanol–sugar–biomass–enzyme package remained important for parameter learning but ranked sixth overall once residual-substrate-containing combinations were included.

The robustness analyses supported the main recommendation while clarifying its conditional nature. Ethanol–sugar–biomass–substrate remained top-ranked with uniform noise scaling, full UKF missingness, delay and bias stress tests, most scoring-weight scenarios, and all tested sensor-specific burden workflows. For independent sensor-specific noise variation, full-proxy monitoring and ethanol–sugar–biomass–substrate had similar top-rank frequencies, and for some alternative operating trajectories full-proxy monitoring became top-ranked. Overall, ethanol-only monitoring is suitable only as a minimal product baseline; ethanol–sugar–biomass sensing can support lower-burden screening; ethanol–sugar–biomass–substrate sensing is recommended for the primary aggregate digital-twin readiness score; ethanol–sugar–biomass–enzyme sensing remains attractive for parameter learning; and full-proxy monitoring is recommended for benchmark experiments when maximum reconstruction accuracy and information completeness are required. Because the results were obtained from a computational benchmark, the hierarchy should be validated with platform-specific experimental measurements, sensor errors, delays, costs, organisms, feedstocks, and operating regimes before practical deployment.

Author Contributions

M.K.Y.: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Software, Funding acquisition, Data curation, Visualization, Writing—original draft, Writing—review and editing. A.A.: Validation, Writing—review and editing, Supervision. N.Y.A.: Investigation, Writing—review and editing, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the German Academic Exchange Service (DAAD) under the program Research Grants—Bi-nationally Supervised Doctoral Degrees/Cotutelle (Grant No. 57693451).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulation code and numerical output files used to generate the figures and tables are available and have been archived in Zenodo at https://doi.org/10.5281/zenodo.20560602.

Acknowledgments

The authors acknowledge support from the Open Access Publication Fund of the University of Duisburg-Essen, the German Academic Exchange Service (DAAD), and the KNUST Engineering Education Project (KEEP).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this study:

CBP	Consolidated bioprocessing
DT	Digital twin
FIM	Fisher information matrix
GC	Gas chromatography
HPLC	High-performance liquid chromatography
MAE	Mean absolute error
MC	Monte Carlo
NIR	Near-infrared spectroscopy
PAT	Process analytical technology
pdet	Pseudo-determinant
RK4	Fourth-order Runge–Kutta method
RMSE	Root mean squared error
UKF	Unscented Kalman filter

References

Yeboah, M.K.; Asiedu, N.Y.; Dogbe, S.; Addo, A. Performance of Machine Learning Based-Modelling Approach in Consolidated Bioprocessing with Microbial Consortium for Bioethanol Production. Ind. Biotechnol. 2024, 20, 77–97. [Google Scholar] [CrossRef]
Yeboah, M.K.; Söffker, D. Consolidated Bioprocessing of Lignocellulosic Biomass: A Review of Experimental Advances and Modeling Approaches. Bioresour. Bioprod. 2026, 2, 4. [Google Scholar] [CrossRef]
Singhania, R.R.; Patel, A.K.; Singh, A.; Haldar, D.; Soam, S.; Chen, C.W.; Tsai, M.L.; Dong, C.D. Consolidated bioprocessing of lignocellulosic biomass: Technological advances and challenges. Bioresour. Technol. 2022, 354, 127153. [Google Scholar] [CrossRef] [PubMed]
Tsai, S.L.; Sun, Q.; Chen, W. Advances in consolidated bioprocessing using synthetic cellulosomes. Curr. Opin. Biotechnol. 2022, 78, 102840. [Google Scholar] [CrossRef] [PubMed]
Sharma, J.; Kumar, V.; Prasad, R.; Gaur, N.A. Engineering of Saccharomyces cerevisiae as a consolidated bioprocessing host to produce cellulosic ethanol: Recent advancements and current challenges. Biotechnol. Adv. 2022, 56, 107925. [Google Scholar] [CrossRef] [PubMed]
Periyasamy, S.; Beula Isabel, J.; Kavitha, S.; Karthik, V.; Mohamed, B.A.; Gizaw, D.G.; Sivashanmugam, P.; Aminabhavi, T.M. Recent advances in consolidated bioprocessing for conversion of lignocellulosic biomass into bioethanol—A review. Chem. Eng. J. 2023, 453, 139783. [Google Scholar] [CrossRef]
Li, Z.; Waghmare, P.R.; Dijkhuizen, L.; Meng, X.; Liu, W. Research advances on the consolidated bioprocessing of lignocellulosic biomass. Eng. Microbiol. 2024, 4, 100139. [Google Scholar] [CrossRef] [PubMed]
Bastin, G.; Dochain, D. On-Line Estimation and Adaptive Control of Bioreactors; Elsevier: Amsterdam, The Netherlands, 1990. [Google Scholar]
Kadlec, P.; Gabrys, B.; Strandt, S. Data-driven soft sensors in the process industry. Comput. Chem. Eng. 2009, 33, 795–814. [Google Scholar] [CrossRef]
Randek, J.; Mandenius, C.F. On-line soft sensing in upstream bioprocessing. Crit. Rev. Biotechnol. 2018, 38, 106–121. [Google Scholar] [CrossRef] [PubMed]
Brunner, V.; Siegl, M.; Geier, D.; Becker, T. Challenges in the development of soft sensors for bioprocesses: A critical review. Front. Bioeng. Biotechnol. 2021, 9, 722202. [Google Scholar] [CrossRef] [PubMed]
Paulsson, D.; Gustavsson, R.; Mandenius, C.F. A Soft Sensor for Bioprocess Control Based on Sequential Filtering of Metabolic Heat Signals. Sensors 2014, 14, 17864–17882. [Google Scholar] [CrossRef] [PubMed]
Tamburini, E.; Marchetti, M.G.; Pedrini, P. Monitoring Key Parameters in Bioprocesses Using Near-Infrared Technology. Sensors 2014, 14, 18941–18959. [Google Scholar] [CrossRef] [PubMed]
Faassen, S.M.; Hitzmann, B. Fluorescence Spectroscopy and Chemometric Modeling for Bioprocess Monitoring. Sensors 2015, 15, 10271–10291. [Google Scholar] [CrossRef] [PubMed]
Konakovsky, V.; Yagtu, A.C.; Clemens, C.; Müller, M.M.; Berger, M.; Schlatter, S.; Herwig, C. Universal Capacitance Model for Real-Time Biomass in Cell Culture. Sensors 2015, 15, 22128–22150. [Google Scholar] [CrossRef] [PubMed]
Grigs, O.; Bolmanis, E.; Galvanauskas, V. Application of In-Situ and Soft-Sensors for Estimation of Recombinant P. pastoris GS115 Biomass Concentration: A Case Analysis of HBcAg (Mut+) and HBsAg (MutS) Production Processes under Varying Conditions. Sensors 2021, 21, 1268. [Google Scholar] [CrossRef] [PubMed]
Siegl, M.; Kämpf, M.; Geier, D.; Andreeßen, B.; Max, S.; Zavrel, M.; Becker, T. Generalizability of Soft Sensors for Bioprocesses through Similarity Analysis and Phase-Dependent Recalibration. Sensors 2023, 23, 2178. [Google Scholar] [CrossRef] [PubMed]
Iglesias, C.F.; Bolic, M. How Not to Make the Joint Extended Kalman Filter Fail with Unstructured Mechanistic Models. Sensors 2024, 24, 653. [Google Scholar] [CrossRef] [PubMed]
Gargalo, C.L.; de las Heras, S.C.; Jones, M.N.; Udugama, I.; Mansouri, S.S.; Krühne, U.; Gernaey, K.V. Towards the development of digital twins for the bio-manufacturing industry. In Digital Twins: Tools and Concepts for Smart Biomanufacturing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–34. [Google Scholar]
Sinner, P.; Daume, S.; Herwig, C.; Kager, J. Usage of digital twins along a typical process development cycle. In Digital Twins: Tools and Concepts for Smart Biomanufacturing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 55–73. [Google Scholar]
Udugama, I.A.; Kelton, W.; Bayer, C. Digital twins in food processing: A conceptual approach to developing multi-layer digital models. Digit. Chem. Eng. 2023, 7, 100087. [Google Scholar] [CrossRef]
Yatipanthalawa, B.S.; Gras, S.L. Predictive Models for Upstream Mammalian Cell Culture Development: A Review. Digit. Chem. Eng. 2024, 10, 100137. [Google Scholar] [CrossRef]
Pietrasik, M.; Wilbik, A.; Grefen, P. The enabling technologies for digitalization in the chemical process industry. Digit. Chem. Eng. 2024, 12, 100161. [Google Scholar] [CrossRef]
Isoko, K.; Cordiner, J.L.; Kis, Z.; Moghadam, P.Z. Bioprocessing 4.0: A Pragmatic Review and Future Perspectives. Digit. Discov. 2024, 3, 1662–1681. [Google Scholar] [CrossRef]
Mu’azzam, K.; da Silva, F.V.S.; Murtagh, J.; Gallagher, M.J.S. A Roadmap for Model-Based Bioprocess Development. Biotechnol. Adv. 2024, 73, 108378. [Google Scholar] [CrossRef] [PubMed]
Julier, S.J.; Uhlmann, J.K. A new extension of the Kalman filter to nonlinear systems. In Signal Processing, Sensor Fusion, and Target Recognition VI; SPIE: Bellingham, WA, USA, 1997; Volume 3068, pp. 182–193. [Google Scholar] [CrossRef]
Julier, S.J.; Uhlmann, J.K. Unscented filtering and nonlinear estimation. Proc. IEEE 2004, 92, 401–422. [Google Scholar] [CrossRef]
Raue, A.; Kreutz, C.; Maiwald, T.; Bachmann, J.; Schilling, M.; Klingmüller, U.; Timmer, J. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 2009, 25, 1923–1929. [Google Scholar] [CrossRef] [PubMed]
Villaverde, A.F. Observability and structural identifiability of nonlinear biological systems. Complexity 2019, 2019, 8497093. [Google Scholar] [CrossRef]
Yeboah, M.K.; Addo, A.; Asiedu, N.Y. Multi-Product Modeling of Consolidated Bioprocessing Using a Literature-Derived Dataset: A Multi-Output Learning Framework for Ethanol and Co-Products. Fermentation 2026, 12, 224. [Google Scholar] [CrossRef]
Lynd, L.R.; van Zyl, W.H.; McBride, J.E.; Laser, M. Consolidated bioprocessing of cellulosic biomass: An update. Curr. Opin. Biotechnol. 2005, 16, 577–583. [Google Scholar] [CrossRef] [PubMed]
Olson, D.G.; McBride, J.E.; Shaw, A.J.; Lynd, L.R. Recent progress in consolidated bioprocessing. Curr. Opin. Biotechnol. 2012, 23, 396–405. [Google Scholar] [CrossRef] [PubMed]
Agharafeie, R.; Ramos, J.R.C.; Mendes, J.M.; Oliveira, R. From Shallow to Deep Bioprocess Hybrid Modeling: Advances and Future Perspectives. Fermentation 2023, 9, 922. [Google Scholar] [CrossRef]
Moser, A.; Appl, C.; Pörtner, R.; Baganz, F.; Hass, V.C. A New Concept for the Rapid Development of Digital Twin Core Models for Bioprocesses in Various Reactor Designs. Fermentation 2024, 10, 463. [Google Scholar] [CrossRef]
Herrera-Ruiz, J.F.; Fontalvo, J.; Prado-Rubio, O.A. Hybrid Modeling for Bioprocesses: Architectures, Applications, and Perspectives. Eng. Rep. 2025, 7, e70502. [Google Scholar] [CrossRef]
Yeboah, M.K.; Asiedu, N.Y.; Addo, A. Dynamic Pareto Optimization of Consolidated Bioprocessing for Ethanol Titer, Productivity, Conversion, and Operating Severity. Bioengineering 2026, 13, 605. [Google Scholar] [CrossRef]
Walter, É.; Pronzato, L. Qualitative and quantitative experiment design for phenomenological models—A survey. Automatica 1990, 26, 195–213. [Google Scholar] [CrossRef]
Franceschini, G.; Macchietto, S. Model-based design of experiments for parameter precision: State of the art. Chem. Eng. Sci. 2008, 63, 4846–4872. [Google Scholar] [CrossRef]
Zhu, X.; Rehman, K.U.; Wang, B.; Shahzad, M. Modern Soft-Sensing Modeling Methods for Fermentation Processes. Sensors 2020, 20, 1771. [Google Scholar] [CrossRef] [PubMed]
Hermann, L.; Kremling, A. A Hybrid Soft Sensor Approach Combining Partial Least-Squares Regression and an Unscented Kalman Filter for State Estimation in Bioprocesses. Bioengineering 2025, 12, 654. [Google Scholar] [CrossRef] [PubMed]
Tuveri, A.; Pérez-García, F.; Lira-Parada, P.A.; Imsland, L.; Bar, N. Sensor fusion based on Extended and Unscented Kalman Filter for bioprocess monitoring. J. Process Control 2021, 106, 195–207. [Google Scholar] [CrossRef]
Reyes, S.J.; Durocher, Y.; Pham, P.L.; Henry, O. Modern Sensor Tools and Techniques for Monitoring, Controlling, and Improving Cell Culture Processes. Processes 2022, 10, 189. [Google Scholar] [CrossRef]
Pérez, P.A.L.; Lopez, R.A.; Femat, R. Control in Bioengineering and Bioprocessing: Modeling, Estimation and the Use of Sensors; John Wiley & Sons: Chichester, UK, 2020. [Google Scholar]
Wilcoxon, F. Individual comparisons by ranking methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Chen, Y.; Yang, O.; Sampat, C.; Bhalode, P.; Ramachandran, R.; Ierapetritou, M. Digital Twins in Pharmaceutical and Biopharmaceutical Manufacturing: A Literature Review. Processes 2020, 8, 1088. [Google Scholar] [CrossRef]
Zhao, B.; Li, X.; Sun, W.; Qian, J.; Liu, J.; Gao, M.; Guan, X.; Ma, Z.; Li, J. BioDT: An Integrated Digital-Twin-Based Framework for Intelligent Biomanufacturing. Processes 2023, 11, 1213. [Google Scholar] [CrossRef]

Figure 1. Workflow for the proposed sensor-set design framework for digital-twin-assisted consolidated bioprocessing. Starting from a hybrid gray-box CBP virtual plant, all ethanol-mandatory measurement packages are defined and evaluated through state-observability analysis, parameter-identifiability analysis, and paired UKF-based Monte Carlo soft-sensing reconstruction. The candidate packages are then examined using robustness analyses involving alternative operating trajectories, measurement-noise assumptions, missing-data scenarios, assay delay, systematic measurement bias, alternative scoring weights, and alternative sensor-specific measurement-burden scenarios. The workflow yields a ranked and Pareto-screened list of sensor packages for assessing CBP digital-twin readiness before experimental or pilot-scale implementation.

Figure 2. Nominal CBP trajectory with the temperature–pH excitation schedule, showing biomass growth, enzyme formation, residual insoluble-substrate conversion, soluble sugar accumulation, and ethanol formation.

Figure 3. State-observability information for the 16 ethanol-mandatory sensor sets and sampling intervals, measured using the log-pseudodeterminant of the Fisher-information-type observability matrix.

Figure 4. Parameter-identifiability information for the 16 ethanol-mandatory sensor sets and sampling intervals, measured using the active log-pseudodeterminant of the parameter Fisher information matrix.

Figure 5. Numerical parameter-identifiability rank for the 16 ethanol-mandatory sensor sets and sampling intervals. Higher rank indicates that more independent parameter directions are practically resolved by the available measurements.

Figure 6. Eigenvalue spectra of selected parameter Fisher information matrices at the

12 h

sampling interval. Larger eigenvalues indicate stronger parameter-information directions, while very small eigenvalues indicate weakly resolved or practically non-identifiable parameter combinations.

Figure 6. Eigenvalue spectra of selected parameter Fisher information matrices at the

12 h

sampling interval. Larger eigenvalues indicate stronger parameter-information directions, while very small eigenvalues indicate weakly resolved or practically non-identifiable parameter combinations.

Figure 7. Approximate parameter uncertainty estimated from the inverse Fisher information matrix. Lower values indicate better parameter precision.

Figure 8. Parameter correlation matrix for the full-proxy monitoring case. Strong positive or negative off-diagonal values indicate parameter pairs that remain difficult to distinguish.

Figure 9. Monte Carlo distribution of mean UKF latent-state reconstruction RMSE by sensor set. Lower RMSE indicates better reconstruction of the unmeasured or partially measured CBP states across paired model–plant mismatch simulations.

Figure 10. Sensitivity of sensor-set ranking to uniform measurement-noise scaling. Rank 1 denotes the best-performing sensor set under the corresponding noise assumption.

Figure 11. Rank distribution under 200 independent sensor-specific noise scenarios. Each sensor standard deviation was independently varied before recalculating the sensor-set scores.

Figure 12. Sensor-set ranking for alternative feasible operating trajectories. Rank 1 denotes the highest aggregate sensor-value score for each temperature–pH trajectory.

Figure 13. Sensor-set ranking for full UKF missingness, assay-delay, and measurement-bias stress tests. Rank 1 denotes the highest aggregate sensor-value score in each measurement-imperfection scenario.

Figure 14. Sensor-set ranking under alternative scoring-weight assumptions. Rank 1 denotes the preferred sensor package for the corresponding design objective.

Table 1. Nominal parameterization of the hybrid CBP virtual plant used for trajectory simulation, finite-difference sensitivity analysis, Fisher-information calculations, and UKF prediction. The values are computational benchmark assumptions, not organism-calibrated kinetic constants.

Quantity	Symbol or Expression	Nominal Value/Definition	Role
Batch duration	$t_{f}$	$96 h$	Simulation horizon
Internal integration step	$Δ t$	$2 h$	RK4 integration step
Initial biomass	$X_{0}$	$0.10$	Initial active biomass state
Initial enzyme activity	$E_{0}$	0	Initial enzyme-activity state
Initial feedstock loading	$S_{0}$	100	Feedstock loading index
Initial insoluble substrate	$B_{0} = S_{0} / 5$	20	Initial residual insoluble substrate
Initial soluble sugar	$C_{0}$	0	Initial soluble-sugar state
Initial ethanol	$P_{0}$	0	Initial ethanol state
Biomass carrying capacity	K	$7.5$	Logistic biomass-growth limit
Base growth-rate coefficient	$μ_{0}$	$0.215 h^{- 1}$	Multiplies temperature, pH, and growth-scale factors
Base decay coefficient	$d_{0}$	$0.009 h^{- 1}$	Basal biomass-decay coefficient
Product-dependent decay factor	–	$1 + 0.35 max (0, P - 6) / 6$	Increases biomass decay at elevated ethanol
Base enzyme-yield coefficient	$Y_{E, 0}$	$0.070$	Enzyme-production coefficient
Enzyme degradation coefficient	$k_{\deg}$	$0.010 + 0.010 max {[0, (T - 50) / 10]}^{2}$	Temperature-dependent enzyme degradation
Base hydrolysis capacity	$V_{max, 0}$	$0.235$	Hydrolysis-capacity coefficient
Hydrolysis half-saturation term	$K_{m}$	$0.85 / max (0.35, f_{pret} θ_{feed})$	Feedstock-accessibility-dependent hydrolysis term
Base ethanol-yield coefficient	$Y_{P, 0}$	$0.215$	Fermentation coefficient
Base inhibition coefficient	$k_{inh, 0}$	$0.070$	Ethanol-inhibition coefficient
Soluble-sugar loss term	–	$0.10 C$	Non-fermentative soluble-sugar loss in the hydrolysis phase
Pretreatment factor	$f_{pret}$	$1.10$ for nominal pretreatment index $\geq 0.5$	Multiplies enzyme production and hydrolysis capacity
Nominal feedstock-accessibility scale	$θ_{feed}$	$1.0$	Multiplicative feedstock-accessibility factor
Multiplicative-factor bounds	–	$0.35 \leq θ_{i} \leq 2.75$	Numerical bounds for plant-scale factors
State bounds	–	$0 \leq X, E \leq 20$ , $0 \leq B, C \leq 40$ , $0 \leq P \leq 30$	Non-negativity and numerical clipping bounds
Operating bounds	–	$30 ° C \leq T \leq 55 ° C$ , $5.0 \leq pH \leq 8.0$	Feasible input domain

Table 2. Sensor library for CBP sensor-set design. The burden index is a dimensionless relative measure of sampling effort, assay delay, calibration burden, and implementation difficulty.

Key	Measurement	$σ$	Burden Index
P	Ethanol	0.18	1.0
C	Sugar	0.20	1.5
X	Biomass proxy	0.05	2.0
E	Enzyme-activity proxy	0.06	3.0
B	Residual insoluble-substrate proxy	0.40	2.5

Table 3. Practical interpretation of the modeled CBP measurement channels.

Modeled Output	Possible Measurement Routes	Typical Implementation Mode	Main Practical Limitations
P Ethanol	HPLC, GC, enzymatic ethanol assay, NIR/Raman spectroscopy, ethanol biosensor	At-line or offline for chromatographic and enzymatic assays; potentially online for spectroscopy or biosensors	Assay or sampling delay, calibration transfer, spectral interference, sensor drift, and matrix effects in lignocellulosic broth.
C Soluble sugar	HPLC, enzymatic glucose/xylose assays, refractive-index methods, NIR/Raman spectroscopy	Mostly at-line or offline; spectroscopy may be online or in situ after calibration	Overlap among multiple sugars, matrix effects from hydrolysate components, calibration burden, and limited specificity for total fermentable sugar.
X Biomass proxy	Optical density, dry cell weight, capacitance/dielectric spectroscopy, turbidity, image analysis, soft-sensor estimate	Offline or at-line for dry weight and optical density; online possible for dielectric or turbidity probes	Interference from insoluble solids, bubbles, cell-morphology changes, fouling, and calibration dependence on organism and feedstock.
E Enzyme-activity proxy	Cellulase activity assay, fluorogenic or colorimetric enzyme assay, protein assay, soft-sensor estimate from hydrolysis response	Primarily at-line or offline	Assay latency, substrate specificity, enzyme-mixture complexity, inhibition effects, temperature and pH dependence, and limited online availability.
B Residual insoluble-substrate proxy	Gravimetric residual solids, total suspended solids, near-infrared spectroscopy, image analysis, solids-balance estimate	Mostly offline or at-line; online estimation possible through calibrated spectroscopy or soft sensing	Sampling heterogeneity, solids settling, poor representativeness, matrix effects, pretreatment-dependent calibration, and high measurement burden.

The table links the scalar measurement channels used in the computational model to plausible experimental or PAT implementations. The listed methods are not unique choices; they illustrate how each modeled output could be approximated in laboratory or pilot-scale CBP monitoring. The numerical noise and burden values used in the computational ranking are nominal workflow assumptions and should be replaced by platform-specific values when experimental sensor data become available.

Table 4. UKF Monte Carlo settings for reproducing the soft-sensing experiment.

Item	Setting
Candidate sets	16 ethanol-mandatory packages: ${P} \cup A$ , where $A \subseteq {C, X, E, B}$ .
Base seed and pairing	Base seed $= 42$ . Plant mismatch and initial-estimate perturbations were indexed by replicate only, giving paired comparisons across sensor sets.
Simulation grid	$t_{f} = 96 h$ , RK4 step $= 2 h$ ; errors evaluated at $t = 0, 2, \dots, 96 h$ .
UKF sampling	Measurements every $12 h$ , with updates at $t = 12, 24, \dots, 96 h$ .
Initial covariance	$P_{0} = diag {(0.06, 0.08, 3.00, 0.60, 0.40)}^{2}$ , state order $[X, E, B, C, P]$ .
Process noise	$Q = diag {(0.0099, 0.0126, 0.9000, 0.03375, 0.02925)}^{2}$ .
Measurement noise	$R_{S} = diag (σ_{i}^{2})$ , using the sensor noise values in Table 2.
Measurement-noise streams	Noise was generated by replicate, time, and channel; shared channels used the same noise draw across sensor sets.
Plant–model mismatch	Plant scales $s_{j} = exp (η_{j})$ . For growth, enzyme yield, hydrolysis, ethanol yield, and feedstock: $η_{j} \sim N (0, 0 . 22^{2})$ . For decay and inhibition: $η_{j} \sim N (0, {(0.65 \times 0.22)}^{2})$ . Scales were clipped to $[0.35, 2.75]$ .
Initial estimate	${\hat{x}}_{0} = x_{0} ⊙ (1 + ϵ)$ , $ϵ_{j} \sim N (0, 0 . 18^{2})$ , followed by state-bound clipping and ${\hat{B}}_{0} \geq 2$ .
Measurement clipping	Noisy measurements were clipped to zero; bias offsets were added in bias-stress scenarios.
State bounds	$0 \leq X, E \leq 20$ , $0 \leq B, C \leq 40$ , $0 \leq P \leq 30$ .
RMSE metric	Latent RMSE used $X, E, B, C$ ; ethanol was excluded because it was measured in all candidate sets.
Statistical comparison	2000 bootstrap resamples and 15 paired ethanol-only contrasts; Wilcoxon p-values were Holm- and Benjamini–Hochberg-adjusted.

Table 5. Main computational assumptions used in the sensor-set comparison.

Category	Values Used	Purpose
Sensor noise	$σ_{P} = 0.18$ , $σ_{C} = 0.20$ , $σ_{X} = 0.05$ , $σ_{E} = 0.06$ , $σ_{B} = 0.40$	Weights FIM calculations and defines UKF measurement covariance.
Measurement burden	$P = 1.0$ , $C = 1.5$ , $X = 2.0$ , $E = 3.0$ , $B = 2.5$	Represents relative sampling, delay, calibration, and implementation burden.
Candidate space	All 16 ethanol-mandatory combinations	Avoids ranking only a restricted seven-package list; ethanol is the product baseline.
Sampling intervals	6, 12, and $24 h$	Tests dense-to-sparse laboratory or pilot-scale sampling.
Finite differences	$ε_{j} = 10^{- 3} max (\| x_{0, j} \|, 1) + 10^{- 4}$ ; $δ = 10^{- 3}$	Computes state and log-parameter sensitivities; clipped perturbations use the actual denominator.
FIM threshold	$τ = max (10^{- 8} λ_{max}, 10^{- 12})$	Defines numerical rank and active eigenvalues.
Regularization	$ρ = max (10^{- 8} λ_{max}, 10^{- 10})$ ; $10^{- 10} I$ for covariance pseudoinverse	Stabilizes determinant and uncertainty diagnostics.
Primary weights	$(w_{obs}, w_{id}, w_{ukf}, w_{burden}) = (0.30, 0.35, 0.25, 0.10)$	Balances observability, identifiability, reconstruction accuracy, and burden.
Alternative weights	Equal, observability-focused, identifiability-focused, reconstruction-focused, burden-sensitive, burden-averse	Tests dependence on design priorities.
UKF settings	$α = 0.35$ , $β = 2$ , $κ = 0$ ; RK4 propagation	Common nonlinear filtering setup for all sensor sets.
Monte Carlo size	$N_{MC} = 100$ per candidate set	Estimates reconstruction-error distributions under mismatch and noise conditions.
Mismatch design	Multiplicative perturbations in growth, enzyme yield, hydrolysis, ethanol yield, decay, inhibition, and feedstock accessibility	Tests reconstruction when the estimator uses the nominal model but the virtual plant differs.
Measurement stresses	Missingness, $6 h$ delay, moderate all-sensor bias, larger proxy bias	Reruns the complete UKF workflow under imperfect practical measurement conditions.
Noise robustness	200 sensor-specific noise scenarios with $m_{k} \in [0.5, 2.0]$	Tests nonuniform changes in sensor accuracy.
Trajectory robustness	Nominal, mild, hydrolysis-extended, fermentation-early, shifted-feasible	Tests dependence on the temperature–pH excitation schedule.
Burden workflows	Nominal, spectroscopy/biosensor, offline assay, dielectric-biomass, high-solids burden	Tests sensitivity to sensor-specific implementation assumptions.

Table 6. Representative mean UKF RMSE values by sensor set.

Sensor Set	X	E	B	C	P
EtOH	1.01	0.45	3.02	0.28	0.16
EtOH–C	0.97	0.43	2.98	0.19	0.14
EtOH–B	0.99	0.44	1.13	0.27	0.16
EtOH–C–B	0.97	0.43	1.11	0.18	0.14
EtOH–C–X–E	0.12	0.08	3.43	0.19	0.16
EtOH–C–X–B	0.11	0.22	1.11	0.20	0.16
EtOH–X–E–B	0.12	0.08	1.11	0.23	0.17
Full	0.12	0.08	1.11	0.19	0.16

X: biomass, E: enzyme activity, B: substrate, C: sugar, P: ethanol. The table reports representative packages; the full 16-set RMSE distributions are shown in Figure 9.

Table 7. Paired comparison of latent-state UKF RMSE against ethanol-only monitoring.

Sensor Set	RMSE	Mean Red.	Bootstrap 95% CI	Holm p
EtOH–C–X–B	0.4121	0.7778	$[0.7022, 0.8558]$	$5.84 \times 10^{- 17}$
Full proxy	0.3756	0.8143	$[0.7398, 0.8894]$	$5.84 \times 10^{- 17}$
EtOH–X–E–B	0.3843	0.8056	$[0.7305, 0.8824]$	$5.84 \times 10^{- 17}$
EtOH–C–E–B	0.4634	0.7265	$[0.6516, 0.8032]$	$5.84 \times 10^{- 17}$
EtOH–X–B	0.4243	0.7656	$[0.6883, 0.8380]$	$5.84 \times 10^{- 17}$
EtOH–C–X–E	0.9554	0.2345	$[0.1619, 0.3070]$	$4.02 \times 10^{- 7}$
EtOH–E–B	0.4718	0.7181	$[0.6434, 0.7888]$	$5.84 \times 10^{- 17}$
EtOH–C–X	1.0093	0.1806	$[0.0997, 0.2603]$	$5.94 \times 10^{- 5}$
EtOH–X	0.9396	0.2504	$[0.1884, 0.3122]$	$2.33 \times 10^{- 10}$
EtOH–X–E	0.8869	0.3030	$[0.2446, 0.3641]$	$2.62 \times 10^{- 14}$
EtOH–E	0.9732	0.2167	$[0.1574, 0.2811]$	$6.45 \times 10^{- 9}$
EtOH–C–B	0.6737	0.5162	$[0.4509, 0.5826]$	$6.82 \times 10^{- 17}$
EtOH–C–E	1.0392	0.1507	$[0.0727, 0.2224]$	$3.82 \times 10^{- 4}$
EtOH–B	0.7051	0.4848	$[0.4174, 0.5561]$	$5.84 \times 10^{- 17}$
EtOH–C	1.1398	0.0501	$[0.0144, 0.0878]$	$2.30 \times 10^{- 2}$

Baseline ethanol-only mean latent-state RMSE = 1.1899. Mean red. denotes the mean paired reduction in latent-state RMSE relative to ethanol-only monitoring. Bootstrap intervals used 2000 resamples. Holm-adjusted p-values were calculated over 15 ethanol-only contrasts. C: sugar, X: biomass, E: enzyme, B: substrate.

Table 8. Primary sensor-set ranking across all 16 ethanol-mandatory candidates.

Rank	Sensor Set	Burden	Score	Value/Burden
1	EtOH–C–X–B	7.0	0.8432	0.1205
2	Full proxy	10.0	0.8173	0.0817
3	EtOH–X–E–B	8.5	0.8086	0.0951
4	EtOH–C–E–B	8.0	0.7557	0.0945
5	EtOH–X–B	5.5	0.7395	0.1344
6	EtOH–C–X–E	7.5	0.7228	0.0964
7	EtOH–E–B	6.5	0.6573	0.1011
8	EtOH–C–X	4.5	0.6531	0.1451
9	EtOH–X	3.0	0.6465	0.2155
10	EtOH–X–E	6.0	0.6457	0.1076
11	EtOH–E	4.0	0.5846	0.1461
12	EtOH–C–B	5.0	0.4665	0.0933
13	EtOH–C–E	5.5	0.3892	0.0708
14	EtOH–B	3.5	0.3540	0.1012
15	EtOH–C	2.5	0.2894	0.1157
16	EtOH	1.0	0.1440	0.1440

EtOH: ethanol, C: sugar, X: biomass, E: enzyme, B: substrate. Burden is the dimensionless measurement-burden index. The rank is based on the primary aggregate sensor-value score. Value/burden is reported only as a secondary diagnostic and was not used to determine the rank. Latent-state RMSE values are reported separately in Table 6.

Table 9. Uniform noise-sensitivity summary. The top-ranked set was EtOH–C–X–B in all noise scenarios.

Noise Scenario	Multiplier	Top Score	Spearman $ρ$
Low noise	0.5	0.8450	0.9971
Nominal noise	1.0	0.8432	1.0000
High noise	2.0	0.8409	0.9882

EtOH–C–X–B denotes ethanol–sugar–biomass–substrate. Spearman

ρ

is relative to the nominal-noise ranking. Scores are normalized within each noise scenario.

Table 10. Operating-trajectory robustness summary.

Trajectory	Top Sensor Set	Top Score	Spearman $ρ$
Nominal	EtOH–C–X–B	0.8432	1.0000
Mild profile	Full proxy	0.8994	0.9941
Hydrolysis extended	EtOH–C–X–B	0.8465	0.9971
Fermentation early	Full proxy	0.8993	0.9912
Shifted feasible	Full proxy	0.8991	0.9647

Spearman

ρ

is relative to the nominal-trajectory ranking. Scores are normalized within each trajectory scenario.

Table 11. Full UKF missingness, delay, and bias stress-test summary. The top-ranked set was EtOH–C–X–B in all scenarios.

Stress Scenario	Top RMSE	Top Score	Spearman $ρ$
Nominal	0.4121	0.8432	1.0000
$20 %$ missing, all sensors	0.4683	0.8410	0.9941
Proxy missingness	0.5495	0.8321	0.9794
$6 h$ assay delay	0.5953	0.8483	0.9559
Moderate bias	0.4210	0.8430	0.9971
Proxy bias	0.4396	0.8506	0.9529

EtOH–C–X–B denotes ethanol–sugar–biomass–substrate. Spearman

ρ

is relative to the nominal stress-test ranking. Scores are normalized within each stress scenario.

Table 12. Top-ranked sensor sets under alternative scoring-weight assumptions.

Weighting Scheme	Top Sensor Set	Top Score	Burden
Primary	EtOH–C–X–B	0.8432	7.0
Equal weights	EtOH–C–X–B	0.7621	7.0
Observability focused	Full proxy	0.8520	10.0
Identifiability focused	EtOH–C–X–B	0.8352	7.0
Reconstruction focused	EtOH–C–X–B	0.8629	7.0
Burden sensitive	EtOH–C–X–B	0.7310	7.0
Burden averse	EtOH–C–X–B	0.6763	7.0

EtOH: ethanol, C: sugar, X: biomass, E: enzyme, B: substrate. Scores are normalized within each weighting scenario.

Table 13. Sensor-specific burden-workflow sensitivity summary. The top-ranked set was EtOH–C–X–B in all burden workflows.

Burden Workflow	Top Score	Spearman $ρ$
Nominal	0.8432	1.0000
Online spectroscopy/biosensor	0.8471	0.9912
Offline HPLC/assay workflow	0.8420	0.9971
Dielectric biomass available	0.8471	0.9824
High-solids burden	0.8367	0.9765

EtOH–C–X–B denotes ethanol–sugar–biomass–substrate. Spearman

ρ

is relative to the nominal burden-workflow ranking.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Korang Yeboah, M.; Asiedu, N.Y.; Addo, A. Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing. Sensors 2026, 26, 3948. https://doi.org/10.3390/s26123948

AMA Style

Korang Yeboah M, Asiedu NY, Addo A. Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing. Sensors. 2026; 26(12):3948. https://doi.org/10.3390/s26123948

Chicago/Turabian Style

Korang Yeboah, Mark, Nana Yaw Asiedu, and Ahmad Addo. 2026. "Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing" Sensors 26, no. 12: 3948. https://doi.org/10.3390/s26123948

APA Style

Korang Yeboah, M., Asiedu, N. Y., & Addo, A. (2026). Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing. Sensors, 26(12), 3948. https://doi.org/10.3390/s26123948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Observability- and Identifiability-Guided Sensor-Set Design for Digital-Twin-Assisted Consolidated Bioprocessing

Abstract

1. Introduction

2. Model, Sensor Sets, and Information-Based Analysis

2.1. Hybrid CBP Digital-Twin Model

2.2. Candidate Sensor Sets and Measurement Assumptions

2.3. State Observability Analysis

2.4. Parameter Identifiability Analysis

3. Soft-Sensor Evaluation, Sensor Ranking, and Robustness Assessment

3.1. Soft-Sensor Reconstruction Test

3.2. Scoring Sensor Values and Rankings

3.3. Noise, Operating-Trajectory, and Scoring Robustness Analyses

3.4. Computational Reproducibility

4. Results and Discussion

4.1. Nominal CBP Trajectory with the Excitation Schedule

4.2. State-Observability Enhancement with Increasingly Informative Sensors

4.3. Parameter-Identifiability Improvement with Biomass and Enzyme Sensors

4.4. Impact of the Sensor Set on UKF Reconstruction Quality

4.5. Recommended Sensor-Set Ranking and Measurement-Burden Trade-Off

4.6. Robustness to Measurement Noise, Operating Trajectories, Measurement Imperfections, and Scoring Weights

4.7. Implications for Practical CBP Digital-Twin Development

4.8. Limitations

5. Summary and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI