#### 3.1. Case Study I: Specification Bias

In this case study, we considered the CHO metabolic model in

Figure 1 with the measured exchange flux values and standard deviations reported previously [

16]. We first employed the GLS regression to obtain the estimate of

${v}_{I}$, denoted by

${\widehat{v}}_{I,GLS}$ (see

Supplementary Table S2).

Figure 1 further depicts the flux distribution according to

${\widehat{v}}_{I,GLS}$. Below, we evaluated the impact of omitting a single reaction from the CHO metabolic network in terms of the bias in the estimated flux values and the significance of the linear regression. We computed the specification bias using Equation (15) and reported the bias in relative (percent) values with respect to

${\widehat{v}}_{I,GLS}$. Meanwhile, we employed ANOVA (analysis of variance) to establish the statistical significance of the linear regression [

29].

Table 1 gives the minimum, median, mean, and maximum absolute specification bias for the omission of single reactions, one at a time, from the stoichiometric matrix

${S}_{I}$. Here, we only removed reactions that would not create an orphan species, i.e, a species that does not participate in any reaction. For each reaction removal, we also generated 10,000 vectors of in silico data of

$y={S}_{I}{v}_{I}$ using the full

${S}_{I}$ matrix and contaminated the data with independent Gaussian random noise with the variance-covariance matrix constructed from the reported standard deviations [

16]. For each data vector, we evaluated the significance of regression by ANOVA using the reduced

${S}_{I}$ matrix, i.e., the matrix

${S}_{I}$ with a missing column (reaction). The averages of the

p values from the ANOVA are given in

Table 1. Here, we took

p value of 0.05 as the threshold to reject the GLS regression; any

p values higher than the threshold indicate a poor regression outcome.

The individual removal of roughly 3/4 of the reactions (26 out of 36 reactions) still produced a significant regression with

p < 0.05. On average, the median, mean, and maximum specification biases in the flux estimates were higher for the removal of reactions that caused a poor regression (

p > 0.05). The two highest

p values expectedly came from the removal of reactions with the two largest fluxes, and each expectedly had large specification biases. There were nevertheless exceptions where a poor regression resulted from removing a reaction with a moderately low flux value (e.g., reactions 21 and 22). On the other hand, many of the cases with a significant regression (

p < 0.05) were associated with high maximum specification biases. In fact, several of the cases among the lowest

p values (i.e., the most significant regression) had a mean bias of >30% and a maximum bias of above 800%. Equally important, the removal of several reactions with a low flux magnitude led to large mean and maximum flux biases (mean bias >150%), as highlighted in

Table 1 and by thin red arrows in

Figure 1. Therefore, while a poor regression generally points to a model misspecification problem or a violation of the assumption of measurement noise, a statistically significant regression does not guarantee a small specification bias in the flux estimates. In addition, removing reactions with a low flux magnitude can cause disproportionately large specification biases in the flux estimate. These observations clearly motivate the use of a more systematic assessment of the model misspecification issue in the overdetermined MFA.

#### 3.2. Case Study II: Stoichiometric Model Misspecification Tests

We evaluated the ability of the Ramsey RESET test, F-test, and LM test to detect the issue of stoichiometric matrix misspecification in the overdetermined MFA, particularly the existence of omitted or missing reactions from

**S**_{I}. As outlined in Materials and Methods, we determined the rates of TP, TN, FP, and FN using randomly generated pairs of data

$y=-{S}_{E}{v}_{E}$ and stoichiometric matrices

**S**_{I} with missing reactions. For the F- and LM tests, we used the information on the actual missing reactions, as well as a distinct set of reactions, as the design matrix of the missing variables

**Z**. The results using the baseline MFA problems with 100 metabolites (

**m**), 60 unknown internal reactions (

${\mathit{n}}_{{v}_{I}}$), 50 measured exchange reactions (

${\mathit{n}}_{{v}_{E}}$), and two, five, or 10 missing reactions (

${\mathit{n}}_{{v}_{O}}$) for different noise levels (1 to 20% CoV), are summarized in

Table 2. Note that by definition, the TP and FN rates sum to 1 and so do the FP and TN rates.

In general, the results in

Table 2 showed that the F-test consistently outperformed the RESET and LM tests and was able to provide high TP rates at moderately low FP rates across all noise levels and different numbers of missing reactions. The results of further evaluations of the F-test performance for metabolic networks of different sizes (

m = 50 and 200 metabolites and

n = 55 and 220 reactions, respectively) in

Table 3 confirmed the robust performance of the F-test. In general, for the F-test, as the level of measurement noise increased (higher CoV), the TP rates expectedly dropped. We also observed that the smaller the number of missing reactions, the poorer were the TP rates of the F-test. This trend was also expected since, with fewer missing reactions, the reduced

**S**_{I} was closer to the true system and could more accurately capture the flux balance. Therefore, for the F-test to correctly detect a misspecification of

**S**_{I}, the missing reactions would need to cause a significant deterioration in the data fitting, a scenario that was less likely to occur as the number of missing reactions became lower. Meanwhile, the FP rates were not a strong function of the noise level. The FP rates improved with a lower number of missing reactions, albeit only slightly.

With larger networks, detecting the same number of missing reactions using the F-test became more difficult, as expected. At the largest network size (

m = 200), the rate of correctly detecting a misspecification with two missing reactions was slightly lower than 60%. Fortunately, the TP rates for detecting five or more missing reactions were still high (>88%), and the FP rates depended weakly on the size of the networks and the number of missing reactions and remained relatively low, between 10% to 15%, in most of the cases in our study (see also

Supplementary Table S3).

On the other hand, the RESET test performed very poorly in this case study, in which the FP rates were consistently higher than the TP rates. The general trends observed for the F-test also did not apply to the RESET test. We noted that the RESET test is derived based on the assumption that data error has a constant variance (i.e., the data noise is homoscedastic) [

20]. Since the data noise in this case study has a standard deviation that scales linearly with the mean flux value, this assumption was violated. Upon repeating the RESET test using homoscedastic in silico flux data, the RESET test performed much better, with much lower FP rates (see

Supplementary Table S4). In addition, the trends of lower TP rates with increasing noise levels and with fewer missing reactions were applicable to the RESET test results when the data noise was homoscedastic.

The LM test performed better than the RESET test but produced lower TP rates than the F-test, particularly at the highest number of missing reactions (

${\mathit{n}}_{{v}_{O}}=10$). The LM test can handle heteroscedastic data through the use of heteroscedasticity-consistent (HC) standard errors in the matrix

$\widehat{\mathsf{\Omega}}$. The results in

Table 2 showed that, like the F-test, the TP rates of the LM test decreased with increasing noise levels. Also, as with the F-test, the TP rates of the LM test increased upon increasing the number of missing reactions from two to five, but, upon increasing the number of missing reactions further, the TP rates of the LM test decreased. With an increasing number of missing reactions, the magnitude of the residuals from the OLS estimation using the misspecified model, and therefore the diagonal elements of the HC matrix

$\widehat{\mathsf{\Omega}}$, would become larger. Since the rejection rates of the null hypothesis decreased with larger

$\widehat{\mathsf{\Omega}}$, the FP rates tended to decrease with more missing reactions. For the same reason, the TP rates of the LM test dropped at the highest number of missing reactions.

Considering its robust performance in this case study, we therefore recommend the F-test for detecting model misspecification in the overdetermined MFA. The F-test requires the stoichiometry of the candidate missing reactions as an input. With the extensive knowledge on metabolic reactions available in the literature and in online public databases, such a requirement may not be overly limiting.

#### 3.3. Case Study III: Resolving Model Misspecification

In the last case study, we evaluated the performance of the proposed iterative procedure to resolve stoichiometric matrix misspecifications in the overdetermined MFA (see Materials and Methods). Here, we returned to the flux analysis of the CHO metabolic network in

Figure 1. For the performance assessment, we created 100 different stoichiometric matrices

**S**_{I,true} by randomly removing a number

n_{extra} of columns from the stoichiometric matrix

**S**_{I} of the CHO model. For each

**S**_{I,true}, we generated an artificial data vector

$y={S}_{I,true}{v}_{I,true}$ using the GLS flux estimate (see

Supplementary Table S2) and contaminated the data vector with independent Gaussian noise with zero mean and the variance-covariance matrix, as in Case Study I. The data generation procedure was repeated 100 times. For each data vector, we then created a reduced matrix

**S**_{I,red} by randomly removing a number

n_{omit} of reactions from

**S**_{I,true}. The reactions removed in the creation of

**S**_{I,true} and

**S**_{I,red} were subsequently combined in the matrix

**S**_{A}. In other words, the set of candidate missing reactions

**S**_{A} had equal fractions of the actual omitted reactions and the extra reactions that were not used in the in silico data generation. Finally, we applied the strategy for resolving model misspecification to each data vector using the matrix

**S**_{I,red} as the reduced stoichiometric matrix and the matrix

**S**_{A} as the candidate missing reaction matrix. The strategy was implemented using two settings: (1)

k = 1 and (2)

k = 1 followed by

k = 2.

Table 4 gives the number of reactions in

**S**_{A} that were not positively identified by the iterative procedure to be included in the stoichiometric matrix. As an indication of good performance, the number of omitted reactions (extra reactions) that remained should be low (high). The results in

Table 4 demonstrated that the proposed procedure using

k = 1 was able to correctly detect and incorporate almost all of the omitted reactions, while keeping the incorrect inclusion of extra reactions low. As expected, performing an additional run with

k = 2 after finishing the procedure with

k = 1 led to a higher incorporation rate of the omitted reactions, but such a strategy came at the cost of a higher rate of incorrect addition of the extra reactions. Due to the small size of the CHO model and the number of missing reactions considered, a higher

k (e.g.,

k = 2) led to the incorporation of all omitted and extra reactions (see

Supplementary Table S5). Considering the trade-off above, we thus recommend using a simple implementation with

k = 1 to resolve the issue of stoichiometric matrix misspecifications in the overdetermined MFA.