Missing Data Calculation Using the Antioxidant Activity in Selected Herbs

In this paper, a model has been developed that can estimate the composition of the phenol compounds, based on censored data and the total equivalent antioxidant capacity (TEAC) measured by three different methods. A contingency of 32 plants was analyzed: total phenolic content (TPC), caffeic acid, p-coumaric acid, ferulic acid, neochlorogenic acid and TEAC. They were measured by three different methods: ABTS (2,20-azinobis-(3-ethylbenzthiazoline6-sulfonic acid)), DPPH (1,1-diphenyl-2-picrylhydrazyl radical) and FRAP (ferric reducing/antioxidant power). Five values of caffeic-, thirteen of p-coumaric-, seven of ferulic-, and nineteen neochlorogenic acids were missing. Due to the complexity of the compounds, data mining and computational methods are required to determine the missing data. The method developed for independent variables was used to estimate the missing data. The contingency was filled with the calculated values obtained with all alternatives. The performance of each approach is shown in the estimation and/or prediction of the phenolic composition compared to the approaches used. The results indicated a strong correlation and mutual influence between the data analyzed.

These natural products [4] are used to provide proactive support to various physiological systems. They can cure or prevent diseases (atherosclerosis, cancer, diabetes, degenerative diseases) [5,6].
Natural antioxidants can help preventing oxidation and regulate the immune function [7]. Overproduction of free radicals can cause oxidative damage to biomolecules. This can ultimately lead to many chronic and acute illnesses. Nowadays, examining the antioxidant capacity of plants has become an important subject [8].
Phenolic compounds are the subject to various studies because of their antioxidant activity. These chemical compounds have a complex structure, with multiple aromatic rings in the same molecule. Antioxidant activity can be measured by different methods. The most commonly used methods are ABTS (2,20-azinobis-(3-ethylbenzthiazoline-6-sulfonic acid)) and DPPH (1,1-diphenyl-2-picrylhydrazyl radical). Several studies compare the methods of determining the antioxidant effect to find the most accurate one [9,10].
Managing missing data is a common challenge in all areas of science. In recent years, serious research has been carried out on the analysis of censored data. To make the best use of existing data without adding or deleting, methods of data mining based on the missing dataset have also been proposed [11]. For example, after reviewing 100 articles from cancer magazines, 81 of them included analyzes with missing covariate data [12]. There are many possibilities to estimate missing values. Popular approaches include regression (conditional mean) and variants of nearest neighbor matching. The imputation methods confirm the initial estimation equation with an imputation model. This way they predict the value of missing data. After replacing the missing data with estimated values, the original model is assessed based on the total sample.
Regression imputation receives predicted values for missing data based on total observations. The estimated regression function is calculated using the assumed conditional average. Nearest neighbor matching is done by replacing the missing data based on "closest" values [13].
The χ 2 test was introduced by Karl Pearson [14]. It is used to determine whether the analyzed data is identically distributed across different populations/sub-groups of the same population [15,16]. Two significant modifications to Pearson's test were introduced by Ronald Aylmer Fisher [17]. First, the degree of freedom was decreased by one unit when applied to a contingency table. The other considered the number of unknown parameters associated with the theoretical distribution. The parameters are estimated from central moments [18]. The Chi-square test has been applied in all research areas, its main uses being goodness-of-fit [19], association/independence [20], homogeneity [21] or classification [22,23].
The correlation coefficient is a statistical measure of the relationship between one dependent and one or more independent variables [24]. Several correlation coefficients based on different statistical hypothesis are known and frequently used: Pearson correlation coefficient, Spearman rank correlation coefficient and semi-quantitative correlation coefficient, Kendall tau-a, -b and -c correlation coefficients, Gamma correlation coefficient [25].
There is no guidance on managing censored values or missing data methods. In this paper, we have developed an iterative algorithm that can identify the most likely missing values of the phenolic content (predictive power).
The rest of the paper is organized as follows. In Section 2, the proposed methods are explained. Section 3 covers the obtained results and their discussion. Finally, Section 4 draws general conclusions.

Measurement Methods
The antioxidant activity of the 32 selected herbs was reported by Wojdyło et al. 2007 [26]. All plant materials (0.5 kg each sample) were collected in 2005 by Polish Pharmaceutical Enterprises "JURBOAGRO" from Grabowo Wielkie. We used this research as the input data for our study (Table S1-Supplementary Material).
Phenolic acids were the subject of many studies because of their antioxidant activity [27][28][29]. The total antioxidant capacity expressed as TEAC was estimated by the following methods: ABTS, DPPH and ferric reducing/antioxidant power (FRAP) [30]. DPPH radical-scavenging activity was determined using the method [31]. The free radical-scavenging activity was determined by ABTS radical cation decolorization assay [32].
The DPPH free radical (DPPH) does not require any special preparation, while the ABTS radical cation (ABTS •+ ) must be generated by enzymes or chemical reactions [33]. The FRAP method is based on the reduction of a ferroin analog. Total polyphenol content was measured using Folin-Ciocalteu colorimetric method [34]. The results were corrected for dilution and expressed in µM Trolox per 100 g dry weight.
Four phenolic acid compounds were identified in the PubChem Compound Database. Further, the compounds are associated with their CID (the permanent compound identifier for a unique chemical Symmetry 2019, 11, 779 3 of 10 structure) number for more simple use, as: caffeic acid-689043 [35], p-coumaric acid-637542 [36], ferulic acid-445858 [37] and neochlorogenic acid-5280633 [38]. Further analysis can be conducted strictly with the chemical structure of the compounds, but they require more investigations.

Primary Data
The results [26], our input data, with the missing places (in red) are presented in Table S1 (see Supplementary Material). Table S1 lists the results of a quantitative analysis of the major phenolic compounds identified in the 32 herbs. The mean values are associated with the standard deviations calculated after the analysis.

Evaluation Methods
The method reported by Jäntschi [39] was adapted and modified to our experimental values. Homemade *.php programs were used to do all the calculations. The relationship between the four chemical compounds and their antioxidant activity (Table 1) using the Chi-square (χ 2 ) test was investigated. The values regarding the antioxidant activity and the phenolic compounds were transformed to logarithmic ones. The χ 2 method requires normally distributed data. This method is preferred because it is a natural transformation that also appears in the environment (e.g., pH determination). The nature of the plants, the compounds structure and the antioxidant activity are influencing each other.
The steps applied in our missing data analysis were as follows ( Figure 1): • Check that the linearity between antioxidant activity and phenol content is true of the experimental data in the analysis.

•
Three alternatives were taken into consideration. The experimental values were introduced in the algorithm in the first step.

•
Obtaining the coefficients using linear regression analysis (Equations (1) and (2)); using these to make estimations in the first cycle. • Fill in the missing places with estimated values.

•
Repeat: Obtain (new) expected values (Equation (3)) Calculate χ 2 using observed and expected values Insert in the missing places the (new) expected values Until the value of χ 2 is not significantly changed (e.g., convergence) In step 1, using linear regression analysis model function (Equation (1)), the linearity is verified; the regression coefficients are calculated (Equation (2)): where, β is the slope of the line and α is the intercept. The value of α and β can be defined as: where, and y as the average of the x i and y i ; n is the number of points taken into consideration, {(x i , y i ), i = 1, . . . , n}.
The steps applied in our missing data analysis were as follows ( Figure 1): • Check that the linearity between antioxidant activity and phenol content is true of the experimental data in the analysis. • Three alternatives were taken into consideration. The experimental values were introduced in the algorithm in the first step.

•
Obtaining the coefficients using linear regression analysis (Equations (1) and (2)); using these to make estimations in the first cycle.  Table S1 (see the Supplementary material) censored data.
In step 1, using linear regression analysis model function (Equation (1)), the linearity is verified; the regression coefficients are calculated (Equation (2)): where, β is the slope of the line and α is the intercept. The value of α and β can be defined as: After filling the missing places with values (step 2), the next mathematical model (Equation (3)) was applied to estimate the phenolic content. This was calculated based on available experimental data and estimated values (step 3).
where, O i,j = mean of the observed value for (i,j) pair of factors; E i,j = mean of the expected value for (i,j) pair of factors; X 2 = the value of Chi-square statistic; m = number of rows; n = number of columns ; 1 ≤ i ≤ m = indices of observations associated to the first factor; 1 ≤ j ≤ n = indices of observations associated to the second factor. One hypothesis is that the O i,j observations are the result of multiplying two factors (repeated observations approximate better the effect of multiplication) [40].
Three mathematical assumptions could be formulated in terms of square error ((O i,j − E i,j ) 2 ) of observation. The measurement is affected by chance errors: • on a scale with values (X 2 , Equation (4)) between absolute and relative errors (step 4); • absolute values (S 2 , Equation (5)); • relative values (Cv 2 , Equation (5)); In our study, the first hypothesis (minimization of the χ 2 statistic) (Equation (4)) was used: Two other hypotheses exist: the minimization of the variance (S 2 ) obtained between the model and the observation; and minimization of the squared coefficient of variation (Cv 2 ): where 1 ≤ i ≤ r = contribution of the first factor to the expected value E i,j ; 1 ≤ j ≤ c = contribution of the second factor to the expected value E i,j ; A graphical representation of the algorithm cycles is presented in Figure 2. The variation of the colors suggests the modification of the values in the missing places (in red). Each cycle also includes a whole set of estimates and expected values, X 2 calculations. Different colors indicate changes in data in missing places; these reach their final values in the Cycle n, after X 2 is not significantly changed compared to Cycle n − 1.
Two other hypotheses exist: the minimization of the variance (S 2 ) obtained between the model and the observation; and minimization of the squared coefficient of variation (Cv 2 ): where 1 ≤ i ≤ r = contribution of the first factor to the expected value Ei,j; 1 ≤ j ≤ c = contribution of the second factor to the expected value Ei,j; A graphical representation of the algorithm cycles is presented in Figure 2. The variation of the colors suggests the modification of the values in the missing places (in red). Each cycle also includes a whole set of estimates and expected values, X 2 calculations. Different colors indicate changes in data in missing places; these reach their final values in the Cycle n, after X 2 is not significantly changed compared to Cycle n-1.  Using this method, three alternatives were applied to complete the missing places in Table S1 (See Supplementary Material). The first option is to use the ABTS values, the second the DPPH values and the third the FRAP values to estimate the phenolic acid content. The hypothesis lay on the presumption that each value from the columns influences each value from the rows.
After fulfilling the missing places from Table 1, correlation coefficients were calculated (Pearson, Spearman, semi-quantitative, see below).
The following formula was used to calculate Pearson's quantitative correlation: where, r Prs = the Pearson correlation coefficient; n = number of samples; datasets {y 1 . . . y n } containing n values and another dataset {ŷ 1 . . .ŷ n } containing n values. Using Equation (6), the strength and direction of the linear relationship between two quantitative variables are measured. This way we described the direction and degree to which one variable is linearly related to another [41]. Student t-test was used to determine if the value of the Pearson correlation coefficient is statistically significant, at a significance level of 5%.
Spearman's qualitative rank correlation (Equation (7)) explains a non-parametric measure of the correlation between variables. It represents how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about their frequency distribution [42].
Spearman's coefficient is appropriate for both continuous, ordinal and discrete variables. Also, semi-quantitative coefficient (Equation (8)) was calculated: where r Sq = semi-quantitative correlation coefficient. The semi-quantitative coefficient is a combination between Pearson's quantitative and Spearman's rank qualitative coefficient.
These coefficients were calculated to quantify the degree of the monotonic non-linear inference (such as sigmoidal extremal deviations).

Results and Discussion
The TEAC experimental values were excluded from further calculations because their χ 2 were outliers. The experimental data from the plant Acorus calamus was excluded due to its outlier χ 2 results (Table 1 see below). The four phenolic acids relationship with the antioxidant capacity in the remained plants were analyzed. Table S1 (see Supplementary Material) shows the observed experimental values (obs.), the estimated, which fill the missing places (in bold), and the expected values (exp.) calculated from the regression. The values are the results of the analysis following the logarithmic transformation of the data.
Note that the values of each plant are representative because the experimental data and the set values are evolving in the same direction. The results do not indicate whether phenolic acid accurately determines the antioxidant activity. Each compound has a role in the delivery of its effect.
After collecting the results, correlation analysis was made to decide which phenolic compound influence the antioxidant activity. The statistical analysis was necessary because each phenolic acid contributes to the antioxidant capacity in a different measure.
The correlation coefficients calculated based on Equations (6)- (8), are included in Table 2 (see below). They revealed that the difference between Pearson's and Spearman's correlation coefficient was not notable. Both almost have the same significance. The only difference is that Spearman's correlation uses ranks instead of x and y values that are used by Pearson. In Table 2, the correlation coefficients indicated a strong correlation between the results. The coefficients can take values between −1 and +1. In this case, the average value of 0.75 showed that the variables are linearly related by an increasing relationship. The Student's t-test revealed that there is a statistically significant linear relationship between the variables. Each correlation coefficient explains a measure of association between variables. Also describes the relationship between the two of these. When the difference between them is not significant, each coefficient evolves in the same direction.
Symmetry 2019, 11, 779 7 of 10 X 2 as a function of iteration was also investigated after running the algorithm. This showed, as the statistical analysis, that the variables are close to each other. The evolution of X 2 as a function of iteration is presented in the next Figure 3.  Figure 3. Evolution of X 2 statistic as a function of iteration.
Other possibilities to fill the contingency table such as Monte Carlo methods were reported in the scientific literature. The basic idea of most Monte Carlo simulations is to iteratively propose a small random change in a configuration [42]. One of the Monte Carlo methods would fit our data set, the Bootstrap method. This method gives the estimates from a repeated resampling with replacement [43,44]. Taking into consideration the Monte Carlo methods, the results from these would not give additional information according to the variables. Also, a disadvantage of the Bootstrap method is that ignores the known variables, even if a linear relationship exists between them. The contingency table with our presented method is filled once, and after that, the X 2 operates on the variables. With the Monte Carlo methods, the table is filled several times, modifying the original variables.
Multiple studies [1,3,4,26,27] provided complex and diverse results on the same topic. The determined compounds have different characteristics, geometry and symmetry. The key to resolve the censored data problem is to find the most fitting one. After all, we used our methods to fill the contingency table, because gives the fastest way to the solution of the problem.
The application of the given algorithm is found in several scientific areas where missing data after the experiments are displayed. Censored data are widespread in epidemiological studies and clinical trials. They occur in large-scale studies in which the event typically is associated with an infection, disease or some other failure. The novelty of our study relies on the promptness and fastness of the algorithm.

Conclusions
Our algorithm proved able to operate on a contingency table with gaps (missing data). In the process, the X 2 statistic is minimized. The results indicate that phenolic compounds influence each other. A linear relationship between all analyzed datasets can be observed. All the three methods (ABTS, DPPH, FRAP) which were used to determine the antioxidant capacity gave comparable results. This is expected based on the experimental data and literature reviews. The estimated values for missing places fit into experimental data. The correlation between the numbers is clearly visible.
Further research of other chemical or biological substances may reveal if there is a relationship between antioxidant activity and the structure of the compounds.  After implying the algorithm (Figure 1) on the experimental data set, the resulted values of the X 2 fast converged to a minimum after different numbers of cycles. The minimum was reached within a few iterations.
A non-significant difference between consecutive X 2 values led to the stop of the algorithm. 12 iteration in case of ABTS, 10 iterations in case of DPPH, and 11 iterations in case of FRAP values were observed.
Other possibilities to fill the contingency table such as Monte Carlo methods were reported in the scientific literature. The basic idea of most Monte Carlo simulations is to iteratively propose a small random change in a configuration [42]. One of the Monte Carlo methods would fit our data set, the Bootstrap method. This method gives the estimates from a repeated resampling with replacement [43,44]. Taking into consideration the Monte Carlo methods, the results from these would not give additional information according to the variables. Also, a disadvantage of the Bootstrap method is that ignores the known variables, even if a linear relationship exists between them. The contingency table with our presented method is filled once, and after that, the X 2 operates on the variables. With the Monte Carlo methods, the table is filled several times, modifying the original variables.
Multiple studies [1,3,4,26,27] provided complex and diverse results on the same topic. The determined compounds have different characteristics, geometry and symmetry. The key to resolve the censored data problem is to find the most fitting one. After all, we used our methods to fill the contingency table, because gives the fastest way to the solution of the problem.
The application of the given algorithm is found in several scientific areas where missing data after the experiments are displayed. Censored data are widespread in epidemiological studies and clinical trials. They occur in large-scale studies in which the event typically is associated with an infection, disease or some other failure. The novelty of our study relies on the promptness and fastness of the algorithm.