A Voting-Based Ensemble Deep Learning Method Focused on Multi-Step Prediction of Food Safety Risk Levels: Applications in Hazard Analysis of Heavy Metals in Grain Processing Products

Grain processing products constitute an essential component of the human diet and are among the main sources of heavy metal intake. Therefore, a systematic assessment of risk factors and early-warning systems are vital to control heavy metal hazards in grain processing products. In this study, we established a risk assessment model to systematically analyze heavy metal hazards and combined the model with the K-means++ algorithm to perform risk level classification. We then employed deep learning models to conduct a multi-step prediction of risk levels, providing an early warning of food safety risks. By introducing a voting-ensemble technique, the accuracy of the prediction model was improved. The results indicated that the proposed model was superior to other models, exhibiting the overall accuracy of 90.47% in the 7-day prediction and thus satisfying the basic requirement of the food supervision department. This study provides a novel early-warning model for the systematic assessment of the risk level and further allows the development of targeted regulatory strategies to improve supervision efficiency.


Introduction
Food quality and safety issues have drawn wide interest with the continuous development of the social economy [1]. Governments worldwide have exerted considerable efforts to improve food safety [2]. In China, more stringent regulatory measures have been implemented by the government. Despite these efforts, food safety incidents still arise [3]. Food safety concerns challenge the food safety oversight system of the country and pose an economic threat [4]. One reason is that most supervision measures and methods rely on manpower, and a severe shortage of qualified professional supervision talents is a current concern [5]. Meanwhile, heavy metal deposition in agroecosystems has recently increased, particularly in northern China, posing serious risks to crop safety and human health via the food chain [6]. The quality and safety of grain processing products such as wheat and its products, which are vital food and feed ingredients in China, have gained interest [7]. Meanwhile, grain processing products have become the main source of heavy metal intake among residents in China [8]. To reduce the likelihood of heavy metal contamination at a reduced cost, big data and artificial intelligence methods need to be applied for efficient monitoring of safety issues. Moreover, appropriate food safety assessment methods have to be implemented to determine the effect of heavy metal contamination on the safety of grain processing products.
The actual situation in China indicates that heavy metals are likely to increase noncarcinogenic and carcinogenic health risks [9]. For instance, exposure to cadmium has been associated with numerous adverse health effects, including atherosclerosis, cancer, and possibly melanoma [10]; we established a dietary exposure assessment model to analyze the food safety risks of heavy metals attributed to environmental factors, carcinogenic, and noncarcinogenic health factors in grain processing products. A comprehensive risk assessment and a hazard analysis of heavy metals were conducted using the analytic hierarchy process based on entropy weight (AHP-EW), combined with K-means++ clustering. Accordingly, we proposed an improved early-warning model based on a voting-ensemble method, which integrates deep learning models in the multi-step prediction. The effectiveness of the proposed early-warning approach was validated by grain processing product detection data from the National Food Safety Sampling Inspection Information System and then the proposed model was compared with current models. This approach benefits food safety supervision departments by reducing manpower supervision costs and can effectively predict food safety risks.

Food Safety Risk Assessment and Classification
Food quality and safety are closely related to the health and living standard of individuals, and the risk assessment of food quality and safety bears considerable social significance [11]. A food safety risk assessment and early-warning analysis have recently been conducted. Several studies on risk assessment [11][12][13] have applied AHP-EW to determine objective food safety risk values as inputs in early-warning models. However, studies use single risk values as the assessment index and thus lack a comprehensive risk assessment. A food safety assessment has to consider the effect of food pollutant exposure on human beings. Accordingly, B. Niu et al. [14] established dietary exposure models, which are typically used to assess the carcinogenic and non-carcinogenic risks of children and adults after metal exposure [15], allowing for a comprehensive assessment of the health risks in vegetables and providing scientific and comprehensive support for risk assessments.
For a comprehensive assessment of food safety risks, many dietary exposure assessment models have been explored, including the target hazard quotient (THQ) and target cancer risk (TCR) established by the United States Environmental Protection Agency (US EPA) in 2000. THQ is the pollutant exposure dose and reference dose to characterize the non-carcinogenic risk of pollutants [16]. TCR is based on the pollutant exposure dose and carcinogenic intensity index. The index reflects the possible type of carcinogenic risk [17]. Moreover, the Nemerow integrated pollution index (NIPI) is a water pollution index used to evaluate heavy metal pollution in soil or sediment [18]. Considering the need to integrate the heavy metal hazards of grain processing products with environmental and health factors, we introduced food safety risk assessment indexes-TCR, THQ, and NIPI-to comprehensively measure the heavy metal hazard in grain processing products and used them as inputs in early-warning models.
With regard to risk classification, Geng et al. and Ma et al. used the interval distribution or risk matrix of the risk value, respectively [13,19], to establish food safety risk levels. However, risk level classification based on risk values, rather than assessment values, fails to comprehensively reflect the risks associated with heavy metals, and risk level classification based on the interval distribution or risk matrix is subjective [19]. As a machine learning method, the clustering algorithm classifies samples based on sample similarity in a data-driven manner [20,21]. Thus, the influence of subjective factors is effectively reduced, and index prediction is converted to level prediction. Therefore, we decided to combine risk assessment indexes with K-means++ clustering (an improved clustering algorithm) to realize a comprehensive assessment and objective classification of heavy metal hazards in grain processing products.

Early-Warning Models of Food Safety Risk
In early research, common early-warning models of food safety risk, including models based on a grey relational analysis (GRA) [22] and artificial neural networks (ANNs) [23], were applied to food safety prediction problems. Han et al. [11] developed multiple GRA models to forecast food quality and safety. Lin et al. [24] adopted a GRA model that integrates interpretative structural modeling to analyze the factors influencing food safety.
In relatively recent studies, neural networks were applied in food safety early-warning models. Geng et al. [25] used the radial basis function (RBF) as the element to construct a deep RBF model for early-warning modeling of complex food detection data. Geng et al. also used the agglomerative hierarchical clustering radial basis function (AHC-RBF) neural network to adaptively obtain the central position of the hidden layer nodes of the RBF, thus improving the prediction precision of the RBF [13].
However, typical shallow neural network methods, such as ANN, back propagation (BP), and RBF, may not be able to extract and use deep features [26]. However, deep learning methods such as long short-term memory (LSTM), gated recurrent units (GRUs), and recurrent neural networks (RNNs) can suitably capture high-dimensional features and exhibit temporal dynamic behavior [27]. These approaches have been employed in weather forecasting or travel-time prediction, achieving enhanced accuracy [28]. Meanwhile, the voting ensemble method can overcome limitations to several algorithms [29] and decrease the variance in a trained model on the validation set [27].
The majority of current prediction methods used in food safety are single-step prediction or fitting prediction techniques, which cannot predict unknown data, that is, future occurrences. As a significant research area in data analysis, time series forecasting plays an important role in the processing industry [30], clinical medicine [31], and other sectors [32] because of its capability to analyze the historical data of a dynamic system and predict future operation patterns [33]. This feature is consistent with the requirement of food safety risk prediction. Therefore, a multi-step time series prediction is more valuable than a single-step prediction [34], and the same is true for food safety. Considering the actual requirement of the food supervision department, we proposed a voting-ensemble technique that integrates deep learning models to grasp the long-term change trend of food safety risks, realizing a more accurate multi-step prediction of food safety risk than that of shallow NNs.
The specific food safety risk assessment and early-warning model proposed in this study is presented in Figure 1. As shown in Figure 1, the newly proposed method of classifying and predicting food safety risk levels, integrated with assessment indexes, mainly consists of three blocks. In the assessment blocks, the dietary assessment of heavy metal hazards is conducted. In the classification block, a clustering algorithm is employed to determine the risk level. In the  As shown in Figure 1, the newly proposed method of classifying and predicting food safety risk levels, integrated with assessment indexes, mainly consists of three blocks. In the assessment blocks, the dietary assessment of heavy metal hazards is conducted. In the classification block, a clustering algorithm is employed to determine the risk level. In the prediction block, we applied a voting-based ensemble deep learning method to implement the multi-step prediction.

Data Sources
In this study, three heavy metals that cause heavy metal pollution in grain processing products are selected as the research objects: chromium [35], cadmium [14], and arsenic [36]. A total of 65,302 samples from the 2020 National Food Safety Sampling Inspection Information System are included: chromium (12,501), cadmium (29,456), and arsenic (23,795). Descriptions of several detection data are shown in Table 1. The detection data cover the 20 provinces of China from March to December 2020 and are characterized by high-dimensional attributes, complexity, discreteness, and nonlinearity, which are reflected in the distribution of the mean values of the three heavy metals' daily detection data in Figure 2 and the description of several observations in Table 1 [12]. The detection data cover the 20 provinces of China from March to December 2020 and are characterized by high-dimensional attributes, complexity, discreteness, and nonlinearity, which are reflected in the distribution of the mean values of the three heavy metals' daily detection data in Figure 2 and the description of several observations in Table 1 [12]. To establish the subsequent risk assessment model, we collect the resident consumption data and related toxicology data to calculate the assessment indexes. The data on the resident consumption of grain processing products in the 20 provinces shown in the Table  2 come from the Fifth Chinese Total Diet Study [8], which adopts stratification based on population proportions and multi-stage cluster random sampling to conduct a dietary questionnaire survey on the main foods consumed by residents. To establish the subsequent risk assessment model, we collect the resident consumption data and related toxicology data to calculate the assessment indexes. The data on the resident consumption of grain processing products in the 20 provinces shown in the Table 2 come from the Fifth Chinese Total Diet Study [8], which adopts stratification based on population proportions and multi-stage cluster random sampling to conduct a dietary questionnaire survey on the main foods consumed by residents. Moreover, related toxicology data are acquired from reports or bibliographic searches of international organizations, such as the Food and Agriculture Organization of the United Nations, the World Health Organization (WHO) Joint Expert Committee on Food Additives, and the United States EPA. The reference doses of chromium, cadmium, and arsenic are 0.003 (trivalent chromium) µg/(kg d), 0.001 µg/(kg d), and 0.0003 µg/(kg d), respectively [37]. The cancer slope factor (CSF) of chromium, cadmium, and arsenic are 0.5 (kg d)/mg, 6.3 (kg d)/mg, and 1.5 (kg d)/mg, respectively [38].

Data Preprocessing
Some detection results are recorded as "not detected" in the original data. With the requirement to predict the levels of food safety risk considered, these results are replaced by half of the metal detection line instead of being directly replaced with zero [39] in accordance with the principle of the credible evaluation of low pollutant levels proposed at the second meeting of the WHO Global Environment Monitoring System/Food [14]. For results with an extra symbol such as "<" the symbol is deleted, and the value is retained [12]. Moreover, the detection results for total arsenic in food are converted to inorganic arsenic at a ratio of 70% to calculate the exposure amount [40].

Assessment Indexes
To improve the accuracy of predicting the food safety risk level and measure the precise effect of heavy metal hazards in grain processing products on the human body, the following safety indexes are selected in this study to classify the daily risk levels.
The NIPI, which reflects the characteristics of food pollution, is used to evaluate heavy metal pollution in rice [16], air [41], and water [16]. The NIPI P c(i,j) of the heavy metal j in grain processing products i is given by where P max(i,j) is the maximum value of the heavy metal j pollution index in grain processing products, and P avg(i,j) is the average value. The specific expression of the pollution index is expressed as where P i,j , X i,j are the pollution index and detection value of heavy metal j in food i, respectively, and S i,j is the national limit standard for heavy metal j in food i. In this study, food i denotes the grain processing, and j represents chromium, cadmium, and arsenic.
The detection values of the grain processing products with the same data are substituted into Equations (1) and (2) to calculate the NIPI of the three heavy metals on a certain day.
Considering the carcinogenicity of heavy metals, we use the TCR to measure the carcinogenic risk. Meanwhile, the non-carcinogenic risk is given by THQ, which is based on the pollutant exposure dose and reference dose [42]. The TCR is based on pollutant exposure dose, and the carcinogenic intensity index of the product reflects the possible type of carcinogenic risk [43].
The specific expression of TCR is given by where EF is the exposure frequency (365 days/year); ED is the exposure period (70 years in the current study); CSF j denotes the carcinogenic intensity index of the heavy metal j (kg·d/mg); AT C is the duration of the carcinogenic effect (365 days/years*exposure period, assumed to be 70 years in this study). EDI 50 j is calculated using where FC j is the per capita daily consumption of China's grain processing products i (kg/d); X 50 i,j is the 50th quantile (mg/kg) of the heavy metal j detected on a certain day; and W is the average body mass of the residents (60 kg in this study).
Similarly, THQ is expressed as where R f D j (reference dose) is the oral reference dose of the heavy metal j (kg·d/mg). EDI 95 is calculated as where X 95 i,j is the 95th quantile (mg/kg) of heavy metal j detected on a certain day.

Analytic Hierarchy Process Base on Entropy Weight
To reduce the scale of data and comprehensively measure the risk of heavy metal contamination in grain processing products in China on a certain day, this study determines the comprehensive assessment indexes by using the AHP-EW [12]. The assessment indexes of the three heavy metals are combined using the weight vector W = [w 1 , w 2 , . . . , w n ] T . The fusion data point Y is calculated using Equation (7) where X is the value matrix of the assessment indexes. The food safety risk assessment index is calculated based on AHP-EW ( Figure 3). Therefore, on the basis of the AHP-EW method, this study integrates the indexes NIPI, TCR, and THQ of the three heavy metals into comprehensive NIPI, comprehensive TCR, and comprehensive THQ, namely, the food safety risk assessment index. These indexes are then used as the basis of risk classification and the input vector of the subsequent prediction model.  Therefore, on the basis of the AHP-EW method, this study integrates the indexes NIPI, TCR, and THQ of the three heavy metals into comprehensive NIPI, comprehensive TCR, and comprehensive THQ, namely, the food safety risk assessment index. These indexes are then used as the basis of risk classification and the input vector of the subsequent prediction model.

Clustering Risk Classification
The K-means++ clustering algorithm, which exhibits low complexity, rapid computational capability, the ability to handle large data sets, and the flexibility to adjust the cluster number [44], is used to determine the risk level of a heavy metal hazard on the basis of assessment indexes. The specific process of K-means++ is as follows [45]: 1. (a) Take one center 1 μ as the initial cluster center, chosen uniformly from the samples.

Clustering Risk Classification
The K-means++ clustering algorithm, which exhibits low complexity, rapid computational capability, the ability to handle large data sets, and the flexibility to adjust the cluster number [44], is used to determine the risk level of a heavy metal hazard on the basis of assessment indexes. The specific process of K-means++ is as follows [45]: 1.
(a) Take one center µ 1 as the initial cluster center, chosen uniformly from the samples. 1.
(b) For each sample x i , calculate d(x i ), i.e., the shortest distance between sample x i to the closest center which has already been selected. 1.
(c) Choose one of the samples as the new cluster center µ 1 according to the weighted probability: 1.
(d) Repeat steps 1 (b) and 1 (c) until cluster centers n have been chosen.

4.
Repeat step 2 and step 3 until the convergence has been reached.
Alternatively, by calculating the silhouette coefficient, which takes both cohesion and separation into account, one can determine the best cluster number [46]. The silhouette coefficient can be expressed as follows [47]: where, a i is the average distance of the jth sample to all other samples in the same cluster; b i is the average distance of the jth sample to all other provinces in different clusters. Then, the optimal cluster number can be obtained by calculating the average silhouette coefficient of all the samples. For one clustering with k categories, the average silhouette coefficient refers to the average of silhouette coefficients of samples belonging to the cluster and given as follows: where, n is the total number of samples in the data set. Besides, a higher value represents better clustering quality. Thus, the optimal clustering results are obtained. With regard to the risk threshold, the risk level based on a clustering algorithm is calculated as follows: where d k , k = 1, 2, · · · , n represents the distance between the jth sample and the center of the kth class of the clusters; d level denotes the minimum value of the distance between that jth sample and each cluster center. If the d level is equal to d k , then the jth sample is labeled as the kth level. The clustering centers are obtained in a data-driven manner on the basis of the similarity between the data, and risk classification is conducted based on the distance of the samples from each clustering center, reducing the subjectivity of the classification. After risk assessment and classification are performed via a data-driven approach, risk prediction models need to be established to identify the hazards of heavy metals in grain processing products at an early stage and consequently address them before they become real risks.

Multi-Step Prediction
After risk assessment and classification are performed via a data-driven approach, we select the multi-step prediction method to forecast the risk level based on the past data.
i , · · · , x n i ∈ R p×n denoting that the instance x i has the length n, dimension p, and d labels as   That is, in the proposed prediction method, the model input is given by the values of the previous n days, and the output is the predicted value of the subsequent day. By constantly adding the value of the prediction day, the value of the next n days can be predicted. However, if the predicted number of days exceeds n days (n-step), the model input no longer contains the true value. Thus, the sequence length used to evaluate the model in this study is the same as the selected step size.  That is, in the proposed prediction method, the model input is given by the values of the previous n days, and the output is the predicted value of the subsequent day. By constantly adding the value of the prediction day, the value of the next n days can be predicted. However, if the predicted number of days exceeds n days (n-step), the model input no longer contains the true value. Thus, the sequence length used to evaluate the model in this study is the same as the selected step size.

Deep Neural Network Model
RNNs have gained popularity in deep learning because of their ability to overcome the limitation of shallow neural network architectures in learning long sequences [48]. LSTM models and their variant GRU based on RNN have been built for a time series prediction. Owing to its distinct gate structure, the LSTM neural network is highly suitable for processing time series data [49]. A simplified version of LSTM, the GRU combines the forget and input gates to form an "update" gate. Thus, it has fewer parameters but less complexity, compared with LSTM. The internal structures of the RNN, LSTM, and GRU neuron modules are shown in Figure 5. That is, in the proposed prediction method, the model input is given by the values of the previous n days, and the output is the predicted value of the subsequent day. By constantly adding the value of the prediction day, the value of the next n days can be predicted. However, if the predicted number of days exceeds n days (n-step), the model input no longer contains the true value. Thus, the sequence length used to evaluate the model in this study is the same as the selected step size.

Deep Neural Network Model
RNNs have gained popularity in deep learning because of their ability to overcome the limitation of shallow neural network architectures in learning long sequences [48]. LSTM models and their variant GRU based on RNN have been built for a time series prediction. Owing to its distinct gate structure, the LSTM neural network is highly suitable for processing time series data [49]. A simplified version of LSTM, the GRU combines the forget and input gates to form an "update" gate. Thus, it has fewer parameters but less complexity, compared with LSTM. The internal structures of the RNN, LSTM, and GRU neuron modules are shown in Figure 5.

Voting-Ensemble Method
The voting-ensemble method is primarily aimed at overcoming the limitations of various algorithms [29]. For instance, LSTM handles information at different distances from time points in various patterns. By contrast, RNNs treat each time point equally, and the GRU is located between the points; the information at certain distances from certain time points that deserves emphasis is unknown. Thus, the voting-ensemble method is applied for integrated prediction. The integrated workflow of the voting-ensemble method is shown in Figure 6.
The voting-ensemble method is used to separately integrate multiple sub-models; the obtained sub-models are arranged and combined by voting integration, which combines the final prediction results of sub-models. On the basis of this technique, the sub-model prediction results are statistically compared and analyzed, and the model with the highest prediction accuracy and overall balance is selected.
The voting-ensemble method is primarily aimed at overcoming the limitations of various algorithms [29]. For instance, LSTM handles information at different distances from time points in various patterns. By contrast, RNNs treat each time point equally, and the GRU is located between the points; the information at certain distances from certain time points that deserves emphasis is unknown. Thus, the voting-ensemble method is applied for integrated prediction. The integrated workflow of the voting-ensemble method is shown in Figure 6. The voting-ensemble method is used to separately integrate multiple sub-models; the obtained sub-models are arranged and combined by voting integration, which combines the final prediction results of sub-models. On the basis of this technique, the sub-model prediction results are statistically compared and analyzed, and the model with the highest prediction accuracy and overall balance is selected.
We use i ζ to represent the sub-model i in the voting-ensemble algorithm; K is the K-means++ algorithm; i w denotes the weight assigned to each sub-model, and the sum of the weights of all sub-models should be 1; X denotes the assessment indexes.
Thus, the output t h of the voting-ensemble method under certain days t is given by Thus, through the K-means++ algorithm, the final risk level output is determined as This study uses the following function to indicate whether the model correctly predicts the risk level: Figure 6. Schematic of the voting-ensemble method.
We use ζ i to represent the sub-model i in the voting-ensemble algorithm; K is the K-means++ algorithm; w i denotes the weight assigned to each sub-model, and the sum of the weights of all sub-models should be 1; X denotes the assessment indexes. Thus, the output h t of the voting-ensemble method under certain days t is given by Thus, through the K-means++ algorithm, the final risk level output is determined as K(h t ). This study uses the following function to indicate whether the model correctly predicts the risk level: where K(y t ) is the real risk level. Therefore, the final output model meets the following requirements: That is, the optimal prediction model is given. Finally, with GRU, LSTM, and RNN as sub-models, the overall architecture of the proposed voting ensemble method integrated with deep learning models is used to calculate NIPI, as shown in Figure 7.
With the assessment index time series as the input of the voting-ensemble method, the predicted risk assessment index as the output is obtained. This output, combined with the clustering algorithm, can result in a risk level prediction.
That is, the optimal prediction model is given. Finally, with GRU, LSTM, and RNN as sub-models, the overall architecture of the proposed voting ensemble method integrated with deep learning models is used to calculate NIPI, as shown in Figure 7. With the assessment index time series as the input of the voting-ensemble method, the predicted risk assessment index as the output is obtained. This output, combined with the clustering algorithm, can result in a risk level prediction.

Prediction Performance Evaluation Index
To evaluate the prediction efficiency of the proposed multi-step food safety risk assessment index prediction method, we use the root mean square error (RMSE) and the mean absolute error (MAE). These two indicators are calculated as follows: where i x represents the actual value of the assessment indexes at day i , and ˆi x is the predicted value.

Prediction Performance Evaluation Index
To evaluate the prediction efficiency of the proposed multi-step food safety risk assessment index prediction method, we use the root mean square error (RMSE) and the mean absolute error (MAE). These two indicators are calculated as follows: where x i represents the actual value of the assessment indexes at day i, andx i is the predicted value. However, the prediction of the final risk level is influenced by the combination of the three assessment indexes; thus, the performance of the single assessment index, as well as the accuracy of the final prediction risk level determined by the three indexes, needs to be assessed.

Prediction Accuracy Evaluation Index
We use the correct rate of predicting the risk level to measure the accuracy of the model, thereby predicting the risk level. When the food safety risk level as the model output is the same as the actual food safety risk level, the food safety risk level is recorded as 1; otherwise, it is 0. The level of predictive accuracy is thus calculated as follows: where t denotes the number of days predicted.

Comprehensive Assessment Indexes
To comprehensively evaluate the heavy metal hazard in grain processing products, we first calculate the assessment indexes and used the AHP-EW method to reduce the dimensionality of the data. The EW weights of NIPI, TCR, and THQ in the AHP-EW method for three heavy metals are listed in Table 3. Table 3. Entropy weights of NIPI, TCR, and THQ for chromium, cadmium, and arsenic.

Heavy Metal Chromium Cadmium Arsenic
Assessment Index  NIPI  TCR  THQ  NIPI  TCR  THQ  NIPI  TCR  THQ

Comprehensive Assessment Indexes
To comprehensively evaluate the heavy metal hazard in grain processing products, we first calculate the assessment indexes and used the AHP-EW method to reduce the dimensionality of the data. The EW weights of NIPI, TCR, and THQ in the AHP-EW method for three heavy metals are listed in Table 3. Table 3. Entropy weights of NIPI, TCR, and THQ for chromium, cadmium, and arsenic. Assessment Index NIPI TCR THQ NIPI TCR THQ NIPI TCR THQ  The final sets of comprehensive heavy metal indexes from March to December 2020 are shown in Figure 8.   Figure 8 presents the assessment indexes exhibiting similar high-dimensional attributes, complexity, discreteness, and nonlinearity, compared with the inspection data. Therefore, the deep learning method is more suitable for assessment index prediction.

Determination of the Risk Level
After the assessment indexes are determined, the characteristics of food safety data become a complex nonlinear time series, including abnormal data. Therefore, data normalization is necessary [14]. In the current study, NIPI, TCR, and THQ are selected as features based on K-means++ clustering in a data-driven manner. Table 4 lists the scores of clustering categories from 3 to 7 through the silhouette coefficient. As listed in Table 4, the silhouette coefficient of Category 4 is the largest, indicating a maximum improvement in the clustering effect, which also allows risk management to perform targeted risk supervision and control. Therefore, the normalized dataset is divided into 4 categories by using the K-means++ algorithm, and the results of corresponding risk factors (i.e., NIPI, TCR, and THQ) values of each cluster center are listed in Table 5. Additionally, the risk level was determined based on the Euclidean distance between each cluster center and the origin, with a longer distance indicating a higher integrated risk. Table 5. Clustering center of the assessment indexes (normalized); based on the Euclidean distance between each cluster center and the origin, the cluster centers are marked as different risk levels. K-means++ clustering results for the risk level are shown in Figure 9.
perform targeted risk supervision and control. Therefore, the normalized dataset is divided into 4 categories by using the K-means++ algorithm, and the results of corresponding risk factors (i.e., NIPI, TCR, and THQ) values of each cluster center are listed in Table  5. Additionally, the risk level was determined based on the Euclidean distance between each cluster center and the origin, with a longer distance indicating a higher integrated risk.  Figure 9.  Subsequently, by using the K-means++ clustering algorithm to select NIPI, TCR, and THQ as the risk characteristics, this study can determine the risk level for each day and the cluster center of each risk level. In the following text, future food safety risk assessment indexes will be classified into specific risk levels based on clustering centers.

Analysis of Heavy Metal Hazard
The risk values of the detection samples from March 2020 to December 2020 are analyzed to illustrate the advantage of risk classification based on clustering. The distribution of the risk level and clustering center is presented in Figure 10.
As shown in Figure 10, the low-and medium-risk levels comprise 87.91% of the total, and the second-highest and high-risk levels comprise 12.09% for 2020. The TCR is higher for the second-highest risk level, and the THQ is higher for the high-risk level, indicating the carcinogenic and non-carcinogenic risks in heavy metals, respectively. Moreover, in the second-highest risk and high-risk levels, three assessment indexes are higher than those of the low-and medium-risk levels. Therefore, we use the second-highest and high-risk levels as early-warning thresholds.
Considering the weekly report requirement, we perform a risk assessment of the detection results from September 9 to October 6 ( Figures 11 and 12) to illustrate the clustering process, combined with the assessment indexes method, to determine the risk level.
indexes will be classified into specific risk levels based on clustering centers.

Analysis of Heavy Metal Hazard
The risk values of the detection samples from March 2020 to December 2020 are analyzed to illustrate the advantage of risk classification based on clustering. The distribution of the risk level and clustering center is presented in Figure 10. As shown in Figure 10, the low-and medium-risk levels comprise 87.91% of the total, and the second-highest and high-risk levels comprise 12.09% for 2020. The TCR is higher for the second-highest risk level, and the THQ is higher for the high-risk level, indicating the carcinogenic and non-carcinogenic risks in heavy metals, respectively. Moreover, in the second-highest risk and high-risk levels, three assessment indexes are higher than those of the low-and medium-risk levels. Therefore, we use the second-highest and highrisk levels as early-warning thresholds.
Considering the weekly report requirement, we perform a risk assessment of the detection results from September 9 to October 6 ( Figures 11 and 12) to illustrate the clustering process, combined with the assessment indexes method, to determine the risk level.   Figure 11. Threshold of risk classification determined by the distance to the clustering center. Figure 11. Threshold of risk classification determined by the distance to the clustering center.
This study proposes a dynamic threshold classification method for determining the objective risk level for each day by calculating and comparing the distances (similarity) of the assessment indexes between each day and each clustering center and then selecting the class with the smallest distance as the risk classification result (Figure 11). For instance, the risk level on September 26 is assessed as high-risk because the distance is shorter to the high-risk clustering center than to other centers. We can also identify to a certain extent the causes of different risk levels through assessment indexes ( Figure 12).
As shown in Figure 12, most samples are low-and medium-risk. However, the risk level corresponding to 3 October is the second-highest, which is mainly attributable to the high TCR. The highest risk level recorded, which corresponds to 17 September, is mainly attributed to the high TCR and THQ, with TCR contributing more. The high-risk classification for 26 September is caused by THQ exceeding the mean high risk. Therefore, targeted policies in risk management can be implemented to tackle different situations. The establishment of the risk classification model can identify the heavy metal hazards and interpret the factors underlying the risks. To resolve these hazards before they develop into real risks, we establish a risk level prediction model.  This study proposes a dynamic threshold classification method for determining the objective risk level for each day by calculating and comparing the distances (similarity) of the assessment indexes between each day and each clustering center and then selecting the class with the smallest distance as the risk classification result (Figure 11). For instance, the risk level on September 26 is assessed as high-risk because the distance is shorter to the high-risk clustering center than to other centers. We can also identify to a certain extent the causes of different risk levels through assessment indexes ( Figure 12).
As shown in Figure 12, most samples are low-and medium-risk. However, the risk level corresponding to October 3 is the second-highest, which is mainly attributable to the high TCR. The highest risk level recorded, which corresponds to September 17, is mainly attributed to the high TCR and THQ, with TCR contributing more. The high-risk classification for September 26 is caused by THQ exceeding the mean high risk. Therefore, Figure 12. Risk levels and assessment indexes for certain days.

Dataset Division and Implementation Environments
To evaluate the performance and generalizing capability of the proposed method, we select three datasets from different time periods, and each dataset is divided into training and test sets. In Dataset1 and Dataset2, we select the same dataset split ratio with different test sets to verify the generalizing capability of the model on different datasets; in Dataset2 and Dataset3, we used the same test set with different training sets and test set ratios to verify the effect of the data split ratio on the model ( Figure 13). Their generalizing capabilities and performance in three datasets are ultimately measured. To resolve these hazards before they develop into real risks, we establish a risk level prediction model.

Dataset Division and Implementation Environments
To evaluate the performance and generalizing capability of the proposed method, we select three datasets from different time periods, and each dataset is divided into training and test sets. In Dataset1 and Dataset2, we select the same dataset split ratio with different test sets to verify the generalizing capability of the model on different datasets; in Dataset2 and Dataset3, we used the same test set with different training sets and test set ratios to verify the effect of the data split ratio on the model ( Figure 13). Their generalizing capabilities and performance in three datasets are ultimately measured. In this paper, we deploy deep learning models like RNN, GRU, and LSTM with Tensorflow 2.0.0 using the Keras package. All the models were programmed by Python 3.6 and trained on a laptop computer (Intel i5-1035G1 CPU, without GPU used as the data accelerator).

Performance of the Sub-Model
Food regulatory agencies require weekly detection reports. Considering this requirement, together with the limitations of the multi-step time series prediction of long-term In this paper, we deploy deep learning models like RNN, GRU, and LSTM with Tensorflow 2.0.0 using the Keras package. All the models were programmed by Python 3.6 and trained on a laptop computer (Intel i5-1035G1 CPU, without GPU used as the data accelerator).

Performance of the Sub-Model
Food regulatory agencies require weekly detection reports. Considering this requirement, together with the limitations of the multi-step time series prediction of long-term error accumulation, this study chooses a time step of 21 to compare the sub-models, which include several existing typical machine learning or deep learning models, focusing primarily on the predictive efficiency of the 7-day model. The 14-and 21-day models are used for an auxiliary comparison via MAE and RMSE indicators. The predictive accuracy rates of the sub-models are also compared.
To improve the comparison of the prediction performances of the different models and select the appropriate sub-models, we visualize the RMSE and MAE of sub-models for the 7, 14, and 21-day prediction results on a heatmap ( Figure 14). As shown in Figure 14, as the number of prediction steps increases, the MAE and RMSE values of each model increase as well. The reason is that the larger the number of prediction steps, the more information is missing and the lower the prediction accuracy. Notably, the following instance is also observed: in the NIPI of Dataset2, the 7-day MAE of the RNN model is 0.1263, and the 14-day MAE is 0.0814. The reason is that in the multi- As shown in Figure 14, as the number of prediction steps increases, the MAE and RMSE values of each model increase as well. The reason is that the larger the number of prediction steps, the more information is missing and the lower the prediction accuracy. Notably, the following instance is also observed: in the NIPI of Dataset2, the 7-day MAE of the RNN model is 0.1263, and the 14-day MAE is 0.0814. The reason is that in the multi-step prediction, positive and negative errors are offset as errors accumulate, hence the decrease in RMSE and MAE values over time. Figure 14 also shows that RNN, GRU, and LSTM perform better than the other models, but the parts of the models only slightly vary. Therefore, we determine the final sub-model portfolio on the basis of the accuracy of the risk level prediction.

Comparison of Different Sub-Models
The correct accuracy rate of each sub-model in risk level prediction is shown in Figure 15.  In Figure 15, the effects of eXtreme Gradient Boosting (XGBoost) and BP models are worse than those of the other models, and the accuracy rates in the subsequent 7-, 14-, and 21-day periods are lower than the average; by contrast, the GRU model performs more efficiently in the subsequent 7-, 14-, and 21-day periods. The predictive efficiency of the RNN model is poor in the subsequent 7-day period; however, this characteristic improves in the 14-and 21-day periods, potentially reaching a relatively satisfactory level. When LSTM predicts the subsequent 14-and 21-day periods, the predictive efficiency is low, but when it predicts the subsequent 7-day period, the predictive efficiency is high, reaching 85.71%. Although the AHC-RBF model outperforms LSTM in predictive accuracy in the 14-day period, a large discrepancy in the predictive efficiency of LSTM is observed in the 7-day period, which is more important than the 14-and 21-day because those periods have the presence of cumulative errors and the requirement of a weekly report. Meanwhile, the proposed approach requires that the sub-models exhibit satisfactory and similar performances. Therefore, the combination of the RNN, GRU, and LSTM models is superior to other models, and these three models possess different accuracy attributes at different prediction steps. This study then integrates the three models into the voting-ensemble method discussed in the subsequent sub-section. In Figure 15, the effects of eXtreme Gradient Boosting (XGBoost) and BP models are worse than those of the other models, and the accuracy rates in the subsequent 7-, 14-, and 21-day periods are lower than the average; by contrast, the GRU model performs more efficiently in the subsequent 7-, 14-, and 21-day periods. The predictive efficiency of the RNN model is poor in the subsequent 7-day period; however, this characteristic improves in the 14-and 21-day periods, potentially reaching a relatively satisfactory level. When LSTM predicts the subsequent 14-and 21-day periods, the predictive efficiency is low, but when it predicts the subsequent 7-day period, the predictive efficiency is high, reaching 85.71%. Although the AHC-RBF model outperforms LSTM in predictive accuracy in the 14-day period, a large discrepancy in the predictive efficiency of LSTM is observed in the 7-day period, which is more important than the 14-and 21-day because those periods have the presence of cumulative errors and the requirement of a weekly report. Meanwhile, the proposed approach requires that the sub-models exhibit satisfactory and similar performances. Therefore, the combination of the RNN, GRU, and LSTM models is superior to other models, and these three models possess different accuracy attributes at different prediction steps. This study then integrates the three models into the votingensemble method discussed in the subsequent sub-section.

Performance of the Voting-Ensemble Model
To verify the efficiency of the proposed model, we compare sub-models and existing models by using the proposed voting-ensemble method on the same detection data. The prediction performance of the sub-models and the voting-ensemble method is summarized in Table 6. As shown in Table 6, in these three datasets, the proposed voting-ensemble method has the smallest RMSE and MAE values on almost every indicator, performing better than the sub-models. Similarly, on different datasets, the predictive efficiency of each model declines as the number of prediction steps increases.
In terms of the time required for the training process, as illustrated in Table 7, RNN and GRU are significantly faster; they are about two times faster than the LSTM, and about four times faster than the proposed voting-ensemble model due to its requirement of waiting until the end of the training of the sub-models. Thus, when the proposed voting-ensemble method is used to make an early-warning analysis of food safety, the training time is extended. The prediction and actual curves generated using the RNN, GRU, LSTM, and the proposed method in each dataset are presented in Figures 16-18. The prediction and actual curves generated using the RNN, GRU, LSTM, and the proposed method in each dataset are presented in Figures 16-18.    The prediction and actual curves generated using the RNN, GRU, LSTM, and the proposed method in each dataset are presented in Figures 16-18.    As shown in Figures 16-18, the voting-ensemble model has a higher degree of coincidence between the actual value curve and the predicted value curve on the test set than those of the sub-models. This result is similar to the outcome summarized in Table 6, indicating that the voting-ensemble model exhibits powerful performance and predictive capabilities.
In Figure 19, the proposed voting-ensemble model exhibits the highest accuracy. This finding, combined with the results in Figure 15, indicates that the accuracy rates of the voting-ensemble method for 7-, 14-, and 21-day periods exceed those of each sub-model. The average predictive accuracy of the three datasets in the 7-day period reaches 90.47%, which is higher than those of the sub-models and existing models. According to the results in Table 6, the RMSE and MAE of the proposed method are less than the values obtained using other methods. Thus, compared with other methods, the proposed method has a better predictive performance.  As shown in Figures 16-18, the voting-ensemble model has a higher degree of coincidence between the actual value curve and the predicted value curve on the test set than those of the sub-models. This result is similar to the outcome summarized in Table 6, indicating that the voting-ensemble model exhibits powerful performance and predictive capabilities.
In Figure 19, the proposed voting-ensemble model exhibits the highest accuracy. This finding, combined with the results in Figure 15, indicates that the accuracy rates of the voting-ensemble method for 7-, 14-, and 21-day periods exceed those of each sub-model. The average predictive accuracy of the three datasets in the 7-day period reaches 90.47%, which is higher than those of the sub-models and existing models. According to the results in Table 6, the RMSE and MAE of the proposed method are less than the values obtained using other methods. Thus, compared with other methods, the proposed method has a better predictive performance. As shown in Figures 16-18, the voting-ensemble model has a higher degree of coincidence between the actual value curve and the predicted value curve on the test set than those of the sub-models. This result is similar to the outcome summarized in Table 6, indicating that the voting-ensemble model exhibits powerful performance and predictive capabilities.
In Figure 19, the proposed voting-ensemble model exhibits the highest accuracy. This finding, combined with the results in Figure 15, indicates that the accuracy rates of the voting-ensemble method for 7-, 14-, and 21-day periods exceed those of each sub-model. The average predictive accuracy of the three datasets in the 7-day period reaches 90.47%, which is higher than those of the sub-models and existing models. According to the results in Table 6, the RMSE and MAE of the proposed method are less than the values obtained using other methods. Thus, compared with other methods, the proposed method has a better predictive performance.

Discussion
In this study, we established a novel time series multi-step prediction model for classifying and assessing the risk levels of heavy metal hazards in grain processing products. Food safety assessment indexes were introduced to explain the heavy metal hazards. The data-driven clustering algorithm reduced the subjectivity of threshold determination. We then introduced the deep learning method in early-warning systems in the food industry to implement a multi-step time series prediction and validate its efficiency by comparing it with existing models.

Risk Assessment and Classification
Recent studies have focused on conducting a risk assessment of food contaminants, in addition to a food safety risk evaluation based on the calculation results for the detection samples via AHP-EW, in fields implementing early-warning systems. The traditional approach is based on risk values for establishing early-warning models, which lack the systematic measurement of food contaminant hazards [14]. Alternatively, we established a risk assessment model by using the NIPI, TCR, and THQ to satisfy the comprehensive evaluation requirements of risk management to a certain extent. Therefore, this study realized a systematic dietary analysis of food safety risks by introducing assessment indexes ( Figure 8). Meanwhile, to reduce regulatory costs, risk levels need to be assessed and different risk levels have to be prioritized differently. However, risk level classification by setting absolute thresholds is subjective [19].
Therefore, regarding the risk classification and analysis of heavy metal hazards, an assessment index-based risk classification by cluster analysis uses silhouette coefficients to determine optimal and risk level classification (Table 4) and obtain clustering results (Figures 9 and 10). With this approach, we can determine the relative threshold for comprehensive indexes in a data-driven manner and objectively analyze heavy metal hazards ( Figure 11). We can also identify the causes of each risk level and evaluate the effect of each index on the classification so that risk management can achieve a retrospective analysis of food safety risks and develop targeted strategies ( Figure 12).
Compared with existing risk assessment and classification methods enabled by earlywarning models, the proposed risk level framework in this study provides an interpretable risk assessment, in addition to data-driven and objective risk classification based on a dietary exposure assessment and K-means++ clustering algorithms. It can be used by risk management departments in assessing the comprehensive relative risk of heavy metal hazards and determining risk levels. With this tool, measures and policies may be implemented to address and retrace the factors that contribute to different risk levels for efficient food safety risk management.

Multi-Step Time Series Prediction of Risk Levels
In fields using food safety early-warning systems, most studies use single-step risk prediction. In the current study, we employed multi-step prediction, as opposed to singlestep prediction, for the assessment index time series, which can predict data that have not occurred. However, a multi-step or long-term prediction is difficult and challenging due to the lack of information and uncertainty [50] or error accumulation ( Figure 14, Table 6). Therefore, models with a satisfactory performance need to be developed to improve the accuracy of a multi-step prediction.
Research on time series predictions in the food early-warning field using the machine learning method or ANNs because of the nonlinear characteristics (Figures 2 and 8) of food safety time series in practice is required [12]. Compared with traditional ANNs, deep learning methods such as RNN, GRU, and LSTM, can capture long-term time series data [33]; this finding is consistent with the performance of the sub-models ( Figure 14). To further improve the accuracy of a multi-step prediction, we proposed the voting-ensemble method to integrate the advantages of different models and select the sub-models with satisfactory performances (Figure 15) to establish the voting-ensemble based deep learning method [29]. The final result suggests that the accuracy of the proposed method reaches 90.47% in 7 days, which meets the weekly report requirement set by the risk management department.

Limitations
The voting-ensemble method was selected in this study to integrate various deep learning models and thereby achieve an improved accuracy rate; however, the training time is extended (Table 7). Despite the high time complexity of the proposed model, it is not unacceptable, as computing power continues to increase. Meanwhile, the outcomes of data grading are acquired in a data-driven way; the greater the amount of data collected, the higher the accuracy of risk grading. Therefore, we will continue to track food safety data to obtain risk classification results with increased accuracy.

Conclusions
To establish an early-warning model that can systematically assess the risk of heavy metals in grain processing products, this study proposed a novel multi-step time series prediction model based on a deep learning method. By adopting a voting-ensemble method, this study increased the accuracy of the prediction model. The final results also indicate that the proposed model achieves an accuracy of 90.47%, which meets the weekly food sampling report requirement for risk management. Meanwhile, risk classification based on system assessment allows food regulatory authorities to objectively prioritize and identify the causes of risk, thus enhancing the early control of food safety risks and reducing the costs of risk management. Moreover, an early-warning system based on deep learning models in a multi-step time series prediction, instead of the existing single-step or fitting prediction machine learning model, can more efficiently capture the dynamic operation pattern of a food safety time series. It can also further enable operators to detect food safety risks promptly, as well as improve early-warning systems for food safety, allowing for a continuous and interactive process to address future problems [51]. The food safety supervision departments can strengthen the supervision of heavy metal hazards based on the proposed early-warning model.
In future research, we intend to track food safety data to obtain risk classification results with increased accuracy and attempt different voting approaches to achieve enhanced multi-step prediction accuracy.