A Gas Concentration Prediction Method Driven by a Spark Streaming Framework

: In the traditional coal-mine gas-concentration prediction process, problems such as low timeliness of data and low efﬁciency of the prediction model in learning data features result in low accuracy of the ﬁnal prediction. To solve these problems, a gas-concentration prediction method driven by the Spark Streaming framework is proposed. In this research study, the Spark Streaming framework, autoregressive integrated moving average (ARIMA) model and support vector machine (SVM) model are used to construct a new prediction model called the SPARS model. The Spark Streaming framework is used to process large batches of real-time streaming data in a short period of time, and the model can be used to intermittently update and optimize the prediction model so that the model can fully learn the characteristics of the data. At the same time, the advantages of the ARIMA model and SVM model for processing linear data and nonlinear data are combined to improve the model’s prediction efﬁciency and fully reﬂect the timeliness of gas prediction. Finally, the proposed prediction model is veriﬁed using gas data collected on site. The optimal learning time for the SPARS model in predicting this set of data is determined, and a comparative analysis of the prediction results obtained from the ARIMA, SVM and other models fully conﬁrms that high-precision prediction results can be obtained using the SPARS model. The proposed model can be used to realize scientiﬁc and accurate real-time prediction and analyses of coal-mine gas concentrations and provides a new idea for realizing real-time and accurate gas prediction in coal mines.


Introduction
Coal has been China's main and basic energy source for a long period of time. Experts predict that China's coal production capacity will reach 4 billion tons in 2030 and 3.4 billion tons in 2050. Therefore, the dominant position of coal in China's energy consumption structure will not change for a long period of time into the future [1,2]. In China, with the continuous depletion of shallow coal resources, the current coal mining situation has required continually extending mining operations deeper underground, and the risk of gas disasters has also increased significantly, which seriously affects the safety of coal mining. Gas accidents have long comprised the most fatal accidents in coal mines. According to statistics, in the eight years from 2014 to 2021, there were 1947 accidents in China and a total of 3472 fatalities. Among them, there were 189 gas accidents, accounting for 10% of all accidents, but the death toll reached 989, accounting for 28%, which is close to the total death toll [3]. Therefore, accurate and efficient gas prediction and early warning are of great significance for the safe mining of coal.
In recent years, with the rapid development of information technology and automatic control, safety monitoring and monitoring in the process of coal mining has gradually developed in the direction of intelligence. In traditional monitoring, the monitoring system has been unable to meet the needs of coal mine development, and more achievements have been made in mine gas prediction and early warning research [4,5]. Zhang et al. [6] 2 of 13 constructed a gas outburst early warning system based on the analysis of the abnormal characteristics of gas emissions, and the variance, peak difference and fluctuation of the slope of the gas emission data for the excavation face were determined by analysing gas emission amounts. The early warning level was determined based on various indicators, which ensured the safe operation at the excavation face of the coal mine. Huang et al. [7] established a multifactor coupling relationship analysis model for the outliers discarded from coal mining face gas data during the daily gas-prediction process and established an early warning level for gas anomaly analysis by analysing the association rules for multidimensional gas outliers to effectively improve the effectiveness of coal-mine gas early warning systems. Liang et al. [8] constructed a bidirectional gated recurrent unit neural network gas concentration prediction model based on the adaptive estimated maximum (Adamax) optimization algorithm for predicting coal-mine gas concentrations. The results show that the optimization algorithm has higher accuracy and an improved prediction effect than other prediction algorithms. Zhang et al. [9] constructed a coal mine ventilation system (CVS) safety prediction and early warning system. This system has high prediction accuracy, can accurately reflect the rationality of the underground mining process and has good utilization value. Xu et al. [10] constructed a gas concentration prediction algorithm based on the superposition model to determine the optimal position for the model by optimizing the parameters. Finally, the prediction results of the algorithm were verified by experimental simulations for more accuracy and the prediction accuracy improved. Jia et al. [11] proposed a prediction model for coal-mine gas concentrations based on gated regression units (GRUs). This model can make full use of the time series characteristics of gas data to predict gas concentrations with high accuracy, which improves the validity and accuracy for gas prediction.
In fact, in the process of gas prediction, the training set's data need to show high timeliness. If the data's timeliness is insufficient, the accuracy of the prediction results will be reduced, thus affecting safe coal mining [12][13][14]. The above research still shows a certain lag in the forecast of data in the process of research on gas prediction and early warning. There is still room for improvement in the operational efficiency of related algorithms. Ensuring both the timeliness of data extraction and the timeliness of algorithm learning rules in the prediction process is a research direction that needs urgent attention. In this paper, a gas prediction and early warning model is proposed based on the Spark Streaming framework. This model is capable of processing large batches of real-time streaming data in a short period of time. The autoregressive integrated moving average (ARIMA) model and support vector machine (SVM) model are combined to process linear and nonlinear data, respectively. While improving the efficiency of the forecast model, this model fully reflects the timeliness of the forecast data. This model lays a certain foundation for the real-time monitoring and prediction of gas in the intelligent construction of coal mines.

Spark Streaming Framework
Spark Streaming was developed on the basis of the Spark platform. A distributed stream-computing data-processing framework based on the discrete stream (DStream) model can process massive data in batches in a short period of time and has the advantages of high fault tolerance, scalability, high traffic and low latency [15,16]. Spark Streaming can split the real-time streaming data according to a certain time interval, pass it to the Spark Engine and finally obtain batches of results. The essence of Spark Streaming is to divide the collected data into DStreams and convert each DStream into resilient distributed datasets (RDDs), which are stored in memory, and finally stored in external devices. DStream represents a continuous data stream and is part of the Spark Streaming framework. It consists of a continuous sequence of RDD sets, and each RDD contains a data stream at a certain time interval. The Spark Streaming flowchart is shown in Figure 1, and the Spark Streaming architecture diagram is shown in Figure 2.

ARIMA-SVM Gas Prediction Model
The ARIMA(p,d,q) model is one of the methods used for time-series forecasting. The main idea is to use a model to describe its internal connections and predict future values by collecting and analysing the observations at past time points. Future predictions can be realized by the linear equation of the past time values and error [17][18][19]. Assuming that X = {xi, i = 1, 2, …, N} is a time series, then ARIMA(p,d,q) can be described as follows.
In this formula, p, d and q are required to be nonnegative and represent the order of autoregression, difference order and moving average, respectively. xt represents the true value.
represents the predicted value of xt. εt represents the error value of the prediction. φ and θ represent the parameter values to be estimated.
The ARIMA satisfies the following.
is the autoregressive coefficient polynomial of the stationary reversible ARMA(p,q) model. Θ( ) = 1 − − ⋯ − is the moving smoothing coefficient polynomial of the stationary invertible ARMA(p,q) model. The essence of ARIMA is the combination of the difference operation and the ARMA model, which has the properties of stationarity and homogeneity of variance [20,21].
SVM is a new machine learning method based on statistical theory. Proposed by Vapnik in 1995, SVM was originally used to solve linearly separable problems and later extended to regression problems [22,23]. Assuming the training set ( , ) ∈ ( × ) , where ∈ = is the input and ∈ = is the output, the SVM model can be expressed as follows.   Figure  1, and the Spark Streaming architecture diagram is shown in Figure 2.

ARIMA-SVM Gas Prediction Model
The ARIMA(p,d,q) model is one of the methods used for time-series forecasting. The main idea is to use a model to describe its internal connections and predict future values by collecting and analysing the observations at past time points. Future predictions can be realized by the linear equation of the past time values and error [17][18][19]. Assuming that X = {xi, i = 1, 2, …, N} is a time series, then ARIMA(p,d,q) can be described as follows.
In this formula, p, d and q are required to be nonnegative and represent the order of autoregression, difference order and moving average, respectively. xt represents the true value.
represents the predicted value of xt. εt represents the error value of the prediction. φ and θ represent the parameter values to be estimated.
The ARIMA satisfies the following.
is the moving smoothing coefficient polynomial of the stationary invertible ARMA(p,q) model. The essence of ARIMA is the combination of the difference operation and the ARMA model, which has the properties of stationarity and homogeneity of variance [20,21].
SVM is a new machine learning method based on statistical theory. Proposed by Vapnik in 1995, SVM was originally used to solve linearly separable problems and later extended to regression problems [22,23]. Assuming the training set ( , ) where ∈ = is the input and ∈ = is the output, the SVM model can be expressed as follows.

ARIMA-SVM Gas Prediction Model
The ARIMA(p,d,q) model is one of the methods used for time-series forecasting. The main idea is to use a model to describe its internal connections and predict future values by collecting and analysing the observations at past time points. Future predictions can be realized by the linear equation of the past time values and error [17][18][19]. Assuming that X = {x i , i = 1, 2, . . . , N} is a time series, then ARIMA(p,d,q) can be described as follows.
In this formula, p, d and q are required to be nonnegative and represent the order of autoregression, difference order and moving average, respectively. x t represents the true value.l t represents the predicted value of x t . ε t represents the error value of the prediction. ϕ and θ represent the parameter values to be estimated.
The ARIMA satisfies the following.
is the moving smoothing coefficient polynomial of the stationary invertible ARMA(p,q) model. The essence of ARIMA is the combination of the difference operation and the ARMA model, which has the properties of stationarity and homogeneity of variance [20,21].
SVM is a new machine learning method based on statistical theory. Proposed by Vapnik in 1995, SVM was originally used to solve linearly separable problems and later extended to regression problems [22,23]. Assuming the training set {(x i , y i )} l i=1 ∈ (x × y) l , where x i ∈ x = R n is the input and y i ∈ y = R is the output, the SVM model can be expressed as follows.
Energies 2022, 15, 5335 Among these expressions, the dual problem is described as follows.
In the above formula, ξ i is the slack variable ξ i ≥ 0, ξ * i ≥ 0, i = 1, · · · , l and C is the penalty parameter.
Then, the solution is given by the following.
Both linear and nonlinear trends are observed in gas-concentration time-series data [24,25]. Considering that the ARIMA model has unique advantages when dealing with linear data, it can fully capture the linear part in the time series, and SVM has outstanding performance when analysing and predicting nonlinear data [26,27]. Therefore, the ARIMA model is used to process the historical data for the one-dimensional gas time series and obtain the corresponding linear prediction results and residual series. Then, SVM is used to further analyse and predict the nonlinear factors in the residual series on the panel data affecting the gas time series. Finally, the analysis and prediction results for the two models are combined to obtain the final prediction result for the target gas time-series data. The principle of the gas concentration prediction framework is shown in Figure 3. The time series Y = {y k , k = 1, 2, · · · , N} consists of two parts: a linear part and a nonlinear part, i.e., y k = l k + nl k . First, the one-dimensional gas data are processed by the ARIMA model, and the time seriesl k and residual series δ k = y k −l k of the linear prediction result are obtained. Second, by further processing the residual time series, a set of time series nl k of nonlinear prediction results is obtained. The final combination of linear and nonlinear results is the final time series forecast valueŷ k =l k + nl k . min , , Among these expressions, the dual problem is described as follows.
Both linear and nonlinear trends are observed in gas-concentration time-series data [24,25]. Considering that the ARIMA model has unique advantages when dealing with linear data, it can fully capture the linear part in the time series, and SVM has outstanding performance when analysing and predicting nonlinear data [26,27]. Therefore, the ARIMA model is used to process the historical data for the one-dimensional gas time series and obtain the corresponding linear prediction results and residual series. Then, SVM is used to further analyse and predict the nonlinear factors in the residual series on the panel data affecting the gas time series. Finally, the analysis and prediction results for the two models are combined to obtain the final prediction result for the target gas time-series data. The principle of the gas concentration prediction framework is shown in Figure 3    The gas-concentration prediction model based on the combined ARIMA and SVM model can be used to more accurately predict gas concentrations. However, its disadvantage is that the training dataset needs to be provided in advance. Although the historical record data saved in the monitoring system can be used as the training dataset, the real-time transmission data can also be used. However, the real-time training dataset from the monitoring system has the characteristics of streaming data, and a continuous change in the data stream leads to the continuous updating of the prediction model. Streaming data are a set of large, fast and consecutively arriving sequences of data [28,29]. Therefore, the use of conventional model prediction will lead to long modelling times, which will indirectly lead to poor timeliness for the forecast data and a low utilization value.

SPARS Model
In this section, a parallel prediction method combining Spark Streaming with the ARIMA-SVM combination algorithm is proposed and is named the SPARS algorithm. It can be used to quickly analyse and predict streaming data. At the same time, it combines the ARIMA model and SVM model to process linear and nonlinear data, respectively, which improves the timeliness of traditional model prediction and improves accuracies.
To perform predictive modelling for the gas concentration data stream, a distributed stream processing framework called Spark Streaming is used to build a real-time gasconcentration prediction system based on the ARIMA-SVM model. The real-time data generated by the gas monitoring source are sent to Spark Streaming through the stream generator. In the sliding window calculation provided by Spark Streaming, the time window is used to divide the original DStream into data RDDs with specified time slices. DStream, which is a part of Spark Streaming, can perform stream data processing and batch processing at the same time, meeting the requirements of various processing types such as dataset extraction, machine learning model training and model application. RDD based on the window length is the basic unit of the prediction model, and the window length can be determined according to the rate of streaming data and modelling complexity. The structure of the real-time gas-concentration prediction system based on Spark Streaming is shown in Figure 4. The Hadoop distributed file system (HDFS) is used for data storage. Alternatively, the data can be sent directly to Spark Streaming [30][31][32]. Coal-mine gas and related sensors transmit real-time monitoring data to the system through the network, and data streams can be written to distributed storage and combined with Spark MLIib to build predictive models, which can be dynamically updated with data streams. Finally, the realtime prediction of gas concentration is carried out by using the constructed ARIMA-SVR prediction algorithm.

Data Sources
Gas data were collected from the 802 working face of a mine in Shaanxi for experi-

Data Sources
Gas data were collected from the 802 working face of a mine in Shaanxi for experimental analysis. A KG9001C sensor was used to collect the gas concentration, and the sensor concentration measurement range was (0-100)% CH 4 . The sensor measurement error was (0-1)% CH 4 ≤ 0.1% CH 4 , (1-2)% CH 4 ≤ 0.2% CH 4 , (2-4)% CH 4 ≤ 0.3% CH 4 , (4-10)% CH 4 ± 1% CH 4 , and (10-100)% CH 4 ± 10% CH 4 . The sampling rate for collecting gas data was 0.2/s. A total of 2880 sets of gas data were collected for 4 h of gas measurement sequences. Some of the original data are provided in Table 1. The original gas sequence is shown in Figure 5. In the collected data, the maximum value is 0.82%; the minimum value is 0.02%; the mean value is 0.17%; the standard deviation is 0.14.

Prediction of the Gas Concentration by the SPARS Model
The dataset collected for the first 3 h is used as the training dataset of the model for the estimation of the model parameters. The dataset for the final 1 h is used as the test set for the model to test the fitness of the prediction model. To simulate the dynamic gas data flow monitored by the coal-mine gas sensor, the training set's gas concentration sequence data are also sent to Spark Streaming at a rate of 0.2/s through a transmission control protocol (TCP) socket to simulate the real-time dynamic data flow for the gas concentration. Spark Streaming utilizes the built-in data stream source socketTextStream to receive gas concentration stream data as the input source. When predicting the gas concentration, the length and sliding distance of the Spark Streaming stream window can be specified according to the requirements of the actual application, and the training data's length and update cycle for the stream regression model can be determined. To verify the real-time performance and accuracy for the flow regression, the gas concentration data are predicted and analysed by setting different model update cycles. In the experiment, the length of the flow regression window is 1 min, and the model update period is set to 10 s, 20 s, 30 s, 40 s, 50 s and 60 s for the real-time prediction of the gas concentration. The experiment was carried out six times to simulate the prediction accuracy for gas concentration data under different model update cycles. The prediction results are shown in Figures 6-11. In these figures, real represents the real value of the gas concentration, and predict represents the predicted value. It can be concluded that the degree to which the mean value predicted by the model is close to the mean value of the true value at different

Prediction of the Gas Concentration by the SPARS Model
The dataset collected for the first 3 h is used as the training dataset of the model for the estimation of the model parameters. The dataset for the final 1 h is used as the test set for the model to test the fitness of the prediction model. To simulate the dynamic gas data flow monitored by the coal-mine gas sensor, the training set's gas concentration sequence data are also sent to Spark Streaming at a rate of 0.2/s through a transmission control protocol (TCP) socket to simulate the real-time dynamic data flow for the gas concentration. Spark Streaming utilizes the built-in data stream source socketTextStream to receive gas concentration stream data as the input source. When predicting the gas concentration, the length and sliding distance of the Spark Streaming stream window can be specified according to the requirements of the actual application, and the training data's length and update cycle for the stream regression model can be determined. To verify the real-time performance and accuracy for the flow regression, the gas concentration data are predicted and analysed by setting different model update cycles. In the experiment, the length of the flow regression window is 1 min, and the model update period is set to 10 s, 20 s, 30 s, 40 s, 50 s and 60 s for the real-time prediction of the gas concentration. The experiment was carried out six times to simulate the prediction accuracy for gas concentration data under  Figures 6-11. In these figures, real represents the real value of the gas concentration, and predict represents the predicted value. It can be concluded that the degree to which the mean value predicted by the model is close to the mean value of the true value at different update times is 60 s, 50 s, 40 s, 30 s, 20 s, and 10 s from high to low. Figure 12 shows a comparison of the fit between the actual value and the predicted value. It can be observed from the figure that when the update time is 50 s or 60 s, the fitting degrees of the predicted value and the real value are higher. In general, the longer the model update time, the larger the training set data flow, the more fully the model is trained, and the higher the prediction accuracy. In contrast, the shorter the model update time, the smaller the training set data flow and the more insufficient the model training process, resulting in lower prediction accuracy.

Discussion
In the above discussion, it can be observed that, with the expansion of the model update period, the model's ability to predict gas is gradually enhanced, the fit between the predicted value and the actual value is enhanced, and the predicted value becomes more accurate. However, if the update time of the model becomes infinitely long, it cannot meet the requirements of real-time processing of data streams. Based on the principle of real-time data-stream processing, the prediction accuracy should be relatively high when the model's update time is relatively short [33,34]. Therefore, the above results need to be further discussed. To further determine the corresponding relationship between the model update period and the prediction results, the root mean square error (RMSE) is used to evaluate the model [35,36]. The precision RMSE of the statistical model is the square root of the squared error and the mean, and its formula is expressed as follows.
In the above prediction process for the test set, when the model update time is 60 s, the model is updated 60 times in total. By analogy, when the model update time is 50 s, 40 s, 30 s, 20 s, and 10 s, the model is updated 72 times, 90 times, 120 times, 180 times, and 360 times, respectively. The individual calculation results of RMSE for each update model are shown in Table 2, and the changes in the RMSE for the model under different update cycles are shown in Figure 13. The lower abscissa in this figure represents the number of model updates. The ordinate is divided into six groups, all of which represent the RMSE value, and the upper abscissa represents the update time for the six groups of models. The lower abscissa and ordinate constitute the change trend for the model RMSE value under different update periods, and the upper abscissa and ordinate constitute the trend diagram of the overall change in the RMSE value for each model during the 1 h prediction process. It can be observed from the figure that when the model's update time is 50 s, the minimum value is 0.01211; when the update time is 10 s, the maximum value is 0.02617. It can be concluded that in the overall prediction process, when the model update time is 50 s, the prediction accuracy is higher.    To further explore the superiority of SPARS model prediction, the above data are predicted by the ARIMA model, SVM model and ARIMA-SVM-combined model. The first 3 h of the gas data are set as the training set, and the final 1 h of the gas data are set as the test set. The prediction results are compared with the real value data series and the prediction results from the SPARS model, and the results are shown in Figure 14. It can be observed from this figure that the prediction effect of the ARIMA model and the SVM model is relatively poor because the ARIMA model can show a good degree of fit for the linear data part, and the prediction result is poor for the nonlinear part of the data. The SVM model shows a good fit for the nonlinear part of the data, and the prediction results for the linear part are relatively poor. The combined ARIMA-SVM model shows relatively good prediction results. In this model, the advantages of linear data and nonlinear data prediction from the above two models are integrated; thus, a good prediction trend is achieved. However, the prediction trend for the SPARS model is the closest to the real value sequence because the model is updated every 50 s according to the time series' characteristics. Compared with the non-real-time prediction model, the SPARS model can be used to more accurately capture the time series characteristics of each time period to achieve accurate prediction. The prediction accuracy of the above four prediction models is quantitatively analysed to obtain the maximum error, minimum error and average error of the four prediction models, as shown in Table 3. The error value for the SPARS model is the smallest; thus, it has the highest prediction accuracy.

Conclusions
In this paper, a new method for predicting coal-mine gas concentrations is studied. The problems of gas concentration data hysteresis and low efficiency of the prediction model when learning data features in the gas prediction process are mainly solved using traditional algorithms. In this research, Spark Streaming and the ARIMA-SVM algorithm were applied and combined into a parallel prediction method called the SPARS model. This model can be used to quickly process daily monitoring gas concentration flow data, and the advantages of the ARIMA model and the SVM model in predicting linear and nonlinear data, respectively, are combined in the model. While improving the timeliness of traditional model prediction, the prediction accuracy is also improved when using this model. The prediction performance of the SPARS model is validated by using examples. First, the model update cycle is analysed, and the RMSE is used to evaluate the prediction performance of the model under different update cycles. It is found that the prediction accuracy of the SPARS model in the process of predicting gas concentrations does not improve with the continuous expansion of the model's update cycle. It is concluded that when the model update period is 50 s, the best prediction effect of the model with the smallest RMSE value is obtained. Then, the accuracy of the SPARS model prediction is verified. Comparing the prediction results of the ARIMA model, SVM model and ARIMA-SVM-combined model, the smallest prediction error and the highest prediction accuracy are achieved using the SPARS model. The proposed model can be used to realize highprecision real-time predictions of coal-mine gas concentrations. This model can be used to help provide a safe and reliable working environment for underground operators while ensuring the safe, stable and efficient mining of coal mines. It also provides a new idea for coal mines with respect to achieving real-time and accurate gas prediction and lays a foundation for intelligent coal-mine gas prediction and early warning. In the next step of this research, the real-time prediction and early warning analysis of gas concentrations under the influence of various factors will be considered to further improve the timeliness and accuracy of coal-mine gas-concentration prediction.