Forecasts of the Amount Purchase Pork Meat by Using Structured and Unstructured Big Data

: It is believed that the huge amount of information delivered to the consumers through mass media, including television and social networks, may a ﬀ ect consumers’ behavior. The purpose of this study was to forecast the amount required to purchase pork belly meat by using unstructured data such as broadcast news, TV programs / shows and social network as well as structured data such as consumer panel data, retail and wholesale prices and production outputs in order to prove that mass media data release can occur ahead of actual economic activities and consumer behavior can be predicted by using these data. By using structured and unstructured data from 2010 to 2016 and ﬁve forecasting algorithms (autoregressive exogenous model and vector error correction model for time series, gradient boosting and random forest for machine learning, and long short-term memory for recurrent neural network), the amounts required to purchase pork belly meat in 2017 were forecasted and compared with the actual amounts to validate model accuracy. Our ﬁndings suggest that when unstructured data were combined with structured data, the forecast pattern is improved. To date, our study is the ﬁrst report that forecasts the demand of pork meat by using structured and unstructured data.


Introduction
Recently, there was an outbreak of African swine fever in South Korea, which severely affected pork meat consumption and price [1]. It has been believed that news on agri-food was delivered to the consumers through mass media including television and social networks, and the information may affect the behavior of consumers' behavior who were exposed to this huge amount of information [2][3][4]. The purpose of this study was to forecast pork consumption in terms of the amount required to purchase pork belly meat by using unstructured data, such as broadcast news, TV programs/shows and social network as well as structured data, such as consumer panel data, retail and wholesale prices and production outputs in order to prove that mass media data release, including various unstructured data can predate actual economic activities, and consumer behavior can be predicted by using these data.
Prediction of economic activities by using social network data or internet search data ahead actual activities has been reported in the stock market, marketing and tourism [5][6][7]. Recently, prediction of economic activities by using social network data or internet search data have been reported in agriculture [4,[8][9][10][11]. However, the data of broadcast news and TV programs/shows, for which recipes have been provided during popular cooking shows and through social network, have never been applied to forecast either demands or prices of agri-foods. In this study, we aimed to demonstrate that broadcast news, TV programs/shows and social network as unstructured data combined with structured one could be used to forecast demands of agri-food by using pork belly meat data. This study could help to understand the effects of broadcast news, TV programs/shows and social network on agri-food consumption eventually.

Data
We previously developed a prediction model of agri-food demand by unstructured and structured bigdata, in which structured data on agri-food production and sales and unstructured data from mass media including broadcasting programs and social network were collected and saved in Mongo database as seen in Figure 1 [2,12]. These bigdata were used to predict the demand of agricultural products in Korea by collecting and analyzing structured and unstructured data together.
In this paper, Agri-food consumers panel data provided by Rural Development Administration (RDA), wholesale market data of Outlook and Agricultural Statistics Information System (OASIS) of Korea Rural Economic Institute, retail price data of Korea Agricultural Marketing Information Service (KAMIS), pork production data of Korean Statistical Information System (KOSIS) were used as structured data whereas data from broadcasting programs and blogs were used as unstructured data [2,[12][13][14][15].
Structured data of production and sales of pork belly meat from 2010 to 2017 were extracted from Mongo database and prepared for analysis in order to predict the demand pork belly meat such as amounts to purchase pork belly meat, daily retail prices of pork belly meat, daily wholesale prices of pork carcass, and others as seen in Table 1. Agri-food consumers panel data were from Korean consumer panels and have feature of amount required to purchase from 2010 to 2017. Data frequencies of structured data were daily, monthly, quarterly and yearly according to the data source as seen in Table 1 except Agri-food consumers panel data of which transaction data were summed for daily average per consumer panel.
Unstructured data that matched a keyword search were collected from broadcasting programs and blogs. Speech from broadcasting programs were converted into text or the transcripts were collected. Unstructured data is data that indicates broadcasting programs and social network where the term "pork belly meat" was mentioned, such as broadcast news, television program/shows, and Blogs in Korea as seen in Table 2. Unstructured data were transaction data which were summed for daily frequencies to match with structured data. combined with structured one could be used to forecast demands of agri-food by using pork belly meat data. This study could help to understand the effects of broadcast news, TV programs/shows and social network on agri-food consumption eventually.

Data
We previously developed a prediction model of agri-food demand by unstructured and structured bigdata, in which structured data on agri-food production and sales and unstructured data from mass media including broadcasting programs and social network were collected and saved in Mongo database as seen in Figure 1 [2,12]. These bigdata were used to predict the demand of agricultural products in Korea by collecting and analyzing structured and unstructured data together.
In this paper, Agri-food consumers panel data provided by Rural Development Administration (RDA), wholesale market data of Outlook and Agricultural Statistics Information System (OASIS) of Korea Rural Economic Institute, retail price data of Korea Agricultural Marketing Information Service (KAMIS), pork production data of Korean Statistical Information System (KOSIS) were used as structured data whereas data from broadcasting programs and blogs were used as unstructured data [2,[12][13][14][15].
Structured data of production and sales of pork belly meat from 2010 to 2017 were extracted from Mongo database and prepared for analysis in order to predict the demand pork belly meat such as amounts to purchase pork belly meat, daily retail prices of pork belly meat, daily wholesale prices of pork carcass, and others as seen in Table 1. Agri-food consumers panel data were from Korean consumer panels and have feature of amount required to purchase from 2010 to 2017. Data frequencies of structured data were daily, monthly, quarterly and yearly according to the data source as seen in Table 1 except Agri-food consumers panel data of which transaction data were summed for daily average per consumer panel.
Unstructured data that matched a keyword search were collected from broadcasting programs and blogs. Speech from broadcasting programs were converted into text or the transcripts were collected. Unstructured data is data that indicates broadcasting programs and social network where the term "pork belly meat" was mentioned, such as broadcast news, television program/shows, and Blogs in Korea as seen in Table 2. Unstructured data were transaction data which were summed for daily frequencies to match with structured data. Figure 1. Agri-food related structured and unstructured bigdata modified from [2]. SNS for social network service, DB for database.

Forecasting Methodology
Forecasting models were developed in order to forecast the daily average amount required to purchase pork belly meat in 2017 by using data from 2010 to 2016 (in-sample period) as a training data set and data from 2017 (out-of-sample) as a test data set. Structured and unstructured data were used for training and testing, whereas structured data alone were also used to compare if unstructured data could improve models' forecasting. Different algorithms were used to develop forecasting models including the autoregressive exogenous model, vector error correction model as traditional time-series algorithms, gradient boosting and random forecast as machine learning algorithms, and long short-term memory as a neural network algorithm. This is because, in relation to price prediction, the time series analysis model is mainly used, and recently, analyses by using machine learning, artificial neural network, or deep learning model have been attempted. The machine learning model shows better predictive power than the regression analysis model [16]. Forecasted amounts required to purchase pork meat in 2017 were compared with actual amounts required to purchase it in 2017 in terms of mean absolute percentage error (MAPE) and mean absolute error (MAE) in order to compare the accuracy of the forecasting models. The Diebold-Mariano test was used for forecast comparison by using the DM.test function in multDM package in R [17,18].

Time Series: Autoregressive Exogenous Model and Vector Error Correction Model
In order to forecast the amounts required to purchase pork belly meat in 2017, time series analysis was used, including autoregressive exogenous modeling and vector error correction modeling. The autoregressive exogenous model (ARX) and vector error correction modeling (VECM) are models in the Multivariate Time Series. When the observed variable and the predicted variable is more than one, the multivariate time series approach is more appropriate [19]. The Autoregressive Exogenous (ARX) model is an autoregressive model with exogenous variables and is a representative and quantitative dynamics modeling approach that has often been used in time series analysis [20]. In order to forecast daily amounts to purchase pork belly meat in 2017 by using the ARX model, data from 2010 to 2016 and the arx function in the 'gets' package in R were used [17]. In order to forecast weekly amounts required to purchase pork belly meat in 2017 for comparison, daily amounts required to purchase pork belly meat were averaged on a weekly basis.
The vector error correction model (VECM) was developed by Engle and Granger and aimed to accommodate the insertion of short-term adjustments due to the presence of integration [19,21]. In VECM as well as VAR (Vector Autoregressive) models, more than one variable can be predicted because the interrelations between variables can be seen [19]. In order to forecast the daily amounts required to purchase pork belly meat in 2017 by using the VECM model, the lag selection of 1 was selected based on the information criteria function before the granger causality and cointegration degree were tested in Eviews 10. Machine learning began with Samuel's paradox that if a computer could learn from experience, it would be a hassle [22]. Therefore, machine learning can be defined as the creation of a computer program that solves a problem with high performance by using data or experience that occurs in a specific area [23]. Machine learning methods include supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. It is appropriate to use supervised learning for consumption prediction. In this paper, we chose random forest and gradient boosting, which is one of the most used algorithms in map learning.
Random forest is a representative ensemble model, which is a combination of tree models having the same distribution extracted independently [24]. Random Forest is an ensemble technique algorithm based on decision trees. A decision tree is created, but a tree is created by randomly selecting some of the attributes used to create the tree instead of using the entire attribute. This process is repeated randomly to create slightly different trees. By using the forest of these trees, the predictions for the new data are voted for to choose the final predictions. In order to forecast daily amounts to purchase pork belly meat in 2017 by using Random Forest model, randomForest package in R was used. When structured and unstructured data were used, the parameters of ntree = 1500, mtry = 13 were used whereas structured data were used, the parameters of ntree = 1000, mtry = 5 were used.
Like Random Forest, Boosting is one of the most popular ensemble models. However, unlike random forest, which generates a tree randomly, boosting compensates for the error of the binary tree and creates the tree sequentially. The feature of the boosting model is that weak learners can come together to make excellent ensemble learners. Boosting gives each learner a weight based on past performance. Thus, a good model has a greater impact on the final prediction of the ensemble [25]. Gradient boosting is one of the boosting algorithms that improve the model by correcting residual errors in the previous model. In order to forecast the daily amounts required to purchase pork belly meat in 2017 by using the Gradient Boosting model, xgboost package in R was used. When structured and unstructured data were used, the parameters of max.depth = 15, eta = 0.03 were used whereas when structured data were used, the parameters of max.depth = 8, eta = 0.03 were used.

Long Short-Term Memory
The Long Short-term Memory (LSTM) is a modification of the existing RNN (Recurrent Neural Network) model, which is a typical improvement model that mitigates the vanishing gradient problem of RNN [26,27] as seen in Figure 2. LSTM is an artificial neural network commonly used for time-series data analysis. This model includes a memory cell in the hidden node which stores and outputs the value and adjusts the forgetting value. What is interesting in this model is that the LSTM consists of an input gate for the input value, an output gate for the output value, and a forgetting gate for the forgetting value. The learning algorithm of the LSTM uses Backpropagation in the same way as the RNN; the input data is the sequence data, the output data is the output data of LSTM. The LSTM model includes three gates, so that the number of weights and the number of biases is about four times that of typical RNN learning, which means that execution time and learning time of LSTM model are longer than the RNN models. Despite this, the vanishing gradient problem can be mitigated to obtain more accurate results.
In this study, the panel purchase amount was forecasted by using a basic LSTM model and two kinds of data sets, structured and unstructured data, and structured data alone for comparison. The data learning and prediction method is shown in Figure 3. First, the data is transformed by using the log, and then data is normalized by min max scaling. After that, the correlation coefficient is calculated for each column through the panel purchase amount which currently predicted through correlation. The correlation coefficients for each column were estimated through the correlation relation analysis, and the weighting was calculated based on correlation coefficient. To learn and predict time-series data, the LSTM model is created in many-to-one form with multiple inputs and one output. Then, values for the LSTM model for sequence length and hidden dimension were specified. Once the LSTM model was determined, the data from 2010 to 2016 were trained for each sequence in order to forecast the daily average amount required to purchase pork belly meat in 2017. In this method, a total of 48 cases were generated to find optimized parameter values of the LSTM model with the best forecasting accuracy. Each case was made by changing the parameters that affect the model during learning in LSTM (e.g., sequence length, hidden dimension, and stack layers). Details of the selected experimental cases are listed in Table 3 and details of all 48 experimental cases are listed in Table S1. In this method, a total of 48 cases were generated to find optimized parameter values of the LSTM model with the best forecasting accuracy. Each case was made by changing the parameters that affect the model during learning in LSTM (e.g., sequence length, hidden dimension, and stack layers). Details of the selected experimental cases are listed in Table 3 and details of all 48 experimental cases are listed in Table S1.     In this method, a total of 48 cases were generated to find optimized parameter values of the LSTM model with the best forecasting accuracy. Each case was made by changing the parameters that affect the model during learning in LSTM (e.g., sequence length, hidden dimension, and stack layers). Details of the selected experimental cases are listed in Table 3 and details of all 48 experimental cases are listed in Table S1.

Forecasted Daily Amounts Required to Purchase Pork Belly Meat
Five algorithms were developed to forecast daily amounts required to purchase pork belly meat in 2017, and model accuracy was compared by using MAPE and MAE. Two different data sets were used to develop the forecast model, which include structured data alone and structured and unstructured data, as seen in Table 4. Among the ten forecast results, LSTM with structured data showed the lowest MAPE whereas ARX with structured and unstructured data showed the lowest MAE, which showed no statistically significant difference in forecasts comparison by using Diebold-Mariano test (DM statistic = 0.7041, p-value = 0.4814). Four forecasted models were compared with actual amounts in the graph including ARX with structured and unstructured data, LSTM with structured and unstructured data, ARX with structured data, and LSTM with structured data as seen in Figure 4. In Figure 4, the patterns of ARX with structured and unstructured data and ARX with structured data alone mimics the pattern of actual daily amounts whereas the patterns of LSTM with structured and unstructured data and LSTM with structured data stay close the mean. The patterns of ARX with structured and unstructured data display a more similar pattern to the actual pattern than that of ARX with structured data alone in terms of height and depth.

Forecasted Weekly Forecased Amounts Required to Purchase Pork Belly Meat
Daily forecasted amounts required to purchase pork belly meat were averaged on a weekly basis in order to see if forecasting errors could be reduced, as seen in Table 5. Among the ten forecast results, ARX with structured and unstructured data showed the lowest MAPE and MAE; however, it did not show a statistically significant difference compared to ARX with structured data in forecasts Agriculture 2020, 10, 21 8 of 14 comparison by using Diebold-Mariano test (DM statistic = 1.3432, p-value = 0.1792). The same four models were compared with actual amounts, as seen in Figure 5. The patterns of ARX with structured and unstructured data and ARX with structured data alone, which stay close to each other, mimics the pattern of actual weekly amounts better than the patterns of LSTM with structured and unstructured data and LSTM with structured data.

Forecasted Errors in LSTM When Structure Data and Unstructured Data Were Used Over Structure Data Alone
The forecasting results of the various LSTM cases are listed in Figure 6 with MAPE. Most of the daily and weekly results show fewer MAPE results when both structure data and unstructured data were used than when structure data alone was used. In the case of daily forecasts, the lowest MAPE was 14.59 of case 27 when structure and unstructured data were used. When structure alone data were used, the MAPE was 14.51 of case 36. In the case of weekly forecasts, the lowest MAPE was 6.5 of case 33 when structure and unstructured data were used. When structure data were used alone, the MAPE was 7.25 of case 18. In both forecasts we compared these results with actual values on a daily and weekly basis. The forecasting errors were lower when both structure data and unstructured data were used.  Actual daily amounts to purcahse pork belly meat in 2017 Forecasted daily amounts to purcahse pork belly meat in 2017 by using ARX algorithm and structure and unstructured data Forecasted daily amounts to purcahse pork belly meat in 2017 by using LSTM algorithm and structure and unstructured data Forecasted daily amounts to purcahse pork belly meat in 2017 by using ARX algorithm and structure data Forecasted daily amounts to purcahse pork belly meat in 2017 by using LSTM algorithm and structure data

Discussion
In this study, we aimed to demonstrate that broadcast news, TV programs/shows and social network could be used to forecast demands of agri-food by using pork belly meat data, one of the popular meat products for Korean consumers. Our findings may suggest that when broadcast news, TV programs/shows and social network, which were grouped as unstructured data were combined with structured data, that include consumer panel data, retail and wholesale prices and production outputs, it improves the forecast pattern.
There have been a few reports that social network or internet search data were used to predict the price of agri-food; however, no paper was reported that data of broadcast news and TV programs/shows were used to predict either prices or demands of agri-food until a very recent one on paprika consumption prediction [4,[8][9][10][11]28]. To date, our study is the first report that predicts the demand of livestock products by using unstructured data of broadcast news, TV programs/shows and social network as well as conventional structured data. Production of agri-food with better forecasts of prices and demand of agri-food by using structured and unstructured data could contribute to a stable supply of agri-food.
Limitations may include that the amounts required to purchase pork meat may have trends with other features (data not shown), which requires further study so that the effects of broadcast news, TV programs/shows and social network on consumption of agri-food could be clearly revealed. Recently, there was an outbreak of African swine fever in the Korean peninsula, which severely affected pork meat consumption and price. For future research, it is important to analyze how an outbreak of infectious diseases among livestock can affect meat consumption and in order to predict demand of agri-food by using unstructured data. Furthermore, prioritizing the impacts of various unstructured data on consumer demand is to be carried out so that these findings can be provided to policy makers to facilitate consumption of agri-food when over production occurs. These topics are lacking in our current study, and they constitute areas of future research.