Data Mining Based on Chinese Traditional Calendar in the Han Dynasty Yang Mausoleum Museum

: The Outer Burial Pits of the Han Dynasty Yang Mausoleum is the ﬁrst fully enclosed site museum in China. The Internet of things sensor installed in the pavilion has accumulated more than 7,000,000 heterogeneous data. Traditional algorithms, such as temperature prediction model, only use statistical value to predict the trend of temperature change, so the data utilization rate is insufﬁcient. In addition, the accuracy of prediction model is also relatively low. The extreme learning machine is a single layer feedback neural network learning method. Using extreme learning machine to analyze and model data is an effective prediction algorithm for preventive protection. However, due to the mismatch between solar calendar characteristics and temperature changes, the accuracy of the prediction model is unsatisfactory. To solve the above problem, a temperature prediction model based on lunar calendar characteristics is proposed in this paper. A large number of measured data show that the model can more accurately predict the future temperature variation trend and detailed characteristics. Its performance is better than that of the model based on solar calendar characteristics. In this paper, a new prediction model is proposed. The advantages of lunar calendar in natural data recording and processing are preliminarily veriﬁed, which provides reference for the label selection of follow-up monitoring data.


Introduction
Built in 2006, the outer burial pits of the Han Dynasty Yang Mausoleum is the first fully enclosed museum in China. In order to effectively evaluate the effect of the closed protection mode, 260 Internet of things (IoT) sensors have been installed in the site since 2009. Heterogeneous data such as temperature, humidity and carbon dioxide concentration are mainly monitored by IoT sensors. In the museum, heterogeneous data can be monitored remotely and in real time by IoT sensors [1][2][3][4], which can be used to study the evolution law of the site and predict its development trend, so as to better guide the protection of the site. Also, IoT sensors can communicate autonomously with the base station to report the current condition of the soil surface, which greatly 1reduces human interference and damage. Although a large number of IoT sensors have been deployed in the museum, the utilization rate of large-scale monitoring data is not relatively high. A deeper understanding of the data is needed to detect or determine anomalies. It mainly went through the following two stages.
The processing of monitoring data still relies on traditional statistical models. These studies include studies on indoor museums and outdoor soil sites, involving museum air quality [5], temperature and humidity [6], microclimate models [7,8], etc. Furthermore, there are also many studies on the outer burial pits of the Han Dynasty Yang Mausoleum, including the influence of indoor temperature change on cultural relics [9], prediction model [10,11], weathering of the soil sites [12], the change rules of indoor air temperature and relative humidity [13,14], indoor air environment at Hanyang Mausoleum museum [15]. These researches are based on the traditional statistical algorithm, which can only obtain the statistical rules of the data in a short time. In addition, these studies also don't mine the detailed characteristics of the data. The algorithm for processing data is not only relatively simple but also uses few data points. Compared with a large number of data, the algorithm uses less data. So it can not show the detailed characteristics of temperature change.
Traditional machine learning is an attempt to predict the future behavior or trend of data based only on observed samples. Modern machine learning [16,17] is a technology that studies how computers simulate or implement human learning behavior in order to acquire new knowledge or skills. Semi-supervised learning [18] can use big data to reorganize the existing knowledge structure to continuously improve its performance. Literature [19] proposes a time-frequency representation based on kernel-extreme learning machine(ELM), which solves non-uniform sampling. ELM is a single-hidden layer forward neural network (SLFNN) machine learning method, its main advantages are simple structure and fast learning. Literature [20] proposes a temperature data modeling and prediction algorithm by using ELM, which establishes a prediction model based on solar calendar label. Intelligent technology was initially introduced into the field of cultural relic protection by the model. Also, the prediction of temperature change in detail was also realized. The traditional Chinese lunar calendar can show information from both the sun and the moon. Natural data tend to be influenced by the sun and the moon at the same time, so using the lunar calendar as a time label would be more consistent with the changing law of the data itself.
Based on the previous work, in this paper, the ELM is used to establish a temperature prediction model based on lunar calendar label. Compared with the processing of solar calendar labels, accuracy of this method is relatively higher. In addition, the superiority of the lunar calendar in recording natural data is preliminarily verified. Experiments show that compared with the solar calendar model, the lunar calendar model can better predict the temperature, humidity, and other environmental factors. Modeling and analysis of temperature data based on lunar calendar can not only improve the accuracy of temperature prediction, but also effectively mine the overall law of temperature.
The structure of this paper is as follows: an introduction to site's monitoring data characteristics and processing in Section 2; data prediction model based on Lunar calendar is given in Section 3; the results of real data tests are presented in Section 4; and concluding remarks are given in Section 5.

Introduction to Solar Calendar and Lunar Calendar
The Gregorian calendar is used as solar calendar today in all around world, which was developed by the Romans according to the revolution of the earth. The average length of the calendar year is 365 days, 5 h, 49 min and 12 s, while the length of the tropical year is 365 days, 5 h, 48 min and 46 s. The difference between a Gregorian year and a tropical year is only 26 s. The month and date of each calendar year in the Gregorian calendar also coincide well with the position of the sun on the ecliptic. The Gregorian calendar is divided into 12 months, due to one twelfth of the tropical year being about 30 and a half days (30.4368 days). However, the "month" in question has nothing to do with the phase of the moon. In other words, the solar calendar can only show information about the sun.
The lunar calendar is based on the earth's revolution and the moon's revolution. Its history can be traced back to the Qin and Han dynasties. The lunar calendar takes the tropical year as one year and the lunar month as one month. However, a tropical year is about 11 days longer than 12 lunar months. If lunar year is measured in 12 months, the date will be wrong. Through practice, the ancients stipulated every three years an extra month, this year is called leap year. Through calculation, the method that seven calendar years were selected as leap years in 19 calendar years was invented. The 3rd, 6th, 9th, 11th, 14th, 17th, and 19th years are generally selected as leap years, and the rest of the year as ordinary years. A leap year has 13 months, while an ordinary year has only 12 months. Which month is set as leap month according to the solar terms of China. That leaves a mere 0.09 days (about 2 h, 9 min and 36 s) between the 19 tropical years and 235 lunar months (12 common years and seven leap years). During the southern and northern dynasties, Zu Chongzhi created the more accurate method, which selected 144 of 391 calendar years as leap years and the rest as common years. After that, it's still used today. The uniqueness of the lunar calendar lies in that, on the one hand, the date of the calendar represents a certain phase of the moon, such as the beginning of the new moon, the full moon is just in the middle. On the other hand, it is coordinated with the four seasons (spring, summer, autumn and winter). In conclusion, the lunar calendar can display information from both the sun and the moon. Due to natural data are often influenced by both the sun and the moon, using the lunar calendar as a time label would be more consistent with the changing pattern of the data itself.

Monitoring Data
The outer burial pits of the Han Dynasty Yang Mausoleum is the first fully enclosed museum in China. As shown in Figure 1, a large number of IoT sensors were set up at the site. Moreover, the green rectangle represents IoT sensor. A lots of heterogeneous data have been recorded over the years, including temperature, humidity, frost point, dew point, and carbon dioxide concentration. Although the outer burial pits is a completely enclosed environment, it only isolates the influence of atmospheric environment on the pit environment. Moreover, the site itself is still directly connected with the earth, which suggests that the impact of the underground environment on the site can't be avoided. It indicates that the monitoring of the site itself is also indispensable. Therefore, on the basis of the existing monitoring points, soil temperature, water content, and electrical conductivity are also monitored on the soil partition beam, as shown in Figure 2a,b. It makes the monitoring of environmental factors more comprehensive. In addition, monitoring points are also set up in the outer pit, outside the glass cover, and the site itself in Figure 3 (Round, diamond, and rectangle represent temperature and humidity, crack, and soil respectively). On the basis of using the original monitoring points, the monitoring area can be more comprehensive.   In this paper, the temperature monitoring data of the 110,120 sensor is selected as the analysis object, which is placed in the middle hall of the outer burial pits. Figure 4 shows that the monitoring instrument is an indoor atmospheric temperature and humidity sensor (model: MW301GA). The measurement accuracy and range are ±0.3 • C, 20-80 • C respectively, the monitoring period was from 1 January 2011 to 31 December 2011. Due to the museum staff adjusted the sampling frequency, different sampling frequencies occurred during a year period. The change of sampling frequency makes it impossible to use the monitoring data directly to build the lunar calendar model, otherwise the accuracy of the lunar calendar model will be reduced. The monitoring frequency was as following: sampling every 20 min from 1st to 50th days and every 30 min from 51st to 365th days. 18,720 data should be measured, but the total number of actual measured data s 15,007.
Year of monitoring data: 2011. Time period: from 1 January 2011 to 31 December 2011. Sampling frequency: sampling every 20 min from 1st to 50th days and every 30 min from 51st to 365th days.
Number of data: 15,017. The structure of this kind of data is relatively complex, as shown in Figures 5 and 6. This is mainly reflected in two aspects: Firstly, uneven distribution of data.  The adjustment of sensor sampling frequency, data recording, data transmission and other emergencies result in relatively more data recorded from 1st to 50th day. Due to the maintenance of monitoring equipment and the renovation of the museum's power supply system, the recorded data are seriously insufficient from 51st to 99th day. If these data are directly used to calculate the daily average temperature, there will be a big error, and then building the temperature prediction model based on the statistical data will lead to inaccurate prediction.
(1) Short-term deletion. From the 1st to the 50th day, the sensor collects data about 20 min once, so about 72 sets of data can be collected every day. Later, as the sampling frequency of the sensor was adjusted to about 30 min, only about 48 sets of data could be collected every day from the 100th to the 365th day. However, less data may be collected on the 100th day due to emergencies, which will affect the later machine learning.
(2) Long-term deletion. Data is missing on a daily basis, such as no monitoring data between the 51st and the 99th day. The long-term absence of data will affect the processing of data details.

Data Preprocessing
The ELM [21,22] is a single layer feedback neural network learning method proposed by Huang et al. Its main characteristic is that it is adaptive to non-linear structure and imprecise rules and can optimize calculation through independent learning. The optimal solution can be generated by setting a reasonable number of hidden layer nodes before training and assigning appropriate values to input weights and hidden layer bias during execution. Furthermore, the ELM is also used as a classifier in many fields, including image processing, signal processing and data classification and prediction.
The entire process of processing the monitoring data is shown in Figure 7. Due to the uneven distribution and missing of monitoring data, it is impossible to directly calculate and predict the data. Firstly, the monitoring data is preprocessed (for example normalization). Secondly, only the normalized data can be used to extract for time characteristics, including hours (accurate to seconds), dates, months and years. Finally, the time characteristics was trained as the input of ELM, a temperature prediction model based on time characteristics is built to achieve temperature prediction by the ELM.

Structure Of ELM
..x im ] T represents the i-th feature vector with n dimensions, and t i = [t i1 , t i2 , ...t im ] T is target data vector with m dimensions. g i ( * ) is the activation function. It is because sigmoid function is a common activation in the neural network which is derivable infinitely and easy to implement. In addition, it can also map from (−∞, +∞) to (0,1), which accords with probability distribution. The standard SLFNN output withN hidden layer nodes neural networks is defined as follows: where It is from the i-th hidden layer node to input layer. The output weight vector β i = [β i1 , β i2 , ...β im ] T is from the i-th hidden nodes to the output layer. b i is the bias of the i-th hidden node. The left side of the equation is the actual output of the model. The right side is the ideal output of the model. The network structure is shown in Figure 8. Which can be rewritten as: H, β, T can be written as: To find β smallest norm by least-square method, Under the constraint of Equation (6), the least-square solution can be derived from the previous equation.The least-square solution: where H † represents Moore-Penrose generalized inverse of H. H † is often used to find the least-square solution of nonuniform linear equations with the smallest norm, and to simplify the form of the solution.

Feature Extraction of Time
In the ELM, the time information includes hours, dates, months, and years. All information needs to be extracted as time features of the machine learning model. Time features are defined as follows: where x h , x d , x m and x y are hours, dates, months and years respectively. In addition, ω h , ω d , ω m and ω y separately represent the weights corresponding to x h , x d , x m and x y . Each weight is an empirical value, which needs to be measured by a large number of experiments. The weights are equivalent to constants for a set of data.

Establishing the Solar Calendar Model and the Lunar Model
(1) Establishment of the solar calendar model: The solar time information including year, month, date, hour (accurate to second), and the temperature monitored at that time point are input into the ELM for prediction. Then, the temperature prediction model based on solar calendar label is established by ELM.
(2) Establishment of the lunar calendar model: first, converting the solar date to the corresponding lunar date. second, the lunar time information and the temperature monitored at the time point are input into the ELM for prediction. Finally, the temperature prediction model based on lunar calendar label will be established.

Regression and Prediction
In the simulation experiments, the feature weights (ω h = 5.5, ω d = 0.98, ω m = 1/30, and ω y = 1/365) respectively corresponded to the weights of four models (the hour model, the daily model, the monthly model, and the annual model). The training period was 10 days and training times were set to 100. The mean square error (MSE) was 2.5824 × 10 −4 . In addition, in Figure 9a,b, the unit of time is the day. Figure 9a shows that the training results of monitoring data can change with the change of training data in real time. It indicates that lunar model could reflect the detailed characteristics of data changes. As Figure 9b shows, the original data and the predicted results can well coincide, meaning the lunar calendar model is more accurate.

Comparison of Prediction Results of the Two Models
In this part of the experiments, training period was set to 10 days. From Table 1, the MSE of a single model did not change much with the training times. By comparing the two models, it can be concluded that when the training times were constant, the MSE of the solar calendar model was always higher than that of the lunar calendar model. The minimum ratio reached 1.27 and the maximum ratio even reached 1.42. The experiments indicate that the prediction model based on the lunar calendar can better reflect the change of temperature data. The accuracy of the temperature prediction based on the lunar calendar has been improved. In this part of the experiments, the training period is 30 days and the temperature will be predicted for the next 10 days. Table 2 illustrates that MSE of the lunar calendar model is always much better than that of the solar calendar model. When the training times are 50, the prediction model based on lunar calendar is the best. The MSE ratio of the two models reaches 3.84. Overfitting occurs with the increase of training times, which makes both MSE of models increase. It indicates that the effect of the lunar calendar model is better than that of the solar calendar model. In order to better analyze the data, four periods (30 days, 60 days, 90 days and 120 days) were selected to predict in the experiment. In addition, the training period was set to 50 days. As Table 3 shows, compared with the solar calendar model, the MSE of the lunar calendar model was lower in every period. The validity of the lunar calendar model can be verified.

Modify the Weight of the Month Model
In this part of experiment, the empirical value of monthly model weight was given. The training period was set to 30 days, which is close to the phase of the moon. In each test, training times were 50 days. Where [5.5, 0.98, X, 1/356] is feature vector set, the X represents the weight of the monthly model. When x = 0, the monthly model was not available. Table 4 shows that the MSE with a monthly model weight of 0 was three times as much as the MSE with a monthly model weight of 1. It indicates that increasing monthly model weight can improve the accuracy of the lunar calendar model.

Conclusions
In this paper, based on the actual problems of environmental monitoring data of the outer burial pits of the Han Dynasty Yang Mausoleum, the ELM is used to establish a temperature prediction model based on the lunar calendar to predict the temperature of the site. It is verified by experiments that the lunar calendar model is more accurate than the solar calendar model in predicting temperature. Moreover, the model can better express the temperature characteristics. The lunar calendar model of this paper also has a good extensibility. Moreover, the accuracy of lunar model can be effectively improved by properly improving the weight of lunar model. The model can be used not only to predict temperature, but also can be extended to predict other environmental data such as precipitation, humidity, soil temperature, and dew point temperature. Furthermore, the study of temperature prediction in this paper is not limited to providing guidance for site protection, but also can be applied to the field of weather forecast and environmental forecast, etc.