1. Introduction
In smart mariculture, it is an inevitable trend for aquaculture to become smarter, more accurate, and more ecological. However, due to the influence of climate, typhoons, rain, and changes of culture density of seawater, the balance of algae and bacteria in an aquaculture environment can easily be destroyed. Consequently, this leads to a decrease in the anti-stress ability and disease-resistant ability of farmed fish [
1,
2,
3,
4]. Furthermore, in traditional mariculture, water quality can only be determined by breeding workers using their experiences, although it is often impossible to grasp the changing trend of water quality in a timely and accurate manner based on empirical judgment alone. The precise prediction of water quality parameters can help aquaculture farmers to get hold of the trend of water quality parameters in the future, so as to adopt countermeasures. Therefore, it is necessary to figure out an accurate prediction method for dynamic changes in water quality factors, which considers the system dynamics relationships between water quality parameters.
The dynamic changes of water quality parameters involved in different stages of mariculture are extremely complicated. The parameters frequently considered in mariculture are salinity, water temperature, pH, dissolved oxygen, hydro chemical factors, etc. [
5]. In this paper, pH and water temperature prediction are studied.
Seawater is the main medium for the habitat and material exchange of aquaculture organisms in mariculture, while aquaculture organisms are sensitive to changes in physical and chemical factors of water. Firstly, large changes in dissolved oxygen, temperature, pH, and other water quality factors may directly lead to the death of aquaculture organisms. Even small fluctuations outside the optimal conditions may cause physiological stress on organisms, such as reduced food intake, increased energy consumption, and susceptibility to infectious diseases. Additionally, in mariculture the aquaculture environment is also an artificial ecosystem. Changes in water quality parameters also affect zooplankton, phytoplankton, and various bacteria in the water environment, which may lead to the deterioration of the aquaculture ecological environment, such as an outbreak of red tide algae, bacteria, and parasites.
The above-mentioned changes in water quality will result in a decrease in aquaculture production efficiency. Therefore, if the changing trend in water quality can be predicted, we can take countermeasures in advance through technical means to prevent serious imbalances in the water quality environment. It can be seen that an accurate prediction of water quality can greatly improve the production efficiency of mariculture.
Since water quality data is often preprocessed before the water quality parameters are predicted, this section reviews from two stages.
The first stage is the pretreatment of water quality data and correlation analysis between different water quality parameters. In the field of data restoration, scholars have done a lot of research. Jin et al. [
6] designed a new data repairing algorithm based on functional dependency and conditional constraints, which improved the efficiency of data restoration to a certain extent. Zhang et al. [
7] constructed an observation matrix based on compressed sensing and a dictionary matrix combined with the prior knowledge to sparsely represent data. The data is then accurately repaired from lossy data, the observation matrix, and data repaired by the dictionary matrix. Xu et al. [
8] proposed a fault data identification method based on a smoothing estimation threshold method and a fault data repair method based on statistical correlation analysis, which improved the repairing accuracy. Singh et al. [
9] proposed a data recovery algorithm based on an orthogonal matching pursuit algorithm for infinite sensor networks, which significantly reduced the number of iterations needed to repair the original data within a small interval. Jakóbczak et al. [
10] presented a probabilistic feature combination method, which uses a set of high-dimensional feature vectors for multi-dimensional data modeling, extrapolation, and interpolation; this method combines the numerical method with the probabilistic method and achieves high accuracy. A great deal of effort has also been done by researchers in the field of data noise reduction. Zhang et al. [
11] put forward a data de-noising method based on an improved pulse coupled neural network, which is more suitable for high-dimensional data denoising. Pan et al. [
12] used wavelet technology to denoise, which not only achieved denoising, but also retained the characteristics of the data itself. Wang et al. [
13], Li et al. [
14], and Perelli et al. [
15] have also used the same method. In our research, due to the short sampling interval and small error data amount, the linear interpolation method and smoothing method can be used to complete interpolation and error correction. In addition, since the noise in water quality data is mainly caused by water fluctuation, the moving average method is used to filter noise.
For correlation analysis, the correlation of two random variables can be well measured by Pearson’s correlation coefficient method, which is divided by the standard deviation of two random variables on the basis of covariance. However, when the amount of data to be analyzed is insufficient, the analysis results are unreliable. Recently, some new methods of correlation analysis have been proposed, such as the spatial cross-correlations method [
16]. This method combines spatial distribution information and uses semivariance and experimental variograms to calculate the spatial correlation of water quality. Then, cross-correlations are analyzed using experimental cross-variograms. The water quality in a small lake can be described well using this method.
The second stage is water quality prediction using the pre-processed data. Traditional water quality prediction methods mainly include: the Grey Markov chain model method [
17,
18], the fuzzy-set theory-based Markov model [
19], the regression prediction method [
20,
21,
22,
23], the time series method [
24,
25], and the water quality model prediction method [
26,
27]. The water quality model prediction method has poor self-adaptability, and the other traditional prediction methods also have many shortcomings, such as low prediction accuracy, poor stability, and single factor prediction without considering dynamics characteristics. With the development of computational intelligence and bionic technology, many novel prediction methods based on artificial intelligence have emerged. The new water quality prediction methods mainly include: the grey theory method [
28], the artificial neural network method [
29,
30,
31], the least squares support vector regression prediction method [
32,
33], the combination prediction method [
34], etc. These methods provide an effective solution for water quality prediction, but are still not perfect due to the fact that mariculture environments are affected by many factors. These factors are mainly reflected in the complex interaction mechanism between water environment parameters, the non-linearity of water quality changes, and the long delay, which lead to a low calculation efficiency and poor generalization performance in the mentioned prediction methods.
In order to solve the problems mentioned above, combined with preprocessing and correlation analysis, the key water quality parameters (pH and water temperature) prediction models based on an LSTM (long short-term memory) [
35] deep neural network are trained. For correlation analysis, Pearson’s correlation coefficient method is adopted because we have obtained sufficient data and have only one sensor deployment location. LSTM is improved from RNN (recurrent neural network) [
36]. RNN is usually used to process time sequential data [
36]. This kind of data reflects the state or the changing degree over time of something, a phenomenon, etc. Then, the training prediction models are used to accurately predict pH and water temperature. At last, the prediction model is evaluated.
Our main contributions can be summarized as follows:
The linear interpolation method and smoothing method are used to fill and correct the data sampled by the sensors, respectively. The moving average filter is used to denoise the data after filling and correcting.
The influence factors of pH and water temperature are analyzed comprehensively. The correlation between water temperature, pH, and other water quality parameters is obtained by Pearson’s correlation coefficient method, which can be used as the input parameters of the model training.
Based on the pre-processed data and the correlation analysis results, a water quality prediction model based on a deep LSTM learning network is trained. Compared with the RNN based prediction model, the proposed prediction method can obtain higher prediction accuracy with less time.
The rest of this paper is organized as follows.
Section 2 gives the methods of data acquisition and data analysis, and presents the water quality prediction model based on LSTM. In
Section 3, we analyze and discuss the experimental results of pretreatment, and evaluate the accuracy and time complexity of proposed prediction methods for pH and water temperature. Finally,
Section 4 concludes this paper.
3. Results and Discussions
The experimental data was collected from the mariculture cages with the sensor devices, and then transmitted to the data server by means of a wireless bridge for storage. For the short-term predictions, the sampling frequency of the data was once every 5 min. Water quality data of 610 groups (about 51 h) including temperature, conductivity, chlorophyll, salinity, turbidity, pH, and dissolved oxygen parameters were used as experimental data for model training, and another 100 sets of water quality data (about 8.3 h) were used to verify the prediction effect. In addition, for the long-term predictions, the data sampling interval and data collection quantity are described in
Section 3.4.
The experimental environment is: Intel(R) Core(TM) i7-8550 CPU@2Ghz processor, 8 GB memory, Windows 10(64-bit) operating system, Anaconda3 experimental platform, and pycharm3.3 IDE (Integrated Development Environment), and the construction of the neural network model is based on python 3.6 and the Tensorflow 1.6.0 package. The accuracy and range ability of the sensors are shown in
Table 3. F.S. is the abbreviation for “Full Scale”, NTU is the abbreviation for “Nephelometric Turbidity Unit”, and PSU is the abbreviation for “Practical Salinity Unit”.
3.1. Experiments and Analysis of Data Preprocessing
Compared with spline interpolation, nearest-neighbor interpolation, and cubic interpolation, it has been found that linear interpolation has a similar interpolation effect to nearest-neighbor interpolation, and is superior to spline interpolation and cubic interpolation. Therefore, in this experiment, we used the improved method mentioned in
Section 2.2.1 for data filling.
In the process of data mending, taking water temperature data collected at depth of 3.26 m as an example, in order to determine the optimal value of
and
in (1), the relative error between the original data and filling data obtained by the linear interpolation method is calculated when
and
are both positive integers in interval
. The variation is shown in
Figure 9.
As shown in
Figure 9, in the deep orange area—i.e.,
,
, or
—the relative errors between the original data and filling data are nearly 0, while, in the red area, blue area, and the area near them—i.e.,
—the relative errors are close to 0.04. Furthermore, as can be seen from
Figure 9, the relative errors can be minimized when
.
In terms of data correction, since water quality data have a time correlation,
and
in Equation (2) can be determined using the relative difference between two adjacent historical water quality data as a constraint of the current relative difference. Take the average value of the relative difference between two adjacent data of the previous day as the value of
and
, i.e.,
, the relative differences of pH and water temperature before and after data correction are shown in
Figure 10.
As shown in
Figure 10, red and blue dots overlap together, and the relative differences of temperature and pH before and after data correction are not greatly different, which indicates that there are relatively few error data in the collected data.
In the process of data denoising, Equation (3) is used in the experiment. The size of window
was set as 4, the data of water temperature and pH were smoothed and denoised. Comparisons of water quality data before and after denoising are shown in
Figure 11.
From
Figure 11, it can be seen that the moving average filter can effectively reduce the data noise, restore the original data affected by wave and transmission, and smooth the water quality parameter curve.
3.2. Experiments and Analysis of Water Temperature Short-term Prediction
3.2.1. The Prediction of Water Temperature
Two kinds of prediction models were used to predict the variation trend of water quality parameters in the future. The 100 values were predicted, and the comparison between predicted values and real values is shown in
Figure 12.
The water temperature data predicted by the two models mentioned above is not completely matched with the real value, but the value predicted by the LSTM-based model is closer to the real value. Obviously, the values predicted by the RNN-based prediction model fluctuate greatly, and the errors between the predicted value and the real value are also large.
Table 4 shows the relative deviations between the predicted values and the real values of the two models. The unit of the deviation in LSTM or RNN is degrees Celsius. In order to facilitate typesetting, the 100 groups of deviation values between the predicted data and the real data are divided into four columns from left to right, with each column showing 25 data.
In
Table 4, relative deviations between the predicted values and the real values using the model based on LSTM are mostly less than 1 °C, with an average of 1.03 °C, while the deviations using the model based on RNN are mostly more than 1 °C, with an average of 1.37 °C. As a result, the LSTM-based model can predict water temperature more effectively and more accurately.
3.2.2. Time Complexity Analysis
The duration of each training and the total time cost of 10,000 times training under the two neuron networks were recorded in the experiment.
Figure 13 shows a comparison of the time spent performing 10,000 trainings between the two methods.
From
Figure 13, it can be seen that the training time of the LSTM-based prediction model is shorter and more stable, while the training time of the RNN-based model is longer, and increases sharply between the 7000th and 9000th. The average training time of the LSTM neuron network is 0.257 s, and the total time cost of 10,000 times training is 2567.06 s; the average training time of RNN is 0.259 s, and its total training time is 2591.95 s. Therefore, the training time of the LSTM-based prediction model is shorter than that of the RNN-based model. In other words, the construction efficiency of the LSTM-based prediction model is higher.
3.3. Experiments and Analysis of pH Short-Term Prediction
3.3.1. Prediction of pH Values
The future pH data is predicted using the two trained models. The 100 values are predicted, and the comparison between the predicted values and real values is shown in
Figure 14.
Figure 14 shows the predicted contrast effect after the scale of the vertical axis is enlarged. In fact, the relative errors are no more than 5%, and the future trend of pH can be judged from
Figure 14. The predictions based on the LSTM deep network are closer to the real values.
Table 5 shows the relative deviations between the predicted values and the real values under the two models. In order to facilitate typesetting, the 100 groups of deviation values between the predicted data and the real data are divided into four columns from left to right, with each column showing 25 data.
As shown in
Table 5, the average relative deviation between the predicted values and the real values using the RNN-based prediction model is 1.579, while the one using the LSTM-based model is 1.439. Therefore, the predicted results of the LSTM-based model are closer to the real values.
3.3.2. Time Complexity Analysis
The experiments recorded the time cost of each training and the total time consumption of 10,000 training times for pH data using two prediction models.
Figure 15 shows the comparison of 10,000 training times between the two methods.
As can be seen from
Figure 15, the training time of the LSTM network is apparently shorter than that of RNN, and the time variations of the former are also smaller, thus it is more stable when using the LSTM network. The average training time of the RNN is 0.298 s, and the total time cost of 10,000 times training is 2968.568 s, while the average training time of the LSTM network is 0.273 s, and its total training time is 2734.118 s. Therefore, the LSTM network takes less time and is more efficient in constructing the pH prediction model.
3.4. Long-Term Prediction of Water Temperature and pH
In order to further verify the practicability and robustness of the prediction model, a longer training data set was collected for model training. Then, we used the trained model to predict the next 83 h (about 3.5 days) of water quality data. The sampling frequency of the data is once every 1 min. A total of 30000 groups (about 21 days) of data were collected for training. An additional 5000 sets of data (83 h in total) were used for comparison.
The experiment was carried out under the same conditions as the short-term prediction, and the number of trainings was 500 and 1000, respectively. The results of the three evaluation indicators obtained during each training process are shown in
Table 6.
For the water temperature prediction, the comparison of RMSE between the LSTM-based prediction model and the RNN-based prediction model is shown in
Figure 16. According to equation (5), when RMSE is closer to 0, the prediction error of water quality parameters is smaller. For water temperature, the unit of the RMSE is Celsius (°C).
For the pH prediction, the comparison of RMSE between the LSTM-based prediction model and the RNN-based prediction model is shown in
Figure 17.
The trained model was used to predict the water temperature and pH values. We have predicted a total of 5000 sets of data. The comparison of the long-term prediction effect between the proposed scheme and RNN is shown in
Figure 18 and
Figure 19. It takes a total of 66 s to predict 5000 pieces of data using the trained model, and the average prediction time is 13.2 ms.
Different kinds of fish have different tolerances to water quality parameters. Saddle-spotted grouper (epinephelus lanceolatus) for example, generally has a pH value tolerance range of 7.5–9.2, and the range suitable for growth is 7.9–8.4. The water temperature suitable for growth is 22 °C–30 °C, with a minimum tolerance of 15 °C and a maximum tolerance of 35 °C. Because cultured fish are sensitive to changes in key water quality parameters, countermeasures can be taken in advance through water quality prediction to keep water quality parameters within the tolerance threshold range.
From
Figure 18 and
Figure 19, we can see some spikes. Since these spikes don’t last very long, we treat them as predicted abnormal data without intervention. However, if such spikes last longer (i.e., more than 15 min) and are outside the tolerance threshold, the farmer needs to pay close attention and take countermeasures in advance.
3.5. Discussions
From the above experimental analysis, the proposed scheme can achieve better results in long-term and short-term prediction. Using the proposed scheme, the short-term prediction accuracy can reach 98.56% and 98.97% for pH and water temperature, respectively, while the long-term prediction accuracy can reach 95.76% and 96.88% for pH and water temperature, respectively. In addition, the average prediction time for short-term predictions is 12.5 ms, and the average time for long-term predictions is 13.2 ms. Therefore, based on the trained model, the proposed scheme can realize fast and accurate predictions.
However, the proposed scheme still needs more computational cost in data set processing. Moreover, compared with the real data, the overall prediction results have strong fluctuations. In the future, we will focus on the optimization of the deep learning network structure. On the premise of ensuring the prediction accuracy, we will reduce the computational complexity of model training through optimization. Meanwhile, in order to make the water quality prediction model more robust and practical, the deep neural network structure will incorporate more relevant prior knowledge (such as precipitation and climate factors) for prediction. In addition, the proposed method also has some limitations in data preprocessing. According to the SVD (Singular Value Decomposition) theory [
40,
41], we know that the original signal contributes little to the tail singular values, and the signal energy is mainly concentrated on the first several singular values, while the tail singular values is mainly determined by noise. In future work, we will obtain more effective noise reduction methods based on this conclusion. Meanwhile, for some of the meaningless spikes that appear in
Figure 18 and
Figure 19, we will consider how to conduct reasonable and safe post-processing in our future work to make the prediction curve smoother.