Developing an Ensembled Machine Learning Prediction Model for Marine Fish and Aquaculture Production

: The ﬁshing industry is identiﬁed as a strategic sector to raise domestic protein production and supply in Malaysia. Global changes in climatic variables have impacted and continue to impact marine ﬁsh and aquaculture production, where machine learning (ML) methods are yet to be extensively used to study aquatic systems in Malaysia. ML-based algorithms could be paired with feature importance, i.e., (features that have the most predictive power) to achieve better prediction accuracy and can provide new insights on ﬁsh production. This research aims to develop an ML-based prediction of marine ﬁsh and aquaculture production. Based on the feature importance scores, we select the group of climatic variables for three different ML models: linear, gradient boosting, and random forest regression. The past 20 years (2000–2019) of climatic variables and ﬁsh production data were used to train and test the ML models. Finally, an ensemble approach named voting regression combines those three ML models. Performance matrices are generated and the results showed that the ensembled ML model obtains R 2 values of 0.75, 0.81, and 0.55 for marine water, freshwater, and brackish water, respectively, which outperforms the single ML model in predicting all three types of ﬁsh production (in tons) in Malaysia.


Introduction
Fish is a vital source of animal protein and micronutrients and it plays a significant part in meeting the food security needs of Malaysia, where per capita fish consumption is currently at least 56.5 kg per year [1,2]. However, fish stocks are declining because of increasing demand and consumption [3]. Currently, fish consumed globally is expected to increase from 20.5 kg in 2018 to 21.5 kg by 2030 [4]. Besides, demand for aquaculture is increasing along with the growing global population and it is projected that the total worldwide production will rise by 62% by 2030 [5].
Malaysia is the highest per capita seafood consumer in Southeast Asia, where fish is chosen as the primary animal protein diet in Asia and fulfilled 60-70% of total national protein intake [6]. It is also reported in the Malaysian Adult Nutrition Survey (MANS) that, the percentage of people eating fish at least once a day among the rural and urban populations of Malaysia is 51.3% and 33.6%, respectively [7]. Consequently, fish landings in beach areas of Malaysia have expanded to about 548,800 km 2 within Malaysia's Exclusive Economic Zone (EEZ), which was enforced in 1981 [8]. EEZ was implemented in Malaysia to support national development by decreasing unemployment via the export of fish. The declaration of the EEZ also helped to make fish the most important source of food for the Malaysians [9]. quality prediction, hydrometeorological forecasting for agricultural decision support, etc. around the world [28][29][30]. Although several types of research have been done so far on estimating fish landings considering ecological variables [31,32]. However, in Malaysia, as far as we are aware, no one has implemented the ML-based method to predict and analyze fish landings of the coastal area. Generally, it is common to use the linear regression (LR) method to predict time-series datasets. It is obvious that, if the dataset is correlated, then the LR-based ML models produced very good outputs with the more accurate assumption of the predicted values. Nevertheless, if the data are not much correlated then we need to find different ML models for accurate predictions. Moreover, a single ML model does not perform best in different time-series predictions. In this case, a good choice would be to use ensemble-based ML models for enhancing the prediction accuracy [33,34]. This approach allows the production of better predictive performance compared to a single ML model.
In this research, the climatic data of five major states of Malaysia are considered to measure the feature importance parameters for predicting three different types of fish productions, i.e., marine landing, brackish water aquaculture, and freshwater aquaculture. The states included in the study are Kedah (KD), Pahang (PH), Perak (PR), Selangor (SL), and Terengganu (TG) of Malaysia. Three ML approaches: linear regression (LR), random forest (RF), and gradient boosting (GB) are applied for prediction purposes on fish production. Finally, we apply one ensembled technique voting regression (VR) consist of these three ML models to demonstrate the accuracy of our approach in terms of quality matrix.
This research presented a novel approach to predict fish production in Malaysia's five major states. The structure of the article is organized as follows: Section 2 presents the Materials and Methods which consists of variable selection, study area, data source, ML regression method, and error quality matrix identification. Section 3 details the Results based on the correlation matrix, feature importance identification, trend line analysis, and ML-based prediction. Section 4 analyzes the generated outputs of this research and Section 5 concludes the paper.

Variable Selection
We have selected climatic variables through an extensive literature review. Several climatic variables are recorded as the influencing factors in the fisheries sector, including temperatures, rainfall, salinity, sea level, etc. [35][36][37][38]. The selection of indicators for regional analysis is fraught with constraints, assumptions, and availability of data set. Previous studies demonstrated that rainfall and temperature impact fish landing in the focus country [39]. Similarly, sea surface temperature has been considered an essential indicator for coastal upwelling events influencing fish production reported for the region [40]. Prior studies showed that relative humidity is a significant climatic factor in fisheries studies due to its indirect impact on some environmental stressors [41][42][43]. Therefore, we have collected data of maximum and minimum air temperature, sea surface temperature, rainfall, rainfall duration, and humidity for building models using the different ML approaches. In addition, we considered fish production data for marine landings, brackish water aquaculture, and freshwater aquaculture during data compilation.

Study Area
Our study site covered the major states of Malaysia named "Selangor", "Terengganu", "Kedah", "Pahang", and "Perak". Among these states, Pahang has the largest land area while Selangor has the smallest. All these states border a long coastal area, which helps them to produce more marine fish over the year. We have collected data covering the period from 2000 to 2019 from various organizations in Malaysia that worked with marine water, brackish water, and freshwater fish production. Different states show different fish production quantities and climatic variable statistics, which are shown in Figure 1. area, which helps them to produce more marine fish over the year. We have collected data covering the period from 2000 to 2019 from various organizations in Malaysia that worked with marine water, brackish water, and freshwater fish production. Different states show different fish production quantities and climatic variable statistics, which are shown in Figure 1. From Figure 1 we can see that the highest temperature was 33.9 °C in Perak, whereas the lowest temperature was 22.2 °C in Terengganu over these 20 years of data. However, the average SST was found as 29 °C for almost all the states. According to Figure 1, it is also observed that Kedah and Terengganu states have a slightly more average rainfall per day (~16 mm) whereas in Pahang and Perak, the value is moderate (~13 mm) and in Selangor, the value is the lowest, i.e., 11.88 mm per day. On the other hand, Kedah recorded the highest humidity at 84.27% whereas Selangor recorded the lowest humidity at 79.54%.
Among all those five major states, the Perak has the highest number of fish landings in marine (~270K tons), fresh (~35K tons), and brackish water (~40K tons). Although the Selangor state has more brackish (~17K tons) and freshwater (~13K tons) fish landings, the marine water fish landings are moderate (~118K tons) compared to other states. The Terengganu state has the lowest fish landings among all states.

Data Source
We have composed the data set of fish production annual data along with various climatic variable's statistical data. We have collected the "Rainfall, and humidity" After collecting all the statistical and environmental data, we have combined them into one dataset for five major states of Malaysia. We mapped the individual states into numbers for ML model implementation. At first, we use ordinal encoding to label the states such as Selangor to 1, Terengganu to 2, Pahang to 3, Kedah to 4, and Perak to 0. Then we converted the numbers using a one-hot encoding process to remove biasing of the data. Based on this combined processed dataset from 2000 to 2019 we have analyzed From Figure 1 we can see that the highest temperature was 33.9 • C in Perak, whereas the lowest temperature was 22.2 • C in Terengganu over these 20 years of data. However, the average SST was found as 29 • C for almost all the states. According to Figure 1, it is also observed that Kedah and Terengganu states have a slightly more average rainfall per day (~16 mm) whereas in Pahang and Perak, the value is moderate (~13 mm) and in Selangor, the value is the lowest, i.e., 11.88 mm per day. On the other hand, Kedah recorded the highest humidity at 84.27% whereas Selangor recorded the lowest humidity at 79.54%.
Among all those five major states, the Perak has the highest number of fish landings in marine (~270K tons), fresh (~35K tons), and brackish water (~40K tons). Although the Selangor state has more brackish (~17K tons) and freshwater (~13K tons) fish landings, the marine water fish landings are moderate (~118K tons) compared to other states. The Terengganu state has the lowest fish landings among all states.

Data Source
We have composed the data set of fish production annual data along with various climatic variable's statistical data. We have collected the "Rainfall, and humidity"  46]. After collecting all the statistical and environmental data, we have combined them into one dataset for five major states of Malaysia. We mapped the individual states into numbers for ML model implementation. At first, we use ordinal encoding to label the states such as Selangor to 1, Terengganu to 2, Pahang to 3, Kedah to 4, and Perak to 0. Then we converted the numbers using a one-hot encoding process to remove biasing of the data. Based on this combined processed dataset from 2000 to 2019 we have analyzed using different ML-based regressions. Additionally, we have used this combined dataset for generating correlation matrices, feature importance scores, and trend line analysis.

ML-Based Regression
We have applied different ML models such as linear regression (LR), gradient boosting regression (GB), and random forest regression (RF) algorithms based on the feature impor-Sustainability 2021, 13, 9124 5 of 14 tance scores. This is because algorithms such as the Tree-based regression technique will help handle data from various measurement scales. These algorithms do not influence outliers and missing values to a fair degree and simplify building rules for predictions about individual cases and complex relationships [34]. As the dataset has different dimensional data so we have chosen RF and GB algorithms. We used the first 17 years of (2000-2016) datasets for model training and the last 3 years of datasets (2017-2019) for testing the accuracy of these models. We have optimized hyperparameters of the ML models based on the cross-validation score. Hyperparameters such as the number of trees in the algorithm, the least number of trials essential to split an internal node, the smallest number of trials required to be at a leaf node, and most importantly, the maximum depth of each tress are adjusted for the RF and GB algorithm. We have assigned the maximum depth of each tree to 7 to ensure the algorithms reduce overfitting of data [38]. We have observed that different ML approaches were suitable for the different datasets. Therefore, we have combined the three ML models in the ensembled VR process and different weight was distributed for a different model as shown in Table 1. We have applied different combinations and compare the cross-validation scores of the VR model with different combinations of weights. Then we choose the weights which give the highest mean scores in the cross-validation table and the total weight is always equal to 1.

Error Quality Matrices
We have used Python Scikit learn for the ML implementation and measured different error quality matrices for determining the accuracy of the prediction [31]. Additionally, we have used five error quality matrices to assess the models quantitatively. The quality matrices included the coefficient of determination (R 2 ), root means square error (RMSE), mean absolute error (MAE), and percentage of bias (PBIAS). The different error quality matrices were calculated using the following equations: where, FL measured and FL predicted are mean values of predicted and measured fish landing (FL), respectively. In this case, the smaller the values of MAPE means the higher the efficiency of the model. The most common way to measure the performance is the R 2 value which is between 0 and 1 [40]. We have compared the predicted fish production values with the observed values for measuring the efficiency of the ML model output.

Correlation Matrix
A correlation matrix is needed to describe the relationships between the climatic variables and locations, which is illustrated in Table 2.  [47]. According to the PCI matrix, one parameter's negative value is found in inverse relation to another parameter. If a parameter has a decreasing value, then another parameter value has an increasing value. For example, in Table 2, a moderate positive correlation is observed between rainfall and rainfall duration (r = 0.61; p < 0.01). This relation implied that an increase in rainfall duration is related to the total amount of rainfall. Additionally, rainfall and rainfall duration data are positively related with states, which means that state-wise rainfall and rainfall duration has a weak impact on fish landings. On the other hand, it has been observed from the correlation data as shown in Table 2 that, state-wise maximum temperature also has a weak positive correlation on the state (r = 0.44; p < 0.01). However, it is observed from Table 2 that climatic variable humidity has a weak positive correlation with rainfall (r = 0.41; p < 0.01). On the other hand, humidity data has weak negative correlation with maximum temperature (r = −0.49; p < 0.01) and minimum temperature (r = −0.47; p < 0.01). This negative correlation implies that if the humidity of any state increases then the maximum or minimum temperature could decrease of that state. Generally, SST rises due to the absorption of more heat by the sea, which changes the ocean circulation patterns [48]. Moreover, the duration of rainfall, temperature variability, and humidity data may affect the fish landings [49]. As there is no significant correlation observed among the climatic variables and location, so for further investigation, feature importance parameters have taken into consideration to analyze the fish landing datasets for five major states of Malaysia. Here, correlation analysis has been done among the three types of water and climatic variables, which is shown in Table 3.

Feature Importance Analysis
According to the correlation table, finding the important climatic variables is quite limited, which are affecting the catchment of different fishes. Therefore, feature importance analysis for predicting the various ML models has been included. Based on the dataset, ML models, dimension reduction, selection of climatic variables feature is necessary to improve the efficiency and effectiveness of any predictive model. Figure 2 shows the feature importance graph of climatic variables, which has been taken into consideration for marine water, brackish water, and freshwater fish productions. The ML-based feature importance model given an importance score for each variable where the larger score implies that the variable is more important [50]. The figure shows that no matter what type of fish we considered, the location always plays the most important role. Therefore, the ML model also included the states as one of the major dependent variables for predicting fish productions in Malaysia. Moreover, from Figure 2a,b, it is found that, for marine water, rainfall, rainfall duration and minimum temperature are found to be less significant. Therefore, these climatic variables were not used in different ML models. Instead, Figure 2c shows that for freshwater fish landings, rainfall, rainfall duration and SST were found less significant. Nevertheless, from Figure 2, it is obvious that humidity is negatively scored in feature importance analysis for all three categories of fish productions, which is also identical to the correlation table (Table 2). However, SST is identified as an important feature for both marine water and brackish water, which should be taken into consideration during prediction modeling.

Feature Importance Analysis
According to the correlation table, finding the important climatic variables is quite limited, which are affecting the catchment of different fishes. Therefore, feature importance analysis for predicting the various ML models has been included. Based on the dataset, ML models, dimension reduction, selection of climatic variables feature is necessary to improve the efficiency and effectiveness of any predictive model. Figure 2 shows the feature importance graph of climatic variables, which has been taken into consideration for marine water, brackish water, and freshwater fish productions. The MLbased feature importance model given an importance score for each variable where the larger score implies that the variable is more important [50]. The figure shows that no matter what type of fish we considered, the location always plays the most important role. Therefore, the ML model also included the states as one of the major dependent variables for predicting fish productions in Malaysia. Moreover, from Figure 2a,b, it is found that, for marine water, rainfall, rainfall duration and minimum temperature are found to be less significant. Therefore, these climatic variables were not used in different ML models. Instead, Figure 2c shows that for freshwater fish landings, rainfall, rainfall duration and SST were found less significant. Nevertheless, from Figure 2, it is obvious that humidity is negatively scored in feature importance analysis for all three categories of fish productions, which is also identical to the correlation table (Table 2). However, SST is identified as an important feature for both marine water and brackish water, which should be taken into consideration during prediction modeling.

Trend Line Analysis
Several climatic variables have been considered such as maximum and minimum air temperature, SST, humidity for building different models. Using the LR model, Figure 3 shows the trend line of maximum and minimum air temperatures in Malaysia's major five states from 2000 to 2019. We can observe an upward trend in the maximum temperature  Figure 3 except the Pahang state, which is showing a downward trend. In addition, we observed a sharp fall in temperature in the year 2017 except the Pahang state which rapidly falls in the year 2018. The temperature started to rise again in 2019. It was also observed that there was a slightly increasing trend line in the minimum temperature over the period (2000 to 2019). We can see a sharp fall in minimum temperature in the case of Pahang state from 2017 to 2018.

Trend Line Analysis
Several climatic variables have been considered such as maximum and minimum air temperature, SST, humidity for building different models. Using the LR model, Figure 3 shows the trend line of maximum and minimum air temperatures in Malaysia's major five states from 2000 to 2019. We can observe an upward trend in the maximum temperature as shown in Figure 3 except the Pahang state, which is showing a downward trend. In addition, we observed a sharp fall in temperature in the year 2017 except the Pahang state which rapidly falls in the year 2018. The temperature started to rise again in 2019. It was also observed that there was a slightly increasing trend line in the minimum temperature over the period (2000 to 2019). We can see a sharp fall in minimum temperature in the case of Pahang state from 2017 to 2018.

ML-Based Prediction
Four ML-based prediction models, known as the LR, GB, RF, and VR models are implemented to predict marine, fresh, and brackish water fish landing (in tons). After implementing the four ML algorithms, the comparison graph is shown in Figure 5. The VR ensembled ML model shows a better result compared to the individual ML model. In

Trend Line Analysis
Several climatic variables have been considered such as maximum and minimum air temperature, SST, humidity for building different models. Using the LR model, Figure 3 shows the trend line of maximum and minimum air temperatures in Malaysia's major five states from 2000 to 2019. We can observe an upward trend in the maximum temperature as shown in Figure 3 except the Pahang state, which is showing a downward trend. In addition, we observed a sharp fall in temperature in the year 2017 except the Pahang state which rapidly falls in the year 2018. The temperature started to rise again in 2019. It was also observed that there was a slightly increasing trend line in the minimum temperature over the period (2000 to 2019). We can see a sharp fall in minimum temperature in the case of Pahang state from 2017 to 2018.

ML-Based Prediction
Four ML-based prediction models, known as the LR, GB, RF, and VR models are implemented to predict marine, fresh, and brackish water fish landing (in tons). After implementing the four ML algorithms, the comparison graph is shown in Figure 5. The VR ensembled ML model shows a better result compared to the individual ML model. In

ML-Based Prediction
Four ML-based prediction models, known as the LR, GB, RF, and VR models are implemented to predict marine, fresh, and brackish water fish landing (in tons). After implementing the four ML algorithms, the comparison graph is shown in Figure 5. The VR ensembled ML model shows a better result compared to the individual ML model. In Figure 5, the X-axis is showing all five states' output for 2018 and 2019. The Y-axis shows the comparison of the predicted values based on the 4 ML models. According to Figure 5, we found that the VR model output is closest compared to the observed dataset. Additionally, this figure indicates that the linear regression has a high bias, whereas RF and GB have comparatively improved prediction results with low bias. Moreover, in 2019, data for freshwater we have found that the LR has given negative values, proving that the LR model will produce a low prediction accuracy. Figure 5, the X-axis is showing all five states' output for 2018 and 2019. The Y-axis shows the comparison of the predicted values based on the 4 ML models. According to Figure 5, we found that the VR model output is closest compared to the observed dataset. Additionally, this figure indicates that the linear regression has a high bias, whereas RF and GB have comparatively improved prediction results with low bias. Moreover, in 2019, data for freshwater we have found that the LR has given negative values, proving that the LR model will produce a low prediction accuracy.

Discussion
By comparing all the figures above, we can see that a single ML model does not give the best prediction every time. Therefore, we have applied the VR technique to ensemble the three different ML models. This VR model is used to average multiple ML models with customized weight values improving the accuracy. Table 4 describes the comparative study of four regression models used in this research with the error quality

Discussion
By comparing all the figures above, we can see that a single ML model does not give the best prediction every time. Therefore, we have applied the VR technique to ensemble the three different ML models. This VR model is used to average multiple ML models with customized weight values improving the accuracy. Table 4 describes the comparative study of four regression models used in this research with the error quality matrix for marine water, brackish water, and freshwater fish production prediction in different ML models. The ML model's prediction performance is measured in terms of four error objective functions, which are R 2 , MAPE, MAE, and PBIAS as we are dealing with time-series data.
MAPE indicates how much error in prediction is compared to the measured value in the series. Additionally, MAPE is used for comparison of the precision of the same or different methods in two different series and measure the accuracy of the estimated value of the model expressed in terms of the absolute percentage error average [41]. In this case, the smaller the values of MAPE means the higher the efficiency of the model. From Table 4 MAPE values of LR, GB, RF, and VR are 0.41, 0.38, 0.41, 0.38, respectively for marine water fish. These values reflect that GB and VR models showed more efficiency among these models. On the other hand, for freshwater and brackish water fish landings, MAPE values are found quite high compared to marine water fish landings. However, MAPE results can be skewed if there are zero or close to zero values in the dataset, which is a disadvantage of this error function [42].
Generally, the optimal value of error indices MAE is zero and the smaller are the values, the more accurate are the simulations [42]. From Table 4, it is found that the MAE value of the VR model is the smallest at 42,160.472, which proved that, among these ML models ensembled VR model is more accurate for predicting the marine water fish landings. Similarly, in brackish water fish landings, the MAE value of the VR model is the smallest at 5412.247, which also showed the best accuracy among other predictive models. However, for freshwater fish landings, the GB model generated the best accuracy with the MAE smallest value of 5354.083. On the other hand, PBIAS measures the average tendency of the simulated values to be larger or smaller than their observed ones. Positive values indicate model underestimation bias, and negative values indicate model overestimation bias [51]. From Table 4, it is observed that both marine water and freshwater fish landing's best accuracy is obtained from LR modeling, which is 0.076 and 0.046, respectively. However, for brackish water fish landings, negative values are observed.
Most importantly, researchers measure the ML model performance in terms of the R 2 values. In the case of marine fish, the R 2 values of LR, GB, RF are 0.64, 0.74, and 0.71, respectively. However, when the VR technique is applied to ensemble these three ML models, the R 2 value became 0.75, which is higher inaccuracy. On the other hand, for freshwater fish, the R 2 values of LR, GB, RF are 0.38, 0.81, and 0.79, respectively. Nevertheless, after applying the ensemble approach, the VR value is found at 0.81, which is equivalent to the highest value obtained by an individual model. The VR model boosts the performance in the brackish water fish landing prediction. The R 2 values of LR, GB, RF are 0.44, −0.57, and −0.073, respectively. However, when the VR ensemble is applied, the accuracy of the prediction is improved and increased to 0.55. Hence, based on the R 2 values, the VR-based ensembled ML model is the best ML-based prediction model for fish landings in Malaysia.
Based on the score achieved in feature importance implementation, it is evident that climatic variables have an impact on marine water, brackish water, and freshwater fish production. The ML model performance also greatly depends on the variables that have more predicting power. We have determined this predictive power of individual climatic variables using the feature importance method and found that rainfall (rf) has the least impact among all variables and temperature has more influences on fish production. We have improved the ML model performance by doing such pre-processing of the data.
The fisheries industry is an important sector in sustaining the economy of Malaysia as changes in climatic variables may alter fish production or overfishing. Therefore, the prediction of annual fish productions in Malaysia is significant for sustainable development along with the continuous fish catch. In this regard, this research has implemented the feature importance method to determine the impact score of the climatic variables over different types of fish landings. Based on the score, the annual climatic data are considered and divided into training and testing set to predict the annual fish landings of all five major states. These variables have been considered for building models using the different ML approaches LR, RF, GB, and ensembled VR. In addition, fish production data from 2000 to 2019 for three different types of water such as marine, brackish, and freshwater is considered for developing these ML models. Generally, it is stated that changes in fish landings are consistent with the temperature and higher temperature means higher fish productions [52]. In addition, fish landings may reduce before temperature decreases, which reflects that climatic variable air temperature may influence the total fish landings by altering the habitat availability and quality.
We see in this study that when the R 2 , MAPE, MAE, and PBIAS values were used for selecting the best model, the ML models emerged as the better models in predicting fish production in Malaysia. However, if the R 2 value is selected as the forecast accuracy measurement for model selection, then the ML models can be used as the alternative models in predicting the demersal marine fish and freshwater fish production in Malaysia. In this research, the dataset contained all five major states in both the validation and testing phases. Thus, ML models used in this research can predict fish production in five major states of Malaysia. The fishery industry's decision-makers usually plan according to the fishing market's resource requirement, which is highly dependent on the accurate forecasts of one or two years of fish landings in advance [53]. Therefore, this predictive model can be a valuable component to build future decision support systems (DSS) for Malaysia's fishing industry.

Conclusions
Machine learning algorithms are efficient to solve complex time series data and have been widely used in the environmental field. Applied ML models sometimes faced a problem with many potential inputs but limited datasets are available. Here, we applied a hybrid approach, which included feature importance measures, and an ML model. We have implemented these models to extract valuable and meaningful information from the selected climatic variables and to explore the impacts on marine fish, brackish water aquaculture, and freshwater aquaculture production. We have used the past 20 years of data from five major states of Malaysia for generating the ML models. A performance matrix evaluates the ML model performance in terms of 4 error objective functions. MLbased feature importance method first identified the more predictor values. Based on those values, we have chosen 3 climatic variables and the location as input of the ML models. We have implemented 3 ML models and one ensembled approach (VR). The results show that the ensembled VR value is found 0.75, 0.81, 0.55 for marine water, freshwater, and brackish water, respectively, which is better compared to the single ML model. Hence, we can conclude that instead of one ML model, an ensemble approach outperforms the other ML model in predicting all three types of fish production (in tons). This research provides an ML model-based framework that delivers a reliable and accurate estimation of fish production in the study area. This ensembled ML approach with the feature importance method will be a valuable tool to predict fish data and help Malaysian stakeholders and policymakers to implement such tools to help guide fish production strategies in Malaysia.