Relationship between Engraulis japonicus Resources and Environmental Factors Based on Multi-Model Comparison in Offshore Waters of Southern Zhejiang, China

: In order to accurately explore the relationship between the density of Engraulis japonicus and environmental factors, ﬁve types of models, including Tweedie-Generalized Additive Model (GAM), two-stage GAM, Ad hoc-GAM, and Generalized Additive Mixing Model (GAMM), were compared based on the survey data in offshore waters of southern Zhejiang, China from 2015 to 2021 in this study. The results showed the best goodness of ﬁt for two-stage GAM when processing the data of E. japonicus resource density. The deviance explained of GAM1 and GAM2 were 19.9 and 53.8%, respectively. According to this study, water temperature and salinity are important environmental factors affecting the distribution of E. japonicus , which are also closely related to latitude. In general, the resource density of E. japonicus decreases gradually with the increase in water temperature. When the salinity was between 26 ppt and 34 ppt, the resource density was higher. Also, there were some differences in the spatial distribution of E. japonicus in different seasons. The relationship between the resource density of E. japonicus and environmental factors was analyzed through various models to provide a scientiﬁc basis for the conservation management of E. japonicus in offshore waters of southern Zhejiang, China.


Introduction
The relationship between fishery resources and the marine environment is very complex, with non-linear and non-additive relationships [1]. Therefore, it is important to select appropriate methods for quantitative analysis of the relationship between fishery resources and the marine environment. Previous studies [2][3][4] indicated that changes in the stock spatial structure might cause due to changes in environmental conditions. Thus, understanding the relationship between species distribution and habitat can provide necessary information for predicting the impact of climate change and formulating relevant management strategies [5].
Species Distribution Models (SDMs) are effective tools to study the relationship between research objects and habitat environmental factors as well as to explore the spatiotemporal distribution of fish [6]. SDMs can be divided into two types [7]: (1) "only presence" models that ignore the missing values and zero values, such as ecological niche factor analysis (ENFA) [8] or maximum entropy model (MaxEnt) [9], which can find habitats similar to species record sites based on the environmental conditions, and (2) "presence-absence" models, which need to record resources and efforts in the survey, such as Generalized Additive Model (GAM), Generalized Linear Model (GLM), or Regression Trees Analyses. Many scholars use the presence-absence models to explore the relationship between fishery resource density and environmental factors [10][11][12][13]. As a kind of SDM, GAM can deal with the non-linear relationship between response variables and explanatory variables, such as spatial, temporal, and environmental variables [14], which comes with good stability and more flexibility for exploring the relationship between fishery resource density and environmental factors [15].
In the actual sampling process, due to sampling methods, selection of fishing gear, and small stock size, there may be a large number of zero values in fishery resource data causing certain difficulties in estimation of the model with log-normal distribution error [16]. Bouska et al. [17] showed that model selection was the main uncertainty factor in the establishment of SDMs. Therefore, selecting an appropriate SDM based on the actual survey data can effectively improve the accuracy and reliability of the model. In relevant modeling research in the fishery field, several methods are usually used to deal with a large number of zero values, such as Ad hoc-GAM, Generalized Additive Mixed Models (GAMM), two-stage GAM, and Tweedie-GAM. Ad hoc-GAM applies the model to data with zero values by adding a constant (the link function is a natural logarithm) [18]. GAMMs are very suitable for processing time-series data and have good performance in processing fishery data [4]. Two-stage GAM is a widely used data survey with a large number of zero values that ensures good results [3,10,19]. To better process data with a large number of zero values, Tweedie [20] also proposed a special distribution law called Tweedie distribution, which is suitable for processing non-negative data with a large number of zero values [16].
The offshore waters of southern Zhejiang are located in the East China Sea. Under the influence of water mass, such as coastal waters of Fujian and Zhejiang as well as Taiwan Warm Current, there are adequate nutrients and bait organisms supporting abundant fish resources [21]. However, after the 1990s, due to overfishing and water pollution, the main economic fish in the offshore waters rapidly declined, and fish species, such as Engraulis japonicus, gradually became the fishing target [21,22]. E. japonicus is a pelagic migratory fish with a strong clustering pattern. It is widely distributed in the East China Sea and Yellow Sea of China, and is the prey of many higher trophic level species [23], thus playing an important role in the ecosystem. In recent years, under the influence of water pollution and overfishing, the sampling zero value of E. japonicus frequently appeared in the resource survey and monitoring stations. Therefore, how to select an appropriate model method to explore the distribution mechanism of E. japonicus resources in this area has become an urgent scientific problem to be solved. Based on the independent survey data of offshore fisheries in southern Zhejiang from 2015 to 2021, this study explored the goodness of fit and prediction performance of Tweedie-GAM, two-stage GAM, Ad-hoc GAM, and GAMM in processing large amounts of zero value data and analyzed the relationship between E. japonicus and environmental factors. This gave a further understanding of the distribution pattern and the latest dynamics of E. japonicus resources that enhanced our understanding of the ecological mechanism of species distribution, which acted as the research reference for the conservation management and sustainable utilization of fishery resources.

Data Sources
From November 2015 to February 2021, a seasonal survey of fishery resources was conducted in the offshore waters of southern Zhejiang. Since no E. japonicus was found in the survey in summer (August) every year, we only used survey data in spring (May), autumn (November), and winter (February). The survey area ranged from 120.93 • E to 122.95 • E, 27.21 • N to 28.97 • N ( Figure 1). The field survey was conducted during the day. E. japonicus mainly lives in the middle and lower to bottom waters during the day [24]. So, we used the bottom trawl for fishing. The total length of gear was 95 m. The fishing gear was 40 m wide and 7.5 m high. The length of the bottom and floating substrates was 80 m. The mesh of the net capsule was 2 cm, and the towing speed was 2-4 kn. The operation time of each survey station was about 1 h. At each survey station, the water quality analyzer WTW-Multi 3430 was used to collect environmental data, such as water temperature and salinity. The collection, determination, and analysis of water quality samples were carried out in accordance with the Specifications for Oceanographic Survey (GB/T 12763) [25] and Specification for Marine Monitoring (GB 17378) [26]. According to the total catch and proportion of each station, the catch data were standardized according to trawl time (1 h) and trawl speed (3 kn).

Selection of Model Explanatory Variables
Given the obvious seasonal migration of E. japonicus during spawning, feeding, and winter migration [27], longitude, latitude, and offshore distance were selected as the influencing factors of the spatial distribution of E. japonicus. The offshore distance was the shortest distance from the survey station to the shore, which was calculated in the Sp package in R [28]. Water temperature, salinity, and water depth are closely related to the resource distribution of E. Japonicus, as it is a pelagic fish species [27,29]. A total of 7 factors (surface water temperature, surface salinity, water depth, month, longitude, latitude, and offshore distance) were selected to explore the relationship between E. japonicus resource distribution and environmental influencing factors. Before modeling, variance inflation factor (VIF) was used to test the multicollinearity of variables while excluding highly correlated explanatory variables with a VIF value greater than 5 [3,4].

Model Theory
In this study, five types of models (two-stage GAM, Tweedie-GAM, GAMM, and Ad hoc-GAM) were used to explore the relationship between the presence probability, resource density of E. japonicus, and environmental factors. In addition, the goodness of fit of the model to the data was measured by deviance explained (the proportion of the null deviance explained by the model).
GAM can fit the non-linear relationship between response variables and explanatory variables, and the expression is as follows: where Y denotes the resource density of E. japonicus (g/h); α denotes the intercept of the fitting model; refers to the smoothing function; a spline smoothing function was used in this study; is an independent variable; residual ε = σ 2 and E(ε) = 0.

Two-Stage GAM
The model has two stages: the first stage of the GAM estimates the presence probability (P) of E. japonicus with a binomial error distribution, and the second stage of the GAM estimates the log transformation abundance of the species a with a Gaussian error distribution [30]. The formula is given as follows: The second stage GAM (GAM2), estimates the log-transformed E. japonicus resource density using the identity link function [3,10]. The two-stage GAM model formula is as follows: Combined with the results of GAM1 and GAM2, the final log-transformed E. japonicus density was estimated [19].
where month denotes the month; Lat denotes latitude; Lon denotes longitude; T denotes water temperature; S denotes salinity; depth denotes water depth; Dis denotes offshore distance; density denotes E. japonicus resource density; P denotes the occurrence probability of E. japonicus; ε denotes a random error.

Tweedie GAM
Tweedie distribution was first proposed by a British statistician in 1984 [20], which is a special probability distribution in exponential dispersion distribution. It is usually expressed as Tw p (θ, ϕ) and is determined by variance function V(µ) = µ P , where θ is a standard parameter; ϕ is a dispersion parameter; p ∈ (−∞, 0) ∪ [1, +∞). Tweedie distribution includes several common important distributions. When p = 0, 1, 2, 3, it corresponds to a normal distribution, Poisson distribution, Gamma distribution, and inverse Gaussian distribution. When 1 < p < 2, corresponding Tw p (θ, ϕ) is a composite distribution between Poisson distribution and Gamma distribution [20]. The probability density equation is as follows [16]: where θ is the location parameter; ϕ is the diffusion parameter; p is the energy parameter; and d(y : θ, p) is the unit deviation. The Tweedie distribution was used to establish the relationship between E. japonicus and environmental factors. The Tweedie-GAM expression is as follows: where month denotes month; Lat denotes latitude; Lon denotes longitude; T denotes water temperature; S denotes salinity; depth denotes water depth; Dis denotes offshore distance; density denotes the E. japonicus resource density; ε denotes a random error.

GAMM
GAMM is an extension of GAM, including fixed and random effects, which is very suitable for processing time series and autocorrelation data [31]. In the southern sea area of Zhejiang, the distribution of E. japonicus changes with season [27]. In this study, season (month) was used as a random error term of GAMM. GAMM expression is as follows: where month denotes month; Lat denotes latitude; Lon denotes longitude; T denotes water temperature; S denotes salinity; depth denotes water depth; Dis denotes offshore distance; density denotes the E. japonicus resource density; ε denotes a random error.

Ad Hoc-GAM
Ad hoc-GAM refers to the addition of a constant c to the resource density so that the model can process data containing zero values. According to Tian et al. [18], constant c is usually 1, but some relevant studies [32] have stated that a constant selection of 10% of the average resource density can reduce the error. Therefore, 1 and 10% of the average resource density were selected as the constant c to compare which effect was better (named as Ad + 1 GAM and Ad hoc-mean GAM). After the addition of the constant, we carried out a logarithmicization process for the dependent variable, and the expression is as follows: where month denotes month; Lat denotes latitude; Lon denotes longitude; T denotes water temperature; S denotes salinity; depth denotes water depth; Dis denotes offshore distance; density denotes the E. japonicus resource density; ε denotes a random error.

Model Selection
Akaike Information Criterion (AIC) can be used to measure the goodness of fit of multiple models [33]. The smaller the AIC value, the better the fitting degree of the model. In this study, the GAM between environmental factors and resource density was established by permutation and a combination of variable factors after screening. The model with the smallest AIC value in each method was considered the optimal model. The calculation method of AIC is as follows: where k is the number of parameters, and L is the likelihood function.

Cross-Validation
In this study, 80% of the data were randomly selected as the training set and the remaining 20% as the test set. The above process was run 1000 times to verify the prediction performance of different models. During each cross-validation, a linear relationship between the model predicted value and the actual observed value was built, and the root mean square error (RMSE) and mean absolute error (MAE) between the predicted and observed values were calculated. Closer the value to zero, the better the model goodness of fit [34,35]. The linear relationship between the predicted and observed values can be expressed as follows: where y denotes the predicted value of the model; Y denotes the actual observed value of the model. When a = 0 and b = 1, it means that the predicted resource density and the actual observed resource density (i.e., test data) have a similar spatial pattern. The model has a good prediction performance [19]. R 2 was used to express the prediction effect of the model. When R 2 is closer to 1, the prediction effect of the model becomes better [3]. The calculation equation of RMSE is as follows [34]: The calculation equation of MAE is as follows [35]: where n denotes the observation frequency; O i denotes the ith observed value; P i denotes the ith predicted value.

Comparison of Models
Based on the results of cross-validation, the prediction effects of the 5 models were compared to select the optimal model for processing zero value data.
All statistical analyses were carried out in R (V3.6.0), and the model was implemented through the "mgcv" package. The distribution of E. japonicus resource density and stations were drawn in Arcmap 10.8.

Zero Value Ratio of E. japonicus
From 2015 to 2021, the sampling zero-value of E. japonicus accounted for the largest proportion (90.5%) in autumn and the smallest proportion (46.1%) in spring. In spring, the zero value ratio of E. japonicus was mainly concentrated in 40-60%, with an average of 44.6 ± 23.7%. In autumn, the zero value ratio of E. japonicus was mainly concentrated within 75-100%, with an average of 90.6 ± 12.7%. In winter, the zero value ratio of E. japonicus was mainly concentrated within 60-80%, with an average of 74.6 ± 17.7%. The sampling zero value ratio in three seasons was ranked from highest to lowest as follows: autumn > winter > spring (Figure 2).

Spatial and Temporal Distribution of E. japonicus Resource Density
The distribution of E. japonicus resources in offshore southern Zhejiang showed obvious spatial differences in different seasons (Figure 3). The distribution characteristics of resource density were opposite to the zero value ratio of E. japonicus (spring > winter > autumn). In spring, the E. japonicus in offshore waters was higher than that of the open waters, while it was the opposite in autumn, presenting a distribution pattern of higher resource density in open waters than in offshore waters. In winter, there was a significant north-south difference in the study waters, and E. japonicus mainly concentrated in the northern waters of 28 • N.

Results of Different Models
VIF test results showed that VIFs of water temperature, water depth, salinity, and offshore distance were all less than 5. The VIFs of longitude and latitude were greater than 5, and the VIF of longitude was the largest. After removing the longitude, the collinearity test was conducted again for the influencing factors, and VIFs were less than 5 (Table 1). Therefore, six influencing factors, including month, water temperature, salinity, water depth, offshore distance, and latitude, were adopted to establish the model. In terms of two-stage GAM, GAM1 consisting of month, latitude, water temperature, salinity, and water depth was the optimal model at this stage, with a deviance explained of 19.9%. Latitude and water temperature were significantly correlated with the occurrence probability of E. japonicus (p < 0.05, Table 2). GAM2 consisting of month, latitude, water temperature, salinity, and water depth, was the optimal model at this stage, with a deviance explained of 53.8%. Water temperature was significantly correlated with the occurrence probability of E. japonicus (p < 0.05, Table 2).
The optimal variable combination of Tweedie-GAM was month, water temperature, and salinity, and the model deviance explained was 46.7%, among which salinity was found to be a significant influencing factor (p < 0.001, Table 2).
All explanatory variables were included in the optimal GAMM, and the deviance explained of the model was 73.2%. All five influencing factors were significantly correlated with the resource density of E. japonicus (p < 0.001, Table 2). The variable combination of optimal Ad + 1 GAM included all factors except water depth and offshore distance, and the deviance explained of the model was 30.0%. Latitude and water temperature were significantly correlated with the resource density of E. japonicus (p < 0.01 and p < 0.05, respectively; Table 2). For Ad hoc-mean GAM, the optimal variable combination was month, water temperature, salinity, latitude, and water depth, with a deviance explained of 29.6%. Water temperature and latitude were significantly correlated with the resource density of E. japonicus (p < 0.05, Table 2).

Relationship between E. japonicus Resource Density and Environmental Factors
There was a non-linear relationship between water temperature and resource density as well as occurrence probability (Figures 4a, 5a, 6a, 7a and 8a). In GAM1 (Figure 4a), Ad + 1 GAM (Figure 7a) and Ad hoc-mean GAM (Figure 8a) showed a negative correlation between water temperature and occurrence probability as well as resource density of E. japonicus. In GAM2, the relationship between water temperature and resource density was non-linear, with multiple peaks, showing an overall negative correlation (Figure 5a). Compared to other models, the relationship between water temperature and E. japonicus resource density was significantly different in GAMM. The resource density of E. japonicus was the lowest at 16.5 • C and the highest at 25 • C (Figure 6a). When the salinity ranged from 24 ppt to 34.5 ppt, the resource density of E. japonicus increased significantly with the increase in salinity (Figures 6b and 9b). When salinity reached 26 ppt, it showed a multi-wave non-linear relationship, which was at a high level.
According to GAM1 (Figure 4b), Ad + 1 GAM (Figure 7b), and Ad hoc-mean GAM (Figure 8b), there was a positive linear correlation between latitude and resource density as well as the occurrence probability of E. japonicus. However, in GAMM (Figure 6d), the relationship between latitude and E. japonicus became very complicated.

Prediction Performance of the Model
The results of 1000 cross-validation showed that the R 2 of Ad + 1 GAM fitting curve was the largest, and that of GAMM was the smallest among the five models. Two-stage GAM had the smallest RMSE and MAE and a relatively large R 2 of 0.18 (Table 3). Therefore, comprehensively considering MAE, RMSE, and R 2 , two-stage GAM should be selected as the optimal model to predict the resource density of E. japonicus.

Comparison between Different Models
Due to the decline in fishery resources, the patch aggregation of fish, and the selection of fishing gear for survey [36], it is impossible to effectively capture the target species in the fishery resource survey, thus resulting in many zero values in the survey data. In this study, zero value data accounted for 70% of the total data, which did not follow the general data distribution pattern (positively skewed distribution). Five models commonly used to deal with zero value data were used to establish the relationship between the resource density data of E. japonicus and environmental factors. It was found that two-stage GAM had more advantages in processing the data of E. japonicus resources. There is a certain difference in deviance explained of GAM1 and GAM2. The deviance explained of GAM1 was 19.2%, which was relatively lower, while that of GAM2 was 53.4%, and the model goodness of fit was good. However, there might be other important factors affecting the presence of E. japonicus. E. japonicus has obvious clustering characteristics, and waters with higher resource density represent the suitable living environment for this species to some extent, which makes the deviance explained of GAM2 relatively higher. In future studies, if biological factors (e.g., bait organisms) [37] and species interactions (e.g., predation and prey) [38] can be included in the model as explanatory variables, it will be conducive to model goodness of fit. The results of 1000 cross-validation showed that R 2 of the fitting curve of Ad + 1 GAM predicted and measured which values were the largest. While two-stage GAM had the smallest RMSE and MAE, the R 2 was relatively large, indicating that it had a better effect on the processing data of E. japonicus resource data in the waters of southern Zhejiang. Although two-stage GAM is considered more suitable for processing the density data of E. japonicus resources as compared to other models, its prediction effect is not particularly good, which may be due to fewer influencing factors discussed in this study and the formation of fishing ground associated with the spatio-temporal distribution structure of environmental factors. It is difficult to dynamically parameterize such spatio-temporal correlations [39]. Therefore, the survey frequency of E. japonicus migration during flood season and the collection of marine environmental data can be increased in future studies to better match the data collection with the living habitats of E. japonicus and improve the prediction ability of the model.

Relationship between E. japonicus Resource Density and Environmental Factors
The current study analyzed the relationship between different environmental factors and E. japonicus resources using different models. Although the goodness of fits and prediction performance of each model was different in comparative analysis, except for the Tweedie-GAM excluding the latitude, water temperature, salinity, and latitude, all existed in these models. Therefore, water temperature, salinity, and latitude may play an important role in the distribution of E. japonicus in southern Zhejiang.
Relevant studies have shown that water temperature can dominate the growth, development, and reproduction of fish [21] and affect the entire food web structure of fish by participating in the regulation of primary viability [40]. In this study, the relationship between water temperature and E. japonicus resources was similar, but there were slight differences. After careful consideration, 10-11 • C is considered the optimum water temperature range for E. japonicus. The range for optimum water temperature of E. japonicus is different from other relevant studies [27,41], where 8-11 • C is the optimum temperature range for E. japonicus. Niu et al. [42] have reported that the optimum temperature is different due to the different temperature ranges of fish in their different life stages, and the stock size, age structure, and fishing situation also affect the distribution of E. japonicus.
Salinity is one of the important environmental factors for fish development and distribution. It can change the fish stock by changing the osmotic pressure of fish eggs and affecting the development of embryos [14,43,44]. However, few studies have focused on how salinity affects the presence and distribution of E. japonicus. In this study, the effect of salinity on the resource distribution of E. japonicus showed a multi-wave non-linear relationship (Figure 5c). Previous studies have shown that the salinity of offshore waters of southern Zhejiang was due to the influence of the coastal current with cold water and low salinity, and Taiwan warm current with warm water that shows higher distribution characteristics in the east and lower distribution characteristics in the west [45,46]. As a kind of migratory fish [27], there is a big difference in the spatial distribution characteristics of E. japonicus in different seasons, which even showed a contrary distribution pattern in this study. Therefore, the relationship between salinity and E. japonicus is complex and variable, which may explain the multi-wave non-linear relationship between the resource density of E. japonicus and salinity.
Latitude, as a spatial factor, plays an important role in the resource density of E. japonicus in the offshore waters of southern Zhejiang. It has an indirect effect on the distribution of E. japonicus, by changing other environmental factors, such as temperature and salinity [47]. In this study, GAM1 ( Figure 4b) showed a positive linear correlation, while GAM2 showed a negative linear relationship (Figure 5b). Every spring, with the gradual increase in water temperature, the wintering stock located at lower latitudes gradually leaves the wintering grounds for waters at higher latitudes for breeding migration under the effect of gonad maturation. In autumn, influenced by the decreasing water temperature, E. japonicus began to migrate to the southern waters with higher water temperature [45], which migrates between high and low latitudes. Latitude represents water temperature to a certain extent. Thus, it can affect the resource density of E. japonicus stock.

Importance of Sampling
The relationship between fishery resource density and environmental factors and the spatio-temporal distribution of target species are affected by time, space, and fishing methods. E. japonicus is a small pelagic fish [37]. However, bottom trawl nets were used to investigate the depth of the water layer, which was not fully matched with the habitat water layer of E. japonicus, leading to the occurrence of a large number of zero values. In addition, spawning migration and wintering migration of E. japonicus are found in different seasons [42]. In this study, samples were taken in quarters with a longer time scale, thus weakening the importance of the E. japonicus migration process. In recent years, under the influence of overfishing, the number of resources was found to be at its lowest value [21]. Thus, E. japonicus was not captured in many stations, leading to zero value data.
The distribution of E. japonicus is closely related to spatio-temporal and hydrologic environmental factors [27,42], and the prediction of its resource distribution using fewer environmental factors may lead to some deviation. In this study, due to the selection of fixed stations with equal spacing for sampling, there is a high correlation between longitude and latitude. So, longitude was excluded in modeling, which might limit the prediction effect of the model to some extent. Li et al. [19] showed that random station sampling was superior to fixed station sampling, and the fixed station sampling might underestimate the true value of resource density. Therefore, in the follow-up studies, it is necessary to optimize the sampling scheme to obtain more accurate data and improve the fitting ability of the model. Several models commonly used to deal with zero value data were compared to select a more suitable model to explore the relationship between the E. japonicus resource density data and environmental factors in the present study. At the same time, researchers thought of selecting suitable SDMs for other fish species. However, the influence of biological factors should be properly considered in future research to have a more comprehensive understanding of the habitat change mechanism and resource density in this sea area.