Predicting Potential Spawning Habitat by Ensemble Species Distribution Models: The Case Study of European Anchovy ( Engraulis encrasicolus ) in the Strait of Sicily

: Species distribution models (SDMs) are important tools for exploring the complex association between species and habitats. Here, we applied six SDMs combining 1946 pieces of presence/absence data regarding European anchovy eggs with environmental parameters from surveys conducted in the Strait of Sicily from 1998 to 2016. We aimed to investigate the mechanisms inﬂuencing spawning habitat suitability for anchovy ( Engraulis encrasicolus ). The dataset was split into a training subset (75%) and a test subset (25%) for evaluating the predictive performance of the models. The results suggested the role of environmental parameters in explaining egg occurrence, model accuracy and spatial predictions. Bottom depth consistently had the highest importance, followed by absolute dynamic topography, which gives insights about local mesoscale oceanographic features. Each modelling method, except the linear model, produced successful performance for both the training and the test datasets. The spatial predictions were estimated as weighted averages of single-model predictions, with weights based on discriminatory power measured by the area under the receiver operating characteristic curve (AUC). This ensemble approach often provided more robust predictions than a single model. The coastal waters were identiﬁed as the most favorable for anchovy spawning, especially the south-central sector and the area around the southern-most tip of Sicily.


Introduction
The growing need for quantitative information on species habitat during all life stages, for resource management and conservation planning, has caused a fast development in the Species Distribution Models (SDMs) in recent years [1][2][3][4]. Their wide use is due mainly to their flexibility and ability to quantify highly complex associations that characterize species and their habitat and to estimate spatially explicit predictions of environmental suitability [5,6].
Of particular interest in marine ecosystem-based management is the distribution of pelagic species, whose spawning strategy, offspring fate and standing stock biomass are highly affected by variability of environmental conditions, especially in the very oligotrophic Mediterranean Sea [7][8][9][10]. Specifically, the upwelling off the southern coast of Sicily [11] drives a highly productive assemblage of pelagic fish including the European anchovy (Engraulis encrasicolus, [12,13]). It is the target species for local fisheries and an intermediate link in the flow of energy from lower to higher trophic levels. Therefore, it substantially contributes to the diet of numerous predators and to the total catches from fisheries [14,15]. The short lifespan (2-3 years), reproductive strategy with large quantities of pelagic eggs and extended spawning season [16,17] make its biomass very susceptible to environmental variability. In the Strait of Sicily, the anchovy spawning period of this species occurs during the summer period, when the temperature increases and the water column is stratified [18], favoring the aggregation of prey in the surface layer, where most of the eggs and larvae are typically concentrated [19]. Therefore, habitat conditions have a high impact on survival in early life stages (eggs and larvae) and are considered more important than the effects of fishing effort in many Mediterranean regions [9,[20][21][22].
Recently, the preferential spawning habitat of European anchovy in relation to environmental parameters has been examined for the Mediterranean Sea [23]. Its spawning distribution is mainly driven by depth gradient, which limits the potential spawning habitat to the continental shelf, within the 200 m isobaths [19,[24][25][26][27][28]. Other important factors affecting anchovy spawning behavior are variations in primary production, temperature and stability of water column [18,[29][30][31][32][33]. Several studies showed that the fate of recruitment is manly linked to local surface circulation during the spawning season [9,[34][35][36][37]. In the Strait of Sicily, the surface circulation is dominated by the flow of the Modified Atlantic Waters (MAW; [38][39][40][41][42]), locally called Atlantic Ionian Stream (AIS). The AIS fuels high variability in both hydrological and physico-chemical conditions that may favor the local retention of eggs and larvae and the enrichment of primary production in the upper water layers, creating favorable feeding conditions for anchovy early life history stages [13,34,43,44].
However, most of the studies carried out until now have examined the relationship between the species distribution and a single environmental parameter. When a multivariate approach was applied, predictions of spawning habitat distribution were mainly based on a single statistical technique. Since the selection of the best a priori model is not straightforward [45], an approach based on the testing of multiple models can provide a more robust estimation of the occurrence of anchovy spawning [46][47][48]. In this paper, six different SDMs were applied combining a long presence/absence time series of European anchovy eggs in the Strait of Sicily and co-occurring environmental information. The approach included regression-based methods (Generalized Linear Model (GLM), Generalized Additive Model (GAM) and Multivariate Adaptive Regression Spline (MARS)) and machine-learning methods (Random Forest (RF), Boosted Regression Trees (BRT) and Support Vector Machines (SVM)).

The Biological Dataset
Anchovy egg data used in this study are from nineteen summer oceanographic surveys carried out in the Strait of Sicily from 1998 to 2016 (Table 1), during the peak of anchovy spawning period in the study area [18]. Because spawning of the European anchovy occurs on the continental shelf [9], only stations with depths less than 200 m were included in the analysis, with a sampling grid of 6 × 6 nautical miles. Plankton samples were collected with a Bongo40 net towed obliquely from the surface to a 100 m depth, wherever possible. The Bongo40 net is composed of two coupled nets with a mouth diameter of 40 cm and mesh size of 200 µ. Samples were stored in 4% buffered formaldehyde (and/or 70% alcohol) sea-water solution for identification of anchovy eggs in the laboratory by stereomicroscopy. A presence/absence data matrix was finally obtained and used as response variable in use the SDMs.

The Environmental Dataset
A set of environmental parameters was selected based on available knowledge about their direct influence on the spawning behavior of the adult population and on the survival of anchovy offspring. First, the bottom depth was included as an explanatory variable in estimated models, as this parameter represents a discriminant factor for all life stages of the European anchovy [31]. Continuous vertical profiles of temperature were acquired in all Bongo40 stations from the surface to the bottom by means of a SBE911 plus CTD probe (Sea-Bird Inc mod. 9/11 plus). The collected downcast data were quality-checked and processed in accordance with the instructions of the Mediterranean and Ocean Data Base [49], using the Seasoft-Win32 software (Washington, DC, USA). As anchovy eggs in the Mediterranean Sea are mainly found in the surface layer [50,51], the temperature data were averaged by station from surface to 10 m depth.
The mixed layer depth (MLD in m) was derived from each CTD profile using the algorithm described in [52], since it is an important driver for the definition of the potential spawning habitat [53].
The relevant impact of mesoscale oceanographic features such as upwelling, cold filaments and fronts on the spatial distribution of the planktonic early life stages of anchovy [9,54] was also taken into account. Oceanographic structures can influence the distribution of chemical and physical properties of the water column with a potential effect on anchovy egg survival and larval development [55]. Therefore, surface circulation features were evaluated by means of Absolute Dynamic Topography (ADT in cm) (daily data; spatial resolution: 0.125 × 0.125 degree) and the derived u and v components of geostrophic currents, produced by Copernicus Marine Environment Monitoring Service (CMEMS, http://marine.copernicus.eu/ (accessed on 15 July 2021)). ADT represents important oceanographic features such as mesoscale eddies and meanders [56], which affect primary production and can act as physical barriers for eggs or be responsible for their dispersal offshore. In addition, absolute speed (parameter speed.gos (cm s −1 )) was derived from the zonal (u) and meridional (v) components of surface current and used as predictors in the modelling.
Finally, sea surface chlorophyll-a concentration (chl-a in mg m −3 ) was included in the modelling as an indicator of primary productivity [57] and a proxy for the presence of favorable feeding conditions for anchovy. Daily high resolution (1 × 1 km) satellite data from CMEMS were used for this purpose. The satellite-based information was extracted for each planktonic station subject to the analysis, based on their spatial and temporal location.

Species Distribution Modelling
All SDMs were performed on the presence/absence of eggs and environmental parameters using the statistical software R 3.6.3 (Vienna, Austria) [58].
The available dataset was checked for the presence of any outliers, the collinearity among independent variables and the distribution of the input parameters. MLD and the concentration of chl-a were transformed via natural logarithm to achieve uniform distribution, useful for improving the model application.
The survey data were randomly split into two subsets, the first to train the models (training dataset: 75% of total number of stations) and the second one to validate the predictive performance of the estimated models (test dataset: 25% of available stations). All species distribution models were fitted using the "sdm" library [59] of the R environment [58]. For each model applied on the training set, the AUC-metric was used to select the most significant input variables useful for discrimination of stations with presence or absence of anchovy eggs. The AUC-metric calculates the improvement in model performance with the inclusion of each variable, compared to when the variable is excluded. Partial dependence plots for each significant predictor were used to describe the role of predictors in the explanation of the species distribution. The response curves were generated following the procedure proposed by [60].

Regression-Based Models
Three regression models were fitted to presence/absence data of anchovy eggs and environmental parameters: GLM, GAM and MARS. The assumption of binomial error distribution and logit link function were chosen for all modelling approaches.
GLM is a mathematical extensions of linear models that do not force data into nonlinear scales, and thereby allow for non-linearity and non-constant variance structures in the data. Model is fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and the magnitude of the variance of each measurement to be a function of its predicted value [61].
GAM is a semi-parametric extensions of GLM, where the only underlying assumption made is that the functions are additive and the linear predictor is the sum of smoothing functions. This makes GAM very flexible, and adaptable to very complex functions [62].
MARS is a non-parametric regression model used to estimate complex nonlinear relationships by a series of truncated spline functions of the predictors. It estimates a set of functions called "basic functions" (BF), which represent the information included in one or more independent variables. The BF are combined to obtain a sequence of candidate models with different BF numbers, from which MARS selects the optimal model [63].

Machine-Learning Models
Three machine-learning models were fitted: RF, BRT and SVM. An RF is an ensemble classifier consisting of many decision trees, where the final predicted class is obtained by combining the predictions of all individual trees. A training set for each individual tree is constructed by randomly re-sampling data with replacement from available data in the original dataset. The number of trees was determine based on lowest error rate on validation data [64].
BRT combines the concept of regression trees with the concept of boosting. Boosting is an ensemble method that derives an answer from a large pool of models. In this case, this large pool of models is generated by creating trees that attempt to explain the variation in the data that is currently unexplained by existing trees [65]. The function assesses the optimal number of boosting trees using k-fold cross validation, described in [66].
SVM applies an algorithm with polynomial kernels able to estimate an optimal hyperplane with the maximal margin of separation between positive and negative stations. The Kernel parameter selection and Cost parameter were defined minimizing error rate on validation data [67].

Model Validation and Mapping
The output values from each species distribution model ranged from 0 to 1, representing the probability of the occurrence of anchovy eggs at a particular sampling station. For each SDM, an optimal threshold was estimated to transform the continuous output into a binary factor: positive stations (representing stations where the presence of anchovy eggs is expected to occur) with predicted values greater than the threshold and negative stations otherwise. Various measurements of accuracy were applied to each SDM for both the training and the independent test data to evaluate discriminating ability and reliability, including threshold-dependent statistics (Sensitivity and Specificity) and threshold-independent statistics (AUC) [68].
The ability of models to correctly predict species presence and absence was examined with sensitivity and specificity, respectively [69]. The maximum sum of sensitivity and specificity was used in SDM as criteria for optimizing the threshold.
In addition, a threshold-independent measure was estimated: the AUC [70]. A high AUC indicates that sites with high predicted suitability values tend to be areas of known presence and locations with lower model prediction values tend to be areas where the species presence is unknown. An AUC lower than 0.5 means that the model is not able to discriminate, 0.7 indicates satisfactory discrimination, 0.8 good discrimination, and 0.9 or higher very good discrimination [71].
Finally, we also estimated the probability of occurrence based on an ensemble of all models, because these predictions are often more robust than predictions derived from a single model [46][47][48]. Ensemble predictions were calculated as weighted averages of single-model predictions, with weights assigned to each modelling technique based on its discriminatory power as measured by the AUC. The Kriging-based spatial interpolation was used to map the results from ensemble approach within the study area.

Regression-Based Outputs
The regression models included bottom depth as the primary effect on spawning, followed by ADT, speed.gos and MLD in all cases. GAM and MARS models also included chl-a and temperature (Figures 1a, 2a and 3a). In GLM, the linear relationship between egg occurrence and both bottom depth and ADT is negative, while speed.gos and MLD have a positive effect on egg presence (Figure 1b). A similar pattern of response was found in MARS and GAM results for all selected variables. The partial plot of bottom depth indicated an optimal range for the spawning grounds between 50 and 100 m. Some particular environmental conditions (ADT values > 10 cm in absolute value; speed.gos around 30 cm s −1 ; temperature < 25 • C; chl-a > 0.05 mg m −3 and 10 < MLD < 15 m) are linked to a higher probability of the presence of anchovy eggs (Figures 2 and 3b).

Machine-Learning Outputs
Five-hundred decision trees were used to select the best RF model and 450 boosting trees for the BRT model. A cost parameter equal to a one-and-a-third degree polynomial Kernel characterized the SVM model with the best performance.
As already observed for regression models, the machine-learning models also showed that anchovy egg distribution was principally driven by changes in bottom depth and ADT. Surface currents in terms of velocity, temperature and chl-a were moderately important for determining egg presence (Figures 4a, 5a and 6a).

Machine-Learning Outputs
Five-hundred decision trees were used to select the best RF model and 450 boosting trees for the BRT model. A cost parameter equal to a one-and-a-third degree polynomial Kernel characterized the SVM model with the best performance.
As already observed for regression models, the machine-learning models also showed that anchovy egg distribution was principally driven by changes in bottom depth and ADT. Surface currents in terms of velocity, temperature and chl-a were moderately important for determining egg presence (Figures 4a, 5a and 6a).
The partial dependence plots (Figures 4b, 5b and 6b) show similar patterns to the regression plots. Specifically, the probability of egg occurrence increases in correspondence with an optimal range for bottom depth (between 50 and 100 m) and speed.gos (about 30 cm s −1 ). BRT model resulted in patterns similar to regression models for temperature and chl-a (temperature < 25 °C and chl-a > 0.05 mg m −3 ). Otherwise, for RF and SVM models, the higher probability of the presence of eggs is linked to temperature values between 20 and 25 °C, chl-a values between 0.05 and 0.1 mg m −3 and negative ADT values between −10 cm and 0.

Evaluation of Model Results
With the exception of the GLM, which showed a relatively weaker performance, the predictions obtained from the training data proved to be reliable for the modelling techniques. According to the classification by [72], the RF model had a good discrimination The partial dependence plots (Figures 4b, 5b and 6b) show similar patterns to the regression plots. Specifically, the probability of egg occurrence increases in correspondence with an optimal range for bottom depth (between 50 and 100 m) and speed.gos (about 30 cm s −1 ). BRT model resulted in patterns similar to regression models for temperature and chl-a (temperature < 25 • C and chl-a > 0.05 mg m −3 ). Otherwise, for RF and SVM models, the higher probability of the presence of eggs is linked to temperature values between 20 and 25 • C, chl-a values between 0.05 and 0.1 mg m −3 and negative ADT values between −10 cm and 0.

Evaluation of Model Results
With the exception of the GLM, which showed a relatively weaker performance, the predictions obtained from the training data proved to be reliable for the modelling techniques. According to the classification by [72], the RF model had a good discrimination power (AUC = 0.81) and the other models assessed in this study, with AUC values ranging between 0.74 (GAM and MARS models) and 0.78 (BRT and SVM models), provided satisfactory discrimination ( Table 2). The estimated threshold was about 0.40 for all models, with the exception of RF (0.52). These values, combined with sensitivity values higher than specificity, indicated that generally all models performed well in correctly identifying hauls with the presence of anchovy eggs. However, the machine-learning models outperformed regression predicting the reproductive habitat of anchovies (Table 2). The modelling techniques also showed a satisfactory performance when applied to the test dataset. RF was the most accurate, with an AUC of 0.76. The estimated threshold was about 0.40 for all models, except for GAM (threshold = 0.51). All models (excluding GAM) performed best when they classified positive stations compared to negative (Table 3).

Displaying the Anchovy Spawning Habitat
The ensemble model identified the coastal waters as the most favorable for anchovy spawning, with the differences between years limited to areal extent. The areas associated with high probability of eggs (red areas in Figure 7) extended more offshore between 1998 and 2002 and between 2004 and 2006. In other years the preferential area was restricted to shallow waters near the coastline. This was observed in particular in 2011 and 2015, when only two restricted areas exhibited conditions favorable to spawning: a narrow band in the south-central sector of Sicilian coast and the area around the southern-most tip of Sicily (Cape Passero). In general, the unfavorable areas (blue areas in Figure 7) were associated with depths greater than 100 m.
The ensemble model identified the coastal waters as the most favorable for anchovy spawning, with the differences between years limited to areal extent. The areas associated with high probability of eggs (red areas in Figure 7) extended more offshore between 1998 and 2002 and between 2004 and 2006. In other years the preferential area was restricted to shallow waters near the coastline. This was observed in particular in 2011 and 2015, when only two restricted areas exhibited conditions favorable to spawning: a narrow band in the south-central sector of Sicilian coast and the area around the southern-most tip of Sicily (Cape Passero). In general, the unfavorable areas (blue areas in Figure 7) were associated with depths greater than 100 m.

Discussion
SDMs are a valuable statistical approach able to provide important information during the decision-making processes aimed at sustainable fishing and species conservation [5,6,72]. Methodological advances have enhanced SDM capabilities, allowing a better knowledge of species distributions and relationships with environmental conditions. However, these approaches could be affected by limitations due to the a priori choice of

Discussion
SDMs are a valuable statistical approach able to provide important information during the decision-making processes aimed at sustainable fishing and species conservation [5,6,72]. Methodological advances have enhanced SDM capabilities, allowing a better knowledge of species distributions and relationships with environmental conditions. However, these approaches could be affected by limitations due to the a priori choice of an optimal model [45]. An ensemble of SDMs provides a method for resolving differences between individual models and estimating more robustly the relationship between species and habitats [46,47].
The Strait of Sicily is a particularly interesting area because it is characterized by highly variable oceanographic conditions among years [73] that strongly affect the spawning and survival of first stages of fish with severe consequences on their stock biomasses. This is the case for the European anchovy, a short-lived species with a demographic structure locally dominated by age classes 1-2 years [74]. This paper represents the first attempt to evaluate the performance of different statistical modelling approaches applied to the investigation of anchovy spawning habitat in the Strait of Sicily, using the longest time series of egg presence/absence data available for the study area. Specifically, spawning habitat suitability was assessed by averaging predictions resulting from six different SDMs. The multiple approach included three regression models and three machine-learning models, based on data collected during 19 ichthyoplankton surveys (years 1998-2016) in the Strait of Sicily. The methodology offered a promising adaptive tool for selecting realistic conservation areas for this species as function of current environmental conditions. Validation of model outputs demonstrated the great efficacy and ecological relevance of the selected environmental parameters in predicting spawning habitat distribution, with the exception of GLM. In general, non-linear techniques, with complex ecological response, provided more realistic results compared with classical linear techniques. In addition, machine-learning models, particularly RF, performed better than regression models. The high performance of machine-learning algorithms is well-documented in SDM applications [47,[75][76][77], consistently outperforming regression-based methods. This result could suggest a potential overfitting issue [66,78]. However, this was not the case for this study, because the models applied on the test dataset showed an equally good performance in the predictions. The higher performance of RF is attributed to its ability to combine the most popular class from several individual trees [64,79], which makes it a widely applied machine-learning algorithm for various field of studies [80,81].
Although the correct-classification rates were quite similar in all tested models, differences were found in the selection of significant variables and evaluation of their importance, as indicated also by [45,82]. However, despite the heterogeneity in the relative importance of the covariates among models, the interpretation of several SDM outcomes collectively helped to identify the optimal environmental windows for anchovy egg distribution in the study area. For instance, all approaches showed the major role played by bottom depth and ADT in affecting egg occurrence, with small differences among them. Specifically, egg occurrence was primarily related to bottom depth in the 50-100 m range, confirming a pattern of distribution linked to the coastal zone [9,29,44,83]. Moreover, egg distribution was influenced by spatial patterns of ADT, which depends mainly on the mesoscale oceanographic structure occurring in the surface layers. In the Strait of Sicily, the surface hydrodynamics is dominated by the AIS, which flows south-eastwards along a meandering path that varies over the years [73]. In terms of ADT, this current is characterized by values close to zero in its core, and its flow is characterized by negative values in the coastal (upwelling) area and positive values offshore. In this context, egg occurrence is mainly associated with the negative ADT values along the southern coast of Sicily, in relation to the AIS motion [32]. The significance of ADT reinforces the important role played by hydrography in affecting the anchovy spawning behavior in the Strait of Sicily, in agreement with other studies [9,54,84].
Conversely, the relative contribution of the remaining parameters, especially chlorophylla concentration, temperature and surface current, in predicting species occurrence was variable, suggesting a secondary but significant impact in affecting egg distribution during the peak of spawning. This study identified an optimal temperature window between 20 and 24 • C, not surprisingly higher than the range of 18-19 • C reported in [29], where temperature was depth averaged in the uppermost 40 m layer of the water column, and within the range observed for this species in the Mediterranean Sea [85]. The higher chlorophyll-a concentration associated with the egg occurrence confirms the findings of previous studies [29,84] and reinforces the hypothesis of a reproductive strategy aimed at maximizing the reproductive success of the species by assuring food availability during the larval development period [86,87]. The observed significant contribution of the current speed in explaining the egg spatial patterns in most models suggests the occurrence of advection from spawning areas to retention areas, as highlighted by other studies using Lagrangian dispersion models [9,54,88,89]. Conversely, the extent of mixed layer depth showed limited positive impact, highlighting an association between the presence of eggs and well-stratified waters typically characterizing the summer period (e.g., outputs of GLM and GAM models).
Despite some differences, it is also worth noting that most of the results of previous studies on the reproductive habitat of anchovies in the Strait of Sicily, even if obtained with different single modeling approaches and on shorter biological and environmental time series [9,[29][30][31][32][33][34] are generally confirmed here. However, the added value of this study is its accurate comparisons of different modelling approaches, contributing to the development of more robust predictive models [46,47,90,91]. In addition, in this study, we took advantage of the extensive literature on the relationships between the environment and the reproductive biology of the European anchovy to validate the results of our methodological approach from an ecological point of view. In particular, the variability in the outcomes of the individual models and the consistency with previous studies underlined the need for a comprehensive use of an ensemble of SDMs to detect, and understand the complexity of relationships in the marine environment.
In this context, the understanding of the effects related to environmental fluctuations can create the basis for defining reliable future previsions about the state of the natural resource, significantly supporting sustainable fishery management actions concerning one of the most exploited natural resources in the Mediterranean Sea [92]. The definition of the spawning areas as well as the understanding of the environmental effect on recruitment represent fundamental knowledge for the management of the fishing effort aimed at guaranteeing the conservation of the resource in line with the priority objectives of current international strategies. This work proposes an analytical approach that can be exported to other contexts and sheds light on the main drivers that regulate the reproduction of the European anchovy in a crucial area for the fishing of small pelagics, highlighting the importance of monitoring programs for the conservation of the natural capital.

Conclusions
This study represents the first attempt to identify potential spawning habitat of the European anchovy in the Strait of Sicily with different modelling approaches applied on a long-term time series of environmental and biological information. The results have important implications for species management and conservation planning, as they provide an exhaustive characterization of the anchovy spawning habitat in the study area. Moreover, it also provides an innovative comprehensive modeling approach aimed at delivering a more effective predictive technique, first applied in the field of fisheries ecology.