Species Distribution Modelling via Feature Engineering and Machine Learning for Pelagic Fishes in the Mediterranean Sea

: In this work a ﬁsh species distribution model (SDM) was developed, by merging species occurrence data with environmental layers, with the scope to produce high resolution habitability maps for the whole Mediterranean Sea. The ﬁnal model is capable to predict the probability of occurrence of each ﬁsh species at any location in the Mediterranean Sea. Eight pelagic, commercial ﬁsh species were selected for this study namely Engraulis encrasicolus , Sardina pilchardus , Sardinella aurita , Scomber colias , Scomber scombrus , Spicara smaris , Thunnus thynnus and Xiphias gladius . The SDM environmental predictors were obtained from the databases of Copernicus Marine Environmental Service (CMEMS) and the European Marine Observation and Data Network (EMODnet). The probabilities of ﬁsh occurrence data in low resolution and with several gaps were obtained from Aquamaps (FAO Fishbase). Data pre-processing involved feature engineering to construct 6830 features, representing the distribution of several mean-monthly environmental variables, covering a time-span of 10 years. Feature selection with the ensemble Reciprocal Ranking method was used to rank the features according to their relative importance. This technique increased model’s performance by 34%. Ten machine learning algorithms were then applied and tested based on their overall performance per species. The XGBoost algorithm performed better and was used as the ﬁnal model. Feature categories were explored, with neighbor-based, extreme values, monthly and surface ones contributing most to the model. Environmental variables like salinity, temperature, distance to coast, dissolved oxygen and nitrate were found the strongest ones in predicting the probability of occurrence for the above eight species.


Introduction
The marine world is rapidly changing as humans perform a number of activities, such as fish stocking, shipping, aquaculture, pollution and habitat modification, which result in ecological and economic damage. Species distribution models (SDMs) provide a measure of species occupancy in response to the local/regional oceanographic and environmental conditions and habitat [1]. Such models combine occurrence locations of known species with a series of environmental layers, by developing a statistical inference system which unveils the impact of environmental parameters on specific species distribution patterns and by expanding the species distribution layer towards The machine learning algorithms that have been used in these fishery SDMs are Logistic Regression [17], other Generalized Linear Models (GLM) [19,22,25], Support Vector Machines (SVM) [17,22], Gradient Boosting Models (GBM) [22,25], Decision Trees [18], Genetic Algorithms for Rule Set Production (GARP) [22,23], Random Forests [19,22,25], Multivariate Adaptive Regression Splines (MARS) [22,25], Maximum Entropy (MaxEnt) [19,22,25,29] and Artificial Neural Network (ANN) [17,18]. Some studies use ensembles [17], while others test existing environmental models [17,19,22] and Favourability Functions [30].
Regarding the number of predictors, to our knowledge, all studies except one [29] used about 5 to 20 features. Very few took advantage of feature selection and variable importance pre-processing, and the ones that did used correlation [19,28], some filter methods [18], permutation importance [25], and embedded MaxEnt variable importance [29]. Finally, there are various SDM approaches [24], that model the distribution of multiple species and their interrelations simultaneously. These approaches are out of scope of the current work.
As it can be concluded from the previous works and according to Leidenberger et al. [20] and Elith et al. [31], there are some limitations to SDMs and most of them can be improved in a number of ways. Typically, there are insufficient observations and a limited number of environmental variables available to train effective machine learning models, presence-absence data are unbalanced, absence data are often not available or artificially created, resolution is poor, and there is collinearity in the data. Therefore, the principal aim of the present work is to overcome these challenges.
More specifically, the herein developed model was trained with roughly ×3.5 times more observations than the average related work (improvement over observations). With the use of feature engineering, 6830 features were extracted and subsequently, feature selection was performed, leading to the selection of the most important ones (improvement over features and collinearity). Finally, a machine learning algorithm that has never been applied in previous SDM literature, namely, XGBoost [32] was used. XGBoost offers regularization preventing the model from overfitting, it can handle missing values and it provides the optimum number of boosting iterations in a single run, minimizing the time needed for its performance. Instead of using presence-absence data for binary classification, like most studies do, the problem was transformed to a regression one (improving the unbalance of presence-absence datasets) and predict the probability of occurrence based on data provided by the AquaMaps database. Furthermore, fish species distributions were interrelated to oceanographic and environmental conditions, utilizing the high resolution Copernicus Marine Environmental Service (CMEMS) products [33] and the European Marine Observation and Data Network (EMODnet) [34] databases for the whole Mediterranean Sea. The best feature categories and environmental predictors for each species and overall were analyzed. Finally, with the use of the trained SDM, AquaMaps spatial resolution was improved (×8 higher) and high resolution maps (0.0625 arc-minutes) without any spatial gaps were constructed, covering the whole Mediterranean Sea, for the eight commercial pelagic fish species.

Study Area
All species observations and environmental variables are situated in the Mediterranean Sea. With regard to environmental conditions, Mediterranean Sea exhibits a unique structure as evaporation exceeds river flux and precipitation, leading to the intrusion of Atlantic water masses. The limited river discharge makes the Sea predominantly oligotrophic to ultra oligotrophic. Salinity decreases from east to west; temperature decreases from north-west to south-east; thermal boundaries in deep waters are absent [35], and areas near the coast are affected locally by human activities [36]. The current oceanographic, environmental, and climatic challenges of the Med, together with the presently operating Observing and Forecasting Systems and existing platforms which offer downstreaming services for users and stakeholders, are extensively reviewed by Tintoré et al. [37]. According to Reference [38] there are approximately 17,000 marine species occurring in the Mediterranean Sea, while AquaMaps [39] provides distribution ranges for 1971 fish species and invertebrates.

Species Data
In this study, eight commercial fish species were chosen for analysis and prediction. This selection was based on the following criteria: (i) these species represent the most common commercial fish in the Mediterranean Sea, occurring in a plethora of locations, (ii) they are pelagic species living in different depth ranges, and (iii) they demonstrate a variety of preferred environmental conditions, especially at the upper parts of the water column. It is a challenging task to create a single generic model that performs the best for all species. Most of these pelagic fish species are small and medium-sized, and the depth range of each appears in Table 1: European anchovy (Engraulis encrasicolus), European pilchard or sardine (Sardina pilchardus), round sardinella (Sardinella aurita), Atlantic chub mackerel (Scomber colias), Atlantic mackerel (Scomber scombrus), picarel (Spicara smaris), Atlantic bluefin tuna (Thunnus thynnus) and Swordfish (Xiphias gladius). Anchovy is a summer spawning (April to September in the Mediterranean), fast growing, short living, and early maturing small pelagic fish [40,41]. It is a low trophic level planktivorous species with preference for warm water temperatures [9]. Together with sardine, they constitute around 40% of the total marine fishery landings in the Mediterranean, a percentage that fluctuates depending on the country and which is highest along the northern Mediterranean coastline [9].
Sardine is a winter spawning (October to March in the Mediterranean), fast growing, and early maturing small pelagic fish with a short lifespan [41] that primarily feeds on zooplankton and is preyed upon by large pelagic fish, marine mammals, and seabirds. It generally prefers low water temperatures and constitutes around 20% of the marine fishery landings in the Mediterranean Sea, especially along the northern coastline [9]. Sardine and anchovy alternate in high abundances and landings [42].
Round sardinella is also a fast growing and early maturing small pelagic fish with short lifespan, which spawns over a narrower (May to July in the Mediterranean Sea) time period [43]. It feeds mainly on zooplankton [44] and has a strong preference for warm waters [9]. Although its abundance is increasing across the Mediterranean Sea as a result of sea warming (western Mediterranean: [45]; eastern Mediterranean: [46]), round sardinella's contribution to marine fishery landings is confined to the southern Mediterranean coastline.
Atlantic chub mackerel is a medium sized pelagic fish, which spawns over late-spring/early -summer months (May to June) in the Mediterranean Sea [47]. Its growth rate is medium and its lifespan is around 7 years [48]. It is a high trophic level species as it feeds on fish and zooplankton [48] and prefers warm waters [9]. Atlantic chub mackerel is a commercial species in many Mediterranean countries.
Atlantic mackerel is a medium sized pelagic fish, which spawns during winter (December to March) in the Mediterranean Sea [47]. It grows at a medium rate and may reach 17 years of age [48]. It is positioned high in the food web and prefers cold waters [9]. Similar to chub mackerel, Atlantic mackerel is a commercial species in most Mediterranean countries.
Picarel is a small-sized hermaphroditic spring spawning species (April to May), which spawns on sandy bottoms where nests are excavated [49]. It grows fast, has a short lifespan, generally prefers cold water temperature, and is positioned low in the food web [48]. In some countries of the Mediterranean, picarel contributes to the total landings.
Bluefin tuna is a highly migratory large pelagic fish with a very high commercial value. Its eastern Atlantic population spawns in the Mediterranean Sea [50] at the age of 3 years and at a size exceeding 100 cm [51]. Its stock, which is subject to dynamic management with a total allowable catch (TAC) allocated each year, collapsed in the late 2000s [52] but appears to be recovering after drastic decrease of TAC.
Swordfish is also a highly migratory large pelagic fish with a very high commercial value, which spawns in the eastern Mediterranean Sea [53] in May and June [54]. Its stock is subject to dynamic management with a total allowable catch (TAC) allocated each year.
Species probability of occurrence data were retrieved from AquaMaps. AquaMaps utilizes large sets of occurrence data (Figure 1), derived from a mixture of online, freely available collection databases (such as GBIF and OBIS), independent knowledge on species distribution, and fishery habitats (from FishBase). These occurrence data points are combined to species-specific envelopes of environmental preferences, such as minimum, mean and maximum depth, depth standard deviation, distance from land, ocean area, annual mean ice concentration, annual mean primary production, annual minimum, mean and maximum sea surface temperature, sea surface temperature standard deviation, sea surface temperature range, annual mean sea bottom temperature, annual minimum, mean and maximum salinity, and annual mean bottom salinity, to determine the suitability of a given area in the ocean for a particular species [55]. Predictions of relative probabilities of species occurrence are provided by AquaMaps, illustrated as color-coded species range maps in a global grid of half-degree latitude and longitude cell dimensions. AquaMaps predictions have been validated using independent and effort-corrected survey data [56]. AquaMaps was based on global, coarse environmental databases (e.g., NCEP, World Ocean Atlas, World Ocean Atlas Bottom Source Information, SeaAroundUs and more), and as a result the relative "probability of occurrence" per species is also coarse. Thus, in this work all Meditterranean Sea-filtered locality records, each with probability of occurrence, were collected from the expert-reviewed AquaMaps datasets. These datasets were the latest available (2016 and later), and were referring to the same time-frames as the environmental variables selected in Section 2.3. These initial datasets include gridded coordinates, as well as the overall probability of the species existing in that location. In contrast to other related works [17,18,22,23], the datasets used in this study include a larger number of observations, ranging from 596 for sardine to 1237 for bluefin tuna.

Environmental Variables
Species Distribution Modelling involves the merging of species occurrence data with environmental layers collected from other sources. Each longitude-latitude occurrence pair must be matched with the environmental conditions taking place in that exact or closest location. Environmental variables are known in the bibliography with various names such as variables, parameters, indicators, predictors, and features. In order to choose the most appropriate parameters, a domain knowledge from experts is needed, and depending on the individual problem to solve, researchers select the most appropriate set of variables for their experiments. Each environmental problem is unique and requires to be treated differently in terms of parameter selection. However, some variables are extensively used in most related works, such as sea surface temperature, salinity, bathymetry, and distance to coast. In this work, fifteen initial predictors were selected based on the variables that have been reported to affect fish distribution. The use of these particular environmental variables is based on data availability and ecological degree of relevance for fish species.
There are several online resources that give access to marine environmental variables. Most species distribution models use low resolutions for their models. Having high resolution, spatially explicit, continuous data is rare in marine environments [21]. However, for the Mediterranean Sea there are two platforms that provide high resolution envelopes for environmental predictors describing marine conditions. These are the Copernicus Marine Environmental Service (CMEMS) database [33], which was used for the extraction of variables changing in time, and the European Marine Observation and Data Network (EMODnet) [34], which was used for the bathymetry and substrate datasets.
Variables chosen for this work are divided into two categories, temporal and static. The temporal include temperature, salinity, dissolved oxygen, meridional current, zonal current, chlorophyll-α, euphotic depth, secchi disk depth, wave height, nitrate, and phosphate. Static include distance to coast, distance to major river mouths, bathymetry, and substrate. Temporal variables are extended into time and vertical space. In particular, monthly-mean data with a timespan of 10 years, from January 2008 to December 2017 were collected resulting in 120 values for each variable at each location. Also, because the selected species are pelagic, 2 depth levels were considered. These include values for sea surface and 100 to 300 m mean. Table 2 summarizes the environmental variables used. More information about such variables can be found in Reference [27]. Table 2. Environmental variables. * These three spatial resolutions are in kilometers (1 × 1, 4 × 4, and 4 × 4 km, respectively), and they are converted in arc-minutes for comparability; one arc-degree at the Mediterranean is roughly 111 km.

Models and Evaluation
A number of machine learning models and feature selection techniques were used in this work. First of all, 10 regression algorithms were tested during the initial experiments, both linear and non-linear, in order to pick the overall best performing one among species. The algorithms used are the following: Linear Regression (LR), Least Angle Regression with Lasso (LARS), Support Vector Machine Regression (SVR), Decision Tree (DT), Random Forest (RF), Extremely Randomized Trees (ET), Adaptive Boosting (AdaBoost), Gradient Boosting (GBM), Extremely Gradient Boosting (XGBoost) and Light Gradient Boosting (LightGBM). 5 feature selection techniques and an ensemble one were used to rank the features according to their importance in determining pelagic fish species distribution over the Med. The goal of feature selection was to eliminate irrelevant and noisy features, by keeping the ones that can predict the target best. This technique decreases the computational time, improves the performance of the algorithm, avoids over-fitting, and creates better general models.
There are three types of feature selection methods: filters, wrappers, and embedders. Filters are used before machine learning, and they rank the features based on their intrinsic properties, like variance and correlation. Wrappers use a machine learning algorithm as a black box evaluator to rank the features. Embedded methods combine filters and wrappers by first computing statistical measurements on the data and then using an algorithm. We used two filters, Mutual Information and Fisher Score, one wrapper namely SHAP and two embedders using Random Forest and XGBoost. The ensemble technique followed is called Reciprocal Ranking and computes the final rank r( f ) of a feature f as follows: where j = 1, 2, . . . , N are the feature selection methods, and r j ( f ) is the rank of the feature according to the jth method.
We split the datasets in train (80%) and test (20%) set. For training and validation we used 10-fold cross validation. Validating a model with cross validation reduces the danger of over-fitting and results in a better general model that can have high performance on an unknown dataset. Root mean square error (RMSE), representing the mean standard deviation of prediction errors, was chosen as the metric for all experiments, since the target is a continuous variable.

Results
This section describes all the actions taken for the completion of this work. Figure 2 serves as a graphical abstract of the whole process. It shows the collection of data from AquaMaps and the merging with the environmental envelopes, the creation of the panel dataset, its transformation with feature engineering to the final dataset, the selection of the top variables, and the prediction with XGBoost on the whole Mediterranean Sea with high resolution. Graphical representation of the proposed procedure. First, data are collected from AquaMaps, which make the target variable. For these points, environmental parameters that serve as dependent variables are extracted from platforms such as CMES and EMODnet. Data are spatially filtered in the Mediterranean Sea. Next, feature engineering is performed, and after that, feature selection to reduce the number of variables to the optimum. Finally, using the model that was trained, predictions for the probability of the species are performed in every point of the Mediterranean at a fine resolution. Example values are for Engraulis encrasicolus.

Dataset Creation
Several steps were taken to create the dataset. Firstly, the AquaMaps website was accessed to download the initial data for each species. These data include the coordinates of the observation and the probability of occurrence in that location. A spatial filter was applied to remove all points outside the Mediterranean Sea. Having the locations, third party datasets that provide environmental variables were merged with the probability of occurrence data. The environmental variables can be seen in Table 2 and the publicly available sources that give access to them are CMEMS and EMODnet. The predictors include the coordinates and the values of the measured variable. A Python code was developed to compare the coordinates between the initial dataset and each environmental variable in order to find the point with the nearest distance. That point's value was added to the new dataset. Also, as CMEMS provides water column data and the studied species are pelagic, for each predictor, the mean from 100 to 300 m depth per parameter was also computed as extra feature to test its strength. The output of this procedure was the full dataset, with the species' coordinates, the probability of occurrence reported in AquaMaps, and the environmental variables describing each observation from third party datasets.
Another important part of this work is that variables were not limited in only a single value per observation, but instead temporal data were also considered. In particular, monthly values for a period of 10 years from 2008 to 2017 were extracted, resulting in 120 values per observation. This data format is called panel data and includes several rows in the dataset for each observation. It is not possible to perform machine learning in such data, because each observation should have a unique row to train. Thus, the next step is to transform this panel dataset into the final dataset for machine learning.

Feature Engineering
Feature engineering is the process of creating new features from the existing ones. Some of these features might prove very efficient for the regression algorithm and can help the prediction of the target variable. Since time series data were considered, a very large number of features was created. These features include descriptive statistics like mean, median, standard deviation, maximum, minimum, maximum to minimum difference, skewness, kyrtosis, mean absolute deviation, and several quantiles. All of them were computed at annual and monthly basis. For example, a feature can be 'temperatureSurface_April_mean'. Time series and signal related features were also computed, like moving and expanding averages, maximums, minimums, medians, quantiles, skewness, mean change rate, short and long term averages, Hilbert transformation, Hann window, spline interpolation, Gaussian spline interpolation, trend, median-filtered mean, Wiener-filtered mean, Savitzky-Golay mean, downsample, detrend, relative minima and maxima, local minima and maxima and features that take into account spatial neighbors of each observation.
The above-described feature engineering procedure created a total of 6830 features. This is a huge number compared to the number of observations (≈902 on average). Having seven times more features than samples can cause overfitting and it is computationally inefficient. Thus, the next vital step is to perform feature selection to get the feature importance and select the very top performing ones. The reason for creating so many features is because it is impossible to know a priori which features are the best, and to have a diversity in the feature selection process. But before proceeding to feature selection, a machine learning algorithm has to be chosen.
As stated in Section 2.4, ten regression algorithms were compared using all features (6830) in the test set. The test set is consisted of randomly picked areas from the dataset (20% of initial dataset). The Python language was used, along with the libraries scikit-learn, xgboost and lightgbm. The parameters of the algorithms were left to their default values. This is the baseline experiment and its outcome is shown in Table 3. The overall best algorithm among species is the XGBoost and is the one that will be used for the rest of the paper. From these results we can observe that the boosting models perform similarly best, while simpler models like Decision Tree and the linear models have the lowest performance. Species Distribution Modeling data seem to favor nonlinear models. Nonlinear regression is much more flexible in the shapes of the curves that it can fit and this kind of data cannot be easily modeled with simple linear models. Based on the above results, the XGBoost, Gradient Boosting or LightGBM are recommended for similar problems.

Feature Selection
The goal of feature selection is to rank the features according to their importance. First, we used five selection algorithms for this purpose and ensembled their rankings using Reciprocal Rank as stated in Section 2.4. The ranking of the latter is usually superior of the individual ones, and is the one that was used in this work. The reason of its success is firstly that it takes as inputs rankings of diverse feature selection techniques, where the disadvantages of one might be compensated by another, and secondly that it is a harmonic mean formula. This means that if one of the five algorithms ranks a feature among the top and the other four rank it lower, then it will remain in a high position on the final ensemble ranking since the harmonic mean is biased towards the smaller of the numbers. Figure 3 gives a graphical insight of the exceptional strength of feature selection and the boost that it offers to the regression problem. It is a backward stepwise selection process, starting with all the features and eliminating them according to their rank of diminishing importance. All runs used 10-fold cross validation for training and evaluation of the model using XGBoost. A clear pattern to each species seems present. Using the baseline of all features (leftmost side of plots), the Root Mean Square Error is relatively high and stays there up to around 70% of features removed (0.3 feature cutoff). From that point it decreases, until it reaches its lowest value at a percentage of around 2% features retained (i.e., around 136 features). The 2% percentage slightly varies between species (from 1.5% to 2.5%), but as a general model is constructed, the, 2% cutoff will be used. If even less features are used, then performance drops (RMSE increases again, drastically). This behavior is similar for all species, regardless of the actual RMSE values, which vary depending on the species. A reason for this outcome is that all datasets were created in a similar manner. The feature engineering procedure created the same features for each species. This does not mean that the top 2% of them are the same. Each species, varies on the predictors that are affecting its distribution. The baseline runs for all species using the XGBoost algorithm are improved, as it can be seen in Table 4, from 26.3% for aurita to even 45.5% for thynnus. The average performance boost for the eight species is about 34% on the test set. As it can be seen with this evaluation, the results on the validation and the test set are similar, indicating lack of overfitting. As a final step in the modeling procedure, the initial classifiers from Section 3.2 are again deployed, this time using the top features as defined by the process of feature selection. The test set is the same as before (20%), when XGBoost was used. The results are available on Tables 5 and 6. It is clear that feature selection improves all algorithms except Extra Trees Regression (improvement in only 3 species), Support Vector Regression (improvement in 6 species), and Linear Regression (improvement in 7 species). The improvement in all algorithms was of a considerable degree. Some of them had an exceptional increase in performance, like Linear Regression (42% improvement over baseline). The best models are again the same as in the initial comparison. These are XGBoost, Gradient Boosting and LightGBM, with XGBoost performing again slightly better on average. The results indicate that the ensemble feature selection used in this work is a consistent method, that improves almost all algorithms and by a high margin. Finally, a model has to be chosen to predict the probability of occurrence of each species in the whole Mediterranean, and that model will be XGBoost.

Feature and Predictor Importance
Having the model trained and all features ranked with Reciprocal Ranking, valuable information about which features are meaningful for such problems can be extracted. Moreover, the variable that refers to a predictor may be derived using a mathematical formula. For example, 'temperature Surface April mean' is a feature that refers to the predictor 'temperature'. Our feature engineering procedure constructed various features for each predictor. The herein produced features belong to several categories. As an example, the aforementioned feature belongs to the categories 'mean', 'month', and 'surface.' It is difficult to know a priori which categories perform better, so through feature engineering all possible features were created, knowing that feature selection will be able to select the top ones. However, after the experiment an importance analysis for each feature category may be performed, based on its Reciprocal Ranking score. In order to do this, all features that fall within each category were collected, and a descriptive statistic was calculated. One option for this statistic would be their mean score. However, the mean is prone to outliers and is not representative. Another option would be to pick the single best (minimum score) feature for each category. This is also not a good choice, because some feature categories might have more features than others. So, the 2% quantile was selected as the most appropriate descriptive statistic to extract a metric for the overall performance of each category. The 2% quantile is compatible with the best model which uses the top 2% of features. The feature relative strength analysis is depicted in Figure 4. Lower numbers have greener colors and represent the feature categories with the lower 2% quantile of Reciprocal Ranking. In addition, Figure 5 depicts the strengths of the top-10 predictors for each species.

Improving Aquamap's Resolution And Coverage
After feature selection, our data are ready to be fitted in the XGBoost model. For each species, the data fitted consisted of all the observations (train + test) and the top 136 features. A prediction/projection dataset was prepared with 67, 000 points covering the whole Mediterranean Sea. The resolution of this dataset is 0.0625 • × 0.0625 • , as it is the lowest available resolution from our data sources (CMEMS, EMODnet). For all these coordinates, the environmental variables were extracted in the same manner as the in training datasets. Finally, the probability of occurrence for each species in this dataset was predicted. The results can be seen in Figure 6. Through the above described machine learning process, we managed to create a single generic model and produce high resolution maps covering the whole Mediterranean basin for the eight most commercial pelagic fish species.

Discussion
Understanding the environmental processes influencing fish species abundance is important in order to manage the fishing grounds and offer advice on the exploitation of this important resource. It is evident that out of the 6830 features that our feature engineering process generated, about 136 (or 2%) of these are the most important and are advised to be used to reach the lowest possible prediction error. Most of the other species distribution studies, use about 5 to 20 features for their models, which are an order fewer than the 136 used in the present model. This happens because (a) very few studies (if any of them) use time series features, (b) they do not generate such a huge number of features to choose from, (c) they empirically pick the features depending on their domain knowledge and data availability. The present approach is different from the existing studies in this respect. It is not limited to basic features like maximum sea surface temperature or mean salinity, but it expands upon very detailed time series features. It suggests that about 2% of such features are appropriate for all our datasets, and if less predictors are used, then performance worsens. For overfitting and generalization purposes the exact number of features that minimizes their RMSE for each species is not used, but rather a better bias-variance trade-off feature cutoff is preferred.
Having very few features, the model's reliability diminishes, since these features are very specialized, and case only the very top-ranked ones are kept, whole predictors like chlorophyll or nitrate cease to exist in the model. Thus, having only variations of temperature is not enough. The whole feature selection process not only boosts execution time and performance, but also gives valuable insights into which variables are most important for each species. These insights are depicted in Figure 4, where the importance of feature categories generated by the feature engineering process is visible.
The figure clearly shows which feature categories are worth investing time in feature engineering and which are not. The neighbor-based features seem to be the strongest ones. This is the only hand-crafted feature and it computes the minimum, mean, maximum and standard deviation of the eight cells that surround the given one on all predictors. For example, the feature 'temperatureSurface_neighbor_mean' calculates for every observation the mean surface temperature of its eight neighboring observations. Surrounding cells have been mentioned before [28] that might influence the distribution of species. It is evident that the probability of species presence is greatly affected by the conditions of the neighboring environmental parameters. This means that the species live in large areas next to each other and not in small isolated ones with different conditions. The importance of neighboring features comes in agreement with the fact that longitude and latitude predictors are also vital as it will be demonstrated later on. In the same manner, features with extreme values depict strong importance. Minima, maxima and quantiles are much stronger in terms of importance than means and medians. These results are in line with the bibliography, as the authors of Reference [57] state that quantiles near the maximum create good features when other variables are not limiting, and Reference [58] claims that low quantiles are relevant to estimate the lowest recruitment level for a species. Another feature of descriptive statistics, which is of major importance is skewness.
It is very common for researchers to use mean-monthly data [16,59] and only rarely mean-annual data. In this study, it was demonstrated that both feature categories are strong, with the monthly features being slightly better. This happens because some stages in fish life, such as breeding and migration, occur in certain months. By comparing the surface values to the mean of 100 to 300 m, it becomes clear that the latter falls behind. Conditions found below the surface are rarely accurately known [60], thus it is advised to use surface values. All feature categories from the signal processing area related to filters are the worst performing ones. The mean expanding (exponential window) and classical sta-lta (Short Time Average over Long Time Average), which is used in seismic events, obtained very high importance. One would expect that relative minima and maxima and peaks would be good feature categories, like the minimums and maximums of the descriptive statistics. This is not true, as all the temporal environmental variables have high seasonality and the peaks are insignificant.
After having investigated the features and their respective predictors that were considered as most valuable by the Reciprocal Ranking technique, in Figure 5 the top 10 predictors for each species are depicted in alphabetical order. The plots are qualitative and normalized so that the 10th predictor would be at the center of the plot. Every decagon represents one unit in terms of Reciprocal Ranking importance. For example, anchovy zonal current is five times less important than distance from the coast. Longitude is among the top predictors for every species, in agreement with the western-to-eastern gradient in the trophic conditions of the Mediterranean Sea. This longitudinal gradient represents changes in water temperature, chlorophyll concentration and mixed layer depth over the epipelagic layer [61]. The Anchovy is mostly affected by the distance from the coast, the temperature and the euphotic depth; pilchard by water temperature, salinity, secchi disk depth and distance from the major rivers; round sardinella by salinity, temperature, distance from the coast and dissolved oxygen; atlantic chub mackerel by salinity, temperature, latitude, nitrate and distance to from the coast; atlantic mackerel by salinity, temperature and chlorophyll-α; picarel by temperature, zonal current, salinity and latitude; bluefin tuna by salinity, nitrate and latitude, and swordfish by salinity, temperature and latitude.
Because of their fast life history strategy (rapid growth, early maturation, short lifespan), small pelagic fishes and especially their recruitment success [10], hence their distribution, abundance and fishery catches, are vulnerable to climate and environmental forcing [12,13,42]. However, the decline in landings of most small and medium pelagic species is only partially attributed to climate and environmental factors because, especially in the Mediterranean Sea, over-exploitation still remains the main driving force of their populations [9,62]. Indeed, overfishing has been reported to modify the abundance, composition, and distribution of pelagic species, but also to induce drastic changes of state [63].
Following the above-described west-to-east gradient, the probability of occurrence of small pelagic fishes was higher in the western Mediterranean Sea and declined eastwards ( Figure 6) with the exception of picarel, which was abundant throughout the basin. For medium and large-sized pelagic fishes, the probability of occurrence was higher in certain areas of the western and northern Mediterranean, while they were completely absent from the southeastern part of the basin, as a results of the raised temperature and salinity and the low chlorophyll levels at the surface layer ( Figure 6). Small and medium pelagic species are generally concentrated in areas of high productivity because most of them are mainly plankton feeders [64]. These areas are associated with cooler and fresher nutrient rich water masses that could be either upwelling areas (e.g., west African coast: [63]), coastal areas affected by riverine input (e.g. NW Mediterranean: [65]) or areas affected by both riverine input and other water masses (e.g. Black Sea Water influx in the northern Aegean Sea: [66]). These conditions are typical for the northern Mediterranean coastline and create a northwestern-to-southeastern gradient in productivity. Indeed, biological productivity in the Mediterranean basin has been reported to decrease from north to south and from west to east and it is inversely related to the increase in temperature and salinity [67] indicating that the Mediterranean Sea is highly heterogeneous between its basins [68]. This gradient makes longitude appear vital in the distribution of pelagic species and even latitude seems significant, despite the narrow latitudinal axis of the Mediterranean due to riverine input that mostly affects the northern coastline. Changes in primary productivity, the composition of plankton community and the abundance of key plankton species, that may be climate-driven [69], directly affect the distribution of small pelagic fishes, which preferentially feed on zooplankton, such us anchovy and sardine [70], but they could also indirectly affect their somatic condition [10]. Enhanced primary and secondary productivity benefits these planktivorous species by increasing the availability of their prey (bottom-up control), but at the same time improves their somatic condition [71]. Although NW Mediterranean is richer in pelagic fishes, sharp regional gradients and gaps in the probability of occurrence exist, mostly in anchovy and round sardinella distribution (Figure 6a,c). Strong currents and meso-scale eddies in Alboran Sea, the Tyrrhenian Current and the Liguro-Provençal Current favor the presence of these species. In the Gulf of Lions the freshwater impact of Rhone River and the up-and downwelling events explain the sharp differences in fish species occurrence.
According to the findings of the present work ( Figure 5, Table 7), temperature and salinity are the main drivers of the distribution of most small and medium pelagic fishes in the Mediterranean. The spatial distribution and the abundance of small and medium pelagic species may be directly affected by sea surface temperature (SST) [72], but the effect of SST can be also indirect through changes in the planktonic components of the food webs [13] that constitute the main prey for small pelagic fishes [64]. The effect of SST, however, is not uniform across species and it depends on their thermal preferences [7], which may vary among the pelagic fishes [9]. Sardine, for example shows preference for colder waters compared to round sardinella and anchovy, and appears to confine its distribution and shrink its spawning grounds to colder waters when SST increases [46], a condition that affects its fisheries. This negative relationship between sardine landings and SST, indicates that the long-term temperature changes in the Mediterranean could have a negative impact on sardine abundance [72]. In contrast, there is a positive relationship between sardine landings and chlorophyll concentration in the Alboran Sea [72] which has also been related to other areas of the Mediterranean Sea [73].
The distribution of large pelagic fishes may also be associated with various environmental conditions, despite their highly migratory activity. Several populations of swordfish have shifted latitudinally, whereas the Mediterranean population has shifted longitudinally towards the west, as a result of climate change [74]. Local conditions, such as clusters of higher density occurring near converging fronts and strong thermoclines may also affect swordfish distribution at a more local scale [75]. The regional distribution and abundance of Atlantic bluefin tuna has also been recently reported to be affected by climatic oscillation and water temperature [76]. Again, overfishing plays a crucial role in distribution and abundance patterns of large pelagic species, since the stocks of these two species are among the most valued and commercially exploited globally [76]. Salinity may also play a role in the distribution of small and medium pelagic fishes, yet along the northern Mediterranean coastline, where the pelagic fishes are mostly abundant, salinity is greatly influenced by precipitation and riverine input, as well as by the inflow of Black Sea water in the Aegean Sea [66]. Thus, other processes including precipitation and runoff are also involved affecting salinity and in turn the distribution of small pelagic species [72]. Anchovy larvae have been reported to preferentially occupy coastal areas, which are areas that are often influenced by river plumes [77]. The effect of precipitation is stronger for pelagic fishes that are distributed along the coast (such as sardine and anchovy [78,79]) and may also affect their catches [14]. Species with a more oceanic distribution such as the scombrids and swordfish are less impacted.
In the NW Mediterranean, sardine and anchovy have been fluctuating in synchrony for over 30 years [70,80] rather than alternating in high abundances, as globally observed for anchovy-sardine coexisting populations [81,82]. Local environmental conditions, including river runoff, wind mixing, sea surface temperature and chlorophyll concentrations, influenced by climatic oscillations [80,83] have been reported to control the fluctuations in abundance of these species in this area [14,65,84] and probably explain the high probability of occurrence for both species in the western Mediterranean. A regional index, the Western Mediterranean Oscillation index, which has been developed to explain the precipitation variability of the Iberian Peninsula [85], seems to represent well the suitable environmental conditions for sardine and anchovy in the west Mediterranean [80].
Furthermore, in the case of small pelagic fishes, the spawning areas and larval distributions are also highly related to environment and recruitment success, and they may determine adult abundance and affect spatial distribution. At the same time, inter-specific competition for resources may provide advantage for the species that has spawned earlier or is more abundant [86]. For example, when outnumbered, round sardinella larvae are concentrated in areas where competition is minimized because the food availability would be higher [87,88]. This behaviour is a characteristic of opportunistic and easy to adapt species [89], such as round sardinella. Similar results have been reported for the NW Mediterranean coast [87], with the less abundant round sardinella larvae occupying the less favourable for survival areas, in order to avoid potential competition with anchovy. Round sardinella larvae may be disadvantageous compared to anchovy , because their bathymetric distribution is limited to the upper 50m of the water column. Thus, they cannot feed on the deep chlorophyll maximum layer probably due to their inability to tolerate lower temperatures [90].
Finally, it is interesting to see what percentage of the Mediterranean Sea is covered by high probability occurrence for each species. Considering 80% and greater as high occurrence probability, following Figure 6, it was computed that anchovy covers 14.7% of the Mediterranean, sardine 17.8%, round sardinella 30.7%, Atlantic chub mackerel 2%, Atlantic mackerel 0.7%, picarel 82.3%, bluefin tuna 4% and swordfish 1.8%. Locations with constant species presence are the Adriatic Sea, the North Aegean Sea, the Alboran Sean, and the sea surrounding the coasts from South France to East Tunisia.

Conclusions
Throughout the present study we have presented a comprehensive Species Distribution Model (SDM) for eight commercial pelagic fish species in the Mediterranean Sea. Initial datasets were created by merging several external databases (Aquamaps, CMEMS, EMODnet) in order to produce a model capable of predicting the probability of occurrence in every coordinate pair in the Mediterranean. Ten regression algorithms were compared in order to identify the best performing one, namely XGBoost, which had never previously been used for SDM. With feature engineering, we explored several aspects of the time series data resulting in a large number of features (6830), followed by feature selection with the use of the ensemble Reciprocal Ranking method. After ranking the features, it was found that approximately the top 2% of features were constructing the best model with a 34% performance boosting for the eight species, on average. Feature selection, in combination with the evaluation schema of 10-fold cross validation, reduced the danger of over-fitting and created a robust general model with a cross-validated RMSE ranging from 0.22 to 0.07 depending on the species. After creating the model, the importance of feature categories and environmental predictors were further explored and discussed. The strongest feature categories are the neighbor-based ones, the features with extreme values (minima, maxima, quantiles), and the monthly and surface features, while the overall strongest predictors are salinity, temperature, distance to coast, dissolved oxygen and nitrate. Finally, using Aquamaps probabilities as target variable, it was improved in both coverage and resolution. High resolution habitat maps were created (×8 higher) for the whole basin (67,000 points instead of 902) which can be used by researchers to understand the link between marine fishes and environmental factors.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.