A Comparative Machine Learning Study Identifies Light Gradient Boosting Machine (LightGBM) as the Optimal Model for Unveiling the Environmental Drivers of Yellowfin Tuna (Thunnus albacares) Distribution Using SHapley Additive exPlanations (SHAP) Analysis

Ling Yang; Weifeng Zhou; Cong Zhang; Fenghua Tang

doi:10.3390/biology14111567

,

and

¹

East China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200090, China

²

College of Information Engineering, Zhejiang Ocean University, Zhoushan 316022, China

³

Graduate School of Chinese Academy of Agricultural Sciences, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Biology2025, 14(11), 1567;https://doi.org/10.3390/biology14111567
(registering DOI)

Version Notes

Order Reprints

Review Reports

Simple Summary

Tuna fisheries are a vital source of global protein, making it important to understand the key environmental factors that influence their distribution. This study aimed to identify which environmental conditions most affect where yellowfin tuna gather in the western and central Pacific Ocean. Using integrated fishing log data and 24 multi-source environmental variables, we applied and compared 16 machine learning regression models. The Light Gradient Boosting Machine (LightGBM) performed best and was selected to evaluate the influence of key environmental drivers. The results highlight that spatiotemporal and thermal factors are the most important predictors of tuna distribution. This research provides a reliable, data-driven framework to support sustainable fishery management, resource assessment, and operational forecasting.

Abstract

Fishery resources of tuna serve as a vital source of global protein. This study investigates the key environmental drivers influencing the spatial distribution of yellowfin tuna (Thunnus albacares) in the western tropical Pacific Ocean. A comprehensive dataset was constructed by linking the catch per unit effort (CPUE) from 43 Chinese longline fishing vessels (2008–2019) with 24 multi-source environmental variables. To accurately model this complex relationship, a total of 16 machine learning regression models, including advanced ensemble methods like Light Gradient Boosting Machine (LightGBM), Random Forest, and Categorical Boosting Regressor (CatBoost), were evaluated and compared using multiple performance metrics (e.g., Coefficient of Determination [R²], Root Mean Squared Error [RMSE]). The results indicated that the Light Gradient Boosting Machine (LightGBM) model achieved superior performance, demonstrating excellent nonlinear fitting capabilities and generalization ability. For robust feature interpretation, the study employed both the model’s internal feature importance metrics and the SHapley Additive exPlanations (SHAP) method. Both approaches yielded highly consistent results, identifying temporal (month), spatial (longitude, latitude), and key seawater temperature indicators at intermediate depths (T450, T300, T150) as the most critical predictors. This highlights significant spatiotemporal heterogeneity in the distribution of Thunnus albacares. The analysis suggests that mid-layer ocean temperatures directly influence catch rates by governing the species’ vertical and horizontal movements. In contrast, large-scale climate indices such as the Oceanic Niño Index (ONI) exert indirect effects by modulating ocean thermal structures. This research confirms the dominance of spatiotemporal and thermal variables in predicting yellowfin tuna distribution and provides a reliable, data-driven framework for supporting sustainable fishery management, resource assessment, and operational forecasting.

Keywords:

yellowfin tuna; catch per unit effort (CPUE); environmental drivers; machine learning; feature importance analysis; SHapley Additive exPlanations (SHAP); explainable machine learning; comparative analysis; Light Gradient Boosting Machine (LightGBM); ensemble learning

1. Introduction

Fishery resources of tuna serve as a vital source of global protein and a key pillar supporting the economic development of coastal nations. Their sustainable exploitation and refined management have long been the focus of international attention []. Within the practices of resource assessment and fisheries management, fishery forecasting stands out as a core technological approach. Its primary objective is to dynamically predict the distribution of target species by modeling the response relationships between environmental factors and fishery resource abundance. This process involves not only multivariate and multi-scale ecological modeling but also demands that models possess robust spatiotemporal adaptability and explanatory power.

Yellowfin tuna is a highly migratory, large-bodied pelagic species of significant commercial value, widely distributed across tropical and subtropical oceans. It exhibits high sensitivity to environmental variability []. Previous studies have demonstrated that the habitat selection of Thunnus albacares is shaped by complex, nonlinear responses to a range of environmental variables, including sea surface temperature (SST), chlorophyll-a concentration (Chl-a), and large-scale climate indices []. However, traditional statistical models based on linear assumptions often fail to capture the synergistic interactions among multidimensional environmental variables and the inherent spatiotemporal heterogeneity of marine ecosystems []. As a result, these models exhibit limited predictive capacity and ecological interpretability, leading to substantial uncertainty in fishery forecasting outcomes.

Catch per Unit Effort (CPUE) is a key indicator for assessing fishery resource abundance. CPUE refers to the amount of fish caught per standardized unit of fishing effort. For tuna longline fisheries, the fishing effort is typically measured in terms of the number of hooks, usually in thousands of hooks. It not only reflects the dynamic changes in the target population but also indirectly captures the regulatory effects of environmental factors on fishing efficiency []. Although CPUE has been widely applied in fisheries science and resource assessment, current research on feature selection faces two primary challenges. First, modeling approaches remain largely dominated by traditional linear regression techniques, with insufficient systematic evaluation of emerging machine learning algorithms such as ensemble learning. In particular, there is a lack of comparative studies on their abilities to capture nonlinear responses and interactions among features []; Second, the selection of predictor variables often relies on expert-driven or empirical screening, with limited use of model-based variable importance metrics. This may result in the underestimation of key drivers or the overestimation of redundant variables, ultimately compromising the ecological interpretability and predictive accuracy of the results [].

To address these issues, this study focuses on yellowfin tuna in the western and central Pacific Ocean and establishes a model comparison framework encompassing 16 regression algorithms to systematically evaluate the applicability and predictive performance of different approaches in CPUE modeling. The modeling suite includes a range of methodologies, covering linear regression, decision tree models, ensemble learning techniques, and multilayer perceptron (MLP) neural networks. Particular emphasis is placed on the Light Gradient Boosting Machine (LightGBM) algorithm, which effectively addresses common challenges in fishery datasets, such as sample imbalance, multicollinearity among variables, and spatiotemporal dependencies. In addition, SHapley Additive Explanations (SHAP) values and the built-in feature importance metrics of the models are used to analyze the nonlinear influence mechanisms and spatial heterogeneity of environmental factors affecting CPUE, thereby enhancing the ecological interpretability of model outcomes.

2. Materials and Methods

2.1. Data Processing

2.1.1. Marine Environmental Data

This study focuses on the primary fishing grounds of longline fleets targeting Thunnus albacares in the western and central Pacific Ocean, specifically within the area bounded by 110° E to 170° W longitude and 0°to 30° S latitude. The fishery production data were obtained from the fishing logbooks of 43 distant-water longline vessels operated by the China National Fisheries Corporation (2008–2019). These logbooks contain key operational information, including vessel name, fishing date (year/month), fishing location (latitude and longitude), species composition, catch weight, number of individuals caught, and number of hooks deployed []. This dataset provides the foundational basis for constructing the CPUE indicator and analyzing its relationship with environmental variables. The spatial distribution of CPUE is shown in Figure 1.

Figure 1. CPUE distribution in the western and central Pacific Ocean.

The temporal dynamics of Catch Per Unit Effort (CPUE), plotted as monthly means with standard deviation ranges, revealed significant seasonal fluctuations (Figure 2). Specifically, CPUE values showed pronounced variability from May to August alongside rising temperatures, peaking in June and demonstrating a trend of higher CPUE during the warm spring and summer months.

Figure 2. CPUE monthly means data change line chart.

The environmental variables used in this study were obtained from the following sources: Chlorophyll-a (Chl-a) data were retrieved from NASA’s Ocean Color remote sensing platform (https://oceancolor.gsfc.nasa.gov, accessed on 6 November 2025); Sea Level Anomaly (SLA) data were provided by AVISO (https://www.aviso.altimetry.fr, accessed on 6 November 2025); Eddy Kinetic Energy (EKE) and temperature–salinity profiles from 0 to 500 m were obtained from the Copernicus Marine Environment Monitoring Service (https://dataspace.copernicus.eu, accessed on 6 November 2025). For the climate indices, the Southern Oscillation Index (SOI) and Arctic Oscillation Index (AOI) were sourced from NOAA’s Climate Prediction Center (https://www.cpc.ncep.noaa.gov, accessed on 6 November 2025), the Pacific Decadal Oscillation Index (PDOI) was sourced from the National Centers for Environmental Information (NCEI) at NOAA (https://www.ncei.noaa.gov/access/monitoring/pdo/, the Pacific Decadal Oscillation Index (PDOI) was obtained from the the National Centers for Environmental Information (NCEI) at NOAA (https://www.ncei.noaa.gov/access/monitoring/pdo/, accessed on 6 November 2025), and the North Pacific Gyre Oscillation Index (NPGOI) was published via the Copernicus data platform (https://data.marine.copernicus.eu, accessed on 6 November 2025).

All environmental variables used in this study have a temporal resolution of one month. In terms of spatial resolution, sea level anomaly (SLA), eddy kinetic energy (EKE), and temperature–salinity profile data were provided at a 0.25° × 0.25° grid, while chlorophyll-a (Chl-a) data were available at a spatial resolution of 4 km. To ensure consistency in analytical scale, all environmental variables were resampled to a standardized grid of 0.5° × 0.5° using Python-based spatial processing tools (v3.10.2). These resampled datasets were then spatially matched and integrated with the fishing location data, enabling a joint analysis of catch variability and environmental drivers.

2.1.2. Fishery Resource Abundance

Catch per unit effort (CPUE) is a key metric for evaluating fishing efficiency and resource abundance, and has been widely applied in fishery stock assessment and management studies []. To systematically analyze the spatiotemporal distribution of Thunnus albacares and the variability of its CPUE, the study area was divided into spatial grids of 0.5° × 0.5°. Based on longline logbook records, monthly statistics were compiled for each grid cell, including fishing effort (number of hooks deployed) and the number of individuals caught. These values were used to compute CPUE for each grid cell on a monthly basis. The CPUE was calculated using the following formula: where

C P U E_{(i, j)}

,

F_{f i s h (i, j)}

,

H_{h o o k (i, j)}

represent, respectively, the monthly average CPUE (number of individuals per thousand hooks), the total number of fish caught, and the total number of hooks deployed in the grid cell located at the i-th longitude and the j-th latitude.

C P U E_{(i, j)} = \frac{F_{f i s h (i, j)} \times 1000}{H_{h o o k (i, j)}}

(1)

2.1.3. Data Preprocessing

To analyze the relationship between catch per unit effort (CPUE) of Thunnus albacares and environmental variables, a comprehensive integration of multi-source datasets was first carried out. By spatially and temporally matching longline logbook records with environmental observation data, a unified dataset was constructed to ensure that each feature variable corresponded precisely to the observed CPUE at the same spatiotemporal coordinates. This process resulted in a curated dataset of 18,029 valid CPUE records. The CPUE values were then calculated, and relevant variables were screened accordingly. The final feature set consisted of 25 variables, encompassing three main categories: fishing operation parameters, oceanographic environmental factors, and climate anomaly indices. Specifically, these included:

(1) Fishing operation parameters: Catch per Unit Effort (CPUE) as a direct proxy for relative fish abundance, year (to account for long-term trends), month (to capture seasonal cycles), and latitude/longitude (to define the spatial context and static habitat features), formed the foundational data layer.

(2) Oceanographic environmental variables: Chlorophyll-a concentration (Chl-a), chlorophyll-a in the previous month (Chl_bf), chlorophyll-a in the following month (Chl_af); sea surface temperature in the previous month (SST_bf), and in the following month (SST_af); chlorophyll anomaly (Chldt), sea surface temperature anomaly (SSTdt), sea surface temperature gradient (SSTgrad), chlorophyll gradient (Chlgrad); sea level anomaly (SLA); eddy kinetic energy (EKE); and temperature at various depths including the surface (T0), 150 m (T150), 300 m (T300), and 450 m (T450).

Chlorophyll-a concentration (Chl-a) and its temporal lags (Chl_bf, Chl_af) served as indicators of primary production and the base of the food web. Sea surface temperature (SST_bf, SST_af) was included for its fundamental influence on physiological processes and thermal habitat suitability. Anomalies (Chldt, SSTdt) and horizontal gradients (Chlgrad, SSTgrad) of these two parameters were used to identify environmentally anomalous areas and productive frontal zones, which are known foraging hotspots. Furthermore, sea level anomaly (SLA) and eddy kinetic energy (EKE) were utilized as proxies for mesoscale ocean dynamics, such as eddies and currents, which affect prey aggregation and retention. Finally, temperature at various depths (T0, T150, T300, T450) characterized the vertical thermal structure, which is critical for defining the vertical habitat range and thermocline depth for pelagic species.

(3) Climate indices: Pacific Decadal Oscillation Index (PDOI), Southern Oscillation Index (SOI), Arctic Oscillation Index (AOI), North Pacific Gyre Oscillation Index (NPGIO), and the Oceanic Niño Index (ONI), which represents El Niño–Southern Oscillation (ENSO) phases. These indices modulate local oceanographic conditions, thereby exerting a bottom-up control on ecosystem productivity and species distributions over interannual to decadal timescales.

These features were categorized based on their environmental relevance to temperature, chlorophyll concentration, and oceanographic phenomena. As illustrated in Figure 3, the variables were grouped into six thematic categories: fishery indicators, ocean dynamics, thermal structure, temperature gradients, chlorophyll-related variables, and climate indices. Each subplot integrates kernel density estimation (KDE) curves with semi-transparent histograms, thereby simultaneously presenting the smoothed distribution trend and the frequency of the raw data.

Figure 3. Distribution of environmental feature data: (a) CPUE and Month; (b) SLA and EKE; (c) Temperature profiles; (d) SST gradients; (e) Chlorophyll-a metrics; (f) Climate indices.

The overlapping density curves allow for a visual comparison of intra-group parameter distributions, facilitating the identification of potential multicollinearity issues among variables. Meanwhile, the histogram frequency data provides insight into the spatiotemporal completeness of environmental sampling, particularly highlighting the prevalence of extreme values.

2.2. Research Methods

In this study, the catch per unit effort (CPUE) of Thunnus albacares was used as the response variable. Relevant fishery production records and environmental feature variables potentially influencing CPUE were collected and integrated into a comprehensive dataset. This dataset includes not only the observed CPUE values of Thunnus albacares but also a wide range of multidimensional environmental factors and climate indices that may affect fishing success. To ensure the robustness and validity of model training and evaluation, the complete dataset was randomly split into a training set (80%) and a testing set (20%) following an 8:2 ratio [].

To comprehensively evaluate and identify the key factors influencing CPUE, this study employed a comparative framework involving 16 representative regression models. These included linear regression, ridge regression, Lasso regression, elastic net, random forest regression, extreme gradient boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), among others. Based on multi-model comparisons that integrated considerations of fitting accuracy, generalization capability, and feature interpretability, LightGBM was ultimately identified as the best-performing regression model.

LightGBM exhibits strong feature selection capabilities by automatically shrinking the coefficients of features that are irrelevant or weakly correlated with the response variable to zero. This effectively eliminates redundant variables and enhances overall model performance. Moreover, the algorithm demonstrates robustness in handling high-dimensional features, nonlinear relationships, and multicollinearity—making it particularly well-suited for modeling complex environmental datasets in this study.

To further optimize model performance, cross-validation was employed to tune hyperparameters such as regularization strength. The selection of the optimal model was based on the quantitative comparison of multiple error evaluation metrics, including mean squared error (MSE) and the coefficient of determination (R²). The workflow of the modeling process is illustrated in Figure 4.

Figure 4. Flowchart of research methods.

2.2.1. Introduction to Regression Algorithms

As a fundamental statistical tool, regression modeling plays a crucial role in uncovering relationships among variables and in predictive analytics. By establishing mathematical functions that link a dependent variable to one or more independent variables, regression models not only quantify the contribution of each factor to the response variable but also help to reveal potential causal mechanisms through parameter estimation techniques [].

In this study, the marine environmental variables involved exhibit high dimensionality, heterogeneity, and complex nonlinear characteristics, making it challenging to comprehensively predict their effects on the catch per unit effort (CPUE) of Thunnus albacares using a single modeling approach. Therefore, to thoroughly assess the predictive performance of various regression models, a diverse set of representative supervised learning algorithms was employed. These models span several methodological categories, including linear models, instance-based learning methods, decision tree models, boosted tree models, ensemble learning techniques, and neural network-based approaches.

Linear models represent the most fundamental class of regression methods, based on the assumption of a linear relationship between input features and the target variable. These models are characterized by strong interpretability and computational efficiency. Among them, Linear Regression models the linear association between predictors and the response variable, featuring a simple structure and fast fitting and prediction speeds, making it suitable for linearly separable datasets [].

Ridge Regression introduces an L2 regularization term to address multicollinearity issues, improving model stability and enhancing generalization performance by penalizing large coefficients []. Lasso Regression, incorporating L1 regularization in the loss function, effectively shrinks less relevant coefficients to zero, thereby performing automatic feature selection and providing resistance to overfitting [].

The ElasticNet Regressor combines both L1 and L2 penalties, balancing feature selection and model robustness, and is particularly suitable for high-dimensional and sparse datasets []. Huber Regression, which employs the Huber loss function, provides robustness to outliers by reducing their influence while maintaining the linear structure of the model [].

Neighbor-based learning methods construct predictions based on distance metrics among samples without assuming an explicit functional form, making them suitable for small-scale datasets. The K-Neighbors Regressor predicts target values by referencing the distance to neighboring training samples. This approach is simple and intuitive, requiring no prior assumptions about data distribution. However, its computational efficiency significantly decreases in large datasets or high-dimensional feature spaces [].

Tree-based models operate by recursively partitioning the feature space to generate predictive rules, offering strong interpretability and the capability to model nonlinear relationships. Among them, the Decision Tree Regressor builds a hierarchical structure that splits the feature space into decision paths based on conditional rules. This method is easy to interpret and visualize, but is prone to overfitting, especially when dealing with noisy data [].

The Extreme Gradient Boosting Regressor (XGBoost Regressor) improves upon traditional boosting techniques by incorporating regularization and parallel optimization strategies within a gradient boosting framework. These enhancements significantly increase both model accuracy and computational efficiency, making XGBoost widely adopted in structured data modeling tasks [].

Boosted Tree Models utilize the boosting mechanism, which enhances overall predictive performance by iteratively combining multiple weak learners. The Light Gradient Boosting Machine Regressor (LightGBM Regressor) employs a histogram-based optimization strategy within the gradient boosting framework, substantially reducing memory usage and training time. This makes it particularly suitable for large-scale datasets []. The Categorical Boosting Regressor (CatBoost Regressor) is specifically optimized for handling categorical features by automatically performing encoding transformations, effectively mitigating overfitting issues. It is well-suited for modeling tasks involving a large number of categorical variables [].

Ensemble Learning Methods improve overall model performance by aggregating the predictions of multiple base learners, thereby effectively reducing both variance and bias. The Random Forest Regressor builds an ensemble of decision trees using randomly sampled training subsets and aggregates their outputs via averaging or voting. It is known for its strong robustness and resistance to overfitting []. The Adaptive Boosting Regressor (AdaBoost Regressor) iteratively trains weak learners by reweighting samples, making it suitable for capturing complex nonlinear relationships, although it tends to be sensitive to outliers []. The Gradient Boosting Regressor optimizes the model through residual learning, achieving high predictive accuracy; however, it requires careful hyperparameter tuning and incurs relatively high computational cost during training [].

Extreme Ensemble Methods enhance model diversity and generalization by introducing greater randomness on top of standard ensemble strategies. The Extremely Randomized Trees Regressor (ExtraTrees Regressor) increases diversity and robustness by randomly selecting both features and split points during tree construction []. The Bagging Regressor trains multiple base learners independently on different bootstrapped subsets of the training data and aggregates their predictions, effectively reducing variance and improving model stability [].

Neural Network Models possess strong nonlinear modeling capabilities, making them suitable for handling complex or high-dimensional data structures. The Multilayer Perceptron Regressor (MLP Regressor) constructs a deep feedforward neural network and uses nonlinear activation functions to capture intricate relationships between inputs and outputs. However, its performance can be highly sensitive to training data quality and hyperparameter settings [].

2.2.2. Model Performance Rating

(1) The Pearson correlation coefficient is a statistical measure used to quantify the strength and direction of the linear relationship between two continuous variables. Its value ranges from −1 to 1. As one of the most commonly used correlation coefficients, it is widely applied in scientific research, data analysis, social sciences, and engineering domains []. The calculation formula is given below, where N denotes the number of data points; x_i and y_i represent the i-th observations of variables x and y, respectively; i is the sample index;

\bar{x}

and

\bar{y}

denote their corresponding sample means:

ρ_{X Y} = \frac{\sum_{i = 1}^{N} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}} \cdot \sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}}

(2)

(2) The Jensen–Shannon Divergence (JSD) is a symmetric metric used to quantify the similarity between two probability distributions, P and Q. Unlike the Kullback–Leibler divergence, JSD is bounded and more stable, with values ranging from 0 to 1 []. The formula is defined as follows, where P and Q represent the two probability distributions, and H(P) denotes the entropy of P:

J S D (P ∥ Q) = H (\frac{P + Q}{2}) - \frac{1}{2} [H (P) + H (Q)]

(3)

(3) The Mean Absolute Error (MAE) is used to quantify the average magnitude of deviations between predicted values and actual observations, offering an intuitive interpretation of prediction accuracy []. The formula for MAE is defined as follows, where n represents the number of samples, i is the sample index, f_i denotes the predicted value, and y_i is the corresponding observed value:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |f_{i} - y_{i}|

(4)

(4) The Mean Squared Error (MSE) measures the average of the squared differences between predicted values and actual observations. It penalizes larger errors more heavily, making it more sensitive to significant deviations in prediction []. A lower MSE indicates that the predicted values are closer to the actual values, implying better model fit. The formula for MSE is given as follows: where n represents the number of samples, i is the sample index, f_i denotes the predicted value, and y_i is the corresponding observed value:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(f_{i} - y_{i})}^{2}

(5)

(5) The Root Mean Squared Error (RMSE) extends the concept of MSE by taking the square root of the average squared errors. It is particularly sensitive to large deviations, making it effective at highlighting the impact of outliers in prediction performance. The formula for RMSE is as follows, where RMSE is the square root of the MSE:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(f_{i} - y_{i})}^{2}}

(6)

(6) The Explained Variance Score (EVS) evaluates the proportion of the variance in the observed data that is captured by the predictive model, indicating the model’s effectiveness in explaining data variability []. The formula for EVS is as follows, where y denotes the observed values, f the corresponding predicted values, and Var represents the variance of y:

E V S = 1 - \frac{V a r (y - f)}{V a r (y)}

(7)

(7) The Coefficient of Determination (R²) is another key goodness-of-fit metric that quantifies the proportion of variance in the dependent variable that is explained by the regression model. It is commonly used to assess the overall performance of predictive []. In the context of regression analysis, it reflects the degree to which the independent variables account for the variability in the dependent variable. The formula for R² is as follows: where n is the number of samples, i is the sample index, y_i is the actual observed value, f_i is the predicted value, and

\bar{y}

is the mean of the actual observed values.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - f_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(8)

(8) To comprehensively evaluate the performance of the regression models, this study employed five key evaluation metrics: MAE, EVS, MSE, RMSE, and R², and further derived a Composite Score. First, all metric values were normalized to ensure consistency in scale. For MAE, MSE, and RMSE, where lower values indicate better model performance, reverse normalization was applied. For EVS and R², where higher values indicate better performance, direct normalization was used. Next, weighted aggregation was performed based on the relative importance of each metric. MAE, MSE, and RMSE were each assigned a weight of 0.5/3, while EVS and R² were each assigned a weight of 0.25, summing to a total weight of 1 []. The Composite Score was calculated using the following formula, where

X_{i}

represents the normalized score of the i-th model for a given metric.

Score = (\sum_{i \in {M A E, M S E, R M S E}} (1 - \frac{x_{i} - \min (x_{i})}{\max (x_{i}) - \min (x_{i})}) \times \frac{0.5}{3}) + (\sum_{i \in {E V S, R^{2}}} \frac{x_{i} - \min (x_{i})}{\max (x_{i}) - \min (x_{i})} \times 0.25)

(9)

2.2.3. Variable Screening and Feature Selection

In regression modeling, feature selection is a critical step for enhancing model performance, improving interpretability, reducing computational complexity, and preventing overfitting. This process not only influences the predictive accuracy of the final model but also sheds light on the relative importance of environmental drivers influencing the abundance of Thunnus albacares. In this study, feature selection was conducted using machine learning-based regression models, integrating Catch Per Unit Effort (CPUE) data of Thunnus albacares with 25 environmental and climatic variables. The feature selection procedures were implemented in Python (v3.10.2), utilizing packages such as Pandas (v2.2.2), NumPy (v1.24.4), Matplotlib (v3.5.3), and the machine learning library scikit-learn (v1.1.2). The overall workflow was structured into the following four stages:

(1) Data Preprocessing Stage: In the initial phase, 25 environmental features potentially associated with CPUE variations in Thunnus albacares were extracted from multiple data sources. These variables included spatial coordinates (latitude and longitude), temporal variables (year and month), remote sensing variables (e.g., SST, Chl-a, SLA, EKE), and major climatic indices (e.g., PDOI, SOI, AOI, ONI). Data were imported, integrated, and preliminarily cleaned using the Pandas library in Python. Subsequently, the 24 environmental variables were designated as independent variables, and the CPUE of Thunnus albacares was defined as the dependent variable. The full dataset was randomly divided into training (80%) and testing (20%) subsets to support the modeling and feature importance analysis.

(2) Model Parameter Optimization Stage: Model performance in regression tasks is highly dependent on appropriate hyperparameter settings. To improve predictive capability, the Light Gradient Boosting Machine (LightGBM) was selected as the core modeling algorithm, and parameter tuning was conducted using a randomized search strategy (RandomizedSearchCV). The optimization covered 15 key parameters involving aspects of tree structure (e.g., maximum depth and number of leaves, which control model complexity), regularization strength (e.g., L1 and L2 penalties to prevent overfitting), and training control (e.g., learning rate and number of estimators to balance training speed and accuracy). A five-fold cross-validation (5-fold CV) approach was used to evaluate the generalization performance of each parameter combination, and the one yielding the lowest average error was selected as the optimal setting.

(3) Feature Importance Analysis Stage: Upon completing model training, the intrinsic feature importance rankings generated by the LightGBM model were extracted. These rankings, derived from the model’s internal decision tree structure, indicate the relative contribution of each feature to the overall prediction accuracy. By analyzing the feature importance scores, key predictors with high explanatory power were identified, while weakly correlated or redundant variables were excluded through dimensionality reduction. This process not only improved model efficiency and generalization capacity, but also enhanced interpretability, offering valuable insights for subsequent ecological mechanism analysis.

(4) SHAP (SHapley Additive exPlanations) Analysis Stage: To further enhance model interpretability and transparency, this study incorporated SHAP, a game-theoretic feature attribution method. Originating from Shapley values in cooperative game theory, SHAP aims to calculate the marginal contribution of each feature to the model output across all possible feature combinations, averaging these contributions to yield an accurate and fair representation of each variable’s influence []. It has been widely applied in ecological model interpretation, environmental risk assessment, and policy response analysis. Unlike LightGBM’s feature importance ranking, based on split frequency and split gain within the tree structure and reflecting structural feature contributions, SHAP values quantify each feature’s direct contribution to individual predictions. This facilitates the generation of more informative and interpretable feature importance rankings, offering insights into how each environmental factor influences the predicted CPUE values for Thunnus albacares [].

3. Environmental Drivers Identification

This section presents the results of the model evaluation and selection process. The performance of 16 machine learning models is systematically compared through visual assessments and quantitative metrics to identify the optimal model for predicting Thunnus albacares CPUE. Subsequently, feature selection and an explanation of the final model’s predictions are provided.

3.1. Comparison Between Actual and Predicted Values

When performing predictive tasks, selecting an appropriate machine learning model is crucial. To this end, the study compared the prediction results of 16 models using kernel density estimation (KDE) plots, as illustrated in Figure 5. The figure presents the performance of various machine learning models, where each subplot depicts the relationship between the predicted and observed values. Color gradients, ranging from blue (low density) to red (high density), represent the density distribution of data points and provide an intuitive understanding of how each model fits the data across different regions.

Figure 5. Comparison between different model predictions and actual values.

The diagonal line in each subplot indicates the ideal condition where predicted values equal observed values, serving as a visual benchmark for assessing model performance. This comparative visualization provides both qualitative and quantitative insights into the models’ fitting capabilities, supporting the identification of the most suitable algorithm for subsequent analyses related to Thunnus albacares catch per unit effort (CPUE) prediction.

The study evaluated model performance using three metrics: the coefficient of determination (R²), Pearson correlation coefficient, and Jensen–Shannon divergence (JSD). Among them, R² and Pearson’s coefficient are the most commonly used indicators in regression analysis, which assess model goodness-of-fit and the linear correlation between predicted and observed values, respectively. In contrast, JSD measures the similarity between the distributions of predicted and actual values; a smaller JSD indicates a closer match between the two distributions, reflecting better agreement at the population level.

As shown in Figure 5, some models exhibit significant deviations from the diagonal line, indicating systematic errors in specific value ranges. In comparison, ensemble learning methods such as LightGBM and ExtraTrees demonstrate superior performance in terms of prediction concentration and alignment along the diagonal. The high density of scatter points near the diagonal line suggests strong predictive accuracy and consistency in distribution, which is essential for reliable modeling of Thunnus albacares catch per unit effort (CPUE).

3.2. Comparison of Regression Models’ Performance

To further visualize and compare the performance of different models, this study employed a radar chart based on the standardized R² scores, enabling an intuitive comparison of various machine learning algorithms and providing data-driven support for subsequent model selection and optimization. As shown in Figure 6, the values along the axes represent the performance scores of each model. The color gradient indicates the magnitude of the score, with higher scores represented by a deeper purple hue. The radar chart clearly illustrates the performance distribution across different evaluation metrics for each algorithm, highlighting that LightGBM demonstrates the most outstanding performance among all evaluated models.

Figure 6. Radar charts of different model performance.

To comprehensively evaluate model performance, we applied a multi-dimensional visualization strategy, as illustrated in Figure 7. Panel (a) shows R² values along with 95% confidence intervals obtained through bootstrap resampling (n = 200), demonstrating considerable performance variation across the 16 machine learning algorithms examined. The horizontal layout enables direct comparison with the baseline model, indicated by a red dashed line. Panel (b) presents corresponding RMSE values and associated confidence intervals, offering complementary insight into predictive accuracy. In Panel (c), the improvement in R² relative to the top-performing baseline model is quantified, with color coding used to distinguish positive (green) and negative (red) performance differences. Finally, Panel (d) summarizes the overall performance distribution via a graded classification system based on R² improvement (Excellent: >0.1, Good: >0.05, Fair: >0.01, Poor: ≤0.01), providing an intuitive assessment of methodological effectiveness.

Figure 7. Model performance evaluation: (a) R² scores and confidence intervals; (b) RMSE values and confidence intervals; (c) Relative R² improvement; (d) Performance classification.

3.3. Comprehensive Scores

The evaluation of model performance typically relies on multiple metrics. To comprehensively compare the advantages and disadvantages of different models, this study employed five key evaluation indicators: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance Score (EVS), and the Coefficient of Determination (R²). A detailed comparison of the scores across all 16 regression models and their performance under different evaluation criteria is presented in Table 1.

Table 1. Comparison table of MAE, MSE, RMSE, EVS, and R² scores for Regression Models.

To facilitate a more intuitive and integrated assessment of overall performance, a composite score (Score) was introduced. This score was calculated by applying weighted aggregation to the five-evaluation metrics. The model with the highest composite score was selected as the optimal model and subsequently used for the feature selection process. The results, as illustrated in Figure 8, present a ranked summary of the overall scores for each regression model.

Figure 8. Comprehensive score of the regression model.

The findings indicate that the Light Gradient Boosting Machine (LightGBM) outperformed all other models, achieving the highest composite score. LightGBM demonstrated a superior capacity to handle large-scale datasets and effectively manage feature learning processes, particularly when dealing with complex data structures. By employing a composite scoring approach, this study not only revealed the strengths and weaknesses of each algorithm based on individual evaluation metrics but also highlighted their overall performance under multidimensional assessment criteria. Consequently, LightGBM was selected as the optimal model for subsequent feature selection and analysis.

3.4. Feature Selection of LightGBM

The strength and direction of linear relationships between numerical variables were assessed using Pearson correlation coefficients. These coefficients range from −1 to +1, representing a spectrum from perfect negative to perfect positive linear correlation. A correlation matrix was constructed and visualized as a heatmap via the Seaborn (v0.13.2) heatmap function. The matrix is displayed as a color-coded grid where the color intensity corresponds to the correlation strength, following a divergent color scheme (blue for negative and red for positive values) for clarity. The results of the correlation analysis between CPUE and environmental factors are shown in Figure 9.

Figure 9. Pearson correlation coefficients between CPUE and environmental factors.

In machine learning regression models, the objective of feature selection using LightGBM is to achieve triple optimization by enhancing predictive accuracy, improving model interpretability, and reducing computational complexity through the elimination of redundant features and irrelevant variables. This study employs a hierarchical control strategy to accomplish this goal. First, the growth of decision trees is constrained by setting thresholds for the number of leaf nodes and maximum depth, which helps suppress overfitting while increasing the model’s sensitivity to key features. Second, the use of random features and sample subset sampling introduces probabilistic filtering to mitigate noise interference in the training data, thereby enhancing the model’s generalization capability. Finally, the inclusion of regularization constraints effectively suppresses the weights of irrelevant features, improving model robustness and its capacity to manage complex feature interactions.

The prediction results demonstrate that the use of a “low learning rate with deep ensemble” strategy (learning rate = 0.01, number of trees = 800) effectively balances training stability and generalization capacity. The dual regularization framework (L1 = 0.1, L2 = 10) successfully suppresses the risk of overfitting in complex marine environments. In feature engineering, 90% dynamic feature sampling combined with gain-based evaluation enhances model robustness while preserving the representational power of the variables. The chosen tree structure parameters (number of bins = 700, number of leaves = 35) optimize computational accuracy and efficiency. Strict control of randomness (random seed = 42) ensures model reproducibility and verifiability in aquaculture-related environmental data analysis. As a result of comprehensive optimization, the model’s validation performance improved to 0.255. The detailed configuration of optimal LightGBM parameters is presented in Table 2.

Table 2. LightGBM Optimal Parameter Table.

LightGBM provides a direct and efficient approach for feature selection, making it well-suited for applications requiring rapid processing of large-scale datasets and accurate prediction. The results of feature selection using the optimal LightGBM parameters are presented in Figure 10, where the importance scores reflect the contribution of each key variable to the prediction of Catch Per Unit Effort (CPUE) for Thunnus albacares.

Figure 10. LightGBM important feature value ranking.

As shown in the bar chart, the importance of features varies significantly, with the variable “month” exhibiting the highest importance, far exceeding other features. To quantitatively assess the model’s predictive performance on CPUE under varying numbers of selected features, and thereby achieve a scientific and reasonable balance between feature reduction and model simplification, a line graph displays the R² value and its 95% confidence interval. The R² increases rapidly as more features are added, indicating that the inclusion of new variables significantly enhances model performance. However, the performance gain plateaus after the first 13 features. A vertical dashed line in the figure marks the top 13 features, indicating that this subset was selected as the most informative group based on both feature importance and model accuracy.

The final ranking of feature importance identified by LightGBM is as follows: month, lat, T450, T150, T300, NPGIO, year, SLA, lon, PDOI, SSTgrad, ONI, SST_bf, Chldt, AOI, SSTdt, Chl_bf, Chl-a, SST_af, Chl_af, EKE, T0, SOI, and Chlgrad.

LightGBM provides a direct and efficient approach for feature selection, making it well-suited for applications requiring rapid processing of large-scale datasets and accurate prediction. The results of feature selection using the optimal LightGBM parameters are presented in Figure 6, where the importance scores reflect the contribution of each key variable to the prediction of Catch Per Unit Effort (CPUE) for Thunnus albacares.

As shown in the bar chart, the importance of features varies significantly, with the variable “month” exhibiting the highest importance, far exceeding other features. To quantitatively assess the model’s predictive performance on CPUE under varying numbers of selected features, and thereby achieve a scientific and reasonable balance between feature reduction and model simplification, a line graph displays the R² value and its 95% confidence interval. The R² increases rapidly as more features are added, indicating that the inclusion of new variables significantly enhances model performance. However, the performance gain plateaus after the first 13 features. A vertical dashed line in the figure marks the top 13 features, indicating that this subset was selected as the most informative group based on both feature importance and model accuracy.

The final ranking of feature importance identified by LightGBM is as follows: month, lat, T450, T150, T300, NPGIO, year, SLA, lon, PDOI, SSTgrad, ONI, SST_bf, Chldt, AOI, SSTdt, Chl_bf, Chl-a, SST_af, Chl_af, EKE, T0, SOI, and Chlgrad.

Model validation is essential for assessing the robustness of machine learning approaches like LightGBM. We systematically evaluate the key statistical assumptions using diagnostic plots in Figure 11. The Q-Q plot (a) and residual distribution (b) reveal severe non-normality, characterized by heavy tails (Shapiro–Wilk W = 0.658, p < 0.001), high skewness (4.92), and kurtosis (52.09). In contrast, the ACF plot (c) and Ljung–Box test (p = 0.307) confirm error independence, a finding supported by the random, unsystematic pattern in the residual sequence (d). Despite a well-calibrated mean residual near zero (−0.0101), the presence of extreme values (e) suggests potential outliers. These results validate the model’s error independence but highlight a significant deviation from normality, guiding future refinements.

Figure 11. Model assumption validation results: (a) Q-Q plot for normality test; (b) Residual distribution histogram; (c) Residuals vs. predicted values scatter plot; (d) Autocorrelation function plot; (e) Residual sequence plot.

Through integrated visualization techniques including Q-Q plots, residual distribution analysis, autocorrelation function plots, and residual sequence examination, we provide a comprehensive assessment of normality, independence, homoscedasticity, and autocorrelation assumptions. The findings demonstrate both strengths and limitations in the current modeling approach, offering valuable insights for methodological refinement.

3.5. SHAP Analysis

To elucidate sample-level decision mechanics of the LightGBM model, Figure 12 employs SHAP (SHapley Additive exPlanations) decomposition to visualize prediction pathways for Thunnus albacares catch per unit effort (CPUE). The horizontal axis traces cumulative changes in predicted CPUE values from the baseline expectation to final outputs, while the vertical axis ranks features by their global contribution magnitude.

Figure 12. SHAP decision plot for the top 200 testing samples.

Each line in the plot corresponds to a single sample and starts from the expected value (i.e., the mean predicted value across the training dataset). Along the path, SHAP values for each feature are added sequentially, ultimately reaching the model’s final prediction output. The horizontal axis shows the cumulative contribution of SHAP values from left to right; the vertical axis displays feature names, automatically ordered by their overall contribution to the prediction. Each bend or inflection point in a line indicates the extent to which a specific feature has a positive or negative influence on the predicted CPUE for that particular sample. The color intensity of each line reflects the density of overlapping samples, allowing for the visual identification of local patterns.

This visualization demonstrates the model’s step-by-step inference process for individual predictions. It can be used to identify consistent decision-making patterns and to uncover the underlying causes of prediction anomalies, serving as a critical link between global feature importance and local interpretability in ecological modeling.

To further analyze the SHAP-based feature importance ranking, this study employed a dual-axis visualization approach that combines the SHAP beeswarm plot with a bar chart of feature importance, allowing for an intuitive representation of each feature’s contribution and impact on the model’s prediction of Thunnus albacares CPUE.

As shown in Figure 13, the variable month exhibited the highest explanatory power, with most of its SHAP values being positive. This indicates that the month generally has a positive influence on the predicted CPUE. The variables latitude (lat) and longitude (lon) followed in importance, also demonstrating a strong contribution to model predictions. Several oceanographic and climate-related variables also showed notable explanatory capability. In contrast, variables such as year, NPGIO, and T150 displayed more dispersed SHAP value distributions, with wide variation in the direction and magnitude of their effects across different samples. This suggests a degree of uncertainty or inconsistency in their influence on CPUE. Variables such as EKE, T0, and chlgrad exhibited low average SHAP values, indicating weak influence on the model’s predictions, with some contributing negligibly across the dataset.

Figure 13. Honeycomb diagram of important features of biaxial SHAP.

4. Discussion

4.1. Comparative Analysis of Different Regression Models

A comparative analysis of the 16 regression models revealed that the boosted tree model LightGBM and the extreme ensemble method ExtraTrees demonstrated excellent performance across all evaluation metrics. In particular, both models significantly outperformed others in terms of Mean Squared Error (MSE) and Mean Absolute Error (MAE), while also ranking among the top in Explained Variance Score (EVS) and the Coefficient of Determination (R²), showing superior overall predictive capabilities.

LightGBM, through its gradient-based histogram optimization algorithm, leaf-wise growth strategy, and native handling of categorical variables, significantly reduces the time complexity of model training and effectively mitigates overfitting, making it especially suitable for data scenarios involving large sample sizes, high-dimensional features, and strong multicollinearity [].

In comparison, ExtraTrees improves model diversity and generalization by introducing an extremely randomized splitting strategy. It demonstrates greater robustness against common issues in fishery observational datasets, such as measurement errors, outliers, and spatial sampling imbalances []. Both models achieve a favorable balance between accuracy and stability, highlighting their high practical value in modeling the catch per unit effort (CPUE) of Thunnus albacares.

In addition, CatBoost maintains stable performance even under significant data heterogeneity or complex variable interactions by incorporating target encoding for categorical variables and an ordered boosting strategy. This makes it particularly suitable for fishery datasets where categorical variables are frequently coupled with temporal information [].

Random Forest improves model robustness and resistance to noise by integrating multiple low-correlation decision trees. It maintains good predictive accuracy even under conditions of feature redundancy or suboptimal data quality [], making it well-suited for ecology-oriented modeling tasks and the generation of policy recommendations where model interpretability is a priority [].

Finally, in this study, models such as Bagging, Gradient Boosting, K Nearest Neighbors, and XGBoost demonstrated moderate performance, suggesting their potential applicability under specific conditions. In contrast, traditional linear models, including Lasso, Ridge, ElasticNet, and standard Linear Regression, exhibited relatively low overall performance. These models showed evident underfitting when faced with nonlinear ecological response mechanisms commonly observed in fishery systems. Their modeling capacity remains constrained by the linear framework, indicating structural limitations in capturing the complex relationships between oceanographic environmental factors and catch per unit effort (CPUE) of species such as Thunnus albacares []. Models ranked at the lower end of performance included MLP Regressor, Huber, AdaBoost, and Decision Tree. The MLP Regressor’s performance was likely constrained by the limited dataset size, which may be insufficient for optimizing its numerous parameters and complex architecture, leading to suboptimal convergence. The Huber regressor, while robust to outliers, may lack the flexibility to capture the full spectrum of nonlinear relationships present in the ecological data. Both AdaBoost and the standalone Decision Tree are particularly prone to overfitting and sensitivity to noise in the dataset; AdaBoost can amplify errors from weak learners, while the Decision Tree model easily learns spurious patterns in the training data, resulting in poor generalization to unseen data.

4.2. Comparative Analysis of Feature Selection

Analysis of the CPUE prediction model for Thunnus albacares identified a set of key spatiotemporal and environmental drivers. The results show a strong consistency between the feature ranking derived from LightGBM and the SHAP-based importance order. Temporal (month) and spatial features (latitude and longitude) exerted the greatest influence on the model’s predictive accuracy, followed by water column temperature features (T450, T300) and large-scale indices such as ONI and PDOI. In addition, variables such as year and NPGIO exhibited both positive and negative effects on the model predictions depending on context, while features like EKE, T0, and chlgrad had relatively minor impacts. Among all predictors, temperature-related environmental factors accounted for the largest proportion and consistently ranked among the top in terms of importance during the feature selection process.

As shown in the results, the month regulates sea temperature dynamics while latitude determines thermal gradients, creating combined effects on water mass distribution that exceed their individual impacts. Furthermore, large-scale climate indices (ONI, PDOI) interact with local thermal conditions in ways that substantially modify fish habitat suitability. For instance, the effect of mid-layer temperatures (T150, T300, T450) on vertical fish distribution is modulated by these climate oscillations, which alter thermal stratification patterns. These interaction mechanisms primarily function through their collective impact on ocean temperature—climate indices shape large-scale thermal regimes while spatial and temporal factors determine how these thermal patterns translate into local habitat conditions, explaining why variables like month and latitude consistently rank high in feature importance as they represent the spatiotemporal integration of these complex interactions.

In the multivariable interaction analysis, month and latitude consistently ranked among the top features in terms of importance. The variable month indirectly influences CPUE by regulating sea temperature and the spatial dynamics of water masses. This finding aligns with the study by Lan et al., which identified seasonal variation and latitudinal thermal gradients as key factors affecting fluctuations in Thunnus albacares catch rates [], suggesting that seasonal changes significantly impact ocean temperature variability. Moreover, this finding aligns with the pattern demonstrated in Figure 2, where higher CPUE values coincide with the elevated temperatures of spring and summer months. Matsubara et al. reported that thermal differences across latitudes directly affect the geographic distribution of Thunnus albacares []. The coupling between latitude and large-scale oceanic thermal structures reflects a stable association with fishing ground spatial patterns []. These results are consistent with the known behavioral response of Thunnus albacares to thermal gradients, supporting the ecological hypothesis that this species exhibits pronounced seasonal migratory behavior and adapts to tropical and subtropical water masses []. Therefore, the high importance of these features underscores the central role of temperature-related factors in the prediction of marine fishery resources [], consistent with numerous previous studies.

The study also revealed that mid to deep-layer temperature variables, such as T150, T300, and T450, exhibited significantly higher importance to model performance compared to surface-layer variables. These factors directly influence the vertical distribution and migratory pathways of fish species [], and Thunnus albacares is known to exhibit a specific preference for certain depths within the vertical water column []. This depth preference likely reflects its adaptive response to optimal temperature layers, thermocline structures, and the distribution of midwater prey resources. These findings are consistent with the observational research by Song et al. on the thermocline-associated behavioral patterns of Thunnus albacares [].

Oceanic anomaly indicators such as the North Pacific Gyre Oscillation Index (NPGIO), Oceanic Niño Index (ONI), and Pacific Decadal Oscillation Index (PDOI) are, to some extent, influenced by temperature fluctuations. In terms of climatic drivers, ONI, PDOI, and NPGIO exhibited relatively high average SHAP values, indicating a strong influence on the model output. This suggests that catch per unit effort (CPUE) is substantially regulated by large-scale climate systems, which may indirectly affect the spatial distribution and catchability of Thunnus albacares by altering thermal stratification, primary productivity, and current patterns []. Therefore, incorporating multi-scale climatic indices into CPUE modeling is essential, a conclusion supported by several regional fishery studies [,].

In contrast, some variables, such as the sea surface chlorophyll gradient (chlgrad), eddy kinetic energy (eke), and sea surface temperature anomalies (sst_af) ranked lower in the feature importance analysis. Variables including sst_bf, sstdt, sst_af, and T0 reflect horizontal variations in sea surface temperature, which are known to affect phytoplankton growth and, through their influence on water mass movement, indirectly alter nutrient availability and fish foraging behavior []. Thunnus albacares is known to exhibit specific ecological temperature thresholds. Given the relatively high variability and low stability of surface-layer temperatures, their influence on CPUE is limited. In contrast, temperature variations provide more stable signals that align better with the habitat preferences of Thunnus albacares, thereby demonstrating stronger explanatory power for CPUE, consistent with the findings of Wright et al. [].

In summary, the prediction of Thunnus albacares catch per unit effort (CPUE) is not solely dependent on individual ecological factors, but rather results from the coupling of multi-scale temporal and spatial drivers. Temporal variables, latitude, and mid-layer ocean temperature emerge as high-frequency driving forces, while large-scale climatic oscillations provide low-frequency background disturbances. These findings further support the application value of machine learning models that integrate nonlinear predictive capacity with interpretability mechanisms in fishery resource assessments. Such approaches enhance the precision of ecological forecasting and contribute to more scientifically grounded policy formulation.

5. Conclusions

This study developed an analytical framework integrating multiple machine learning regression algorithms to systematically evaluate the performance of sixteen mainstream models in predicting the catch per unit effort (CPUE) of Thunnus albacares. The results demonstrated that the Light Gradient Boosting Machine (LightGBM) and Extremely Randomized Trees (ExtraTrees) models achieved superior performance across all regression metrics, exhibiting excellent capabilities in nonlinear fitting and capturing complex interactions among ecological drivers, thus showing strong application potential. CatBoost and Random Forest also displayed robust performance and high interpretability, making them well-suited for fisheries prediction scenarios involving heterogeneous ecological variables or requiring clear ecological inference. Based on feature importance rankings and SHapley Additive exPlanations (SHAP) analysis, the study further identified month, latitude, and multi-depth temperature variables as key factors driving the spatiotemporal variability of CPUE, highlighting the high sensitivity of Thunnus albacares to thermal and seasonal environmental fluctuations. These findings contribute to a deeper understanding of the relationships between environmental forcing mechanisms and the spatial dynamics of fishery resources.

While this study provides a robust modeling framework for understanding the environmental drivers of yellowfin tuna CPUE for the Chinese distant-water longline fishery, several limitations should be considered when interpreting the results. The primary limitation stems from our reliance on a single fleet source (Chinese longliners). Although this ensures internal consistency by controlling for vessel type, gear, and broad operational strategies, it may limit the immediate generalizability of our specific model predictions to other fleets (e.g., purse seiners, gillnetters) or different maritime regions. It is plausible that our model has learned relationships influenced by the particular operational preferences of this fleet, which might not fully transfer to other systems with distinct fishing tactics and target species compositions. Furthermore, the non-stationary nature of marine ecosystems under a changing climate also implies that the identified relationships between environmental drivers and CPUE may evolve over time.

These limitations define clear pathways for future research. First and foremost, a critical next step is to conduct a comparative analysis by applying the same modeling framework to datasets from diverse fleets and regions. Investigating how the importance and functional response of environmental drivers (e.g., sea surface temperature, chlorophyll-a) vary across different vessel types (e.g., longliners vs. purse seiners) and operational protocols would significantly enhance our understanding of the universality or context-dependence of these relationships.

Secondly, integrating higher-resolution oceanographic physical and chemical environmental data, along with individual behavioral datasets, could enhance the model’s ability to disentangle complex multi-factor coupling mechanisms and deepen the identification of ecological drivers. In addition, emerging modeling approaches such as transfer learning and federated learning hold significant potential in data-sparse regions, offering improved model robustness under extreme climate event conditions and enhancing the capacity to capture ecological responses at global scales. These advancements are expected to provide more scientific and reliable technical support for sustainable fisheries management and resource assessment related to Thunnus albacares and other marine species.

Author Contributions

Conceptualization, W.Z.; methodology, L.Y. and C.Z.; validation, W.Z., L.Y., C.Z. and F.T.; formal analysis, W.Z. and L.Y.; data curation, C.Z.; resources, F.T.; writing—original draft preparation, L.Y.; writing—review and editing, W.Z.; project administration and funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2023YFD2401303).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Boyd, C.E.; McNevin, A.A.; Davis, R.P. The contribution of fisheries and aquaculture to the global protein supply. Food Secur. 2022, 14, 805–827. [Google Scholar] [CrossRef]
Skirtun, M.; Pilling, G.M.; Reid, C.; Hampton, J.J.M.P. Trade-offs for the southern longline fishery in achieving a candidate South Pacific albacore target reference point. Mar. Policy 2019, 100, 66–75. [Google Scholar] [CrossRef]
Lan, K.-W.; Evans, K.; Lee, M.-A. Effects of climate variability on the distribution and fishing conditions of yellowfin tuna (Thunnus albacares) in the western Indian Ocean. Clim. Change 2013, 119, 63–77. [Google Scholar] [CrossRef]
Wang, W.; Fan, W.; Yu, L.; Wang, F.; Wu, Z.; Shi, J.; Cui, X.; Cheng, T.; Jin, W.; Wang, G. Analysis of multi-scale effects and spatial heterogeneity of environmental factors influencing purse seine tuna fishing activities in the Western and Central Pacific Ocean. Heliyon 2024, 10, e38099. [Google Scholar] [CrossRef]
Feng, Y.; Chen, X.; Gao, F.; Liu, Y. Impacts of changing scale on Getis-Ord Gi* hotspots of CPUE: A case study of the neon flying squid (Ommastrephes bartramii) in the northwest Pacific Ocean. Acta Oceanol. Sin. 2018, 37, 67–76. [Google Scholar] [CrossRef]
Yaseen, Z.M. A new benchmark on machine learning methodologies for hydrological processes modelling: A comprehensive review for limitations and future research directions. Knowl. Based Eng. Sci. 2023, 4, 65–103. [Google Scholar] [CrossRef]
Liu, W.; Li, R. Variable selection and feature screening. In Macroeconomic Forecasting in the Era of Big Data: Theory and Practice; Fuleky, P., Ed.; Springer: Cham, Switzerland, 2020; Volume 52, pp. 293–326. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, W.; Tang, F.; Shi, Y.; Fan, W. Prediction Model of Yellowfin Tuna Fishing Ground in the Central and Western Pacific Based on Machine Learning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 330–338. [Google Scholar]
Zagaglia, C.R.; Lorenzzetti, J.A.; Stech, J.L. Remote sensing data and longline catches of yellowfin tuna (Thunnus albacares) in the equatorial Atlantic. Remote Sens. Environ. 2004, 93, 267–281. [Google Scholar] [CrossRef]
Yang, L.; Zhou, W. Feature Selection for Explaining Yellowfin Tuna Catch per Unit Effort Using Least Absolute Shrinkage and Selection Operator Regression. Fishes 2024, 9, 204. [Google Scholar] [CrossRef]
Braun, M.T.; Oswald, F.L. Exploratory regression analysis: A tool for selecting models and determining predictor importance. Behav. Res. Methods 2011, 43, 331–339. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R.; Taylor, J. Linear regression. In An Introduction to Statistical Learning: With Applications in Python; Springer: Cham, Switzerland, 2023; pp. 69–134. [Google Scholar]
McDonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 93–100. [Google Scholar] [CrossRef]
Ranstam, J.; Cook, J.A. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Sun, Q.; Zhou, W.-X.; Fan, J. Adaptive huber regression. J. Am. Stat. Assoc. 2020, 115, 254–265. [Google Scholar] [CrossRef]
Sitienei, M.; Otieno, A.; Anapapa, A. An application of K-nearest-neighbor regression in maize yield prediction. Asian J. Probab. Stat. 2023, 24, 1–10. [Google Scholar] [CrossRef]
Czajkowski, M.; Kretowski, M. The role of decision tree representation in regression problems–An evolutionary perspective. Appl. Soft Comput. 2016, 48, 458–475. [Google Scholar] [CrossRef]
Dong, J.; Chen, Y.; Yao, B.; Zhang, X.; Zeng, N. A neural network boosting regression model based on XGBoost. Appl. Soft Comput. 2022, 125, 109067. [Google Scholar] [CrossRef]
Truong, V.-H.; Tangaramvong, S.; Papazafeiropoulos, G. An efficient LightGBM-based differential evolution method for nonlinear inelastic truss optimization. Expert Syst. Appl. 2024, 237, 121530. [Google Scholar] [CrossRef]
Ibrahim, A.A.; Ridwan, R.L.; Muhammed, M.M.; Abdulaziz, R.O.; Saheed, G.A. Comparison of the CatBoost classifier with other machine learning methods. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 738–748. [Google Scholar] [CrossRef]
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef]
Shrestha, D.L.; Solomatine, D.P. Experiments with AdaBoost. RT, an improved boosting scheme for regression. Neural Comput. 2006, 18, 1678–1710. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobotics 2013, 7, 21. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Maqbool, J.; Aggarwal, P.; Kaur, R.; Mittal, A.; Ganaie, I.A. Stock prediction by integrating sentiment scores of financial news and MLP-regressor: A machine learning approach. Procedia Comput. Sci. 2023, 218, 1067–1078. [Google Scholar] [CrossRef]
Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2, pp. 1–4. [Google Scholar] [CrossRef]
Hoyos-Osorio, J.K.; Sanchez-Giraldo, L.G. The representation jensen-shannon divergence. arXiv 2023, arXiv:2305.16446. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Das, K.; Jiang, J.; Rao, J. Mean squared error of empirical predictor. Ann. Statist. 2004, 32, 818–840. [Google Scholar] [CrossRef]
LaHuis, D.M.; Hartman, M.J.; Hakoyama, S.; Clark, P.C. Explained variance measures for multilevel models. Organ. Res. Methods 2014, 17, 433–451. [Google Scholar] [CrossRef]
Di Bucchianico, A. Coefficient of determination (R²). In Encyclopedia of Statistics in Quality and Reliability; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2008. [Google Scholar] [CrossRef]
Van den Broeck, G.; Lykov, A.; Schleich, M.; Suciu, D. On the tractability of SHAP explanations. J. Artif. Intell. Res. 2022, 74, 851–886. [Google Scholar] [CrossRef]
Wen, X.; Xie, Y.; Wu, L.; Jiang, L. Quantifying and comparing the effects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP. Accid. Anal. Prev. 2021, 159, 106261. [Google Scholar] [CrossRef]
Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
Rani, S.J.; Ioannou, I.; Swetha, R.; Lakshmi, R.D.; Vassiliou, V. A novel automated approach for fish biomass estimation in turbid environments through deep learning, object detection, and regression. Ecol. Inform. 2024, 81, 102663. [Google Scholar] [CrossRef]
Xu, S.; Wang, J.; Chen, X.; Zhu, J. Identifying optimal variables for machine-learning-based fish distribution modeling. Can. J. Fish. Aquat. Sci. 2024, 81, 687–698. [Google Scholar] [CrossRef]
Hou, J.; Zhou, W.; Fan, W.; Zhang, H. Research on fishing grounds forecasting models of albacore tuna based on ensemble learning in South Pacific. South China Fish. Sci. 2020, 16, 42–50. [Google Scholar] [CrossRef]
Salih, A.K.; Hussein, H.A.A. Lost circulation prediction using decision tree, random forest, and extra trees algorithms for an Iraqi oil field. Iraqi Geol. J. 2022, 55, 111–127. [Google Scholar] [CrossRef]
Altelbany, S. Evaluation of ridge, elastic net and lasso regression methods in precedence of multicollinearity problem: A simulation study. J. Appl. Econ. Bus. Stud. 2021, 5, 131–142. [Google Scholar] [CrossRef]
Lan, K.-W.; Lee, M.-A.; Lu, H.-J.; Shieh, W.-J.; Lin, W.-K.; Kao, S.-C. Ocean variations associated with fishing conditions for yellowfin tuna (Thunnus albacares) in the equatorial Atlantic Ocean. ICES J. Mar. Sci. 2011, 68, 1063–1071. [Google Scholar] [CrossRef]
Matsubara, N.; Aoki, Y.; Aoki, A.; Kiyofuji, H. Lower thermal tolerance restricts vertical distributions for juvenile albacore tuna (Thunnus alalunga) in the northern limit of their habitats. Front. Mar. Sci. 2024, 11, 1353918. [Google Scholar] [CrossRef]
Erauskin-Extramiana, M.; Arrizabalaga, H.; Hobday, A.J.; Cabré, A.; Ibaibarriaga, L.; Arregui, I.; Murua, H.; Chust, G. Large-scale distribution of tuna species in a warming ocean. Glob. Change Biol. 2019, 25, 2043–2060. [Google Scholar] [CrossRef]
Nimit, K.; Masuluri, N.K.; Berger, A.M.; Bright, R.P.; Prakash, S.; TVS, U.; Rohit, P.; Ghosh, S.; Varghese, S.P. Oceanographic preferences of yellowfin tuna (Thunnus albacares) in warm stratified oceans: A remote sensing approach. Int. J. Remote Sens. 2020, 41, 5785–5805. [Google Scholar] [CrossRef]
Cai, L.; Xu, L.; Tang, D.; Shao, W.; Liu, Y.; Zuo, J.; Ji, Q. The effects of ocean temperature gradients on bigeye tuna (Thunnus obesus) distribution in the equatorial eastern Pacific Ocean. Adv. Space Res. 2020, 65, 2749–2760. [Google Scholar] [CrossRef]
Alvarez, I.; Rasmuson, L.K.; Gerard, T.; Laiz-Carrion, R.; Hidalgo, M.; Lamkin, J.T.; Malca, E.; Ferra, C.; Torres, A.P.; Alvarez-Berastegui, D. Influence of the seasonal thermocline on the vertical distribution of larval fish assemblages associated with Atlantic bluefin tuna spawning grounds. Oceans 2021, 2, 64–83. [Google Scholar] [CrossRef]
Song, L.M.; Zhang, Y.; Xu, L.X.; Jiang, W.X.; Wang, J.Q. Environmental preferences of longlining for yellowfin tuna (Thunnus albacares) in the tropical high seas of the Indian Ocean. Fish. Oceanogr. 2008, 17, 239–253. [Google Scholar] [CrossRef]
Wu, Y.-L.; Lan, K.-W.; Evans, K.; Chang, Y.-J.; Chan, J.-W. Effects of decadal climate variability on spatiotemporal distribution of Indo-Pacific yellowfin tuna population. Sci. Rep. 2022, 12, 13715. [Google Scholar] [CrossRef]
Zhou, W.; Hu, H.; Fan, W.; Jin, S. Impact of abnormal climatic events on the CPUE of yellowfin tuna fishing in the central and western Pacific. Sustainability 2022, 14, 1217. [Google Scholar] [CrossRef]
Sebastian, P.; Stibor, H.; Berger, S.; Diehl, S. Effects of water temperature and mixed layer depth on zooplankton body size. Mar. Biol. 2012, 159, 2431–2440. [Google Scholar] [CrossRef]
Wright, S.R.; Righton, D.; Naulaerts, J.; Schallert, R.J.; Griffiths, C.A.; Chapple, T.; Madigan, D.; Laptikhovsky, V.; Bendall, V.; Hobbs, R. Yellowfin tuna behavioural ecology and catchability in the South Atlantic: The right place at the right time (and depth). Front. Mar. Sci. 2021, 8, 664593. [Google Scholar] [CrossRef]

Figure 1. CPUE distribution in the western and central Pacific Ocean.

Figure 2. CPUE monthly means data change line chart.

Figure 3. Distribution of environmental feature data: (a) CPUE and Month; (b) SLA and EKE; (c) Temperature profiles; (d) SST gradients; (e) Chlorophyll-a metrics; (f) Climate indices.

Figure 4. Flowchart of research methods.

Figure 5. Comparison between different model predictions and actual values.

Figure 6. Radar charts of different model performance.

Figure 7. Model performance evaluation: (a) R² scores and confidence intervals; (b) RMSE values and confidence intervals; (c) Relative R² improvement; (d) Performance classification.

Figure 8. Comprehensive score of the regression model.

Figure 9. Pearson correlation coefficients between CPUE and environmental factors.

Figure 10. LightGBM important feature value ranking.

Figure 11. Model assumption validation results: (a) Q-Q plot for normality test; (b) Residual distribution histogram; (c) Residuals vs. predicted values scatter plot; (d) Autocorrelation function plot; (e) Residual sequence plot.

Figure 12. SHAP decision plot for the top 200 testing samples.

Figure 13. Honeycomb diagram of important features of biaxial SHAP.

Table 1. Comparison table of MAE, MSE, RMSE, EVS, and R² scores for Regression Models.

Model\Score	MAE	MSE	RMSE	EVS	R²
LightGBM	1.611484	11.643110	3.412200	0.221116	0.220319
ExtraTrees	1.610001	11.887523	3.447829	0.204035	0.203952
CatBoost	1.677910	12.004651	3.464773	0.197216	0.196108
RandomForest	1.634359	12.091773	3.477323	0.190333	0.190274
Bagging	1.636302	12.130093	3.482828	0.187753	0.187708
GradientBoosting	1.706000	12.256261	3.500894	0.180118	0.179259
KNeighbors	1.709640	12.541822	3.541443	0.161668	0.160137
XGBoost	1.690276	13.205482	3.633935	0.116058	0.115695
ElasticNet	1.911213	13.881904	3.725843	0.072130	0.070398
Lasso	1.911046	13.894650	3.727553	0.071288	0.069545
Ridge	1.911232	13.899188	3.728161	0.071003	0.069241
Linear	1.911564	13.906065	3.729084	0.070548	0.068780
MLP	1.872767	14.365285	3.790156	0.054812	0.038029
Huber	1.802355	14.773407	3.843619	0.057400	0.010699
AdaBoost	3.057544	18.747417	4.329829	−0.064498	−0.255421
DecisionTree	2.131442	20.764380	4.556795	−0.390487	−0.390487

Table 2. LightGBM Optimal Parameter Table.

Parameter Category	Parameter Name	Optimal Value	Description
Basic Parameters	learning_rate	0.01	Controls the learning step size; smaller values improve model stability.
	n_estimators	800	Number of decision trees; a higher number may enhance model performance.
	max_depth	10	Maximum depth of each tree; limits complexity to reduce overfitting.
Regularization	reg_alpha	0.1	L1 regularization coefficient; encourages sparsity in the model.
Regularization	reg_lambda	10	L2 regularization coefficient; prevents excessively large weights.
Feature and Sampling	colsample_bytree	0.9	Proportion of features sampled per tree (90%).
	subsample	0.9	Proportion of samples used for training (90%), enhancing generalization.
	importance_type	gain	Feature importance evaluation method.
Tree Structure	max_bin	700	Number of bins for continuous features; higher values improve precision.
	num_leaves	35	Maximum number of leaves per tree; balances complexity and performance.
	min_child_samples	20	Minimum number of samples in a leaf to prevent overfitting.
	min_split_gain	0.2	Minimum gain required to make a split.
Randomness Control	bagging_freq	10	Performs bagging every 10 iterations.
	random_state	42	Global random seed to ensure reproducibility.
	bagging_seed	42	Random seed for the bagging process.
Computational Efficiency	n_jobs	−1	Uses all available CPU cores to accelerate training.
Evaluation Metric	Best Score	0.255	Best model performance on the validation set.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.