1. Introduction
The flotation process is one of the most complex stages in mineral processing, consisting of multiple strongly coupled processes. In the flotation process, it is of vital importance to accurately predict the concentrate grade. On the one hand, concentrate grade is the key performance indicator that reflects both the separation efficiency and product quality, determining the metal recovery rate and the economic value of the concentrate. A concentrate grade below the contractual specifications may result in price reductions, significantly affecting a plant’s economic performance and its ability to meet its contractual obligations. On the other hand, real-time prediction of the concentrate grade enables early detection of deviations during production, allowing operators to effectively tune the flotation parameters, such as collector dosage, air flow, slurry density, and froth level, to control the fluctuation process, reduce the reagent consumption, improve the process responsiveness, and enhance the resource-use efficiency. Therefore, it is crucial to propose an effective concentrate grade prediction method for ensuring stable flotation operations, maintaining concentrate quality, and maximizing the profitability.
However, it is challenging to build an accurate model for concentrate grade prediction. Firstly, the flotation production environment is complex, and the data collection is difficult. The limitations of data collection sensors lead to outliers and missing values in the production database. Secondly, flotation is a highly nonlinear and multi-coupled process. The relationships among production indicators are impacted by complex physical and chemical reactions, which cannot be adequately captured by simple mechanistic equations. The flotation process exhibits intricate mechanisms, with multiple variables interdependently influencing the process [
1]. Although data-driven methods have been applied to grade prediction, most existing methods depend on a single prediction model [
2,
3], which may lead to limited generalization capabilities and unstable outputs due to the strong variability and noise in flotation data.
To address these challenges, we propose a copper concentrate grade prediction method based on an improved eXtreme Gradient Boosting (XGBoost) model, using a real flotation dataset collected from practical production scenarios. Firstly, a box-plot analysis is employed to detect outliers in the dataset, while the outliers and missing values are imputed using the MissForest (MF) algorithm, which is a nonparametric method based on random forest and is superior for dealing with complex nonlinear relationships and mixed variable types in the collected dataset. Then, the XGBoost model is trained using some key indicators, including feed grade, feed rate, reagent concentrations, slurry and air flow rates, froth level, and pH value. By integrating multiple decision trees through boosting, the model effectively captures the complex dependencies among these indicators to ensure more accurate grade predictions. Hyper-parameter tuning is optimized using a Tree-Structured Parzen Estimator (TPE), which is a Bayesian optimization technique that probabilistically navigates the hyper-parameter space to avoid the inefficiencies and sub-optimality caused by manual tuning. Due to the limited capability of the data collection sensors in the flotation process, there are significant issues with data anomalies and missing values. Additionally, the flotation process is highly complex and involves many coupled parameters, which leads to low efficiency in parameter tuning. By combining the IQR/MissForest with TPE-optimized XGBoost, an end-to-end prediction pipeline for the copper concentrate grade in the flotation process can be achieved, covering data processing and model training. It ensures both the accuracy of outlier detection and data imputation and the effectiveness of the trained prediction model.
Compared with AutoML, LightGBM, CatBoost, and Gradient Boosting, XGBoost offers better interpretability, robustness, and real-time performance. It can avoid overfitting and does not impose strict requirements on the types of training data. Because AutoML automatically searches across multiple models or hyper-parameter combinations, the resulting model tends to be complex and harder to interpret. Moreover, AutoML’s search paths can differ greatly under different hyper-parameter settings, so the final models obtained may vary significantly, which may compromise stability and robustness. In the flotation process, the data acquisition is low-frequency, and the sample size is small. LightGBM is prone to overfitting when the sample size is limited; the trees learned by LightGBM may be highly unbalanced or irregular, which reduces model interpretability; also, during prediction, traversing complex tree structures may result in longer inference times and thus a weaker real-time performance. CatBoost uses symmetric trees, encoding, and internal categorical handling, making the model structure more complex than standard tree models, which leads to longer prediction times and a poorer real-time performance. Furthermore, CatBoost’s strength lies in handling categorical features well; but in flotation production, the collected data are purely numerical, so CatBoost cannot fully exploit its advantages. Gradient Boosting is prone to overfitting when the sample size is small. And it does not optimize how it stores tree structures, resulting in long prediction times and a poor real-time performance. The flotation process is costly and complex, so high demands are placed on model interpretability and robustness. XGBoost trains with relatively regular (balanced) tree structures; its regularization parameters and feature importance can be controlled, yielding strong interpretability. The prediction results of XGBoost models tend to be stable, with high robustness. The relatively simpler structure of XGBoost leads to shorter prediction times and thus a better real-time performance, and it is less prone to overfitting. Also, XGBoost places no stringent demands on the training data types, so it can readily use collected numerical flotation production data for training. To demonstrate the superiority of our method, we compare our approach with LightGBM, CatBoost, AutoML, and Gradient Boosting in terms of interpretability, robustness, real-time performance, and overfitting and whether there are requirements on the training data in
Table 1.
The contributions of our method are as follows:
Our proposed method addresses the issues of outliers and missing values in the collected dataset by identifying outliers based on the Inter-Quartile Range (IQR) method with a box-plot analysis and imputing the outliers and missing values using the MissForest (MF) algorithm.
Aiming at the shortcomings of mechanistic, linear, and single prediction models, we employ the eXtreme Gradient Boosting (XGBoost) model to capture the high nonlinearity and strong coupling between flotation indicators.
To overcome the inefficiencies of manual hyper-parameter tuning, we integrate the Tree-Structured Parzen Estimator (TPE). Compared with traditional methods like grid or random search, it has superior tuning outcomes.
The proposed method is evaluated using a real flotation dataset. The experimental results show that our method achieves high accuracies in predicting the copper concentrate grade.
The structure of our paper is as follows: In
Section 2, we introduce some related work on product quality prediction across various areas, especially the steelmaking industry. In
Section 3, we provide an overview of our method.
Section 4 and
Section 5 present detailed descriptions of our proposed prediction methods.
Section 6 shows the experimental results.
Section 7 summarizes our work.
3. Overall Framework
In this section, we will introduce the overall framework of our proposed copper flotation concentrate grade prediction method.
In the data processing stage, the method employs the Inter-Quartile Range (IQR) technique to detect outliers in the collected production dataset and marks them as missing values. Because the flotation process is a complex chemical and physical process involving multiple variables, the relationships between these variables are not linear but are influenced by the interaction of various factors, resulting in a non-normal distribution of the collected data. A box-plot is used for outlier detection since it does not make any assumptions about the data distribution. Subsequently, the original missing values and the detected outliers are imputed using the MissForest (MF) algorithm, which is derived from the random forest approach, to construct a complete dataset suitable for model training. The processed dataset is then normalized, scaling the input data to a fixed range to mitigate the effects of different magnitudes of attribute values.
In the model training stage, a copper flotation concentrate grade prediction method is developed based on an improved Extreme Gradient Boosting (XGBoost) algorithm, which utilizes ensemble learning to integrate multiple decision tree models and thus accurately predict the copper concentrate grade. Meanwhile, the hyper-parameter tuning process of XGBoost is optimized using the Tree-Structured Parzen Estimator (TPE) algorithm. The proposed method is capable of aligning with real flotation conditions and effectively addresses issues such as data anomalies and low prediction accuracy encountered during the flotation production. The overall framework of the algorithm is depicted in
Figure 1.
4. IQR-MF-Based Copper Concentrate Production Data Processing Method
In actual flotation production, sensor failures, manual recording errors, and process fluctuations often lead to outliers and missing values in production data. Such outliers and missing values severely compromise the training of the prediction models. Therefore, it is essential to detect and correct or impute these irregularities to preserve the model accuracy. The paper employs the Inter-Quartile Range (IQR) method to identify outliers in the collected production dataset. In this way, the outliers for each production indicator are detected via the box-plot analysis. Since the collected production dataset tends to contain a large number of outliers, simply removing them or replacing them with means or medians may discard too much information, thereby undermining the prediction accuracy. Consequently, our method marks the detected outliers as missing values and then applies the MissForest (MF) algorithm to impute both originally missing values and these newly marked outliers. Finally, all imputed values are normalized to eliminate the effects of different magnitudes of attribute values. In this section, we will introduce the proposed IQR-based outlier detection method and the MF-based data imputation method.
4.1. Outlier Detection Based on IQR
Detecting the outliers in the collected copper concentrate production data is an important step in the data processing stage. In the paper, the actual flotation production data collected from a mining company do not follow a normal distribution, and many extreme values appear among the outliers. Since the box-plot method does not rely on any specific data distribution and is less affected by extreme values, we employ the box-plot method for outlier detection. Unlike conventional methods that either remove outliers directly or impute them with mean or median values, our method marks the identified outliers as missing values and imputes them using our proposed missing value imputation method. In this way, the information on the dataset is effectively preserved, which helps maintain the accuracy of the prediction model.
In addition, before applying the box-plot method for outlier detection, the production data need to be screened. Because all flotation production indicators are positive, any negative values in the collected production dataset are initially treated as missing values and are imputed using our proposed missing value imputation method. Similarly, for missing values that already exist in the raw dataset, the same imputation method is applied.
4.2. Data Imputation Based on MF
Because the values of production indicators, such as concentration, flow rate, pneumatic flow, liquid level, and pH, contain numerous outliers, directly discarding them would result in significant information loss. Therefore, the detected outliers need to be imputed. Given that production indicators in the flotation process exhibit complex, nonlinear coupling, traditional statistical imputation methods struggle to accurately replace these outliers with reliable values. Consequently, in the paper, the MissForest (MF) algorithm is employed to impute the missing data.
The MissForest (MF) algorithm is a nonparametric imputation method based on random forest models. It treats known variables as predictors and variables with missing values as targets, building several random forest models to predict and impute the missing values. This algorithm does not require assumptions such as a normal data distribution and is thus broadly applicable, especially in the presence of nonlinear interactions.
In the MF-based data imputation method, the missing values are firstly temporarily filled using the simple statistics, that is, mean for continuous variables and mode for categorical ones. Next, on the preliminarily imputed dataset, the missing rate of each attribute is computed, and the missing values are imputed iteratively in ascending order of missing rate. For each attribute, a random forest model is trained using the preliminarily imputed data and then used to predict the reliable values to replace these missing values.
During the model training process, the random forest model is iteratively updated based on the predictions in each round until either a predefined maximum number of iterations is reached or a stopping criterion is satisfied. The stopping criterion is
where
,
,
t,
j and
N represent the threshold, the imputed dataset, the iteration count, the attribute index, and the number of attributes. If the gap in the dataset is less than
after a single iteration, the iteration stops.
In addition, to mitigate the impact of different magnitudes of attribute values and improve the convergence speed, the input flotation production data is normalized to the range prior to model training.
5. Copper Flotation Concentrate Grade Prediction Method Based on TPE-Optimized XGBoost Algorithm
Copper concentrate grade prediction is influenced by multiple production indicators during flotation operations, and traditional mechanistic models struggle to accurately capture the nonlinear relationships among these variables. To address this issue, an XGBoost model is trained using our collected real-world production data for predicting copper concentrate grade. And the TPE-Optimized Parameter Tuning Strategy is proposed to enhance the tuning efficiency. In this section, we will introduce the construction of the XGBoost model and the proposed TPE-Optimized Parameter Tuning Strategy.
5.1. XGBoost Model Construction
As an ensemble learning algorithm grounded in gradient-boosted trees, XGBoost integrates multiple decision tree models sequentially, overcoming the limitations of single decision trees, such as poor generalization, high fluctuations in predictions, and low model stability, thereby enhancing the prediction accuracy.
The trained XGBoost model consists of multiple decision tree models. Given an input dataset
these individual decision tree models are combined via the additive mechanism of ensemble learning. Suppose there are
K decision tree models, each decision tree model is denoted as
, and the set of models is denoted as
The model can be represented as
The objective function during the model training is defined as
In the loss function,
denotes the loss function, which measures the discrepancy between the predictions and the true values.
represents the regularization term, defined to quantify the complexity of the decision tree and thus mitigate the risk of model overfitting. Its formulation is as follows:
where
T denotes the number of leaves in each decision tree,
w denotes the weight of each leaf, and
and
are hyper-parameters.
Due to the additive nature of ensemble learning models, the prediction
of the strong learner at iteration
t is the sum of the prediction from the previous tree
and the new tree’s prediction
. Similarly, in the regularization term, the total complexity is the sum of the complexities of all previous trees plus that of the current tree. Substituting this mechanism into the objective function, it can be expressed as
Performing a Taylor expansion on the objective function incorporating the additive mechanism, it can be expressed as
where
denotes the first derivative of the loss function, and
denotes the second derivative. Since the constant term and the structures of the previous
decision trees are already fixed, they are also constant terms. Thus, during the model optimization, their effects can be ignored. Therefore, by simplifying and removing these constant terms, the objective function simplifies into
Since the tree complexity comprises both the number of leaf nodes and the
norm of the leaf node weight vector, the complexity function can be expressed as
where
T represents the number of leaf nodes, and
denotes the weight of the
j-th leaf node.
A mapping function
is defined from the input samples to the leaf nodes, where all samples
, that belong to the
j-th leaf node are grouped into a set:
and we can obtain
where
w denotes a one-dimensional vector of length
T, representing the weights of the tree’s leaf nodes. Consequently, the objective function can be rewritten as
The coefficients of the linear term and the quadratic term in the objective function are merged:
and
where
denotes the sum of the first-order derivatives of the loss over all samples that fall into leaf node
j, and
denotes the corresponding sum of second-order derivatives. They are both constants. We partition the training data by leaf node, grouping all samples that belong to leaf
j into a set
. Substituting the aggregated coefficients for each group into the objective function leads to its simplified form:
From the above formula, we can derive that the objective function for each leaf node is given by
From the objective function, we can derive both the minimum value of the objective function for each leaf node and the corresponding optimal weight value for the leaf node when the objective function reaches its minimum:
and
At this point, the objective function of the entire model also reaches its minimum:
5.2. TPE-Optimized Parameter Tuning Strategy
Since many hyper-parameters, which greatly affect the model’s prediction performance, are involved during the XGBoost model training, manual hyper-parameter tuning is not only time-consuming but also unlikely to yield optimal results. The flotation process involves many parameters, and XGBoost has numerous hyper-parameters as well, with coupling relationships among them, so searching the combined space becomes too large. A common grid search is inefficient, and random search yields unstable results. Intelligent optimization algorithms like particle swarm optimization converge slowly and require tuning even more parameters. In this paper, to optimize XGBoost’s hyper-parameters, we use the TPE algorithm, a Bayesian method based on sequential models, which can effectively leverage historical information and does not require adjusting its own parameter settings, thereby improving the hyper-parameter tuning efficiency.
First, according to Bayes’ theorem, we decompose
into
and p(y). Then, we partition
into two regions:
In the equation, according to the setting of the hyper-parameter
, we define
as the quantile of
y at level
. Consequently, we can derive
By substituting the above formulas into the expected improvement (EI) expression and simplifying, we obtain the following result:
From the above formula, the ratio
is inversely proportional to the denominator. Therefore, once the hyper-parameter
is fixed, the value of
x that minimizes the ratio corresponding to the maximum value of the numerator
is the sought optimal solution.
The search space is as follows: The number of trees lies in the interval ; maximum tree depth lies in the interval ; the learning rate lies in the interval ; the row subsample ratio lies in the interval ; the column subsample ratio lies in the interval ; the minimum sum of the instance weight in the leaf lies in the interval ; the minimum loss reduction to make a split lies in the interval ; the L1 regularization term lies in the interval ; and the L2 regularization term lies in the interval .
The convergence criterion is that the TPE reformulates hyper-parameter optimization as a sequential model optimization problem. It builds a model in sequence based on historical trial results to estimate the hyper-parameter performance and uses this model to choose which hyper-parameters to evaluate next. Meanwhile, the TPE employs an adaptive probabilistic model to generate two probability density functions: one based on hyper-parameter trials with a good performance and the other based on those with a poorer performance. The direction of convergence is to continuously sample new hyper-parameter combinations from the “good” density function because these combinations are more likely to improve the model’s performance. The TPE uses the expected improvement as the criterion to select the next hyper-parameter combination. It always prefers the point that has the greatest expected improvement over the currently known best result, thus balancing exploration and exploitation.
6. Experimental Evaluation
6.1. Dataset
To verify the effectiveness of the proposed copper concentrate grade prediction method in practical production scenarios, in this paper, we use an actual production dataset from a mining company for experiments.
The dataset consists of flotation process data from the mining company’s actual production, which is used for the model training and validation in our proposed copper flotation concentrate grade prediction method. The dataset contains 365 records of actual production batches. To build and evaluate the prediction model, the dataset is divided into a training set and a testing set. The training set consists of 292 samples (approximately 80% of the total dataset), used to train the model. And the remaining 73 production batch records are used as the testing set to evaluate the performance of our proposed models. The input variables include concentration, flow rate, airflow rate, liquid level, pH value, ore grade, and original ore grade, which are represented as to . The output variable is the copper concentrate grade, denoted as y. The input variable data frequency and sampling interval are 1 time per hour, while the output variable data frequency and sampling interval are 1 time per day. Due to the different time scales of data collection, the input variable data collected within one day is aggregated and averaged to ensure that the input features and target variables have the same scale. In the preprocessing step, to prevent the effects of different data scales and accelerate convergence, we apply Z-score normalization to the input data before the model training, constraining the input values within the interval . To ensure robustness against overfitting, we conduct the experimental evaluations using k-fold cross-validation, where k is set to 5.
6.2. Experimental Environment
The experiment is conducted on a Windows operating system, with an Intel Core i5-12600KF processor. This processor has 10 cores and 16 threads, with a base clock speed of 3.7 GHz. The system is equipped with 32 GB of DDR4 RAM and 512 GB SSD. The experiment is implemented on Python 3.12.4.
6.3. Evaluation Metrics
To evaluate the effectiveness of the copper concentrate grade prediction method, we select the Root Mean Square Error (RMSE) and the Mean Absolute Percentage Error (MAPE) as the performance evaluation metrics for evaluate the prediction method. These two metrics are commonly used to assess the degree of deviation between the actual values and predicted values. The metrics are computed as follows.
and
where
represents the actual values of the concentrate grade,
represents the predicted values, and
m is the number of samples. The MAPE and RMSE represent the deviation between the predicted and actual values, and the closer their values are to 0, the better the prediction performance.
6.4. Evaluation of IQR-MF-Based Data Processing Method
We conduct two sensitivity analyses to validate the IQR and MissForest. On the one hand, we vary the multiplier
m in the box-plot (IQR) method, with
. For each value of
m, outliers are detected and then set to null (missing). We then impute the missing values using the MissForest algorithm. After the data processing, we apply the TPE-XGBoost algorithm and evaluate the prediction performance via the RMSE and MAPE, thereby assessing how different parameter settings affect the prediction results. On the other hand, to validate the effectiveness of the integration of the IQR and MissForest, we conduct the data processing under the following four scenarios, including the following: A. Using the IQR and MissForest for both outlier detection and imputation; B. Using only the IQR for outlier detection and deleting the outliers directly; C. Using only MissForest for imputation; D. Using neither outlier detection nor imputation. After the data processing, in all cases, we use TPE-XGBoost to test the prediction performance via the RMSE and MAPE to assess how the IQR and MissForest affect the prediction results. The experimental results are shown in
Table 2 and
Table 3.
As the value of m increases, both the number and proportion of points identified as outliers decline. When , both the RMSE and MAPE reach their minimum. However, under this setting, the outlier proportion is as high as , which suggests that too many valid samples might have been excluded. Although the prediction performance is still acceptable, this setting carries a risk of overfitting and may harm the model’s generalization ability. When m is set between 1.5 and 2.0, the fluctuations in the RMSE and MAPE are relatively small, indicating that the model’s prediction performance changes within a moderate range and is fairly stable. To ensure reasonable outlier detection and maintain performance stability, we set according to convention. Under this setting, the outlier proportion is controlled at a reasonable level while achieving a stable and excellent prediction performance.
Meanwhile, the results under different data processing scenarios show that Scheme A achieves the best RMSE and MAPE, significantly outperforming the other schemes, demonstrating the superiority of combining the IQR and MissForest for outlier detection and missing value imputation. Comparing Scheme A and Scheme B, we can see that Scheme B, which directly deletes outlier sample, results in a large reduction in the training sample size, and its RMSE increases compared to that under Scheme A. This indicates that using MissForest to impute outliers rather than simply deleting them makes use of the data information and yields a more accurate model. The performance of Scheme C and Scheme D are both significantly worse than that of Scheme A. Scheme C fails to improve the performance effectively, which shows that although MissForest can handle missing values, its robustness to outliers is insufficient to replace outlier detection. IQR-based outlier identification is an indispensable step for improving model accuracy.
6.5. Evaluation of TPE-XGBoost-Based Prediction Method
To verify the effectiveness of the proposed method, in this section, we select random forest (RF), Ordinary Least Squares (OLS), a Back-Propagation Neural Network (BPNN), a support vector machine (SVM), and a convolutional neural network (CNN) as benchmarks and compare the proposed prediction method with them. Our method is abbreviated as TPE-XGBoost. The comparison results are shown in
Table 4. In the table, the mean values, standard deviations (SDs), 95% CIs, and
p-values are presented.
As shown in
Table 4, the prediction method proposed in this paper achieves the best results for both the RMSE and MAPE among all of the comparison benchmarks, indicating that the proposed method has the highest prediction accuracy and the smallest prediction error. Compared with the benchmarks, the RMSE is reduced by 1.53–35.21%, and the MAPE is reduced by 2.60–40.75%, which fully validates the superiority of the improved XGBoost algorithm proposed in this paper. The CNN requires a large amount of training data to ensure its effectiveness. Due to the low data collection frequency in the flotation process, there is limited data available for training the model. Therefore, the CNN method shows the worst prediction performance, with a MAPE exceeding 10%, indicating a large prediction error. This suggests that the CNN algorithm is not suitable for copper concentrate grade prediction.
We also conduct statistical analyses on the experimental results. The p-values of RF, OLS, the BPNN, the SVM, and the CNN are 0.328, 0.034, 0.016, 0.007, and 0.044, respectively. From the results, it can be seen that the p-values of OLS, the BPNN, the SVM, and the CNN are all less than 0.05, indicating that the experimental results are statistically significant and have a high level of interpretation. Although the p-value of RF is greater than 0.05, multiple trials have shown that our proposed method outperforms RF across all metrics, which still demonstrates the superiority of our approach.
To more intuitively demonstrate the accuracy and stability of the prediction methods, the values for grade are plotted onto the vertical axis and the number of samples onto the horizontal axis. Curves in different colors represent different prediction methods. The relative prediction error curves are shown in
Figure 2.
Our method utilizes data to train a model in order to capture the relationships between various variables in the flotation process, without considering the physical process. However, the prediction results show that the trained model performs well, indicating that our model can accurately capture the physical process.
6.6. Ablation Experiments
As can be seen in the proposed method, the outlier detection based on the box-plot method is first performed on the samples, and the detected outliers are imputed using the MF algorithm. After data processing, the XGBoost model is used to predict the copper concentrate grade, and the hyper-parameters of the XGBoost model are tuned using the TPE algorithm. To verify the effectiveness of the outlier detection method and the hyper-parameter tuning method, ablation experiments are conducted by setting whether the MF algorithm is used to handle missing values and whether the TPE algorithm is used for hyper-parameter tuning. The experimental schemes are as follows:
Impute the outliers using the mean, and then use the XGBoost model to predict the copper concentrate grade. This is denoted as XGBoost.
Perform outlier imputation based on the Missing Forest method, and then use the XGBoost model to predict the copper concentrate grade, without using the TPE algorithm for hyper-parameter tuning. This is denoted as MF-XGBoost.
Impute the outliers using the mean, and use the TPE algorithm to tune the hyper-parameters of the XGBoost model before predicting the copper concentrate grade. This is denoted as TPE-XGBoost.
Perform the experiment using our proposed prediction method completely to validate the effectiveness. This is denoted as MF-TPE-XGBoost.
From the experimental results, it can be concluded that the RMSE and MAPE values obtained from the other three experimental schemes are all higher than those obtained using the method proposed in this paper. This demonstrates that outlier imputation based on the MF algorithm and the hyper-parameter tuning method using the TPE algorithm are crucial for improving the prediction accuracy. Using the MF algorithm to handle missing values can improve data quality, while using the TPE algorithm can obtain better hyper-parameters efficiently for the XGBoost model, thereby enhancing prediction accuracy.
6.7. Analysis of Feature Importance
In this paper, we apply the SHAP (SHapley Additive exPlanations) method to quantitatively analyzing the key process variables affecting concentrate grade during flotation. By computing the SHAP values of seven feature variables, including feed grade, ore throughput, concentration, flow rate, air flow, liquid level, and pH value, we uncover the contribution magnitudes and interaction mechanisms of each process parameter for the prediction of concentrate grade. The experimental results are shown in
Figure 4.
According to the experimental results, the SHAP importance of reagent concentration, liquid level, air flow rate, pulp flow rate, feed grade, pH value, and ore throughput is 0.2784, 0.2394, 0.2326, 0.2098, 0.1852, 0.1496, and 0.1165, accounting for 19.7%, 17.0%, 16.5%, 14.9%, 13.1%, 10.6%, and 8.3% of the total importance, respectively. Reagent concentration, liquid level, and air flow constitute the core combination of features affecting concentrate grade. The sum of their importances reaches 53.2%, which exceeds half of the total importance. This aligns closely with flotation theory. Reagent concentration, as a core chemical control parameter, directly influences changes in mineral surface hydrophobicity. Liquid level governs the flotation residence time and slurry retention, thereby affecting separation efficiency. And air flow determines the rate and size distribution of bubble generation, influencing the flotation probability for mineral particles.
The heatmap of the feature interaction strengths reveals coupling relationships among the features. Reagent concentration and liquid level show the strongest coupling and exhibit significant synergistic effects. Air flow rate and pulp flow rate have a certain coupling effect. And there is an adaptation relationship between feed grade and reagent concentration, linking the raw material characteristics to the process parameter settings.