A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm

Song, Yang; Yu, Xiance; Huang, Min

doi:10.3390/app152011142

Open AccessArticle

A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm

by

Yang Song

^1,2,

Xiance Yu

¹ and

Min Huang

^1,*

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

Yunnan Key Laboratory of Service Computing, Yunnan University of Finance and Economics, Kunming 650221, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11142; https://doi.org/10.3390/app152011142

Submission received: 10 September 2025 / Revised: 8 October 2025 / Accepted: 15 October 2025 / Published: 17 October 2025

Download

Browse Figures

Versions Notes

Abstract

The flotation stage is a critical segment of mineral processing production. In copper concentrate flotation, predicting the concentrate grade is essential for maintaining a stable flotation process, ensuring concentrate quality, and enhancing profits. To improve the prediction accuracy for the concentrate grade, we propose a prediction method based on an improved eXtreme Gradient Boosting (XGBoost) model using real copper concentrate flotation data in the paper. To address the issues of outliers and missing values in the collected dataset, we firstly present an outlier detection and imputation method using the Inter-Quartile Range (IQR) method and the MissForest (MF) algorithm. An XGBoost-based model is developed for predicting the copper concentrate grade. The model is trained using some key indicators, including feed grade, ore throughput, reagent concentration, pulp flow rate, air flow rate, level, and pH value, as the input features. Moreover, hyper-parameter tuning is optimized based on a Tree-Structured Parzen Estimator (TPE). Combining the IQR/MissForest with TPE-optimized XGBoost can enable an end-to-end prediction pipeline for the copper concentrate grade in the flotation process to address the issues of data anomalies and missing values in the flotation process, as well as the low efficiency of multi-parameter tuning, ensuring the accuracy of data processing and the effectiveness of model training. The experimental results demonstrate that compared with some traditional prediction methods, such as support vector machines, the proposed method achieves about a 25.3% reduction in the Root Mean Square Error (RMSE), indicating our method’s superior performance.

Keywords:

concentrate grade prediction; eXtreme Gradient Boosting (XGBoost); Tree-Structured Parzen Estimator (TPE)

1. Introduction

The flotation process is one of the most complex stages in mineral processing, consisting of multiple strongly coupled processes. In the flotation process, it is of vital importance to accurately predict the concentrate grade. On the one hand, concentrate grade is the key performance indicator that reflects both the separation efficiency and product quality, determining the metal recovery rate and the economic value of the concentrate. A concentrate grade below the contractual specifications may result in price reductions, significantly affecting a plant’s economic performance and its ability to meet its contractual obligations. On the other hand, real-time prediction of the concentrate grade enables early detection of deviations during production, allowing operators to effectively tune the flotation parameters, such as collector dosage, air flow, slurry density, and froth level, to control the fluctuation process, reduce the reagent consumption, improve the process responsiveness, and enhance the resource-use efficiency. Therefore, it is crucial to propose an effective concentrate grade prediction method for ensuring stable flotation operations, maintaining concentrate quality, and maximizing the profitability.

However, it is challenging to build an accurate model for concentrate grade prediction. Firstly, the flotation production environment is complex, and the data collection is difficult. The limitations of data collection sensors lead to outliers and missing values in the production database. Secondly, flotation is a highly nonlinear and multi-coupled process. The relationships among production indicators are impacted by complex physical and chemical reactions, which cannot be adequately captured by simple mechanistic equations. The flotation process exhibits intricate mechanisms, with multiple variables interdependently influencing the process [1]. Although data-driven methods have been applied to grade prediction, most existing methods depend on a single prediction model [2,3], which may lead to limited generalization capabilities and unstable outputs due to the strong variability and noise in flotation data.

To address these challenges, we propose a copper concentrate grade prediction method based on an improved eXtreme Gradient Boosting (XGBoost) model, using a real flotation dataset collected from practical production scenarios. Firstly, a box-plot analysis is employed to detect outliers in the dataset, while the outliers and missing values are imputed using the MissForest (MF) algorithm, which is a nonparametric method based on random forest and is superior for dealing with complex nonlinear relationships and mixed variable types in the collected dataset. Then, the XGBoost model is trained using some key indicators, including feed grade, feed rate, reagent concentrations, slurry and air flow rates, froth level, and pH value. By integrating multiple decision trees through boosting, the model effectively captures the complex dependencies among these indicators to ensure more accurate grade predictions. Hyper-parameter tuning is optimized using a Tree-Structured Parzen Estimator (TPE), which is a Bayesian optimization technique that probabilistically navigates the hyper-parameter space to avoid the inefficiencies and sub-optimality caused by manual tuning. Due to the limited capability of the data collection sensors in the flotation process, there are significant issues with data anomalies and missing values. Additionally, the flotation process is highly complex and involves many coupled parameters, which leads to low efficiency in parameter tuning. By combining the IQR/MissForest with TPE-optimized XGBoost, an end-to-end prediction pipeline for the copper concentrate grade in the flotation process can be achieved, covering data processing and model training. It ensures both the accuracy of outlier detection and data imputation and the effectiveness of the trained prediction model.

Compared with AutoML, LightGBM, CatBoost, and Gradient Boosting, XGBoost offers better interpretability, robustness, and real-time performance. It can avoid overfitting and does not impose strict requirements on the types of training data. Because AutoML automatically searches across multiple models or hyper-parameter combinations, the resulting model tends to be complex and harder to interpret. Moreover, AutoML’s search paths can differ greatly under different hyper-parameter settings, so the final models obtained may vary significantly, which may compromise stability and robustness. In the flotation process, the data acquisition is low-frequency, and the sample size is small. LightGBM is prone to overfitting when the sample size is limited; the trees learned by LightGBM may be highly unbalanced or irregular, which reduces model interpretability; also, during prediction, traversing complex tree structures may result in longer inference times and thus a weaker real-time performance. CatBoost uses symmetric trees, encoding, and internal categorical handling, making the model structure more complex than standard tree models, which leads to longer prediction times and a poorer real-time performance. Furthermore, CatBoost’s strength lies in handling categorical features well; but in flotation production, the collected data are purely numerical, so CatBoost cannot fully exploit its advantages. Gradient Boosting is prone to overfitting when the sample size is small. And it does not optimize how it stores tree structures, resulting in long prediction times and a poor real-time performance. The flotation process is costly and complex, so high demands are placed on model interpretability and robustness. XGBoost trains with relatively regular (balanced) tree structures; its regularization parameters and feature importance can be controlled, yielding strong interpretability. The prediction results of XGBoost models tend to be stable, with high robustness. The relatively simpler structure of XGBoost leads to shorter prediction times and thus a better real-time performance, and it is less prone to overfitting. Also, XGBoost places no stringent demands on the training data types, so it can readily use collected numerical flotation production data for training. To demonstrate the superiority of our method, we compare our approach with LightGBM, CatBoost, AutoML, and Gradient Boosting in terms of interpretability, robustness, real-time performance, and overfitting and whether there are requirements on the training data in Table 1.

The contributions of our method are as follows:

Our proposed method addresses the issues of outliers and missing values in the collected dataset by identifying outliers based on the Inter-Quartile Range (IQR) method with a box-plot analysis and imputing the outliers and missing values using the MissForest (MF) algorithm.
Aiming at the shortcomings of mechanistic, linear, and single prediction models, we employ the eXtreme Gradient Boosting (XGBoost) model to capture the high nonlinearity and strong coupling between flotation indicators.
To overcome the inefficiencies of manual hyper-parameter tuning, we integrate the Tree-Structured Parzen Estimator (TPE). Compared with traditional methods like grid or random search, it has superior tuning outcomes.
The proposed method is evaluated using a real flotation dataset. The experimental results show that our method achieves high accuracies in predicting the copper concentrate grade.

The structure of our paper is as follows: In Section 2, we introduce some related work on product quality prediction across various areas, especially the steelmaking industry. In Section 3, we provide an overview of our method. Section 4 and Section 5 present detailed descriptions of our proposed prediction methods. Section 6 shows the experimental results. Section 7 summarizes our work.

2. Related Work

In this section, we will present the related work from two perspectives, including copper flotation concentrate grade prediction methods and prediction methods based on ensemble learning.

2.1. Copper Flotation Concentrate Grade Prediction

A series of studies has been conducted on predicting the concentrate grade in flotation processes, which can generally be categorized into mechanism-based prediction methods and data-driven prediction methods.

The flotation process is highly mechanistically complex and exhibits strong coupling among its data characteristics, making mechanistic modeling difficult and accurate prediction hard to achieve. As a result, researchers predominantly combine mechanistic models with data-driven techniques for concentrate grade in flotation. Gomez-Flores et al. [4] include physicochemical variables that describe the surfaces of mineral particles in machine learning models and conduct grade prediction experiments using models such as random forest. Their work demonstrates the effectiveness of machine learning in predicting concentrate grade in flotation. Wang et al. [5] introduce an index function to optimize the network parameters of a recurrent neural network (RNN) model, constructing an improved RNN model that effectively enhances the accuracy of concentrate grade prediction in the flotation process. In addition, some data-driven prediction methods without considering mechanistic models are also proposed. These methods do not require knowledge of the chemical or physical reaction principles within the flotation stages. Instead, they use data to train models such as multivariate regression, support vector machines, and neural networks to predict concentrate grade. MsFfTsGP [6] proposes a multi-source feature fusion and two-stage grade prediction framework that integrates froth image sequences and traditional process variables via multi-stream 3D convolution with attention, achieving high accuracy in zinc tailing grade prediction in lead–zinc flotation. Liu et al. [7] apply a Stacked Deep Gated Recurrent Unit (SDGRU) model to predicting quality metrics in the copper flotation process, demonstrating that the SDGRU effectively captures temporal dynamics in flotation data and yields accurate predictions of concentrate quality. Kaartinen et al. [8] collect variables such as feed flow rate, gas flow rate, and pH as input features and establish a copper concentrate grade prediction model using partial least squares regression (PLSR), achieving a satisfactory prediction performance. Shean et al. [9] use feed grade, concentrate valve opening, and tailing valve opening as the input variables to train a dynamic feedforward neural network, which enhances the prediction accuracy of key flotation performance indicators. Ebrahim et al. [10] employ inputs such as the reagent dosage, froth layer thickness, and air flow rate to construct a copper concentrate grade prediction model using a multilayer feedforward neural network. Experimental evaluations and simulation results indicate that the multilayer feedforward network yields more accurate predictions than those of a linear regression model.

2.2. Ensemble-Learning-Based Prediction

As an ensemble learning technique, Chen et al. [11] propose a scalable end-to-end tree- boosting system called XGBoost, which is widely used by data scientists and has achieved a leading performance in many machine learning challenges. Currently, researchers are designing and optimizing ensemble-learning–based prediction methods and applying them to prediction tasks in industries such as manufacturing and the medical field.

With respect to ensemble-learning-based prediction methods, Bagherzadeh et al. [12] developed a stacked machine learning framework combining TPOT evolutionary search and neural network optimization to predict the maximum tensile stress in plain-weave composite laminates with interacting holes, showing that a model using Gradient Boosting Regression, polynomials, and LassoLarsCV outperforms other ML methods. Zhang et al. [13] apply ensemble learning methods (Random Forest and XGBoost) to training a prediction model, demonstrating that ensemble models significantly outperform SVM and logistic regression in slope stability prediction and identifying “profile shape” as the most influential factor. Keshun et al. [14] present a hybrid EM-PF-SSA-SVR model incorporating a gamma stochastic process to accurately predict the remaining useful life (RUL) of lithium-ion batteries, outperforming the existing methods in both its accuracy and robustness. Pan et al. [15] propose a hybrid CNN-LSTM-SA model, integrating convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and self-attention mechanisms to enhance the prediction accuracy of oil well production by effectively capturing the spatial–temporal features and internal correlations in production data. He et al. [16] employ support vector regression and the XGBoost algorithm, optimizing the prediction model’s hyper-parameters using heuristic algorithms such as particle swarm optimization (PSO), which lead to an improved prediction accuracy across various datasets. Sorokin et al. [17] address the challenge of tuning gradient boosting tree hyper-parameters by proposing a model-aware, data-driven optimization process. Compared to existing hyper-parameter tuning methods, this approach offers a stronger generalization ability and a more streamlined optimization process. Liu et al. [18] present a hybrid modeling framework combining XGBoost and support vector regression, tailored to prediction tasks involving imbalanced data distributions. Xie et al. [19] introduce an improved neural network architecture that uses XGBoost as a classifier, integrating adaptive features with simple empirical ones to achieve a higher predictive accuracy. Hou et al. [20] propose a prediction method combining XGBoost with an adaptive Kalman filter. In the method, XGBoost models the nonlinear relationship between the inputs and outputs, and the adaptive Kalman filter refines the output into approximate feature values. This method demonstrates strong performance in terms of its accuracy, generalizability, and convergence.

Moreover, ensemble-learning-based prediction methods are applied in various fields. Parhi et al. [21] implement a hybrid ensemble machine learning model using a stacking strategy, where the base learners consist of random forest regression, neural networks, and multivariate regression splines, and the meta-learner is XGBoost, achieving accurate predictions of concrete compressive strength. Alyami et al. [22] employed machine learning algorithms such as Random Forest, LightGBM, and XGBoost to predict concrete compressive strength on small-scale datasets and used SHAP (Shapley Additive Explanations) to interpret the proposed prediction models, establishing an accurate method for concrete strength assessment. Cakiroglu et al. [23] utilize ensemble learning methods including XGBoost to predict the power output of wind turbines based on features such as temperature, humidity, pressure, air density, and wind speed and leverage SHAP explanations to interpret the model, both achieving power prediction and assessing the significance of the input features. He et al. [24] propose an XGBoost-based method for classifying pulmonary nodules as benign or malignant and introduce Improved SHapley Additive exPlanations (ISHAP) as an interpretability model to guide the lung nodule classification process, enhancing model transparency and trustworthiness.

3. Overall Framework

In this section, we will introduce the overall framework of our proposed copper flotation concentrate grade prediction method.

In the data processing stage, the method employs the Inter-Quartile Range (IQR) technique to detect outliers in the collected production dataset and marks them as missing values. Because the flotation process is a complex chemical and physical process involving multiple variables, the relationships between these variables are not linear but are influenced by the interaction of various factors, resulting in a non-normal distribution of the collected data. A box-plot is used for outlier detection since it does not make any assumptions about the data distribution. Subsequently, the original missing values and the detected outliers are imputed using the MissForest (MF) algorithm, which is derived from the random forest approach, to construct a complete dataset suitable for model training. The processed dataset is then normalized, scaling the input data to a fixed range to mitigate the effects of different magnitudes of attribute values.

In the model training stage, a copper flotation concentrate grade prediction method is developed based on an improved Extreme Gradient Boosting (XGBoost) algorithm, which utilizes ensemble learning to integrate multiple decision tree models and thus accurately predict the copper concentrate grade. Meanwhile, the hyper-parameter tuning process of XGBoost is optimized using the Tree-Structured Parzen Estimator (TPE) algorithm. The proposed method is capable of aligning with real flotation conditions and effectively addresses issues such as data anomalies and low prediction accuracy encountered during the flotation production. The overall framework of the algorithm is depicted in Figure 1.

4. IQR-MF-Based Copper Concentrate Production Data Processing Method

In actual flotation production, sensor failures, manual recording errors, and process fluctuations often lead to outliers and missing values in production data. Such outliers and missing values severely compromise the training of the prediction models. Therefore, it is essential to detect and correct or impute these irregularities to preserve the model accuracy. The paper employs the Inter-Quartile Range (IQR) method to identify outliers in the collected production dataset. In this way, the outliers for each production indicator are detected via the box-plot analysis. Since the collected production dataset tends to contain a large number of outliers, simply removing them or replacing them with means or medians may discard too much information, thereby undermining the prediction accuracy. Consequently, our method marks the detected outliers as missing values and then applies the MissForest (MF) algorithm to impute both originally missing values and these newly marked outliers. Finally, all imputed values are normalized to eliminate the effects of different magnitudes of attribute values. In this section, we will introduce the proposed IQR-based outlier detection method and the MF-based data imputation method.

4.1. Outlier Detection Based on IQR

Detecting the outliers in the collected copper concentrate production data is an important step in the data processing stage. In the paper, the actual flotation production data collected from a mining company do not follow a normal distribution, and many extreme values appear among the outliers. Since the box-plot method does not rely on any specific data distribution and is less affected by extreme values, we employ the box-plot method for outlier detection. Unlike conventional methods that either remove outliers directly or impute them with mean or median values, our method marks the identified outliers as missing values and imputes them using our proposed missing value imputation method. In this way, the information on the dataset is effectively preserved, which helps maintain the accuracy of the prediction model.

In addition, before applying the box-plot method for outlier detection, the production data need to be screened. Because all flotation production indicators are positive, any negative values in the collected production dataset are initially treated as missing values and are imputed using our proposed missing value imputation method. Similarly, for missing values that already exist in the raw dataset, the same imputation method is applied.

4.2. Data Imputation Based on MF

Because the values of production indicators, such as concentration, flow rate, pneumatic flow, liquid level, and pH, contain numerous outliers, directly discarding them would result in significant information loss. Therefore, the detected outliers need to be imputed. Given that production indicators in the flotation process exhibit complex, nonlinear coupling, traditional statistical imputation methods struggle to accurately replace these outliers with reliable values. Consequently, in the paper, the MissForest (MF) algorithm is employed to impute the missing data.

The MissForest (MF) algorithm is a nonparametric imputation method based on random forest models. It treats known variables as predictors and variables with missing values as targets, building several random forest models to predict and impute the missing values. This algorithm does not require assumptions such as a normal data distribution and is thus broadly applicable, especially in the presence of nonlinear interactions.

In the MF-based data imputation method, the missing values are firstly temporarily filled using the simple statistics, that is, mean for continuous variables and mode for categorical ones. Next, on the preliminarily imputed dataset, the missing rate of each attribute is computed, and the missing values are imputed iteratively in ascending order of missing rate. For each attribute, a random forest model is trained using the preliminarily imputed data and then used to predict the reliable values to replace these missing values.

During the model training process, the random forest model is iteratively updated based on the predictions in each round until either a predefined maximum number of iterations is reached or a stopping criterion is satisfied. The stopping criterion is

\frac{\sum_{j \in N} {(D_{t}^{i m p} - D_{t - 1}^{i m p})}^{2}}{\sum_{j \in N} {(D_{t}^{i m p})}^{2}} < ε,

where

ε

,

D^{i m p}

, t, j and N represent the threshold, the imputed dataset, the iteration count, the attribute index, and the number of attributes. If the gap in the dataset is less than

ε

after a single iteration, the iteration stops.

In addition, to mitigate the impact of different magnitudes of attribute values and improve the convergence speed, the input flotation production data is normalized to the

[0, 1]

range prior to model training.

5. Copper Flotation Concentrate Grade Prediction Method Based on TPE-Optimized XGBoost Algorithm

Copper concentrate grade prediction is influenced by multiple production indicators during flotation operations, and traditional mechanistic models struggle to accurately capture the nonlinear relationships among these variables. To address this issue, an XGBoost model is trained using our collected real-world production data for predicting copper concentrate grade. And the TPE-Optimized Parameter Tuning Strategy is proposed to enhance the tuning efficiency. In this section, we will introduce the construction of the XGBoost model and the proposed TPE-Optimized Parameter Tuning Strategy.

5.1. XGBoost Model Construction

As an ensemble learning algorithm grounded in gradient-boosted trees, XGBoost integrates multiple decision tree models sequentially, overcoming the limitations of single decision trees, such as poor generalization, high fluctuations in predictions, and low model stability, thereby enhancing the prediction accuracy.

The trained XGBoost model consists of multiple decision tree models. Given an input dataset

T = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\},

these individual decision tree models are combined via the additive mechanism of ensemble learning. Suppose there are K decision tree models, each decision tree model is denoted as

f_{k} (k = 1, 2, \dots, K)

, and the set of models is denoted as

F = {f_{1}, f_{2}, \dots, f_{K}} .

The model can be represented as

y_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F .

The objective function during the model training is defined as

L = \sum_{i}^{n} l ({\hat{y}}_{i}, y_{i}) + \sum_{k}^{K} Ω (f_{k}),

In the loss function,

l ()

denotes the loss function, which measures the discrepancy between the predictions and the true values.

Ω ()

represents the regularization term, defined to quantify the complexity of the decision tree and thus mitigate the risk of model overfitting. Its formulation is as follows:

Ω (f) = γ T + \frac{1}{2} λ {∥ w ∥}^{2},

where T denotes the number of leaves in each decision tree, w denotes the weight of each leaf, and

γ

and

λ

are hyper-parameters.

Due to the additive nature of ensemble learning models, the prediction

{\hat{y}}_{i}^{(t)}

of the strong learner at iteration t is the sum of the prediction from the previous tree

{\hat{y}}_{i}^{(t - 1)}

and the new tree’s prediction

f_{t} (x_{i})

. Similarly, in the regularization term, the total complexity is the sum of the complexities of all previous trees plus that of the current tree. Substituting this mechanism into the objective function, it can be expressed as

L = \sum_{i = 1}^{n} l ({\hat{y}}_{i}^{t - 1} y_{i} + f_{t} (x_{i})) + \sum_{k}^{K} Ω (f_{k}) + Ω (f_{t}) .

Performing a Taylor expansion on the objective function incorporating the additive mechanism, it can be expressed as

L = \sum_{i = 1}^{n} [l ({\hat{y}}_{i}^{t - 1}, y_{i}) + g_{i} f_{t} (x_{i}) + 0.5 h_{i} f_{t}^{2} (x_{i})] + \sum_{k}^{K} Ω (f_{k}) + Ω (f_{t}),

where

g_{i}

denotes the first derivative of the loss function, and

h_{i}

denotes the second derivative. Since the constant term and the structures of the previous

t - 1

decision trees are already fixed, they are also constant terms. Thus, during the model optimization, their effects can be ignored. Therefore, by simplifying and removing these constant terms, the objective function simplifies into

L = \sum_{i = 1}^{T} [g_{i} f_{t} (x_{i}) + 0.5 h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) .

Since the tree complexity comprises both the number of leaf nodes and the

L_{2}

norm of the leaf node weight vector, the complexity function can be expressed as

Ω (f_{t}) = γ T + 0.5 λ \sum_{j = 1}^{T} w_{j}^{2},

where T represents the number of leaf nodes, and

w_{j}

denotes the weight of the j-th leaf node.

A mapping function

q ()

is defined from the input samples to the leaf nodes, where all samples

x_{i}

, that belong to the j-th leaf node are grouped into a set:

I_{j} = {i | q (x_{i}) = j},

and we can obtain

f_{t} (x_{i}) = w_{q (x_{i})},

where w denotes a one-dimensional vector of length T, representing the weights of the tree’s leaf nodes. Consequently, the objective function can be rewritten as

\begin{matrix} L & = \sum_{i = 1}^{n} [(\sum_{i \in I_{j}} g_{i}) w_{j} + 0.5 (\sum_{i \in I_{j}} h_{i} + λ) w_{j}^{2}] + γ T . \end{matrix}

The coefficients of the linear term and the quadratic term in the objective function are merged:

G_{j} = \sum_{i \in I_{j}} g_{i},

and

H_{j} = \sum_{i \in I_{j}} h_{i},

where

G_{j}

denotes the sum of the first-order derivatives of the loss over all samples that fall into leaf node j, and

H_{j}

denotes the corresponding sum of second-order derivatives. They are both constants. We partition the training data by leaf node, grouping all samples that belong to leaf j into a set

I_{j}

. Substituting the aggregated coefficients for each group into the objective function leads to its simplified form:

L = \sum_{j = 1}^{T} [G_{j} w_{j} + 0.5 (H_{j} + λ) w_{j}^{2}] + γ T .

From the above formula, we can derive that the objective function for each leaf node is given by

f (w_{j}) = G_{j} w_{j} + 0.5 (H_{j} + λ) w_{j}^{2} .

From the objective function, we can derive both the minimum value of the objective function for each leaf node and the corresponding optimal weight value for the leaf node when the objective function reaches its minimum:

w_{j} = - \frac{G_{j}}{H_{j} + λ},

and

f (w_{j}) = - \frac{1}{2} \frac{G_{j}^{2}}{H_{j} + λ} .

At this point, the objective function of the entire model also reaches its minimum:

L = - \frac{1}{2} \sum_{j = 1}^{T} \frac{G_{j}^{2}}{H_{j} + λ} + γ T .

5.2. TPE-Optimized Parameter Tuning Strategy

Since many hyper-parameters, which greatly affect the model’s prediction performance, are involved during the XGBoost model training, manual hyper-parameter tuning is not only time-consuming but also unlikely to yield optimal results. The flotation process involves many parameters, and XGBoost has numerous hyper-parameters as well, with coupling relationships among them, so searching the combined space becomes too large. A common grid search is inefficient, and random search yields unstable results. Intelligent optimization algorithms like particle swarm optimization converge slowly and require tuning even more parameters. In this paper, to optimize XGBoost’s hyper-parameters, we use the TPE algorithm, a Bayesian method based on sequential models, which can effectively leverage historical information and does not require adjusting its own parameter settings, thereby improving the hyper-parameter tuning efficiency.

First, according to Bayes’ theorem, we decompose

P (y | x)

into

P (x | y)

and p(y). Then, we partition

P (x | y)

into two regions:

P (x | y) = \{\begin{matrix} l (x), y \leq y^{*} \\ g (x), y > y^{*} \end{matrix}

In the equation, according to the setting of the hyper-parameter

γ

, we define

y^{*}

as the quantile of y at level

γ

. Consequently, we can derive

P (x) = γ l (x) + (1 - γ) g (x) .

By substituting the above formulas into the expected improvement (EI) expression and simplifying, we obtain the following result:

E I_{y^{*}} (x) = \frac{\int_{- \infty}^{y^{*}} (y^{*} - y) p (y) d y}{γ + (1 - γ) \frac{g (x)}{l (x)}} .

From the above formula, the ratio

E I_{y^{*}} (x)

is inversely proportional to the denominator. Therefore, once the hyper-parameter

γ

is fixed, the value of x that minimizes the ratio corresponding to the maximum value of the numerator

l (x) / g (x)

is the sought optimal solution.

The search space is as follows: The number of trees lies in the interval

[100, 1000]

; maximum tree depth lies in the interval

[3, 15]

; the learning rate lies in the interval

[0.001, 0.3]

; the row subsample ratio lies in the interval

[0.6, 1.0]

; the column subsample ratio lies in the interval

[0.6, 1.0]

; the minimum sum of the instance weight in the leaf lies in the interval

[1, 10]

; the minimum loss reduction to make a split lies in the interval

[0, 1]

; the L1 regularization term lies in the interval

[0.001, 10]

; and the L2 regularization term lies in the interval

[0.001, 10]

.

The convergence criterion is that the TPE reformulates hyper-parameter optimization as a sequential model optimization problem. It builds a model in sequence based on historical trial results to estimate the hyper-parameter performance and uses this model to choose which hyper-parameters to evaluate next. Meanwhile, the TPE employs an adaptive probabilistic model to generate two probability density functions: one based on hyper-parameter trials with a good performance and the other based on those with a poorer performance. The direction of convergence is to continuously sample new hyper-parameter combinations from the “good” density function because these combinations are more likely to improve the model’s performance. The TPE uses the expected improvement as the criterion to select the next hyper-parameter combination. It always prefers the point that has the greatest expected improvement over the currently known best result, thus balancing exploration and exploitation.

6. Experimental Evaluation

6.1. Dataset

To verify the effectiveness of the proposed copper concentrate grade prediction method in practical production scenarios, in this paper, we use an actual production dataset from a mining company for experiments.

The dataset consists of flotation process data from the mining company’s actual production, which is used for the model training and validation in our proposed copper flotation concentrate grade prediction method. The dataset contains 365 records of actual production batches. To build and evaluate the prediction model, the dataset is divided into a training set and a testing set. The training set consists of 292 samples (approximately 80% of the total dataset), used to train the model. And the remaining 73 production batch records are used as the testing set to evaluate the performance of our proposed models. The input variables include concentration, flow rate, airflow rate, liquid level, pH value, ore grade, and original ore grade, which are represented as

x_{1}

to

x_{7}

. The output variable is the copper concentrate grade, denoted as y. The input variable data frequency and sampling interval are 1 time per hour, while the output variable data frequency and sampling interval are 1 time per day. Due to the different time scales of data collection, the input variable data collected within one day is aggregated and averaged to ensure that the input features and target variables have the same scale. In the preprocessing step, to prevent the effects of different data scales and accelerate convergence, we apply Z-score normalization to the input data before the model training, constraining the input values within the interval

[0, 1]

. To ensure robustness against overfitting, we conduct the experimental evaluations using k-fold cross-validation, where k is set to 5.

6.2. Experimental Environment

The experiment is conducted on a Windows operating system, with an Intel Core i5-12600KF processor. This processor has 10 cores and 16 threads, with a base clock speed of 3.7 GHz. The system is equipped with 32 GB of DDR4 RAM and 512 GB SSD. The experiment is implemented on Python 3.12.4.

6.3. Evaluation Metrics

To evaluate the effectiveness of the copper concentrate grade prediction method, we select the Root Mean Square Error (RMSE) and the Mean Absolute Percentage Error (MAPE) as the performance evaluation metrics for evaluate the prediction method. These two metrics are commonly used to assess the degree of deviation between the actual values and predicted values. The metrics are computed as follows.

RMSE = \sqrt{\sum_{i = 1}^{m} {(y_{i} - {\hat{y}}_{i})}^{2}},

and

MAPE = \frac{1}{N} \sum_{i = 1}^{m} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| * 100 %,

where

y_{i}

represents the actual values of the concentrate grade,

{\hat{y}}_{i}

represents the predicted values, and m is the number of samples. The MAPE and RMSE represent the deviation between the predicted and actual values, and the closer their values are to 0, the better the prediction performance.

6.4. Evaluation of IQR-MF-Based Data Processing Method

We conduct two sensitivity analyses to validate the IQR and MissForest. On the one hand, we vary the multiplier m in the box-plot (IQR) method, with

m = [1.0, 1.5, 2.0]

. For each value of m, outliers are detected and then set to null (missing). We then impute the missing values using the MissForest algorithm. After the data processing, we apply the TPE-XGBoost algorithm and evaluate the prediction performance via the RMSE and MAPE, thereby assessing how different parameter settings affect the prediction results. On the other hand, to validate the effectiveness of the integration of the IQR and MissForest, we conduct the data processing under the following four scenarios, including the following: A. Using the IQR and MissForest for both outlier detection and imputation; B. Using only the IQR for outlier detection and deleting the outliers directly; C. Using only MissForest for imputation; D. Using neither outlier detection nor imputation. After the data processing, in all cases, we use TPE-XGBoost to test the prediction performance via the RMSE and MAPE to assess how the IQR and MissForest affect the prediction results. The experimental results are shown in Table 2 and Table 3.

As the value of m increases, both the number and proportion of points identified as outliers decline. When

m = 1.0

, both the RMSE and MAPE reach their minimum. However, under this setting, the outlier proportion is as high as

11.33 %

, which suggests that too many valid samples might have been excluded. Although the prediction performance is still acceptable, this setting carries a risk of overfitting and may harm the model’s generalization ability. When m is set between 1.5 and 2.0, the fluctuations in the RMSE and MAPE are relatively small, indicating that the model’s prediction performance changes within a moderate range and is fairly stable. To ensure reasonable outlier detection and maintain performance stability, we set

m = 1.5

according to convention. Under this setting, the outlier proportion is controlled at a reasonable level while achieving a stable and excellent prediction performance.

Meanwhile, the results under different data processing scenarios show that Scheme A achieves the best RMSE and MAPE, significantly outperforming the other schemes, demonstrating the superiority of combining the IQR and MissForest for outlier detection and missing value imputation. Comparing Scheme A and Scheme B, we can see that Scheme B, which directly deletes outlier sample, results in a large reduction in the training sample size, and its RMSE increases compared to that under Scheme A. This indicates that using MissForest to impute outliers rather than simply deleting them makes use of the data information and yields a more accurate model. The performance of Scheme C and Scheme D are both significantly worse than that of Scheme A. Scheme C fails to improve the performance effectively, which shows that although MissForest can handle missing values, its robustness to outliers is insufficient to replace outlier detection. IQR-based outlier identification is an indispensable step for improving model accuracy.

6.5. Evaluation of TPE-XGBoost-Based Prediction Method

To verify the effectiveness of the proposed method, in this section, we select random forest (RF), Ordinary Least Squares (OLS), a Back-Propagation Neural Network (BPNN), a support vector machine (SVM), and a convolutional neural network (CNN) as benchmarks and compare the proposed prediction method with them. Our method is abbreviated as TPE-XGBoost. The comparison results are shown in Table 4. In the table, the mean values, standard deviations (SDs), 95% CIs, and p-values are presented.

As shown in Table 4, the prediction method proposed in this paper achieves the best results for both the RMSE and MAPE among all of the comparison benchmarks, indicating that the proposed method has the highest prediction accuracy and the smallest prediction error. Compared with the benchmarks, the RMSE is reduced by 1.53–35.21%, and the MAPE is reduced by 2.60–40.75%, which fully validates the superiority of the improved XGBoost algorithm proposed in this paper. The CNN requires a large amount of training data to ensure its effectiveness. Due to the low data collection frequency in the flotation process, there is limited data available for training the model. Therefore, the CNN method shows the worst prediction performance, with a MAPE exceeding 10%, indicating a large prediction error. This suggests that the CNN algorithm is not suitable for copper concentrate grade prediction.

We also conduct statistical analyses on the experimental results. The p-values of RF, OLS, the BPNN, the SVM, and the CNN are 0.328, 0.034, 0.016, 0.007, and 0.044, respectively. From the results, it can be seen that the p-values of OLS, the BPNN, the SVM, and the CNN are all less than 0.05, indicating that the experimental results are statistically significant and have a high level of interpretation. Although the p-value of RF is greater than 0.05, multiple trials have shown that our proposed method outperforms RF across all metrics, which still demonstrates the superiority of our approach.

To more intuitively demonstrate the accuracy and stability of the prediction methods, the values for grade are plotted onto the vertical axis and the number of samples onto the horizontal axis. Curves in different colors represent different prediction methods. The relative prediction error curves are shown in Figure 2.

Our method utilizes data to train a model in order to capture the relationships between various variables in the flotation process, without considering the physical process. However, the prediction results show that the trained model performs well, indicating that our model can accurately capture the physical process.

6.6. Ablation Experiments

As can be seen in the proposed method, the outlier detection based on the box-plot method is first performed on the samples, and the detected outliers are imputed using the MF algorithm. After data processing, the XGBoost model is used to predict the copper concentrate grade, and the hyper-parameters of the XGBoost model are tuned using the TPE algorithm. To verify the effectiveness of the outlier detection method and the hyper-parameter tuning method, ablation experiments are conducted by setting whether the MF algorithm is used to handle missing values and whether the TPE algorithm is used for hyper-parameter tuning. The experimental schemes are as follows:

Impute the outliers using the mean, and then use the XGBoost model to predict the copper concentrate grade. This is denoted as XGBoost.
Perform outlier imputation based on the Missing Forest method, and then use the XGBoost model to predict the copper concentrate grade, without using the TPE algorithm for hyper-parameter tuning. This is denoted as MF-XGBoost.
Impute the outliers using the mean, and use the TPE algorithm to tune the hyper-parameters of the XGBoost model before predicting the copper concentrate grade. This is denoted as TPE-XGBoost.
Perform the experiment using our proposed prediction method completely to validate the effectiveness. This is denoted as MF-TPE-XGBoost.

The experimental results are shown in Figure 3 and Table 5.

From the experimental results, it can be concluded that the RMSE and MAPE values obtained from the other three experimental schemes are all higher than those obtained using the method proposed in this paper. This demonstrates that outlier imputation based on the MF algorithm and the hyper-parameter tuning method using the TPE algorithm are crucial for improving the prediction accuracy. Using the MF algorithm to handle missing values can improve data quality, while using the TPE algorithm can obtain better hyper-parameters efficiently for the XGBoost model, thereby enhancing prediction accuracy.

6.7. Analysis of Feature Importance

In this paper, we apply the SHAP (SHapley Additive exPlanations) method to quantitatively analyzing the key process variables affecting concentrate grade during flotation. By computing the SHAP values of seven feature variables, including feed grade, ore throughput, concentration, flow rate, air flow, liquid level, and pH value, we uncover the contribution magnitudes and interaction mechanisms of each process parameter for the prediction of concentrate grade. The experimental results are shown in Figure 4.

According to the experimental results, the SHAP importance of reagent concentration, liquid level, air flow rate, pulp flow rate, feed grade, pH value, and ore throughput is 0.2784, 0.2394, 0.2326, 0.2098, 0.1852, 0.1496, and 0.1165, accounting for 19.7%, 17.0%, 16.5%, 14.9%, 13.1%, 10.6%, and 8.3% of the total importance, respectively. Reagent concentration, liquid level, and air flow constitute the core combination of features affecting concentrate grade. The sum of their importances reaches 53.2%, which exceeds half of the total importance. This aligns closely with flotation theory. Reagent concentration, as a core chemical control parameter, directly influences changes in mineral surface hydrophobicity. Liquid level governs the flotation residence time and slurry retention, thereby affecting separation efficiency. And air flow determines the rate and size distribution of bubble generation, influencing the flotation probability for mineral particles.

The heatmap of the feature interaction strengths reveals coupling relationships among the features. Reagent concentration and liquid level show the strongest coupling and exhibit significant synergistic effects. Air flow rate and pulp flow rate have a certain coupling effect. And there is an adaptation relationship between feed grade and reagent concentration, linking the raw material characteristics to the process parameter settings.

7. Conclusions and Future Work

In this paper, a copper flotation concentrate grade prediction method based on an improved Extreme Gradient Boosting algorithm is proposed. The method introduces outlier detection and data imputation methods based on the Inter-Quartile Range (IQR) method and the MissForest (MF) algorithm. Simultaneously, the eXtreme Gradient Boosting (XGBoost) model is employed to predict the copper flotation concentrate grade, and the Tree-Structured Parzen Estimator (TPE) is utilized to optimize the hyper-parameter tuning process. An actual production dataset is used to verify the effectiveness of the proposed method, and the experimental results demonstrate that our method can accurately predict the copper concentrate grade. Our method is beneficial for enhancing product quality. By making appropriate model adjustments based on the data characteristics, our method can be extended to other minerals or plants. In future work, we will focus on designing prediction models with general applicability that can be used in various industrial scenarios and investigating the KPIs, such as quality stability, reagent savings, operator feedback, prediction latency, afforded by our method.

Author Contributions

Conceptualization: Y.S. and X.Y.; methodology: Y.S. and X.Y.; software: X.Y.; validation: Y.S., X.Y., and M.H.; resources: Y.S.; data curation: Y.S. and X.Y.; writing—original draft preparation: Y.S. and X.Y.; writing—review and editing: Y.S. and X.Y.; supervision: M.H.; project administration: M.H.; funding acquisition: Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the National Natural Science Foundation of China Grant No. 62302085; the Foundation of Yunnan Key Laboratory of Service Computing No. YNSC24104; the NSFC Key Supported Project of the Major Research Plan Grant No. 92267206; the Joint Fund Project for Doctoral Research Startup of Liaoning Province No. 2023-BSBA-130; and the Fundamental Research Funds of the Central Universities Grant No. N25ZJL008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset was sourced from a mining company. Further inquiries can be directed to songyang1@ise.neu.edu.cn.

Acknowledgments

We appreciate Xinfeng Li from Digital Mining Division of Changchun Gold Design Institute Co., Ltd., Changchun, China, Zhaohui Jiang and Lei Yao from China Gold Inner Mongolia Mining Co., Ltd., Hulunbeier, China for their support to this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sajjad, M.; Otsuki, A. Coupling flotation rate constant and viscosity models. Metals 2022, 12, 409. [Google Scholar] [CrossRef]
Zhu, J.; Niu, J.; Zhang, Y. Prediction of concentrate grade and recovery rate of tailings in the process of production based on chaotic ant colony algorithm. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 5308–5313. [Google Scholar]
Deniz, V.; Umucu, Y.; Deniz, O.T. Estimation of grade and recovery in the concentration of barite tailings by the flotation using the MLR and ANN analyses. Physicochem. Probl. Miner. Process. 2022, 58, 150646. [Google Scholar] [CrossRef]
Gomez-Flores, A.; Heyes, G.W.; Ilyas, S.; Kim, H. Prediction of grade and recovery in flotation from physicochemical and operational aspects using machine learning models. Miner. Eng. 2022, 183, 107627. [Google Scholar] [CrossRef]
Wang, K.; Pang, L.; Li, X.; Zhao, X. An improved RNN modeling algorithm for flotation process of copper-silver ore. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 5232–5237. [Google Scholar]
Xu, P.; Tian, L.; Liu, J.; Luo, D.; Jahanshahi, H. MsFfTsGP: Multi-source features-fused two-stage grade prediction of zinc tailings in lead-zinc flotation process via multi-stream 3D convolution with attention mechanism. Eng. Appl. Artif. Intell. 2024, 129, 107647. [Google Scholar] [CrossRef]
Liu, Y.; Wang, S.; Xiong, Y.; Liu, S.; Guo, X.; Jin, Y. Quality Prediction in Copper Flotation Process Based on SDGRU. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 5348–5353. [Google Scholar]
Kaartinen, J.; Hätönen, J.; Hyötyniemi, H.; Miettunen, J. Machine-vision-based control of zinc flotation—A case study. Control Eng. Pract. 2006, 14, 1455–1466. [Google Scholar] [CrossRef]
Shean, B.; Cilliers, J. A review of froth flotation control. Int. J. Miner. Process. 2011, 100, 57–71. [Google Scholar] [CrossRef]
Allahkarami, E.; Nuri, O.S.; Abdollahzadeh, A.; Rezai, B.; Chegini, M. Estimation of copper and molybdenum grades and recoveries in the industrial flotation plant using the artificial neural network. Int. J. Nonferrous Metall. 2016, 5, 23–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bagherzadeh, F.; Shafighfard, T.; Khan, R.M.A.; Szczuko, P.; Mieloszyk, M. Prediction of maximum tensile stress in plain-weave composite laminates with interacting holes via stacked machine learning algorithms: A comparative study. Mech. Syst. Signal Process. 2023, 195, 110315. [Google Scholar] [CrossRef]
Zhang, W.; Li, H.; Han, L.; Chen, L.; Wang, L. Slope stability prediction using ensemble learning techniques: A case study in Yunyang County, Chongqing, China. J. Rock Mech. Geotech. Eng. 2022, 14, 1089–1099. [Google Scholar] [CrossRef]
You, K.; Qiu, G.; Gu, Y. Remaining useful life prediction of lithium-ion batteries using EM-PF-SSA-SVR with gamma stochastic process. Meas. Sci. Technol. 2023, 35, 015015. [Google Scholar]
Pan, S.; Yang, B.; Wang, S.; Guo, Z.; Wang, L.; Liu, J.; Wu, S. Oil well production prediction based on CNN-LSTM model with self-attention mechanism. Energy 2023, 284, 128701. [Google Scholar]
Dasi, H.; Ying, Z.; Ashab, M.F.B. Proposing hybrid prediction approaches with the integration of machine learning models and metaheuristic algorithms to forecast the cooling and heating load of buildings. Energy 2024, 291, 130297. [Google Scholar] [CrossRef]
Sorokin, A.; Zhu, X.; Lee, E.H.; Cheng, B. SigOpt Mulch: An intelligent system for AutoML of gradient boosted trees. Knowl.-Based Syst. 2023, 273, 110604. [Google Scholar] [CrossRef]
Liu, W.; Chen, C.; Li, Y.; Bai, X.; Zhang, B.; Guan, X. Dynamic Sinter Quality Prediction Based on Time-Shifted State Space Reconstruction. IEEE Trans. Ind. Inform. 2024, 21, 2033–2042. [Google Scholar] [CrossRef]
Xie, J.; Li, Z.; Zhou, Z.; Liu, S. A novel bearing fault classification method based on XGBoost: The fusion of deep learning-based features and empirical features. IEEE Trans. Instrum. Meas. 2020, 70, 3506709. [Google Scholar] [CrossRef]
Hou, W.; Shi, Q.; Liu, Y.; Guo, L.; Zhang, X.; Wu, J. State of charge estimation for lithium-ion batteries at various temperatures by extreme gradient boosting and adaptive cubature Kalman filter. IEEE Trans. Instrum. Meas. 2023, 73, 2504611. [Google Scholar] [CrossRef]
Parhi, S.K.; Patro, S.K. Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey wolf optimized machine learning estimators. J. Build. Eng. 2023, 71, 106521. [Google Scholar] [CrossRef]
Alyami, M.; Nassar, R.-U.-D.; Khan, M.; Hammad, A.W.; Alabduljabbar, H.; Nawaz, R.; Fawad, M.; Gamil, Y. Estimating compressive strength of concrete containing rice husk ash using interpretable machine learning-based models. Case Stud. Constr. Mater. 2024, 20, e02901. [Google Scholar] [CrossRef]
Cakiroglu, C.; Demir, S.; Ozdemir, M.H.; Aylak, B.L.; Sariisik, G.; Abualigah, L. Data-driven interpretable ensemble learning methods for the prediction of wind turbine power incorporating SHAP analysis. Expert Syst. Appl. 2024, 237, 121464. [Google Scholar] [CrossRef]
He, W.; Li, B.; Liao, R.; Mo, H.; Tian, L. An ISHAP-based interpretation-model-guided classification method for malignant pulmonary nodule. Knowl.-Based Syst. 2022, 237, 107778. [Google Scholar] [CrossRef]

Figure 1. Overall framework.

Figure 2. The gaps between the predicted values and true values in different methods.

Figure 3. The gaps between predicted values and true values via ablation experiments.

Figure 4. Analysis of feature importance.

Table 1. Comparison with mainstream ensemble learning techniques.

	Our Method	AutoML	LightGBM	CatBoost	Gradient Boosting
Interpretability	✓	×	×	✓	✓
Robustness	✓	×	✓	✓	✓
Real-Time Capability	✓	✓	×	×	×
Avoids Overfitting	✓	✓	×	✓	×
No Data Requirements	✓	✓	✓	×	✓

Table 2. Evaluation of different multipliers.

m	Outlier Percentage (%)	RMSE	MAPE
1	11.33	1.385	5.40
1.5	7.53	1.614	6.00
2	5.49	1.636	6.28

Table 3. Evaluation of different data processing scenarios.

Scenario	RMSE	MAPE
A: IQR+MissForest	1.614	6.00
B: IQR	1.678	6.24
C: MissForest	1.950	7.14
D: Non	1.950	7.14

Table 4. Comparison with benchmarks.

Model	RMSE			MAPE
Model	Mean	SD	95% CI	Mean	SD	95% CI
Our Method	1.614	0.081	[1.5017, 1.7263]	6.00	0.33	[5.54, 6.46]
RF	1.639	0.069	[1.5427, 1.7354]	6.16	0.41	[5.60, 6.73]
OLS	1.770	0.142	[1.5728, 1.9676]	6.72	0.40	[6.17, 7.27]
BPNN	1.968	0.137	[1.7778, 2.1582]	7.57	0.69	[6.61, 8.53]
SVM	2.161	0.220	[1.8561, 2.4663]	7.84	0.41	[7.28, 8.41]
CNN	2.491	0.772	[1.4202, 3.5624]	9.92	3.67	[6.29, 10.89]

Table 5. Ablation experiments.

Model	RMSE			MAPE
Model	Mean	SD	95% CI	Mean	SD	95% CI
Our Method	1.614	0.081	[1.5017, 1.7263]	6.00	0.33	[5.54, 6.46]
TPE-XGBoost	1.684	0.342	[1.2593, 2.1077]	6.3520	0.9611	[5.1587, 7.5453]
MF-XGBoost	1.955	0.281	[1.6061, 2.3034]	7.4674	1.2185	[5.9546, 8.9804]
XGBoost	2.145	0.280	[1.7977, 2.4929]	8.3167	1.2106	[6.8135, 9.8199]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, Y.; Yu, X.; Huang, M. A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm. Appl. Sci. 2025, 15, 11142. https://doi.org/10.3390/app152011142

AMA Style

Song Y, Yu X, Huang M. A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm. Applied Sciences. 2025; 15(20):11142. https://doi.org/10.3390/app152011142

Chicago/Turabian Style

Song, Yang, Xiance Yu, and Min Huang. 2025. "A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm" Applied Sciences 15, no. 20: 11142. https://doi.org/10.3390/app152011142

APA Style

Song, Y., Yu, X., & Huang, M. (2025). A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm. Applied Sciences, 15(20), 11142. https://doi.org/10.3390/app152011142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Copper Flotation Concentrate Grade Prediction Method Based on an Improved Extreme Gradient Boosting Algorithm

Abstract

1. Introduction

2. Related Work

2.1. Copper Flotation Concentrate Grade Prediction

2.2. Ensemble-Learning-Based Prediction

3. Overall Framework

4. IQR-MF-Based Copper Concentrate Production Data Processing Method

4.1. Outlier Detection Based on IQR

4.2. Data Imputation Based on MF

5. Copper Flotation Concentrate Grade Prediction Method Based on TPE-Optimized XGBoost Algorithm

5.1. XGBoost Model Construction

5.2. TPE-Optimized Parameter Tuning Strategy

6. Experimental Evaluation

6.1. Dataset

6.2. Experimental Environment

6.3. Evaluation Metrics

6.4. Evaluation of IQR-MF-Based Data Processing Method

6.5. Evaluation of TPE-XGBoost-Based Prediction Method

6.6. Ablation Experiments

6.7. Analysis of Feature Importance

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI