Breast Cancer Prediction Based on Differential Privacy and Logistic Regression Optimization Model

: In order to improve the classiﬁcation effect of the logistic regression (LR) model for breast cancer prediction, a new hybrid feature selection method is proposed to process the data, using the Pearson correlation test and the iterative random forest algorithm based on out-of-bag estimation (RF-OOB) to screen the optimal 17 features as inputs to the model. Secondly, the LR is optimized using the batch gradient descent (BGD-LR) algorithm to train the loss function of the model to minimize the loss. In order to protect the privacy of breast cancer patients, a differential privacy protection technology is added to the BGD-LR model, and an LR optimization model based on differential privacy with batch gradient descent (BDP-LR) is constructed. Finally, experiments are carried out on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. Meanwhile, accuracy, precision, recall, and F1-score are selected as the four main evaluation indicators. Moreover, the hyperparameters of each model are determined by the grid search method and the cross-validation method. The experimental results show that after hybrid feature selection, the optimal results of the four main evaluation indicators of the BGD-LR model are 0.9912, 1, 0.9886, and 0.9943, in which the accuracy, recall, and F1-scores are increased by 2.63%, 3.41%, and 1.76%, respectively. For the BDP-LR model, when the privacy budget ε is taken as 0.8, the classiﬁcation performance and privacy protection effect of the model reach an effective balance. At the same time, the four main evaluation indicators of the model are 0.9721, 0.9975, 0.9664, and 0.9816, which are improved by 1.58%, 0.26%, 1.81%, and 1.07%, respectively. Comparative analysis shows that the models of BGD-LR and BDP-LR constructed in this paper perform better than other classiﬁcation models.


Introduction
Cancer is the leading cause of human mortality worldwide, and the treatment of cancer consumes a lot of medical resources and increases the burden on society, so that cancer has become a common social issue all over the world [1].Breast cancer is one of the malignancies with the highest morbidity and mortality in women [2].The symptoms of early-stage breast cancer are not obvious, and advanced cancer cells will rapidly metastasize, leading to systemic multiorgan lesions that will directly threaten the lives of patients, so early diagnosis is the key to improving the survival rate of breast cancer patients [3].
At present, there are three common methods for early diagnosis of breast cancer: clinical evaluation [4], imaging evaluation [5], and tissue biopsy.Using machine learning methods on these detection data for data analysis and data mining can assist doctors in reducing misdiagnosis and missed diagnosis caused by subjective factors and improve the detection rate of breast cancer [6].Machine learning is at the core of artificial intelligence and data science.As machine learning continues to be optimized, cancer prediction accuracy continues to improve [7].
Machine learning methods have contributed to the early diagnosis of breast cancer.Making improvements to the classification model and processing the data using feature selection methods can improve the classification of breast cancer, but the classification of breast cancer is still not optimal for the existing research.In addition, there is a risk of leaking specific private information in the training data to attackers through the structure of the model [31].In recent years, the leakage of private information of patients has occurred frequently.With the advancement of cloud technology and big data, it is easier for attackers to collect patients' private information and speculate about patients' sensitive information through correlation and other means [32].Therefore, while combining machine learning models with cancer diagnosis, it is also necessary to pay attention to the privacy protection of data.
Commonly used privacy protection techniques are anonymity-based privacy protection, encryption-based privacy protection, and noise-based privacy protection [33].Differential privacy technology based on noise was proposed by Dwork [34] of Harvard University in 2006 by adding a series of "noise" to the original data, making it difficult for attackers to achieve accurate calculation of an individual user's privacy data, so as to improve the efficiency of data sharing and use under the premise of protecting data security.The application of privacy protection technology in data mining has become a research hotspot in the field of artificial intelligence [35][36][37], and how to achieve the combination with machine learning with a smaller accuracy loss cost is still an urgent problem to be solved [38].
In summary, there are two main problems in the study of breast cancer prediction: (1) how to improve the prediction effect through feature selection and model improvement; (2) how to improve the classification effect and at the same time make the model have privacy protection function.To solve these two problems, this paper proposes a new hybrid feature selection method to process data.At the same time, it combines differential privacy technology and a logistic regression algorithm to construct a breast cancer classification model with higher classification performance and data privacy protection.The main process is shown in Figure 1, and the main contributions are as follows: (1) Improve the effect of breast cancer prediction.Firstly, a new hybrid feature selection method is proposed to eliminate weakly correlated features and redundant features, which is divided into two steps: in the first step, the features with an absolute value of the Pearson correlation coefficient greater than or equal to 0.3 are screened out; in the second step, the optimal combination of features is screened to find the final features by the iterative RF-OOB algorithm.Then, the BGD algorithm is used to optimize the LR, and the loss function of the model is trained to minimize the loss to improve the classification effect of the model.In order to verify the effectiveness of hybrid feature selection, a control group experiment is set up to compare the results.(2) Differential privacy protection technology is added to the process of breast cancer prediction.In the BGD algorithm, Gaussian noise is added to each layer of gradient descent, which makes the model have accurate classification performance while protecting data privacy.Finally, the optimal results of the model in this paper are compared with the results in other papers.

Standardized processing
Pearson correlation coefficient

Methods and Materials
This section introduces the basic theoretical concepts of differential privacy preserving techniques, feature selection methods, LR algorithms, and batch gradient descent algorithms.In particular, the feature selection methods contain the Pearson correlation test and random forest algorithm based on out-of-bag estimation.

Definition 1 (differential privacy).
If there is a mechanism F satisfying differential privacy protection, the sum of its outputs is S and Pr[] is the probability of the output results.For all adjacent data sets A and ' A , there is Mechanism F is said to satisfy differential privacy [39], where ε is the privacy budget.When ε is smaller, F needs to give a very similar output and therefore pro- vide higher privacy.Conversely, a larger ε allows F to give less similar outputs, providing less privacy.Satisfying Equation ( 1) is called the strictly satisfying ε-differential privacy definition.

Definition 2 (approximate differential privacy).
In the process of experimentation, because of too-strict protection, the availability of data will be seriously affected.In order to solve this problem, Dwork et al. [40] gave the concept of approximate differential privacy, when the where the privacy parameter δ is a small constant indicating the "probability of failure" that does not meet this approximate differential privacy definition, and we set the δ to 0.00001.Like strict differential privacy, approximate differential privacy, which is known as ( ε , δ )-differential privacy, also satisfies sequence combinatory and parallel combinatory.Satisfying approximate difference privacy means that if we change an element in the database, the probability of the output should be close to the probability of the original data, thus protecting the original data from leakage [41].

Methods and Materials
This section introduces the basic theoretical concepts of differential privacy preserving techniques, feature selection methods, LR algorithms, and batch gradient descent algorithms.In particular, the feature selection methods contain the Pearson correlation test and random forest algorithm based on out-of-bag estimation.

Definition 1 (differential privacy).
If there is a mechanism F satisfying differential privacy protection, the sum of its outputs is S and Pr[] is the probability of the output results.For all adjacent data sets A and A , there is Mechanism F is said to satisfy differential privacy [39], where ε is the privacy budget.When ε is smaller, F needs to give a very similar output and therefore provide higher privacy.Conversely, a larger ε allows F to give less similar outputs, providing less privacy.Satisfying Equation ( 1) is called the strictly satisfying ε -differential privacy definition.

Definition 2 (approximate differential privacy).
In the process of experimentation, because of too-strict protection, the availability of data will be seriously affected.In order to solve this problem, Dwork et al. [40] gave the concept of approximate differential privacy, when the where the privacy parameter δ is a small constant indicating the "probability of failure" that does not meet this approximate differential privacy definition, and we set the δ to 0.00001.Like strict differential privacy, approximate differential privacy, which is known as (ε,δ)-differential privacy, also satisfies sequence combinatory and parallel combinatory.
Satisfying approximate difference privacy means that if we change an element in the database, the probability of the output should be close to the probability of the original data, thus protecting the original data from leakage [41].

Definition 3 (global sensitivity).
The sensitivity of the function reflects the degree to which the output changes when the input of the function changes.For a query function f : D → R k and a norm function • , the sensitivity is The norm function is usually L 1 or L 2 [42], the length of the vector V is k, the L 1 norm is defined as ν 1 = ∑ k i=1 |ν i |, thus the sum of the elements of the vector and the L 2 norm is defined as In two-dimensional space, the L 2 norm is always less than or equal to the L 1 norm.Definition 4 (Gaussian noise mechanism).The Gaussian mechanism cannot satisfy strict εdifferential privacy, but it can satisfy (ε,δ)-differential privacy, so for the function f : D → R k , the Gaussian mechanism defined below is applied to obtain F(A) satisfying approximate differential privacy: where N(σ 2 ) represents the Gaussian (normal) distribution sampling result with a mean of 0 and a variance of σ 2 , where , ∆ 2 f is the L global sensitivity.

Pearson Correlation Coefficient Test
The Pearson correlation coefficient [43] is used to test the correlation of each feature with the target variable, and the Pearson correlation coefficient formula is as follows: where ρ X m X n indicates the correlation coefficient between two variables, Cov(X m , X n ) indicates the covariance between two variables, EX m indicates the expectation of the variable, and DX m represents the variance of the variable.According to Equation (5), the correlation coefficient between each feature and the target variable is calculated.Based on the thresholds set in this paper, values with absolute values of correlation coefficients greater than the thresholds are filtered to weed out variables other than those with weak correlations.The filtered feature variables are used as candidate features for secondary feature screening.

Random Forest Algorithm Based on Out-of-Bag Estimation (RF-OOB)
Because of the subset of candidates obtained by the correlation coefficient screening method, features with high correlation will occur, and redundancy between such features will affect the classification results of the model.Therefore, secondary feature screening of random forests estimated outside the bag is also required for candidate subsets.About 36.8% of the sample data of the RF model was not extracted by the bootstrap sampling [44], which is out-of-bag data of the decision tree.Out-of-bag estimation is to use these data to test the model, and the ratio of misclassified data to the total number of out-of-bag data is out-of-bag estimation, which is also an unbiased estimation of the generalization error of the ensemble classifier.Due to the presence of out-of-bag samples, cross-validation testing is not required for random forest out-of-bag estimation.
As shown in Equation ( 6), the sum of the out-of-bag score (oob-score) and the out-ofbag error is 1.For a single decision tree T i trained by the sampling method, operating with out-of-bag data produces an oob-score.So, for T decision trees, there will be T oob-scores.Finally, the mean is derived to obtain the oob-score for the whole random forest.

Logistic Regression (LR)
LR is a classification algorithm based on logarithmic probability functions.Its core idea is to nest an S-shaped sigmoid function on the basis of linear regression, so as to convert the output result of linear regression into a value close to 0 or 1, and the sigmoid function formula is: where z = w T • x, w is the weight that needs to be learned, and x is the sample feature vector.g(z) represents the predicted probability value corresponding to the event when the event is inferred from the sample.The fitting function H θ (x) for LR is: where The loss function for LR is: In LR model, the parameters are generally estimated by the maximum likelihood method [45].The loss function can measure the gap between the actual variable values and the predicted values.The smaller the loss function, the more accurate the predicted values are.In general, if the difference between the loss function values of the training set and the test set is very small, which both achieve low loss values, then the model can be considered to perform well on both the training set and the test set with a good fit [46].

Batch Gradient Descent (BGD)
Gradient descent is a commonly used optimization algorithm.Its core idea is to gradually adjust the parameters through iteration, so that the loss function of the model reaches the minimum value.The BGD algorithm is a variant form of the gradient descent algorithm.
where ∇l(θ; x (i) , y (i) ) denotes the gradient of the function l(θ) with respect to the parameter θ.The BGD algorithm uses the entire training set for each iteration and computes the local gradient of the error function with respect to the parameter vector θ while proceeding to the next iteration in the direction of the gradient descent until the algorithm converges to a minimum value.

Selection of Indicators for the Evaluation
Evaluation indicators are quantitative indicators for evaluating the performance of the model, and if the selected evaluation indicators are not reasonable, it will affect the orientation of the result analysis.Therefore, different evaluation indicators should be selected for specific data and models.
In this paper, breast cancer prediction is a binary classification problem, then the classification results can generate a confusion matrix, as shown in Table 1, where TP indicates that the positive class is predicted as the number of positive classes; TN indicates that the negative class is predicted as the number of negative classes, which can be referred to as the true counterexample; FP indicates that the negative class is predicted as the number of positive classes, which is referred to as the first type of error; FN indicates that the positive class is predicted as the number of negative classes, which is referred to as the second type of error.

Positive Negative
Real situation Positive TP FN

FP TN
In order to measure the classification effect of the model in this paper, the four perspectives of the model-the overall accuracy of the model, the accuracy of the positive class prediction, the coverage ability of the positive class, and the comprehensive performance-are taken into account.The accuracy, precision, recall, and F1-score are selected as the four main evaluation indicators to evaluate the prediction effect of the model.And these evaluation indicators have values between 0 and 1.The closer the value is to 1, the better the classification effect of the model is.The receiver operating characteristic (ROC) curve is also selected as an assistant indicator to compare the classification effect among the models.The specific meanings are as follows: (1) Accuracy: represents the total proportion of all predictions that are correct (positive and negative categories) and can be expressed as accuracy = TP+TN TP+TN+FP+FN .For breast cancer prediction, a high accuracy rate indicates that the model is better at correctly classifying both malignant and benign tumors.Accuracy is justified because it provides an assessment of overall classification accuracy and can help determine the model's ability to discriminate between the two types of tumors.
(2) Precision: indicates how many of the samples predicted to be positive are truly positive, which can be expressed as precision = TP TP+FP , also known as PPV.(3) Recall: indicates how many positive cases in the sample were predicted correctly, which can be expressed as precision = TP TP+FN , also known as TPR.(4) F1-score: it can be expressed as F 1 = 2 PPV×TPR PPV+TPR .F1-score is a comprehensive evaluation index of extrinsic methods.For breast cancer prediction, F1-score is reasonable because it balances the model's ability to correctly classify malignant and benign tumors.It also comprehensively evaluates the precision and recall of the model, which is one of the very important evaluation indicators.
(5) The receiver operating characteristic (ROC) curve: in the ROC curve, the horizontal axis is the false positive rate (FPR) and the vertical axis is the true positive rate (TPR).
The points closer to (0, 1) correspond to the better classification performance of the model.AUC is the area under the ROC curve, between 0 and 1.As a numerical value it can be visualized to evaluate the classifier, the larger the value the better.When AUC = 1, it is a perfect classifier.When 0.5 < AUC < 1, it is better than random guessing.When AUC = 0.5, like random guessing, the model has no predictive value.When AUC < 0.5, the model is less predictive than random guessing.

Data Preprocessing
Data preprocessing helps to improve the accuracy of analysis results.For different datasets and different tasks, there will be different data preprocessing methods.In this paper, the WDBC dataset is first introduced in detail, followed by Z-score standardization of the data according to the characteristics of the dataset.

Introduction to Data
The WDBC dataset used in this paper was provided by the renowned Dr. Williams of the University of Wisconsin Institute for Clinical Medicine [47] and the eigenvalues were computed from digitized images of fine needle aspiration (FNA) of breast masses.The dataset contains 569 sets of experimental samples.The following ten characteristics of the nucleus of the cells taken from each subject are mainly collected: radius, perimeter, smoothness, area, compactness, concavity, symmetry, texture, concave points, and fractal dimension.Of the experimental samples, 357 sets of data are for benign samples of breast cancer and 212 sets of data are for malignant samples of breast cancer.The breast cancer dataset has one sample label (benign and malignant) and 30 features.The first 10 features are the mean values of the nuclei feature values in the sample images, the 11th to 20th features are the standard deviations of the nuclei feature values, and the 21st to 30th features are the maximum values of the nuclei feature values.The classification label represents the type of breast cancer.

Data Standardization
Some of the feature data of the WDBC dataset are shown in Table 2, from which it can be seen that there are differences in the magnitude of each feature, and if not standardized, direct experiments will lead to the inability to obtain the real results of the research object.In order to reduce the impact of the data dimension on the model, the data need to be processed dimensionlessly.Commonly used dimensionless processing methods are min-max (normalization) and Z-score standardization [48].In this paper, based on the characteristics of the WDBC dataset, Z-score standardization is applied to the data.

A Logistic Regression Optimization Model Based on Hybrid Feature Selection and Differential Privacy
In order to improve the effectiveness of the breast cancer classification model and protect the patient's privacy, first, a hybrid feature selection method that can effectively eliminate redundant variables and select the optimal features is proposed.Second, the LR model is optimized using the BGD algorithm to minimize the loss function of the model.Finally, a logistic regression optimization model based on the hybrid feature selection and differential privacy is proposed by adding the Gaussian noise mechanism on this basis.

Hybrid Feature Selection
In order to improve the model's accuracy, this paper proposes a new hybrid feature selection method, which combines the Pearson correlation test and the RF-OOB algorithm to effectively eliminate irrelevant and redundant features.The method is divided into two parts, as shown in Figure 2: in the first part, the Pearson correlation coefficient is first utilized to measure the correlation between each feature and the target variable, and the k features whose absolute value of the correlation with the target variable is greater than or equal to 0.3 are screened out from the sample training set D(X n m , Y m ) .In the second part, the out-of-bag estimation random forest algorithm is used to calculate the feature importance of the remaining k features.Simultaneously, feature combinations are performed according to the feature scores from high to low.Lastly, the feature combinations with the highest score are iteratively filtered to obtain k features to realize redundant feature removal.
to effectively eliminate irrelevant and redundant features.The method is divided into two parts, as shown in Figure 2: in the first part, the Pearson correlation coefficient is first utilized to measure the correlation between each feature and the target variable, and the k features whose absolute value of the correlation with the target variable is greater than or equal to 0.3 are screened out from the sample training set ( ) D X ,Y .In the second part, the out-of-bag estimation random forest algorithm is used to calculate the feature importance of the remaining k features.Simultaneously, feature combinations are per- formed according to the feature scores from high to low.Lastly, the feature combinations with the highest score are iteratively filtered to obtain ' k features to realize redundant feature removal.Using Equation ( 5), the Pearson correlation coefficients between each feature and the target variables (benign and malignant) are calculated.The feature variables with an absolute value of correlation coefficient greater than or equal to 0.3 are filtered out.Finally, the filtered feature variables are used as candidate features.In order to avoid redundancy among the candidate subsets, it is necessary to carry out the secondary feature screening for RF-OOB on the candidate subsets, and the specific steps are shown in Figure 3: (1) firstly, RF-OOB is applied to calculate the feature importance of each feature, and the features are ranked according to feature importance.The subset of features with the highest feature importance is used as the initial feature combination, and RF-OOB is applied to calculate the model score.(2) Add the subset of features with the second highest feature importance as a new feature combination to be input into the RF-OOB algorithm and calculate the new model score.(3) Add a subset of features one by one according to their importance as a new combination of features and compute a new model classification score.This is iterated until all the feature subsets are traversed.Finally, the feature combination with the highest model classification score is selected as the optimal feature.Using Equation ( 5), the Pearson correlation coefficients between each feature and the target variables (benign and malignant) are calculated.The feature variables with an absolute value of correlation coefficient greater than or equal to 0.3 are filtered out.Finally, the filtered feature variables are used as candidate features.In order to avoid redundancy among the candidate subsets, it is necessary to carry out the secondary feature screening for RF-OOB on the candidate subsets, and the specific steps are shown in Figure 3: (1) firstly, RF-OOB is applied to calculate the feature importance of each feature, and the features are ranked according to feature importance.The subset of features with the highest feature importance is used as the initial feature combination, and RF-OOB is applied to calculate the model score.(2) Add the subset of features with the second highest feature importance as a new feature combination to be input into the RF-OOB algorithm and calculate the new model score.(3) Add a subset of features one by one according to their importance as a new combination of features and compute a new model classification score.This is iterated until all the feature subsets are traversed.Finally, the feature combination with the highest model classification score is selected as the optimal feature.

Logistic Regression Optimization Model Based on Batch Gradient Descent (BGD-LR)
The smaller loss function represents the better prediction effect of the model.In order to solve the problem of the poor classification effect of a traditional logistic regression model on the WDBC dataset, this paper uses the BGD algorithm to optimize the LR model so that the loss function reaches the minimum value.The specific steps of Algorithm 1 are as follows: k , Y k that has been filtered with a mixture of features, initialize the θ Output: prediction results 1.Take the partial derivative of the loss function J(θ) and compute the gradient using the full training set of samples.2. Update the model parameters θ according to Equation (10).3. Repeat steps 2 through 3 for multiple iterations until the specified number of iterations is reached and return θ.Step 2 RF_OOB Step 1 Pearson correlation analysis Figure 3. Specific steps of hybrid feature selection: In the first step, the Pearson correlation coefficient method is used.In the second step, the RF-OOB algorithm is used.

Logistic Regression Optimization Model Based on Batch Gradient Descent (BGD-LR)
The smaller loss function represents the better prediction effect of the model.In order to solve the problem of the poor classification effect of a traditional logistic regression model on the WDBC dataset, this paper uses the BGD algorithm to optimize the LR model so that the loss function reaches the minimum value.The specific steps of Algorithm 1 are as follows: that has been filtered with a mixture of features, initialize the θ Output: prediction results 1. Take the partial derivative of the loss function ( ) J θ and compute the gradient using the full training set of samples.
3. Repeat steps 2 through 3 for multiple iterations until the specified number of iterations is reached and return θ .
4. Calculate the predicted classification results: calculate the predicted values according to the updated θ and Formula (8) in step 2, and output the classification results.

Logistic Regression Optimization Model for Batch Gradient Descent with Differential Privacy (BDP-LR)
In order to solve the problem that traditional LR cannot protect data privacy, this paper uses the BGD algorithm to optimize the loss function of the LR model.At the same time, Gaussian noise is added to each layer of gradient descent, which enables the model

Logistic Regression Optimization Model for Batch Gradient Descent with Differential Privacy (BDP-LR)
In order to solve the problem that traditional LR cannot protect data privacy, this paper uses the BGD algorithm to optimize the loss function of the LR model.At the same time, Gaussian noise is added to each layer of gradient descent, which enables the model to have accurate classification performance while protecting data privacy.
Adding Gaussian noise to the BGD-LR is the core idea of the BDP-LR algorithm.Since the loss function of the LR model is Lipschitz continuous and bounded [49], it means that the global sensitivities of these gradient functions are all bounded.
Thus, for a BGD-LR model, if it can be guaranteed that this gradient is bounded, it can be straightforward to increase the noise by giving the sensitivity as b, with an upper bound on the sensitivity of L 2 obtained by the cropping technique b.The sensitivity formula for the cropping gradient in this paper is: The BGD model after adding noise [50] is The specific steps of Algorithm 2 are as follows: the data by the Z-score standardization; the second set of experiments is to perform hybrid feature selection on the data; the third set of experiments is to verify the effectiveness of the BGD algorithm for LR model optimization; the fourth set of experiments is to verify the effect of hybrid feature selection on the model performance; the fifth set of experiments is to compare the experimental results of the BGD-LR model with the experimental results of other papers without considering privacy protection; and the sixth set of experiments compares the performance of the BDP-LR model with other differential privacy-based machine learning models while considering privacy preservation.Finally, the results are analyzed.

Experimental Environment and Model Hyperparameters
The operating system used for the experiments is Windows 11, the environment is Python 3.9.7, the processor is Intel(R) Core(TM) m3-6Y30 CPU @ 0.90GHz1.51GHz, and the RAM is 4.00 GB.The experiments are conducted using the WDBC dataset, and the data are divided into a test set and training set according to a ratio of 8:2.Both the grid search method and the cross-validation method are used to improve the accuracy of the model.The grid search method is a parameter-tuning method to find the best combination of hyperparameters by trying all possible combinations of hyperparameters to improve the accuracy of the model.Also, to avoid model overfitting, the cross-validation method is used to assess the generalization ability of the model.This impact of the differences between the training set and the test set is reduced.The commonly used cross-validation methods are K-fold cross-validation and leave-one-out cross-validation.In this paper, we use 5-fold cross-validation to divide the dataset into five copies.Each time, four copies are used as the training set and the remaining one as the validation set.This is repeated five times, and the average value is taken as the final result.The specific steps are as follows: 1.
Firstly, select a set of parameter value ranges for each hyperparameter.

2.
Then, evaluate the performance of the adjusted model by the cross-validation method.

3.
Finally, select the parameter with the best performance as the best combination.
The optimal parameter combinations of the model determined according to the above method are shown in Table 3.

Experimental Design
In this paper, six groups of experiments are designed, from which the average of the results of one hundred experiments is taken as the final results for the model with added differential privacy due to the randomness of the added noise, and the six groups of experiments are as follows: 1.
In order to reduce the influence of data magnitude on the model impact, the data are subjected to Z-score standardization in this paper.

2.
In order to eliminate weakly correlated variables and redundant features from breast cancer data, this paper includes hybrid feature selection on the data.

3.
In order to test the optimization effect of the BGD model on the LR algorithm, the loss function graph of the BGD-LR is analyzed in this paper.4.
In order to testify to the impact of hybrid feature selection algorithms on model performance, we set up a control group experiment and analyze the results from the four main evaluation indicators: accuracy, precision, recall, and F1-score.

5.
The experimental results of this paper are compared with those of other papers.The breast cancer classification model proposed in this paper is compared with existing research results without considering privacy protection.6.
The prediction results of the BDP-LR model in this paper are compared with other machine learning models based on differential privacy when privacy protection is considered.

Results of Data Standardization
In order to reduce the influence of data magnitude on the model impact, Z-score standardization is applied to the WDBC dataset, and some of the results are shown in Table 4. First, the Pearson correlation coefficients between each feature and the target variables are calculated, and values with absolute values of the correlation coefficients greater than or equal to 0.3 are filtered out.Then, the feature importance of the candidate subset is computed using the out-of-bag estimation random forest algorithm.In the meantime, the individual features are ranked according to their importance from highest to lowest, and the final results are shown in Table 5, where ρ X 1 X 2 represents the Pearson correlation coefficient and number is the sorted sequence number.
According to Table 5, there are 23 candidate features with Pearson correlation coefficient greater than 0.3.The 23 features are screened using the iterative RF-OOB algorithm, and the results of each feature combination are shown in Table 6.The model classification score of the optimal feature combination is 0.96837, and there are 17 features in the feature combination.Therefore, these 17 features are used as the final input values of the breast cancer classification model.The loss function for training the LR model using the BGD algorithm is shown in Figure 4. From Figure 4, it can be seen that the loss function values of the training and test sets decrease with each iteration of gradient descent, and with the increase in the number of iterations, the loss function values gradually converge to reach their respective minimum values.Since the loss function values of the train and test sets are low and the difference is between 0.023 and 0.0402, the difference is small.Therefore, from the point of view of the loss function, it can be seen that the difference between the prediction results of the BGD-LR model and the real labels is relatively minor, in other words, the fitting effect is great.

Impact of Hybrid Feature Selection Algorithms on Model Performance
In order to verify the effectiveness of hybrid feature selection, we set up a control group and an experimental group for comparison.The data in the control group are only screened by the Pearson correlation coefficient in feature selection, followed by prediction.The data in the experimental group are screened by the hybrid feature selection method From Figure 4, it can be seen that the loss function values of the training and test sets decrease with each iteration of gradient descent, and with the increase in the number of iterations, the loss function values gradually converge to reach their respective minimum values.Since the loss function values of the train and test sets are low and the difference is between 0.023 and 0.0402, the difference is small.Therefore, from the point of view of the loss function, it can be seen that the difference between the prediction results of the BGD-LR model and the real labels is relatively minor, in other words, the fitting effect is great.

Impact of Hybrid Feature Selection Algorithms on Model Performance
In order to verify the effectiveness of hybrid feature selection, we set up a control group and an experimental group for comparison.The data in the control group are only screened by the Pearson correlation coefficient in feature selection, followed by prediction.The data in the experimental group are screened by the hybrid feature selection method proposed in this paper, and then the prediction of breast cancer is carried out.The comparison results are shown in Table 7.According to Table 7 and Figure 5, it can be seen that, compared with the control group, the accuracy, recall, and Fl-score of the experimental group of the BGD-LR model improved by 2.63%, 3.41%, and 1.76%, respectively.For the BDP-LR model, when the privacy budget ε added in each iteration is 0.2, the four main evaluation indicators of the model increase by 2.34%, 0.09%, 3.00%, and 1.75%, respectively.When the privacy budget ε is 0.4, the four main evaluation indicators of the model improve by 3.12%, 0.3%, 3.78%, and 2.16%, respectively.When the privacy budget ε is 0.6, the indicators of the model increase by 1.75%, 0.31%, 1.98%, and 1.19%, respectively.When the privacy budget ε is 0.8, the four main evaluation indicators improve by 1.58%, 0.26%, 1.81%, and 1.07%, respectively.And when the privacy budget ε is 1, the four main evaluation indicators increase by 2.11%, 0.27%, 2.49%, and 1.42%, respectively.Obviously, after the breast cancer data are processed by hybrid feature selection, the model classification results of the experimental group are all better than those of the control group, therefore, hybrid feature selection can effectively improve the classification performance of the model.According to Table 7 and Figure 5, it can be seen that, compared with the control group, the accuracy, recall, and Fl-score of the experimental group of the BGD-LR model improved by 2.63%, 3.41%, and 1.76%, respectively.For the BDP-LR model, when the privacy budget ε added in each iteration is 0.2, the four main evaluation indicators of the model increase by 2.34%, 0.09%, 3.00%, and 1.75%, respectively.When the privacy budget ε is 0.4, the four main evaluation indicators of the model improve by 3.12%, 0.3%, 3.78%, and 2.16%, respectively.When the privacy budget ε is 0.6, the indicators of the model increase by 1.75%, 0.31%, 1.98%, and 1.19%, respectively.When the privacy budget ε is 0.8, the four main evaluation indicators improve by 1.58%, 0.26%, 1.81%, and 1.07%, respectively.And when the privacy budget ε is 1, the four main evaluation indicators in- crease by 2.11%, 0.27%, 2.49%, and 1.42%, respectively.Obviously, after the breast cancer data are processed by hybrid feature selection, the model classification results of the experimental group are all better than those of the control group, therefore, hybrid feature selection can effectively improve the classification performance of the model.After comparative analysis, it can be seen that the optimal accuracy, precision, recall, and F1-score of the BGD-LR model are 0.9912, 1, 0.9886, and 0.9943, respectively.When the added privacy budget ε is 1, the BDP-LR model has the best combined classification results, and its accuracy, precision, recall, and Fl-score are 0.9777, 0.9981, 0.9731, and 0.9853, respectively.6.3.5.Comparative Analysis with Previous Studies After comparative analysis, it can be seen that the optimal accuracy, precision, recall, and F1-score of the BGD-LR model are 0.9912, 1, 0.9886, and 0.9943, respectively.When the added privacy budget ε is 1, the BDP-LR model has the best combined classification results, and its accuracy, precision, recall, and Fl-score are 0.9777, 0.9981, 0.9731, and 0.9853, respectively.

Comparative Analysis with Previous Studies
In order to further verify the effectiveness of the breast cancer classification model developed in this paper, the classification results of this paper are compared with those of other studies.Firstly, the breast cancer classification model proposed in this paper is compared with the results of existing studies without considering privacy protection, and the results are shown in Table 8.As can be seen from Table 8, through comparative analysis, the prediction method for breast cancer classification proposed in this paper outperforms previous research results with an accuracy of 0.9912.Therefore, the hybrid feature selection method and the BGD-LR model used in this paper provide the best classification results.

Comparative Analysis of BDP-LR Model Results with Other Models
The prediction effect of the BDP-LR model in this paper is compared with the DP-NB [35], DP-RF [36], DP-DT [36], and GDP-EBM [37] models under the consideration of privacy preservation.The variation of the four main evaluation indicators with ε for each model is shown in Figure 6.
The results show that when increasing the value of privacy budget ε from 0.001 to 2 with WDBC data, the four main evaluation indicators of each model gradually increase and fluctuate up and down a certain value range with the increase in privacy budget ε.According to Formula (2), the smaller ε is, the better the privacy protection effect is.So, the critical value of ε needs to be selected to provide balance between the model's classification performance and privacy protection effect.According to the trend of the four main evaluation indicators in Figure 6, 0.8 is chosen as the privacy budget value of the BDP-LR model when the model has better classification performance and a stronger privacy protection effect.
At a privacy budget ε of 0.8, the experimental results of the BDP-LR model are compared with other machine learning models based on differential privacy.The performance of each model is evaluated using the ROC curve with AUC, as shown in Figure 7.In addition, the average values of the four main evaluation indicators obtained by running 100 experiments are shown in Table 9.

Comparative Analysis of BDP-LR Model Results with Other Models
The prediction effect of the BDP-LR model in this paper is compared with the DP-NB [35], DP-RF [36], DP-DT [36], and GDP-EBM [37] models under the consideration of privacy preservation.The variation of the four main evaluation indicators with ε for each model is shown in Figure 6.The results show that when increasing the value of privacy budget ε from 0.001 to 2 with WDBC data, the four main evaluation indicators of each model gradually increase and fluctuate up and down a certain value range with the increase in privacy budget ε.According to Formula (2), the smaller ε is, the better the privacy protection effect is.So,  9.As shown in Figure 7, the AUC value of the BDP-LR model is 0.9974.The AUC values for the GDP-EBM and DP-NB models are 0.9694 and 0.9663, respectively.The AUC value of the DP-DT model is 0.7535, and the AUC value of the DP-RF model is 0.8684.The higher the prediction accuracy, the closer the AUC value is to 1. Therefore, based on the ROC curves, it can be seen that the BDP-LR model has the highest classification accuracy, followed by the GDP-EBM and DP-NB models.As shown in Figure 7, the AUC value of the BDP-LR model is 0.9974.The AUC values for the GDP-EBM and DP-NB models are 0.9694 and 0.9663, respectively.The AUC value of the DP-DT model is 0.7535, and the AUC value of the DP-RF model is 0.8684.The higher the prediction accuracy, the closer the AUC value is to 1. Therefore, based on the ROC curves, it can be seen that the BDP-LR model has the highest classification accuracy, followed by the GDP-EBM and DP-NB models.
The experimental results show that the four main evaluation indicators of the BDP-LR model are better than those of the other models when the privacy budget is 0.8, and the four main evaluation indicators are 0.9721, 0.9975, 0.9664, and 0.9816, respectively.Therefore, the logistic regression optimization model based on hybrid feature selection and differential privacy proposed in this paper not only provides high privacy to protect the patients' privacy but also provides superior classification results.

Conclusions
Early diagnosis of breast cancer is significant.Applying machine learning to the prediction of breast cancer cells can assist doctors in reducing the rate of leakage and misdiagnosis.However, at this stage, there are still problems of low correct prediction rate and patient privacy leakage.In order to improve the correct rate of breast cancer diagnosis, this paper proposes a breast cancer classification method with higher classification performance, which firstly combines the Pearson correlation test and the RF-OOB algorithm to construct a new hybrid feature selection strategy and secondly optimizes the LR model by using the BGD algorithm.In order to make the model have the effect of protecting patients' privacy, Gaussian noise is added to the BGD algorithm to build the BDP-LR model.In the paper, the accuracy, precision, recall, and F1-score are selected as the four main evaluation indicators of the models.The hyperparameters of each model are determined using the grid search method and the cross-validation method.Experiments on the WDBC dataset show that the hybrid feature selection method proposed in this paper can improve the prediction performance of each model.Comparative analysis shows that the BGD-LR and BDP-LR models constructed in this paper are better.However, the hybrid feature selection method used in this paper has a long computation time, and this paper is limited to combining differential privacy techniques with machine learning models.In the future, further research will be carried out on local differential privacy techniques, deep learning, and so on.At the same time, these studies will be applied to the classification and prediction of breast cancer, contributing to the early diagnosis of breast cancer and the protection of patients' privacy.

Figure 1 .
Figure 1.Flowchart of breast cancer classification model with data privacy protection.

Definition 3 (
global sensitivity).The sensitivity of the function reflects the degree to which the output changes when the input of the function changes.For a query function : k f D→ and a norm function  , the sensitivity is

Figure 1 .
Figure 1.Flowchart of breast cancer classification model with data privacy protection.

4 .
Calculate the predicted classification results: calculate the predicted values according to the updated θ and Formula (8) in step 2, and output the classification results.

Figure 3 .
Figure 3. Specific steps of hybrid feature selection: In the first step, the Pearson correlation coefficient method is used.In the second step, the RF-OOB algorithm is used.

Figure 4 .
Figure 4. Loss function in the BGD-LR model.

Figure 4 .
Figure 4. Loss function in the BGD-LR model.

Figure 6 .
Figure 6.Comparison of the results of the BDP-LR model with other machine learning models based on differential privacy: (a) accuracy; (b) precision; (c) recall; (d) F1-score.

Figure 6 .
Figure 6.Comparison of the results of the BDP-LR model with other machine learning models based on differential privacy: (a) accuracy; (b) precision; (c) recall; (d) F1-score.

Figure 7 .
Figure 7. ROC curve for each model when ε is 0.8: the closer the AUC value is to 1, the higher the prediction accuracy is.

Figure 7 .
Figure 7. ROC curve for each model when ε is 0.8: the closer the AUC value is to 1, the higher the prediction accuracy is.

Table 2 .
Partial sample characteristic data.

Table 3 .
Optimal hyperparameters of the model.

Table 4 .
Results of data standardization.

Table 5 .
The features are ranked in order of feature importance.

Table 7 .
Comparison of results before and after hybrid feature selection.

Table 8 .
Comparison of the results of the BGD-LR model with other breast cancer classification models.

Table 9 .
Classification effect of BDP-LR model compared with other models when ε is 0.8.

Table 9 .
Classification effect of BDP-LR model compared with other models when ε is 0.8.