Performance Analysis of University Collaborative Innovation Center Based on BPNN-Dominated K-Means–Random Forest Unsupervised Factor Importance Analysis Model

: The collaborative innovation plan for colleges and universities is one of the important plans for the construction of high-level universities in Jiangsu Province. A key aspect of this plan is the development of collaborative innovation centers in colleges and universities. Based on the second-phase construction of collaborative innovation centers in 76 colleges and universities in Jiangsu Province, this paper constructs performance evaluation indicators and proposes an unsupervised factor importance analysis model based on Back Propagation Neural Network (BPNN)-dominated K-means and random forests. According to the analysis results, suggestions for further promoting the development of high-quality collaborative innovation centers in colleges and universities are provided


Introduction
Jiangsu Province began to implement a collaborative innovation plan for colleges and universities in 2012.Currently, based on annual selection and supplementation, there are five national "2011 Collaborative Innovation Centers" in colleges and universities in the province, ranking second in China.Moreover, the Ministry of Education has identified 10 provincial and ministerial collaborative innovation centers, ranking them first in the country in total.The universities in Jiangsu are home to 76 collaborative innovation centers.As part of the construction of high-level universities in Jiangsu, the Jiangsu University collaborative innovation plan has been listed as one of the four major projects and is continually being implemented in depth.
The purpose of this type of performance evaluation is to evaluate and promote construction, gain a comprehensive understanding of the project development and management situation, and determine the obtained output benefits.In addition, it aims to establish a more standardized evaluation system for the performance of the collaborative innovation center project.It serves as a decision-making reference for improving fund management in the third phase of the collaborative innovation plan and promoting the advancement of the collaborative innovation projects.The scientific and reasonable nature of the performance evaluation determines the effectiveness and quality of the implementation of the entire university's collaborative innovation plan.
Performance evaluation holds significant practical value in various industries.The selection of appropriate performance evaluation indicators and algorithms further contributes to enhancing industry standards [1][2][3].During performance evaluation, the weight of each characteristic factor needs to be determined.This can be conducted using the analytic hierarchy process (AHP), which is a systematic and hierarchical analysis method that combines qualitative and quantitative analysis.It serves as a model and decision-making method for complex systems that are difficult to quantify completely [4].Fashoto et al. employed AHP in a business employee performance evaluation system to determine employee performance based on individual goals required by the organization [5].Anto et al. integrated the AHP algorithm with the TOPSIS algorithm to construct a performance evaluation system.The authors determined weights using the AHP algorithm and implemented the TOPSIS algorithm for ranking [6].Rajabpour et al. used the fuzzy analytic hierarchy process to translate expert opinions into factors and assess the relationships between them [7].However, the AHP algorithm requires some manual intervention because different weights may be set for varying situations.
Principal Component Analysis (PCA) is another widely used technique in performance evaluation, particularly in scenarios with a large number of features.The PCA algorithm maps a high-dimensional feature space to lower dimensions to achieve the purpose of simplification [8].Lv et al. developed a performance evaluation system tailored to the circumstances of Higher Vocational College teachers, using a combination of the PCA and AHP algorithms for evaluation [9].Wu et al. used the PCA algorithm and envelop analysis to determine the energy security performance of each country in their research on the trend of energy security performance [10].However, although the PCA algorithm is widely used in data dimension reduction, the interpretability of the resulting principal components is slightly weaker.
Multiple criteria decision-making (MCDM) techniques can provide optimal solutions and help find alternatives suitable for specific complex situations.The analytic hierarchy process (AHP), as a globally recognized methodology, is the success of MCDM technology.In particular, Kumar, A., et al. review the application of AHP to various agriculturerelated problems [11].Rawat, S.S., et al. provide a contrast to the emergence of AHP applications.Between 2011 and 2022, it was widely used.Disciplines include renewable energy, sustainable manufacturing, natural disasters, environmental pollution, landfill waste management, and many other issues that fall explicitly or implicitly under the theme of sustainable development [12].
The entropy method, rooted in information theory, is an unsupervised weight determination method that is widely used in performance evaluation [13].Archer et al. used entropy analysis and the TOPSIS method for employee performance evaluation, which improved the interpretability of the evaluation and helped to identify the strengths and weaknesses of each employee, thus furthering the development of the employees [14].Chen et al. used an evaluation model that combined entropy weight with improved D-S evidence theory to evaluate the performance of large infrastructure projects in terms of their operation and maintenance [15].Feng et al. applied the entropy method to the fuzzy comprehensive evaluation of the investment performance of electric power companies, which provided a theoretical basis for government performance evaluation [16].
The regression model based on machine learning algorithms has an absolute effectiveness frontier, surpassing traditional packet analysis methods in terms of analysis efficacy.Zhong et al. integrated the neural network model with the performance evaluation method to reduce the impact of statistical noise in the data and address the inherent issues with the data packets [17].Deng et al. improved the importance-performance analysis (IPA) algorithm by integrating the back-propagation neural network (BPNN) and the three-factor theory for effective performance analysis [18].
Therefore, this paper presents a BPNN-dominated K-means-random forest unsupervised factor importance analysis model for the performance analysis of university collaborative innovation centers.This study is aimed at the 76 collaborative innovation centers in Jiangsu Province, constructs evaluation indicators based on the performance statistics of the second phase, and assigns an importance index to each characteristic factor based on the model.Based on this importance index, this paper provides suggestions on how to promote the high-quality connotative development of the collaborative innovation center and further improve the construction efficiency, acting as a reference for the management and construction of the collaborative innovation center in the future.

Building an Evaluation Index System for the Construction of Universities' Collaborative Innovation Center
Based on the second phase performance statistics of the Jiangsu University Collaborative Innovation Center, we give the main factors that affect the construction of this institution, including operation guarantee ability, capital input and output, scientific research and innovation ability, social services and contributions, personnel training and team building, and international cooperation and exchange.This set of evaluation indicators is rich, cutting-edge, and instructive, which is reflected in the fact that it covers all the indicators required for the effective operation of the collaborative innovation center in colleges and universities.The evaluation results of these attribute indicators can objectively reflect the operation of an efficient collaborative innovation center.The frontier of the indicators reflects that they have certain requirements for the international cooperation and communication ability of the university collaborative innovation center, which requires an international perspective at an advanced level.The guidelines of the indicators are reflected in their importance; they play an important role in directing the future development focus of each university's collaborative innovation center.For the purpose of this study, under each first-level index, we refine the second-and third-level indicators, for a total of 61 third-level indicators, as shown in Table 1.

Unsupervised Factor Importance Analysis Model for the Construction of Universities' Collaborative Innovation Centers
The evaluation index system for the construction of collaborative innovation centers in colleges and universities in Jiangsu Province is composed of six first-level indicators, 19 second-level indicators, and 61 third-level indicators.The sample data refers to 76 collaborative innovation centers in colleges and universities in Jiangsu Province.To analyze the importance of different indicators in the evaluation index, we use a random forest model to classify the collaboration centers and evaluate each of the three-level indicators.However, the random forest model lacks prior monitoring information.Therefore, we first use the K-means unsupervised clustering algorithm to construct the prior clustering information needed for random forests, after which we use the random forest model to analyze the importance of each three-level index.The structure of our proposed method is shown in Figure 1.

Data Preprocessing Based on BPNN Principles of Neural Network and BPNN
The neural network is a computing model that is composed of a large number of interconnected neurons.In addition to the input neurons, each neuron represents a distinct output function, which is also known as the excitation function.The connection between each pair of nodes represents a weighted value of the signal passing through the connection, which is called the weight and is equivalent to the memory of the artificial neural network.The output of the network varies with the connection mode, weight value, and excitation function of the network.In general, neural networks can be classified into the forward type, feedback type, random type, and competitive type.
BPNN is a kind of feedforward neural network.It includes a back-propagation algorithm in addition to the standard structure of the feedforward neural network, which is used to adjust the weights and thresholds of the network during training.

Data Preprocessing Based on BPNN Principles of Neural Network and BPNN
The neural network is a computing model that is composed of a large number of interconnected neurons.In addition to the input neurons, each neuron represents a distinct output function, which is also known as the excitation function.The connection between each pair of nodes represents a weighted value of the signal passing through the connection, which is called the weight and is equivalent to the memory of the artificial neural network.The output of the network varies with the connection mode, weight value, and excitation function of the network.In general, neural networks can be classified into the forward type, feedback type, random type, and competitive type.
BPNN is a kind of feedforward neural network.It includes a back-propagation algorithm in addition to the standard structure of the feedforward neural network, which is used to adjust the weights and thresholds of the network during training.

Optimize Data Validity Using the BPNN Training Model
The existence of noise in the training sample data may cause unpredictable deviations in the importance assessment.The training model based on BPNN can check and fill the data and simultaneously classify and rearrange the effective data against the evaluation index system to improve the validity of the data.The output data is then normalized to minimize the error between data and improve the accuracy of the results.

K-Means Algorithm
The K-means algorithm is an unsupervised clustering algorithm [19].It divides the data into k clusters, where the centroid, also referred to as the center of each cluster, is determined by the mean of the data in that cluster.Set the dataset to be clustered to  =  ,  , … ,  , and the ith cluster to be divided to D_i; then the steps of the K-means algorithm are as follows: 1. Randomly select k centroids  ,  , … ,  in .

K-Means and K-Means++ Clustering Algorithms K-Means Algorithm
The K-means algorithm is an unsupervised clustering algorithm [19].It divides the data into k clusters, where the centroid, also referred to as the center of each cluster, is determined by the mean of the data in that cluster.Set the dataset to be clustered to X = {x 1 , x 2 , . . . ,x n }, and the ith cluster to be divided to D_i; then the steps of the K-means algorithm are as follows: 1.

2.
Calculate the distance from x i to each centroid |x i − u k | 2 , select the centroid u p with the minimum distance from x i to each centroid, and update the cluster Calculate the mean within each cluster and update the centroids.

4.
Repeat steps 2 and 3 to obtain the cluster {D i } k i=1 .

K-Means++ Algorithm
The K-means algorithm has been proven to be convergent.However, the random selection of the initial centroids in this algorithm has a great impact on the final result and running speed.Therefore, the K-means++ algorithm has been designed to optimize the selection of the initial centroids [20].The steps of the K-means++ algorithm are as follows: 1.
Randomly choose an initial centroid u 1 .

2.
Calculate the minimum distance d i from x i to the selected centroid.

3.
Select the next centroid according to the probability; the sample with larger d i has a greater probability of being selected.4.
Repeat steps 2 and 3 until k centroids are selected.
Method for Determining Cluster Number K Within the K-means series of algorithms, the number of clusters, denoted by K, is a very important parameter that plays a crucial role in the division of the data into multiple clusters.The selection of k typically involves two methods: the elbow method and the silhouette coefficient method.
The core idea of the elbow method is to evaluate the sum of squares of errors (SSE): As the number of clusters k increases, the degree of aggregation in each cluster increases as well.Consequently, the squared error and SSE become smaller.When the decline in the SSE is no longer significant, the benefits of increasing the k-value are minimal.The diagram shows the inflection point of the SSE image.
The elbow method usually requires manual observation of the location of the inflection point to select the k value, while the contour factor method determines the rationality of the clustering method by calculating the silhouette coefficient of the clustering result.The steps for the contour factor method are as follows: 1.
For x i of cluster D p , calculate the dissimilarity a(i) within the cluster: Calculate the inter-cluster dissimilarity b(i) of x i of the cluster D p : Calculate the silhouette coefficient s(i) of x i : The average of the silhouette coefficients s(i) of all samples is calculated as the silhouette coefficient of the clustering result.A larger silhouette coefficient of the clustering result indicates a more reasonable clustering result.

Random Forest Algorithm Decision Tree
The decision tree is a tree-like prediction model in machine learning that is widely used in various fields because its output results are easy to understand [21].It has a flowchart-like structure in which each branch represents a choice.Moreover, each leaf node corresponds to a classification and produces a rule that consists of the conditions along all the paths from the root node to the leaf node.The conclusions presented on the leaf node represent the conclusions derived from the corresponding rules.In machine learning, the decision tree is a prediction model that represents a mapping relationship between the attributes and values of an object.

Random Forest
Random forests are classifiers that use multiple decision trees to train and predict samples [22].Each classifier in a random forest is a decision tree.At each node, a decision tree is generated by splitting randomly selected attributes.Subsequently, when performing classification, the outcomes are determined through a voting process involving multiple decision trees.Based on the above mechanism, random forests are less prone to overfitting and are more stable when dealing with error points and outliers.
The random forest algorithm builds each tree according to the following steps: 1.
Let N represent the number of training cases (samples), and M represent the number of features.The number of input features m is used to determine the decision result of a node on the decision tree, where m should be much smaller than M.

2.
From the N training samples in the way of sampling with replacement, sampling N times to form a training set, and using the unsampled samples as predictions to evaluate the error.

3.
For each node, m features are randomly selected, and the decision of each node on the decision tree is determined based on these features.According to the m features, calculate the optimal splitting method.

4.
Each tree grows fully without pruning.

Feature Importance Evaluation Based on Random Forest Algorithm
Since the random forest algorithm uses random sampling with replacement, about one-third of the data is not used and does not participate in the establishment of the decision tree.This part of the data can be used to evaluate the performance of the decision tree and calculate the prediction error rate of the model, which is called out-of-bag (OOB).The importance of a characteristic factor can be judged by calculating the change in OOB.
. In order to study the importance of the ith feature, we are essentially studying the influence of this feature on the overall classification effect.Specifically, the method used to calculate the importance of a feature in the random forest is as follows: 1.
First train a random forest classifier y = f (x) and calculate OOB 1 .

2.
Apply random perturbation ε to the ith feature, that is, , where .

3.
Train a random forest classifier y = g(x ) and calculate OOB 2 .
Then the importance of the ith feature is: After calculating the importance of each feature, the ranking of all feature factors can be assigned.
The complete algorithm flow is shown in Algorithm 1.

Major Limitations of the Model and Wider Applicability Major Limitations of the Model
The main limitations of the model are mainly reflected in the following three aspects: For high-dimensional data, the results of factor importance analysis may be affected by dimensional disaster, leading to inaccurate analysis results.For nonlinear data, the expression ability of the BPNN model may be insufficient, leading to inaccurate factor importance analysis results.For large-scale data, the computational complexity of the K-means algorithm may be very high, resulting in low analysis efficiency.

The Wide Applicability of the Model
Although the model has some limitations, it also has a wide range of applicability.For data sets whose data distribution is not obvious or irregular, the K-means algorithm can effectively perform clustering, thus improving the accuracy of factor importance analysis.The BPNN model can model nonlinear data to improve the accuracy of factor importance analysis.The random forest algorithm can effectively avoid overfitting and improve the accuracy of factor importance analysis.

Data Source and Preprocessing
This paper uses the second-phase performance statistics of 76 university collaborative innovation centers in Jiangsu Province as sample data.According to the evaluation index system for the construction of a university collaborative innovation center, the sample data is first validated; the data points that do not conform to the statistical characteristics and are unreasonable are eliminated, and the missing values are filled using the median method.
In order to eliminate the feature deviation caused by different data dimensions and ranges, we normalize the data in the preprocessing stage.Let X = x ij m×n represent the index value matrix of all collaborative innovation centers; then, the normalized data matrix Z is given by: where:

Unsupervised Data Clustering with K-Means++
In this paper, we use K-means++ to perform unsupervised clustering on standardized data to identify the category of each collaborative innovation center.The clustering results are transferred to a random forest model for feature importance analysis.
To determine the number of clusters k, we combine the elbow method, and the silhouette coefficient method.Figure 2 shows the result obtained from the elbow method, and Figure 3 shows the result of the silhouette coefficient method.Combining the results of these two methods to determine the k value, we choose k = 2 in this paper, that is, divide all the data into two categories.After the clustering is complete, the result Y = (y 1 , y 2 , . . . ,y m ) corresponding to each data is provided to the random forest algorithm.
To determine the number of clusters k, we combine the elbow method, and the silhouette coefficient method.Figure 2 shows the result obtained from the elbow method and Figure 3 shows the result of the silhouette coefficient method.Combining the results of these two methods to determine the k value, we choose k = 2 in this paper, that is, divide all the data into two categories.After the clustering is complete, the result  =  ,  , … ,  corresponding to each data is provided to the random forest algorithm.houette coefficient method.Figure 2 shows the result obtained from the elbow method, and Figure 3 shows the result of the silhouette coefficient method.Combining the results of these two methods to determine the k value, we choose k = 2 in this paper, that is, divide all the data into two categories.After the clustering is complete, the result  =  ,  , … ,  corresponding to each data is provided to the random forest algorithm.

Random Forest Algorithm Feature Importance Analysis
We use the normalized data Z and unsupervised clustering result Y to construct the classification training data (Z, Y) and input the training data into the random forest model.In random forest models, the number of estimators is an important parameter.Based on the specific situation of the data, we adopt the method of 5-fold cross-validation, traverse all the evaluators in the set {e|e = 100 + 50p, p ∈ N, 0 ≤ p ≤ 18}, and select the one with the best effect.Figure 4 shows the results of the experiment that considered the estimator number as a parameter.Based on the experimental results, the number of evaluators we choose is 600.In addition, the results of cross-validation are good, indicating that the random forest model has good interpretability for the clustering results of the K-means algorithm.
validation, traverse all the evaluators in the set | = 100 + 50,  ∈ , 0 ≤  ≤ 18 , and select the one with the best effect.Figure 4 shows the results of the experiment that considered the estimator number as a parameter.Based on the experimental results, the number of evaluators we is 600.In addition, the results of cross-validation are good, indicating that the random forest model has good interpretability for the clustering results of the K-means algorithm.Once the random forest model completes the classification, the importance results for each feature are obtained, as shown in Figure 5. Once the random forest model completes the classification, the importance results for each feature are obtained, as shown in Figure 5.
select the one with the best effect.Figure 4 shows the results of the experiment that con sidered the estimator number as a parameter.Based on the experimental results, th number of evaluators we choose is 600.In addition, the results of cross-validation ar good, indicating that the random forest model has good interpretability for the cluster ing results of the K-means algorithm.Once the random forest model completes the classification, the importance result for each feature are obtained, as shown in Figure 5.

Analysis of the Construction of Universities' Collaborative Innovation Center in Jiangsu
According to the analysis results of the model, we can draw the following conclusions: indicators of scientific research projects are higher than the average level.These findings suggest that in the later construction process, the collaborative innovation center should further improve the quantity and quality of scientific research projects.Moreover, the government should continue to promote research development as well as provide financial support for the project.In addition, the importance of the scientific research awards of the leading universities is relatively high, indicating that these universities should maintain their leading role, ensure collaborative work with the member universities, and continue to produce high-quality output.This output should constitute not only academic papers but also independent intellectual property rights.2.
Funding input and output are crucial to the operation of any collaborative innovation center.The support of industries and local governments constitutes an important funding source.To ensure the development of an innovative country, industries and local governments should establish a reasonable funding scale and cycle for the project based on current reality.Additionally, they should strengthen the macro-control of the project discipline layout.It is essential to provide guidance for the unpopular, weak, and "shrinking" disciplines and areas that are significant in the context of long-term economic and social development.For the collaborative innovation center, establishing a long-term mechanism and a relatively clear policy funding period scheme can create a stable and predictable environment, which is more conducive to the strategic design of collaborative innovation and the selection of a long-term roadmap.

3.
In terms of talent training and team building, each collaborative innovation center should pay attention to the talent plan at the provincial, ministerial, and higher levels, as well as focus on the talents that are the source of continuous innovation within the collaborative innovation center.This is because the essence of the collaborative innovation drive is talent.Each collaborative innovation center should take responsibility for the introduction and cultivation of talents, break the constraints of the original system, continuously enhance the vitality and competitiveness of the center in terms of scientific research, condense the research direction, and form a high-level team.

4.
A modern university collaborative innovation center should have an international vision.Aiming for high-level results on an international scale can improve the quality and global recognition of the research at the collaborative innovation center.Moreover, actively organizing and conducting major international collaborative research projects and obtaining advanced experience through exchange programs can further improve the quality of the collaborative innovation center.

Conclusions
This study is aimed at the 76 collaborative innovation centers in Jiangsu Province and constructs evaluation indicators based on the performance statistics of the second phase.Specifically, this paper presents a BPNN-dominated K-means-random forest unsupervised factor importance analysis model for the performance analysis of university collaborative innovation centers and assigns an importance index to each characteristic factor based on the model.The research shows that the feature analysis method based on this model does not require manual intervention, and the obtained results have good interpretability, which helps to provide a policy reference for government departments to improve performance management and evaluation methods.Additionally, it provides decision support for collaborative innovation centers to improve performance.Last but not least, the proposed model has certain limitations for high-dimensional data, nonlinear data, and large-scale data.Subsequent work will focus on model optimization in this area.

First
Overseas High-Level Talent Program C51: Talent plan for provincial and ministerial levels and above B14: Personnel training C52: Train students with master's or above degrees B15: New Innovation Team C53: New Innovation Team C54: Innovation Teams at and above the provincial level A6: International Cooperation and Exchange B16: New Major International Cooperation Studies C55: New Major International Cooperation Studies B17: To host (undertake) international academic conferences C56: To host (undertake) international academic conferences B18: New positions in international academic institutions and international academic journals C57: Total numbers C58: New positions in international academic institutions C59: New positions in international academic journals B19: International exchange and mutual visits of personnel C60: Dispatch personnel C61: Visitors

Figure 1 .
Figure 1.Structure of unsupervised factor importance analysis model.

Figure 1 .
Figure 1.Structure of unsupervised factor importance analysis model.Optimize Data Validity Using the BPNN Training Model The existence of noise in the training sample data may cause unpredictable deviations in the importance assessment.The training model based on BPNN can check and fill the data and simultaneously classify and rearrange the effective data against the evaluation index system to improve the validity of the data.The output data is then normalized to minimize the error between data and improve the accuracy of the results.
Let x = [x i ] ni=1 be the training dataset, where x p = a

Algorithm 1 :
Based on BPNN-dominated K-means-Random Forest Unsupervised Factor Importance Analysis Model Input: Sample data and evaluation indicators.Output: Evaluate the importance results of each feature.1. Input sample data and evaluation indicators into the BPNN model for data preprocessing.2. Obtain normalized data Z. 3. Input normalized Z into the K-means++ algorithm model (use the elbow method and the silhouette coefficient method to determine the k value).4. Obtain unsupervised results Y. 5. Use Z, Y to construct classified training data (Z, Y). 6. Input the training data into the random forest model (determine the number of estimators using the 5-fold cross-validation method).7. The importance results of each feature are obtained after the classification of random forests is completed.

Figure 2 .
Figure 2. Results of determining k value by elbow method.

Figure 3 .
Figure 3.The result of determining the value of k by the silhouette coefficient method.

Figure 2 .
Figure 2. Results of determining k value by elbow method.

Figure 2 .
Figure 2. Results of determining k value by elbow method.

Figure 3 .
Figure 3.The result of determining the value of k by the silhouette coefficient method.

Figure 3 .
Figure 3.The result of determining the value of k by the silhouette coefficient method.

Figure 4 .
Figure 4.The experimental results of different numbers of estimators.

Figure 4 .
Figure 4.The experimental results of different numbers of estimators.

Figure 4 .
Figure 4.The experimental results of different numbers of estimators.

Figure 5 .
Figure 5. Results of the factor importance analysis of the construction of the universities' collaborative innovation center.The figure shows the three-level indicators, and detailed information can be obtained in Table1.

1 .
Scientific research innovation and output are important first-level indicators for evaluating collaborative innovation centers.The results of the analysis indicate that the significance levels of multiple three-level indicators based on this indicator are relatively high.In particular, all the third-level indicators under the second-level