Career Choice Prediction Based on Campus Big Data—Mining the Potential Behavior of College Students

: Career choice has a pivotal role in college students’ life planning. In the past, professional career appraisers used questionnaires or diagnoses to quantify the factors potentially inﬂuencing career choices. However, due to the complexity of each person’s goals and ideas, it is difﬁcult to properly forecast their career choices. Recent evidence suggests that we could use students’ behavioral data to predict their career choices. Based on the simple premise that the most remarkable characteristics of classes are reﬂected by the main samples of a category, we propose a model called the Approach Cluster Centers Based On XGBOOST (ACCBOX) model to predict students’ career choices. The experimental results of predicting students’ career choices clearly demonstrate the superiority of our method compared to the existing state-of-the-art techniques by evaluating on 13 M behavioral data of over four thousand students.


Introduction
According to Erikson's theory [1], identity development primarily relates to career identity, which is mainly developed during adolescence. A student's career identity is probably shaped by adequate career exploration and consecutive commitment at school [2]. Therefore, career counseling services at universities are significant in helping students find their career goals, which is the reason for many special job counseling centers having been established. The major challenge is to reveal important factors that affect students' career planning. From the psychological point of view, collecting, screening, and evaluating relevant personal information is a cognition-based approach to providing career counseling service [3]. Specifically, students are supposed to develop abilities and skills in understanding themselves to be able to participate in occupational decision-making. However, due to the complexity of each person's goals and ideas, it is difficult for students to clearly determine their postgraduation destinations. In contrast, from an empirical point of view, the students' inner interests and future postgraduation destinations can be effectively ascertained by exploring behavioral data of students at school, which makes students' behavioral data essential for their career planning.
The self-perception theory presumes that human behavior can be used to infer a person's goals and intrinsic motivation [4]. Due to the development of information technology, in modern universities there is a growing trend to augment physical facilities with sensing, computing, and communication capabilities. This means that all behavioral data of students on campus can be recorded in real time through the campus information system. Such behavioral data can reflect the students' unique habits, abilities, preferences, and state of mind [5]. Furthermore, accumulating such data continually provides a way for students to better understand themselves by using data-mining techniques [6]. In contemporary research, differences and regularities of behaviors of various types of graduates of a school have been analyzed by using a data-mining classification algorithm [7]. Additionally, theories can be applied in practice. For example, we can not only establish a set of teaching approaches according to the actual circumstances of students at school but also ensure that students can be better educated according to their own personal conditions. As a result, students can plan their own careers based on their actual personal circumstances to effectively alleviate the problem of difficulty finding employment [8].
Using behavior data to predict students' career choices is a challenging task. Although existing studies use various machine learning algorithms, problems of low precision and models' poor performance exist. Hence, motivated by the social influence theory [9], we further analyze the correlation of each student's career choice with choices of students behaviorally similar to him/her. There are three challenges to this process. First, career choices can be divided into four major categories-employment, postgraduate studies, further studies abroad, and others. Thus, this process is a multiclass learning task. To enhance the performance of multiclass learning, prototypical cluster centers are calculated as priori information for each college. In this paper, the promotion of prototypical cluster centers to multiclass learning is testified by experiments.
Second, as there is aggregation in student groups [10], cluster centers is used to help the model capture information in behavioral data. Prototype is widely used in machine learning, and it aims to let us make use of priori knowledge, thereby achieving better results. In this paper, we propose a new regularization method to compensate for the gap between the examples of students and prototypical cluster centers. More specifically, the output of each instance of our model and its corresponding cluster center should be similar. Then, such normalization will naturally encourage the local smoothness of the learning function and will hence achieve the purpose of improving the accuracy based on the original model.
Lastly, behavior data is massive and mixed, including completely different types of data, such as library records, dormitory entries and exits, consumption at campus locations, book borrowing, and academic achievements [11]. In order to predict students' career choices based on behavior data, data mining approaches such as feature engineering [12] is introduced. Before training our model, we cluster students according to their college information. Afterwards, we establish many new behavior-based representative factors that affect a student's career selection. Inspired by Reference [13], a behavioral entropy index is established to measure the regularity of student behavior. Later, we will discuss in detail how we establish new factors.
Specifically, in this paper we collect (mainly through campus smart cards) 4634 students' longitudinal behavioral data spanning almost three years. Based on the statement that clustering the data by feature "college" can capture the connections between students, cluster centers are calculated as prototypes for each college [14,15]. For all instances in a cluster, the cluster center, which is a feature vector, represents their average band. In other words, the cluster center can be a new instance with the average label. Such prototype approach brings multiple important advantages for multiclass learning.
The framework of our model is shown in Figure 1. First, four types of behavioral features are generated based on campus behavioral data: mastery of professional skills, behavior regularity, reading interest, and family economic status. Next, the Approach Cluster Centers Based on XGBOOST (ACCBOX) model can be obtained by our prototypical cluster center generation method and a novel regularization item. Finally, the resulting predicted career choice is presented. We use actual career choice data to evaluate our model.
In summary, we make the following contributions: (1) We collect behavioral data of students at school. Using the location and timestamp of the student's card at the school, the designed algorithm constructs a behavioral entropy index that measures the regularity of student behavior. The behavioral entropy index are used to describe the regularity of students' behavior at school, and the differences in the behavioral patterns of students in different graduating classes are analyzed. Finally, a feature that considers data with and without labels measures the data shift with respect to different years.
(2) Based on student behavior, we propose four factors that reflect student traits. Our study shows that these factors are significantly correlated with students' achievements and career choices.
(3) We perform an extensive evaluation on a real-world dataset covering over 4634 students. To make our results credible, we perform numerous experiments. A methodology called the ACCBOX model is proposed to model behavioral information of students belonging to different clusters. We verify the effectiveness of our method of career choice prediction through experiments on students' behavior dataset. Framework of the proposed Approach Cluster Centers Based On XGBOOST (ACCBOX) model.

Analysis of Factors Influencing Career Choices
Many existing studies focus on factors that influence college students' career choices. In Reference [16], the influence of parents' careers on students' choices was studied. In Reference [17], it was discovered that salary and quality of career advice were the most common factors influencing career choice. In addition, the influence of peers, gender, print media, and interests on career choices has been investigated [18]. One drawback, however, is that all of these studies used questionnaires to study the influencing factors, and ignored students' daily behaviors. In this study, students' daily behavioral data are used to analyze factors that influence career choice.

Career Choice Prediction
With the rapid development of science and technology, many studies use data generated by online platforms to predict the performance of students based on behavior. For example, in Reference [19], students' academic performance was analyzed by studying students' behavioral data generated on an online platform, namely, a course learning management system (LMS). Additionally, transfer learning was used to predict individuals' professional expertise through online behavioral data [20]. However, in this study, offline behavioral data is our main focus in predicting students' career choices.
Several studies used offline behavioral features to predict students' future. However, all of them focused on predicting academic performance. The authors of Reference [21] compared the effectiveness of multiple linear regression, a multilayer perceptron network, a radial basis function network, and a support vector machine in predicting academic performance, and the support vector machine was observed to attain the best predictive performance. A questionnaire was utilized to collect data about students' social media use for collaboration and communication, and that data was subsequently used to analyze the influence of social media use on students' academic performance [22]. In addition, multilevel regression based on LMS data was used in predicting academic performance [23]. Though the existing methods use mainstream machine learning models to predict students' future, they all ignore the distribution differences between different student clusters of, for example, students in different colleges, which has important implications for predicting students' future. In our study, generating prototypical cluster centers is proposed to capture this information, thus improving the performance of our learner.

Behavioral Factors
To mine information correlated with career choices from students' behavioral data, four types of behavioral features are generated as follows.

Analysis of Learning the Level of Mastery of Professional Skills
In this paper, students' professional skills can be extracted from course scores they have earned during school days. However, we still face numerous questions in doing so. First, there are thousands of courses in a university. If the score of each course is taken as a feature, that feature representation will face the challenge of sparsity. In addition, if a student performs well in several courses, we believe that he or she has effectively attained some professional skills. Extracting professional skills is, however, a difficult problem indeed. Therefore, to extract such features, we use a matrix factorization based on a dimensionality reduction algorithm. As to students, we denote the matrix of the grades of students by R ∈ R M×E , where each element r i,j in R represents the grade of student u i in class c j . We divide sparse matrix R into two matrices, denoted by P M×K and Q K×E : In these two matrices, M represents the number of students, E represents the number of classes, and K refers to K kinds of potential professional skills. In this case, matrix P M×K represents the professional skills of each student, and Q K×E embodies the relationships between these K kinds of skills and each class in this college.
To ensure that the product of the two matrices P M×K and Q K×E approximate R M×E , we transform this question into a regression problem in data mining. We define the squared error between the original matrix R M×E and the new matrixR M×E as the loss function. Additionally, to increase our model's generalizability, we add the L 2 norm of the two matrices to the function as a regularization item. Hence, the objective function is defined as follows: In the objective function, I i,j indicates whether student u i has taken class c j . Vector p i ∈ R K is the ith row vector in matrix P M×K , while q j ∈ R K is the jth column vector in matrix Q K×E .
Next, we try to find the solution of P M×K and Q K×E in the objective function by stochastic gradient descent, using the following gradient update functions: We optimize the loss function by using the above gradient update function and calculate the professional skill vector p u,k ; afterwards, we add this vector into the forecasting model as a feature factor. As a given course may be taught by a different teacher every time, the final grades may not be comparable, and the standard of evaluation may vary. Hence, we need to normalize different grades to make them comparable, using the following processes: Suppose that both teachers A and B teach course C; teacher A gave student u i a grade of g u i , and student u j while being taught by teacher B earned gradeĝ u j ; the grade after the normalization is where H and L represent the numbers of students in classes of teacher A and teacher B, respectively.

Analysis of Behavior Regularity
According to the Big Five personality traits (namely, openness, conscientiousness, extraversion, agreeableness, and neuroticism), conscientiousness plays an important role in job and academic performance [24]. The factors we consider are eating breakfast, going to the library, and bathing for the first time every day. Specifically, an entropy of probability that a behavior occurs within specific time intervals can measure the regularity of daily behaviors. Assume that period T is divided into n time intervals: T = {t 1 , . . . , t n }.
For each student, the probability of behavior v ∈ V = {"eating break f ast", "going to the library", "bathing"} occurring within a given time interval t i can be calculated as where n v (t i ) refers to the frequency of occurrence of behavior v within a given time interval t i . The entropy of behavior v is then calculated as We know that the regularity of a behavior is low due to its probability's uniform distribution over time intervals, while its entropy is high. In the computation of entropy, time periods and time interval spans can be different according to different behavior characteristics. For example, the time period of breakfast behavior is set from 6 a.m. to 10 a.m., and its time interval is half an hour.

Analysis of Book Reading Interests
We can learn what a student is interested in from borrowing archives' records of libraries that partly correlate with future vocational choices. There are millions of books in the library, and each student may borrow only one of them. Directly counting the quantity of each student's book borrowing for each book will cause the problem of vector sparsity. However, each book has several very rich attributes, such as book classification. Therefore, we can use the Chinese library classification to define the respective categories for each book. Considering the accuracy of the final partition and the sparsity of the vector, the second-level partition of the Chinese library classification is used as the criterion, and there are approximately 250 dimensions in total. We compute the frequency of borrowing books of each student within more than 200 book categories, and define the frequency as a feature vector that characterizes a student's personal interests; that is, In this equation, S denotes the feature vector of a student's interests, G j represents the number of books borrowed by students in category j, and there are ζ categories of books in total. Considering that the frequency of borrowing varies across students, to make each element in the feature vector comparable, it is necessary to normalize the frequency with the following equation: The feature vector after normalization is S = G 1 , G 2 , G 3 , . . . , G ζ ; we then add the new feature vector to the prediction model to improve the accuracy and interpretability of the model.

Analysis of Family Economic Status
We can assess students' economic conditions by using a questionnaire, but because students may not be able to estimate their family economic conditions well, and because of geographical differences, it is difficult for us to unify the criteria. Therefore, we calculate daily expenditures through the consumption of students in the cafeteria and supermarkets. Afterwards, we use first-and second-order descriptive statistics, including torsion, discrete, median, mean, quartile range, standard deviation, and kurtosis values, to assess each family's economic conditions. Second, we calculate the ratio of transitions on weekends and weekdays, and subsequently perform a fast Fourier transform (FFT); we obtain the consumption cycle by calculating the total energy as the sum of squares of components of each FFT, which provides more information about families' economic status. To eliminate the influence of consumption level, each value of sequence [x 1 , x 2 , · · · , x n ] should be reduced by the mean value of the sequence. We define the energy based on the converted sequence [x 1 ,x 2 , · · · ,x n ] as follows: wherex i = x i − ∑ x i /n, and symbol j denotes the imaginary unit.

Model Introduction
At present, the traditional machine learning method is limited in solving the problems of students' postgraduation plans. For a group of students, there must be a certain connection between them, and this problem could be understood better by mining the relationship between students. First, we find the connections among students through the clustering method; however, large differences between students will affect the performance of the clustering algorithm. Fortunately, we can obtain this a priori information from each college. By building information bridges, we naturally connect related students. In this section, we present our approach called the "Approach Cluster Centers Based On XGBOOST" (ACCBOX). ACCBOX proceeds by taking three elementary steps: prototypical cluster center generation, model training, and optimization.

Problem Statement
In this subsection, we will introduce some notation and then formally define our main contribution of this paper. In a university, let U = {U 1 , U 2 , U 3 . . . , U C } denote the set of colleges. For every student i, we denote the feature vector and the student's career choice by x i ∈ R 1×p and y i , respectively. Parameter p represents the total number of dimensions of students' features after using the methods of feature generation mentioned in Section 3. Let x = [x 1 , x 2 , . . . , x N ] ∈ R N×p and y = [y 1 , y 2 , . . . , y N ] ∈ R N denote the feature matrix and career choices of all students, where N is the total number of all students. D = {x i , y i } n i=1 denotes students' behavioral data and their career choices, where n is the number of students. The detail of features will be covered in the next section. We then formally define our method of career choice prediction as follows.
Career Choice Prediction: Given the feature vector x i of every student, we are supposed to predict the corresponding career choice y i . To build our ACCBOX model, we first introduce the prototypical cluster center. Then the optimization method of training model parameters is been expounded.

Prototypical Cluster Center
As shown in Figure 2, based on the statement that the most remarkable characteristics of the class are supposed to be reflected by the main samples of a category, we determine the cluster center for college U. The size of college U is C. Next, we calculate a main sample for each cluster. For clusters U = {U 1 , U 2 , . . . , U C }, cluster center z j is defined as where I(·) is an indicator function, that is, I x i ∈ U j equals 1 if x i ∈ U j is true and equals 0 otherwise. Similarly, let t j denote the labeling information of U j ; then, we treat the the average vector t j of students' career choices from the college U j as the career choice prototype of each college. Accordingly, the prototypes of colleges are defined as D = z j , t j C j=1 . The students from the same college usually have similar career choices. The prototype of each college can reveal the priori information of the career choice of students from the corresponding college. Hence, we also require the model to approach the prediction of students' career choice to that of the prototype. Experiments in Section 5.2 shows that this effectively improves the learning ability of our model.

Model Training
According to [25], XGBOOST model φ (x i ) can be described as: whereŷ i is the prediction of students' career choice, K is the number of decision tree in XGBOOST, and f k is the k-th decision tree in CART tree set F . Followed by XGBOOST [25], the objective function could be designed as follows: where the first item is the conventional loss function and the second item is the regularization item to avoid overfitting. The trade-off hyperparameter α reduce the model capacity. Ω ( f ) of each tree can be described as : where T is the number of leaves in the tree, w is the scores of all leaves, both γ and λ are the trade-off parameters. We use the soft labels of each college that are treated as career choice prototypes to regularize the model's predictions for the students of each college. Thus, the novel regularization item is specified as: Combining Equations (13) and (14) leads to the following final objective function: where β is a trade-off parameter that controls the importance of the regularization item.

Optimization
XGBOOST is a commonly used algorithm in machine learning that performs very well on most classification tasks. Unfortunately, it is not very good at the task of predicting students' postgraduation plans. However, applying the method we propose to XGBOOST can improve its effectiveness.
Similarly to other boosting algorithms, XGBOOST is an iterative decision tree algorithm; its base learner is a classification and regression tree (CART) and constructs an integration model in a phase-iterative manner. The ACCBOX algorithm's iterative update process is shown in Algorithm 1.

Algorithm 1: Iterative update of ACCBOX
Input: the students' behavior dataset D, the colleges' prototype dataset D , the college clusters U. Loss function: Output: CART tree ensemble φ(x, z) 1 Initialize the model: 2 k=1 3 while(k<=K) 4 Calculate the residuals: 5 Fit a CART tree f k to the above three residuals. Find the optimal weight for f k by minimizing the following loss function: Update the model: where α refers to the learning rate of our model k=k+1; At each iteration, ACCBOX uses the integrated model obtained at that stage to calculate the residual of the model's predicted and true values. There are two parts in the model: the residual of the student's career choice and the predicted value, the residual between the prototypical cluster centers' data and the predicted value, as shown in Equations (19) and (20).
To update the existing integration model, it is necessary to train a CART tree in each iteration to fit the above three residuals and add that tree to our integration model. To ensure that the addition of that tree can benefit our model, we need to continuously optimize the parameters of that tree. The optimal parameters of that tree can be obtained by minimizing the loss function as shown in Equation (21). As a result, we can obtain the optimal tree ensemble by K-round iteration, as shown in Equation (23).
In summary, we have fully introduced our ACCBOX model through three steps: prototypical cluster center generation, model training, and optimization. The pseudocode of the model algorithm is shown in Algorithm 2.

Algorithm 2: The Approach Cluster Centers Based On XGBOOST (ACCBOX) algorithm.
Input: D = {x i , y i } n i=1 the students' behavior dataset α, β : the regularization hyperparameters U : the college clusterŝ x : the student test set Output: y : the prediction of student's career choice 1 divide x into different colleges' clusters; 2 calculate college cluster centers' prototypes {z j } ∈ R c×d according to Equation (10); 3 calculate prototypes' career choice labels {t j } ∈ [0, 3] c×1 according to Equation (11); 4 train the best tree ensemble model structure according to Algorithm 1; 5 using the tree ensemble model to predict student's career choiceŷ;

Experiments
In this section, we first introduce the dataset and the settings used for evaluation. Afterwards, we report experimental results and discuss them.

Dataset and Settings
The evaluation uses a dataset of smart card data of 4634 students of the same grade from 16 colleges. The total number of consumption records is 13,122,696. We collect data from one university during 2010/09/01 to 2014/06/30. This dataset consists of four types of data: academic performance data, basic information data of students, behavior data, and career choice data. The students have borrowed 95,493 books, resulting in 391,637 book loan records. They have taken 1358 courses and generated 336,353 course grade records. The numbers of library records and dormitory entries and exits are 1,048,576 and 727,260, respectively. Students' behaviors include library and dormitory entries and exits, consumption at campus locations (e.g., a canteen and a supermarket), book borrowing, and academic achievements.
The experimental training group and the test group are divided according to a 70%-30% ratio. All hyperparameters are tuned based on accuracy in 5-fold cross validation.
In the experiment, we divide the training group into 16 colleges, and the central label of each college is the mean feature of students in that college. The final objective function of our proposed approach that incorporates prototypical cluster center generation and a novel regularization item for career choice prediction is shown in Equation (16). We search for regularization hyperparameters α, and β in the interval of 10 −4 , 1 with a step size of 10.
In our experiments, we use the accuracy, recall, precision, and micro-F1 as our evaluation metrics.

Feature Importance
We construct the following four types of features for career choice prediction.  Section 3.4) in Section 3 in a full combination way (C(4,4) = 16). To compare the model performance under these 16 feature combinations, first we input the four types of features into the classification algorithm so that we can obtain the performance of each type of features. It is clear that each type of feature helps refine students' post graduation predictions, with mastery of professional skills having the greatest impact. This should be quite intuitive, as common sense tells us that the main goal of college students is to learn professional skills that will help them in their future careers. Additionally, a person's interest in reading reflects that person's preference for learning, and life, behavioral regularity, and family economic conditions affect lifestyle, which will affect career choice prediction. ACCBOLR is a new method we propose based on logistic regression, incorporating prototypical cluster center generation, and a novel regularization item. • ACCBOX represents the application of our proposed method based on XGBOOST, incorporating prototypical cluster center generation and a novel regularization item.
As shown in Table 2, adding the regularization item and prototype examples improves the predictive accuracy of logistic regression and XGBOOST models. By using our method, the accuracy of ACCBOX increases from 0.604 to 0.638, the Micro-F1 score rises from 0.622 to 0.647, Micro-Precision increases from 0.617 to 0.629, and Micro-Recall rises from 0.627 to 0.666. The accuracy of ACCBOLR increases from 0.605 to 0.623, the Micro-F1 score rises from 0.607 to 0.624, Micro-Precision increases from 0.601 to 0.626, and Micro-Recall rises from 0.615 to 0.623. In contrast, the performance of other models is much worse. The accuracy of SVM is 0.606, the value of Micro-F1 is 0.621, Micro-Precision is 0.610, and Micro-Recall is 0.632. The accuracy of the random forest method is 0.623, the value of Micro-F1 is 0.633, Micro-Precision is 0.620, and Micro-Recall is 0.647. The performance of the decision tree is the worst: its accuracy is 0.532, the Micro-F1 score is 0.562, Micro-Precision is 0.541, and Micro-Recall is 0.584.
As displayed in Table 3, we perform a further ablation study to demonstrate the contribution of each design in ACCBOX. The second row represents the XGBOOST method trained with reducing the model capacity. In the third row, only the novel regularization approach is added in XGBOOST. It is clear that the novel regularization term makes a good contribution to regularizing the learner.
The experimental results prove that our approach outperforms the state-of-the-art counterparts. It is clear that by using the prototypical cluster center generation approach and the novel regularization item contribute to the performance of the model.

Effect of α and β
As Figure 3 shows, as α increases in the ACCBOX model, the model attains the highest accuracy if α is 10 −2 ; subsequently, the accuracy decreases gradually. As β increases, the accuracy of the model increases gradually. If it is 10 −2 , the accuracy reaches the highest level, and subsequently decreases gradually.

Conclusions
In this paper, we have studied college students' career choices based on their professional skills, behavior regularity, and other related behaviors. Additionally, the study has offered several important insights into improving the model.
We have proposed a prototypical cluster center generation approach to use the priori information from each college. Motivated by the cluster assumption that examples in the same cluster should have the same label, we have introduced a novel regularization item to bridge the gap between the real-world examples and prototypical cluster centers. The results of multiple experiments demonstrate that our approach is superior to other approaches to career choice prediction.
In future studies, three directions can be followed with interest. First Cluster Centers can be discovered in a more precise method. In addition, our model can be extended from using only behavioral data to using multimodal data, such as adding school achievement and questionnaire data. Furthermore, it is meaningful to improve our model to not only predict career choices but also advise on career planning, such as advising on the courses required.