Analysis of Risk Factors and Symptoms of Burnout Syndrome in Colombian School Teachers under Statutes 2277 and 1278 Using Machine Learning Interpretation

In 2002, the Colombian ministry of education released statute 1278, for teaching professionalization, superseding statute 2277 of 1977. Although statute 1278 was intended to increase the quality of the education service and teachers’ remuneration, there is evidence that the abundant evaluations and hindered promotion system introduced by statute 1278 resulted in an impairment of the quality of life of the teachers, and a higher incidence of burnout syndrome. We used two techniques for machine learning interpretability, SHapley Additive exPlanation summary plots and predictor importance, to interpret support vector machine and decision tree machine learning models, respectively, to better understand the differences on risk factors and symptoms of burnout syndrome in school teachers under statutes 2277 and 1278. We have surveyed 54 school teachers between August and October 2018, 17 under statute 2277, and 37 under statute 1278. Among the risk factors and symptoms of burnout syndrome considered in this study, we found that the satisfaction with earnt income was the most relevant risk factor, followed by the overtime work and the perceived severity of the sanctions on lower performance. The most relevant symptoms of burnout were fatigue at the end of the day, and frequent headaches. This methodology can be potentially used in other contexts and social groups, allowing institutional authorities and policy makers to allocate resources to specific issues affecting a particular group of workers.


Introduction
Stressful working environments affect the mental health of teachers, and produce disorders like burnout syndrome (Delgado 1995;Extremera et al. 2003;Redó 2009;León et al. 2008). Contract conditions contribute to teachers' working environment, as they set the rules for remuneration, incentives, promotion, and training. In Colombia, two statutes coexist to rule public schools' teacher contracts, statute 2277 of 1977, and statute 1278 of 2002. Several studies have evaluated the elevated In Colombia, statutes 1278 and 2277 regulate the teaching profession in terms of functions and requirements, such as the selection, permanence and promotion processes. In the case of the new statute (1278), appointments are based on merit and a competitive selection, which requires reaching certain threshold scores in the selection exam, and competing with other applicants for admission to the teaching staff. In contrast, statute 2277 allowed the direct assignation of teaching appointments. Overall, it is said that statute 1278 has "established new rules for training, entry, evaluation, promotion and permanence in the teaching profession, which seek to influence students' results through the improvement of teaching quality" (Bautista Macia 2009).
With regard to permanence, statute 2277 established it as an acquired right. In statute 1278, permanence needs to be determined by assessing teachers' suitability, and evaluating functions and tasks related to teaching performance. Those conditions were not considered in the previous statute (2277). With respect to remuneration, incentives, promotion, and training, the functions assigned specifically to teachers under statute 1278 are used in the process to evaluate teaching performance and skills (Cifuentes Cubillos 2013). It has been observed in many cases that teachers of statute 1278 externalize demotivation, because they cannot get promotions and improve their salary, despite the multiple evaluations that they undergo.
Although statute 1278 has been found to positively affect the quality of education, it does not guarantee an improvement of the quality of life or working conditions of teachers (Bautista Macia 2009). A recent analysis of differences on risk factors and symptoms of burnout between teachers under statutes 2277 and 1278 have evidenced a higher prevalence for teachers under statute 1278 (Posada Quintero et al. 2018). However, traditional statistical analysis is limited to determine which risk factors and symptoms are different between the groups and cannot draw conclusions on how important those differences are. Identifying the relative importance of risk factors and symptoms could enable a more effective prevention of burnout in teachers under statute 1278.
Artificial intelligence can help one to understand more clearly the impact of statute 1278 on teachers' working conditions. Artificial intelligence is defined as "a branch of computer science dealing with the simulation of intelligent behavior in computers", or "the capability of a machine to imitate intelligent human behavior" (Definition of ARTIFICIAL INTELLIGENCE 2020). In other words, artificial intelligence is a field determined with reproducing human intelligence in machines, especially computer systems through learning, reasoning and self-correction (Egbuna 2019). Artificial intelligence is a vast topic, and machine learning is just a subset of artificial intelligence. The machine learning concept comes from the learning aspect of the definition of artificial intelligence provided above. Simply put, machine learning is a set of statistical tools to learn from a given dataset. The purpose of machine learning is to teach computers how to learn and make predictions from data, without explicit instructions to do so.
In this paper, we advocate for the use of machine learning to provide policy makers and institutions with tools to prevent burnout in teachers. We are proposing a novel methodology based on machine learning interpretation for the analysis of the importance of risk factors and symptoms of burnout in two different populations that coexist in the same working place (different statues govern their contracts), known to have different levels of prevalence of burnout. First, we collected data from both populations regarding the risk factors and symptoms of burnout. Second, we performed a statistical analysis to identify differences in risk factors and symptoms, as a preliminary way to assess relevancy. Then, we trained machine learning algorithms for classifying the two groups, using risk factors and symptoms separately, to identify the most sensitive features and the model. From that, we trained a model with the identified risk factors and symptoms and performed a machine learning interpretation technique, in order to rank the factors and symptoms by their importance.

Materials and Methods
In this study, we analyzed the risk factors and symptoms of burnout syndrome in teachers under two different statutes, 2277 and 1278. The sample was made up of teachers from the San José Technical Educational Institution (Fresno, Tolima, Colombia), who voluntarily consented to participate in the study. The information was obtained from a survey applied to all teachers (N = 54), with ages between 22 and 63 years (mean = 43 years, standard deviation = 12). The highest percentage of respondents corresponded to women (76%). For the purposes of the analysis, the sample was divided into two groups: teachers under statute 2277 (N = 17; 31.5%, mean = 55 years) and under statute 1278 (N = 37; 68.5%; mean = 37 years). Note that all teachers belonged to the same school, and this constitutes a single case study (Stake 1994;1995;2005). This is a widely used methodology in education research (Tight 2010;2012).
For the purpose of the analysis, sociodemographic characteristics were initially collected (statute of employment, age and sex). Then, 26 items corresponding to 13 risk factors and 13 symptoms of burnout syndrome were established from previous studies (Minprotección Publica Instrumentos Para Evaluar Factores de Riesgo Psicosocial 2011). The risk factors studied were communication with supervisors; sufficient salary; working overtime; social recognition; absences and penalties; support from supervisors; breaks during the day; valuation of work; tasks according to the teaching profession; relationship with peers; working hours; union membership; and training opportunities. The symptoms analyzed through the instrument were loss of appetite; difficulty in communication; headache; gastric problems; feeling of depression; medical consultation for injuries; despair; feeling of fatigue; consultation for mental health; irregular sleep; feeling of sadness; irritability and consultation for voice injuries. The risk factors and symptoms considered in this study have been previously associated to burnout syndrome, and were taken from the specifications of the Battery of instruments for the Evaluation of psychosocial risk factors of the Colombian Ministry of Social Protection (Minprotección Publica Instrumentos Para Evaluar Factores de Riesgo Psicosocial 2011). The questionnaire was applied in Spanish, and a translation is included in Table 1. The coding for the responses was: 0, Never; 1, Almost never; 2: Sometimes; 3: Always.

Statistical Analysis
Normality of the indices was tested using the one-sample Kolmogorov-Smirnov test (Massey 1951;Miller 1956;Wang et al. 2003). Given that data of all the risk factors and symptoms were found non-normally distributed, we used the two-sided Wilcoxon rank sum test (Gibbons and Chakraborti 2011), a non-parametric statistical technique, for the comparison of risk factors and symptoms between the two groups. A p value < 0.05 was considered significant. A statistical analysis was conducted in MATLAB (Mathworks, Inc., Natick, MA 01760-2098, USA).

Machine Learning Classification Analysis
We aimed to determine the risk factors and symptoms of burnout that truly differentiate teachers under statutes 1278 and 2277. We hypothesize that a more detailed analysis beyond the traditional statistically-significant differences analysis can provide useful information, to prevent the consequences of burnout. For this purpose, we used machine learning analysis with risk factors and symptoms of burnout as "features" (each one of the risk factors and symptoms) to classify teachers under statutes 1278 and 2277.
In machine learning, classification is comprised of two steps, "learning" and "predicting", also referred to as training and testing. The machine learning model is developed (trained) based on given training data in the learning step. The prediction step is where the machine learning model is used to predict the response for the given testing data (Roman 2019). We have tested several machine learning models for the classification of teachers under statutes 1278 and 2277, using risk factors of burnout only, and symptoms of burnout. To compensate for the imbalance of the classes (37 teachers of statute 1278 vs. 17 teachers of statute 2277), the prior probabilities of the classes were set uniform in the training process.
In this study, we used decision trees (DT) (Friedman et al. 1976) and support vector machines (SVM) (Shawe-Taylor and Cristianini 2000) for the classification of teachers under statutes 2277 and 1278 using risk factors and symptoms of burnout. DT is one of the easiest and popular classification algorithms to understand and interpret (Breiman 2017). For its part, SVM is one of the most widely used classification techniques, because it aims to minimize the number of misclassification errors directly. This is especially useful when the data is not linearly separable. A machine-learning classification analysis was performed in MATLAB (Mathworks, Inc., Natick, MA 01760-2098, USA) and Python.

Decision Trees (DT)
DT builds the classification (or regression, for other applications) models using a tree structure. The goal of using a DT is to create a model to predict to which class a given sample belongs to, approximating a sine curve by learning conditional decision rules inferred from previously known data, called training data. The deeper branches of the tree create more complex decision rules. Each branch splits the training data into smaller subsets (the deeper the tree, the smaller the subset). The resulting model is a tree with decision nodes and leaf nodes (outputs). A decision node can have two or more branches. A leaf node represents the point where the decision of the classification is made.
A DT requires several steps to be built. The first step is called splitting. In this step, the DT partitions the data into subsets, based on the values of a particular variable that the DT is trying to predict. The second step is called pruning, where the branches of the tree are shortened by turning some branch nodes into leaf nodes. This pruning process is important for the DT, because an unpruned DT tends to be very successful on the training dataset but performs poorly on classifying new values not included in training. This is called over-fitting, and a simpler and pruned tree usually avoids it better than a more complex one (Sehra 2018;Chauhan 2020).
In this study, we evaluated three DT configurations. The first model, named Single DT, was designed with a maximum number of 20 splits. The latter two are ensembles of classification DTs, one using bootstrap aggregation, termed Bag DT (30 learning cycles) (Breiman 1996), and the other using random undersampling boosting (30 learning cycles), termed RUSBoost DT (Seiffert et al. 2008). An ensemble essentially trains many "weak" models (low performance), in which the classification is conducted by getting votes from the ensemble models.

Support Vector Machines
The objective of the SVM algorithm is to find a hyperplane in an N-dimensional space (N is the number of features) that distinctly separates the classes. In a two-classes problem, there are many possible hyperplanes that can be selected to separate the two classes. SVM specifically aims to determine which plane has the maximum distance between the two classes. The idea behind SVM is that maximizing the margin distance allows future samples to be classified with more confidence. The support vectors are the samples closest to the hyperplane, and they influence the position and orientation of the selected separation hyperplane. Adding and removing a given support vector changes the hyperplane, and such a process is used to maximize the margin of the classifier. Once the margin is maximized, the included support vectors are used to build the SVM model. SVM uses Kernel functions that have the capability of measuring similarity in higher dimensions. In many cases, there is no linear decision boundary that could perfectly separate the samples of the different classes. A nonlinear boundary (circular or quadratic, for instance) might be able to provide an optimal decision boundary that a linear classifier is unable to provide. Furthermore, the Kernel functions do not increase the computational costs significantly. In this study, we have tested linear (Linear SVM) and Gaussian Kernel (Gaussian SVM) SVM models.

Performance Evaluation of Machine Learning Models
Overfitting occurs when the machine learning model fits the data too well, even capturing the noise of the data. Overfitting a model can result in good training accuracy (accuracy in the training dataset), however it typically results in poor accuracy on new data sets (data not used for training). Such a model is not able to predict outcomes for new cases, and for this reason it is not usable in the real world (Shaikh 2018). A key challenge with machine learning is that we cannot know how well our model will perform on new data, or how overfitted it is, until we actually test it. To address this, the dataset is split into separate training and test subsets. This can be done once, or several times for a given dataset. This process is called cross validation (James et al. 2017). A common cross validation procedure is to split the data into k subsets. For each fold, one group is used as a test dataset and the remaining groups are used as training datasets. A model is fitted on the training set and evaluated in the test set. The evaluation score (e.g., accuracy) for each fold is retained and the model is discarded. At the end of the process, the performance of the model is summarized using the average of the evaluation scores. The parameter k provides the name of the process (for k = 10 the process is called 10-fold cross-validation).
In clinical applications of machine learning, when more than one sample is obtained from the same subject, it is important to perform a cross-validation subject wise (k is a division of the number of subjects, instead of the number of samples) to evaluate the performance of the machine learning model (Saeb et al. 2017). In this study, we have used leave-one-subject-out cross-validation (Saeb et al. 2017) for the evaluation of the models. We chose the best machine learning models using cross-validation accuracy.

Machine Learning Interpretability
The need for interpretability arises from an incompleteness in problem formalization (Doshi-Velez and Kim 2017). In other words, for certain tasks it was not enough to get the prediction (the what), but the model could also explain how it made the prediction (the why), because the original problem is only partially solved with a correct prediction. The need for scientific understanding of the phenomena, safety, ethics, between other reasons, called for machine learning interpretability techniques. In this study, once the modes that achieved the best performance (i.e., highest cross-validation accuracy) were identified, we used all the dataset to train a general machine learning model. The resulting models were analyzed using machine learning interpretability to learn about the importance of the risk factors and symptoms of burnout in the prediction. In this study, we used two techniques for machine learning interpretability: SHapley Additive exPlanation (SHAP) summary plots (Lundberg and Lee 2017), and predictor importance (PI).

SHAP Summary Plots
The SHAP summary plots allow one to break down a prediction model to extract the impact of each feature. It computes SHAP values, which are used in game theory to assess how much each player in a collaborative game has contributed to the success (Shapley 1952). The idea is to assess how much a feature value has contributed to the prediction compared to the average prediction. It is simple to assess for a linear regression model, as the effect of each feature is the weight of the feature times the feature value. However, for complex models it requires a different analysis. Considering the prediction as a game, the actual prediction as a gain, and the features as players working together to receive the gain, the Shapley value is the average marginal contribution of a feature value across all possible coalitions (Molnar 2019).

Predictor importance in Decision Trees
There are several criteria for splitting data in the decision nodes. In this study, we have used the Gini's diversity index (Jost 2006). It calculates the probability that two entities represent different types and is computed by subtracting the sum of the squared probabilities of each class from one. For DT models interpretation, we computed the importance of risk factors and symptoms by computing the predictor importance (PI) for each feature, which represents an estimate of the importance of each of the features of a DT model (Estimates of Predictor Importance for Classification Ensemble of Decision Trees-MATLAB 2020). PI is computed by summing changes in the risk due to splits on every feature and dividing the sum by the number of branch nodes.

Results
First, we have included the results of the statistical analysis of the risk factors and symptoms of burnout syndrome, followed by the results of the performance evaluation of the different machine learning models using cross-validation. At the end of the section, we have included the results of the machine learning interpretation analysis. Table 2 shows the statistical analysis of the risk factors and symptoms of burnout between the two groups (teachers under statutes 2277 and 1278). Data are represented as mean ± standard error. Obtained p-values from the Wilcoxon rank sum test are also shown. We found five risk factors to exhibit significant differences between groups, APPROPRIATE_COMMUNICATION, HEAD_SUPPORT, APPROPRIATE_SALARY, SANCTIONS_TO_FAILS, and OVERTIME. The first three were higher for statute 2277, and the latter two were lower for statute 2277. Only one symptom of burnout, HEADACHE, was significantly lower in statute 2277. None of the other symptoms exhibited significant differences between groups.  Table 3 includes the cross-validation accuracy of the DT and SVM models. For classification using risk factors only, Linear SVM obtained the best cross-validation accuracy (96%). In the case of using only symptoms of burnout, the RUSBoost DT model achieved the highest accuracy (82%). Overall, the classification task was more difficult (lower accuracy) using symptoms compared to risk factors. The models with best performance (for risk factors and symptoms, respectively) are analyzed to extract information on the importance of the risk factors and symptoms. First, SHAP summary plot was used to analyze the importance of the risk factors in the Linear SVM model (Figure 1). SHAP values can be used to analyze how much a given feature changes the prediction model. The summary plot tells which features are more important than others, and their range of effects over the dataset. Vertical location shows what feature it is depicting, and the features are sorted from top to bottom in order of importance. The color shows whether that feature was high or low for that row of the dataset, and the amount of samples. Horizontal location shows whether the effect of that value caused a higher or lower prediction. The more spread the SHAP summary plot is for a feature, the more important and relevant is the feature for the model.
By plotting the SHAP values for every sample of every feature, we provide a detailed look at which features are most important for the Linear SVM model. APPROPRIATE_SALARY is the most relevant feature, followed by OVERTIME, SANCTIONS_TO_FAILS, and HEAD_SUPPORT. Starting with the feature FULFILL_WORK_SCHEDULE, the remaining features have relatively low impact on the model. Soc. Sci. 2020, 9, x FOR PEER REVIEW 9 of 14 order of importance. The color shows whether that feature was high or low for that row of the dataset, and the amount of samples. Horizontal location shows whether the effect of that value caused a higher or lower prediction. The more spread the SHAP summary plot is for a feature, the more important and relevant is the feature for the model. By plotting the SHAP values for every sample of every feature, we provide a detailed look at which features are most important for the Linear SVM model. APPROPRIATE_SALARY is the most relevant feature, followed by OVERTIME, SANCTIONS_TO_FAILS, and HEAD_SUPPORT. Starting with the feature FULFILL_WORK_SCHEDULE, the remaining features have relatively low impact on the model. For the RUSBoost DT model to classify teachers under statutes 1278 and 2277 using only symptoms of burnout, we used the predictor importance (PI) analysis technique, described in the Methods section. The PI values must be evaluated relatively (Table 4). In other words, we can conclude on what features are more important than others, but not as an absolute measure of importance. The feature with the highest PI was FATIGUE, followed by HEADACHE, COMMUNICATION_DIFFICULTY, and LOSS_OF_APPETITE. The remaining features exhibited lower relative importance compared with the aforementioned features, as they achieved lower PI values.  For the RUSBoost DT model to classify teachers under statutes 1278 and 2277 using only symptoms of burnout, we used the predictor importance (PI) analysis technique, described in the Methods section.
The PI values must be evaluated relatively (Table 4). In other words, we can conclude on what features are more important than others, but not as an absolute measure of importance. The feature with the highest PI was FATIGUE, followed by HEADACHE, COMMUNICATION_DIFFICULTY, and LOSS_OF_APPETITE. The remaining features exhibited lower relative importance compared with the aforementioned features, as they achieved lower PI values.

Discussion
For the first time, we have used artificial intelligence, specifically machine learning interpretation techniques, to analyze the importance of risk factors and symptoms of burnout that differentiate two groups. We found that the level of satisfaction with earned income was the most relevant risk factor, followed by overtime work and the perceived severity of sanctions on lower performance. As for the symptoms, we found that fatigue at the end of the day, and frequent headaches were the most relevant. This observation can enable institutional authorities and policy makers to allocate resources to specific issues related to burnout in this specific group of workers.
Using statistical analysis, we found a total of six features (risk factors and symptoms) to be significantly different between teachers under statutes 2277 and 1278. Five factors regarding the salary, the channels and the spaces of communication with supervisors; the feeling of support from those supervisors; the set of sanctions established for low performance evaluations; as well as the overtime work, were the most important differentiators between teachers under statutes 2277 and 1278. The occurrence of frequent headaches was the only symptom of burnout syndrome which was significantly different between the two groups. The statistical analysis of the features (risk factors and symptoms) provides an overall exploration of the differences, but is only able to analyze one feature at a time, and the nonlinear relationships between burnout and the features cannot be considered. The p-value of the statistical analysis is not an appropriate measure of importance. Although the level of satisfaction with the earned income (APPROPRIATE_SALARY) was the feature that exhibited the largest difference between groups, the p-value of the statistical analysis identified communication issues (APPROPRIATE_COMMUNICATION) and the perception of support from supervisors (HEAD_SUPPORT) as the following p-values. In contrast, the machine learning interpretation analysis identified overtime work and the perceived severity of the sanctions to lower performance to be more important than the communication issues and lack of support from supervisors.
In the education field, some studies have shown a high vulnerability to burnout in teachers, due to the simultaneous realization of multiple activities and functions (Bambula et al. 2012). Teacher work regularly implies not only ensuring that students receive a quality education, but also other time-demanding and conflicting tasks, imposed not only by the work with students, but also by families, social networks and administration. In the same way, burnout syndrome is becoming a serious problem, not only for the teaching group, but also for the education system in general. This is due to its direct consequences on the quality of teaching and its corresponding increasing negative effect on attrition, labor rotation, absenteeism and the decrease in the productivity of education (Delgado 1995;Extremera et al. 2003;Redó 2009). León et al. (2008) conclude that the mental health of teachers is a current problem that must be considered and that may be associated with burnout syndrome.
Making an order of research ideas at national and global level, it is evident that various studies have demonstrated the presence of burnout syndrome in teachers, which analyze causal factors and degree of prevalence. However, the work comparing the degree of prevalence of burnout in teachers governed under labor association statutes 1278 against the group of teachers governed by statute 2277 is unknown.
Artificial intelligence has been widely used for some years, but it is only recently that researchers started to incorporate it in the psychological and behavioral sciences (DelPozo-Banos et al. 2018;Posada-Quintero and Bolkhovsky 2019;Arnold et al. 2012). For instance, various tools of artificial intelligence, including machine learning, have been used to analyze electronic health records (DelPozo-Banos et al. 2018). Other studies have used machine learning to improve the diagnosis of mood disorders (like depression or bipolar disorder) and suicidality (Pandey et al. 2012;Arnold et al. 2012;Ramasubbu et al. 2016). Yet, the potential of artificial intelligence and machine learning is immense for the analysis of data in the social sciences.
As for the limitations of the study, we have compared two groups of teachers under the assumption that differences between the groups are related only to the different statutes under which they were employed. However, the mean age of the group of teachers under statute 2277 is 18 years higher than the mean age of the group of teachers under statute 1278. This means they have been working longer and their salary is significantly higher. Although this is a potential limitation, we believe that at some extent the perception of the appropriateness of the earned salary is not only dependent on the amount of money earned, but also on the social benefits, working conditions, family size, and age, among other circumstances. In other words, although the groups expectedly have significantly different salaries, they could be equally satisfied with it if their salary levels were fair. Another potential limitation of the study is the use of teachers from a single school; as such, this is considered as a case study and cannot be therefore generalized. However, this also allows one to partially avoid the differences in social and economic circumstances if the teachers belonged to different schools.

Conclusions
For the first time, we have used machine learning interpretation analyses to identify the most important risk factors and symptoms of burnout between two groups of Colombian teachers working under statutes 2277 and 1278. The analysis techniques allowed us to conclude that the level of satisfaction with earned income was the most relevant feature, followed by overtime work and the perceived severity of the sanctions to lower performance. The most relevant symptoms of burnout syndrome were fatigue at the end of the day, and frequent headaches. In particular, with the caveat that this was for a single group of teachers, our results suggests that strategies to level the salary between the two statutes should be implemented, and the overtime work and the severity of the evaluation and sanctions should be reduced to diminish the risk of burnout in teachers under statute 1278. Furthermore, the occupational health providers should encourage the teachers under statute 1278 to perform activities aiming to manage excessive fatigue and headaches. In the broader aspect, we have developed a machine learning based analysis framework to identify the most relevant risk factors and symptoms of a given issue. The framework includes the cross-validation of machine learning models to identify the most accurate, and the application of machine learning interpretability techniques to extract information on the relevancy of the features included in the model. This analysis framework enables the identification of relevant conditions that elicit different issues in a given group of workers. Providing information on the importance of factors and symptoms can enable institutional authorities and policy makers to allocate resources to alleviate burnout or other conditions attending those identified as most relevant.