Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures

Formation of a hierarchy within an organization is a natural way of optimizing the duties, responsibilities and flow of information. Only for the smallest organizations the lack of the hierarchy is possible, yet, if they grow, its appearance is inevitable. Most often, its existence results in a different nature of the tasks and duties of its members located at different organizational levels. On the other hand, employees often send dozens of emails each day, and by doing so, and also by being engaged in other activities, they naturally form an informal social network where nodes are individuals and edges are the actions linking them. At first, such a social network may seem distinct from the organizational one. However, the analysis of this network may lead to reproducing the organizational hierarchy of companies. This is due to the fact that that people holding a similar position in the hierarchy can possibly share also a similar way of behaving and communicating attributed to their role. The key concept of this work is to evaluate how well social network measures when combined with other features gained from the feature engineering align with the classification of the members of organizational social network. As a technique for answering the research question, machine learning apparatus was employed. Here, for the classification task, Decision Tree and Random Forest algorithms where used, as well as a simple collective classification algorithm, which is also proposed in this paper. The used approach allowed to compare how traditional methods of machine learning classification, while supported by social network analysis, performed in comparison to a typical graph algorithm.


Introduction
People around the world send hundreds of emails to exchange information within organizations. As an implicit result of that, each of these interactions forms a link in a social network. This network can be a valuable source of knowledge about human behaviors and what is more, conducting the analysis can reveal groups of employees with similar communication patterns. These groups usually coincide with different levels of the organization's hierarchy and additionally, employees who work in the same position generally have a comparable scope of duties. It is common for organizations to observe some kind of hierarchy because formally, an organized structure helps with better management of employees and gaining an advantage within the market. Therefore, the analysis of the network created from a set of emails could retrieve valuable data about inner corporation processes and recreate an organizational structure. An interesting and promising idea seems to be the combination of network measures and additional features extracted from messages for classification tasks. Social network analysis (SNA) has the potential to boost machine learning algorithms in a field of organization structure detection, thus capturing relations between data seems to be very important in this kind of dataset.
The reverse engineering of the corporate structure of organization can be perceived in two-way. On one hand, if successful, it could reveal company structure by having only meta-data and this imposes a risk in the case when the structure is intentionally kept secret, e.g., for keeping the competitive edge or protecting the employees from takeovers by other companies. On the other hand, this could lead to reconstructing the structure of malicious organizations when only partial information about them is available.
In the literature there are several works describing the detection of organizational structures, however, most of them use the Enron dataset or focus rather on a network approach and omit standard (supervised) classification algorithms. It should be noted, that each organization is managed in a slightly different way which means that communication patterns could differ within each of them. These differences imply that some solutions may give better or worse results depending on the network's specificity; it is important that studies on an organization hierarchy should not be limited to only one dataset.
The authors of [1] and [2] introduced a concept of matching a formal organizational structure and a social network created from email communications. Experiments were carried out on messages coming from a manufacturing company located in Poland as well as the well-known Enron dataset. The research results showed that in both cases, some network metrics were able to reveal organizational hierarchy better than others. This work also touched on the problem that a formal structure sometimes may significantly differ and won't converge with the daily reality.
The idea of combining network metrics and other features extracted from email dataset to reveal corporate hierarchy is introduced in [3]. The authors presented their own metric named "social score" which defines the importance of each employee in the network. This metric is defined as a weighted average of all features and is used in a grouping algorithm. The grouping method is a simple straight scale level division algorithm which assigns employees to defined intervals by the social score.
The study on the usage of network measures as input features for classification algorithms was presented in [4]. The basic concept of this work focused on retrieving company hierarchy based on the network created from social media accounts of the employees. The authors presented that centrality measures and clustering coefficients in combination with other features extracted from social media can detect leaders in corporate structure. However, this research used individual features of a person like gender, hometown or number of friends instead of features gained from job activities and interactions among employees. Other articles describing the combination of SNA and standard classification methods are [5] and [6]. They both work using the Enron dataset and features based on the number of sent/received messages. In [7] was the described usage of some network metrics as input for classification and clustering algorithms, furthermore the results were compared to a novel measure called Human Rank (improvement of Page Rank).
In the area of the problem being solved also many solutions concentrated mainly on classification from a graph perspective exist. For instance identification key players of social network based on entropy [8], applying graphical models [9] or factor graph models [10].
The following work focuses on the organizational structure detection based on nine-months of e-mail communication between employees of a manufacturing company located in Poland as well as the Enron dataset. The research used decision tree and random forest algorithms for classification, moreover influence of minimum employee activity was examined. The obtained results were compared with the simple graph algorithm of collective classification also proposed in this paper. The weakness of this approach is the fact that an independent and identical distribution (IID) condition is difficult to meet due to network measures which were calculated once before splitting data on training and test set. In social network analysis, full satisfaction of the IID condition is hard to achieve because if we had built independent networks for training and test data, we would get totally different network measures and the importance of each node could be biased. However, network measures could be valuable features for machine learning algorithms in sight of capturing connections between data. The results showed that the combination of classification algorithm and social network analysis can reveal organizational structures, however, small changes in the network can change the efficiency of the algorithms. Furthermore, a graph approach, such as collective classification, is able to classify well even with limited knowledge about node labels.

Materials and Methods
In this section, a proposed solution is described in detail, as well as the used datasets. The presented solution is created with Python language, as well as Networkx library for a social network creation and Scikitlearn for all machine learning tasks.

Manufacturing company
The analyzed dataset contains a nine-month exchange of messages among clerical employees of a manufacturing company located in Poland [2]. The dataset consists of two files: the first contains the company hierarchy, the second stores the communication history. The analyzed company contains the three level hierarchy: the first management level, the second management level and standard employees. The file with emails consists of senders and recipients, as well as the date and time of sent messages. Moreover, emails from former employees and technician accounts are also included in this file. Due to the lack of data about supervisors, former employees and technical accounts were removed from the further research. The final organizational structure to be analyzed is shown in the figure 1 and in the Table 1. It is important to note that the dataset does not contain any correspondence with anyone outside of the company, moreover, the company structure has been consistently stable within the period of time being considered and has not undergone any changes.

Enron
The second analyzed dataset comes from the Enron company [9]. Enron was a large American energy establishment founded in 1985 subsequently became famous at the end of 2001 due to financial fraud. During the investigation, the dataset has been made public, however the organizational hierarchy has never been officially confirmed. Despite of this limitation, the Enron email corpus has become the subject of many studies, which allowed to partially reconstruct the company's structure. The authors of this paper decided to use processed version of this dataset which already include positions assigned to the employees. There is a seven-level hierarchy in this data set, however, to reduce the complexity of this structure the authors proposed more generic three-level hierarchy showed in Table 2, the same as in the manufacturing company dataset. The applied approach allowed for a better distinction of managerial and executive positions from regular employees. The analyzed period contains messages from over 3 years and due to limited knowledge about inner company processes the authors assumed that the organizational structure was stable during it.

Network
The network was built using the email exchanges of its members, where the nodes were employees and the edges were the messages. It was decided to use a directed graph with edge weights defined according to the following formula: where e ij is the sum of messages sent from node i to node j and e i is the total sum of messages sent from node i. All self loops were removed.

Features
From the created social network following centrality measures were calculated as input features for classification algorithms: • indegree centrality • outdegree centrality • betweenness • closeness • eigenvector • page rank • hubs • authorities Moreover, a local clustering coefficient was calculated for each node, which allows capturing density of connections between neighbors, as well as two additional features related to cliques. A clique is defined as a fully connected subgraph which means that each node has directed links to all other nodes in the clique. The first feature is the total numbers of cliques in which an employee is assigned, furthermore the second is the size of the biggest clique for the specific node.
The next features were based on neighborhood variability, which is determined in three ways: sent neighborhood variability, received neighborhood variability and general neighborhood variability. Overall, neighborhood variability is defined as the difference between sets of neighbors which the specific node communicates in the previous and the next month. Sent neighborhood variability considers a set of neighbors to which the given node was sending messages.
Received neighborhood variability looks at a set of neighbors from which the given node had been receiving messages. General neighborhood variability uses a set of neighbors with which the node communicates, without distinguishing between sending and receiving messages. The Jaccard coefficient was used for calculating the difference between sets, so the coefficient takes values between 0 and 1 where 0 means totally different sets and 1 means identical sets. The Jaccard coefficient was calculated for each pair of alternating months moreover, if the employee had not been active in a directly following month, the nearest next month would be considered. For example: the employee was active in January, but not in February and again was active in March, therefore, coefficient would be calculated between sets of neighbors in January and March. Furthermore, a neighborhood variability was calculated as an average Jaccard coefficient for each node based on previous partial coefficients.
Furthermore, features such as the number of weekends worked and the amount of overtime taken were used. For overtime, work between 4:00 PM and 6:00 AM were considered. It should be mentioned that overtime was only calculated for the manufacturing company dataset. Due to the limited knowledge about the Enron dataset, it was impossible to know whether different timezones should be considered because the dates were given in the POSIX format and Enron had branches located in different timezones.

Classification
The classification task was carried out using the decision tree and random forest algorithms for different set up of following parameters of the experiment: number of recognized employee groups, minimum number of active months as well as the percentage of used features.
The first parameter refers to the previously mentioned three-level hierarchy of employees, which can also be flattened to only two levels -management level and standard employees. The experiment was run with two values of this parameter to see how the performance of the algorithms vary with recognizing two and three groups of employees.
The meaning of the second parameter is checked to see how the activity of a person may have influence on the result of the classification and therefore was examined to see if higher minimum months of employee activity correspond with better results. There is an assumption that some patterns of behavior required more time to be revealed, so the classification was run five times starting with one month minimum activity and ending with 5 months minimum activity. For each value, the network had to be recreated and features calculated again as some nodes were eliminated from the network.
The third parameter examines the impact of the elimination of the most significant features. For this parameter, the experiment was carried out nine times, starting from all features to only ten percent of features with a continual decrease of ten percent. The importance of each feature, for both classification algorithms, was determined by gini index; features with the highest gini index were eliminated.
In the analyzed manufacturing company dataset, there was a problem with the unbalanced size of classes, which is common for a company structure where the group of standard employees is the most numerous and the management level has fewer members. However, the Enron dataset is much more balanced in each group, which may indicate a different management model in this company. To handle this problem the technique of oversampling was used to solve it, therefore to match the size of all minor classes to the size of the majority class of standard employees, SMOTE algorithm was used.
In general, for each combination of the above parameters, 5 fold cross validation was performed, in addition, oversampling was also run. To prevent data leakage, oversampling was performed only on a training set. Furthermore, to tune hyperparameters such as the maximal tree depth, the number of features at each split or the number of estimators in random forest, a grid search algorithm was used.

Collective Classification
Collective classification is a different way of revealing company hierarchy from a graph perspective. This approach uses the connection between nodes to propagate labels within the whole network. Loopy belief propagation is an example of collective classification described in detail in [11] and [12]. Therefore, in this paper, a simplified version of this algorithm is introduced to compare with standard classification algorithms like decision tree and random forest.
The first step of this algorithm is choosing a utility score and sorting all nodes according to it. The utility score can be one of the calculated features from the previous section. The next step is to reveal the labels for the given percentage of nodes of each class l i ∈ L with the highest utility score. These nodes are marked as known V k and their labels are constant, whereas the other nodes are treated as unlabeled. Furthermore, the propagation of labels begins in a loop until the stop condition is met or the number of iterations exceeds max_iter. In one iteration, each labeled node sends a message to all of its neighbors N vi by treating edges as undirected; moreover, all received labels in a given iteration are saved for each node. The labels update begins after all nodes sent a message to their neighbors, so the sending order does not affect the result. If the node has received one label more often than others, this label will be assigned to it, otherwise for this node unchanged counter will be increased. If unchanged counter exceeds max_unchanged, it will be reset and the node will be assigned the neighbor's label with the highest utility score. At the end of iteration the stop conditions is always checked and it is determined as a difference between sets of previous and current labels, therefore if the Jaccard coefficient is bigger than min_jaccard and all nodes have assigned label then the algorithm will end. Additionally, in the case of unbalanced classes, the algorithm allows defining a threshold. During the phase of counting how many times each label was received by the node, the result for the majority class will be divided by this threshold to prevent domination of this class. The whole algorithm is shown in listing 1.
The collective classification algorithm was run with three parameters: number of recognized employee groups, minimum number of active months, percentage of known nodes. The first two are identical to the parameters from the previous section, but the last one determines percentage of the known (labeled) nodes. Nine values of this parameter were used from 90% to 10% with a decrease of ten percent. The manufacturing company dataset required setting threshold on the contrary to the Enron dataset where it was not necessary. Additionally, to find the best utility score experiment was carried out for all calculated features, as well as the best jaccard value and threshold were chosen from a range of different values.
Listing 1: Collective classification algorithm 1 // sort all nodes descending by utility score 2 V = sort(nodes, utility_score) 3 4 assign given percentage of top v i in V to V k for each label remove label from v i 10 end for 11 12 // classification 13 for i in range(0, max_iter) 14 // start label propagation 15 if jaccard(previous_labels, current_labels) > min_jaccard 33 and all_nodes_has_label(V) 34 break

Results
The problem that is touched can be considered as a binary classification for two groups of employees and multiclass classification for three groups. Therefore, f-score macro average measure was used to evaluate the solution in sight of the one metric which was needed to compare both results. This measure can handle the above cases, moreover, as was written in [13] it copes well with the problem of unbalanced classes. The biggest advantage of this measure is the equal treatment of all classes which means that a result is not dominated by a majority class.
The results for the manufacturin company dataset are shown in figure 2 and 3, so it is visible that in both cases the presented solution is better than the random baseline. The best score is obtained for two groups of classification which can be explained by unbalanced classes. The classification of these three groups got worse results because of the small number of samples which was insufficient for the distinction between the two levels of management, even when oversampling was used. Furthermore, random forest got a slightly better results especially for two groups of employees. A strange phenomenon can be observed when a reduction of the most important features occasionally concludes with a better result; meaning that there could be some noise among the features which may affect the decision boundary. The potentially explanation of this phenomenon might be related to the problem described in [1], so the observed alteration could be a result that the hierarchy, which arises from daily duties does not converge with company structure on paper. This inconsistency could be the source of some noise in the used features which has an influence on the obtained result; therefore, changing the network structure by eliminating some nodes, as well as removing the most important features, could result in moving a decision boundary. Moreover it is also noticeable that the parameter of a minimum employee activity also has impact on the classification but in the majority of cases it is small especially for 100% of features. The best results for two and three groups of employees was obtained by the collective classification algorithm which was able to classify nodes even if more than half of the labels were unknown. It is noticeable that a graph approach like collective classification is a little more stable solution and is more resistant to small changes in a network structure.
Figures 4 and 5 present the results for the Enron dataset which are similar to the results of the previous dataset. The best score was achieved by the collective classification algorithm for two groups of employees. However in this case it is visible that excessive reduction of features or known nodes leads to the results close to randomness. Furthermore, similarity of the results is important because shows that the presented solution works well for various organizational management models. In the manufacturing company the majority of the employees are regular clerical workers in contrast to the small management group. In the Enron dataset the situation is opposite, so the ratio between the first and second management level and standard employees is balanced.

Discussion
The comparison of the results for decision tree, random forest and collective classification show that for the analyzed datasets of the last method is the most efficient, even if a low percentage of nodes is known. However, standard classification algorithms can also retrieve organizational structure, which was shown in this paper, as well as in the related works. As it was mentioned in the previous section, the decision tree and random forest algorithms were more sensitive to small changes in the network, so this phenomenon could coincide with some network characteristics. Furthermore, it was presented that the minimum length of the period in which an employee was active influences the result, however the ideal value of this parameter depends on an analyzed dataset.
A network created from email communication may vary from organization to organization, especially regarding to the size of the company and different management models, for example: a big international company with many employees and a complicated hierarchy could create a social network with totally different properties than a small startup with a bunch of employees and a simple structure. Moreover, in some companies email communication could be one of many ways of passing messages, so a dataset of emails do not have to contain full information about the connection between employees in the company. These differences could cause some patterns of behavior assigned to a specific level of hierarchy, which does not have to appear in a constructed social network.
Future work should focus on the study of communication coming from different types of companies, moreover, further research should discover which organizational structures can provide the best results of classification task. Therefore, better results could be an implication of some hidden graph properties corresponding to the way which an organization is managed, so future studies also should focus on the examination of a network structure and revealing its characteristics. The biggest problem may be obtaining data for research due to the fact that internal communication of a company is confidential and have to be anonymized before being shared.