Email Based Institutional Network Analysis: Applications and Risks

: Social Network Analysis can be applied to describe the patterns of communication within an organisation. We explore how extending standard methods, by accounting for the direction and volume of emails, can reveal information regarding the roles of individual members. We propose an approach that models certain operational aspects of the organization, based on directional and weighted indicators. The approach is transferable to other types of social network with asymmetrical connections among its members. However, its applicability is limited by privacy concerns, the existence of multiple alternative communication channels that evolve over time, the di ﬃ culty of establishing clear links between organisational structure and e ﬃ ciency and, most importantly, the challenge of setting up a system that measures the impact of communication behavior without inﬂuencing the communication behaviour itself.


Introduction
Our world is being transformed into a Digital Society at a fast pace, with an increasing number of human interactions and communications leaving a digital trail. The volume of accessible data on social activity at both personal and professional levels-combined with the explosive growth of data processing technologies-is often seen as an opportunity for the development of applications that measure several aspects of human behaviour. In a management context, the potential of utilizing such data brings promises of improved monitoring, performance measurement and control of organizations, with the objective of better understanding the dynamics of an organization and improving its efficiency.
The research question we address here is how social network analysis based on email traffic can be applied for Human Resources (HR) management. The main hypothesis is that the (email) communication patterns of each member of an organization can reveal information about their role and performance which-consequently-can be used for a variety of management actions, such as promotions, training and internal mobility. We discuss three main dimensions of the research question: • Suitable indicators that take into account the direction, intensity and frequency of information flow among the organisation's members.

•
How such indicators can be connected with human resources and the overall organisation efficiency measurement.

•
What are the possible areas of application and the trade-offs when applying such an approach?
The article is structured as follows: Section 2 provides an overview of the literature on Social Network Analysis (SNA) and organizational dynamics, focusing on the use of e-mails and related

Description of the Dataset
The dataset we use here is the Stanford Network Analysis Project (SNAP) "email-Eu-core-temporal" network, a well-known reference for Social Network Analysis (SNA) of email traffic (Yin et al. 2017;Leskovec et al. 2007). The dataset was constructed using real email traffic data from a large European research institute. Anonymized information about all incoming and outgoing emails in the research institute was collected over 18 months. The information retained consists of the (anonymized) sender, the (anonymized) receiver and the email timestamp. To convert the set of email messages into a network, each email address is considered a node. A directed edge between nodes i and j is created if i sent at least one message to j. SNAP also provides an additional dataset, "email-Eu-core-department-labels", which associates each individual e-mail address to one of the 42 departments of the organization. The resulting network consists of 986 nodes (unique email addresses). Since 21 email addresses had only outgoing messages within the organization, and 162 e-mail addresses had only incoming messages from within the organization, there are 824 transmitting nodes and 965 receiving nodes. Membership in a department ranges from 1 to 109, with a mean of 23.93 members and a median of 14.5.
The number of e-mails sent by each individual is highly correlated with the number of e-mails received (Pearson correlation = 0.747). The correlation between the number of sent and received e-mails is even higher when summarized at department level (Pearson correlation= 0.967). E-mail activity appears to be an effect of the individual's role within the department and the organisation at large, rather than an attribute associated to the role that each department has inside the organisation. The number of e-mails sent by a department's members to members of other departments is proportional to the number of e-mails received from other departments ( Figure 1 and Table 1). In addition, even though there is a significant variance in the number of e-mails sent or received by each individual (Tables 2  and 3), the aggregate figures at department level are, to a large extent, proportional to the number of individuals in each department. Even though there is a significant variance among individuals as regards e-mail activity, the e-mail flow between departments is, to a large extent, symmetrical.

Description of the Dataset
The dataset we use here is the Stanford Network Analysis Project (SNAP) "email-Eu-core-temporal" network, a well-known reference for Social Network Analysis (SNA) of email traffic (Yin et al. 2017;Leskovec et al. 2007). The dataset was constructed using real email traffic data from a large European research institute. Anonymized information about all incoming and outgoing emails in the research institute was collected over 18 months. The information retained consists of the (anonymized) sender, the (anonymized) receiver and the email timestamp. To convert the set of email messages into a network, each email address is considered a node. A directed edge between nodes i and j is created if i sent at least one message to j. SNAP also provides an additional dataset, "email-Eu-core-department-labels", which associates each individual e-mail address to one of the 42 departments of the organization. The resulting network consists of 986 nodes (unique email addresses). Since 21 email addresses had only outgoing messages within the organization, and 162 e-mail addresses had only incoming messages from within the organization, there are 824 transmitting nodes and 965 receiving nodes. Membership in a department ranges from 1 to 109, with a mean of 23.93 members and a median of 14.5.
The number of e-mails sent by each individual is highly correlated with the number of e-mails received (Pearson correlation = 0.747). The correlation between the number of sent and received e-mails is even higher when summarized at department level (Pearson correlation= 0.967). E-mail activity appears to be an effect of the individual's role within the department and the organisation at large, rather than an attribute associated to the role that each department has inside the organisation. The number of e-mails sent by a department's members to members of other departments is proportional to the number of e-mails received from other departments ( Figure 1 and Table 1). In addition, even though there is a significant variance in the number of e-mails sent or received by each individual (Tables 2 and 3), the aggregate figures at department level are, to a large extent, proportional to the number of individuals in each department. Even though there is a significant variance among individuals as regards e-mail activity, the e-mail flow between departments is, to a large extent, symmetrical.

Modelling Based on Social Network Analysis Indicators
There are a large number of indicators that can be applied in order to measure the various aspects of networks in an organisational context, at either an individual or institutional level. As discussed in the preceding sections, such indicators are often used in SNA, but their application is usually limited to simple rankings of individual actors in the network. What is largely missing is an identification of causality between these SNA indicators, and an objective measure of the performance of the individuals, or the network as a whole. The reason for this is usually a lack of data on system variables that are independent of the network structure itself. For example, data may be available on the follower structure of a Twitter account network, but it is difficult to associate them with data that reflect the real importance of a specific Twitter account. In practice, such analyses would be limited to calculating the total number of followers of an account which, by itself, does not (or should not) constitute a measure of the importance of the account. Moreover, most SNA indicators would be highly correlated with number of followers, creating an additional bias.
We explore here the possibility of using SNA on the internal email communication patterns of an organisation in order to explain certain operational characteristics. In order to do so, we chose a set of independent variables that can be extracted from the available data and constructed a model that predicts two indicators of performance.
On the side of the independent variables, we extract SNA indicators at two levels. At a network level, we use the well-known graph theory concept of closeness centrality, an indicator that reflects how central each individual is to the "centre" of the organisation. In order to capture differences between the role of each individual as a sender or recipient of information, we use the two directional forms of closeness, normalized to account for the network size, adapting from (Freeman 1979): where d ij and d ji the distances (number of edges, or "degrees of separation") between nodes i and j in the two directions (Melhorado et al. 2016). At a second level, we use SNA indicators that measure the role of each individual at "small world" level through the local clustering coefficients, proposed by Watts and Strogatz (1998). The clustering coefficient of node i is equal to the number of triangles τ i connected to this node divided by the number of triples (i.e., potential triangles) centred on it: where τ i is the number of triangles formed between node i and its possible neighbours. d i is the degree of the node (the number of individual connections).
As in the case of closeness centrality, this assumes directed clustering in weighted networks can provide additional insight into the structure and dynamics of a social network. Nevertheless, a node can be part of triangles, with arcs pointing in different directions. Four types of triangles can be distinguished (Yin et al. 2017;Leskovec et al. 2007): 1.
In: a triangle with two arcs incoming to i (j→i, k→i, j→k or k→j) (Figure 2a) 2.
Out: a triangle with two arcs coming out of i (i→j, i→k, j→k or k→j) (Figure 2b  Middleperson: a triangle where the two arcs of i have different directions and there is an arc between j and k (or vice versa), without forming a cycle. There are two arcs incoming to k or j (j→i, i→k, j→k or vice versa) ( Figure 2d) A directed clustering coefficient can be specified for each of the above cases, in order to account for the different patterns. Each coefficient is defined as the number of triangles of i with a specific pattern of arc directions, divided by the number of potential specific triangles of i.
2. Out: a triangle with two arcs coming out of i (i→j, i→k, j→k or k→j) (Figure 2b  A directed clustering coefficient can be specified for each of the above cases, in order to account for the different patterns. Each coefficient is defined as the number of triangles of i with a specific pattern of arc directions, divided by the number of potential specific triangles of i. If is a binary variable that indicates whether there is a connection between i and j or not, and is the number of e-mails sent from i to j (and, consequently, 0, if and only if, 1), the four clustering coefficients can be defined as: where ↔ is the strength of the connection between node i and its adjacent nodes j, expressed as: If a ij is a binary variable that indicates whether there is a connection between i and j or not, and w ij is the number of e-mails sent from i to j (and, consequently, w ij 0, if and only if, a ij = 1), the four clustering coefficients can be defined as: where s ↔ i is the strength of the connection between node i and its adjacent nodes j, expressed as: The calculations of these standard centrality indicators for the reference email network used here were done with the igraph (Csardi and Nepusz 2006) and DirectClustering (Clemente and Grassi 2018) software packages in R.
To decide on the dependent variables to use for modelling performance, we construct custom features that can be derived from information not already used in the calculation of the independent variables. The dataset we use here is limited in terms of the type of information it includes. For example, there is no information on the job description of each individual, which would allow us to explore whether specific roles in the organisation lead to higher email exchanges. On the other hand, the dataset does offer two types of information that could potentially be useful in creating additional features of higher explicatory value: a. The intensity of bilateral email exchanges between individuals i and j: the hypothesis is that the total number of e-mails exchanged between two members of the network is a reflection of the strength of their relation. While this number is not used in the calculation of SNA indicators, the hypothesis is that the SNA indicators can (at least partially) explain it, i.e., whether the centrality of two individuals in the network influences the number of e-mails they exchange. The work of Zhuang et al. (2012) suggested that the number of interactions among members of a social network is a predictor of their social ties. The work of Lou et al. (2013) studied reciprocity in social network links, while Wang et al. (2013) explored how the links are related to each user's profile. In both cases, the number of messages exchanged has a strong correlation with the role of each user in the social network.
b. The delay in replying to an email exchange: the dataset provides the timestamp of each email which-in turn-allows the measurement of the time between an email from i to j, until an email from j to i is sent. The data does not distinguish between whether the email from j to i was an actual reply to the original email, but it is sufficient for this application to assume that the communication pattern is continuous, regardless of whether the email exchanges follow a specific subject. The use of response time as a variable in the analysis of social networks is a growing line of research. The work of Kalman et al. (2006) identified persistent patterns in the latencies of responses in digital communications. The work of Avrahami et al. (2008) suggested that responsiveness, or the time until a person responds to communication, can affect the dynamics of a conversation, as well as participants' perceptions of one another. The work of Kalman et al. (2013) extended the concept of chronemic research-the exploration of the temporal dimension in communication-and associated it with Social Information Processing. They suggested that chronemic variance can be a conduit for important information about the members of a social network, an assertion that we use as a starting point for the definition of the response time variable in our model.
Following the example of (Christidis 2019; Christidis and Focas 2019; Focas and Christidis 2017), we construct a regression model for each of those two features as dependent variables, using the SNA indicators described in Section 4 as independent variables. The results of the two models are summarized in Table 4. Significance codes: 0 '***' 0.001 '**'.
The results of the regression suggest that the SNA indicators explain the variation in the intensity of the bilateral flows and the delay in responses sufficiently well. In the first case, R 2 is remarkably high at 0.8781. The number of emails sent from i to j appears to be positively correlated with the out-closeness centrality of i-i.e., how close i is to the centre of the network as regards sending emails. Nevertheless, it is negatively correlated with the in-closeness centrality of i. Seen from the j point of view, the in-and out-closeness centrality estimates have-as expected-the opposite signs. The correlation with the local clustering coefficient is not as straightforward. The in-, out-and middleperson indicators of i show a positive correlation, while the cycle clustering coefficient has a negative correlation. On the side of j, the correlations are not symmetrical. It may be implied that the middleperson role generates more email activity, while the cycle role generates less, for both i and j. It is also interesting to note that the strength of email exchanges is expected to be higher when i and j belong to the same department.
Regarding the time it takes for a reply to be received, R 2 is lower (0.5556), but still acceptable. Most probably, the influence of possible weekends between the original email and its reply distorts the results. Even so, the estimates for the independent variable are in the expected direction. Individuals with high out-closeness centrality are expected to reply (as j) and be replied to (as i) faster than the average, while high in-closeness has the opposite effect. The in-and out-clustering coefficients have opposite signs, but have the same direction for i and j. This suggests that individuals with a high out-clustering indicator reply and are replied to faster. High in-clustering or cycling clustering coefficients suggest higher delays in replies. The middleperson clustering coefficient has a negative time for i and a positive one for j. Finally, similarly to the case of intensity, the time to reply is, on average, lower when i and j belong to the same department.

Discussion
The methodology and the results presented here suggest that, from a technical point of view, it is feasible to collect data from the email traffic within an organisation and derive indicators that may be useful for the analysis of certain operational characteristics. Given the research questions raised in the introduction, each point can be discussed separately: Suitable indicators that take into account the direction, intensity and frequency of information flow among the organisation's members: Our results suggest that it is possible to use a minimal set, that includes (anonymized) sender id, (anonymized) recipient id and the email's timestamp as data that can help infer the role of an individual in an organisation. This information is sufficient for the construction of a (graph) network, which in turn allows for the calculation of several indicators and measures. While several options for the choice of indicators are available from the literature, our experiments identified the combination of a measure of each individual's role in the overall network (closeness centrality) with a group of clustering coefficients, which quantify the individual's role at a small world scale, as the most suitable to explain the variation in the dependent variables. The modelling results suggest that it is important to consider the relative roles of both the sender and the recipient of each email communication. We also recommend that the directional version of each indicator is used, since the asymmetry in the intensity and frequency of the communication flow among individuals reveals patterns that can be useful in the interpretation of the results.
How such indicators can be connected with human resource and overall organisational efficiency measurements: We have used two custom measures that allow for quantification of the intensity and speed of email exchanges among members of the organisation. Both are modelled sufficiently well using the independent variables that we selected. Moreover, the results suggest that members belonging to the same department have a more intense and frequent communication pattern than ones who do not. These two measures have a direct physical interpretation and have both been identified in the literature as important predictors of organisational behaviour. Our methodology allows for the explanation of different patterns in email communication as a result of each individual's role, as expressed by the individual's centrality and clustering coefficients. For example, a user closer to the centre, as sender (high C out i ), would be expected to send a higher number of messages and to receive responses faster than the average user. Users with a middleperson profile tend to send a high number of emails (positive estimate for C middleperson i ) and reply to emails faster (positive estimate for C middleperson j

).
What are the possible areas of application and the trade-offs when applying such an approach? Intensity and response time can be measured and monitored at an organisational level and explained by a mix of the various user profiles. This information can be useful from the management point of view as long as certain conditions apply, but there are also several risks, as regards its possible misuse. As a starting point, monitoring the number of emails and the speed of response can be an indicator of an organisation's workload, performance and efficiency. Depending on the organisation analysed, a growing number of emails-as a whole or for an individual-may be the result of higher output (positive), higher workload (negative), shift from other means (neutral), improved communication (positive), worsening real-life communication (negative), or several other reasons, that may or may not relate to performance. At an individual level, the number of emails and the speed of responses should not be confused with a measure of efficiency but, instead, should be seen as a gauge of the communication of the individual with the rest of the organisation. Therefore, we recommend that such an approach is used as a system to monitor overall patterns of communication throughout an organisation, using the underlying indicators to identify possible causes of changes in the patterns. A rising number of emails, combined with a rising average middleperson coefficient, probably signifies a less desirable communication structure than one with a falling number of emails and a rising average out-clustering coefficient.
Apart from the caution in selecting which application is feasible, two additional, potentially limiting, factors need to be addressed. On one hand, if a system that measures organisational or individual aspects of behaviour uses information that is generated by an individual, there is always a risk that the individual will change their behaviour, in order to influence the measurement. If, for example, an organisation monitors the average time to respond to an email as a direct or indirect performance measure, it can be expected that most members of the organisation will react according to the observer effect, and modify their response time according to the expected benchmark. In cases where the benchmark is a goal in itself, such a monitoring system may make sense (e.g., in an IT helpdesk), but in many real-life situation this could be detrimental to the quality of the response sent. The legal and ethical context also needs to be clarified. Access to personal data, such as email contacts, may be illegal in certain jurisdictions, or considered unethical in certain cultures. The capacity to derive useful information increases the more detailed the data that can be accessed are, but so do worries regarding its possible uses and misuses.

Conclusions
The methodology that we have presented here is an application of Social Network Analysis with suitable indicators that take into account the direction and weight of communications among members of the network. The indicators proposed allow for standard graph theory indicators, as well as social network clustering coefficients, to be calculated for real-life email networks and, potentially, for all types of directional social networks.
We also suggest how these indicators can be used to explain two measures of organisational operation-the intensity and speed of bilateral communication-through social network indicators. While this is possible for the specific indicators using the current dataset, data availability may be an obstacle in other applications. It is important to highlight that-especially in cases using social network analysis methods for Human Resources management-special care should be given to the availability of independent and objective data before SNA data are used for measurement purposes.
Here, we demonstrated that efficiency measures and SNA indicators can be correlated, but there are limited examples of real-life data combining these two aspects. The sensitive and private nature of efficiency and communication data may limit the applicability of the approach, most probably to controlled environments within specific organisations.
SNA analysis can be useful when combined with additional data on the network and its individual members. However, there are privacy issues that need to be respected. There is a trade-off between the scope of measurements that increase the usefulness of such approaches, and the sensitivity of the information collected. For example, applying text analytics on the contents of the emails could allow measurement of the sentiment of the emails, but would obviously mean that access to such information would be possible. Several company policies state that employees' email contents can be accessed (particularly when the information is stored in a device owned by the employer), but these provisions are normally applied for monitoring and investigating specific cases, often in a legal context. Applying them for analyzing-or even improving-working conditions, can face strong reactions from the ones monitored.
Even when privacy or other ethical concerns are not relevant, it is well known that people who are aware they are being monitored may not behave in a normal way. The literature has long proposed that 'unobtrusive' measurements should be carried out, whereby people are not aware they are being monitored (Webb et al. 1966). This observer effect (also known as Hawthorne effect) may distort the results and applications of SNA in an organisational context. If individual members of the network know that-for example-the number of emails sent is positively correlated with high efficiency in an efficiency measurement system, they may be inclined to send more (possibly unnecessary) emails, so that their model-based efficiency indicators are higher.
The dataset we used here solely covers internal email traffic and is only suitable for the analysis of the internal patterns of email communication within an organisation. A similar analysis of external email communication would probably provide more insight, and possibly be more valuable, from a business perspective. Performing such an analysis for external communication flows would have, however, its own limitations, such as a lack of information on the communication patterns between actors outside of the organization itself. Without such information, a measurement of the role of the specific organization within its network of external communication would be incomplete.
Communication, in a business or private setting, is becoming increasingly multi-channelled. As a consequence, it may not be enough to analyse a single communication channel, such as email. One needs to also explore the evolution and patterns of other modes of communication, either conventional (telephone, regular mail) or digital (instant messaging, skype, etc.). In a similar fashion, other factors that may modify email and overall communication patterns should be explored. For example, changes in office distribution are often mentioned as a driver for increased digital communication (to the detriment of physical communication). In such a case, while exploring patterns of communication using emails would lead to distorted results, a comparison of the evolution of all possible communication channels, using a long enough observation period, could potentially provide useful results.
To summarize, the work presented here demonstrates that it is technically feasible to analyse email traffic within an organisation and derive information that can be usable for organisational management purposes. Technical feasibility does not, however, translate directly to practical feasibility. Specific care should be taken to ensure that the conceptual link between the measured indicators and the management objectives is robust. Such an approach can be applied on a variety of social networks, but its applicability is limited by privacy concerns, the existence of multiple alternative communication channels that evolve over time, the difficulty of establishing clear links between organisational structure and efficiency and, most importantly, the challenge of setting up a system that measures the impact of communication behaviour without influencing the communication behaviour itself.
Funding: This research received no external funding.