Temporal Statistical Analysis of Degree Distributions in an Undirected Landline Phone Call Network Graph Series

This article aims to provide new results about the intraday degree sequence distribution considering phone call network graph evolution in time. More specifically, it tackles the following problem. Given a large amount of landline phone call data records, what is the best way to summarize the distinct number of calling partners per client per day? In order to answer this question, a series of undirected phone call network graphs is constructed based on data from a local telecommunication source in Albania. All network graphs of the series are simplified. Further, a longitudinal temporal study is made on this network graphs series related to the degree distributions. Power law and log-normal distribution fittings on the degree sequence are compared on each of the network graphs of the series. The maximum likelihood method is used to estimate the parameters of the distributions, and a Kolmogorov–Smirnov test associated with a p-value is used to define the plausible models. A direct distribution comparison is made through a Vuong test in the case that both distributions are plausible. Another goal was to describe the parameters’ distributions’ shape. A Shapiro-Wilk test is used to test the normality of the data, and measures of shape are used to define the distributions’ shape. Study findings suggested that log-normal distribution models better the intraday degree sequence data of the network graphs. It is not possible to say that the distributions of log-normal parameters are normal.


Introduction
Most studies related to phone call network graphs are based on mobile call data [1][2][3][4][5][6][7] rather than on landline phone call data [8][9][10][11].Network graphs are seen as static, and rarely [11,12] are they pursued in temporal studies.A local telecommunicating data set from Albania was used in [10] to construct a static network graph.The tails of the empirical distributions were analyzed on the greatest connected component related to: the number of phone calls per client, the total duration of calls per client in seconds, and the distinct number of calling partners per client.The network graph was considered in both cases, directed and not directed.A comparison between power law (PL) and log-normal (LN) fit was made in the tail of the distributions, but it could not be concluded which of them had a determinate dominance over the other.Tail analysis in vertex degree or vertex strength distribution in communication network graphs is important because it gives information about hubs and rare events.Hubs are highly connected vertices, which are hypothesized to act as focal points for the convergence or divergence of information.
Considering the network graph as static, and the concentration only at its greatest connected component may have influenced our findings in [10].This study aims to provide new results about the intraday degree sequence distribution considering phone call network graph evolution in time.
Phone call communication relations have a survival time.Network graph evolution in time is related to the network graph's topology state, which is in a continuous change.A day's snapshot is used to show the topology state of the network graph in a time point.
This article tackles the following problem.Given a large amount of landline phone call data records, what is the best way to summarize the distinct number of calling partners per client per day?
In order to answer this question, I construct an undirected phone call network graphs series, with all network graphs of the series simplified.Further, a longitudinal temporal study is made on this network graph series related to the degree distributions.PL and LN are compared on each of the degree sequences of the network graph series.
The vertex degree is related to the number of distinct callers and the number of distinct subjects that are called by an active phone client.This relation is conditioned by the fact that the network graphs are undirected and simplified.The analysis aimed to determine the distribution that yielded a better fit to model the data related to the degree sequence in each time step.It is shown that the LN model is better, mainly because it covers a large amount of data, and it was determined by the tests to be more reliable than the PL model.I also considered the distributions' shape of the LN parameters and described them.The results show that the distributions of LN parameters were not normal.

Data Preparation
The data set is provided by a local telecommunicating operator positioned in the south of Albania, which covers approximately 4% of the landline market in the country.Clients' identities were substituted with numbers to conserve privacy (see Supplementary Material).The study is based only on phone calls inside the operator's client network, and not outside it.The reason for this restriction is based on the evidence that phone number data which did not belong to the operator would be incomplete.
Phone calls took place in November 2014.On 28 November, Albania celebrates Independence Day, and on the 29th, Liberation Day.From a total of 81,591 phone calls, 41, which were without call durations, and 7442, which lasted less than 10 s, were excluded from the study.The reason for this exclusion is that these calls were lost calls or wrong numbers and might have affected the accuracy of the results.Thus, the total data set used for the study was 90.83% of the initial data set.Active clients are considered only those that were engaged in at least in one phone call (made or received) that lasted at least 10 s, amounting to a total number of 3287.Multiple phone call relations between any two clients were treated as single phone call relations.This statistical technique, about filtering and extracting the best sample that would reflect the global calling patterns related to the number of calling partners per client, has been applied by other authors in telecommunication data [1,2].
Degree distribution in the communication system was studied by observing 30 network graphs, which were constructed by splitting the data set for each day of the month.The network graphs are denoted by G i = (V i , E i ).The vertex set (active phone clients) is V i , and the edge set is E i (G 1 is the network graph of the first day of the month, G 2 for the second day, and so on).Each edge represents a communication relation between two phone clients.Thus, if v 1 and v 2 are vertices, then an undirected edge (v 1 , v 2 ) is between them only if v 1 has made or received at least one phone call from v 2 or the reverse.Multiple relations between two vertices are simplified as only one edge.In Table 1, the topology techniques various authors have used are mentioned.The table includes the following information: the type of telecommunication data, the time interval, the relation's direction, the relation's mutuality, the relation's simplification, and the relation's weight.There is no precise topology technique on how to treat mobile or landline data.Variability depends on the goal of the scientific research.
G 1 , G 2 , . . ., G 30 is defined as the temporal network graph series.The network graph G i is constructed based only on the data of the i-th day.Vertex degree [13] in a network graph is defined as the number of edges incident on that vertex.Let d v denote the degree of the vertex v and, with d i v v∈V i , the vertex degree sequence of G i .The fraction of vertices v that have d v = x is denoted by p x .This can also be interpreted as-the probability that a vertex chosen uniformly at random has a degree equal to x.The set of {p x } x≥0 defines the degree distribution of the network graph.

Temporal Statistical Analysis
At first, for each of the network graphs of the series {G i } 30 1 , the vertex degree sequence d i v v∈V i was computed.The normality of d i v v∈V i was controlled.Thus, a histogram and Q-Q plot were constructed.The Shapiro-Wilk test was performed on the degree sequence, and the basic statistics were calculated.If the p-value of the test [14][15][16] was less than chosen alpha level 0.05, it was considered as evidence that the data did not come from a normally distributed population.
Skewness [17] and kurtosis were used to determine whether the empirical distribution was heavy-tailed.Increasing kurtosis was associated with the "movement of probability mass from the shoulders of a distribution into its centre and tails" [18].Leptokurtic distributions (kurtosis values are greater than 3) partly comprise heavy-tailed distributions [19].Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions.According to [20], a distribution is heavy-tailed if and only if its tail function is a heavy-tailed function.A non-negative function is said to be heavy-tailed if it fails to be bound by a decreasing exponential function.
PL and LN distributions are heavy-tailed.These distributions are chosen to be fitted on data for x ≥ x min , because it is not always possible to get a good fitting for all the data.A random variable X follows a PL distribution for X ≥ x min if its probability mass function µ and σ are parameters of the distribution.The estimation procedure is based on the maximum likelihood method [21,22].This technique is also applied by other authors [23].
The Kolmogorov-Smirnov statistic (KS) is used to determine goodness-of-fit, and the p-value based on 2500 instances of bootstrapping is computed for each of the fittings.Small KS values, and p > 0.1 suggest that the fitted distribution is a plausible one for the set of the data, such that x ≥ x min .If p ≤ 0.1, then it is said that the data does not come from either a PL or an LN distribution.A reliable p-value is obtained when the number of data in the tail of the distribution, n tail , is greater than 100 for PL and greater than 300 for LN [21,22].
When both PL and LN are plausible models for the data, a Vuong log likelihood test [24] between them is computed.The sign of the log likelihood ratio, R, can be reliably used to determine which of the models is better than the other if the p-value is less than 0.1.Otherwise, both models are considered equally plausible.
After that, a box plot description of temporal change on the estimated parameters of the distributions α i , µ i , σ i and their estimated x i min for i = 1, 30 is constructed.Three cases are considered: • Case 1: the fitting made from 1; • Case 2: the fitting made from the estimated x min of each distribution; • Case 3: the fitting made from the min x PL min , x LN min , where both distributions are plausible.
Furthermore, a shape description of parameter distributions of the best-fitted degree distribution models is made.A visualization of the log-log plots of the complementary cumulative distribution function (CCDF) (P(d v ≥ x)) is provided for Case 1, 2, and 3 at G 1 and G 30 .

Results
After constructing the undirected landline phone call network graphs series, each of the network graphs of the series is simplified.Further, the data set I analyze here is the degree sequence d i v v∈V i of each G i where i = 1, 30.In this section, the results of the study are presented.They are divided into two subsections.

Descriptive Analysis of Degree Values
In Figure 1 and Table 2, an illustration for the case of G 1 and G 30 related to the histogram and Q-Q plot and the basic statistics for the set of degree values are shown.For all G i , based on the Q-Q plots, a strong deviation from the straight line can be seen.Moreover, after running the Shapiro-Wilk normality test in each of them, the p-values were always less than 2.2 × 10 −16 < 0.05.This means that the data did not come from a normal distribution.Furthermore, for all   ,  the degree sequence is unimodal;  the mode is 1;  mean > median > mode;  This means that, in all degree sequences d i v v∈V i , a highly right (positively) skewed distribution is present.They are leptokurtic and heavy-tailed.

Statistical Analysis of Fitted Distributions
The results of the estimated α and x min for the PL distribution are given in Table 3. Information about the total number of data n in d i v v∈V i , as well as the quantity of the data that are in the tail of the distribution n tail , is also Based on p-values where p ≤ 0.10, PL is rejected only once out of 30 (G 20 ).The p-value is not reliable in six cases (G 7 , G 10 , G 12 , G 15 , G 16 , G 26 ), since n tail < 100.
The results for the estimated parameters, σ, and x min for the LN, information about the total number of data n in d i v v∈V i , and the quantity of data in the tail of the distribution n tail is given in Table 4. Based on p-values (p ≤ 0.10), LN is rejected three times out of 30 (G 2 , G 12 , G 21 ).The p-value is always reliable, since n tail > 300 in all the cases.For each G i , the KS value of LN is always lower than the corresponding KS value of PL.Information about the LN and PL distribution, for cases where both are plausible, is shown in Table 5.Since min x PL min , x LN min = x PL min , it can be said that LN is conditioned by PL.The sign of the log likelihood ratio, R does not reliably determine which of the models is better than the other, because p-value is not less than 0.1.In this way, both models are considered equally plausible."-" denotes cases where no comparison between LN and PL can be made, because they are defined as not plausible when x ≥ min x PL min , x LN min (G 2 , G 12 , G 20 , G 21 ).It was found that between weekdays, weekends, and holidays, there was no substantial change related to the degree sequence of the network graph.The PL was not rejected in either of the weekends or holidays, but it was not reliable for one of the weekends (G 16 , G 17 ).LN was rejected for one of the weekend days (G 2 ) but was otherwise always reliable.For weekdays, PL and LN models were both rejected and accepted.
Some statistics about the temporal change of LN parameters are given in Table 6.A shape description of the LN parameter distribution is given as follows: 1.
µ: In all three cases, µ does not come from a normal distribution, and its shape is described as follows: • Case 1: approximately symmetric and platykurtic; • Case 2: highly negative skewed and leptokurtic; • Case 3: highly negative skewed and platykurtic.

σ:
In Cases 1 and 2, based on the Shapiro-Wilk test, the normal distribution was not rejected, but it was in Case 3. The shape of the σ distribution is described as follows: • Case 1: approximately symmetric and platykurtic; • Case 2: moderately skewed and platykurtic; • Case 3: moderately skewed and platykurtic.Figure 3 shows a visualization of the log-log plots of the CCDF (P(d v ≥ x)) of the three cases at G 1 and G 30 .In Case 1, the PL model, when it is fitted from x min = 1, is a bad fit for the data; in Case 2, the range of LN fitting is greater than the range of PL fitting; in Case 3, there is no significance difference between the two models.Figure 3 shows a visualization of the log-log plots of the CCDF ((  ≥ )) of the three cases at  1 and  30 .In Case 1, the PL model, when it is fitted from   = 1, is a bad fit for the data; in Case 2, the range of LN fitting is greater than the range of PL fitting; in Case 3, there is no significance difference between the two models.

Discussion
The best way to summarize the distinct number of calling partners per client per day, when considering the evolution of the network graph in time, is via an LN model, even though it was rejected 2 more times than the PL model was.This is based on the evidence that p-values, which are used to define the LN model as plausible or not, were always reliable.In the PL model fittings, six times out of 30, the p-values were not reliable.Furthermore, the range of data that is modeled via LN was always greater than that modeled via PL; furthermore, when the tails of each distribution were compared, no significance differences were found.Therefore, it cannot be said that the distributions of the LN parameters are normal.It would be interesting to define a distribution that models the parameters of LN as the best-fitted degree sequence distribution of the network graphs series.

Data 2017, 2 , 33 5 of 11 Figure 1 .
Figure 1.These are the histograms and the Q-Q plots of degree values of  1 and G 30 .Histogram axes are logarithmic.

Figure 1 .
Figure 1.These are the histograms and the Q-Q plots of degree values of G 1 and G 30 .Histogram axes are logarithmic.
Box plots of the temporal change on the estimated parameters of the distributions α i , µ i , σ i and their estimated x i min for i = 1, 30 are shown in Figure 2. Case 1 is based on box plots LN 1 and PL 1 ; Case 2 is based on box plots LN x min and PL x min ; Case 3 is based on box plots LN x min−PL .

Figure 2 .
Figure 2. Box plots of temporal changes of x min and parameters α, µ, σ of the

Figure 3 .
Figure 3. Visualization of the log-log plots of the complementary cumulative distribution function (CCDF).The dashed lines refer to the CCDF distributions of the LN model and the solid line refers to the PL model.

3 Figure 3 .
Figure 3. Visualization of the log-log plots of the complementary cumulative distribution function (CCDF).The dashed lines refer to the CCDF distributions of the LN model and the solid line refers to the PL model.

Table 1 .
A general overview of topology statistical techniques used to analyze phone call data.Abbreviations: m: months; w: weeks; d: days; -: not applicable.

Table 2 .
A summary of some basic statistics for the set of degree values of G 1 and G 30 .

Table 3 .
Parameter estimates and the goodness-of-fit (KS) for the power law (PL) distribution.

Table 4 .
Parameter estimates and the goodness-of-fit (KS) for the log-normal distribution.

Table 5 .
Results of the Vuong log likelihood ratio test R for LN and PL.

Table 6 .
A summary of statistics related on the temporal change of LN parameters.

Table 6 .
A summary of statistics related on the temporal change of LN parameters.