2.1. Data Preparation
The data set is provided by a local telecommunicating operator positioned in the south of Albania, which covers approximately 4% of the landline market in the country. Clients’ identities were substituted with numbers to conserve privacy (see Supplementary Material). The study is based only on phone calls inside the operator’s client network, and not outside it. The reason for this restriction is based on the evidence that phone number data which did not belong to the operator would be incomplete.
Phone calls took place in November 2014. On 28 November, Albania celebrates Independence Day, and on the 29th, Liberation Day. From a total of 81,591 phone calls, 41, which were without call durations, and 7442, which lasted less than 10 s, were excluded from the study. The reason for this exclusion is that these calls were lost calls or wrong numbers and might have affected the accuracy of the results. Thus, the total data set used for the study was 90.83% of the initial data set. Active clients are considered only those that were engaged in at least in one phone call (made or received) that lasted at least 10 s, amounting to a total number of 3287. Multiple phone call relations between any two clients were treated as single phone call relations. This statistical technique, about filtering and extracting the best sample that would reflect the global calling patterns related to the number of calling partners per client, has been applied by other authors in telecommunication data [
1,
2].
Degree distribution in the communication system was studied by observing 30 network graphs, which were constructed by splitting the data set for each day of the month. The network graphs are denoted by
. The vertex set (active phone clients) is
, and the edge set is
(
is the network graph of the first day of the month,
for the second day, and so on). Each edge represents a communication relation between two phone clients. Thus, if
and
are vertices, then an undirected edge
is between them only if
has made or received at least one phone call from
or the reverse. Multiple relations between two vertices are simplified as only one edge. In
Table 1, the topology techniques various authors have used are mentioned. The table includes the following information: the type of telecommunication data, the time interval, the relation’s direction, the relation’s mutuality, the relation’s simplification, and the relation’s weight. There is no precise topology technique on how to treat mobile or landline data. Variability depends on the goal of the scientific research.
is defined as the temporal network graph series. The network graph
is constructed based only on the data of the
i-th day. Vertex degree [
13] in a network graph is defined as the number of edges incident on that vertex. Let
denote the degree of the vertex
and, with
, the vertex degree sequence of
. The fraction of vertices
that have
is denoted by
. This can also be interpreted as—the probability that a vertex chosen uniformly at random has a degree equal to
. The set of
defines the degree distribution of the network graph.
2.2. Temporal Statistical Analysis
At first, for each of the network graphs of the series
, the vertex degree sequence
was computed. The normality of
was controlled. Thus, a histogram and Q–Q plot were constructed. The Shapiro-Wilk test was performed on the degree sequence, and the basic statistics were calculated. If the
-value of the test [
14,
15,
16] was less than chosen alpha level 0.05, it was considered as evidence that the data did not come from a normally distributed population.
Skewness [
17] and kurtosis were used to determine whether the empirical distribution was heavy-tailed. Increasing kurtosis was associated with the “movement of probability mass from the shoulders of a distribution into its centre and tails” [
18]. Leptokurtic distributions (kurtosis values are greater than 3) partly comprise heavy-tailed distributions [
19]. Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions. According to [
20], a distribution is heavy-tailed if and only if its tail function is a heavy-tailed function. A non-negative function is said to be heavy-tailed if it fails to be bound by a decreasing exponential function.
PL and LN distributions are heavy-tailed. These distributions are chosen to be fitted on data for
, because it is not always possible to get a good fitting for all the data. A random variable
follows a PL distribution for
if its probability mass function
is
where
is the general
Riemann function.
is the scaling parameter of the distribution. A random variable
follows a LN distribution for
if
and
are parameters of the distribution. The estimation procedure is based on the maximum likelihood method [
21,
22]. This technique is also applied by other authors [
23].
The Kolmogorov–Smirnov statistic (KS) is used to determine goodness-of-fit, and the
-value based on 2500 instances of bootstrapping is computed for each of the fittings. Small KS values, and
suggest that the fitted distribution is a plausible one for the set of the data, such that
. If
, then it is said that the data does not come from either a PL or an LN distribution. A reliable
-value is obtained when the number of data in the tail of the distribution,
, is greater than 100 for PL and greater than 300 for LN [
21,
22].
When both PL and LN are plausible models for the data, a Vuong log likelihood test [
24] between them is computed. The sign of the log likelihood ratio,
, can be reliably used to determine which of the models is better than the other if the
-value is less than 0.1. Otherwise, both models are considered equally plausible.
After that, a box plot description of temporal change on the estimated parameters of the distributions , and their estimated for is constructed. Three cases are considered:
Case 1: the fitting made from 1;
Case 2: the fitting made from the estimated of each distribution;
Case 3: the fitting made from the where both distributions are plausible.
Furthermore, a shape description of parameter distributions of the best-fitted degree distribution models is made. A visualization of the log–log plots of the complementary cumulative distribution function (CCDF) () is provided for Case 1, 2, and 3 at
The statistical computation related to these distributions are made based on the following packages in the R statistical computation platform [
25]: poweRlaw [
26], fBasics [
27], igraphdata [
28], and igraph [
29].