Comparing Static and Dynamic Weighted Software Coupling Metrics

Coupling metrics are an established way to measure software architecture quality with respect to modularity. Static coupling metrics are obtained from the source or compiled code of a program, while dynamic metrics use runtime data gathered e.g., by monitoring a system in production. We study \emph{weighted} dynamic coupling that takes into account how often a connection is executed during a system's run. We investigate the correlation between dynamic weighted metrics and their static counterparts. We use data collected from four different experiments, each monitoring production use of a commercial software system over a period of four weeks. We observe an unexpected level of correlation between the static and the weighted dynamic case as well as revealing differences between class- and package-level analyses.


Introduction
Coupling [14,31]-the number of inter-module connections in software systemshas long been identified as a software architecture quality metric for modularity [29]. Taking coupling metrics into account during development of a software system can help to increase the system's maintainability and understandability [7], in particular for microservice architectures [24]. As a consequence, aiming for high cohesion and low coupling is accepted as a design guideline in software engineering [11].
In the literature, there exists a wide range of different approaches to measuring coupling. Usually, the coupling degree of a module (class or package) indicates the number of "connections" it has to different system modules. A "connection" between modules A and B can be, among others, a method call from A to B or an exception of type B thrown by A. Many notions of coupling can be measured statically, based on either source code or compiled code.
Static analysis is attractive since it can be performed immediately on source code or on a compiled program. However, it has been observed [5,12,15] that for object-oriented software, static analysis does not suffice, as it often fails to account for effects of inheritance with polymorphism and dynamic binding. This is addressed by dynamic analysis, where monitoring logs are generated while running the software.
The results obtained by dynamic analysis depend on the workload used for the run of the system yielding the monitoring data. Hence the availability of representative workload for the system under test is crucial for dynamic analysis. As a consequence, dynamic analysis is more expensive than static analysis.
Dynamic analysis is often used to improve upon the accuracy of static coupling analysis [16]. Dynamic analysis uses monitoring data to find, e.g., all classes B whose methods are called by the class A. In this case, the individual relationship between two classes A and B is qualitative: The analysis only determines whether there is a connection between A and B, and does not take its strength (e.g., number of calls during a system's run) into account. In contrast, a quantitative coupling measurement quantifies the strength of the connection between A and B by assigning it a concrete number.
The coupling metrics we consider in this paper are defined using a dependency graph. The nodes of such a graph are program modules (classes or packages). Edges between modules express call relationships. They can be labelled with weights, which are integers denoting the number of occurrences of the call represented by the edge. Depending on whether coupling metrics take these weights into account or not, we call the metrics weighted or unweighted. The main two metrics we consider are the following: 1. Unweighted static coupling, where an edge from A to B is present in the dependency graph if some method from B is called from A in the (source or compiled) program code, 2. Weighted dynamic coupling, where an edge from A to B is present in the graph if such a call actually occurs during the monitored run of the system, and is attributed with the number of such calls observed.
Dynamic weighted coupling measures cannot replace their static counterparts in their role to e.g., indicate maintainability of software projects. However, we expect dynamic weighted coupling measures to be highly relevant for software restructuring: In contrast to static coupling measures, weighted dynamic measures can reflect the runtime communication "hot spots" within a system, and therefore may be helpful in establishing performance predictions of restructuring steps. For example, method calls that happen infrequently may be replaced by a sequence of nested calls or with a network query without relevant performance impacts. Since static coupling measures are often used as basis for restructuring decisions [11,26], dynamic weighted coupling measures can potentially complement their static counterparts in the restructuring process. This possible application leads to the following question: Do dynamic coupling measures yield additional information beyond what we can obtain from static analysis?
Initially, we expected static and dynamic coupling degrees to be almost unrelated: A module A has high static coupling degree if there are many method calls from A to methods outside of A or vice versa in the program code. On the other hand, A has high dynamic weighted coupling degree if during the observed run of the system, there are many runtime method calls between A and other parts of the system. Since a single occurrence of a method call in the code can be executed millions of times-or not at all-during a run of the program, static and weighted dynamic coupling degrees do not need to correlate. Thus, our initial hypothesis was to not observe a high correlation between static and weighted dynamic metrics.
Our main research question is: Are static coupling degrees and dynamic weighted coupling degrees statistically independent? If we observe correlation, can we quantify the correlation?
To answer these questions, we compare the two coupling measures. We use dynamically collected data to compute weighted metrics that take into account the number of function calls during the system's run. We obtained the data from a series of four experiments. Each experiment consists of monitoring real production usage of a commercial software system (Atlassian Jira [6]) over a period of four weeks each. Our monitoring data contains more than three billion method calls. We compare the results from our dynamic analysis to computations of static coupling degrees.
Directly comparing static and weighted dynamic coupling degrees is of little value, as these are fundamentally different measurements: For instance, the absolute value of dynamic weighted degrees depends on the duration of the monitored program run, which clearly is not the case for the static measures. We therefore instead compare coupling orders, i.e., the ranking obtained by ordering all program modules by their coupling degree using the Kendall Tau 1 metric [21]. This also allows to quantify the difference between such orders.
Our answer to the above stated research questions is that static and (weighted) dynamic coupling degrees are not statistically independent. A possible interpretation of this result is that dynamic weighted coupling degrees give additional, but related information compared to the static case. In addition to this result, we observe insightful differences between class-and package-level analyses.

Contributions
The results and contributions of this paper are: 2 -Using a unified framework, we introduce precise definitions of static and dynamic coupling measures. -To investigate our main research question, we performed four experiments involving real users of a commercial software product (the Atlassian Jira project and issue tracking tool [6]) over a period of four weeks each. The software was instrumented via the dynamic monitoring framework Kieker [20] based on AspectJ [22]. From the collected data, we computed our dynamic coupling measures. We compared the obtained results, using the Kendall-Tau metric [8], to coupling measures we obtained by static analysis.
1 See [19] for a discussion of the relationship between this metric and Spearman's correlation. 2 A replication package inlcuding the collected data of our experiments will soon be published on Zenodo, to allow other researchers to repeat and extend our work.
-The results show that all coupling metrics we investigate are correlated, but there are also significant differences. In particular, when considering packagelevel coupling, the correlation is significantly stronger than for class-level coupling. As reason we assume that effects like polymorphism and dynamic binding often do not cross package boundaries.
Finally, we note that this paper is an extension of a previous short poster paper [30] in which a high-level overview of the research approach and the first data set are presented. The current paper extends the previous short poster paper (2 pages in length) as follows: -This paper contains an in-depth explanation of the research approach, including a precise definition of our coupling metrics. -We report on the statistical properties of the data collected during the experiments. -We report on the findings of four experiments whereas the short paper only discusses the first of our four data sets.

Paper Organization
The remainder of the paper is organized as follows: In Section 2, we discuss related work. Section 3 provides our definition of weighted dynamic coupling. In Section 4, we explain our approach to static and dynamic analysis. Section 5 then describes the setting of our experiment. The results are presented and discussed in Section 6. In Section 7, we discuss threats to validity and conclude in Section 8 with a discussion of possible future work.

Related Work
There is extensive literature on using coupling metrics to analyse software quality, see, e.g., Fregnan et al. [17] for an overview. Briand et al. [10] propose a repeatable analysis procedure to investigate coupling relationships. Nagappan et al. [27] show correlation between metrics and external code quality (failure prediction). They argue that no single metric provides enough information (see also Voas and Kuhn [32]), but that for each project a specific set of metrics can be found that can then be used in this project to predict failures for new or changed classes. Misra et al. [25] propose a framework for the evaluation and validation of software complexity measures. Briand and Wüst [9] study the relationship between software quality models to external qualities like reliability and maintainability. They conclude that, among others, import and export coupling appear to be useful predictors of fault-proneness. Static weighted coupling measures have been considered by Offutt et al. [28]. Allier et al. [2] compare static and unweighted dynamic metrics. Our approach is different: We do not study correlation between software metrics and software quality, but correlation between different software metrics.
Dynamic (unweighted) metrics have been investigated in numerous papers (see, e.g., Arisholm et al. [5] as a starting point, also the surveys by Chhabra and Gupta [13] and Geetika and Singh [18]). None of these approaches considers dynamic weighted metrics, as we do.
Dynamic analysis is often used to complement static analysis. As an notable exception, Yacoub et al. [33] use weighted metrics. However, to obtain the data, they do not use runtime instrumentation-as we do-but "early-stage executable models." They also assume a fixed number of objects during the software's runtime.
Arisholm et al. [5] study dynamic metrics for object-oriented software. Our dynamic coupling metrics are based on their dynamic messages metric. The difference is as follows: Their metric counts only distinct messages, i.e., each method call is only counted once, even if it appears many times in a concrete run of the system. The main feature of our weighted metrics is that the number of occurrences of each call during the run of a system is counted. The dynamic messages metric from [5] corresponds to our unweighted dynamic coupling metrics (see below).

Dependency Graphs
We performed our analyses with two different levels of granularity: on the (Java) class and package levels. In the following we use the term module for either class or package, depending on the granularity of the analysis. The output of either types of analyses (dynamic and static) is a labeled, directed graph G, where the nodes represent program modules (i.e., classes or packages), and the labels are integers which we refer to as weights of the edges. An edge from A to B has label (weight) n A,B , this denotes that the number of directed interactions between A and B occurring in the analysis is n A,B .
In the case of a static analysis, this means that there are n A,B places in the code of A where some method from B is called. For dynamic analysis, this means that during the monitored run of the system, there were n A,B run-time invocations of methods from B by methods from A.
Our graph G is a weighted dependency graph, hence we call the coupling metrics we define below weighted metrics. When we disregard the numbers n A,B , the graph G is a plain dependency graph, i.e., a directed graph where the edges reflect function calls between the modules. We refer to metrics defined on the unweighted dependency graph-i.e., metrics that do not take the weights n A.B into account-as unweighted metrics. We study the following three conceptually different approaches to measure coupling dependency between program modules: 1. The first approach is static analysis, which identifies method calls by analyzing the compiled code (we used BCEL to analyze Java .class and .jar files). Here we do not take weights into account. We therefore compute our static coupling measures from an unweighted dependency graph.
2. Our second approach is unweighted dynamic analysis. This analysis identifies method calls between modules as they appear in an actual run of the system (the data is obtained by monitoring), but does not take the weights n A,B into account. It therefore does not distinguish between cases where a module A calls another module B a million times or just once. This metric is essentially the dynamic messages metric from [5]. 3. Our third approach is weighted dynamic analysis, which differs from its unweighted counterpart only by taking the weights n A,B into account.
The distinctions between static/dynamic analyses and unweighted/weighted analyses are orthogonal choices. In particular, we omit in the present paper a weighted, static analysis, since our main motivation is the comparison of dynamic, weighted metrics unweighted, static metrics.

Definition of Coupling Metrics
We now define the coupling measures we study. Our measures assign a coupling degree to a program module (i.e., a class or a package). We consider 18 different ways to measure coupling, resulting from the following three orthogonal choices: 1. The first choice is between class-level and package-level granularity. Depending on this choice, a module is either a (Java) class or a (Java) package. 2. The second choice is between one of our three basic measurement approaches: static, dynamic unweighted, or dynamic weighted analysis. 3. The third choice is to measure import-export-or combined coupling.  To distinguish these 18 types of measurement, we use triples (α, β, γ), where α is c or p and indicates the granularity, β is s, u, or w and indicates the basic measurement approach, and γ is i, e, or c, indicating the direction of couplings taken into account. Figure 1 illustrates these three orthogonal choices: The example triple (p, u, i) denotes an analysis with granularity package-level, using dynamic unweighted analysis, and considers coupling in the import direction.
Our coupling measures can be computed from the two dependency graphs resulting from our two analyses (static and dynamic). For a module A, and a choice of measure (α, β, γ), the (α, β, γ)-coupling degree of A, denoted with coupdeg α,β,γ (A), is computed as follows: -We compute G α,β . This is the weighted dependency graph between classes (if α = c) or packages (if α = p) obtained by static analysis (if β = s) or dynamic analysis (if β = u or β = w), where each weight n A,B is replaced with 1 if the analysis is static or (dynamic) unweighted (i.e., if β ∈ {s, u}). -Then, coupdeg α,β,γ (A) is the out-degree of A, in-degree of A, or sum of these, depending on whether γ = i, γ = e, or γ = c. The in (out) degree of A is the sum of the weights of its incoming (outgoing) edges in the graph.

Static and Dynamic Analysis
We perform our static analysis (using the Apache BCEL [4]) on the compiled code. This also implies that some optimizations have already performed by the compiler, such as removal of dead code. Therefore, our static and dynamic analyses are performed on the exact same code, without differences introduced in the compilation process. For the dynamic analysis, we use the Kieker framework [20] that allows to register every method call. Kieker uses AspectJ's [22] load-time weaver to instrument the analyzed software automatically at load-time. In order to reduce the performance impact of monitoring, we restricted the monitoring to a subset of the system, and adjusted the static analysis accordingly.

Experiment Design
We analyzed the software Atlassian Jira, versions 7.3.0, 7.4.3, and 7.7.1 [6]. The system was instrumented using AspectJ technology. For each method call, we recorded the time stamp, the class name of caller and of the callee.
To perform our analysis with realistic workload, we conducted four experiments with real users using a software system (Atlassian Jira [6]) in production. Jira was used by students participating in a mandatory programming course of our computer science curriculum. In the course, the students develop a software using the Kanban process management method [1]. The time span of the project is four weeks, with full time participation by the students.
We report on four experiment runs, from February and September of 2017 and 2018. Each time, the software ran for a four-week period. The collected monitoring data from each run includes the startup sequence, basic configuration such as database access, initial tasks as user registration and setup of the Kanban boards, and day-to-day usage. No person-related data is used for our analysis. In Table 1, we list the number of method calls recorded as well as the number of users of our Jira installation in each of the three experiment runs.
Obviously between the four runs of the software that we analyze. For example, different students took parts in the course each time, the focus of the project required using different features of the Jira software in each iteration, and we also instructed them to use more features of the tool in the later iterations (this is one reason why the number of method calls per student is higher in the later runs of the experiment). Therefore, our four experimentseven though they are conducted using the same software system-give us slightly more variation in the data than running the exact same software with the exact same group of users. However, our main analysis results do not vary significantly between the different runs of the experiment, indicating that our findings are invariant under small changes of the experiment setup.

Compared Measures
We compare the coupling degrees computed by these different approaches. Comparing the actual "raw" values of coupdeg α,β,γ (A) for different combinations of α, β, γ and some class or package A does not make much sense: The weighted values depend on the length of the measurement run of the system, whereas the static analysis does not. However, the absolute coupling values are usually not the most interesting results of such an analysis. For a developer, the identification of the modules with the highest coupling degree is among the most interesting results of applying a software metric. Therefore, a useful approach is to study the relationship between the orders among the modules in the different analyses: Each analysis yields an ordering of the classes or packages from the ones with the highest coupling degree to the ones with the lowest one; we call these orders coupling orders. These orders can be compared between different analyses of varying measurement durations.
Given our coupling measure definitions, we have the following choices for a left-hand-side (LHS) and a right-hand-side (RHS) analysis: -The first choice is whether to consider class or package analyses (both the LHS and the RHS should consider the same type of module). -The second choice is which two of our three basic measurement approaches (see Section 3.1) we intend to compare: static analysis, (dynamic) unweighted analysis, and (dynamic) w eighted analysis. There are three possible choices: s vs. u, s vs. w, and u vs. w. -For each combination, we consider import, export, and combined coupling.
Hence, there are 18 comparisons we can perform in each of our four data sets, leading to 72 different comparisons.
Kendall-Tau distance To study the difference between our different basic measurement approaches, we compare the coupling orders of the analyses using the Kendall-Tau distance [8]: For a finite base set S with size n, the metric compares two linear orders < 1 and < 2 . The Kendall-Tau distance τ (< 1 , < 2 ) is the number of swaps needed to obtain the order < 1 from < 2 , normalized by dividing by number of possible swaps n(n−1) 2 . Hence τ (< 1 , < 2 ) is always between 0 (if < 1 and < 2 are identical) and 1 (if < 1 is "reverse" of < 2 ). Values smaller than 0.5 indicate that the orders are closer together than expected from two random orders, while values larger than 0.5 indicate the opposite.
Distance Values To present our results, we use the following notation to specify the LHS and RHS analyses: We use a triple α : β 1 ↔ β 2 , whereα is c or p expressing class or package coupling, -β 1 is s or u expressing whether the LHS analysis is static or (dynamic) unweighted, -β 2 is u or w expressing whether the RHS analysis is (dynamic) unweighted or (dynamic) weighted.
For each of these combinations, we consider export, import, and combined coupling analyses. This results in 18 comparisons for each data set, which are presented in Tables 2, 3

, 4, and 5 for our four experiments.
Statistical Significance To measure statistical significance, we computed the absolute z-scores of our experiments. The smallest observed absolute z-score among all our experiments is 9.41, and all but two absolute values are above 10. As a point of reference, the corresponding likelihood for z-score 10 is 7.6 · 10 −24 , this is the probability to observe the amount of correlation seen in our dataset under the assumption that the compared orders are in fact independent. This indicates a huge degree of statistical significance, which is due to the large number of program units appearing in our analysis.  Discussion The first obvious take-away from the values presented in Tables  2-5 is that all 72 reported distances (and of course also the average values) are below 0.5, many of them significantly so. This indicates that there is a significant similarity between the coupling orders of the static and the two dynamic analyses. This was not to be expected: While in small runs of a system, one could possibly conjecture that there might not be a large difference between the static and dynamic notions of coupling, this changes when we analyze longer system runs: In our longest experiment, we analyzed more than 2.4 billion method calls. The dynamic, weighted coupling degree of a class A is the number of calls from or to methods from A among these 2.4 billion calls, while its static, unweighted coupling degree is the number of classes B such that the compiled code of the software contains a call from A to B or vice versa. A single method call in the code is only counted once in an unweighted analysis, but this call can be executed millions of times during the experiment, and each of these executions is counted in the weighted, dynamic coupling analysis. Therefore, it was not necessarily to be expected that we observe correlation between unweighted static and weighted dynamic coupling degrees. However, our results suggest that all of the three types of analyses that we performed are correlated, with different degrees of significance. In particular, dynamic weighted coupling degrees seem to give additional, but not unrelated information compared to the static case.
The static coupling order is closer to the dynamic unweighted than to the dynamic weighted order in almost all cases. This was expected: In an hypothetical "complete run" of a system, and in the absence of issues resulting from object-oriented features these measures would coincide. On the other hand, the dynamic weighted analysis is very different from the static one by design.
A very interesting observation is that in all 36 cases except for 3 cases involving import coupling in our first two data sets, comparing c : β ↔ γ for some coupling measure with β ∈ {s,u,w} and γ ∈ {i,e,c}.
to p : β ↔ γ shows that the distance from the analysis of the package case is smaller than the corresponding distance in the class case, sometimes significantly so. A possible explanation is that in the package case, the object-oriented effects that are often cited as the main reasons for performing dynamic analysis are less present, as, e.g., inheritence relationships are often between classes residing in the same package.

Threats to Validity
Concerning external validity, our analysis is limited by the fact that we covered only four runs, each with four weeks, of only one software system (Atlassian Jira). To address this threat, we plan to monitor additional software tools such as Jenkins and Tomcat (which are also used in the course). Concerning internal validity, our dynamic analysis omits some of Jira's classes in order to maintain sufficient performance of the system. To ensure that our comparisons in Section 6 are conclusive, we only considered the classes and packages covered by both the static and dynamic analysis in the computation of the Kendall Tau distances. Additionally, different interpretations of what is considered as coupling between the static and in the dynamic analyses are always possible. However, since our notion of coupling is rather simple (method calls between different classes), we are confident that our static and dynamic analysis in fact use the same notion of coupling. Finally, as discussed in Section 4, we examine compiled code, not source code. When performing a similar analysis on source code, the differences between the static and the dynamic analyses would likely increase, as the dynamic analysis of course also uses compiled code. However, this can also be seen as an advantage, since this allows us to focus on the differences between static code and a running system, which is the goal of this study.

Conclusions and Future Work
We studied three different basic measurement approaches: Static coupling, unweighted dynamic coupling, and weighted dynamic coupling. We performed four runs of an experiment that allows to compare these metrics to static coupling measurements. Our results, as discussed in Section 6, suggest that dynamic coupling metrics complement their static counterparts: Despite the large (and expected) difference, there is also a statistically significant correlation. This suggests that further study of dynamic weighted coupling and its relationship with other coupling metrics is an interesting line of research.
A key question is how the additional information given by weighted dynamic coupling measurements can be used to evaluate the architectural quality of software systems, or more generally, to assist a software engineer in her design decisions. Coupling metrics can be used as recommenders for restructuring [11], and for static coupling measures, correlation between coupling and external quality has been observed [23]. A study of the relationship between static coupling measures and changeability and code comprehension has been performed in [33]. In [3], it is argued that unweighted dynamic metrics can be used for maintenance prediction. Since dynamic weighted metrics contain additional information compared to their unweighted counterparts, it will be interesting to study whether and how this additional information can be used in these contexts.