Multi-Level Formation of Complex Software Systems

Abstract: We present a multi-level formation model for complex software systems. The previous works extract the software systems to software networks for further studies, but usually investigate the software networks at the class level. In contrast to these works, our treatment of software systems as multi-level networks is more realistic. In particular, the software networks are organized by three levels of granularity, which represents the modularity and hierarchy in the formation process of real-world software systems. More importantly, simulations based on this model have generated more realistic structural properties of software networks, such as power-law, clustering and modularization. On the basis of this model, how the structure of software systems effects software design principles is then explored, and it could be helpful for understanding software evolution and software engineering practices.


Introduction
Many systems in nature and society reveal network organizations.These networks, such as biological protein networks [1], science collaborations [2,3], social networks [4] and the Internet [5], have been found to represent some attributes, such as scale free, small world, etc.These discoveries emerge from the science of complex networks.Recent studies have revealed that object-oriented software systems share some structural attributes with these complex networks.Specifically, the networks of software systems are characterized by a scale-free degree distribution [6][7][8][9][10], a small-world structure (short average path length and high clustering) [11,12] and some other features [13][14][15][16][17][18].This therefore raises the study of software networks in recent years.
Software systems consist of many interacting units at some levels of granularity, such as methods, classes and subsystems [19].Additionally, the collaborations of these units in a software system can be therefore extracted and defined as a software network.Figure 1 shows a simple example of the extraction from a software system to a software network, in which the classes in the left figure are nodes and the collaborations of such nodes are edges.For the software systems with a more complex structure, the corresponding software networks are organized to be highly functional, modularized [19] and evolvable [15].This therefore brings some further studies on software networks, such as community detection [20,21], quality assessment [10], important unit identification [22], bug classification [23] and developer social collaboration [24,25], which are helpful to various phases in software engineering practices.
During the whole production process of a software project, the design phase is the most critical stage because the structure of the units at different levels and the collaborations of such units are explicitly described in this process.These collaborations enable the detailed functional tasks to be integrated by many reusable basic units in a modular and hierarchical fashion [11].However, some other crucial and persistent actions in the lifecycle of software systems, such as software maintenance, refactoring and adaptation, cannot be carried out in the task of software design, but lie in the formation of software systems.The goal of a software project is not only building up a software system to satisfy the functional requirement, but also making the software systems convenient and economical to upgrade to new versions.Thus, it makes the evolution of software networks an increasingly important issue.However, while active research has been undertaken and many solid results have been obtained for understanding the formation mechanisms of these natural and man-made systems, the same work has been only very sparsely performed on software systems, and little has been achieved about the cause-effect relationship between software engineering practices and the structure of these systems.
From the view of software engineering, software evolution is a process of meeting the dynamical requirement changes of the users.From the view of entropy, software evolution is a process of network structural change from chaos to order.As a kind of typical open system, the structure of software systems is dynamically changing under an external driving force, and the changing reveals the status of system designing and coding [26].Therefore, studying how the software systems evolve can help in a number of areas, including software testing, software maintenance and program comprehension [27] and further to evaluate the system robustness and its ability to tolerate changes [28].
Based on various of empirical studies [15,[29][30][31][32][33][34][35], a few models of software network evolution, which describe the evolutionary mechanism from different perspectives, have been proposed.The models reported in [11,[36][37][38][39] are respectively based on refactoring processes, node aging affect, weighted network and software patterns.These models perform well in some aspects of software network evolution, and the work can forecast the evolution trends based on the model [39].However, they do not explicitly include some important qualities of software networks, e.g., modularity and hierarchy.In reality, however, software systems are characterized by high modularity [11,21,40], which corresponds to the important principle of high cohesion and low coupling in software design [41], and the practice of software architecture design assures that software networks are hierarchical and multigranular by nature [13,[42][43][44].In addition, most of these models adopt a undirected graph, which is appropriate for some types of networks, such as the Internet and social networks, to represent software networks.However, software networks are directed because dependent relations between basic units in software systems are unidirectional, and the direction is designated when the node is added to the network [45].
Here, we stress another salient fact about the evolution of software systems, which is missing in most of the existing models.With respect to evolution, these systems lie somewhere between natural systems, which are characterized by a bottom-up self-organizing process, and conventional engineering systems, whose upgrades are governed by a top-down central design, because the work of software design is a social work in which many designers and developers are working together to carry out the task [24].On the one hand, software architecture supports more autonomous attachments in comparison to other types of architecture, because objects are loosely coupled, and adding or removing non-core objects usually does not significantly affect the rest of the system.On the other hand, in enhancing software systems, some issues should be taken into consideration, such as reuse, maintenance, performance and optimization.They all call for a comprehensive viewpoint and overall design.
In view of the current situation discussed above, we aim at more accurately understanding the general mechanism that governs the evolution of software systems and exploring the cause-effect relationship between a variety of software development principles and the structure of software systems.To accomplish this goal, we have developed a multi-level model of software evolution, which represents software systems as directed networks and adopts a modular binding process for new component attachments.
The rest of this paper is structured as follows.In Section 2, we describe the multi-level model, in terms of structure and the evolutionary mechanism.Section 3 exhibits simulation results based on the model and compares them to empirical data.In Section 4, we explore the implications of various software design principles for the structure of software systems.Section 5 concludes this work and presents possible future works.

Levels of Software Systems
Software systems are multi-level systems by nature.In this work, we consider three typical levels of software systems, which are outlined in Table 1.
We represent a software system on each level as a directed graph G = {V, E}.Here, V is a set of elements that are termed nodes; E is a set of ordered node pairs, each of which implies that the first node depends on the second one, and this relation of dependence is depicted as an edge that leaves the first and enters the second node.The sets of elements on Levels I, II and III are denoted respectively by {v I i,j,k }, {v II i,j }, and {v III i }.Therefore, the element v I i,j,k on Level I is the i-th element of v II j,k on Level II, and v II j,k in turn is the j-th element of v III k on Level III, which is the k-th element on this largest scale.In the following part of this section, we will give more detailed descriptions of the three levels.

Level I
In software systems, some basic units, such as classes, encapsulate fundamental functions for constructing more complex and application-specific elements on larger scales.We term the scale corresponding to these units Level I.
Collaborations among these basic elements form the microstructure of a software system.Specifically, there are two types of dependencies between these elements: inheritance implies a relationship of "is a", and association corresponds to "has a".We follow the convention of software engineering in depicting a dependence: an edge is directed from Element B to Element A if B, in its definition, makes reference to (or is dependent on) A. In our analysis, repeated links are not considered.

Level II
In software practices, some combinations of classes or other basic units appear with much higher frequencies than would be expected by pure chance [42], although some do not work in modern software engineering [46].These patterns, such as motifs, are usually composed of a few basic (Level I) elements.They are general repeatable solutions to some commonly-occurring problems in software design or have been reused over time in different systems to perform various information processing functions.Being building blocks of more complex software structures, they constitute a natural level between the basic units, such as classes, and the entire software system.We name it Level II.
In reality, Level II elements are usually composed of three or four Level I elements.An important fact is that Level II elements with few internal links appear more frequently than those with many internal connections [42].The high probability of these sparse graphs is caused by the software engineering principle that coupling should be minimized [47].The most commonly-used Level II elements are displayed in Figure 2.

Level III
Component-based software engineering has become a widely-adopted reuse-oriented approach to software development, and software evolution usually involves adding new components to existing systems.In comparison with single classes that can be used only if the detailed knowledge about them is known, components are more encapsulated, abstract and easy to use.
In our model, components lie on Level III.They contain different numbers of Level II elements and eventually different numbers of Level III elements, conforming to the empirical fact that the sizes of components vary from a few objects to whole applications.

The Mechanism of Software Evolution
Empirically, software networks keep growing in response to changing conditions and new requirements, in line with the empirical studies on a large number of real software systems [48,49].Consequently, new functional modules are continually added into software systems, and these elements are much more than those that are removed.The work in [44] further reported that, in real software systems, both elements and edges tend to grow on different levels simultaneously.
There is an empirical fact that, although newly-added elements have different functions and sizes, the numbers of the edges between them and the existing elements are quite close to one another.For simplicity, we consider these numbers as equal and treat them as a constant that is denoted by R e .Its value can be obtained by averaging the corresponding values of the elements in different systems.
There are a few parameters in our model: reuse probability Γ, common to elements on all three levels, expresses the general degree of reuse; coupling ratio Λ is the ratio of the number of the edges that connect all of the Level II elements within the same Level III element to all of the Level II edges related to the Level III element; the total size of the whole software system at the end of the evolution N I , in terms of the number of Level I elements; and the minimum size N III I,min and maximum size N III I,max of the Level III elements, in terms of the number of Level I elements.The mechanism of evolution can be described as the following algorithm and correspondingly depicted as Figure 3.One should note that an edge between two Level II elements is established because two Level I elements separately belonging to the two Level II elements are connected.Likewise, an edge between two Level III elements is established because two Level II elements separately belonging to the two Level II elements are connected.Step In the rest of this section, we present a more detailed explanation of the mechanism.

Direction of Attachment
The direction of the edges is determined in the following manner: (1) it will reuse an existing module and establish an outgoing edge with reuse probability Γ (0 ≤ Γ ≤ 1); (2) it depends on an existing module and receives an incoming edge with probability 1 − Γ. Γ is positively related to the general degree of reuse.We adopt a great value for Γ on account of the fact that, in the software development practice, there is a strong inclination to reuse.

Probability of Attachment
In software engineering practices, the elements with high incoming dependencies usually have a simple structure and perform some fundamental functions.These elements could be reused for a greater probability to be reused and receive incoming links.In contrast, the elements with more outgoing dependencies, such as modules of user interfaces, usually represent a more complex structure within the elements.They are more likely to depend on other elements and establish outgoing links.Due to their complexity, it is dangerous for the system if these elements are dependent on other elements.
Consequently, elements with larger in-degrees are more likely to receive incoming edges, while those with larger out-degrees are more likely to link to other elements with outgoing edges [45].Therefore, we can consider that the probability that an element receives an incoming edge P in (D in ) is proportional to its in-degree D in , and that with which elements establish an outgoing edge P out (D out ) is proportional to its out-degree D out , i.e., P in (D in ) ∼ D in , and:

Level III
We assume that N I Level I elements will be added to the existing network.New Level III elements {v III k } with a random number of Level I elements will be generated and added, one by one, to the existing network, until the total number of Level I elements of the whole system reaches N I .The size of each Level III element v III k is between N III I,min and N III I,max , and there are R e edges that connect this element to other existing Level III elements.

Level II
The edges of each Level II element are of two types: internal edges connecting it to other elements in the same Level III element and edges that link to elements in other Level III elements.We use the parameter named coupling ratio Λ for the proportion of the second type of edges.A high (low) value of Λ therefore corresponds to high coupling and low cohesion (low coupling and high cohesion).
In k and an existing Level III element, R e N III II Λ pairs of Level II elements between the two Level III elements are linked through Level II edges and with the same direction as the Level III edge.The probabilities for each pair of Level II elements to get an incoming edge and an outgoing edge are P in (D in ) and P out (D out ), respectively.

Level I
Empirically, a Level II element is composed of three or four Level I elements of 14 types (see Figure 2).When a Level II element is generated, the number of internal Level I edges depends on the type of the Level II element.For simplicity, we assume that the 14 types of Level I elements appear in every Level II element with an equal probability.Consequently, we can get the average number of Level I elements NII I = (1/n) ∑ n j=1 N II I,j and the average number of Level I edges ĒII I = (1/n) ∑ n j=1 E II I,j .We assume that there are N III II Level II elements in the current Level III element, and E II, I Level I level interacting edges are added when a Level II level edge is added.The total number of Level I edges is equal to the sum of the total number of internal edges and the total number of interacting edges: then we have: We can therefore simply consider that each Level II element contains NII I Level II elements and ĒII I Level II edges.When an edge is added between the new Level II element v II j,k and an existing Level II element, E II, I pairs of edges between the two elements are linked by Level I level edges and with the same direction as that of the Level II level edge.The selected probabilities for each pair of elements to get the incoming edge and the outgoing edge are P in (D in ) and P out (D out ), respectively.

Simulation Results
In this section, some essential results of simulations based on our multi-level model are displayed.The simulations were undertaken with respect to the structural properties of our simulated software network.We explored the influences of four parameters: coupling ratio Λ, reuse probability Γ and the minimum size N III I,min and the maximum size N III I,max of the Level III elements.The values for E II, I and R e are adopted as, respectively, the average values of their corresponding empirical observations.For validating our modeling, the simulation results are compared to data presented in some real-world software systems, such as Blender, Doxygen, Eclipse, etc.More importantly, these simulations enable more comprehensive understanding of the evolutionary mechanisms under study.

Degree Distributions
The degree of an element, D i,j,k , is the number of edges attached to it.Correspondingly, the in-degree D in i,j,k and the out-degree D out i,j,k are respectively the number of links that enter it and the number of links that exit it.
In this study, the measurements of the p-value and x min are used to measure the goodness-of-fit for degree distributions [50] (the code can be found from [51]).The first metric, p-value, represents the mathematical "distance" between the power-law distribution and the distribution of the actual network.The previous study reports that the power-law distribution of the current data can be believable, if p-value≥ 0.1; conversely, it cannot be authentic [50].moreover, the degree distribution has some non-power-law behavior at the lower end; thus, we use the metric of x min to control the part of the degree distribution that represents power-law behavior.It is reported that the power-law distribution is more plausible if the value of x min is smaller [50].
Figure 4 shows the simulated distribution of the degrees of the Level I elements, and the corresponding correlation coefficients, p-value, x min can be found in Table 2.For comparison, in the same figure, we also plotted the degree distribution of the real software systems.It can be seen that both of and simulations and real software networks represent a power-law feature, and the degree distributions of the simulations are close to those of real software networks because the values of the exponents γ are close to the values of real software networks.
The power-law degree distribution is an important network feature in complex networks.It indicates that the degrees of most of the nodes are small while a small amount of nodes have large degrees [52].In software networks, the elements with a small degree can be benefit fromthe function decomposition [53].In contrast, the nodes with a large degree are crucial to achieve complex tasks and frequently interact and exchange data with other nodes.Therefore, the possible failures of these nodes with a large degree could greatly affect the system.
On the other hand, we know that software networks are directed; thus, the in-degree and out-degree distributions can also represent the interaction characteristics of the nodes.Figure 5 and Table 3 show the simulated distribution of the in-degrees of the Level I elements, with the comparison of the in-degree distribution of the real software systems.Similar to the degree distributions, the in-degree distributions of the simulations and real software networks express the power-law, and the differences between them are small according to the exponents γ and fitting goodness p-value.Additionally, Figure 6 and Table 4 show the out-degree distributions of these networks.Though the distributions are also close between the simulations and real software networks, the fitting goodness of the power-law is not good enough for some networks.Therefore, only some software networks follow a power-law.

Correlation between In-Degree and Out-Degree
In comparison with some other complex networks, software networks display an important characteristic: in-degrees and out-degrees of elements are negatively correlated [11].
Figure 7 is a scatter plot of the simulated Level I in-degrees against corresponding out-degrees of all Level I elements.For comparison, in the same figure, we also plotted the same types of data obtained from the real software systems.This figure expresses that the results generated by our model are in line with empirical data.We can see that the elements with larger in-degrees have smaller out-degrees, while the nodes with large out-degrees have smaller in-degrees.Therefore, we use correlation coefficient R(D in , D out ) for measuring the correlation between in-degree and out-degree distributions.The correlation coefficients of the in-degree set and the out-degree set (respectively for all nodes and the elements with d in ≥ 10 or d out ≥ 10) for simulations of the multi-level model and real software networks are shown in Table 5.It can be seen that most of the coefficients for simulations (−0.2 ∼ −0.05) between in-degrees and out-degrees are close to the real software networks (−0.2 ∼ 0), though the negative correlations are not obvious.However, the correlation coefficients for simulations (−0.4 ∼ −0.6) between in-degrees and out-degrees are also close to the real software networks (−0.3 ∼ −0.6) and negatively correlated.
This negative correlation can be accounted for by some principles of software development.In a software system, elements with a large in-degree usually perform fundamental or commonly-used functions.These elements are therefore more likely to be reused.Conversely, elements with a large out-degree usually accomplish specific tasks.Therefore, they are less likely to be aggregated by other elements.
As shown in Figure 7, our model also reproduced another feature of real software systems, i.e., the largest out-degrees of the nodes are always much smaller than the largest in-degrees.In contrast, the BAmodel is unable to generate this attribute.
The reason for this feature is that elements that have a larger probability to be reused tend to have a higher in-degree; while existing elements are not easy to aggregate intonew elements.Additionally, a new element is more likely to reuse an element with many incoming links than a complex element with many outgoing links.The software engineering practice encourages reuse, which leads to large in-degrees.Conversely, it is not encouraged for an element to have too many out-degrees, because this will lead to highly complicated structures and hinder maintenance.

Level of Clustering and Modularity
In software design, the cohesion and coupling reflect the interactions between modules of software systems.Cohesion is a property of a single module and represents the degree to which the related units within the module, while coupling is a property of a pair of modules and represents the degree of relationships between such modules [41].It is well known that the modularized software systems are much easier to develop and maintain, and a well-modularized software system usually represents a high degree of cohesion and a low degree of coupling [19,20].
According to the previous studies, the metrics of the clustering coefficient and modularity are used to represent the degree of cohesion and coupling for software networks [11,21].The clustering coefficient of the entire network is a measure of the degree to which nodes in the network tend to cluster together, and it represents the tendency of the nodes' neighbors to be their common neighbors in a network [11].The modularity is an attribute of how good a network is divided into modules, and a good division is more edges within modules and fewer edges between them [54].Comparatively speaking, the clustering coefficient tends to describe the clustering of the node and its neighbors, while the modularity emphasizes the goodness of module division.
The measurement of clustering coefficient C is the average of the clustering coefficients of all of the nodes [55].The equation of the clustering coefficient is: , in which q i,j,k is the number of nearest neighbors of node v i,j,k and t i,j,k is the number of connections between them.If the value of C is larger, the network tends to have a higher degree of cohesion and a lower degree of coupling.Real software systems are modular, and the clusters represent some units that collaborate together to carry out the same task [56].Then, we choose the sample software systems, such as Blender (written in C++) and Eclipse (written in Java), respectively, to generate 10 simulated networks for comparisons, and the results are shown in Tables 6 and 7. Table 6 shows that the C value from our model is close to the value of the real-world software system.The reason is that the networks generated by the model are modular and have high cohesion.The work in [21] proves that software networks show the feature of community structure by empirical studies, and thus, it is verified that software networks are modularized and that each consists of a network of interdependent parts [57].Therefore, we use the metric of modularity Q, which is defined as the fraction of the edges within the divided groups minus the expected fraction of such edges in the network formed in a random way [54], to measure the modularity of software networks.The mathematical definition for modularity Q [58] is: where A ij denotes the weight of an edge between a node v i and a node v j (the weight is one in this case) in the graph, k i = ∑ j A ij is the sum of the weights of the edges attached to the node v i , c i is the community to which the node v i is assigned, the function δ(u, v) is one if u = v and zero otherwise and m = 1 2 ∑ i,j A ij .The lager the value of Q, the higher the degree of cohesion for a network.
The examples of Blender and Eclipse are also used here to study the modularity of the simulations, and the results can be found in Table 7.We can see that the values of modularity are close between the real software networks and simulations.Moreover, the results also demonstrate that the model can produce software networks following the principle of high cohesion and low coupling.

Cohesion and Coupling
In the model, the coupling ratio Λ represents the possibility that a new edge connects two nodes in different modules, when new nodes are added to the existing network.Particularly, a larger value of Λ means a larger proportion of edges between nodes in different modules, which indicates that the nodes are more likely to connect the nodes in other modules.Conversely, a smaller value of Λ means a smaller proportion of edges between nodes in the same modules, which indicates that the nodes are more likely to connect the nodes in the same modules.
We generate three groups of evolving networks (Γ = 0.8, N III I,min = 32, E II, I = 48 and R e = 2) by three different values of Λ, in order to discuss the clustering and modularity of the software networks for different coupling ratios.Figure 8 shows the negative correlation between the two measurements, clustering coefficient C and modularity Q, and coupling ratio Λ.We can see that the values of C distribute in the range of 0.1 ∼ 0.11, and the values of Q are around 0.85, when Λ = 0.1.However, the values of C decrease to the range of 0.05 ∼ 0.055, and the values of Q decrease to around 0.65 when the value of Λ rises to 0.5.Therefore, the clustering coefficient and modularity will decrease as the coupling ratio becomes larger.Moreover, the above correlation can be proven by the network topological structure.For the sake of better visualization, we generate three simulated networks (Γ = 0.8, N III I,min = 32, E II, I = 48 and R e = 2) with 1000 nodes by different values of coupling ratio Λ and display the topological structure of them by Gephi (version 0.9) [59] in Figure 9.It can be seen clearly that the network represents obvious modularization as Λ = 0.1, and the nodes within the same modules connect more tightly while the connections between the nodes in different modules are sparse.As the value of Λ becomes 0.3, the network still displays the feature of modularization, and the nodes interact with their neighbors within the modules more tightly than the nodes in other modules, although the modularization is reduced obviously.When the value of Λ reaches 0.5, we cannot find a modular structure any longer from the generated network.

Reuse and Modularity
Software engineering practice encourages code reuse.However, reuse could lead to over coupling.In the case of a fixed amount of interactions across modules, if the value of Γ is large, more elements are reused as new elements are added.Conversely, if the value of Γ is small, existing modules aggregate more new elements.
Table 8 shows the top five largest out-degrees of the Level I network for different vales of reuse probability Γ and the corresponding numbers of the Level I elements.We can see that more elements with large out-degrees appear as the value Γ decreases.This will cause serious risk in global functions and decreases the evolvability of the software system, due to the complicated structure caused by many out-going edges.Figure 9 also shows that the level of modularity is negatively related to coupling ratio Λ.In software engineering practice, the degree of modularity is governed by the trade-off between reuse and maintainability.In order to promote reuse, fine-grained and self-contained components should be used.If maintainability is a critical requirement, the coupling among components should be minimized by adopting relatively large-grain, highly cohesive components.

Influence of Motifs on Software Structure
In addition to modularity, software networks share another important feature with many other types of systems, such as biological networks.They show recurring patterns in a small scale, i.e., motifs.It has been conjectured that the abundance of motifs in software networks relates to universal mechanisms underlying software evolution [42].
From a specific angle based on our multi-level modeling, we have explored the impact of the existence of motifs on the structure of software systems.We undertake this exploration through the comparison between the simulated network with 10,000 nodes produced by our model in the normal setting (Λ = 0.35, Γ = 0.8, N III I,min = 32, E II, I = 64 and R e = 2) and the resulting networks generated by special settings.
In the first special setting, all of the Level I elements within each Level II element are connected through the same mechanism as those Level II elements in each Level III element.The network generated by this rule also represents clustering, though the clustering coefficient C is smaller than the simulation with the motif, as shown in Figure 10.However, this formation rule does not conform to software engineering practices.Actually, the existing classes and components are usually reused, and this is the case of basic code reuse in software design.Besides, the local structures are also duplicated in some scenario, such as design patterns.If the latter is ignored in software design, it will result in an awful situation that some well-designed micro-structure (such as design patterns) will not be widely used, and the designers have to design many repeated scenarios and workflows.Thus, the motifs represent the micro-structure reuse in software system evolution.In the second special setting, the random sub-graphs are used to replace the motifs.The result shows that the simulated network represents much stronger clustering than the simulation using motifs, as shown in Figure 10.The motifs used in software network formation are usually sparse sub-graphs, which are composed of three or four nodes and two or three edges.Conversely, the random sub-graphs include some dense sub-graphs, which will increase the clustering in the network formation.Most of the edges in these sub-graphs are redundant connections, and this could lead to unnecessary cost.Even worse, the McCabe cycles, which will result in increasing complexity and decreasing stability, will propagate in the evolution of the software systems.This discussion therefore tells us that the motifs can keep the overall software systems in reasonable cohesion and with structural stability.

Conclusions
The main contribution of this paper is that a multi-level model for software network evolution is proposed.In this model, three levels of elements, including class level, design pattern level and framework level, are used to describe the organization of the software systems.Through the comparisons with the real software networks from different aspects, the model has been proven to be inherently close to describing the formation process of real software systems.Furthermore, with the help of this model, we discuss some principles in software engineering practices, such as the relation of cohesion and coupling, the code reuse and modularity and the influence of motifs on software structure.This model could help us to understand the formation of the complex software systems and potentially to forecast the changes of the software structure.
However, some limitations may shorten the usage of the model.The parameters used in this model are obtained from the history data of the source codes.This means that the model may not correctly describe the structural changes due to the dramatic changes in the software architecture modifications.In addition, empirical studies tell us that the number of nodes and edges usually keeps increasing in most software projects, but it cannot avoid the sudden reduction of the nodes and edges in some projects for some unpredictable reasons.Besides, some large-scale software systems may not organize by three levels, but four levels or more, so how to dynamically describe the levels of the software network structure is also an open question.
Thus, some further studies still need to be done in the future.Firstly, more software projects should be investigated, especially the software systems written in the C language.Secondly, many projects have been terminated because of different reasons; thus, the studies of the structural changes of these software systems may make sense, then the model may be improved due to the further studies.Finally, the model may potentially be used to describe the formation of some other complex systems (such as the Internet, social networks, biology systems), and thus, it is worth updating the model to be universal for multi-level complex systems.

Figure 1 .
Figure 1.An example of the extraction from a software system to a software network.(a) The Unified Modeling Language (UML) graph of the simple software program; (b) the software network extracted by the software program on the left.

Figure 2 .
Figure 2. Most commonly-used motifs in software networks.(a-d) are 3-node motifs; (e-n) are 4-node motifs.All of the motifs are sparse graphs and not directed cycle.

Figure 3 .
Figure 3. Structure and evolutionary mechanism of a multi-level software system.

Figure 8 .
Figure 8. Negative correlation between two measurements and coupling ratio Λ.(a) The correlation between clustering coefficient C and coupling ratio Λ; (b) The correlation between modularity Q and coupling ratio Λ.

Figure 10 .
Figure 10.The influence of motifs on the values of the clustering of Level I networks.

Table 1 .
Levels of a software network and their elements.

1
Determine the values of N I , N III I,min , N III I,max , Λ and Γ.The existing element is selected by probability P in (D in ) or P out (D out ). Attach v III k to an existing Level III element with directions depending on Γ.It is selected with probability P in (D in ) or P out (D out ). III k , and link to an existing Level II element in the existing Level III element by probability P in (D in ) or P out (D out ).
the evolutionary process, when a new Level II element v II j,k is added into a Level III element v III k , R e Λ internal edges form between v II j,k and other Level II elements in v III k .Since v III k contains N III II Level II elements, there will be N III II R e (1 − Λ) internal edges and N III II R e Λ edges connecting different Level II elements within v III k .When an edge is added between the new Level III element v III

Table 2 .
Correlation coefficients of degree distributions; the corresponding metrics of the p-value and x min , produced by our multi-level model and real software systems.

Table 3 .
Correlation coefficients of in-degree distributions; the corresponding metrics of the p-value and x min , produced by our multi-level model and real software systems.

Table 4 .
Correlation coefficients of out-degree distributions; the corresponding metrics of the p-value and x min , produced by our multi-level model and real software systems.

Table 5 .
Correlation coefficients of the in-degree set and the out-degree set (respectively for all nodes and the elements with d in ≥ 10 or d out ≥ 10) for simulations of the multi-level model and real software networks.Sub-Network_1 and Sub-Network_2, respectively, represent the sub-network with the elements with d in ≥ 10 or d out ≥ 10 simulations of the multi-level model and the sub-network with the elements with d in ≥ 10 or d out ≥ 10 simulations of real software networks.

Table 6 .
Clustering coefficients of Blender and Eclipse, with the corresponding simulations produced by the model for 8 versions.

Table 7 .
The values of modularity of real software networks (Blender and Eclipse), with the corresponding simulations produced by the model for 8 versions.

Table 8 .
Numbers of Level I elements with large out-degrees, for different values of reuse probability Γ.The numbers 1 to 5 correspond respectively to the largest to the fifth largest out-degrees.