Next Article in Journal
Multi-Objective Parameter Optimization for Disc Milling Process of Titanium Alloy Blisk Channels
Next Article in Special Issue
Detecting Word-Based Algorithmically Generated Domains Using Semantic Analysis
Previous Article in Journal
Algebraic Structures of Neutrosophic Triplets, Neutrosophic Duplets, or Neutrosophic Multisets
Previous Article in Special Issue
Reliability Enhancement of Edge Computing Paradigm Using Agreement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Package Network Model: A Way to Capture Holistic Structural Features of Open-Source Operating Systems

College of Computer, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Symmetry 2019, 11(2), 172; https://doi.org/10.3390/sym11020172
Submission received: 11 December 2018 / Revised: 16 January 2019 / Accepted: 29 January 2019 / Published: 1 February 2019
(This article belongs to the Special Issue Information Technology and Its Applications 2021)

Abstract

:
Open-source software has become a powerful engine for the development of the software industry. Its production mode, which is based on large-scale group collaboration, allows for the rapid and continuous evolution of open-source software on demand. As an important branch of open-source software, open-source operating systems are commonly used in modern service industries such as finance, logistics, education, medical care, e-commerce and tourism, etc. The reliability of these systems is increasingly valued. However, a self-organizing and loosely coupled development approach complicates the structural analysis of open-source operating system software. Traditional methods focus on analysis at the local level. There is a lack of research on the relationship between internal attributes and external overall characteristics. Consequently, conventional methods are difficult to adapt to complex software systems, especially the structural analysis of open-source operating system software. It is therefore of great significance to capture the holistic structure and behavior of the software system. Complex network theory, which is adequate for this task, can make up for the deficiency of traditional software structure evaluation methods that focus only on local structure. In this paper, we propose a package network model, which is a directed graph structure, to describe the dependency of open-source operating system software packages. Based on the Ubuntu Kylin Linux Operating system, we construct a software package dependency network of each distributed version and analyze the structural evolution through the dimensions of scale, density, connectivity, cohesion, and heterogeneity of each network.

1. Introduction

With the extensive application of software in daily production and life, complexity increases dramatically. Complexity has become one of the elementary attributes of software systems, and runs through the whole life cycle of software analysis, design, development, testing, and maintenance. It is difficult to analyze the structure of a sophisticated software product and the relationships between its components, further increasing the difficulty of developing, maintaining, extending, and upgrading it. Researchers have studied a large number of open-source software systems and find that the complexity of the software system is one of the primary factors leading to software errors. Thus, exploring the internal rules of complex software systems and figuring out key nodes that lead to vital problems will play an important part in the software development progress [1,2,3].
The complexity of software systems stems from manifold aspects [4,5]. First, subjects described by software systems are increasingly complicated. Moreover, diverse application fields lead to completely different internal structures and functional logic. Secondly, software systems themselves are progressively intricate. Systems generated by various platforms retain their particular architecture by virtue of the unique compilation mechanism of each platform. On the other hand, comprehension differences exist among developers towards the same software requirements. A variety of program languages, computation modes, and application modes can be selected for software implementation. Furthermore, the complexity of computer hardware results in a consideration of hardware resources during the development of software systems. Thirdly, the complexity of a software system is derived from continuous updating and upgrading, namely, the process of software evolution.
Software structural change is a common phenomenon. Measurement and control of software structure has always been the goal of software designers and developers. Therefore, how to effectively analyze and control the internal structure of software is pivotal to understanding and measuring the complexity of software as well as its evolution. Current software complexity metrics generally spotlight either program code line such as scale of code lines, blank lines, or analyzing cyclomatic, essential and code path quantity of software module data control flow diagrams. Prevailing approaches include McCabe, Halstead, C&K, and MOOD. McCabe transformed the control flow of a program into a directed graph and measured the complexity by counting the number of linearly independent directed loops. Different from McCabe, Halstead proposed a data flow-based method and evaluated software complexity through calculating the number of operators in a program. Accompanying the development of object-oriented technology, Chidanber and Kemerer put forward an inheritance tree-based way and estimated the complexity of an object-oriented software at a granularity of classes from six indicators containing number of weighted methods in class, depth of inheritance tree, number of subclasses, degree of coupling between objects, number of external function call in a class, and polymerization degree among methods inside a class. Brito introduced the MOOD approach, which measures the encapsulation, inheritance, coupling, and polymorphism of object-oriented software to reflect its complexity. The above traditional measurement methods describe the complexity of software from different aspects. All focus on analyzing the local structure and characteristics of functional individuals in the software system like classes, methods, etc., and lack a global measurement of software structure [6,7,8,9,10].
The emergence of complex systems and complex networks [10,11,12,13], which emphasize a holistic approach to the system rather than focusing on local aspects, has provided a valuable perspective and a unique research dimension for understanding a software system. Unlike the traditional “reduction method” used in software development, the complex system theory emphasizes the global features of a system. Generally speaking, complex systems tend to give rise to new features that are not intentionally implemented by the system developer, and these features exist only at the system level. These emergent properties cannot be observed at the lower levels and in the local parts of a system. There are many real examples of this phenomenon in nature [11], such as the social activities of ants and geese, which demonstrate abilities that one or several ants or geese cannot achieve on their own. The same is also true of software systems; while a single class or module can accomplish only a limited amount of functionality, all classes or modules interact cooperatively within the system to achieve the desired functionality of the user. Therefore, studying these emergent characteristics can provide valuable perspectives and different research dimensions for understanding software systems [14,15,16,17].
In 2002, Valverde et al. first introduced the complex network method to study the structure of software systems and found that the system structure of JDK 1.2 and Ubi Soft Pro Rally2002 both exhibited obvious “small world” and “scale-free” characteristics. Studies have been conducted into selected software systems written in Java and research carried out at the class level as well [18,19,20,21,22,23,24,25].
Open-source operating systems are obviously dynamic and interdependent. The development of an open-source operating system involves more uncertainty and complexity than other engineering projects. The internal structure of the system and its intricate interactions have gradually exceeded the comprehension of software developers. On the one hand, bug corrections and new function additions to the open-source project are in continuous iteration, so the version is in a state of constant change. On the other hand, open-source platforms emphasize software reuse, which is not restricted to internal projects [24,25,26,27,28]. This reusing causes the dependencies between applications, and between applications and operating systems, to become more complex. Currently, mainstream open-source operating systems abstract installable software units into software packages. However, the system not only involves the stacking of software packages, but also the orderly combination of these packages. It is worth exploring further so as to describe and study the internal structure of open-source operating systems.
This paper takes open-source operating system software as the object of analysis and proposes a network model of the software package. We extract the dependencies of software packages and describe the internal structure of open-source operating systems by treating software packages as nodes and dependencies as edges. In brief, this paper makes three main contributions:
It constructs a software package dependency network for an open-source operating system, which sets the overall structure of the system.
This paper takes Ubuntu Kylin Linux [29] as an example and analyzes the evolution of the software package dependency network through the dimensions of the scale, density, connectivity, cohesion, and heterogeneity of each distributed version.
This paper proposes a betweenness-based method in order to exploit the key nodes of an open-source operating system software package dependency network.
The remaining parts of this paper are structured as follows. Section 2 describes the construction of the software package network. Section 3 gives a detailed analysis of the software package dependency network evolution of Ubuntu Kylin Linux. Section 4 proposes a betweenness-based method to mine the key node of the above networks. Finally, conclusions are given in Section 5.

2. Software Package Dependency Network

2.1. Software Packages

At present, mainstream open-source operating systems abstract installable software units into software packages and provide a corresponding software package management and distribution system to manage various interdependent software packages, as well as assisting users to obtain, install, delete or update required software packages [30,31,32,33,34,35]. A software package contains the program, data, and corresponding configuration files of the published software, along with some metadata that describes the name, version, dependency and other information about the software package that can be used by the software package management and distribution system. The software package acts as an independent module in the operating system platform to achieve a comprehensive function. Developers develop packages through a front-end text editor, and it is up to the operating system distributors to decide which packages can be integrated into the corresponding version of their operating system.
A complete software package management system includes the distribution and management of software packages. Software package distribution is maintained by means of open-source distribution platforms such as Debian and Gentoo. According to its specific characteristics, each platform imports/updates the source code of open-source projects by its maintainer. The resulting software packages are kept in the storage pool and marked according to their architecture and version number. The distribution part is the bridge that connects the open-source software project to the end users, providing the service of obtaining and downloading the software packages through the network. Package management involves parsing the package format and content on the client side, as well as implementing the specific installation, update and deletion of packages. When the package management software is dealing with the dependency of newly installed software packages, it can obtain the required software packages from the distribution storage pool with the help of the services provided by the distribution part, so as to realize the automation of client operation. Figure 1 presents the management structure of open-source packages [34,35].

2.2. Dependency of Software Packages

The main body of a software package is a set of functions. As developed in an open-source environment, their source code can be reused and modified. Functions in the existing software package can either be called directly or modified as new functions. In software package systems, the former is defined as a dependency, while the latter is not defined or documented. Dependencies must be stated in the new package description file so that the software management system can automatically load its dependent packages when the new package is running. This dependency is the lifeblood of software package expansion.
One notable aspect of dependency is the dependency on a function library. As a vital component of the operating system, a function library is used to realize various specific functions. Function libraries exist on computers in the form of library files, and different operating systems have different ways of organizing these files. An operating system’s standard function libraries usually achieve relatively basic functions, which are developed and maintained by professional technicians. These function libraries can be repeatedly called up by other programs, and the operation of each application depends on a variety of elementary function libraries.
An operating system is a complex set of elements that interact, correlate, or depend on each other. In this paper, the dependencies of software packages can be regarded somewhat similarly to references in academic papers. More specifically, if package A relies on package B, this means that package A directly calls the functions in package B, and the operating system platform automatically loads package B when package A is running.

2.3. Complex Network

A complex network aims to express complicated social systems in the real world through mathematical concepts. Nodes in a complex network represent individuals in real life, while edges between nodes represent relationships between individuals. A comprehensive study of complex networks can help to understand their structural composition, evolutionary dynamics, and other characteristics, and provide a basis for other disciplines. A complex network is a kind of network that possesses one or all of the features of self-organization, small world, scale-free, and self-similarity. Degree distribution, cluster coefficient, and average path length are three basic static structural characteristics of complex networks. The number of edges connected to a node in a network is the degree of that node. Degree represents the influence of a node. The more edges that are connected to a node, the more relationships there are between the node and other nodes, and thus the higher importance it has in a network. Cluster coefficient is another important parameter of a complex network, which measures the collectivization level of a complex network. Path refers to the number of accessible edges between two nodes in the network, that is, two nodes can be connected through these edges. Average path distance reflects the overall structural characteristics of a complex network.
There are generally four models of complex network: regular network, random network, WS small-world network, and BA scale-free network. A regular network is the simplest form of complex network theory; all nodes have the same degree as well as higher average path length and cluster coefficient. As the beginning of the systematic study of complex network theory in mathematics, a random network is generated through two steps. First, set the size of the network, which means the number of nodes, to N . Then, connect any two nodes in the network with probability p . A random network with p N ( N 1 ) / 2 edges is constructed. Degree distribution of a random network is a relatively representative Poisson distribution. Average degree will increase with a rise in N . However, clustering coefficient decreases with the increase of N , while average path distance is proportional to ln N , which is obviously different from regular networks. A WS small-world network, which describes a transition from a completely regular network to a completely random network, has typical features of short average path distance as well as a large clustering coefficient. A common property of many large networks is that the vertex degrees follow a scale-free power-law distribution, The BA model features two generic mechanisms: the networks expanding and the preferential attachment. Expanding of networks concentrates on the open property of a network that new nodes will be added all the time. Moreover, increasing the number of nodes results in a growth in network scale. Preferential attachment predicts a new node connection trend: they prefer to establish relationships with nodes that have more connections. Figure 2 delineates the above four models under the same scale, in other words, with the same node number.

2.4. Construction of the Software Package Dependency Network

When considering the internal structure of open-source operating system software, software packages can be regarded as nodes of a network, while their relationship can be seen as the edge of the network. Thus, the internal structure of open-source operating system software can be characterized by a network of software packages. Edges in the network are directed owing to the directional property of dependency.
In our research, we define the software package dependency relationship as an unweighted and directed graph G = ( V , E ) . V represents the set of nodes V = { v 1 , v 2 , , v n } . Each node represents a software package. E represents the set of connected edges. An edge connects two dependent packages. When a vertex v j depends on another vertex v i , there is an edge connecting node v i to node v j , in other words, this means that package j is derived from package i. The establishment of a dependency network is described below using the example of the Sudo software package. The dependency of the software package can be queried from the command line. The Sudo software package is dependent on the following packages:
$ apt-cache depends sudo
sudo
  depends: libaudit1
  depends: libc6
  depends: libpam0g
  depends: libselinux1
  depends: libpam-modules
  depends: lsb-base
In this example, we only build the first-level dependency network, since there may be nested dependencies of other packages that Sudo is dependent on. Results are presented in Figure 3.
The following algorithm is used to extract the overall software package dependency network.
Algorithm 1. Package dependency extraction
1. initialization phase
 for i = 1, 2, 3, …, n
 Vi ← package names
2. find dependency relationship and construct the graph
  for i = 1, 2, …,n
  for j = 1, 2, …, i−1, i+1, …,n
  scan the dependencies list of vi
  add an edge when dependency exit between vi and vj
3. delete redundant edges
4. store the graph as a table
5. visualization

3. Results and Analysis

In this paper, we took Ubuntu Kylin Linux operating system [29], which is a Chinese official distribution of Ubuntu Linux maintained by our university, as the research object. Ubuntu Kylin Linux operating system released its first version, named 13.04, in 2013. Since then, two operating system versions have been released each year. The beta version with suffix ‘.04’ is released in the first half of the year, while the official operating system version, which has a version number that contains extension ‘.10’, is released in the second half. Our experiments are conducted on the grounds of Algorithm 1 and extracted software package dependency relationships from their head files. This paper regards the holistic structure of the operating system as the package dependency network. Abstracting packages as nodes and dependency relationships as edges, complex a operating system internal structure is transformed into a network model, namely, a graph. Thus, complexity metrics are turned into an investigation of topological structure features. Degree distribution, cluster coefficient, and average path length are three basic static structural characteristics of complex networks. By combining complex network theory with knowledge of software engineering, we will obtain a better understanding of the topological structure and dynamic characteristics [36,37,38,39].
Our experiments portray the software package dependencies of six versions of the Ubuntu Kylin operating system, from the official versions of 13 to 18. With daily use of an operating system, the quantity and variety of installed software packages varies from user to user depending on their own habits. Therefore, six operating system versions tested in this experiment are all original versions of the system, in other words, the original version at the time of release. Figure 4 presents the overall structure of the software package dependency network for six operating system versions. All networks are visualized by Gephi, which is an open-source visualization tool. The following networks are presented in modularity. Nodes inside a module have a close connection, while few connections exist between modules. In the following pictures, a different color indicates a unique module.
The change of software structure is a common phenomenon in the wake of software evolving. Kernel upgrade, desktop environment technology replacement, and the addition of new functions are the primary factors that lead to structural variation in operating systems.
Degree distribution, cluster coefficient, and average shortest path length are general preferences used in complex network theory to estimate structural characteristics of a network. For a comprehensive investigation of our network characteristics, we will discuss the above six networks in terms of network scale, density, connectivity, cohesion, and heterogeneity.

3.1. Network Scale

The scale of a network can usually be expressed in terms of the number of nodes in the network. The scale of an actual network is almost always changing. In fact, in the case of the Internet and online social networks, the number of nodes and edges have been increasing for quite a long time [27,28]. Figure 5 presents the trend of network scale variation during the version evolution of the Ubuntu Kylin operating system.
According to the definition of network construction, a node represents a software package. Any increase or decrease in the number of nodes stands for a corresponding change in the number of packages. Therefore, the growth of nodes means the emergence of new software packages. On one hand, this may mean an addition of new technology; on the other hand, it may indicate the enrichment and expansion of the peripheral applications. By contrast, a decrease in nodes indicates the replacement of technology or the obsolescence of software packages. Figure 5 presents that both the number of nodes and edges show the same evolutionary trend. Their increase is positive correlation. However, growth rate of edges is larger since the same node can be connected to more different nodes according to the increment of nodes.

3.2. Network Degree and Its Distribution

The degree of connectivity of nodes in a software network determines the importance of the nodes in the network in a certain sense, reflecting the uneven degree of energy distribution. If a network is randomly connected, the importance of each node is roughly the same, and the energy distribution is uniform, such that the structural formula of the software can be considered “disordered.” On the contrary, if the network is asymmetric—that is, there are a small number of “core nodes” and a large number of “end nodes” (with small node degree) in the network—and there are also differences in the importance of nodes, resulting in an uneven distribution of energy, then the structural formula of the software can be considered “orderly” and “heterogeneous.” Figure 6 presents the distribution of network node degrees across the six versions of the software package network.
As can be seen from the above figure, the node degree distribution of the six networks generally follows the power law distribution [32], and the distribution of the system structure presents as uneven, showing that the whole structure is “orderly.” The node degree distribution curve of the six versions of the software networks displays the “long tail” feature; that is, most nodes in the network have only a small degree of connectivity, while a few nodes have a large degree of connectivity and become the central nodes.
The software package dependency network is a directed graph, and the input and output degrees of nodes represent different meanings. The reasons for this difference are linked to design rules and decisions made during development. In this paper, it is believed that the higher the in-degree of a node, the higher the level of its dependence on other nodes; by contrast, the higher the out-degree of a node, the higher the level of its reuse. This indirectly reflects the complexity of the operating system design.

3.3. Network Connectivity

Network connectivity is an indicator that measures whether network nodes are connected as a whole. For undirected networks, where there is a sequence of nodes i 1 , i 2 , , i n and edge connection exists between adjacent node pairs, there is a path between nodes i 1 and i n . Moreover, if a path exists between each pair of nodes in an undirected network, it is referred to as connected. The edges of a directed network have direction, and so does its path. If there is a node sequence i 1 , i 2 , , i n , where the adjacent node pairs have edges pointing from the former to the latter, it is said that there is a path from node i 1 to node i n . Obviously, there is a path from node i n to node i n , however there does not have to be a path from node i n to node i n . The connectivity of directed networks can be classified into two types: strong connectivity and weak connectivity. Strong connectivity means that between any pair of nodes i and j in the network, there is a path from node i to node j and from node j to node i . Weak connectivity means that when the direct edges as regarded as undirected edges, the undirected network is connected. Experiments have shown that the Ubuntu Kylin software package dependency network is not strongly connected. Figure 7 presents the number of strongly connected modules in each version of Ubuntu Kylin.
Next, we discuss the weakly connected property of the software package dependency network. Figure 8 presents the number of weakly connected modules as well as the node portion of the maximum connected modules of the Ubuntu Kylin software package dependency network. It can be seen from Figure 9 that the size of weakly connected modules from 13 to 18 versions is kept at 30 and above, while the node portion of the largest connected module is over 95%. This means that Ubuntu Kylin operating system package dependency network has good connectivity. Another phenomenon that each version of the connectivity reflects a relatively stable development trend can be observed in the figure. The node proportion of the maximum connected modules has been maintained at 95%, indicating that the package network is distinct from other networks, such as citation networks or dependency networks, and grows as a whole.

3.4. Network Density

The denseness of a network refers to the number of connected edges relative to the network size. Network density and average degree are two parameters that predominantly adopted for measuring network denseness. The former one measures the relative denseness of a network while the latter measures the absolute denseness of the network. The definition of the network density of a directed graph, which is marked with ρ , is presented in Equation (1). The average degree, marked with k , is defined in Equation (2). In the equation, M is the number of edges in the network, while N is the number of nodes.
ρ = M N ( N 1 )
k = M N
Figure 9 presents the graph density of each version of the software package dependency network. The experimental results demonstrate that the overall network density exhibits a downward trend and the network becomes sparse. The average density of the graph of the dependency network is 0.0025, while the average of the average degree is only 4.616, indicating that the software package dependency network is very sparse.
As the software version evolves, the degree of connection between software packages decreases, and so does the complexity of the internal structure. Increment of software scale, that is to say, the addition of new package, does not result in an increase in complexity. It indicates that the operating system version evolution only considers and design necessary features so as to avoid over-design.

3.5. Network Diameter

In the evolution of open-source operating systems, are software packages more closely associated or less? The average path length and network diameter can be used to assess the cohesion of a network. The shortest path between nodes i and j , marked as d i j , is defined as the smallest number of edges that can connect two nodes. The average path length of a network, marked as L , is the average distance between any two nodes. For directed networks, the formula for the average path length is defined in Equation (3) [1]:
L = 1 N ( N 1 ) i j d i j .
Many of the actual networks are large, but have a small average path length. This is called the “small world” phenomenon. Network diameter is defined in Equation (4); in other words, it is the maximum number of all shortest paths:
D = max i , j d i j .
The average path length and diameter of each version of the software package dependency network are presented in Figure 10. We study average path length to identify what level should be maintained in a software system so as to better realize the extensibility and maintainability of the software system and control the cost of software development. If the average shortest path of a network is too large, it may be because of a loose organization and a low degree of reuse. On the other hand, a small average shortest path indicates a high coupling degree between packages as well as an unclear system design responsibility. A small average is not conducive to software maintenance and modification.

3.6. Average Clustering Coefficient

The clustering coefficient, which originates from the social networking field, is a measure of the rate at which nodes in a network tend to cluster together. Nodes in real-world networks tend to create tightly knit groups with a relatively high density. Two versions of clustering coefficient exist in a network: the local clustering coefficient and its global alternative. The local clustering coefficient quantifies how close its neighbors are to being a clique. To wit, it calculates the proportion of neighbors directly adjacent between nodes to account for the maximum possible neighbors. Its global version, the average clustering coefficient, gives an indication of the overall clustering in a network. In our experiments, the local clustering coefficient is given as in Equation (5), where e i refers to the actual connection number of a node and k i is the degree of the node:
C i = 2 e i / k i ( k i 1 ) .
The average clustering coefficient is the average of the local clustering coefficients of all the vertices, which can be acquired through Equation (6):
C ¯ = 1 n i = 1 n C i .
The average clustering coefficient describes the clustering of nodes in a network; in other words, how close the network is. The average clustering coefficient of the six operating system versions tested in this paper ranges from [0.196, 0.214], while that of a corresponding random graph is 0.003. That is, the average clustering coefficient of our operating system package dependency network is more than 65 times higher than that of a random network. This indicates that the dependency network is a high-clustering network, and that the packages in the network are closely related and cluster together. The average clustering coefficient distribution of each version of the software package dependency network is presented in Figure 11.
Table 1 summarizes the descriptive statistics for the six versions of the software package dependency network [30,31].
It can be seen from the results of the above statistical analysis that, although the size of the network varies, the average shortest distance length of these networks is relatively small compared to their size. For example, in version 17.10, there are more than 2000 software packages, but the average shortest path length is less than 4; on the other hand, the average aggregation coefficient between nodes is much higher than that of a random network of the same size [28]. This result suggests that these networks have “small world” characteristics. During the evolution of operating systems, the values of average path length are maintained at a steady level. This phenomenon indicates a proper coupling degree between packages. The degree of connection between software packages decreases, resulting in a reduction of the complexity of the internal structure. With the increase of software scale and the addition of new function packages, the complexity of software structure does not increase accordingly.
There are isolated nodes in all versions of the software dependency networks. These packages are simple and independent function software packages, such as rar software packages.

4. Analysis of Key Nodes in the Software Package Dependency Network

Mining the key nodes in the complex network and evaluating their importance can improve the overall performance and robustness of the system. In this paper, key nodes of the software package dependency network are defined as those software package nodes that can affect the stability of the entire network structure. The evaluation of key nodes in the network should be conducted via analysis of the local connection characteristics and the overall influence degree of the nodes. In this paper, node degree and betweenness centrality of the node are used to identify the key nodes of each version of the software package dependency network [40,41,42,43,44,45].
Table 2 presents the top 10 nodes with the highest out-degree for each version of the software package dependency network. As can be seen from the table, the software packages with high degrees of reuse in various versions of the operating system are relatively fixed; all of these are software libraries of the operating system or software packages providing graphical interfaces. C, C++, and Python are extensively used in writing package source code. Moreover, the increased out-degree of Perl packages in version 18.10 discloses a popularity increment of Perl language.
The in-degree of a node indicates the extent to which the node depends on other nodes; the higher the in-degree, the higher the extent of dependence on other nodes. Table 3 specifies 10 nodes with the highest in-degree for each version of the software package dependency network. As can be observed in the table, most of the packages that rely heavily on other nodes are related to the desktop environment. The changes in the top 10 in-degree nodes also describe an evolution of the Ubuntu desktop environment, from GNOME to Unity to UKUI.
In-degree and out-degree are basic properties of nodes in a network, and can be wielded to explore key nodes from a connection point of view. In this paper, key nodes are analyzed from another perspective as well, namely in terms of the role of nodes in the network and the extent of their impact on the network. We utilize betweenness of a node to conduct the key node mining. Nodes in the network with heavy information load can be determined by using the index of betweenness: the more tasks on a node, the higher its betweenness value. If such a node loses its efficacy, this will have a significant negative impact on the whole software system. Thus, we can analyze the failure influence of a node on the whole system according to the betweenness value, providing guidance for system reconstruction and optimization. This kind of result is precisely what traditional software measurement methods cannot achieve.
For two nodes A and B in the network, there may be many shortest paths between them. The betweenness of one node in a network is considered to be high if many of the shortest paths between two nodes in the network go through it. Suppose σ s t represents the number of shortest paths between vertex s and vertex t , while σ s t ( v ) represents the number of shortest paths passing through v . Accordingly, betweenness is defined as in Equation (7):
B ( v ) = s t v V σ s t ( v ) σ s t .
Table 4 lists the top 10 nodes with the highest betweenness for each version of the software package dependency network.
During the evolution of an operating system, the betweenness values of libgtk-3-0 and DPKG are always larger in each version. The greater the betweenness of a node, the greater the responsibility it has; in other words, the greater the impact of its failure on the system. Hence, the failure of the above software packages must be taken seriously. Otherwise, they are likely to cause large-scale system failure.

5. Conclusions

This paper studies the software package dependency network of open-source operating systems from the perspective of complex networks. Firstly, a directed software package dependency network model is proposed to describe the structure of the open-source operating system. Through research into the Ubuntu Kylin operating system, versions 13‒18, it is found that the open-source operating system software package dependency network has the characteristics of “small world” and “no scale” in terms of its structure. Moreover, the development of network structure is “orderly” in its evolutionary process. Network density decreases with the increase in scale, as does network cohesion. The network connectivity is very good, and the proportion of maximum connected slices exceeds 95%. The network has a small number of nodes with large degree values and a large number of nodes with small degree values. Finally, a measure of key nodes, namely betweenness centrality, is proposed to identify key nodes in the open-source operating system software package.
Software evolution, a process of software updating and changing, is one of the essential characteristics of software. By observing the structural characteristics during evolution, the quality of the new version of the software caused by different structural characteristics can be found out, as well as the rules of how they evolve. This information is useful for understanding the unfolding nature of software and provides a reference for software version upgrade so as to guarantee a proper iterative development and quality control. In addition, this study provides guidance for designing a software structure with higher fault tolerance and robustness, avoiding the premature end of the software life cycle.

Author Contributions

Conceptualization, J.W., Y.T. and Q.W. (Qingbo Wu); Data curation, J.W., K.Z. and X.S.; Funding acquisition, Y.T.; Supervision, Q.W. (Qingbo Wu) and Q.W. (Quanyuan Wu); Writing—original draft, J.W.; Writing—review & editing, K.Z., X.S., Y.T., Q.W. (Qingbo Wu) and Q.W. (Quanyuan Wu)

Funding

This work is funded by the National Key Research and Development Program of China under grant No. 2018YFB1003602 and the National Natural Science Foundation of China under grant NO. 61872444.

Acknowledgments

We would like to thank Professor Ji Wang for his kind suggestions on this paper and the anonymous reviewers for their insightful suggestions for improving this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Santiago, E.; Velascohernández, J.X.; Romerosalcedo, M. A descriptive study of fracture networks in rocks using complex network metrics. Comput. Geosci. 2016, 88, 97–114. [Google Scholar] [CrossRef]
  2. Tekinerdogan, B.; Ali, N.; Grundy, J.; Mistrik, I.; Soley, R. Quality concerns in large-scale and complex software-intensive systems. In Software Quality Assurance; Elsevier: New York, NY, USA, 2016; pp. 1–17. [Google Scholar]
  3. Seol, K.; Kim, J.D.; Baik, D.K. Common neighbor similarity-based approach to support intimacy measurement in social networks. J. Inf. Sci. 2016, 42, 128–137. [Google Scholar] [CrossRef]
  4. Rodríguez, M.A. Graphicality conditions for general scale-free complex networks and their application to visibility graphs. Phys. Rev. E 2016, 94, 012314. [Google Scholar] [CrossRef] [PubMed]
  5. Zhang, J.; Zhang, C.; Xuan, J.F.; Xiong, Y.F.; Wang, Q.X.; Liang, B.; Li, L.; Dou, W.S.; Chen, Z.B.; Chen, L.Q.; et al. Recent progress in program analysis. Ruan Jian Xue Bao/J. Softw. 2019, 30, 80–109. [Google Scholar]
  6. Ma, X.; Liu, X.; Yu, P.; Zhang, T.; Bu, L.; Xie, B.; Jin, Z.; Li, X. Software development methods: Review and outlook. J. Softw. 2019, 30, 3–21. [Google Scholar]
  7. He, J.; Shan, Z.; Wang, J.; Pu, G.; Fang, Y.; Liu, K.; Zhao, R.; Zhang, Z. Review of the achievements of major research plan on “Trustworthy Software”. Sci. Found. China 2018, 32, 291–296. [Google Scholar]
  8. Aguirre, A.; Barthe, G.; Gaboardi, M.; Garg, D.; Strub, P.Y. A relational logic for higher-order programs. Proc. ACM Program. Lang. 2017, 1, 21. [Google Scholar] [CrossRef]
  9. Chen, H.; Chajed, T.; Konradi, A.; Wang, S.; İleri, A.; Chlipala, A.; Kaashoek, M.F.; Zeldovich, N. Verifying a high-performance crash-safe file system using a tree specification. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; ACM: New York, NY, USA, 2017; pp. 270–286. [Google Scholar]
  10. Yin, L.; Dong, W.; Liu, W.; Wang, J. On scheduling constraint abstraction for multi-threaded program verification. IEEE Trans. Softw. Eng. 2018. [Google Scholar] [CrossRef]
  11. Albert, R.; Jeong, H.; Barabási, A.L. Internet: Diameter of the World-Wide Web. Nature 1999, 401, 130–131. [Google Scholar] [CrossRef]
  12. Halstead, M.H. Elements of Software Science; Elsevier: New York, NY, USA, 1977; pp. 23–26. [Google Scholar]
  13. Mehndiratta, B.; Grover, P.S. Software metrics—An experimental analysis. ACM SIGPLAN Not. 1990, 25, 35–41. [Google Scholar] [CrossRef]
  14. Araújo, C.W.; Nunes, I.; Nunes, D. On the Effectiveness of Bug Predictors with Procedural Systems: A Quantitative Study. In Proceedings of the 20th International Conference on Fundamental Approaches to Software Engineering, Uppsala, Sweden, 22–29 April 2017. [Google Scholar]
  15. Akour, M.; Alsmadi, I.; Alazzam, I. Software fault proneness prediction: A comparative study between bagging, boosting, and stacking ensemble and base learner methods. Int. J. Data Anal. Tech. Strateg. 2017, 9, 1–16. [Google Scholar]
  16. Coskun, E.; Grabowski, M. Complexity in embedded intelligent real time systems. In Proceedings of the 20th International Conference on Information Systems, Charlotte, NC, USA, 12–15 December 1999; pp. 434–439. [Google Scholar]
  17. Potier, D.; Albin, J.L.; Ferreol, R.; Bilodeau, A. Experiments with computer software complexity and reliability. In Proceedings of the 6th International Conference on Software Engineering, Tokyo, Japan, 13–16 September 1982; pp. 94–103. [Google Scholar]
  18. Martins, P.; Fernandes, J.P.; Saraiva, J. A Web Portal for the Certification of Open Source Software. In Proceedings of the Revised Selected Papers of the SEFM 2012 Satellite Events on Information Technology and Open Source: Applications for Education, Innovation, and Sustainability, Thessaloniki, Greece, 1–2 October 2012. [Google Scholar]
  19. Kumar, L.; Misra, S.; Rath, S.K. An empirical analysis of the effectiveness of software metrics and fault prediction model for identifying faulty classes. Comput. Stand. Interfaces 2017, 53, 1–32. [Google Scholar] [CrossRef]
  20. Barabási, A.L. Scale-free networks: A decade and beyond. Science 2009, 325, 412–413. [Google Scholar] [CrossRef] [PubMed]
  21. Barabási, A.L.; Albert, R. Emergence of scaling in random networks. Science 1999, 286, 509–512. [Google Scholar] [PubMed]
  22. Barabasi, A.L.; Oltvai, Z.N. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 2004, 5, 101–113. [Google Scholar] [CrossRef]
  23. Garlaschelli, D.; Ruzzenenti, F.; Basosi, R. Complex networks and symmetry I: A review. Symmetry 2010, 2, 1683–1709. [Google Scholar] [CrossRef]
  24. Barcellini, F.; Détienne, F.; Burkhardt, J.M. Participation in online interaction spaces: Design-use mediation in an Open Source Software community. Int. J. Ind. Ergon. 2009, 39, 533–540. [Google Scholar] [CrossRef]
  25. Basset, T. Coordination and social structures in an open source project: Videolan. In Open Source Software Development; O’Reilly: Sebastopol, CA, USA, 2004; pp. 125–151. [Google Scholar]
  26. Benkler, Y. Coase’s Penguin, or, Linux and the Nature of the Firm. Yale Law J. 2002, 112, 367–445. [Google Scholar] [CrossRef]
  27. Bollobás, B.E.; Riordan, O.; Spencer, J.; Tusnády, G. The degree sequence of a scale-free random graph process. Random Struct. Algorithms 2001, 18, 279–290. [Google Scholar] [CrossRef]
  28. Ball, F.; Geyer-Schulz, A. How symmetric are real-world graphs? A large-scale study. Symmetry 2018, 10, 29. [Google Scholar] [CrossRef]
  29. Ubuntu Kylin. Available online: https://www.ubuntukylin.com/ (accessed on 1 November 2018).
  30. Broder, A.; Kumar, R.; Maghoul, F.; Raghavan, P.; Rajagopalan, S.; Stata, R.; Tomkins, A.; Wiener, J. Graph structure in the web. Comput. Netw. 2000, 33, 309–320. [Google Scholar] [CrossRef]
  31. Carley, K.M.; Skillicorn, D. Special issue on analyzing large scale networks: The Enron corpus. Comput. Math. Organ. Theory 2005, 11, 179–181. [Google Scholar] [CrossRef]
  32. Clauset, A.; Shalizi, C.R.; Newman, M.E. Power-law distributions in empirical data. SIAM Rev. 2009, 51, 661–703. [Google Scholar] [CrossRef]
  33. Cohendet, P.; Creplet, F.; Dupouet, O. Organisational innovation, communities of practice and epistemic communities: The case of Linux. In Economics with Heterogeneous Interacting Agents; Springer: Berlin/Heidelberg, Germany, 2001; pp. 303–326. [Google Scholar]
  34. Crowston, K.; Howison, J. The social structure of free and open source software development. First Monday 2005, 10, 405–411. [Google Scholar] [CrossRef]
  35. Crowston, K.; Howison, J. Hierarchy and centralization in free and open source software team communications. Knowl. Technol. Policy 2006, 18, 65–85. [Google Scholar] [CrossRef]
  36. Dorogovtsev, S.N.; Goltsev, A.V.; Mendes, J.F.F. Critical phenomena in complex networks. Rev. Mod. Phys. 2008, 80, 1275–1335. [Google Scholar] [CrossRef]
  37. Érdi, P.; Makovi, K.; Somogyvári, Z.; Strandburg, K.; Tobochnik, J.; Volf, P.; Zalányi, L. Prediction of emerging technologies based on analysis of the US patent citation network. Scientometrics 2013, 95, 225–242. [Google Scholar] [CrossRef]
  38. Erdos, P.; Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 1960, 5, 17–61. [Google Scholar]
  39. Faloutsos, M.; Faloutsos, P.; Faloutsos, C. On power-law relationships of the internet topology. In Proceedings of the ACM SIGCOMM Computer Communication Review, Cambridge, MA, USA, 30 August–3 September 1999; ACM: New York, NY, USA, 1999. [Google Scholar]
  40. Garrido, A. Symmetry in complex networks. Symmetry 2011, 3, 1–15. [Google Scholar] [CrossRef]
  41. Mistrík, I.; Soley, R.M.; Ali, N.; Grundy, J.; Tekinerdogan, B. Software Quality Assurance: In Large Scale and Complex Software-Intensive Systems; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2015. [Google Scholar]
  42. Fang, Y.; Neufeld, D. Understanding sustained participation in open source software projects. J. Manag. Inf. Syst. 2009, 25, 9–50. [Google Scholar] [CrossRef]
  43. Kitsak, M.; Gallos, L.K.; Havlin, S.; Liljeros, F.; Muchnik, L.; Stanley, H.E.; Makse, H.A. Identification of influential spreaders in complex networks. Nat. Phys. 2010, 6, 888–893. [Google Scholar] [CrossRef]
  44. Šubelj, L.; Bajec, M. Software systems through complex networks science: Review, analysis, and applications. In Proceedings of the KDD Workshop on Software Mining, Beijing, China, 8 August 2012; pp. 8, 9–16. [Google Scholar]
  45. Budimac, Z.; Rakić, G. Consistent Static Analysis in Multilingual Software Products Development. In Proceedings of the 7th Balkan Conference on Informatics Conference, Craiova, Romania, 2–4 September 2015. [Google Scholar]
Figure 1. Structure of open-source platform software package management.
Figure 1. Structure of open-source platform software package management.
Symmetry 11 00172 g001
Figure 2. (a) A regular network with 20 nodes; every node has three neighbors. (b) A random network has 20 nodes with a connection probability of 0.2. (c) A WS small-world network has 20 nodes and each node has four neighbors, with a connection probability of 0.3. (d) A BA scale-free network with 20 nodes.
Figure 2. (a) A regular network with 20 nodes; every node has three neighbors. (b) A random network has 20 nodes with a connection probability of 0.2. (c) A WS small-world network has 20 nodes and each node has four neighbors, with a connection probability of 0.3. (d) A BA scale-free network with 20 nodes.
Symmetry 11 00172 g002
Figure 3. Dependency relationship of the Sudo package.
Figure 3. Dependency relationship of the Sudo package.
Symmetry 11 00172 g003
Figure 4. The package dependency network from version 13.10 to 18.10: (a) 13.10, (b) 14.10, (c) 15.10, (d) 16.10, (e) 17.10, (f) 18.10.
Figure 4. The package dependency network from version 13.10 to 18.10: (a) 13.10, (b) 14.10, (c) 15.10, (d) 16.10, (e) 17.10, (f) 18.10.
Symmetry 11 00172 g004
Figure 5. Network scale variation during the version evolution of the Ubuntu Kylin operating system. In our paper, the number of vertices and edges is used to measure the network scale. The double y axis presents the number of vertices and edges accordingly.
Figure 5. Network scale variation during the version evolution of the Ubuntu Kylin operating system. In our paper, the number of vertices and edges is used to measure the network scale. The double y axis presents the number of vertices and edges accordingly.
Symmetry 11 00172 g005
Figure 6. Degree distribution of software package dependency network from version 13.10 to 18.10. Degree distribution marked with p ( k ) calculates the probability of a node with degree k . In our experiments, the fitting curves in the above figures illustrate that the degree distribution matches power-low distribution p ( k ) k b .
Figure 6. Degree distribution of software package dependency network from version 13.10 to 18.10. Degree distribution marked with p ( k ) calculates the probability of a node with degree k . In our experiments, the fitting curves in the above figures illustrate that the degree distribution matches power-low distribution p ( k ) k b .
Symmetry 11 00172 g006
Figure 7. Number of strongly connected modules in each version of Ubuntu Kylin.
Figure 7. Number of strongly connected modules in each version of Ubuntu Kylin.
Symmetry 11 00172 g007
Figure 8. (a) Comparison of weakly connected modules between each version of the software package dependency network while vertical axis presents the number of weakly connected modules; (b) node portion in the maximum connected graph between each version of the software package dependency network.
Figure 8. (a) Comparison of weakly connected modules between each version of the software package dependency network while vertical axis presents the number of weakly connected modules; (b) node portion in the maximum connected graph between each version of the software package dependency network.
Symmetry 11 00172 g008
Figure 9. (a) Network density of each version of the software package dependency network; (b) average degree of each version of the software package dependency network.
Figure 9. (a) Network density of each version of the software package dependency network; (b) average degree of each version of the software package dependency network.
Symmetry 11 00172 g009
Figure 10. (a) Average path length of each version of the software package dependency network; (b) diameter of each version of the software package dependency network.
Figure 10. (a) Average path length of each version of the software package dependency network; (b) diameter of each version of the software package dependency network.
Symmetry 11 00172 g010
Figure 11. Average clustering coefficient of each version of software package dependency network.
Figure 11. Average clustering coefficient of each version of software package dependency network.
Symmetry 11 00172 g011
Table 1. Descriptive statistics for the six versions of the software package dependency network.
Table 1. Descriptive statistics for the six versions of the software package dependency network.
Version NumberNumber of VerticesNumber of EdgesAverage Path LengthAverage Clustering CoefficientAverage Clustering Coefficient of A Random Network in the Same ScaleAverage Degree
13.10154678093.4340.2130.0035.051
14.10190895173.350.2070.0034.988
15.10182691973.2490.2140.0035.037
16.10198391023.2950.20.0034.59
17.102098101023.5850.2110.0034.815
18.10192582063.2140.1960.0033.214
Table 2. Top 10 out-degree nodes of each version of the software package dependency network.
Table 2. Top 10 out-degree nodes of each version of the software package dependency network.
Version NumberNode NameOut-DegreeVersion NumberNode NameOut-Degree
13.10libc6109616.10libc61335
multiarch-support421libglib2.0-0386
libglib2.0-0380libstdc++6251
libstdc++6156libgcc1220
libgcc1138libx11-6148
libx11-6129zlib1g126
libgtk-3-0120libgtk-3-0118
dpkg98libcairo293
libgdk-pixbuf2.0-094multiarch-support93
zlib1g94libgdk-pixbuf2.0-091
14.10libc6141417.10libc61465
multiarch-support659libglib2.0-0447
libglib2.0-0441libstdc++6297
libstdc++6262libgcc1263
libgcc1216libx11-6167
libx11-6153libgtk-3-0140
zlib1g128zlib1g125
libgtk-3-0116libcairo2116
python115libgdk-pixbuf2.0-0112
libgdk-pixbuf2.0-0105libpango-1.0-0108
15.10libc6126918.10libc61237
multiarch-support601libglib2.0-0334
libglib2.0-0400libstdc++6194
libstdc++6218libgcc1176
libgcc1182perl129
libx11-6135libx11-6128
libgtk-3-0120zlib1g128
zlib1g105libgtk-3-0109
libcairo292libcairo285
libgdk-pixbuf2.0-091libgdk-pixbuf2.0-079
Table 3. Top 10 in-degree nodes of each version of the software package dependency network.
Table 3. Top 10 in-degree nodes of each version of the software package dependency network.
Version NumberNode NameIn-DegreeVersion NumberNode NameIn-Degree
13.10ubuntukylin-desktop8116.10ubuntukylin-desktop64
libreoffice-core54unity-control-center61
gnome-control-center48unity46
empathy47mpv45
unity40libqt5gui542
gstreamer0.1-plugings-good38unity-setting-daemon40
ubuntu-minimal38libwebkit2gtk39
gstreamer1.0-plugings-good35gstreamer1.0-plugings-good36
gnome-setting-daemon32libwebkitgtk36
libwebkitgtk32ubuntu-minimal36
14.10gstreamer1.0-plugings-bad6217.10gstreamer1.0-plugings-bad74
libreoffice-core62ubuntukylin-desktop71
unity-control-center58unity-control-center61
gstreamer0.1-plugings-bad52libreoffice-core51
empathy51mpv47
mplayer249unity46
unity44libqt5gui542
libqt5gui543unity-setting-daemon39
gimp40chromium-browser38
gstreamer0.1-plugings-good38gimp38
15.10ubuntukylin-desktop8018.10ubuntukylin-desktop70
unity-control-center61mplayer60
libreoffice-core54mpv49
empathy52libukwm-1-049
unity44libqt5gui542
libqt5gui543libwebkit2gtk41
unity-setting-daemon40gstreamer1.0-plugings-good39
gstreamer0.1-plugings-good38ukui-control-center38
ubuntu-minimal38chromium-browser34
libwebkitgtk36ubuntu-minimal34
Table 4. Top 10 betweenness nodes of each version of the software package dependency network.
Table 4. Top 10 betweenness nodes of each version of the software package dependency network.
Version NumberNode NameBetweennessVersion NumberNode NameBetweenness
13.10libgtk-3-011364.2216.10libgtk-3-018781.11
libc68357.45libqt5gui55207.38
udev6556.18libgl1-mesa4180.14
upstart5226.74dpkg3981.82
debconf4945.29libcups23910.03
dpkg4936.39passwd3651.25
dbus4153.23libwayland-egl1-mesa3308.32
passwd3899.30libgl1-mesa-dri3108.68
perl-base3836.72libfontconfig13095.37
libcups23539.74fontconfig-config2766.81
14.10libgtk-3-010191.4117.10libgtk-3-019640.03
passwd5286.504libglib2.013467.23
dbus4654.707passwd12441.35
libgtk2.0-04635.568libuuid111585.17
libcups24614.537libmount19959.20
dpkg4527.79libblkid19350.53
libfontconfig14256.135libqt5gui55493.75
libuuid14114.621dpkg5298.70
fontconfig-config4069.921libegl1-mesa4867.30
libqt5gui53788.161libcups24153.26
15.10libgtk-3-013379.0218.10libgtk-3-014339.41
libqt5gui54564.72dpkg4585.28
libcups24246.68libcups23302.04
dpkg4079.47libglib23272.01
passwd3882.18libfontconfig12694.25
libc62787.44libc62642.52
python32775.04fontconfig-config2554.11
libgtk2.02741.31libegl12447.56
libuuid12699.08libcairo22329.06
xserver-xor-core2636.12perl2298.56

Share and Cite

MDPI and ACS Style

Wang, J.; Zhang, K.; Sun, X.; Tan, Y.; Wu, Q.; Wu, Q. Package Network Model: A Way to Capture Holistic Structural Features of Open-Source Operating Systems. Symmetry 2019, 11, 172. https://doi.org/10.3390/sym11020172

AMA Style

Wang J, Zhang K, Sun X, Tan Y, Wu Q, Wu Q. Package Network Model: A Way to Capture Holistic Structural Features of Open-Source Operating Systems. Symmetry. 2019; 11(2):172. https://doi.org/10.3390/sym11020172

Chicago/Turabian Style

Wang, Jing, Kedi Zhang, Xiaoli Sun, Yusong Tan, Qingbo Wu, and Quanyuan Wu. 2019. "Package Network Model: A Way to Capture Holistic Structural Features of Open-Source Operating Systems" Symmetry 11, no. 2: 172. https://doi.org/10.3390/sym11020172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop