Large Scale Fault Data Analysis and OSS Reliability Assessment Based on Quantiﬁcation Method of the First Type

: Various big data sets are recorded on the server side of computer system. The big data are well deﬁned as a volume, variety, and velocity (3V) model. The 3V model has been proposed by Gartner, Inc. as a ﬁrst press release. 3V model means the volume, variety, and velocity in terms of data. The big data have 3V in well balance. Then, there are various categories in terms of the big data, e.g., sensor data, log data, customer data, ﬁnancial data, weather data, picture data, movie data, and so on. In particular, the fault big data are well-known as the characteristic log data in software engineering. In this paper, we analyze the fault big data considering the unique features that arise from big data under the operation of open source software. In addition, we analyze actual data to show numerical examples of reliability assessment based on the results of multiple regression analysis well-known as the quantiﬁcation method of the ﬁrst type.


Introduction
A waterfall development model is well-known as the traditional software development style. At present, the software development style has been changed to various development paradigms. In particular, the development style of open source software (OSS) has the unique style such as the OSS project. The OSS project has the development cycle in the flow such as development, version release, usage of users, reporting of bug, checking and modifying of OSS, and release of new version. Recently, the OSS with network connection service is increasing more and more. Thus, the numbers of network-oriented OSS such as cloud service, server, IoT (Internet of Things) device software have been increasing as well as the standalone software.
In the past, various methods based on software reliability growth models have been proposed by several research groups [1,2]. On the other hand, several research papers for OSS reliability assessment have been published [3].
There are many OSS reliability assessment methods based on the stochastic models. In addition, there are several methods based on empirical data analysis [4,5]. In particular, it is very useful for the OSS developers to understand the trend of fault big data recorded on the OSS bug tracking system from the standpoint of bird's-eye view. The organization of this paper is as follows: Section 3: proposes the multiple regression analysis in order to solve the problem in terms of the degree of freedom for the large scale fault data. Section 4: describes and analyzes the forward-backward stepwise selection method by analyzing the fault big data. Section 5: discusses the upper and lower confidence limits based on the typical hazard rate model. Section 6: discusses the characteristics of the proposed method.
Many software reliability assessment methods based on the stochastic model have been proposed by several researchers [6][7][8]. Recently, it is difficult to assess the software reliability because there are various software development styles. Historically, the fault data sets are used for the software reliability assessment. In addition, the software reliability assessment methods based on the measurements of software metrics have been proposed in the past [9,10]. At present, many kinds of fault data have been recorded on the bug tracking system in the case of OSS. In particular, it will be useful for the reliability assessment by using various categorized fault data sets in the case of OSS. We will be able to propose the high accuracy method by integrating the stochastic models and the statistical analysis if we can assess the fault big data from the standpoint of statistical analysis.
As the related works, several research papers have proposed the methods in terms of the upper and lower limits based on software reliability growth models [1,2], and the empirical approach for OSS [3]. However, it is difficult to understand the upper and lower bounds of the stochastic model for the big data because of the problem for the degree of freedom. Generally, the degree of freedom is given by the number of data. However, it is difficult to obtain the degree of freedom from the big data, because the data set is in the large scale. Then, we will be able to use the number of explanatory variables in place of the number of data. In this paper, we propose the data analysis method based on a quantification method of the first type. Then, we focus on the fault big data analysis with a more simple method, because the analyses of fault big data are required to take a lot of time for calculation and analysis. The multiple regression-based model to analyze financial data has been proposed in the financial research area [11,12]. Moreover, the multiple regression analysis is used in the research area of network [13]. In this way, the statistical methods such as the multiple regression analysis have been applied to various research areas. This paper proposes the method based on the statistical analysis and typical hazard rate model for the large scale fault data analysis and OSS reliability assessment. Furthermore, we show several analysis examples based on the proposed method by using the actual fault big data. Table 1 presents the part of raw data in terms of fault big data. We can use the data in terms of the time and categories as shown in Table 1. However, it is difficult to analyze the categorical data for the reliability assessment. Historically, the data sets in terms of the number of faults and the time between software failures have been well-used for the software reliability assessment. Therefore, we convert the categorical data sets to the number of software faults. For example, Table 1 can be converted to Table 2. For example, each line in Table 1 means one fault, e.g., Table 1 contains 5 fault. In addition, the unit of "Opened" is "day". Many software reliability growth models have been proposed by several researchers as follows:

Fault Data Analysis
• Therefore, it will stand to reason that the categorical data sets are converted to the number of faults and the time between software failures from the stand point of the software reliability engineering. We define the data sets in Table 2 as the dummy variables considering the multiple regression analysis.

Multiple Regression Analysis
Generally, the number of data are used as the degree of freedom in the statistics. In the case of the big data, it is very difficult to estimate the upper and lower limits from the number of big data in the stochastic models, because the volume of data is huge. Many methods of OSS empirical assessment have been proposed [4,5,14,15]. However, the size of fault data in OSS is large. Therefore, it is difficult to assess the fault big data. Then, we focus on the number of explanatory variables. We will be able to estimate the upper and lower bounds by using the number of explanatory variables as the degree of freedom. The multiple regression analysis is well-known as the analysis method understanding the relationship between the objective variable and explanatory one. The analysis step in this paper is shown as follows: Step 1: The pairplots for each factor are used in order to overlooking the fault big data.
Step 2: We apply the heatmap to the decision of the objective variable Step 3: The explanatory variables are narrowed by using the forward-backward stepwise selection method. Then, the degree of freedom is decided by the number of explanatory variables.     Generally, the equation of multiple regression is given as follows: where F is the objective variable, α i i-th partial regression coefficient, and Deciding the objective variable, we discuss the estimation results by the heatmap analysis. Then, the heatmap for actual fault big data is shown in Figure 5. From Figure 5, we find that the weight parameters of "Hardware", "OS", "Changed", and "Status"are large. Therefore, we focus on 4 factors of "Hardware", "OS", "Changed", and "Status" -as the objective variable, respectively. We analyze the data in the period from January 2001 to May 2020. The x and y axes of Figures 1-5 are analyzed by using the values of Table 2.
Moreover, we analyze all categories of OSS fault big data by the multiple regression. The estimation results based on multiple regression analysis in cases of Hardware, OS, Changed, and Status as objective variables is shown in Table 3. For example, the top of Table 3 denotes that the objective variable is Hardware. Then, the other categories denote the explanatory variables. From Table 3, the multivariate regression models are obtained as follows:

Forward-Backward Stepwise Selection Method
The forward-backward stepwise selection method is well-known as the selection method of explanatory variables for the multiple regression analysis. We use the forward-backward stepwise selection method as the multiple regression analysis for OSS fault big data. The forward-backward stepwise selection method is well-known as the selection method of explanatory variables in multiple regression analysis. In particular, we apply the backward stepwise selection method. Then, we consider the following steps: Step 1: All explanatory variables are analyzed by the multiple regression.
Step 2: As the results of step 1, the explanatory variable is removed if p-value becomes large than 0.01.
Step 3: The selected explanatory variables are analyzed by the multiple regression again.
Step 4: The above steps 1 and 2 are continued until there is no p-value of explanatory variable larger than 0.01.
There are many methods as forward-backward stepwise selection ones. Then, it will be difficult to analyze the fault big data sets by using the other complex analysis method. Therefore, this paper is simply analyzed by above steps, because the fault big data sets have many factors and lines of bugs. In the case of the big data, it is very important to consider the calculation time and complexity in the estimation.
From Table 4, the multivariate regression models based on backward stepwise selection method are obtained as follows: In particular, the selection results from explanatory variable is shown in Table 5 by using the method of backward stepwise selection. From Table 5, "Product", "Version", and "Assignee"are included as the common factors for all objective variables. This means that these three factors are very important factors to detect and fix the fault recorded on the bug tracking system. From this estimation results, we consider that the OSS developers can appropriately manage by using the information obtained from "Product", "Version", and "Assignee". On the other hand, "Opened"and "Reporter" have been removed from all explanatory variables by using the backward stepwise selection method. In other words, "Opened" and " Reporter" may not be important from the standpoint of the quality control of OSS.

Multiple Regression Analysis with Application to Reliability Assessment
Many software reliability assessment models have been proposed in the past [17][18][19][20][21]. In particular, the hazard rate model is well-known as the typical software reliability model. We apply the hazard rate model to the time-interval between correction faults. The distribution function of X k (k = 1, 2, · · · ) representing the time-interval between correction faults of (k − 1)th and k-th is defined as: where Pr{Φ} represents the occurrence probability event Φ. Therefore, the following derived function means the probability density function of X k : From Equations (10) and (11), the hazard rate is given by the following equation: where the hazard rate means [1,22] the software correction rate when the software correction does not occur during the time-interval (0, x]. Therefore, the software reliability assessment measures are obtained from the typical hazard rate model in Equation (12). The probability density function can be derived as where N is the number of latent faults in OSS, φ the hazard rate per inherent fault. Then, the mean time between software failures correction (MTBF c ) is given as follows: It is important to assess the upper and lower bounds of MTBF c , E[X k ], because the difficulty of fault correction continuously keeps the variation state. Then, the upper and lower confidence limits for the MTBF c can be estimated from Chi-squared distribution. Then, the upper and lower confidence limits in 100(1 − α) percentage point of Chi-squared distribution for the MTBF c is given by where m is the statistical degrees of freedom in objective variable of the regression equation. Then, we consider the regression equation of Equation (8). The explanatory variables are shown in Table 4. From Equation (8) and Table 4, the statistical degrees of freedom for the regression equation is 9 in the case of "Changed". In the case of "Changed", 90% upper and lower confidence limits for MTBF c are as follows: As an example, the estimated upper and lower confidence limits for MTBF c are shown in Figure 6. As shown in Figure 6, we can assess the upper and lower confidence limits for MTBF c . By using the estimation results in Equation (8) and Table 4, we can consider the influence degrees from the explanatory variables for MTBF c as the upper and lower confidence limits. The upper and lower bounds in Figure 6 mean the influences from "Product", "Component", "Version", "Assignee", "Status", "Resolution", "Hardware", "OS", and "Summary"as the main factors. Several research papers have proposed the methods in terms of the upper and lower limits based on software reliability growth models, and the empirical approach for OSS [23,24]. As the comparison with the conventional method, we show the estimated upper and lower confidence limits based on the conventional method in Figure 7. The number of fault data is used as the degree of freedom in Figure 7. In Figure 7, we found that the conventional method can not estimate the upper and lower bounds accurately because the value of degree of freedom is large. On the other hand, the proposed method can appropriately estimate the upper and lower confidence limits for the actual fault big data, because the degree of freedom is properly given by using the proposed method. As the comparison results of the other model, we compare the following Moranda model with the Jelinski-Moranda model.
where c is the decreasing rate of hazard rate, D the hazard rate per inherent fault. Similarly, we show the estimated upper and lower confidence limits for the MTBF c in actual fault big data in the case of the Moranda model in Figure 8. Moreover, we show the estimated upper and lower confidence limits based on the conventional method in the case of the Moranda model in Figure 9. As our main contribution, we have proposed the estimation method for upper and lower confidence limits based on the typical hazard rate model. The conventional models cannot estimate the upper and lower confidence limits because the degree of freedom is very large. The proposed method can estimate the upper and lower confidence limits based on the typical hazard rate model in case of large scale fault data sets by using our method.

Conclusions
We have discussed the quantification method of the first type for the fault recorded on the bug tracking system of OSS. Then, we apply the multiple regression analysis. We have found that the proposed method can assess the important factors in terms of the OSS quality control by using the multiple regression analysis.
It is difficult for the OSS developers to assess from the bug tracking system because the fault big data are large scale. The proposed method is simple structure by using the traditional stepwise selection method. Therefore, our method can be simply use for the other OSS. The proposed method can find the main factors as explanatory variables affecting the quality management. Thereby, the OSS developer will be able to easily assess the quality from the standpoint of the condition recorded from actual fault big data.
In particular, we have applied the estimation results of multiple regression analysis to the reliability assessment. Under the situation of big data, the objective variable will depend on various explanatory variables. We have proposed the reliability assessment method based on the multiple regression analysis and stochastic model for the OSS fault big data. As the study, the OSS managers can assess the upper and lower limits of the software reliability models for the fault big data. Thereby, the OSS managers can comprehend the stability of OSS development and operation.

Conflicts of Interest:
The authors declare no conflict of interest.