Performance Analysis of a Clustering Model for QoS-Aware Service Recommendation

: The numbers of web services are growing rapidly in recent years. One of the most challenging issues in service computing is the personalized recommendation of Web services. Most of the current research recommends services based on Quality of Service (QoS)-aware data with few considerations of service-side factors, such as service functions. In this paper, a new QoS-aware Web service recommendation model based on user and service clustering (RMUSC) is proposed to gain an advance in recommended accuracy. Firstly, similar users are clustered together by a Top-N similarity algorithm through the user QoS records. Secondly, a K-means++ based filtering service cluster is established. Finally, a user and services collaborative scheme is exploited and obtains potential user QoS preferences to generate recommendations. The experimental results show that when the density of the service invocation matrix is 5%, 10% and 20%. the average absolute error (MAE) and root mean square error (RMSE) of RMUSC are lower than those of other methods.


Introduction
Powered by the advanced technology of Internet, web services with various functions improve the lives of common people [1,2]. However, it is difficult for users to select appropriate web services themselves, as a result of a shortage of professional knowledge and numerous web services. Therefore, how to efficiently and accurately recommend services based on user preference has become a challenging issue for both industry and academia [3][4][5].
The basic assumptions of the web service recommendation method are [6]: (1) Users prefer a service and its similar services; (2) Users prefer services that are used by other users with similar backgrounds and preferences; (3) Users prefer a service with certain characteristics as well as other services with similar characteristics. At present, the web service recommendation methods mainly include recommendation based on IF/THEN rules [7], content recommendation [8], collaborative filtering recommendation [9] and mixed recommendation [10]. Figure 1 shows the general structure of web service recommendation. The web service recommendation is widely used in office automatic (OA), internet of vehicles, and tourism services. The service recommendation management platform is utilized for web service customization and publishing. Business, vehicle and tourist scenes are the main applications of the service recommendation. The collaborative filtering recommendation (CFR) is a mainstream recommendation method [11,12]. Compared with other recommendation approaches, CFR has two advantages-first, there is no special requirement for recommended objects, and recommendations can be generated based on complex and abstract resources. Second, only explicit or implicit user history evaluation data is needed. No prior knowledge of the user's own attributes is required. Although the collaborative filtering recommendation has achieved many important research results, there are still many key issues to be solved, including data sparsity, cold start and scalability [13]. The data sparsity problem is that users in the current dataset have less ratings on related web services, and it is difficult to consider the influence of potential factors on users' preferences for web services. However, each potential factor has a great impact on the accuracy of recommendation. This will greatly reduce the accuracy of web service recommendations. The advantage of CFR is that it can effectively handle complex unstructured objects without special requirements for recommendation objects [12]. However, the CFR-based service recommendation methods are significantly affected by diverse factors, such as server-side features and the sparsity of the user information matrix.
Most of web service recommendations establish a model for service recommendation based on the users' information, history records, user preference and Quality of Service (QoS). Moreover, other factors can be considered to improve recommendation accuracy, such as geographical factors or service functional factors. However, with the sparseness of the QoS matrix, the user neighborhood and service functions' attributes are rarely considered when improving the accuracy of web service recommendations. In order to effectively improve the recommended performance, this paper will propose a new QoS-aware Web service recommendation model based on users and services clustering (RMUSC), which will combine user and service factors to obtain higher accuracy Web service recommendations. There is currently a WSDream project on the www.github.com code hosting site that uses Planet-Lab to collect QoS datasets from 5925 web service calls from 339 users in 73 countries, and we will get data from this open source project [14][15][16][17][18]. In general, the main contributions are shown as follows.


We build a new services features extraction system and extract the characteristics of services based on WSDL files and perform users clustering based on the QoS of the user who invokes the service.


We exploit similar user clusters and services clusters by a collaborative filtering matrix factorization and obtain potential user QoS preferences to generate recommendations.  We develop a novel recommendation method by jointly considering the QoS of users and service clustering, whereby improving the accuracy of the recommendation. We perform the experiment based on real data sets-called WSDream-and compare with other methods on the basis of MAE and RMSE. It turns out that our approach achieves a better performance compared to other mainstream approaches.
The remainder of this paper is organized as follows: Section 2 introduces the related work about services recommendation. Section 3 depicts the principles of the RMUSC model, and combines our similar user cluster and service cluster in the recommendation model based on user and service clustering (RMUSC) to predict potential QoS values. In Section 4, experiments are performed to evaluate the approach. Section 5 concludes this paper with potential future directions.

Related Works
Many researchers have made many attempts to get higher accuracy and better performance of web service recommendation algorithms. Zheng Cheng [19] and others mainly introduced the factor of geography, relying on matrix decomposition to alleviate the sparsity of the prediction matrix, and fast convergence using a stochastic gradient descent algorithm, give prediction results, and improve the synergy through this scheme. The filtering algorithm has more accurate prediction results than other algorithms. Deng Ailin et al. in Reference [20] developed a collaborative filtering algorithm (IPR) based on project scoring prediction. The algorithm uses a project-based collaborative filtering method to fill the user's scoring items and collect the scoring null values, and the similarity between users is calculated on the merged union. In order to solve the effect of neglecting the sparsity of QoS between Mashup and service and the multi-dimensional information on recommendation accuracy, Cao et al. in [21] proposed a qos-aware service recommendation for IoT Mashup applications based on relational topic model and factorization machine. Shun Li et al. [22] pointed out that in the web2.0 scenario, the WSDL file of the web service has a field describing the service function. By extracting the relevant fields, the service is classified according to the function by statistical methods, so as to solve the service recommendation result function mismatch problem.
To solve the QoS-aware ASC problem with multiple QoS criteria constraints, Wang [23] proposed an extended version of the classical graphplan and backward A* search algorithm. Pappalardo et al. [24] proposed a reputation-based model that can support the composition of complex cloud services. In order to solve the problem of traditional Web service recommendation methods dealing with a large amount of service data, Zhang [25] proposed a CA-QGS algorithm based on Spark's quotient space granularity analysis. It takes into account both cost and QoS measures. Traditional Item-based collaborative filtering (ICF) involves privacy issues. In response to this shortcoming, Yan et al. [26] improved the traditional method and integrated position-sensitive hashing (LSH) technology. Zhang et al. [27] proposed a new framework for combining web service grouping, distance estimation, service utilization level estimation and project-to-project comparison (Pearson correlation coefficient (PCC)). Kang et al. [28] based on the preference of a user's QoS and diversity characteristics of service potential, using the user's interest and QoS preference for web service history to calculate the ranking score of Web service candidates, an advanced algorithm about diversity-aware Web service ranking is proposed. A Web service map is also constructed on the base of the functional similarity between Web services, and Web service candidate entries are evaluated on the base of their scores and the degree of diversity derived from the Web service map. Yan Hu [29] proposed an advanced time-aware collaborative filtering method for high-quality Web service recommendation. They integrate time information into similarity measurements and QoS predictions. In addition, for the purpose of alleviating the problem of data sparseness, a hybrid personalized random walk algorithm is invented to reason the similarity of indirectly associated users and the similarity of services.
The aforementioned studies achieve good performance in predicting and recommending related services and variables based on QoS. However, most of them ignore the significance of service functions and user similarity. References [19] and [28] consider the contextual characteristics of services without the service function characteristics. Reference [22] considers the similarity of the service geographical level without the functional attributes of services and the similarity of the users. References [29] and [25] indicated that, by leveraging the collaborative filtering algorithm [30], many recommendation models are deeply affected by the sparseness of QoS matrix data, which may cause the low similarity of recommended results. Therefore, this paper proposes a clustering model that combines user QoS and services, which can solve the above problems and thus improve the accuracy of recommendations.

RMUSC Architecture
The RMUSC recommendation model proposed in this paper is shown in Figure 2. First, extract the context features according to the WSDL document of the Web service to obtain its functional description and clustering features. Then, the user performs clustering based on the QoS of the history request service. The matrix decomposition model is used to predict user QoS and generate service recommendations, thereby solving the problem of data sparsity in the traditional CFR algorithm. Finally, the RMUSC recommended model is used for testing and verification, 70% of the data set is used for model training and the best parameters are obtained, and the remaining data is used for recommended performance testing. The generated recommendation results will be compared with other Web service recommendation algorithms [30][31][32][33][34]

WSDL Service Description Files
Generally, the web service is provided to the user on the client side; for example, in the browser, the user can directly browse and use the service provided by the current web page. Web services in the client are generally described using a WSDL file [35][36][37][38]. The contextual characteristics of the WSDL file describe the functional categories to which the service belongs. We select the five most representative contextual features from the WSDL file, including WSDL text, WSDL type, WSDL prompt message, WSDL port, and web service name. These features expose the functionality of web services, based on which service function clustering. According to the service API in the dataset, the open web crawler Hertrix to collect the WSDL is used. Text data of the service, filter the text through the WSDL tag, and get the required text data and store it in the database. Then format the data to get the standard dataset entered when the next feature value is extracted [39,40]. Table 1 shows the characteristics of a WSDL service description document.  [41], which indicates the number of times a word appears in the entire text, because the more times a word appears in the text, the more it reflects the theme of the text. IDF (Inverse Document Frequency) [42], which mainly indicates how often a word appears in multiple texts, and how important the word is to the text topic. A word appears in multiple texts, indicating that the word is not unique and is weaker in embodying the subject of the text [43].
The web service description text has many compound words, such as housework, classroom, and football. We perform a stem-drying analysis on the stem-to-stem analysis method to obtain a content vector [44,45]. w F is defined as the word frequency of each word in the content vector as follows: (1) w TF is defined as the total number of words in the sample document, and w TF is how often the word w appears in the document. The larger w F the more likely the word w is to be a content descriptor. In this paper, a threshold  is set, and if w F exceeds this threshold, the word can be set as a content descriptor. The inverse document frequency IDF of each content descriptor is defined as follows: where N is the total number of documents, and -TF IDF is used to assess how important a word is to a text topic.
In this paper, a threshold  is set for the result of the above formula, and each content descriptor is calculated by the above formula. The word above the threshold becomes the characteristic word of the current text, which reflects the theme of the text, that is, the description of the web service function described in this paper. The feature words of each service text are merged into a feature word set of the web service, denoted as si FV .
NGD (Normalized Google Distance) [46] is a related representation of two words obtained by standardization calculation using data obtained by the Google search engine. The calculation is as follows: M is the total number of web pages searched by Google using the feature words ,log x y f x f y is the number of hits searched using the feature words , x y respectively, and   , f x y is the number of web pages that appear simultaneously using , x y .
This paper uses the normal Google distance to normalize the feature words of the web service and uses the normalized Google distance formula to calculate the similarity of the two web services. The calculation formula is as follows:

Web Service Clustering Decision
Because the selection of the initial type center of the classical clustering algorithm K-means is random, the cluster center may be too close, which greatly affects the classification results [47]. The selection principle of the initial center of the K-means++ algorithm is that the distance between them should be as large as possible, and the final error of the classification result can be significantly improved. Based on the service feature word set and its Google distance, K-means++ algorithm is used to cluster various web services [48]. The Algorithm 1 we develop is expressed as follows:

User Clustering Algorithm
User similarity can be calculated by QoS values provided by different users who invoke the same Web service. In some cases, the QoS value of a user may be lost. Missing values can be predicted by using other QoS values observed by similar users [49].
Since the cosine similarity measure only considers the similarity between the two vector directions, the influence of the dimension between different vectors is not considered, and the scores of different users will be different [50]. The method of modified cosine similarity calculation mitigates effects of this difference on the results by subtracting the average score of the user's rating items. We use the modified cosine similarity to mensurate the similarity. As follows: where ij I is the set of items that the users , i j have scored together, we use i I to represent the item set of the user i scored, j I is the item set of the user j scored, , We call the QoS data of the web service according to the user history and use the modified cosine similarity to cluster each similar user. Each user calculates the nearest N users as the neighbor of the current user. Algorithm 2, for grouping the most loved N users for each user is as follows:

User QoS Prediction Algorithm
In a wide range of internet interactions, users calling web services have their own specificities, which lead to a sparse matrix of user calls. On the other hand, many services called by users may not have been visited before, and there is no relevant data as predictive support, which leads to the problem of cold boot. As for the scoring matrix of services, there will be some potential factors that have a significant impact on users' preferences for web services. Under this premise, the matrix decomposition method is widely used to decompose the service matrix called by the user into low rank, and uses the inner product of the matrix to predict missing values in the user score matrix [51,52].
The user's scoring matrix for the service is defined as is the user feature matrix, and l m S R   is the service feature matrix. the feature vector in the user feature matrix is represented as i p and the service feature vector as j q . Then the missing value ,i j r in the scoring matrix of user i for service j as follows: , Collaborative filtering methods are widely used in many studies to get the prediction of QoS, usually in the following form: However, due to the sparseness of QoS data, the traditional collaborative filtering algorithm has major defects in predicting QoS values. We can collaboratively predict a user's QoS preferences for services based on user clustering and service clustering. In general, user preferences based on user similarity clustering defined as follows: The goal of this optimization problem is to find users with similar preferences based on the web service feature matrix. On the other side, we predict the potential web service preferences for related users based on similar user clustering: This optimization problem is to find a web service with potential user preferences based on the user feature matrix.
We use these two optimization sub-problems to merge for the collaborative user features and service features we need, and get the missing user values in the user service call matrix, this model described as follows: where   are the weight coefficient to control the user feature and the service feature. A larger value of  indicates that adjacent users have a greater influence on the current predicted QoS, and the service feature has a greater impact on the current predicted QoS if  is larger. The gradient descent algorithm is used to explore the optimal solution of the Equation (11), The update of the factors , i j p p of the target feature vector is iterated by the following method: where  is an iteration factor controlling the number and speed of iterations.
We use a new gradient descent algorithm to calculate the missing QoS values in the service invocation matrix. The flow for Algorithm 3 is listed hereafter: for each

Simulation Results and Analysis
In this paper, the typical data set is selected for the performance analysis of the recommended algorithm. The data set contains 339 user invocations to 5825 web services in the real world and more than 1.5 million invocation records [17]. We used 70% of the user invocation records to train the algorithm to get the optional values for the relevant parameters. The remaining 30% of the user data is used to validate our algorithm model. We get a sparse matrix by randomly deleting some records in the user call matrix.
Through the recommended models and experiments, the QoS value ,i j r of the recommended service is obtained. We use the mean absolute error (MAE) and the root mean square error (RMSE) to evaluate the accuracy of the experimental results. The accuracy increases with the decrease of both values. The calculation is indicated as below: , , where , i j r is the QoS value that user i actually invocates to web service j , , i j r is the QoS value predicted by the model, N is the amount of predicted values, and MAE rates the relative error of the predicted value as a whole to the true value. The relative maximum error is usually emphasized with RMSE, defined as follows: Figure 3 shows three original sets of user services. Each presents the initial center of each original set. In practice, the location of user service sets can be automatically obtained. K similar service clustering (denoted by Si) can be derived by Algorithm 1 from user service sets. In Figure 4, X axis represents the offset of "portType" and Y axis is the offset of "service". "portType" and "service" are the characteristic words as shown in Table 1. Each point set with different color represents a similar distribution. Each "*", denoted by Ci, is the center of each service center. Ci is determined by the proposed algorithm. We can observe that there is K (K = 6) service clustering after adapting the proposed algorithm in Figure 4. Furthermore, the original sets are almost evenly divided by the service clustering, which verified the effectiveness of the algorithm.  The accuracy of clustering (precision) is an important indicator to measure the effect of clustering. This article will use the same data set for clustering and analysis. The calculation formula of precision is:

User Services Clustering Analysis
where A is the number of points in the category, and B is the number of points not in the category but recorded as the category. The accuracy of clustering directly reflects the pros and cons of the clustering effect. This paper will use the density-based clustering algorithm (DBSCAN) and the traditional K-means algorithm for clustering on the same data set, and use the clustering results to calculate the clustering accuracy. From Figure 5, the service clustering algorithm in this paper has obvious advantages over the other two clustering algorithms in the number of clusters and maintain good recommendation accuracy as the number of clusters grows.

Effects of  ,  and Density on Service Recommendation
In Figure 6, we can make a conclusion that the model we support gets the optimal solution when the parameter  = 0.4,  = 0.5. Before reaching 0.4 and 0.5, MAE and RMSE decrease as the parameter increases; after reaching this value, it increases as the number increases. It shows that the fusion of user neighborhood and service function characteristics into the recommendation model can improve the accuracy of recommendation. Using only one party reduces the accuracy of the recommendation, and the clustering weight of the service is greater than the user's clustering weight.
Therefore, the parameters are  = 0.4,  = 0.5. To illustrate the generality of the two parameters at different matrix densities, we discuss them in the following figure. The basic method is to show by fixing one of the values and intercepting a face of the other parameter.

Service Recommendation Analysis
For the sake of showing that the model we develop has higher accuracy in two evaluation factors above, we compare with the following mainstream collaborative filtering algorithms-(1) IPCC, similar services for recommendation on the basis [18].
(2) UPCC, on the base of similar behavior between users [18]. (3) NIMF [19], similar users merge with the matrix factorization model for recommendation. (4) LoNMF [9], which uses local similar neighbor matrix factorization model for recommendation. Table 2 shows that the proposed RMUSC considers both the factor of user-side and service-side. This paper randomly deletes some QoS data in the data set for simulating the data sparsity of the user service invocation matrix, so that the matrix density of the invocation matrix R can be controlled. The matrix density is large represents the more data is available. Verifying the reliability of the experiment, we repeat the experiment at each matrix density ten times. Finally, we continuously verify and iterate the parameters in the recommended model during the experiment, we set  = 0.4,  = 0.5,  = 0.013, N = 10. The comparison results of service recommendation algorithms is shown in Figure 8. In Figure 8, we can see that our method obtains smaller values of MAE and RMSE evaluation parameters than the other four mainstream recommendation algorithms. This proves that our recommendation approach has better recommendation accuracy. It shows that the application of service function features and adjacent user features to the model-based collaborative filtering recommendation algorithm has a better recommendation result. It can also be seen that as the density of the matrix increases, the values of MAE and RMSE will become smaller and smaller, indicating that the increase in available data will increase the recommended accuracy of the recommended model. Figure 9 shows the recommended comparison of the method and other mainstream methods. It can be seen that when the recommendation result is configured to 10, our method has a greater performance than the second ranked LoNMF method, and the recommendation precision is improved by about 19%. This shows that the recommendation results obtained by our recommendation method are highly recognized.

Conclusions
In this paper, a new recommendation model by jointly considering the impact of service function characteristics and similar user preferences is developed. In the proposed model, the useful information is merged with the matrix factorization model to predict the missing QoS values. The experimental results show that the proposed model outperforms the other mainstream recommendation algorithms in light of recommendation efficiency and accuracy. In the immediate future, the Web service can be tagged to enhance the performance of the proposed recommendation model. Moreover, regional users generally have similar user service features. Therefore, user location can also be considered one of the factors for the accuracy of Web service recommendation by user classification.
In the future, we will work on optimizing algorithms to reduce complexity and optimize the framework to improve efficiency. The existing problem is that the method proposed in this article uses the QoS record of the user's historical call service, and there is a lag in processing efficiency. We hope to try online real-time processing, further optimize and improve the architecture, improve data acquisition and analytical processing power.