Emerging Pattern-Based Clustering of Web Users Utilizing a Simple Page-Linked Graph

Web usage mining is a popular research area in data mining. With the extensive use of the Internet, it is essential to learn about the favorite web pages of its users and to cluster web users in order to understand the structural patterns of their usage behavior. In this paper, we propose an efficient approach to determining favorite web pages by generating large web pages, and emerging patterns of generated simple page-linked graphs. We identify the favorite web pages of each user by eliminating noise due to overall popular pages, and by clustering web users according to the generated emerging patterns. Afterwards, we label the clusters by using Term Frequency-Inverse Document Frequency (TF-IDF). In the experiments, we evaluate the parameters used in our proposed approach, discuss the effect of the parameters on generating emerging patterns, and analyze the results from clustering web users. The results of the experiments prove that the exact patterns generated in the emerging-pattern step eliminate the need to consider noise pages, and consequently, this step can improve the efficiency of subsequent mining tasks. Our proposed approach is capable of clustering web users from web log data.


Introduction
With the rapid growth of the Internet, most research on the Internet has revealed some very hot topics, such as social networks [1,2], web mining, and so on.In web mining, there are three categories: web content mining, web structure mining and web usage mining.In Web Usage Mining (WUM), also known as web access, web access pattern tracking can be defined as the web page history; the mining task is a process of extracting interesting patterns from web access logs.Web usage mining is still a popular research area in data mining.With the rapid growth of the Internet, more and more useful information is hidden in web log data.It is essential to learn about the favorite web pages of web users and to cluster web users in order to understand the structures that they use.
Many techniques in web usage mining have been proposed [3][4][5][6][7][8], and this field is still a hot topic for research in data mining.Most existing web mining techniques are performed based on association rule mining or frequent pattern mining, and these methods aim to find relationships among web pages or predict the behavior of web users.It is difficult to find certain groups of web users with similar favorite web pages.Furthermore, some articles about clustering web users have been published recently [6,9,10], although different clustering algorithms are used.However, all of these articles cluster web users based on frequent-pattern mining of topics in common.In generating frequent patterns, based on a user-specified minimum support threshold, the process can obtain frequent web pages for all web users.This means that if some web pages are frequently accessed by one web user, then they are accessed by other web users with a high probability.These kinds of frequently visited web pages are without discrimination in clustering web users; they are like noise pages in clustering.
The discovery of class comparison or discrimination information is an important problem in the field of data mining.Emerging patterns [11,12], defined as multivariate features where supports change significantly from one class to another, are very useful as a means of discovering distinctions between different classes of data.Using the emerging pattern-mining technique, we can find emerging web pages in web log data.This technique has been detailed in many articles [13,14], and is still a hot topic in the field of computer science.Jumping Emerging Patterns (JEPs) [15] is a special concept of EPs, which has been presented to describe some discriminating features that only occur in one class, but do not occur in other classes at all.
Term Frequency-Inverse Document Frequency (TF-IDF) [16] is a kind of weighted technology commonly used for information retrieval and information mining.Therefore, it is considered one of the measures of the importance of a document.It is widely used in research areas for classification of literature, text mining and other related fields.
Clustering algorithms are used for category based on their cluster model, and many clustering algorithms were proposed, such as the K-means algorithm [17,18], Self-Organizing Map (SOM) algorithm [19,20], Adaptive Resonance Theory (ART1) [21,22] and K-means & TF-IDF [23,24].In this paper, accessed web pages were collected in users, they were in text form that can be defined as the identification of web users.The K-means & TF-IDF approach was used to cluster web users, because of the advantage of TF-IDF used in text mining.
Folksonomies [25] has been proposed as a collaborative way to classify online items.This kind of classification is determined by the defined frequency of user groups.In addition, many researches have been proposed based on Folksonomies [26,27] up to now.In this paper, we label the clusters based on the concept of Foksonomies.
There were many articles proposed for finding the interests of users [1,28,29].The first paper proposed a linear regression-based method to evaluate user interest in order to calculate a similarity matrix, and to cluster web users based on a threshold according to the generated matrix.The second paper proposed an entropy-based approach to obtain user interests.The third paper proposed a community-based algorithm to retrieve user interests.The above proposed approaches ignore the characteristic of specific web pages that are frequently accessed by one user but barely accessed by others, which should be a pattern mainly considered.
In this paper, we aim to cluster web users based on user interests found in web log data.We propose an efficient approach by considering the techniques of emerging-pattern mining.In the mining task, emerging patterns of each web user are used to define the interests that are frequently accessed by one user and barely accessed by others.Through finding emerging patterns for all web users, we can discard the noise (nonessential) web pages for each web user, and cluster web users according to the generation of typical web pages.
This paper is organized as follows: The next section introduces our proposed approach; then, we implement our proposed approach with a file of web log data for evaluation; finally, we discuss our conclusions and suggest future work.

Proposed Approach
In this section, we generate large web pages from processed web log data, then scan and transform the clean data set into simple page-linked graphs (SPLGs), and then, generate emerging patterns in the generated SPLGs.We cluster web users based on generated emerging patterns, and finally, label the clusters with typical web pages.Our work flow is shown in Figure 1.

Preprocessing of Data Set
Web log data is automatically recorded in web log files on web servers when web users access the web server through their browsers.Not all of the records sorted into the web log files have the right format or are necessary for the mining task, so before analyzing the web log data, a data cleaning phase needs to be implemented.

Removing Records with Missing Value Data
Some of the records sorted in the web log file will not be complete, because some of the parameters of the records were lost.For example, if a click-through to a web page was executed while the web server was shut down, then, in the log file, only the IP address, user ID, and access time will be recorded; the method, URL, referrer, and agent are lost.This kind of record cannot be used for our mining task, so these records must be removed.

Removing Records with Exception Status Numbers
Some records are caused by errors in the requests or are caused by the server.Even if those records are intact, the activity did not execute normally.For example, records with status numbers 400 or 404 are caused by HTTP client errors, bad requests, or when a requested page was not found.Records with status numbers 500 or 505 are caused by HTTP server errors, in which the internal server cannot connect, or when the HTTP version is not supported.These kinds of data are not needed for our task, so the records must be removed.

Selecting the Essential Attributes
As shown in the common log format of the web log data in Figure 2, there are many attributes in one record, but for web-usage mining, not all the attributes are necessary.In this paper, the attributes IP address, time, URL, and referrer are essential to our task, so they must remain; the rest should be discarded.

Preprocessing of Data Set
Web log data is automatically recorded in web log files on web servers when web users access the web server through their browsers.Not all of the records sorted into the web log files have the right format or are necessary for the mining task, so before analyzing the web log data, a data cleaning phase needs to be implemented.

Removing Records with Missing Value Data
Some of the records sorted in the web log file will not be complete, because some of the parameters of the records were lost.For example, if a click-through to a web page was executed while the web server was shut down, then, in the log file, only the IP address, user ID, and access time will be recorded; the method, URL, referrer, and agent are lost.This kind of record cannot be used for our mining task, so these records must be removed.

Removing Records with Exception Status Numbers
Some records are caused by errors in the requests or are caused by the server.Even if those records are intact, the activity did not execute normally.For example, records with status numbers 400 or 404 are caused by HTTP client errors, bad requests, or when a requested page was not found.Records with status numbers 500 or 505 are caused by HTTP server errors, in which the internal server cannot connect, or when the HTTP version is not supported.These kinds of data are not needed for our task, so the records must be removed.

Selecting the Essential Attributes
As shown in the common log format of the web log data in Figure 2, there are many attributes in one record, but for web-usage mining, not all the attributes are necessary.In this paper, the attributes IP address, time, URL, and referrer are essential to our task, so they must remain; the rest should be discarded.

Generation of Large Web Pages
A large page set is a set of frequent web pages.We define frequent web pages as those where support thresholds are greater than, or equal to, a user-specified minimum support threshold.
In this paper, a web log file denotes a data set; Large Web Pages (LWPs) denote the set of web pages that are accessed by web users with sufficient frequency over a period of time.A special period of time called a user session is an important definition for generating LWPs for web users.Generally, the value of a user session is defined by web designers according to the desired level of security.For some websites with high security, the user session is always set to a short amount of time, such as 15 min or less, for safety.For example, a web user who is not active for a long time may have left the computer to do other things, so if someone else is using his account to do something, but the original user does not know about it, it is not safe.For other general websites, the period of a user session can be longer, such as a half hour or one hour; it can also be indefinite.If, for simplicity, the user session time is defined as one hour, in the process of generating large web pages, we should group the experimental data by periods of one hour for each web user.
After cleaning the data set, the data are sorted by the values of their IP address field, and split by user session.As a result, a session-based data set is obtained, which serves as input for our proposal.An example of a session-based data set is shown in Table 1, and a special period of time for the user session is defined as one hour.According to the session-based data, candidate large web pages of each web user are extracted, and their supports are calculated.To calculate the support count for each candidate, we need to count the visit times of each web page accessed in different user sessions for each web user.The equation of support is Equation (1), where ij P N is the visit times of web page j in all sessions of web user i, and _ i N Session is the number of sessions for web user i.Finally, a user specified Minimum Support threshold for Large Web Page (MSLWP) must be defined.The MSLWP denotes a kind of abstract level that is a degree of generalization.The support value will be determined by the proportion of web users accessing web pages at certain times.The selection of an MSLWP is very important; if it is low, then we can obtain information for a detailed event.If it is high, then we can obtain information for general events.The pseudocode for obtaining large web pages is shown in Algorithm 1.

Generation of Large Web Pages
A large page set is a set of frequent web pages.We define frequent web pages as those where support thresholds are greater than, or equal to, a user-specified minimum support threshold.
In this paper, a web log file denotes a data set; Large Web Pages (LWPs) denote the set of web pages that are accessed by web users with sufficient frequency over a period of time.A special period of time called a user session is an important definition for generating LWPs for web users.Generally, the value of a user session is defined by web designers according to the desired level of security.For some websites with high security, the user session is always set to a short amount of time, such as 15 min or less, for safety.For example, a web user who is not active for a long time may have left the computer to do other things, so if someone else is using his account to do something, but the original user does not know about it, it is not safe.For other general websites, the period of a user session can be longer, such as a half hour or one hour; it can also be indefinite.If, for simplicity, the user session time is defined as one hour, in the process of generating large web pages, we should group the experimental data by periods of one hour for each web user.
After cleaning the data set, the data are sorted by the values of their IP address field, and split by user session.As a result, a session-based data set is obtained, which serves as input for our proposal.An example of a session-based data set is shown in Table 1, and a special period of time for the user session is defined as one hour.According to the session-based data, candidate large web pages of each web user are extracted, and their supports are calculated.To calculate the support count for each candidate, we need to count the visit times of each web page accessed in different user sessions for each web user.The equation of support is Equation ( 1), where N P ij is the visit times of web page j in all sessions of web user i, and N_Session i is the number of sessions for web user i.Finally, a user specified Minimum Support threshold for Large Web Page (MSLWP) must be defined.The MSLWP denotes a kind of abstract level that is a degree of generalization.The support value will be determined by the proportion of web users accessing web pages at certain times.The selection of an MSLWP is very important; if it is low, then we can obtain information for a detailed event.If it is high, then we can obtain information for general events.The pseudocode for obtaining large web pages is shown in Algorithm 1.
Considering the session-based data set of Table 1 as input to Algorithm 1, when setting the parameter value of MSLWP at 0.25, the implementation of steps 5-19 results in a candidate data set, as shown in Table 2. Afterwards, the execution of lines 20-26 results in large web pages for user 1: {p6, p12, p14, p19}.

Algorithm 1. getLWPs (List SD, double MSLWP)
Input: A set of session-based web data SD; a user-specified minimum support MSLWP.Output: A set of large web pages for each web user. 1.
Define N_Session i = 0; // initialize the number of sessions for web user i 5.
for each sequence data SD n in SD 6.
if (SD n .URLs contain P ij ) // P ij is the jth web page for web user i 10. N

Generation of Simple Page Linked-Graph (SPLG)
After generating large web pages for each web user, all of the large web pages are defined as vertices in the SPLG.
In regular page-linked graphs, each edge consists of every two web pages that are contained in one session.An example of a page link graph for web user 1 is shown in Figure 3 (left).However, in a SPLG, each edge consists of every two large web page of the web user.Applying the concept of the SPLG to the structure of web page links can reduce large and complex regular page-linked graphs to simple ones in order to reduce noise web pages.In the SPLG, links between each of the two large web pages should be checked.To check the link between every two vertices, the direction of link does not need to be considered, if the two vertices are visited by one user in one session, then they are connected.The pseudocode for checking the links is shown in Algorithm 2.

Generation of Simple Page Linked-Graph (SPLG)
After generating large web pages for each web user, all of the large web pages are defined as vertices in the SPLG.
In regular page-linked graphs, each edge consists of every two web pages that are contained in one session.An example of a page link graph for web user 1 is shown in Figure 3 (left).However, in a SPLG, each edge consists of every two large web page of the web user.Applying the concept of the SPLG to the structure of web page links can reduce large and complex regular page-linked graphs to simple ones in order to reduce noise web pages.In the SPLG, links between each of the two large web pages should be checked.To check the link between every two vertices, the direction of link does not need to be considered, if the two vertices are visited by one user in one session, then they are connected.The pseudocode for checking the links is shown in Algorithm 2.

Algorithm 2. checkLinks (List SD, String [][] LWP)
Input: A set of session-based web data SD; a set of large web pages LWP[i][j], where i is the index of web users, and j is the index of large web pages for user i Output: A set of links with IP address.1.
for each web user i 4.
for each large web page j 5.
if end After generating all of the links, the generated links with the same IP address are grouped and linked with the same vertices.Then, the SPLGs of all web users can be generated.For example, for the experimental data set in Table 1 for web user 1, who visited 20 web pages {p1, p2... p20} in 14 user sessions, we define the MSLWP as 0.25, and the large web pages of user 1 are {p6, p12, p14, p19}, which was generated in the previous section.After implementing Algorithm 2, links {(web user 1, [p6, p12]), (web user 1, [p6, p19]), (web user 1, [p12, p14])} are obtained, and then the SPLG of web user 1 can be described as shown in Figure 3 (right).

Generation of Emerging Patterns
After generating SPLGs for all web users, we try to find emerging patterns in these SPLGs.Examples of SPLGs for some web users are shown in Figure 4.After generating all of the links, the generated links with the same IP address are grouped and linked with the same vertices.Then, the SPLGs of all web users can be generated.For example, for the experimental data set in Table 1 for web user 1, who visited 20 web pages {p1, p2... p20} in 14 user sessions, we define the MSLWP as 0.25, and the large web pages of user 1 are {p6, p12, p14, p19}, which was generated in the previous section.After implementing Algorithm 2, links {(web user 1, [p6, p12]), (web user 1, [p6, p19]), (web user 1, [p12, p14])} are obtained, and then the SPLG of web user 1 can be described as shown in Figure 3 (right).

Generation of Emerging Patterns
After generating SPLGs for all web users, we try to find emerging patterns in these SPLGs.Examples of SPLGs for some web users are shown in Figure 4.In the process of emerging pattern mining, we will use the ideas of ρ -EP [30] and JEP [31,32].
For example, we set the SPLG for web user U1 as class 1, and set other SPLGs for other web users as In the process of emerging pattern mining, we will use the ideas of ρ-EP [30] and JEP [31,32].For example, we set the SPLG for web user U1 as class 1, and set other SPLGs for other web users as class 2. Table 3 shows the web pages in these two classes.Table 4 lists all the possible EPs, their support, and the growth rates of all EPs.The equation of support is Equation (2), and the equation for the growth rate is Equation (3), where N p i is the number of patterns (p i ) in the class, N is the number of all patterns in the class, SupportpClass1q is the support value of class 1, and SupportpClass2q is the support value of class 2. If we set the minimum growth rate threshold, ρ, to 1.5, there are nine EPs in class 1: four normal EPs ({p6}, {p12}, {p19} and {p12, p14}), where the value of the growth rate is greater than the specific value of ρ, and five JEPs ({p6, p12}, {p6, p19}, {p6, p12, p14}, {p6, p12, p19} and {p6, p12, p14, p19}) where the value of the growth rate is infinite.

Clustering of Web Users
In this paper, we execute a K-means clustering algorithm [24,25] on emerging patterns to cluster the web users.First, we generate a TF-IDF based weighted matrix which can reflect how important a web page is to a web user.In the process of matriculation, a TM matrix is defined as U by P, where U is the number of web users, P is the number of web pages that are emerging patterns of all web users, and TM ij represents a measure of the TF-IDF weighted value for web page j visited by web user i, i P r1, Us and j P r1, Ps.According to Equation (4), we can get the value of the TF-IDF for web page j of web user i, and then the TF-IDF based TM matrix can be obtained.Then, we execute the K-means algorithm on the generated TM with a specified K value to get the clusters of web users: where n ij is the number of occurrences of web page j for user i, n kj is the number of occurrences of web page j for all web users, |U| is the number of web users, and ˇˇu i P U : p j P u i ˇˇis the number of web users who accessed web page j.

Annotation of Clusters
After clustering, we label the clusters based on the concept of Folksonomies.Each cluster is defined as one user group, and the web pages in each cluster are defined as online items, we use TF-IDF to calculate the frequency of each web page in each cluster.According to Equation ( 5), we can calculate the TF-IDF value of each web page in each cluster, and then we can select some web pages where TF-IDF values are among the Top N (N can be the number chosen by a user with freedom, where N is smaller than the number of web pages in each cluster) and the largest in each cluster is the label of this cluster: where n ij is the number of occurrences of web page j in cluster i, n kj is the number of occurrences of web page j in all the clusters, |K| is the number of clusters, and ˇˇc i P C : p j P c i ˇˇis the number of clusters that contain web page j.

Experiments and Analysis
Based on the proposed approach presented in this paper, we performed experiments on a set of web log data to evaluate its efficiency.

Experimental Data Set
In the experiments, we used a web log file from the web site www.vtsns.edu.rs as the experimental data.There were 5999 records in the raw file.After data cleaning, there were 1222 records left from 243 user sessions.There were 31 different kinds of web pages accessed during the user sessions in these data.

Experimental Results and Discussions
This section shows several experiments based on our proposed approach.In the first subsection, we try to analyze the effects of the parameters (MSLWP and ρ) used in our proposed approach.We implemented our approach to show the results from generating emerging patterns, clustering, and the annotation of clusters.

Analysis of the Parameters
In the process of getting large web pages, we extracted events that satisfy a user-defined Minimum Support of Large Web Pages (MSLWP).We can discard infrequent events to reduce the size of the experimental database, reduce the search space and time, and maintain the accuracy of the whole mining task.To evaluate the effect of the MSLWP parameter, we compared the number of web users who have large web pages, and generated large web pages by changing the values of MSLWP.The experimental results are shown in Figures 5 and 6.We can see that the bigger the MSLWP, the fewer generated web users and large web pages.There always exists a value for MSLWP, and from this value, the number of web users and large web pages will either not change at all, or change by a negligible amount.This value is always selected as an empirically suitable value for the MSLWP parameter in the whole approach.In this experiment, we can see that when the value of MSLWP is 0.6, the number of web users and large web pages sees a small decline.Consequently, we always choose this value as the value for MSLWP in the performance tests.In addition, from Figure 5, we can also see that comparing the number of web users after executing the proposed approach with the number of records after data cleaning, it becomes clear that our proposed approach can greatly eliminate the noise pages of web users from the data set to improve efficiency.
In the process of generating emerging patterns, we tried to find patterns where growth rates satisfy a user-defined ρ value.This can be a criterion for the selection of emerging patterns.To evaluate the effect of the ρ parameter, we first defined the value of MSLWP as 0.6 (or 60%), then we compared the number of emerging patterns by changing the values of in different web users.The experimental results are shown in Figure 7.We can see that the bigger the value of ρ, the fewer emerging patterns are generated for each web user.There always exists a value of ρ, and from this value, the number of emerging patterns will not change, or will change very little.This value is always selected for use as the value of the minimum growth rate in the experiment.From the result, we can also see that the number of emerging patterns saw a small decline, compared to increases in the value of ρ.
This section shows several experiments based on our proposed approach.In the first subsection, we try to analyze the effects of the parameters (MSLWP and ρ ) used in our proposed approach.We implemented our approach to show the results from generating emerging patterns, clustering, and the annotation of clusters.

Analysis of the Parameters
In the process of getting large web pages, we extracted events that satisfy a user-defined Minimum Support of Large Web Pages (MSLWP).We can discard infrequent events to reduce the size of the experimental database, reduce the search space and time, and maintain the accuracy of the whole mining task.To evaluate the effect of the MSLWP parameter, we compared the number of web users who have large web pages, and generated large web pages by changing the values of MSLWP.The experimental results are shown in Figures 5 and 6.We can see that the bigger the MSLWP, the fewer generated web users and large web pages.There always exists a value for MSLWP, and from this value, the number of web users and large web pages will either not change at all, or change by a negligible amount.This value is always selected as an empirically suitable value for the MSLWP parameter in the whole approach.In this experiment, we can see that when the value of MSLWP is 0.6, the number of web users and large web pages sees a small decline.Consequently, we always choose this value as the value for MSLWP in the performance tests.In addition, from Figure 5, we can also see that comparing the number of web users after executing the proposed approach with the number of records after data cleaning, it becomes clear that our proposed approach can greatly eliminate the noise pages of web users from the data set to improve efficiency.
In the process of generating emerging patterns, we tried to find patterns where growth rates satisfy a user-defined ρ value.This can be a criterion for the selection of emerging patterns.To evaluate the effect of the ρ parameter, we first defined the value of MSLWP as 0.6 (or 60%), then we compared the number of emerging patterns by changing the values of in different web users.The experimental results are shown in Figure 7.We can see that the bigger the value of ρ , the fewer emerging patterns are generated for each web user.There always exists a value of ρ , and from this value, the number of emerging patterns will not change, or will change very little.This value is always selected for use as the value of the minimum growth rate in the experiment.From the result, we can also see that the number of emerging patterns saw a small decline, compared to increases in the value of ρ .Then, we tried to analyze the relationship between MSLWP and the number of emerging patterns of web users.We defined the value of ρ at 1.5.The result is shown in Figure 8. From the result, we can see that the bigger the MSLWP, the fewer generated emerging patterns in web users.There always exists a value of MSLWP, and from this value, the number of web users and large web pages will not change, or will change very little.This value is always selected for use as the value of MSLWP in the whole approach.In addition, the number of emerging patterns saw a big decline compared to increases in the value of MSLWP.From the experiments, we know that MSLWP is the main parameter to control the number of web users with favorite web pages, and ρ is the minor parameter to calibrate.Then, we tried to analyze the relationship between MSLWP and the number of emerging patterns of web users.We defined the value of ρ at 1.5.The result is shown in Figure 8. From the result, we can see that the bigger the MSLWP, the fewer generated emerging patterns in web users.There always exists a value of MSLWP, and from this value, the number of web users and large web pages will not change, or will change very little.This value is always selected for use as the value of MSLWP in the whole approach.In addition, the number of emerging patterns saw a big decline compared to increases in the value of MSLWP.From the experiments, we know that MSLWP is the main parameter to control the number of web users with favorite web pages, and ρ is the minor parameter to calibrate.Then, we tried to analyze the relationship between MSLWP and the number of emerging patterns of web users.We defined the value of ρ at 1.5.The result is shown in Figure 8. From the result, we can see that the bigger the MSLWP, the fewer generated emerging patterns in web users.There always exists a value of MSLWP, and from this value, the number of web users and large web pages will not change, or will change very little.This value is always selected for use as the value of MSLWP in the whole approach.In addition, the number of emerging patterns saw a big decline compared to increases in the value of MSLWP.From the experiments, we know that MSLWP is the main parameter to control the number of web users with favorite web pages, and ρ is the minor parameter to calibrate.and /ispit_raspored_akt.php as their favorite web pages.In contrast, the users in cluster 5 frequently visit web pages /oglasna.phpand /raspored_predavanja.php as their favorite web pages.We can design the web page to recommend some favorite web pages to those web users who are in the same clusters.In this section, we executed our proposed approach on the experimental data set with the parameter MSLWP at 0.6 and ρ at 1.5 and compared it with existing approaches of generation of user interests by Zheng and Zhang [28], Xu et al. [29] and Tchuente et al. [1].After executing the approaches, we used Purity to evaluate clustering.According to Equation ( 6), we can calculate the Purity value with different numbers of clusters, where Ω " tw 1 , w 2 , ..., w k u is the set of clusters and C " c 1 , c 2 , ..., c j ( is the set of classes.In this experiment, the results of clustering are defined as the set of clusters, and the results of annotation can be structured as the set of classes, C, with different kinds of web pages.
The comparison is shown in Figure 9. From the result, we can see that our proposed approach performed better than the other three existing approaches with greater purity in the clusters.In particular, our proposed approach is outstanding when the number of clusters is five; therefore, we can also say that it is most correct to group web users in this web log data into five clusters.
The comparison is shown in Figure 9. From the result, we can see that our proposed approach performed better than the other three existing approaches with greater purity in the clusters.In particular, our proposed approach is outstanding when the number of clusters is five; therefore, we can also say that it is most correct to group web users in this web log data into five clusters.Then we executed different clustering algorithms on the data set of user interests which is generated by emerging pattern mining technique with the parameter MSLWP at 0.6 and ρ at 1.5.The result is shown in Figure 10, from the result we can see that, K-means & TF-IDF clustering approach

Conclusions and Future Work
In this study, we not only proposed a process for getting large web pages from processed web log data, but also defined these large web pages as vertices, and transformed the session-based data set into SPLGs; we then found emerging patterns in these SPLGs.Afterwards, we clustered the web users and labeled the clusters by considering TF-IDF.The main result of this study is to generate large web pages and emerging patterns to identify the personal favorite web pages of each user by eliminating noise due to overall popular pages.In the experiments, we evaluated the parameters used in our proposed approach and discussed the effect of the parameters on generating emerging patterns.The results of the experiments have proven that the exact patterns generated in the emerging-pattern step eliminated the need to consider noise pages.Consequently, we found that the efficiency of subsequent mining tasks can be improved.

Figure 1 .
Figure 1.Work flow of our proposed approach.

Figure 1 .
Figure 1.Work flow of our proposed approach.

Figure 2 .
Figure 2. Common log format of web log data.

Algorithm 1 .
getLWPs (List SD, double MSLWP) Input: A set of session-based web data SD; a user-specified minimum support MSLWP.Output: A set of large web pages for each web user. 1. Define tmp_IP = SD1.IP; 2. Define i = 1; 3. Define out_LWP[][]; 4. Define _ i N Session = 0; // initialize the number of sessions for web user i 5. for each sequence data SDn in SD 6.if (SDn.IP == tmp_IP)

Figure 2 .
Figure 2. Common log format of web log data.

Figure 4 .
Figure 4. Example SPLGs of some web users.

Figure 4 .
Figure 4. Example SPLGs of some web users.

Figure 5 .
Figure 5.Effect of parameter MSLWP related to the number of web users.Figure 5. Effect of parameter MSLWP related to the number of web users.

Figure 5 .
Figure 5.Effect of parameter MSLWP related to the number of web users.Figure 5. Effect of parameter MSLWP related to the number of web users.

Figure 6 .
Figure 6.Effect of parameter MSLWP related to the number of large web pages.

Figure 7 .
Figure 7. Effect of parameter ρ in sample web users.

Figure 6 . 18 Figure 6 .
Figure 6.Effect of parameter MSLWP related to the number of large web pages.

Figure 7 .
Figure 7. Effect of parameter ρ in sample web users.

Figure 7 .
Figure 7. Effect of parameter ρ in sample web users.
,..., c } j C is the set of classes.In this experiment, the results of clustering are defined as the set of clusters, and the results of annotation can be structured as the set of classes, C, with different kinds of web pages.

Figure 9 .
Figure 9.Comparison of generation of user interests with existing approaches.

Figure 9 .
Figure 9.Comparison of generation of user interests with existing approaches.

Figure 10 .
Figure 10.Comparison of clustering with existing approaches.

Table 1 .
P ij ++; // the visit time of web page j by web user i, add one 11.Example of session-based data set.

Table 2 .
Candidate large web pages in example data set.

Table 2 .
Candidate large web pages in example data set.
Algorithm 2. checkLinks (List SD, String [][] LWP)Input: A set of session-based web data SD; a set of large web pages LWP[i][j], where i is the index of web users, and j is the index of large web pages for user i Output: A set of links with IP address.

Table 3 .
Sample dataset split into two classes.

Table 6 .
Result of clustering of web users.

Table 7 .
Generation of five clusters with their annotation. TF-

Table 8 .
Result of clusters with labels.

Table 8 .
[29]lt of clusters with labels.Comparison with Existing ApproachesIn this section, we executed our proposed approach on the experimental data set with the parameter MSLWP at 0.6 and ρ at 1.5 and compared it with existing approaches of generation of user interests by Zheng and Zhang[28], Xu et al.[29]and Tchuente et al.[1].After executing the approaches, we used Purity to evaluate clustering.According to Equation (6), we can calculate the Purity value with different numbers of clusters, where k  is the set of clusters and