Pattern-Based and Visual Analytics for Visitor Analysis on Websites

In this paper, We present how we combined visualization and machine learning techniques to provide an analytic tool for web log data.We designed a visualization where advertisers can observe the visits to their different pages on a site, common web analytic measures and individual user navigation on the site. In this visualization, the users can get insights of the data by looking at key elements of the graph. Additionally, we applied pattern mining techniques to observe common trends in user segments of interest.


Introduction
Analyzing and describing visitor behavior of an e-commerce site is of interest to web marketing teams, especially when assessing ad campaigns. Marketing teams are interested in quantifying their human visitors and characterizing them, for example, to discover the common elements of visitors who made a conversion (e-commerce purpose). Also, knowledge about visitor behavior on a website could benefit IT personnel, for example, as an instrument to identify bot visitors and combat click-fraud.
One resource that can be used to analyze visitor behavior is web interaction data, such as mouse and keyboard usage; however, this type of data is not inherently available and requires the collection of information on the visitor side (which often is denied for privacy issues). Conversely, a web log file is a trace available in any server that hosts a website. The requests made by the visitors to the site are recorded in this log file, providing website owners with information about the resources requested, including details about the visitor and the resource itself. Thus, the information from web log files is a valuable asset in the analysis of visitor behavior.
In this work, we present a new visualization that allows web marketing teams to understand how visitors navigate their site, which is key to analyze the success of a campaign or to redesign a website. Our visualization shows the user a general view of the visitors, pages, and their interactions, apart from some common web analytic measures. Moreover, we include a detailed view of a visit navigation path, which allows observing individual behavior. Yet, we do not intend to replace other visualization tools, rather we aim to complement them with interesting features. Furthermore, in order to help the visual model end-user understand what captures a type of visitor, we apply pattern mining techniques, in particular, contrast patterns. In this case, we are interested in seeing what separates two segments of visitors, so we extract contrast patterns from two classes. This yields patterns which outline what differentiates one segment from the other., enriching our visual model. Features could be used to define classes and form different groups of users that are of interest to characterize. For example, we can apply pattern mining to characterize the traffic that comes from a country of interest against that of the others, or what characterizes visits that yield a conversion. As use case scenarios, we present the characterization of human versus bot visitors, and an example of country segmentation.
A summary of our contributions is as follows: • The design of an interactive visualization that allows users to have a comprehensive snapshot of visitors on a website, but also enables a fine-grained analysis by means of navigation graphs.

•
The application of pattern mining techniques to extract patterns that characterize traffic segments of interest. The obtained patterns can aid in the selection of groups of users whose behavior would be interesting to observe. For example, we have been able to discover patterns that capture groups of interest, including (some types of) human and bot traffic. This last bit is of paramount importance, both because marketing can now give attention to clean and crisp segments of traffic, and because IT may block unwanted traffic, using for example firewall rules.
The paper is organized as follows. First, in Section 2, we present a brief analysis of web analytic tools and summarize research efforts in web visualization. Section 3 describes the data and the pre-processing steps required for the visualization and pattern mining components of our approach. In Section 4, we describe our visualization design. Section 5 describes how this visualization can be enriched by the use of machine learning, specifically of pattern mining techniques. Finally, in Section 6, we discuss the applications of our work and possible extensions.

Related Work
There are many tools available for measuring digital content. Nevertheless, they all display web analytics in a similar manner; goal reports, conversions and site performance usually are still displayed as tables, big score counters or line plots. Next, we mention five popular web analytic tools and provide a brief discussion about them. Then, we mention the research proposals related to the visualization of web behavior.
Google Analytics (GA) is Google's main product for getting reports and analyzing the traffic on a website. It can be configured to import and track ad campaigns from Ad Words [1] and Double Click [2] (Google's web advertising products). It allows segmenting the traffic from many sources and by applying several filters. Also, it has the advantage of being widely known by marketing experts and people from other domain areas.
ComScore [3] is an American company which provides services to evaluate media across different platforms, it has a big presence not only on the Internet but also in the TV industry, newspapers, health care, and others. Unfortunately, we could not get a further analysis of their tools because they are paid services. Despite this fact, comScore has been very open with their current research and has been publishing some reports in a periodic way [4,5].
KissMetrics [6] provides analytic reports and email tools to increase user engagement. They provide a more tailored experience, focusing on consulting and teaching their customers on how to configure the tool and interpret the results. It also allows segmenting the traffic using filters. KissMetrics is also a paid service.
Matomo [7], formerly named Piwik, it is one of the most popular and robust tools. It can be self-hosted or as Software as a Service (SaaS) in the cloud. Matomo is a company that focuses on giving their users complete control of everything, meaning that you get full reports (no data sampling, in contrast with GA). It is developed in PHP and also provides an HTTP API for consulting reports such as the visitor's information, goals, and pages performance, user segments, live visits information, among others.
Although the Open Web Analytics (OWA) [8] project has not published any new version since 2014, it is still popular in legacy websites. It was integrated into former versions of Content Management Systems (CMS) like Wordpress or Media Wiki. It can be tested only by installing it on a personal server. One of the features included is a heatmap that shows the hottest (most clicked) sections of a website page, which can be used to optimize the placement of page information.
As mentioned before, many of the web analytics reports provided by the solutions described have not changed significantly in the past few years. Such reports do not provide, for example, how visits interact with the website pages individually; most often, results are aggregated and spread in multiple reports. Figure 1 shows a common report to display goal performance and conversion counters. Although these kinds of reports provide a quick way to compare time series, they could be improved, for example, by integrating information and adding interactive controls. Regarding visitor's navigation, the common report is a table of sequential actions or as aggregated traffic flow (funnels). However, when marketers sometimes need to explore individual traces of navigation from their users. Information from individual traces, when available, is represented through simple text reports, as shown in Figure 2. Few platforms have implemented features for automatic customer segmentation. Typically, an expert manually creates filters based on arbitrary parameters such as the visitor's country, user type, visitor's language, among others.; although this is not a problem per se, it is not trivial to select a relevant segment. We believe that machine learning can improve this process by suggesting such filtering parameters through the use of pattern mining [9][10][11]. Such patterns represent true segments found in the data itself. One disadvantage of the enterprise solutions (GA, comScore and KissMetrics) is that they can only be used as a hosted service which can be inconvenient for companies that need an on-premise solution, or if they need to obey certain law regulations about storing customer's data. Another big concern about these solutions is that, usually, the user does not own the data, instead, the only way to access the information is through a third-party. Unlike open source solutions, like Matomo, where the user is the owner of 100% of the data.
Apart from the previously mentioned tools, there have been research efforts towards designing new visualizations. Maps and network graphs are common visualizations, given that traffic comes from any part of the world. For example, work by Akamai [12], provides an interactive map of web attacks in real time. Kaspersky [13], offers a similar tool but using a 3D perspective and more features integrated. Logstalgia [14], is another interesting tool to visualize HTTP server logs, inspired by Atari's pong game, when you get requests, it renders a swarm of pong balls.
In the field of credit fraud, a new method is being employed: the use of graph-based databases. Neo4j [15] and IBM Graph [16], are examples of tools for such purposes. As described in [16] and [15] the motivation is to find cycles inside graphs, which commonly represent a kind of fraud. Neural Networks can also be used as visualization tools. Atienza et al. [17] used Self Organized Maps (SOM) to find web traffic attacks.
Chi [18] surveys a couple of visualization tools developed at the User Interface Research Group at Xerox PARC. The mentioned work used visualization tools to improve web usability, to find and predict browsing patterns and to show web structure and its evolution. Like us [18] also implemented a graph inspired visualization.
Another graph inspired visualization is Hviz [19], which was used successfully by InfoSec Institute to explore and summarize HTTP requests to find common malware like Zeus [20], and also as a tool for forensic analysis [21]. Hviz deserves mention for its versatile use cases and also for creating a heuristic to aggregate HTTP requests by using Frequent Item Mining. Hviz is related to ReSurf [22], using it as a benchmark to improve browsing reconstruction. Another work in this field is ClickMiner [23], which also reconstructs browsing paths and provides a tool to visualize it; this is analogous to the Click Path feature we propose, but the context is different: they analyze traffic from a single machine, client by client, whereas ours is server-based and we don't need access to individual computers. Blue et al.,presented NetGrok [24], which uses a combination of graphs and Treemaps to display bandwidth usage from IP hosts in real-time. Although they used the tool successfully to detect anomalies, the scope of their analysis does not match with ours; they use low-level packet capturing, whereas we use server logs, more close to the Web Analytics resources available.
We propose new ways to display website traffic by using an interactive tool that provides several ways to arrange visits, conversions, user behavior, click path, page views and filtering options. All of them, combined with pattern recognition [9], could help to find clusters for new market niches, discover unknown visitor segments or improve segment analysis by the use of the patterns found on the data. We integrated into the visualization tool a way to introduce a pattern and using it as a query to filter the visits. This allows the expert to create segments automatically (after introducing the pattern) or at least give some insight on which group of visitors shares common properties or behavior.
Most of these reports and graphs are based on user sessions, which are usually identified by cookies, that link a requested resource to a particular visitor. In the case of web log files, cookies' information is not available, so other methods are applied to the discovery of user sessions. Such is the case of the work in [25], which assigns requests to sessions following the next heuristics, in order, (1) same IP addres and user agent, (2) same user agent and common domain (obtained by reverse DNS lookup of the IP address), (3) same user agent and common IP prefix, and 4) same IP address and different user agent. More commonly, session definition consist of joining requests from the same IP address and user agent, as is the case of [26,27]. Additionally, all approaches define a session time out, typically set to 30 min.

Dataset
Our visualization has been designed to work with log files from a web server. Unlike most approaches, we do not use data related to the interaction while browsing a page (mouse usage, supported browser features, etc.), we work solely with data directly available from a standard log file. Using client-side data, logged directly through the client's browser (i.e., by javascript tracking code), may lead to a richer feature space. However, we have found that server data on its own is enough to provide a general idea of the visitors' behaviour on the site. We aim to cover the needs of companies that may be reluctant to add logging scripts due to privacy concerns. A weblog file is a trace already available to servers, and it does not require running any additional script.
We analyzed log data recorded by web servers from a commercial website. In total, we examined the log files of one month of interest. The data was collected by company experts, aiming to provide only navigation requests generated by allegedly human visitors. These logs use an extended version of the NSCA Common Log Format, known as Combined Log Format [28]. Table 1 shows the fields that a web server records in its log files when the Combined Log Format is used. By parsing the log file, we were able to extract these fields, for each line in the file (see Figure 3 for a sample line and its fields).

Field
Description The remote logname of the user. authuser The username that has been used for authentication. date Date and time of the request. request Resource requested and HTTP version. status The HTTP status code returned to the client. bytes The content-length of the document transferred. referrer The URL which linked the user to the site. useragent The Web browser and platform used by the visitor.  Next, we performed a series of pre-processing steps in order to obtain an extended representation of each log entry. Given that log files are not intended to be read by the common user, the information (fields) they provide, while valuable, may not be insightful to web marketing teams. For this reason, we extract features from the log fields and create objects that represent log entries. From these objects, it is possible to obtain contrast patterns that describe the characteristics of a group of users, as described in Section 5. In the next paragraphs, we explain the pre-processing steps we followed to extract the feature vector used in our work.
The first field available is the IP address of the visitor. From this field, we can extract geolocation and contextual features. We used GeoLite databases [29] to extract the City, Country, Subdivision and Organization associated with the IP address. Using these geolocation features allows for a more generalized analysis. For instance, using the log field raw values, it is not possible to identify two visitors from the same city but with different IP addresses; whereas in the proposed feature space, they will have a common value. Additionally, the extracted features are more interpretable. An IP address might not tell much to a user; however, knowing the location or the organization of the visitors provides a better idea of their profile.
We skip rfc931 and authuser because they had the same value in all the log entries in our data. Then, we process the date field. As can be seen in Figure 3, the date is logged using the format [dd/MMM/yyyy:hh:mm:ss +-hhmm]. This format is not convenient for data mining because it is very specific and does not allow generalization. Instead, we extracted two features from this string: the hour (rounded up if the time is closest to the next hour) and the day of the week. We do not take minutes into account because they are very specific for our purposes, while hours from 24 different groups, hour and minute precision would allow too many groups to be created. If more precision is desired, instead of taking directly the time including minutes, we recommend to bin times of the day, in order to have more than 24 groups but not as many as 1440. Next, we processed the request to extract the URL and the number of parameters in the URI-query, again, with the purpose of getting a more general feature vector .
We maintained the next three fields (status, bytes, and referrer) as features. Finally, using UAParser [30] we obtained, from the useragent, the operating system, browser and device used by the host. In total, we have a set of 14 features, which are shown in Table 2. These features are used by our pattern mining approach. Table 2. Feature Set. The first column lists the features that will be used in our analysis. The second column lists the name used to identify the feature in the patterns. The third column specifies the origin of the feature in the logfile. Additionally, we processed the logs to obtain user visits, which are the main element in our visualization. This processing was made by importing the data to Matomo. Visits include pages requested by the same user (identified by a device fingerprint). When a user requests a page more than 30 min after his last requested page, it is considered a new visit. This allow us to create an extended vector with information regarding the whole visit, instead of one resource request. Table 3 describes extra features that were extracted after processing the data in Matomo. Please note that this list is not an exhaustive list of Matomo's database structure. Features like the number of clicks, plugin flags, and page generation time, among others; are only available when page tracking is performed directly through Matomo tracking code (on the client's side); since we imported server logs into Matomo this data is not available, as it is not possible to infer it from the logs. Additionally, with this processing, we eliminate requests to internal resources with are not required in our current analysis (i.e., javascript code, static image/css information, etc.); alternatively, a list of resources of interest can be obtained from the site owner and used to filter the data.

Visual Model
In order to aid the task of understanding website traffic, we have designed an interactive visualization tool that provides visual cues indicating the relationship between visitors and pages, along with analytic metrics. Following the analysis, certain marketing strategies can be proposed to improve the page content or to take advantage of the most visited pages, like including ads into them or adding appealing elements in the pages. The site to be analyzed is not one with millions of visits per day, on Section 6 we specify the design changes we suggest for the analysis of sites with a high volume of visitors.
Specifically, we have two visual elements dedicated for this purpose: Visits view This visual element allows exploring the relation between pages and visits, while allowing observation of analytic metrics, such as page views, bounce rate, and visit duration. The user can also highlight a single page or visit to observe its relations and metrics.
Navigation path This visual element allows exploring individual visits in a graph structure.
In the following sections, we describe the design of each visual element and present examples of how they aid to the goal of understanding website traffic.

Visits View
Page and visit reports are common features found in any web analytics platform, often, those reports are provided as tables or line plots (as shown in Figure 1). However, by using a different visualization, we can combine those reports into a single one. Our proposed visualization (Figure 4) displays information about both: the pages in the site and the visitors to these pages. The design is based on several concentric circles, formed by three different type of nodes, which connect to each other according to visit and page relationships. We start by describing the three types of nodes: Visit nodes Visit nodes form the outermost concentric semi-circle, these nodes are represented as circles with a country flag image. Each node represents a visit to the website. The flag indicates from which country the visit came.
Page nodes Page nodes form several concentric semi-circles, these are the blue circles. Each node represents a unique web page of the website.
Objective nodes Objective nodes are pictured as stars. These nodes are web pages that are considered goals of the business, for example: sign-up pages, landing pages, checkout pages, among others. They are distributed along with the page nodes.
As we can see, nodes are arranged in two groups. The first group contains all the visit nodes and is distributed in the outermost semi-circle. The second group contains all the page and objective nodes and it distributed in several semi-circles. The nodes are ordered within the semicircles. Next, we explain how this ordering was defined, according to visit and page characteristics, and the advantages of this design.
Visit nodes are grouped according to the country of origin. This allows to identify the countries that have more visits, as large groups of nodes with the same flag are easy to spot. An alternative ordering is to sort the nodes according to the visit duration, in this case, the user can look at the nodes at the far right (or left) to observe which visits were the longest (or shortest) and if they had any similarities in their countries of origin. We kept the ordering according to the country of origin due to two reasons: (1) it was preferred by our users, and (2) the visit duration is reflected in the connection between the nodes, as will be explained later in this section.
Page and objective nodes are distributed in k semicircles, also called levels. Each level is ordered from the center to the outermost level in ascending order. Once the parameter k is established, the elements are distributed into the k levels in the following way: k segments of size max(metric) k are selected, starting from min(metric). Then, we sort all the nodes into the respective level, ordered in ascendant from left to right. The index page, or home page of the website, is excluded from these computations because its position is fixed into the center of the view.
A metric must be selected to designate the order of the nodes. The selection of this metric will determine the sort of information that will be quickly grasped just by looking at the position of the nodes. After continuous iterations with the final users of our tools, the selected metric is page views. Thus, pages with fewer page views are positioned at the leftmost side and increasingly positioned to the right. Pages on the first level (shortest radius) are the ones with fewer page views, whereas pages on the k st level have the highest page views. This allows the marketing team to quickly identify the most visited pages, and see if their starred pages are truly visited more. Additionally, in the center of the node, the metric is displayed.
The selection of k can reflect a certain Key Performance Indicator (KPI). For example, the business could decide to use k = 5 and define the goal KPI as having all the objective nodes at the fourth level; if this goal is not achieved it is an indicator of bad performance of the business goals. Ideally, objective nodes should appear in the outermost level.
As an example, we have Figures 5 and 6. Page and objective nodes will be placed in their corresponding level, depending on how many page views they have. In this example, k = 3, the most visited page had 45 page views, and the least visited page had only ten views. Thus, we end-up with three segments of size 15 (SegmentSize = 45/3 = 15), starting at 10: [min(pageviews), 25], (25,41], and (41, max(pageviews)]. In this case, we have 14 pages in the first level, 8 in the second level, and two in the third level. Figure 5 shows a scenario where the objective nodes are most viewed pages, this indicates an ideal business scenario. On the contrary, Figure 6 shows a scenario where the objective nodes lay on the inner-most semicircle, i.e., they are the within the less viewed pages. This indicates a bad business scenario. Nevertheless, our visualization is useful to spot new opportunities by looking at page nodes in last semi-circle. There pages are currently not considered as objectives, but they have a lot of views. Thus, we can implement some call-to-action elements inside there, such as advertising, banners, promotions, among others.
The final element of this visual component are the connections between the nodes. A connection between a visit node and a page or objective node indicates that such page was accessed by that visitor. The width of the connection represents a selected metric, in this case, we chose the inverse of the visit duration. The less time spent visiting the website, the thicker the connection. On top of the visit nodes, the metric used for the width of the connection is displayed; in this case, how many seconds the visit lasted.  This visual component, not only displays information, but it also is interactive. The use can click on nodes to obtain relevant information about the node. When selecting a visit node, the connections and nodes not related to this visit are dimmed and the sidebar is populated with information about the selected visit, including a button that allows access to the Navigation Path (Section 4.2). This way, the user sees all the pages related to the selected visit and information such as visit duration and location. This behaviour is shown in Figure 7. When selecting a page or objective node, the connections and nodes not related to this page are dimmed and the sidebar is populated with information about the selected page, including the url, number of pageviews and average time on the page. This behaviour can be seen in Figure 8. If a star is seen in the inner levels of the semi-circle, the user can click on it, to show the details of the page and device new strategies to bring visitors to this page. Likewise, a user can identify a page previously thought as not important, to be, in fact, one of the most visited; in this case, the team can add to this page information that they want visitors to see.
We have shown that by using a single visual report, we can quickly observe page views, with an easy access to detail, as desired by the marketing team, including the performance of goal pages, plus the country from which the visit comes from. Additionally, instead of having text indicators, users can quickly glance at the relationship of visitors and pages, including the visit duration. The selection of these metrics was the result of a continuous feedback process with the marketing team of the analyzed e-commerce site. Choosing page views allows an easy identification of the most popular pages. Choosing the visit duration allows an easy identification of visitors who spend more time in the page and are probably interested in the site (their navigation path is interesting to analyze) whereas visitors with a short visit duration might correspond to bots or uninterested visitors.

Navigation Path (F4)
In many platforms, the user navigation path, is commonly represented as tables of sequential actions or as aggregated traffic flow (Figure 2). We believe this could be improved using a visual representation. In this section, we describe our proposed visual representation, inspired in network diagrams. Figure 9 shows the proposed visualization for the navigation path. It includes a series of page nodes and objective nodes connected when the visitor navigated between the corresponding pages. Each node represents a page the user visited, as before, blue circles represent common pages whereas stars represent objective pages. The user can see the Universal Resource Identifier (URI) of each node.
Additionally, each connection has two items of information. The first element shows the order in the chain of viewed pages and the second element is the time spent by the user in the page. Our visualization, unlike that presented in Figure 2, gives a quick understanding of non-linear navigation. For example, the presence of loops and recurrent pages is now straightforward to appreciate. Additionally, our click path visualization, enables identifying at which point of the visit the visitor landed to a certain objective page, or if in fact, the visitor never touched an objective. However, a limitation of obtaining this path directly from log files happens when the user clicks the browser back button, as there will be no entry in the log for this event.

Traffic Segmentation and Characterization
One of the first tasks for analyzing an audience is to create segments, which involves performing a filter process given a series of conditions: e.g., young people (18 < age < 24); men living in New York City (gender = male AND location = NYC); people coming from social networks (referrer IN [Facebook, Twitter, Snapchat]), etc. So, it is very important to have an easy way to perform such filters. In this section we describe how filtering is integrated into our visualization and how a pattern mining approach can aid the characterization of segments. Finally, we present two case study scenarios where we applied our approach.

Filtering Visits View
We incorporated a query console to allow users to select a group of nodes that meet a certain criteria (notice the console at the top in Figure 4). The query console is an interface to the open-source project called Cytoscape by [31], which is responsible for the actual query expression evaluation. The queries that can be performed using the tool, supporting the base logic operators: ∧, ∨, ¬ (AND, OR, NOT), relational operators (=, >, <, ≤, ≥), string matching, and others. The patterns follow a very similar syntax than the Disjunctive Normal Form (DNF, [32]). All capabilities are provided by Cytoscape and the full specification can be found in the documentation [31]. For the visual model, we import the log files to Matomo, so we are able to query visits and pages based on their database structure [33]. The format follows the next grammar: group[attr OP val], group ∈ {node, edge, .className} where: • className represents a custom class assigned to a particular data. In our case we have three classes: visit, page, and objective. So instead of using node or edge, we use such classes which are less abstract. For example, .page[attr OP val] will target only web page attributes, similarly .visit[attr OP val] will query for the visit information. The previous querying system enables us to enter patterns obtained with machine learning algorithms, such as the one presented in the next Section. In Table 4, we provide examples of how to use such query system to select a segment of traffic. In addition, in Figure 10 we can observe an example of the Visits view with a filter applied, highlighting the group of nodes that meet the specified criteria.

Pattern Mining Algorithm
A pattern mining approach can aid traffic segmentation in two different aspects: by helping to discover interesting segments, and by characterizing such segments. Contrast pattern-based classifiers are an important family of both understandable and accurate classifiers [9]. A pattern is an expression defined in a certain language that describes a collection of objects. For example, 5]] is a pattern describing a bot behavior.
A contrast pattern is a pattern appearing significantly in a class regarding to the other classes [9,[34][35][36]. These patterns describe common characteristics of objects in a class. We propose to define the classes based on the segments of interest. Contrast pattern-based classifiers have been used on several real-world applications like characterization for subtypes of leukemia, classification of spatial and image data, structural alerts for computational toxicology and prediction of heart diseases; among others, where they have shown good classification results [9,[36][37][38][39][40].
Mining contrast patterns is a challenging problem because of the high computational cost due to the exponential number of candidate patterns [9,41]. Also, some algorithms for mining contrast patterns need an a priori global discretization of the features, which might cause information loss. For this reason, those pattern mining-based approaches avoiding a global discretization step, allowing low computational cost, and obtaining a small collection of high-quality patterns, have special attention for the international community; an example of this are those contrast pattern miners based on decision trees [41].
Usually, contrast pattern miners based on decision trees build several decision trees for extracting several patterns from them. For each decision tree, patterns are extracted from the paths from the root node to the leaves. For each extracted pattern, the class with the highest support determines the class of the pattern [36,40,41].
From those contrast pattern miners based on decision trees, Random Forest miner have proved better classification results and better diversity of high-quality patterns than other approaches based on decision trees for mining contrast patterns [41]. Also, Random Forest miner have allowed obtaining better classification results when Hellinger distance [42] is used for evaluating each binary candidate split at each decision tree level than using the information gain measure [43]. This result is due to the Hellinger distance is unaffected by the class imbalance problem because it rewards, in a better way than the information gain measure [36].
For all of the above reasons, in this paper, we have selected the Random Forest miner jointly with the Hellinger distance as the algorithm for mining contrast patterns. Also, we used the pattern filtering method introduced in [36] for obtaining a small collection of high-quality pattern describing the problem's classes, in this case the segments of interest. These patterns can guide the analysis of the data, by pointing to interesting visitor patterns; guiding the user towards creating queries to filter the data, instead of guessing interesting subsets of visitors that may appear in the data. Additionally, we have noticed that patterns obtained contrasting two classes can point to other segments that could be interesting to analysts. In the next section, we present a case study where we observe this phenomenon.

Case Study Scenario: Humans vs. Bots
Often, the first step of visitor analysis on web sites is to perform a data cleaning process to eliminate bot visits from the analysis. We present the bot versus human analysis, as an example of how pattern mining can be applied to characterize two visitor segments. We used data mining techniques to extract patterns which marketing teams can observe to get interesting trends from both types of visitors.
We used the dataset described in Section 3. For this scenario, the pattern mining algorithm will work with two classes: human and bot; which were obtained using a one-class classifier (BaggingRandomMiner [44]). Next, we will show the extracted patterns separating human from bot traffic from a specific day of interest. Note that the analysis does not have to cover only a particular day, it can cover the desired period of interest. For example, if a marketing campaign lasted four days, the patterns can be extracted from that time span. Alternatively, it may be of interest to feed the system with visits from certain hours; for example, as will be seen shortly, patterns may outline an interesting subset of visits, which is of interest to analysts.
Patterns with high support for one class are of interest because they show a general behavior of that class. Also, patterns without necessarily high support but with zero support for the other class are also interesting because they show subgroups of visitors. First, we have a pair of related patterns that characterize the normal class: (A) country = "Mexico" ∧ hour ≤ 10 ∧ agentOS = "Other" (B) country = "Mexico" ∧ hour > 10 Both patterns have a support of zero for the anomalous class, which means that they do not cover bot visitors. Additionally, pattern A has a support of 0.1107 and pattern B of 0.8574. Given the nature of these particular patterns (the intersection of the visitors covered by both patterns is zero), together, they account for 96.81% of the human visitors. We can conclude that for this day, at least 96% of the human visitors come from Mexico. Many more visits (85.74%) are registered after 10 a.m. than those registered before this time (11.07%). This can serve as an indicator that only a few users visit the page early in the morning, most users visit after 10 a.m. Another pattern with zero support for the bot class and high support for the human class is: Pattern C has a support of 0.9618 for the normal class. This pattern states a high percentage of the users (96.18%) come from Mexico, as we already know by pattern A and B; but it is interesting because it adds the information that these visitors arrive through a referrer.
We found that our one-class classifier was able to find visits from two known crawlers, even when the provided log files were pre-selected as human behavior. Next, we show some patterns with zero support for the human class, which means that only bots follow each of them: (D) agentBrowser = "Sogouwebspider" (E) country = "Mexico" ∧ agentOS = "Other" ∧ agentBrowser = "BingPreview" (F) city = "Redmond" (G) city = "Redmond" ∧ subdivision = "Beijing" ∧ url = "robots.txt" (H) city = "Sunnyvale" ∧ subdivision = "Beijing" ∧ referer = "?" In the analyzed day, Sogou web spider (Chinese search engine crawler) and Bing Preview (used to generate page snapshots) appeared in patterns D and E, with a support of 0.2372, and 0.2146, respectively. Other patterns describing bot visitors, like F, G, and H, were inclined to geolocation features, indicating us a fraction of the bots come from Beijing (pattern G support: 0.2514), Redmond (pattern F support: 0.2571), and Sunnyvale (pattern H support: 0.1356).
In this case, we took the class label provided by the one-class classifier to extract the patterns. Alternatively, we can extract patterns using a different segmentation. Furthermore, the patterns obtained can help to point towards a new segmentation. For example, pattern C states that most human users in the time of interest arrived through a referrer, this may motivate to characterize users based on their referrer. Another interesting analysis, suggested by patterns A and B, would be to characterize traffic from before and after ten.

Case Study Scenario: Segmentation by Country
Next, we applied the proposed approach to contrast visitors from Mexico versus those from Asia. Again, we aim to extract patterns which marketing teams can observe to get interesting trends from both types of visitors.
For this scenario, the pattern mining algorithm will work with two classes: Mexico and Asia. The first step is to label the data. We obtain class label by manually labelling the data depending to the country associated with each visit. If a visit does not belong to either class, it is left out of the analysis. Next, we will present the results in the form of a summary of distinctive traits.
Again, we analyzed the patterns which have zero support for the opposite class, given that they show characteristics exclusive to the class of interest. In the Mexican class, we observe two interesting patterns related to the visit duration. These patterns tell us that half of visits from Mexico have a duration longer than 100 s and 13 percent have a duration of less than five seconds; from this we can infer that the rest of the visitors (37 percent) have a visit duration between 5 and 100 s. A possible business goal could be to try to engage such 13% of visitors that stay in the site less than five seconds.
Also, since the patterns have zero support, we can further and infer that Asian visitors browse the site an stay at most 100 s. Again, this can be actionable information, if the business is interested expanding to a new market, then it can aim to optimize the duration of Asian visits.

Further Work
New visualization models, complemented with a pattern mining approach, are useful to discover common characteristics of users and to understand their behaviour on a site. Here we presented an approach aimed at companies which do not yet have thousands or millions of visitors per day. Here, we presented two case study scenarios to showcase the capabilities of pattern mining applied to segmentation of web log data. In the analyzed month the site reaches at most hundreds of visits per day, we are currently solving how will the model be adapted, as the number of visits starts increasing. In the next paragraphs, we describe opportunities to extend our approach, and how they would be appealing to the analysts of a more popular site.
The first challenge when the traffic volume increases is the number of nodes visible in our visual element Visit views; either it the visualization start to saturate, or only a sample of the nodes are displayed. To amend this, we suggest joining similar nodes into a bigger node, which can then be expanded if this group of nodes is of interest to the analyst. To decide what qualifies as a similar node, a metric or feature can be used. We believe that geographical (for example by country, city, or region) can be useful to evaluate the reach of a particular marketing campaign.
Additionally, our prototype allows the user to observe the navigation paths of individual users. This feature was particularly well received by the marketing team with whom we collaborated. Other tools do not display such a graph for individual analysis, instead, their visual models show information for general analysis. In this scenario, patterns can help to select which, out of the many website visits, may be of interest to observe individually. The navigation path of a visitor, allows the analyst to validate if visitors really navigate the way they expect and helps them to notice, for example, if there is a page that has the attention of the users (loops create around a particular page node) or one where they leave the page (last page node in the graph).
There are many opportunities to extend our tool. For example, the automatic analysis of navigation graphs could provide marketing teams with models or patterns that define the behavior of their visitors. Graphs of all visitors could be seen as sequences and mined to look for interesting insights such as the most frequent sequences, the most frequent landing page (first element of the sequence), etc. This is especially useful when the traffic volume increased, with graph analysis the repetitive navigation patterns can be found and, thus, there is no need to look at individual patterns which would make the task overwhelming. Additionally, graphs could also be analyzed to detect subgroups of users, for example, or bot patterns, given that bots probably have a more structured or linear way of visiting the pages on a site whereas humans navigation is probably more complex structure and contains loops. An analysis of navigation graphs can also be useful to identify sequences that lead to conversions, which can be useful to reconfigure the site.
Our prototype is meant to be a complement to, not a replacement of, web analytic tools, such as Google Analytics and Matomo. Independently of traffic volume, it can be personalized with different metrics for node ordering and connection width between the nodes. Here we presented metrics tailored to our final users interest. As is, it already provides interesting features, we believe the combination of machine learning and visualization techniques is a promising area. Funding: This work was supported in part by Network Information Center México. Fernando Gómez was partly supported by Consejo Nacional de Ciencia y Tecnología scholarship 470302.