Cluster Analysis of Public Bike Sharing Systems for Categorization

: The world population will reach 9.8 billion by 2050, with increased urbanization. Cycling is one of the fastest developing sustainable transport solutions. With the spread of public bike sharing (PBS) systems, it is very important to understand the di ﬀ erences between systems. This article focuses on the clustering of di ﬀ erent bike sharing systems around the world. The lack of a comprehensive database about PBS systems in the world does not allow comparing or evaluating them. Therefore, the ﬁrst step was to gather data about existing systems. The existing systems could be categorized by grouping criterions, and then typical models can be deﬁned. Our assumption was that 90% of the systems could be classiﬁed into four clusters. We used clustering techniques and statistical analysis to create these clusters. However, our estimation proved to be too optimistic, therefore, we only used four distinct clusters (public, private, mixed, other) and the results were acceptable. The analysis of the di ﬀ erent clusters and the identiﬁcation of their common features is the next step of this line of research; however, some general characteristics of the proposed clusters are described. The result is a general method that could identify the type of a PBS system.


Introduction
According to the UN forecast, the world population in mid-2017 was about 7.6 billion people, and by 2050 it is predicted to reach 9.8 billion. Along with this, urbanization is expected to increase [1,2]. Cycling is one of the fastest developing sustainable transport solutions [3][4][5][6]. Modernized and urban lifestyles have faded away physical activity of everyday life and this has resulted in a threat to population health caused by sedentary lifestyles [7]. It is estimated that physical inactivity causes 21-25% of breast and colon cancer and even greater proportions are estimated for diabetes (27%) and ischemic heart disease (30%) [8].
Public bike sharing (PBS) systems, also known as "Public-Use Bicycles", "Bicycle Transit," "Bikesharing", or "Smart Bikes," can be defined as a short-term urban bicycle rental schemes that allow bicycles to be picked up at any self-service bicycle station and returned to any other bicycle station, consisting in point-to-point trips [9]. Basically, people use bicycles on an "as-needed" basis, without the responsibility of the bicycle ownership [10]. Nowadays, different type of PBS systems start to spread all around the world, which can be operated without the docking stations, hence called dockless systems [11,12]. With the spread of public bike sharing systems, it is very important to understand the differences between systems [10,[13][14][15][16]. Without understanding the differences neither the impact of these systems can be calculated, nor is high-quality decision support possible.
We developed a complete framework during a doctoral research for analyzing, comparing, and categorizing public bike sharing systems, as such a comprehensive system is still missing from the literature [17]. The first level of our framework is to collect data about existing systems and perform a 2 of 15 cluster analysis. Then, a SWOT (strengths, weaknesses, opportunities, and threats) analysis for each cluster is compiled based on the examined systems. The third step is to create a benchmark tool, which supports the evaluation of systems. At the fourth level, impact analysis and impact assessment are carried out [18][19][20][21].
The present article deals with the clustering of different bike sharing systems around the world (i.e., it concerns the first level of our framework). The lack of a comprehensive database about PBS systems in the world does not allow for a simple comparison or evaluation of the systems [22]. Furthermore, the original goal of the creation of a PBS system is quite often unclear [23]. Without knowing the initial goal, the success of the system cannot be evaluated. A systematic literature review and scientometric analysis was conducted by Si et al. [17] from most of the bike-sharing-related articles between 2010 and 2018 from which it is clear that the researchers main focus was not on business models.
Several articles analyze the value creation of a bike sharing system [10,[24][25][26], although all of them start from the assumption that there are several distinct business models for bike sharing. DeMaio [16] introduced several examples of model provision in his article, but there was no clear definition of the different models. Other articles [24,25,[27][28][29] are using the business model canvas [30] approach or at least some of its elements, but these are not provide an easy to use categorization.
Our initial idea was to apply an unsupervised machine learning algorithm to a dataset, which should lead us to findings related to business models. This approach was applied in other industries like the Spanish scientific journals [31] or electric mobility [32] successfully. The cluster analysis methodology was not up to now applied in the field of PBS business models, but we collected a large dataset, which can be used to this purpose.
The goal of the clustering process is to create groups (clusters of objects) of the dataset, in a way that: (i) the objects in a given cluster are similar as much as possible; and (ii) the objects belonging to different clusters are highly different [33].The cluster analysis usually applied in the domain of spatial studies related to public bike sharing (e.g., [34][35][36][37]). In this field, the studies mostly focus on the distribution of bikes or stations.
Our main assumption is that a large proportion (i.e., 90%) of the public bike sharing systems around the world could be classified into one of the four clusters. These clusters are formulated based on the type of the owner and the type of the operator. A SWOT analysis based on this categorization could help PBS project promoters and owners to develop higher-quality systems. The clustering methodology proposed by the authors contributes, among others, to reducing a large number of primary data to several basic categories that can be treated as subjects for further analysis in the public bike sharing domain.

Methodology
Our research followed the steps described in Figure 1. The first step was data collection, which was followed by the initial dataset analysis. Then, the first cluster analysis based on expert opinion was conducted. The statistical tests and regression analysis were applied in order to select the parameters for the second cluster analysis. In the end, both internal and external cluster validation techniques were applied. During the analysis of the results, we compared three scenarios to each other, where different parameters were considered:

•
The entire dataset (all collected 64 parameters), • Selected parameters based on multinomial regression, • Only operator and owner parameters.

Data Collection, Database
The original idea was to collect 80 parameters on 125 systems around the world. The collection of this data was based on open web databases and the webpages of the different bike sharing systems. We assumed that the data of the bike sharing system website are up-to-date and accurate. Our starting point was the collection of systems by Meddin [39]; this database contained 2124 active systems at the beginning of 2019. We selected the 125 systems based on the following criteria: • 50 systems from Europe, 50 systems from Asia, 5 systems from Australia and New Zealand, and 20 systems from the Americas, • 3/4 should be docked systems, while the remaining 1/4 is dockless.
After a 6-month-long collection period, we had to reduce the dataset to 64 systems and 64 parameters. We made the decision that to exclude the dockless systems (n = 31) from the analysis and only focus on docked bike sharing. There were several systems (n = 30) where-despite all effortswe did not reach the minimum viable information. These systems were excluded from the analysis so as to not distort the results. Out of the originally desired 80 parameters, we had to exclude some due to the lack of available data. For example, we intended to gather information about the goals of

External cluster validation
forming "Other" cluster

Dunn index
Average silhouette width

Cluster analysis II.
based on selected parameters

Univariate statistical test
Chi-Square test One-way ANOVA test

Data Collection, Database
The original idea was to collect 80 parameters on 125 systems around the world. The collection of this data was based on open web databases and the webpages of the different bike sharing systems. We assumed that the data of the bike sharing system website are up-to-date and accurate. Our starting point was the collection of systems by Meddin [38]; this database contained 2124 active systems at the beginning of 2019. We selected the 125 systems based on the following criteria: • 50 systems from Europe, 50 systems from Asia, 5 systems from Australia and New Zealand, and 20 systems from the Americas, • 3/4 should be docked systems, while the remaining 1/4 is dockless.
After a 6-month-long collection period, we had to reduce the dataset to 64 systems and 64 parameters. We made the decision that to exclude the dockless systems (n = 31) from the analysis and only focus on docked bike sharing. There were several systems (n = 30) where-despite all efforts-we did not reach the minimum viable information. These systems were excluded from the analysis so as to not distort the results. Out of the originally desired 80 parameters, we had to exclude some due to the lack of available data. For example, we intended to gather information about the goals Sustainability 2020, 12, 5501 4 of 15 of the different systems, although it was not possible since very few system declare their initial goal, as Ricci pointed out earlier [23].
The final database was grouped around the following main topics: • Location of the systems (Continent, Country, City, etc.), • Contextual data (climate, start year of operation, size of the city, size of the service area, population, income, topology of the city, etc.), • Data about the system (owner, operator, number of bikes, number of stations, etc.),

•
Fare system (access fee, usage fee, deposit etc.), • Data related to the system operations (when it is closed, how a bike can be hired, etc.), • Derived data (bike density, station density, etc.).

Dataset Analysis
The first step was to visualize the dataset in a two-dimensional space. As the dataset itself contains several parameters, a principal component analysis (PCA) algorithm was used to reduce the number of dimensions. The algorithm presents the results in a scattered plot diagram, which gives us an easily understandable visual representation of the dataset [33].
There are several methods to calculate the distance between each pair of observations. Gower distance [39] is one of the few measures that are capable of handling both categorical and continuous variables, therefore this method was used for our calculation. The dissimilarity between two variables is the weighted mean of the contributions of each variable. This automatically implies that a particular standardization process is applied to each variable.
The daisy function from the cluster package [40] is suitable for calculating Gower distances in R. The result of computation of these distances is known as a dissimilarity matrix. The Gower distance can be described with the following Equation (1).
ij is the 0-1 weight, which becomes zero when the variable x[, k] is missing in either or both rows (i and j) or when the variable is asymmetric binary and both values are zero and in all other situations it is 1, and d (k) ij − k th variable contribution to the total distance We analyzed the entire dataset from the cluster tendency point of view. During the visual assessment of clustering tendency (VAT approach), we used the following steps:

1.
Compute the dissimilarity matrix for the data set using Gower distance.

2.
Reorder the dissimilarity matrix so that similar objects get close to one another, which results in an ordered dissimilarity matrix.

3.
The ordered dissimilarity matrix is converted into an image for visual inspection.
The color level is proportional to the value of the dissimilarity between observations. The observations in the same cluster are displayed in a consecutive order [41].
After the visual inspection, we also used the statistical method called Hopkins statistic to evaluate clusterability. This method measures the probability if a dataset was generated by a uniform distribution, so it tests the spatial randomness of the data. The calculations are the following: • Get a random sample from the original real dataset. • Compute a distance from each point to each nearest neighbor of the original real dataset.

•
Generate a random dataset based on uniform distribution with the same variation as the original real dataset. • Compute a distance from each point to each nearest neighbor of the random dataset.

•
Calculate the Hopkins statistics (H) as the mean nearest neighbor distance in the random dataset divided by the sum of the mean nearest neighbor distance in the original real and the random dataset.
The formula of Hopkins statistics can be defined as below (2): where H is the Hopkins statistics, y i is the nearest neighbor distance in the random dataset, x i is the nearest neighbor distance in the real dataset, and n is number of sample points in the dataset. The null hypothesis is that the original real dataset is uniformly distributed (i.e., there are no meaningful clusters). The alternative hypothesis is that the dataset is not uniformly distributed. (i.e., there can be find meaningful clusters). If the Hopkins statistics is close to 1, we can reject the null hypothesis and conclude that there is significant clusterability. A higher than 75% value indicates a clusterability at the 90% confidence level.

Clustering Based on Expert Opinion
Our main hypothesis was that most of the PBS systems can be clustered based on the owner type and the operator type. Therefore, during data collection, we used two owner categories: Public and Private, while in the type of operator we created 4 categories: Advertising company, Private Company, Service provider, and Public. Based on these types, we created 4 clusters, which can be seen in Table 1. This categorization was based on the expert opinion of the two authors. An advertising company provides services to a public institution, the income for the advertising company might not be realized from the direct user fees, but from some advertising spaces around the city 3 Public Service provider A service provider operates a system on behalf of the public institution, the income for these service providers can be an availability payment.

Public Public
A public institution operates the system or sets up a public company for operation, the income is directly from the user fees.
2 Private Private company The owner of the system is the same private company as the operator, the incomes are from the user fees and from the advertisements.

Univariate Statistical Tests
In order to determine which of the 64 parameters should be included in a multivariate regression model, some preselection is required [42]. As the dependent variables are both categorical and continuous, while the independent variable is categorical, we had to use two types of statistical tests. We used the SPSS statistical software for these tests.
We used the Pearson's chi-square test to discover whether there is a relationship between two categorical variables. As all the variables were measured at an ordinal or nominal level (i.e., were categorical data) and both variables consist of at least two independent groups, the test was Sustainability 2020, 12, 5501 6 of 15 applicable. The null hypothesis was that Variable 1 (Cluster) is dependent from Variable 2 (all other categorical variables) [43].
We used the one-way ANOVA test to determine if there is a statistical difference between the means of independent groups and the population. The independent variable (cluster in our case) divides the dataset into mutually exclusive groups. We used this test where the dependent variables were continuous. The null hypothesis was that all group means are equal, while the alternative hypothesis was that at least one of the group means is not equal to the others. As the one-way ANOVA is an omnibus test, we do not know which of the groups are different [44].
We selected a higher significance level for both tests not to eliminate the possible candidates from the multivariate regression analysis as it was suggested by Bursac et al. (2008). If the p-value was less than our chosen significance level (α = 0.25), we rejected the null hypothesis, and concluded that there is an association between our two variables, therefore we selected the dependent variable for further tests [42].

Multinomial Regression
We used multinomial logistic regression to predict the nominal dependent variable (cluster of the PBS system) based on the preselected independent variables (both categorical and continuous ones). This also allows to have interaction between the independent variables to predict the dependent one. We used the SPSS statistical software for this.
The applicability of this method is based on the following assumptions: • The dependent variable is a nominal one and should be mutually exclusive.

•
There are two or more independent nominal or continuous variables.

•
There should be no multicollinearity.

•
There need to be a relationship between any continuous independent variable and the logit transformation of the dependent one.

•
There should be no outliers.
We checked the entire dataset for the first 3 assumptions. The multicollinearity assumption was continuously tested for each different model and the rest was automatically tested in SPSS. As the software is not capable of running any automated model selection processes due to categorical variable, we decided to use the backward method and computed each step manually. First, we eliminated those independent parameters where we believed that the relationship to the dependent one would only be statistical, but there is no real reason to be related (e.g., start of operation, country etc.). Then, we added all remaining parameters to the model. We selected the variables with multicollinearity and eliminated one of them based on the significance. We reduced the model until we got a statistically significant one.

Cluster Analysis for Selected Parameters
We used a clustering method for creating associated groups from the dataset. We used the same method with different parameter sets. We decided to use a k-medoids algorithm, which belongs to the k-means clustering approaches. The most commonly used method is the partitioning around medoids (PAM) algorithm [45]. The PAM algorithm is based on the search of k representative medoids in the dataset and then it clusters the remaining dataset around them. As it does not use the means of the cluster, this method is less sensitive to outliers. The method consists of two phases: The build phase and the swap phase. In the build phase, the first step is the selection of k medoids. The second step is the calculation of the dissimilarity matrix, while the third step is the assignment of each observation into the closest medoids (therefore cluster), based on the calculated distance. In the swap phase, the fourth step is to check if swapping the current medoid of the cluster to any other object in the given cluster is reducing the average dissimilarity. If this happens, the cluster medoid should be changed to the new object and we must go back to the third step and start over again. If none of the medoids change in the fourth step the procedure stops. We used the R software [46] and the factoextra package [47] to compute the clustering. We used Gower distance to calculate the dissimilarity of the variables.

Internal Cluster Validation
In order to determine how good the clustering is, we applied internal cluster validation statistics, which uses the internal information of each cluster without external data. All the different statistics measure the compactness, the separation, and the connectedness of the different clusters [40,48].

•
The average distance between clusters measures the separation of clusters; as the average distance increases, so does the separation.

•
The average distance within cluster objects measures the compactness of the clusters; as it decreases, the compactness increases.

•
The average silhouette width also measures the separation between clusters. Each silhouette width coefficient is close to 1 if the object is in the right cluster, 0 means that the object is between clusters and −1 means that the object is entirely in the wrong cluster. So, we want the average to be as close to 1 as possible.

•
The Pearson Gamma or normalized gamma coefficient shows the correlation between distances and a 01-vector where 0 means same cluster and 1 means different clusters.

•
The Dunn index may be calculated in two ways, but in both cases, the Dunn index should be maximized: In the first version, the minimum separation divided by the maximum diameter; in the second way, the minimum average dissimilarity between two cluster divided by the maximum average within cluster dissimilarity.
In addition to the statistical indexes, we can also use visual methods to explore the results of clustering. The first possibility is to visualize the clusters with PCA in a two-dimensional space. The other option is the silhouette plot, where the diagram shows the silhouette coefficient for each object in an ordered way separated for each cluster.

External Cluster Validation
During the external cluster validation, we can compare two cluster validation techniques to each other. As in this research we created an expert based categorization as well as the wider parameter-based cluster using PAM method, we can compare the two categorizations to each other. The external cluster validation parameters measure how the external cluster number is matched to the clustered one.
The Rand index [49] measures the similarity between two clusters; its range is from −1 (no common value) to 1 (completely the same). The Variation Index described by Meila [50] is also a valuable tool to measure the similarity of the two clusters.

Results and Discussion
The initial phase of our research was to collect the necessary data for our clustering analysis. We shared all the data which collected for this purpose online [51].
We presented the results in three different scenarios below. In the first case, we always made the assessment on the entire dataset. The second presented scenario is the one with the selected parameters based on multinomial regression. The third case is when we only use the operator and the owner parameters. We used Gower distance here as the measure of dissimilarity of the different objects.
We visualized the raw dataset in a two-dimensional space using PCA methodology ( Figure 2). The two axes have no specific meaning, they only provide an artificial scale for visualization purposes. Although the scaling and the axes of the figures are not the same, it is viable for comparing the resulting patterns to each other. The dataset with all parameters is less clusterable than the one with selected parameters. In the last one, only four datapoints are visible, since the entire dataset is clustered into these four points. The same conclusion can be drawn from the heatmap resulting from the VAT approach (see We used the SPSS for the preselection of the parameters for the multivariate regression. Table 2 contains those variables whose p-value is lower than the chosen significance level ( = 0.25) in the Chi-square test.  The same conclusion can be drawn from the heatmap resulting from the VAT approach (see Figure 3) as well as from the Hopkins statistics. The factoextra package [47] implements H alt = 1 − H as the definition of H provided in the methodology section. H alt = 0.2661683 proved to be for all parameters (scenario 1), while H alt = 0.1565344 for the selected parameters (scenario 2). We used the seed number 123 for the calculation of Hopkins statistics. The same conclusion can be drawn from the heatmap resulting from the VAT approach (see We used the SPSS for the preselection of the parameters for the multivariate regression. Table 2 contains those variables whose p-value is lower than the chosen significance level ( = 0.25) in the Chi-square test.  We used the SPSS for the preselection of the parameters for the multivariate regression. Table 2 contains those variables whose p-value is lower than the chosen significance level (α = 0.25) in the Chi-square test. Table 3 contains those variables whose p-value is lower than the chosen significance level (α = 0.25) in the one-way ANOVA test. Parameter names can be found next to the dataset description in [51].
We ran the cluster analysis in the R software using the PAM method for all three scenarios with k = 4. As shown in Figure 4, the clustering is better with the selected parameters.  The results of the clustering can be described with the average silhouette width of each cluster (Table 4) and the silhouette plots Figure 5). The average silhouette width increases to 1 in the absolutely clustered scenario. The cluster based on the selected parameters has no negative data in cluster 1 and 2, which means a good clustering result. The results of the clustering can be described with the average silhouette width of each cluster (Table 4) and the silhouette plots Figure 5). The average silhouette width increases to 1 in the absolutely clustered scenario. The cluster based on the selected parameters has no negative data in cluster 1 and 2, which means a good clustering result.  The results of the clustering can be described with the average silhouette width of each cluster (Table 4) and the silhouette plots Figure 5). The average silhouette width increases to 1 in the absolutely clustered scenario. The cluster based on the selected parameters has no negative data in cluster 1 and 2, which means a good clustering result.  The internal cluster validation statistics shows similar results (Table 5). Where it is appropriate, the owner and operator scenario has the theoretical minimum or maximum values. There are some exceptions e.g., the first Dunn index shows worse results for the selected parameter scenario than the all-parameters one.  The internal cluster validation statistics shows similar results (Table 5). Where it is appropriate, the owner and operator scenario has the theoretical minimum or maximum values. There are some exceptions e.g., the first Dunn index shows worse results for the selected parameter scenario than the all-parameters one. External cluster validation was based on the clustering related to the expert opinion. Therefore, the last column of Table 6 is just for reference purposes; obviously, it has no meaning besides that the statistical calculation is working. The results of the clustering algorithm for the selected parameter can be seen in Figure 6. There is a distinct cluster which is clearly different from the others, while the remaining three are somewhat overlapping. The results of the clustering algorithm for the selected parameter can be seen in Figure 6. There is a distinct cluster which is clearly different from the others, while the remaining three are somewhat overlapping. We compared expert-based clustering with parameter-based clustering. Out of 64 systems, 42 clustered into the same cluster as the expert based method suggests and the remaining 22 (35%) were misplaced by the clustering algorithm. Based on our results, we decided that these 22 systems should be categorized as a fifth cluster, called Other. The clustering was correct for the systems, which were owned by a private company: No system with such parameters was missing, and none was misplaced to this cluster. The public systems showed similarly good results, as there were only 3 out of 12 that were misplaced and even these 3 showed some similarity with the selected clusters (e.g., Bixi We compared expert-based clustering with parameter-based clustering. Out of 64 systems, 42 clustered into the same cluster as the expert based method suggests and the remaining 22 (35%) were misplaced by the clustering algorithm. Based on our results, we decided that these 22 systems should be categorized as a fifth cluster, called Other. The clustering was correct for the systems, which were owned by a private company: No system with such parameters was missing, and none was misplaced to this cluster. The public systems showed similarly good results, as there were only 3 out of 12 that were misplaced and even these 3 showed some similarity with the selected clusters (e.g., Bixi Montreal is a public system, but has similarity with a commercial one, the two others were Chinese systems, where the difference between a public or a private company is sometimes hard to spot).

Based on All Parameters
Eleven systems were misplaced into the cluster where the operator is supposed to be an advertising company, although all these systems are operated by a service provider. It was also true for the other way around: Four systems out of five were misplaced into the cluster with a service provider as operator, although they have an advertising company operator. This might indicate that the service provider and the advertising company business model and those system characteristics are not as distinct from each other as the purely public and purely private models.
If we only use three big clusters (purely public, purely private, and mixed), we end up only with the miscategorization of 6 systems out of 64. The analysis of the different clusters and the identification of their common features is the next step of this line of research, however some general characteristics of the proposed four clusters can be described here as a starting point. The basis of these descriptions is both the categorization described in Section 2.3 and the results of the clustering exercise: • Cluster 1 (Public systems): Both the operator and owner of the system are public institutions. The owner is usually a city or one of its companies. The operator can be the same organization or a new one created for this specific purpose. The income is coming directly from the user fees, but usually it requires subsidization. The goal of such a scheme usually is to provide an alternative transport mode or educate the citizens rather than profit making. Typical example: MVG Rad (Munich).
• Cluster 2 (Private systems): Both the operator and owner of the system are private companies, usually the same one. The income is coming from the user fees as well as advertisements.
The goal here is clearly profit making, therefore a more cost-efficient operation is envisaged. Furthermore, some limitation related to the network or the users can be applied. Typical example: NextBike Croatia. • Cluster 3 (Mixed systems): The owner of the system is a public entity (usually a city or a transport operator), while the operator is a private company. The goal of the owner is usually to provide wider transport choices to citizens, while a private company is providing a service. There are two distinct business models for the private company based on the main source of income. In both cases, the user fees are collected for the owner, hence the financial risk from the usage is on the owner side. In the first type, the service provider gets a service fee (availability payment) based on a service level agreement. In the second type, the operator can use different advertising spaces around the city to cover the expenses of the system operation. However, from the system point of view, there are not enough distinct features of these two subtypes to cluster them separately. Typical example: MOL-Bubi (Budapest-subtype 1); Velib (Paris-subtype 2). • Cluster 4 (other systems): There are several systems that can be categorized by an unsupervised algorithm to one of the above clusters based on the expert knowledge of the authors cannot fit well with them. The reasons are usually hard to spot, but for instance, it can be that a public company design a system with clear profit-making goals, or a private company acts similarly towards a public entity. There are especially some outliers in the Chinese systems, due to the specificities of the country political structure.