Utilizing Crowdsourced Data for Studies of Cycling and Air Pollution Exposure: A Case Study Using Strava Data

With the development of information and communications technology, user-generated content and crowdsourced data are playing a large role in studies of transport and public health. Recently, Strava, a popular website and mobile app dedicated to tracking athletic activity (cycling and running), began offering a data service called Strava Metro, designed to help transportation researchers and urban planners to improve infrastructure for cyclists and pedestrians. Strava Metro data has the potential to promote studies of cycling and health by indicating where commuting and non-commuting cycling activities are at a large spatial scale (street level and intersection level). The assessment of spatially varying effects of air pollution during active travel (cycling or walking) might benefit from Strava Metro data, as a variation in air pollution levels within a city would be expected. In this paper, to explore the potential of Strava Metro data in research of active travel and health, we investigate spatial patterns of non-commuting cycling activities and associations between cycling purpose (commuting and non-commuting) and air pollution exposure at a large scale. Additionally, we attempt to estimate the number of non-commuting cycling trips according to environmental characteristics that may help identify cycling behavior. Researchers who are undertaking studies relating to cycling purpose could benefit from this approach in their use of cycling trip data sets that lack trip purpose. We use the Strava Metro Nodes data from Glasgow, United Kingdom in an empirical study. Empirical results reveal some findings that (1) when compared with commuting cycling activities, non-commuting cycling activities are more likely to be located in outskirts of the city; (2) spatially speaking, cyclists riding for recreation and other purposes are more likely to be exposed to relatively low levels of air pollution than cyclists riding for commuting; and (3) the method for estimating of the number of non-commuting cycling activities works well in this study. The results highlight: (1) a need for policymakers to consider how to improve cycling infrastructure and road safety in outskirts of cities; and (2) a possible way of estimating the number of non-commuting cycling activities when the trip purpose of cycling data is unknown.


Introduction
Over the last two decades, researchers have provided much evidence of the benefits of cycling as a health-enhancing physical activity [1][2][3][4][5][6][7]. Recently, volunteered geographic information (VGI), user-generated content (UGC) and crowdsourced data are becoming promising data sources for transport and health research [8,9]. Traditional methods of collecting cycling data, including manual counts, stated preference surveys [10,11] and annual average daily bicycle (AADB) volumes [12], are expensive and time-consuming. Each of these methods has its advantages, but each is almost impossible to accomplish over a broad area simultaneously, which is why crowdsourced methods are gaining interest in planning [9]. Through the expansion of Global Positioning Systems (GPS) new methods for collecting detailed cycling route information have emerged [13]. GPS-enabled mobile devices, such as smartphones, allow individuals to track and map their cycling routes [13][14][15][16]. More recently, crowdsourced cycling data are used to analyze cycling behavior [13,17] and make associations between cycling and health [9]. Strava is a popular website used to track users' cycling and running activity via GPS-enabled devices, such as smart phones and smart watches. Millions of people upload their rides and runs to Strava every week via their smartphones or other GPS devices [18]. Strava launched a data service called Strava Metro that offers aggregated data sets after anonymizing and aggregating individual's GPS traces. In earlier studies that use traditional data collection methods, research on the role of cycling for health through physical activity has been limited by the lack of information on where bicyclists ride [9]. With a high spatial resolution, Strava Metro data is able to provide new opportunities for research into active travel, sustainable travel and public health. This could benefit studies of cycling behavior and public health, and further help policymakers in urban planning, especially designing urban infrastructures aiming to make urban residents healthier and cities more sustainable. For instance, knowing where people like to cycle could help policymakers to improve cycling infrastructure more effectively (e.g., availability of cycle parking in areas of high demand) and promote road safety by giving priority to roads where there are more cycling trips. In several recent studies, Strava data has been used to map ridership over a city [13], evaluate the impact of bicycle infrastructure on cycling behavior [17], and investigate impacts of residential and employment density, land use diversity, cycling facilities and terrain on cycling behavior [9].
The impact of outdoor and traffic-related air pollution on health is an important issue in transport and health [19][20][21][22][23][24][25][26][27][28]. Typically, impacts of active travel (cycling and walking) and inactive travel (traveling by car, bus or train) on health are compared [22][23][24][25][26][27]. Although earlier studies offer much evidence on the health benefits of cycling or walking due to increased physical activity [1][2][3][4][5][6][7], some other studies reveal cycling also carries some potential health risks, including air pollution, accidents and noise [29][30][31]. One of the most important risks is from poor air quality [29,30]. Exposure to air pollution is harmful to human health [32][33][34][35][36][37][38][39][40][41][42][43][44] and more than 80% of people living in urban areas that monitor air pollution are exposed to air quality levels that exceed World Health Organization (WHO) safe limits [32]. Cyclists riding in urban areas are therefore likely to be exposed to high levels of air pollution. Recent studies use health impact modelling (HIM) to estimate the health benefits and risks of active travel (cycling, walking), and reveal that the total benefits of active travel outweighed the risks [28,45,46]. Particularly, a very recent study reveals that benefits of active travel outweighed the harm caused by air pollution in all but the most extreme air pollution concentrations [28]. It is also becoming widely accepted that increasing cycling time tends to increase health improvements. However many such studies assess cyclists' exposure to air pollution based on city-level air pollution values when in fact, air pollution levels vary spatially over a city. Relating cycling activities to air pollution at a larger scale (e.g., street-level) could promote assessment of air pollution exposure when it is known where and when cyclists ride in a city. Ideally, urban planners and policymakers could use this knowledge of where cyclists ride to devise cycling and walking routes that minimize the risks faced by active commuters, and to decrease volume of cyclists riding in the environments that are associated with the highest exposures [29,47].
Moreover, Strava Metro data also indicate cycling purpose (commuting or non-commuting) of cycling activities. Researchers might make use of this in studies of cycling purpose and health. In this paper, we explore the potential of Strava Metro in research of active travel and health by using the data to investigate spatial patterns of non-commuting cycling activities and associations between cycling purpose (commuting and non-commuting) and air pollution exposure at a large scale. Additionally, as some cycling trip data sets (e.g., crowdsourced GPS trajectories or bike-sharing origin-destination trips) lack trip purpose (commuting or non-commuting), we can't directly relate cycling purpose to air pollution exposure when utilizing those data sets. However, we might estimate the number of non-commuting cycling trips based on the number of all-purpose trips and environmental characteristics that affect cycling behavior. In this paper, we try to estimate the number of non-commuting cycling trips based on the number of trips for all purposes and environmental characteristics, as the estimation model could be validated by the Strava Metro data. If this method of estimating the number of non-commuting cycling trips is shown to be good, we may then use it to estimate the number of other non-commuting cycling trip datasets where trip purpose is unknown.
In this paper, we use the Strava Metro data in Glasgow, UK to carry out an empirical analysis. Firstly, in order to explore spatial patterns of non-commuting cycling activities, we investigate where non-commuting cycling activities are more likely to be than commuting cycling activities by identifying clusters where there are high rates of non-commuting cycling activities. Afterward, to associate cycling purpose with air pollution exposure at a large scale (i.e., the street intersection level), we investigate whether cyclists riding for recreation and other purposes (excluding commuting) are more likely to be exposed to relatively low levels of air pollution than cyclists riding for commuting. Note that levels of air pollution also might also vary over time in the study area. We focus on spatial variations of air pollution levels, not spatio-temporal variations of air pollution levels as temporal resolution of the air pollution data is one year. In this study, we focus on the difference in air pollution exposure during cycling, not the difference in health effects of cycling. Additionally, we strive to improve the estimation of the number of non-commuting cycling activities by using different regression methods (linear and non-linear methods) as the estimation models.

Materials and Methods
In this section, the methods used for spatial analysis of non-commuting cycling activities and air pollution are presented. Section 2.1 introduces the data and study area, Section 2.2 explores spatial patterns of non-commuting cycling activities and associations between cycling purpose and air pollution exposure, and Section 2.3 presents how we estimate the number of non-commuting activities according to total cycling activities and locational characteristics.

Strava Metro Data
Strava (San Francisco, CA, USA) is a popular online social network for cyclists and runners with a user base larger than other similar sites like MapMyRide (Under Armoour, Baltimore, MD, USA), MapMyRun (Under Armour, Baltimore, MD, USA) or RideWithGPS (Ride with GPS, Portland, OR, USA). Strava consists of a mobile app and a website, allowing users can track their rides, runs, walks and hikes on a smartphone or another GPS device. The Strava app records distance, time, average speed and route (GPS trajectory) of each activity. Users can also add textual information like titles and tags to describe their trips. The 'commute' flag indicates walking or riding journeys to or from work. Users can upload their GPS-tracked activities recorded by their Strava apps to the Strava's website (https://www.strava.com). Strava's database comprises nearly a trillion GPS points globally and is growing by over 8 million activities every week [48]. Strava Metro is a suite of data services that enables cutting-edge views into cycling and pedestrian (running, walking, hiking) patterns [49]. Aiming to produce state-of-the-art spatial data products and services to make cycling, running, and walking in cities better, Metro anonymizes and aggregates activity data from Strava's millions of users and then collaborates with departments of transportation and city planning groups to improve infrastructure for bicyclists and pedestrians [49]. To acquire the number of commuting activities, Metro first count 'commute' flags. As some users tend to not select the 'commute' flag, Metro further uses two ways to detect commuting activities from activities where 'commute' is not flagged: (1) activities with keywords in the titles such as "To Work" and "Commute To"; and (2) starting and ending points more than 1 km apart within a distance and time threshold (This can be user defined but we find 30 miles and 90 min to be a good top threshold) [48].
More than half of the Strava Metro's dataset for dense metropolitan areas corresponds to commuting trips [49]. Some studies suggest crowdsourced data may be a good proxy for estimating daily, categorical cycling volumes by comparing cyclist counts between Strava data and manual count data in count stations [13,50]. For example, the Oregon Department of Transportation (ODOT) found the month-to-month correlation on the Hawthorne Bridge between total number of bicycles counted with a bike counter and total number of Strava bicycles trips over a one-year period was an adjusted R-Squared 0.91 [50]. Although crowdsourced cyclists represent a small portion of all cyclists, comparison with manual counts revealed a linear relationship between crowdsourced cyclists and total ridership in Victoria, Canada [13]. Due to an ability of tracking cycling activities at a high level of spatial granularity and a fairly high representation of the population of cyclists, Strava Metro data seem to have a high potential for studies of active travel and health in urban areas.
The Urban Big Data Centre, UK offers a Strava Metro dataset to researchers [51]. This dataset records 287,833 cycling activities (174,758 commuting activities and 113,075 non-commuting activities) contributed by 13,684 users (11,216 males, 1698 females and 770 blank-gender users) within the Glasgow Clyde Valley Planning area (including Glasgow City and seven council areas) in 2015. This dataset contains three sub sets in three formats: Streets, Origin-Destination and Nodes (see [49]). In this study, we select the sub set Nodes. A node represents an intersection of street and an edge represents a street (see Figure 1). The street network is extracted from OpenStreetMap (OpenStreetMap Foundation, West Midlands, UK). Attributes of a node includes count of all-purpose cycling activities at the intersection in 2015 and counts of commuting activities at the intersection (node) in 2015. Note that Strava Metro uses the number of all-purpose cycling trips (regardless of unique riders) that meet at the intersection to represent count of all-purpose cycling activities at the intersection, and uses the number of commuting cycling trips (regardless of unique riders) that meet at the intersection to represent count of commuting cycling activities at the intersection. We can infer that count of the non-commuting activities at the intersection (node) equals the difference between count of all-purpose cycling activities and count of commuting activities at the intersection (node). Additionally, the dataset contains a file that offers demographics of the bike trips, including average distance (24 km), median distance (15 km), average time (1.34 h), and median time (0.77 h).
Additionally, we look at the distributions of numbers of all-purpose cycling activities, inferred non-commuting cycling activities and commuting cycling activities in a node. Figure 2 shows cumulative distributions of numbers of all-purpose cycling activities, non-commuting cycling activities and commuting cycling activities in a node by using the complementary cumulative distribution function (CCDF). The CCDF of a variable X at x represents probability that X takes a value more than x, i.e., P(X > x). For instance, CCDF of the number of all-purpose cycling activities at 2500, i.e., P(X > 2500), equals to 0.1. This means that 10% of nodes have a relatively large number of all-purpose cycling activities (e.g., more than 2500); whilst the other 90% of nodes have a relatively small number of all-purpose cycling activities (e.g., less than or equal to 2500). Intuitively, the cumulative distributions of numbers of all-purpose cycling activities, non-commuting cycling activities and commuting cycling activities all seem to approximately follow an exponential law as they look like straight lines in the log-linear plot (see Figure 2). Note the log-linear plot is a semi-log plot with a logarithmic scale on the y-axis and a linear scale on the x-axis.  Within the administrative boundaries of Glasgow, there are 59,718 nodes. Among those nodes, a few of nodes seem to have null or incorrect records. For instance, in some nodes count of commuting cycling activities is larger than count of all-purpose cycling activities; while in some other nodes   Within the administrative boundaries of Glasgow, there are 59,718 nodes. Among those nodes, a few of nodes seem to have null or incorrect records. For instance, in some nodes count of commuting cycling activities is larger than count of all-purpose cycling activities; while in some other nodes Within the administrative boundaries of Glasgow, there are 59,718 nodes. Among those nodes, a few of nodes seem to have null or incorrect records. For instance, in some nodes count of commuting cycling activities is larger than count of all-purpose cycling activities; while in some other nodes counts of all-purpose cycling activities is zero. After removing these noise nodes, 50,057 nodes are kept as the data set. In the spatial analysis, we first need to aggregate nodes to areas as we want to identify clusters consisting of contiguous areas. In this study, the area unit is census output area. There are 5486 census output areas in Glasgow (see Figure 3). The census output areas data is downloaded from DATA.GOV.UK [52]. counts of all-purpose cycling activities is zero. After removing these noise nodes, 50,057 nodes are kept as the data set. In the spatial analysis, we first need to aggregate nodes to areas as we want to identify clusters consisting of contiguous areas. In this study, the area unit is census output area. There are 5486 census output areas in Glasgow (see Figure 3). The census output areas data is downloaded from DATA.GOV.UK [52].

Air Pollution Data
In this paper, particulate matter (PM), including PM10 (coarse PM) and PM2.5 (fine PM) are used as the air pollutants to measure levels of air pollution [34][35][36][37][38][39]. PM10 is PM with a diameter of 10 micrometers or less; while PM2.5 is PM with a diameter of 2.5 micrometers or less. Exposure to air pollution tends to increase risk of disease and mortality [33,44]. For instance, earlier studies provide much evidence that long-term exposure to PM is associated with an increase in cardiovascular and respiratory diseases [33][34][35][36][37][38][39][40] and recently, some studies also reveal short-term exposure to PM is associated with increased mortality risk [41][42][43][44]. According to a report released by WHO [32], Glasgow is one of the worst UK cities for both PM10 and PM2.5. In this study, background maps for PM2.5 and PM10 are downloaded from Air Quality in Scotland [53]. The background pollutant concentration maps are presented in 1 km × 1 km grid squares across Scotland. The background maps (reference year 2013) contain estimates of pollutant concentrations (PM10 and PM2.5) based on an average over a year (annual average) for 2015, 2020, 2025 and 2030 from a base year of 2013 [54]. The 2013 maps are based on ambient monitoring and meteorological data for 2013 [54]. Air pollution background concentration maps for 2015 are used in this study. Specifically, each grid has values to represent annual average estimates of PM2.5 and PM10 concentrations in 2015 (see Figure 4). The unit of PM2.5 and PM10 is μg/m 3 . The modelling methodology of PM10 background maps is based on Scottish monitoring data and Scottish meteorological data, used to model the annual mean background and roadside concentrations for Scotland; whilst the modelling methodology of PM2.5

Air Pollution Data
In this paper, particulate matter (PM), including PM 10 (coarse PM) and PM 2.5 (fine PM) are used as the air pollutants to measure levels of air pollution [34][35][36][37][38][39]. PM 10 is PM with a diameter of 10 micrometers or less; while PM 2.5 is PM with a diameter of 2.5 micrometers or less. Exposure to air pollution tends to increase risk of disease and mortality [33,44]. For instance, earlier studies provide much evidence that long-term exposure to PM is associated with an increase in cardiovascular and respiratory diseases [33][34][35][36][37][38][39][40] and recently, some studies also reveal short-term exposure to PM is associated with increased mortality risk [41][42][43][44]. According to a report released by WHO [32], Glasgow is one of the worst UK cities for both PM 10 and PM 2.5 . In this study, background maps for PM 2.5 and PM 10 are downloaded from Air Quality in Scotland [53]. The background pollutant concentration maps are presented in 1 km × 1 km grid squares across Scotland. The background maps (reference year 2013) contain estimates of pollutant concentrations (PM 10 and PM 2.5 ) based on an average over a year (annual average) for 2015, 2020, 2025 and 2030 from a base year of 2013 [54]. The 2013 maps are based on ambient monitoring and meteorological data for 2013 [54]. Air pollution background concentration maps for 2015 are used in this study. Specifically, each grid has values to represent annual average estimates of PM 2.5 and PM 10 concentrations in 2015 (see Figure 4). The unit of PM 2.5 and PM 10 is µg/m 3 . The modelling methodology of PM 10 background maps is based on Scottish monitoring data and Scottish meteorological data, used to model the annual mean background and roadside concentrations for Scotland; whilst the modelling methodology of PM 2.5 background maps is based on the UK Pollution Climate Mapping (PCM) approach, used to model the annual mean background and roadside concentrations of PM 2.5 for the UK as a whole [54]. background maps is based on the UK Pollution Climate Mapping (PCM) approach, used to model the annual mean background and roadside concentrations of PM2.5 for the UK as a whole [54].  Only grids that have more than half of their area included in the administrative boundaries of Glasgow are included in the study, resulting in a total of 175 grids. In Glasgow, air pollution levels are observed to have been decreasing in recent years. For instance, the PM 10 annual mean has decreased from 2010 to 2015 [55]. The main source of air pollution produced is road traffic while the other sources are of less significance [55]. Accordingly, areas with relatively high levels of PM 10 and PM 2.5 are in or closed to the city center; while levels of PM 10 and PM 2.5 tend to decrease from the city center to outskirts of the city (see Figure 4) where the volume of motor vehicles would be expected to be lower. WHO set 20 µg/m 3 as a safety guideline for annual mean of PM 10 , and 10 µg/m 3 as a safety guideline for annual mean of PM 2.5 [32] and Glasgow is below the safety guideline for PM 10 for all the areas (20 µg/m 3 ). Annual average PM 2.5 for several areas (grids) that are located in the city center is above the safety guideline (10 µg/m 3 ), while annual average PM 2.5 for the other areas (grids) is below the safety guideline (10 µg/m 3 ). We mark areas (grids) with an annual average PM 2.5 exceeding the safety guideline in red (see Figure 4).

Spatial Patterns of Non-Commuting Cycling Activities
Firstly, we define the non-commuting rate to measure dominance of non-commuting activities within an area (census output area). Suppose i is an area, non-commuting rate of i is computed as where num_non_act(i) and num_com_act(i) are the number of non-commuting and commuting cycling activities in the area i. num_non_act Node (j) and num_com_act Node (j) are the number of non-commuting and commuting activities in the node j. N i is the set of nodes that are located within the area i. In this paper, the improved AMOEBA (A Multidirectional Optimum Ecotope-Based) algorithm developed by [56] is used to identify clusters of high non-commuting rate. As a spatially constrained clustering method, this algorithm is applicable to classification of a large number of areas and identification of irregularly shaped clusters. Here we briefly introduce the improved AMOEBA algorithm based on [56]. Essentially, a region or ecotope is a spatially linked group of areas. A region can thus be defined as a spatially contiguous set of areas. The value of the G * i statistic is used to measure the level of clustering of an attribute x around an area. Suppose we run AMOEBA on a study region with N areas and an attribute x with elements x i , indicating the value of x at area i. Let us denote this set of areas as M, and x and S as the mean and the standard deviation of the attribute x and let R be a sub region of M with n areas. Duque et al. [56] rewrite the formulation of G * i as follows: Basically, G * R depends on the areas that are in the region R and the parameters N, x and S that are obtained from the areas in M.
Accordingly, a positive (negative) and statistically significant value of G * i statistic indicates the presence of a cluster of high (low) values of attribute x around area i. Thus, AMOEBA identifies high-valued, or low-valued, ecotopes (regions) by looking for subsets of spatially connected areas with a high absolute value of the G * i statistic. There is only one parameter, i.e., the significance level threshold, that is required to run the AMOEBA algorithm. The significance level threshold was set to 0.01, meaning only clusters with a p-value less than 0.01 are statistically significant.

Comparison of Air Pollution Exposure by Cycling Purpose
We quantitatively investigate whether cyclists riding for non-commuting (recreation and other purposes) are more likely to be exposed to lower level of PMs (PM 10 and PM 2.5 ) than cyclists riding for commuting purpose at the node level (the intersection level). First, we compare means of instantaneous exposure to PMs for non-commuting and commuting cycling activities. Second, we compare percentages of 'high exposure' activities for non-commuting and commuting cycling activities. As some areas exceed the WHO guideline value for annual PM 2.5 level (see Figure 4), we call activities that are located within areas (PM grids) of 'high' PM 2.5 levels (annual PM 2.5 > the WHO guideline: 10 µg/m 3 ) 'high exposure' activities in this paper.
A more reasonable approach to assess the exposure of cyclists to PM concentrations should take account of not only where cyclists are riding but also the time spent cycling to calculate cyclists' inhaled dose of PM 10 or PM 2.5 air pollution [28]. Ideally, the comparison of air pollution exposure by cycling purpose should be based on cyclists' inhaled dose of air pollution when riding for commuting and non-commuting. As the Strava Metro do not contain data on time spent cycling, trip distance, or trip speed at the individual level (level of the cyclist), they can't support assessment of cyclists' intake of PM 10 or PM 2.5 air pollution. Therefore, in this paper, we use instantaneous exposure to PM 10 or PM 2.5 air pollution based on only locations of cycling activities.
Suppose there is one cycling activity at a node, meaning that there is one cyclist at this node at a particular time. We could use levels of PM 10 or PM 2.5 at a node (street intersection) to represent exposure of the cyclist to PM 10 or PM 2.5 air pollution at the moment when he or she is at that node. In other words, levels of PM 10 or PM 2.5 at a node where a cycling activity is located could be used to represent instantaneous exposure of the cyclist to PM 10 or PM 2.5 air pollution. We could calculate instantaneous exposure of the cyclist to PM 10 or PM 2.5 air pollution for each non-commuting or commuting cycling activity. Accordingly, we could calculate means of instantaneous exposure of the cyclist to PM 10 or PM 2.5 air pollution for all non-commuting cycling activities and for all commuting cycling activities by where i is a node and S is set of nodes. Num NON (i) and Num COM (i) are numbers of non-commuting activities and commuting activities in node i. PM 2.5 (i) and PM 10 (i) are the PM 10 and PM 2.5 values in node i. Moreover, percentages of 'high exposure' activities for non-commuting and commuting cycling activities are calculated by Per COM where i is a node and S is set of nodes. Num NON (i) and Num COM (i) are numbers of non-commuting activities and commuting activities in node i. H 2.5 is a sub set of S, and it consists of nodes that are located in areas (PM grids) with 'high' PM 2.5 levels (annual PM 2.5 > the WHO guideline: 10 µg/m 3 ). And j is a node in sub set H 2.5 .

Estimation of Non-Commuting Cycling Activities
In this paper, we try to estimate the number of non-commuting cycling activities at the node level. This means that (1) the dependent variable is the number of non-commuting cycling activities in a node; and (2) the independent variables are the number of all-purpose cycling activities in the node and locational characteristics of the node. Earlier studies investigate impacts of environmental characteristics, including land use mix, cycling facilities, volume or mix of motor vehicle, green space and water, on cycling behavior [57][58][59][60][61]. Table 1 lists the independent variables used in the estimation of non-commuting cycling activities. We select the locational characteristics as the independent variables according to the environmental characteristics from earlier studies and data availability. Specifically, Dis_to_Greenspace and Dis_to_Waterbody representing distance from node to its nearest rail station and to its nearest water body are used to measure presence or proximity of green space and water. Dis_to_Citycentre representing distance from node to the city center is used to reflect levels of land use mix, assuming that the closer to the city center, the higher the level of land use mix is likely to be. Num_nearest_busstops representing number of bus stops within a distance of 100 m to node is used to reflect volume or mix of motor vehicles, assuming that the more bus stops there are around, the higher the volume or mix of motor vehicle is likely to be. In addition to ordinary least squares (OLS), the most widely used model, three other regression models: multilayer perceptron neutral network (MLP), support vector machine (SVM) and random forest (RF) are used to estimate non-commuting cycling activities according to the independent variables when independent variables are not linearly correlated with the dependent variable. We used a cross validation method to evaluate the performances of the four models.

Results and Discussion
This section demonstrates the empirical results in the study area and makes discussions about the results.

Spatial Patterns of Non-Commuting Cycling Activities
In this section, an investigation of spatial patterns of non-commuting cycling activities in Glasgow is demonstrated. In the input of the AMOEBA algorithm, an observation is the non-commuting rate of an area (census output area) (see Equations (1)-(3)). As census output area is the area unit, the entire study region (Glasgow) consists of 5486 areas (census output areas) (see Figure 3). In this paper, running AMOEBA is conducted using ClusterPy. ClusterPy is a Python library of spatially constrained clustering algorithms [62].
In the output of AMOEBA, there are 'solution values' representing clusters of high values or low values. Specifically, areas with positive 'solution values' belong to high value clusters; areas with negative 'solution values' belong to low value clusters; and areas with 'solution values' of zero are those outside the clusters. Table 2 shows how we group areas with 'solution values' to three cluster types: cluster of high value, cluster of low value and outside of cluster. In this empirical study, the value here is the non-commuting rate of an area. As a result, we map clusters of high and low value non-commuting rate in Glasgow (see Figure 5) which reveals that clusters of high value non-commuting rate tend to be located in outskirts of the city, away from the city center. This indicates that non-commuting cycling activities are more likely to be located in outskirts of the city. away from the city center. This indicates that non-commuting cycling activities are more likely to be located in outskirts of the city.

≥1
Cluster of High value 0 Outside of Cluster ≤−1 Cluster of Low value Figure 5. Clusters of high and low non-commuting rate.

Comparison of Air Pollution Exposure by Cycling Purpose
Through viewing Figures 4 and 5 in combination we can infer that non-commuting cycling activities are less likely to be in areas of relatively high levels of PM than commuting cycling activities at the census output area level. Table 3 lists percentages of areas of clusters of high and low noncommuting rate with different levels of PM10 and PM2.5. 80% of clusters of high non-commuting rate and only 30% of clusters of low non-commuting rate are located within grids with a relatively low PM10 level (e.g., below 12 μg/m 3 ); whilst 70% of clusters of low non-commuting rate and only 20% of clusters of high non-commuting rate are located within grids with a relatively high PM10 level (e.g., 12 μg/m 3 Figure 5. Clusters of high and low non-commuting rate.

Comparison of Air Pollution Exposure by Cycling Purpose
Through viewing Figures 4 and 5 in combination we can infer that non-commuting cycling activities are less likely to be in areas of relatively high levels of PM than commuting cycling activities at the census output area level. Table 3 lists percentages of areas of clusters of high and low non-commuting rate with different levels of PM 10 and PM 2.5 . 80% of clusters of high non-commuting rate and only 30% of clusters of low non-commuting rate are located within grids with a relatively low PM 10 level (e.g., below 12 µg/m 3 ); whilst 70% of clusters of low non-commuting rate and only 20% of clusters of high non-commuting rate are located within grids with a relatively high PM 10 level (e.g., 12 µg/m 3 and above). Similarly, 96% of clusters of high non-commuting rate and only 58% of clusters of low non-commuting rate are located within grids with a relatively low PM 2.5 level (e.g., <9 µg/m 3 ); whilst 42% of clusters of low non-commuting rate and only 4% of clusters of high non-commuting rate are located within grids with a relatively high PM 2.5 level (e.g., 9 µg/m 3 and above). Table 3. Percentages of areas of clusters of high and low non-commuting rate with different levels of PM 10 and PM 2.5 .

Cluster of High Value Outside of Cluster
Cluster of Low Value Here we quantitatively investigate whether cyclists riding for recreation and other purposes are more likely to be exposed to lower levels of PMs (PM 10 and PM 2.5 ) than cyclists riding for commuting purpose at the node level (the intersection level). Firstly, the means of instantaneous exposure to PM 10 and PM 2.5 for non-commuting and commuting cycling activities are calculated by Equations (5)- (8). Table 4 lists the means of instantaneous exposure to PM 10  indicate that the means of instantaneous exposure to PM 10 and PM 2.5 for non-commuting cycling activities are smaller than those for commuting cycling activities. We use the Wilcoxon test to statistically test whether the mean of one group is substantially larger or smaller than the mean of the other group. The Wilcoxon test is used as an alternative to the T-test when the data cannot be assumed to be normally distributed. Table 4 also lists results of the Wilcoxon test. In the results of the Wilcoxon test, the p-values corresponding to PM 10 and PM 2.5 are all less than 0.001. This indicates that the means of instantaneous exposure to PM 10 and PM 2.5 for non-commuting cycling activities are both statistically significantly smaller than those for commuting cycling activities at the 0.01 level. This indicates that spatially speaking, cyclists riding for non-commuting purposes tend to be exposed to lower levels of instantaneous PM 10 and PM 2.5 air pollution than cyclists riding for commuting purposes. Moreover, we calculate and compare percentages of 'high exposure' activities for non-commuting and commuting cycling activities by Equations (9) and (10). Table 5 lists percentages of 'high exposure' activities for non-commuting and commuting cycling activities. Per COM 2.5 is two times larger than Per NON 2.5 , indicating that cyclists riding for commuting purposes are more likely to pass through areas of 'high' PM 2.5 levels than cyclists riding for non-commuting purposes. In summary, the empirical results reveal that: spatially speaking, cyclists riding for recreation and other purposes are more likely to be exposed to lower level of PMs than cyclists riding for commuting purposes. The means of instantaneous exposure to PM 10 and PM 2.5 for non-commuting cycling activities are smaller than those for commuting cycling activities. In addition to encouraging commuters to ride bikes on working days, encouraging people to ride bikes for recreation on non-working days and holidays may contribute to development of urban sustainability. As recreational cyclists are more likely to be in outskirts of cities, policymakers might consider how to improve cycling infrastructure and road safety in those areas when designing or changing urban infrastructure.

Estimation of Non-Commuting Cycling Activities
The 50,057 nodes used in the spatial analysis constitute the data set used in the estimation of non-commuting cycling activities as well. We further calculate the locational characteristics for each node (see Table 1 in Section 2). An explorative analysis is made to know whether each independent variable is linearly correlated with the dependent variable. Figure 6 shows the scatterplots generated for each independent variable with the dependent variable. Apart from Num_cycling, other independent variables do not have a significant correlation with the dependent variable as absolute values of the corresponding Pearson correlation coefficients are less than 0.1. Therefore, in addition to OLS, three other models are also used to estimate non-commuting cycling activities.
We run a 10-folder cross validation to measure the performances of the four estimation models. Correlation of predicted and actual number of non-commuting activities is used to measure estimation performance. Random forest (RF) outperforms other algorithms with a correlation coefficient of 0.981 (see Table 6). Figure 7 plots the predicted and actual number of non-commuting activities. The estimation results indicate that the estimation of the number of non-commuting cycling activities is fairly good in this study. This suggests that we may be able to estimate the number of non-commuting cycling trips when the trip purpose of cycling data is unknown. commuters to ride bikes on working days, encouraging people to ride bikes for recreation on nonworking days and holidays may contribute to development of urban sustainability. As recreational cyclists are more likely to be in outskirts of cities, policymakers might consider how to improve cycling infrastructure and road safety in those areas when designing or changing urban infrastructure.

Estimation of Non-Commuting Cycling Activities
The 50,057 nodes used in the spatial analysis constitute the data set used in the estimation of non-commuting cycling activities as well. We further calculate the locational characteristics for each node (see Table 1 in Section 2). An explorative analysis is made to know whether each independent variable is linearly correlated with the dependent variable. Figure 6 shows the scatterplots generated for each independent variable with the dependent variable. Apart from Num_cycling, other independent variables do not have a significant correlation with the dependent variable as absolute values of the corresponding Pearson correlation coefficients are less than 0.1. Therefore, in addition to OLS, three other models are also used to estimate non-commuting cycling activities.
We run a 10-folder cross validation to measure the performances of the four estimation models. Correlation of predicted and actual number of non-commuting activities is used to measure estimation performance. Random forest (RF) outperforms other algorithms with a correlation coefficient of 0.981 (see Table 6). Figure 7 plots the predicted and actual number of non-commuting activities. The estimation results indicate that the estimation of the number of non-commuting cycling activities is fairly good in this study. This suggests that we may be able to estimate the number of non-commuting cycling trips when the trip purpose of cycling data is unknown.

Conclusions
In this study, we investigate spatial patterns of cycling activities and associations between cycling purpose and air pollution exposure in Glasgow, UK by using Strava Metro data. Empirical results reveal some findings that (1) compared with commuting cycling activities, non-commuting cycling activities are more likely to be located in outskirts of the city; (2) spatially speaking, cyclists

Conclusions
In this study, we investigate spatial patterns of cycling activities and associations between cycling purpose and air pollution exposure in Glasgow, UK by using Strava Metro data. Empirical results reveal some findings that (1) compared with commuting cycling activities, non-commuting cycling activities are more likely to be located in outskirts of the city; (2) spatially speaking, cyclists riding for recreation and other purposes are more likely to be exposed to relatively low levels of air pollution than cyclists riding for commuting; and (3) the method for estimating of the number of non-commuting cycling activities works well in this study. The results suggest that (1) policymakers might consider how to improve cycling infrastructure and road safety in outskirts of cities; and (2) we may be able to estimate the number of non-commuting cycling activities when trip purpose of cycling data is unknown. We conclude that this study is a good start in utility of crowdsourced cycling data for studies of cycling and air pollution exposure.

Limitations
This paper does present a few limitations. First, a census output area is used as the area unit in in the identifying clusters. The modifiable areal unit problem (MAUP) might influence the cluster identification in this study. Second, although the estimation of non-commuting cycling activities is good at the node level, the estimation at the edge (street) level is unknown. Third, although the models work well in estimating non-commuting cycling activities, there might be some potential to improve the estimation. Ideally, we could improve the estimation by incorporating more attributes such as land use mix, residential density, traffic count, road type, road width, etc., into the models. We are not able to include those attributes due to present data availability. Fourth, instantaneous assessment of air pollution is used in this study. In fact, cumulative assessment of air pollution makes more sense to studies of health effects of active travel. Ideally, the inhaled dose of air pollution during a cycling trip should be assessed according to not only where the trip takes place but also the time spent travelling. Furthermore, long-term air pollution exposure of a cyclist should also take account of the number of his or her commuting trips and non-commuting trips within a longer period (e.g., one year or more). As Strava Metro doesn't offer individual-level trips, we know neither how long a commuting or non-commuting cycling trip takes nor how many commuting or non-commuting cycling trips each biker takes in one year. Thus, we are not able to assess cumulative exposure of a cyclist to air pollution. Finally, since VGI, UGC and crowdsourced data are collected and shared by individuals, there are arguments about the quality and fitness for use of such data in projects [63]. While we are aware of this issue, we do not tackle this in this study, as it requires separate study on this topic.

Future Works
In the future, we will take account of some aspects to enhance this study. First, as the Strava database offered by Urban Big Data Centre offers the number of cycling activities in distinct daily time slots, we could model spatio-temporal variations in non-commuting cycling activities according to spatio-temporal characteristics; Second, we will incorporate other air pollutants such as sulphur dioxide, nitrogen dioxide and ozone into the analysis. In addition, although Strava Metro doesn't offer individual-level trips, we may be able to assess cumulative air pollution exposure by using the sub data set 'Streets' in Strava Metro. Based on the number of trips on each street segment, we could use length of the street segment to represent the length of sub cycling trip, and then estimate the time of sub cycling trip based on average speed of cycling for commuting or recreation and other purposes in Glasgow, a figure we could possibly obtain from Strava Metro or other data sources (e.g., travel surveys). Accordingly, we could estimate the total air pollution exposure to all of Glasgow's Strava cyclists when riding for commuting or recreation and other purposes during one year, and further estimate annual average air pollution exposure of one cyclist when riding for commuting or recreation and other purposes. This would represent a large amount of future work and potential to gain a much greater understanding of the impact of city air pollution.