#### 3.1. The Definitions

The key idea of the SLCM is finding the centerlines via sufficient neighboring lines from all OD lines. The following definitions and parameters were used in this method.

Definition: centerline and its neighboring lines. Let

L be the database of lines,

O be the origin points,

D be the destination points, and

Li be the line connecting the

ith point of

O and

D. If

Lj, a line connecting

Oj and

Dj, fell within the searching radius of

Oi and

Di, then

Lj is defined as a neighboring line of the centerline

Li (

Figure 1)

. A centerline can have more than one neighboring line and the number of its neighboring lines, denoted by

Nls(Li), is defined as:

where

Dr represents the searching radius, and

dist(Oi,

Oj) and

dist(Di,

Dj) represent the Euclidean distances of two endpoints.

Parameter 1: searching radius

Dr. Since the key idea of the SLCM is to find similar lines from the OD database within a given radius, the first parameter that needs to be defined is the searching radius. This parameter will directly determine the shape of the clustering results. In general, the larger the searching radius, the more neighboring lines can be found. However, there is a special case that needs attention. When line

Lj is too short (e.g., shorter than 2 times of

Dr), meeting definition 1 does not guarantee spatial resemblance because

Lj could be in a different or even the opposite direction of

Li (i.e., green lines in

Figure 2).

Therefore, we added a limitation parameter for

Li to be defined as a center line for which the length of

Li must be greater than 2

Dr/sin45°

(≈2.83

Dr), to ensure that the neighboring lines not only are geographically close to

Li, but also in line with the direction of

Li. This threshold guarantees an angle less than 45° between a centerline and its neighboring line, and the change in length will not change by more than 2 times

Dr. Notice that this definition excludes all OD lines shorter than 0.83

Dr in the clustering analysis. An angle less than 45° can be used if stricter results are desired or the direction is of greater concern. This definition is elaborated on in

Figure 3.

Parameter 2: the length limitation of centerline Lm, which is derived from the searching radius and is approximately equal to 2.83Dr. Li can be used as a centerline to calculate the neighboring lines in the following step only if the length of Li is longer than 2.83Dr; otherwise, Nls(Li) is set to 1. Obviously, this limitation will not allow lines shorter than 0.83Dr to participate in the final clustering. A large search radius will result in more information loss.

Parameter 3: the minimum number of neighboring lines, Minlines. In general, a centerline is considered more representative if it has more neighboring lines. This parameter excludes all centerlines with a number of neighboring lines less than the threshold Minlines that are not considered representative in cluster analysis.

In summary, the principle of SLCM is not complicated and can be easily applied with only two parameters pre-defined: the searching radius (Dr) and the minimum number of neighborhood lines (Minlines). Notice that: (1) large Dr may lead to information loss, such that OD lines shorter than 0.83Dr will be excluded in the clustering analysis, and (2) the definition of Minlines lacks statistical meaning. Determining values of the two parameters to avoid these limitations is critical and the following strategy is proposed.

#### 3.2. Determining the Parameters

In general, with sufficient knowledge of the OD data source and a clear study objective, Dr and Minlines can be directly specified subjectively. For example, if we want to find the strongest spatial connection in the job-house OD data to help design a public bus line and stations with an impacted area of 500 m, we can set the Dr as 500 m and the Minlines to be the minimum demand of the bus line. However, experience-based parameters may not always be optimal. Therefore, in the absence of prior knowledge, we recommended the following method developed based on the entropy theory and distribution probability.

In information theory, entropy is related to the amount of uncertainty for an event associated with a given probability distribution [

27]. If all the outcomes are equally likely, then the entropy should be maximized. In the worst clustering scenario,

Nls tends to be uniform for a

Dr that is too small, and

Nls becomes 1 for almost all lines; for a

Dr that is too large, it becomes the total number of lines for almost all lines. Thus, the entropy becomes maximized. In contrast, in a good clustering scenario, the entropy of all

Nls tends to be skewed. Thus, we used the entropy definition in Formula (2) and (3) and find the optimal value of

Dr that minimizes

H(L).

Certainly, the initial Dr begins with a small value and gradually increases to calculate the entropy of all Nls. This heuristic method provides a reasonable range where the optimal value is likely to reside.

Another parameter,

Minlines, is calculated from probability distribution functions in order to ensure its statistical significance. We first tested if

Nls follows a normal distribution, which can be used to obtain all spatial clustering with higher confidence. As shown in

Figure 4, any data with distribution similar to normal distribution can be transferred into a standard normal distribution with z-scores and

p-value [

28]. Z-scores are standard deviations. The

p-value is the probability that the observed spatial pattern was created from a random process. A small

p-value means it is very unlikely (with a small probability) that the observed spatial pattern is a result of a random processes. With

p < 0.01 as the significance level,

Minlines can be calculated as:

where SD is the standard deviation.

However, studies have also shown that many human activities follow a power law distribution instead of a normal distribution. For example, the distributions of a wide variety of physical, biological, and man-made phenomena were found approximately following a power law over a wide range of magnitudes [

29]. Therefore, if the OD data does not meet the normal distribution, a power law distribution test is recommended. In general, most of the OD lines in space are discrete, and only a few are clustered together. Thus, we suggest testing the OD dataset with Pareto (type-I) distribution, and to calculate the cumulative distribution probability (CDP) using Formula (5) [

30].

where

x_{m} represents the minimum possible value of

x, and

α is a positive parameter which can be obtained from the power law mode:

where c and α are regression coefficients. The Pareto type-I distribution is characterized by scale parameter

x_{m} and shape parameter

α, which is known as the tail index (

Figure 5).

To find the strong spatial connections that are statistically significant, we set

Minlines as 95% and 99% of CDPs (i.e.,

p < 0.05 and

p < 0.01). In general, when the radius is very small,

Nls values would approach 1 in most cases, and α would be a large value. With increasing radius, the number of

Nls with a value of 1 decreases, and α decreases. However, when α is less than 1, the expected value of a random variable following the Pareto distribution is ∞, where the tail of the distribution has an infinite area, and the probability density function becomes meaningless [

32]. Therefore, when α is less than 1,

Minlines is set as the null, and no centerline will be considered.

Of course, other distribution tests can be performed. The key is finding the suitable probability distribution to help extract spatial clustering with high significance level.

#### 3.3. Clustering Process Flowchart

Overall, our classification method is simple in principle and the parameter initialization is adaptable. Once the optimal values of

Dr and

Minlines are set either subjectively or derived from data, the clustering process can be implemented following the flowchart in

Figure 6.

However, this simple clustering method still cannot solve all the problems mentioned above. Lines shorter than the specified value (i.e., 0.83

Dr) will be excluded in clustering, which results in loss of information. Therefore, we propose another more flexible clustering procedure to avoid this drawback. The complex version of the SLCM clustering process consists of two parts: determining the parameters forward and searching the clusters backward, as shown in

Figure 7. In this complex scenario, the optimal search radius is not used as the only search radius, but as the maximum search radius in the following step. Clustering backward means that clustering is started with the optimal search radius and ends with the minimum radius. With the optimal radius, we first search for the centerline of the maximum

Nls. If the maximum

Nls is greater than the

Minlines, we extract the centerline and their neighboring lines into the cluster file and mark them. The remaining lines are then recalculated to determine the next centerline that satisfies the conditions. When no qualifying centerline appears, we moved on to the next smaller search radius and repeated this clustering process until the minimum search radius was reached. This backward clustering process can preserve spatially connected clustering with as much significance as possible under different search radii.