Identification and Analysis of Weather-Sensitive Roads Based on Smartphone Sensor Data: A Case Study in Jakarta

Weather change such as raining is a crucial factor to cause traffic congestion, especially in metropolises with the limited sewer system infrastructures. Identifying the roads which are sensitive to weather changes, defined as weather-sensitive roads (WSR), can facilitate the infrastructure development. In the literature, little research focused on studying weather factors of developing countries that might have deficient infrastructures. In this research, to fill the gap, the real-world data associating with Jakarta, Indonesia, was studied to identify WSR based on smartphone sensor data, real-time weather information, and road characteristics datasets. A spatial-temporal congestion speed matrix (STC) was proposed to illustrate traffic speed changes over time. Under the proposed STC, a sequential clustering and classification framework was applied to identify the WSR in terms of traffic speed. In this work, the causes of WSR were evaluated based on the variables’ importance of the classification method. The experimental results show that the proposed method can cluster the roads according to the pattern changes in the traffic speed caused by weather change. Based on the results, we found that the distances to shopping malls, mosques, schools, and the roads’ altitude, length, width, and the number of lanes are highly correlated to WSR in Jakarta.


Introduction
Weather is an essential causing factor of traffic congestion, especially in metropolises of developing countries. Due to the limited infrastructure, such as flawed mass transport, deficient sewer systems, and relatively narrow roads, in developing countries, such as Vietnam or Indonesia [1][2][3], traffic is vulnerable when it rains. Meanwhile, the astronomical economic loss caused by the congestion highlights the urgency of the solutions. For example, traffic congestion leads to US$5 billion lost in Jakarta [4], US$11.4 billion in Dhaka [5], and US$18 billion in Metro Manila [6] annually. The tremendous economic loss underlines the importance of solving traffic congestion in the metropolises of developing countries.
Without a doubt, upgrading the infrastructure to improve traffic congestion is costly and time-consuming. In the literature, researchers have proposed alternative and shortterm solutions to remind drivers about traffic congestion, such as understanding the traffic patterns on different weather conditions [7][8][9]. In intelligent transport system (ITS) research, the traffic prediction models try to predict future traffic congestion in terms of location and time. For example, in Taipei City, Taiwan, many traffic bulletin boards were installed on major roads to announce real-time information regarding the roads' traffic situation and prediction under different weather conditions. Based on the real-time announcement, nouncement, drivers can choose different routes when necessary to avoid tra tion [10]. Other ITS applications also include adaptive street lighting control [11 the traffic flow using camera devices installed on Strinella Street in L'Aquila c this research [11], the traffic prediction model was used to provide information fic congestion and provide a more efficient energy consumption of traffic lighti In the ITS research domain, researchers assessed the impact of weather on veh drop in major metropolises of developed countries which have relatively wellinfrastructures, such as Paris [12], Chicago [13], London [14], Beijing [15], She and Seoul [17]. However, in the literature, few studies have been reported for ities with poor or limited infrastructures, especially in developing countries.
This paper aims to fill the research gap in the traffic study by investig weather change affects Jakarta's traffic condition, Indonesia's largest metropo the limited infrastructure of sawing and mass transportation system. In 201 population reached 11 million, with a population density of 16,882 people/km cording to the Traffic Index ranking by TomTom ® , Jakarta was ranked 10th in worst traffic in 2019 [19]. With the unbalanced annual growth of 8% and 0.01% cars and road lengths, respectively, Jakarta city expects to face heavier traffic in On account of the severe traffic condition, the Jakarta Smart City Departmen roads with an average speed of 10 km/h is a traffic-free road, which is equal to t jogging speed for most adults [20]. Figure 1 illustrates how weather change causes traffic congestion at Jalan P Dua, Jakarta, Indonesia. Figure 1a,b show the traffic condition on dry and rain spectively. This road is a representative example of many roads in Jakarta wi adequate sewer systems. As shown in Figure 1a, the road has mild traffic cong dry day. However, many puddles are generated when it rains, as shown in These puddles will eventually develop as potholes and create slippery roads. drivers tend to slow down to avoid a potential traffic accident, which ultima traffic congestion on the road.
(a) On a dry day (b) On a rainy day In this work, the real-world traffic speed data collected from citizens' sm in Jakarta and daily weather data, especially for raining information, were traffic speed study. This paper also proposes the spatial-temporal conges change matrix (STC) to identify the weather-sensitive roads (WSR) in Jakarta, fic congestion dramatically suffers from weather changes, such as rain.The da framework was presented under the proposed STC based on sequential clu classification methods [21] to identify and analyze the WSR. First, the K-mean To avoid the bias, both photos were taken at the similar time. The photo (a) was taken on Thursday, 16 July 2020, at 2:30 p.m. while photo (b) was taken on Thursday, 13 August 2020, at 2:33 p.m. on exactly the same spot. No particular event was hosted nearby. The main difference between (a) and (b) is the weather condition.
In this work, the real-world traffic speed data collected from citizens' smartphones in Jakarta and daily weather data, especially for raining information, were utilized for traffic speed study. This paper also proposes the spatial-temporal congestion speed change matrix (STC) to identify the weather-sensitive roads (WSR) in Jakarta, whose traffic congestion dramatically suffers from weather changes, such as rain. The data analysis framework was presented under the proposed STC based on sequential clustering and classification methods [21] to identify and analyze the WSR. First, the K-means algorithm [22] was applied to cluster roads with similar STC patterns. Then, the random forests (RF) [23] were used to train the classifier for identifying the STC level, which measures the level of the traffic speed drop over time due to the weather change. The best number of clusters k was selected based on Pareto front [24], and the root causes of WSR were determined by using the RF method.
The experimental results show that using the data analysis framework under the proposed STC can identify the WSR in Jakarta. The results also present the significant driving factors regarding WSR: the spatial information of the road (the distances to public areas, such as shopping malls, mosques, or schools), road altitude, length, width, and the number of lanes. By utilizing the proposed framework, this work demonstrates the capability of tracing or investigating traffic conditions caused by weather change.
The rest of this paper is structured as follows. Section 2 discusses the related literature. Section 3 delivers the methodology. Section 4 shows the experimental results, and Section 5 concludes this study.

Weather Impact on Traffic Congestion
When evaluating the effect of weather change on vehicles' speed drop, the popular method used in the literature is regression and trend analysis [7,15,25,26]. Camacho et al. [7] modeled the traffic speed of 15 major freeways in northern Spain using regression analysis based on several indicators: number of trucks, visual visibility, wind speed, precipitation intensity, and snow thickness. The results showed that the tested indicators were significantly affecting the traffic speed drop, mainly the precipitation intensity and the wind speed. Zhang et al. [15] applied the regression analysis to investigate the impact of rainfall on the traffic flow intensity in major expressways in Beijing, China. They found that different rainfall intensity affects traffic flow, as summarized in Table 1. The regression and analysis approaches were also used by Mitsakis et al. [25] and Stamos et al. [26].
Moreover, Billot et al. [12] proposed a multilevel assessment using microscopic, mesoscopic, and macroscopic approaches. The mesoscopic approach focused on understanding the drivers' behavior under the adverse weather condition. Based on the insight at the microscopic level, they observed the platooning effect at the mesoscopic level. The global view to the weather effect on the traffic density and speed drop was carried out at a macroscopic level. Other researchers also applied the regression analysis method [8,14] and other statistical methods, such as the Gaussian mixture model [27], to study the weather impact on traffic flow or speed. Besides the statistical methods, researchers also applied a variety of machine learning methods for traffic studies, such as the k-means clustering method [28], neurowavelet models [29], long short-term memory network [30], Bayesian networks model [31], deep belief networks [32], and decision tree [9]. The work above demonstrates the promising performance of the machine learning method for the traffic problem. Table 1 summarizes the relevant studies in the literature regarding the effect of rainy weather on vehicle speed reduction. As can be seen in Table 1, most of the literature studied traffic conditions in Western countries, such as the United States [13,33], France [12], Spain [7], Greece [25,26], Australia [31], and Sweden [34]. A few studies reported on the main metropolises in Asia, such as Hong Kong [8], Beijing [15], Shenzhen [16], and Seoul [17]. All of the mentioned works were conducted based on modern cities in developed countries. Very few studies pay attention to the weather impact on the developing country, whose traffic condition is more vulnerable due to the poor infrastructure.
Additionally, in terms of the road type, most of the works in Table 1 studied the highspeed vehicular traffic roads, such as freeways [7,12,33], highways [13,31], motorways [14], and expressways [15,16]. The speed of vehicles on high-speed vehicular traffic roads can be easily measured by loop detectors, road-side cameras, and on-board equipment. However, measuring vehicle speed in an urban area might need different technologies, such as wireless sensor networks. Although some researchers also focused on traffic studies on urban roads, all of those works considered the developed countries' road condition, such as in South Korea [17], Sweden [33], the United States [35], Athens, Greece [25], and Thessaloniki, Greece [26]. In summary, each work mentioned above shows various weather impact with different precipitation and characteristics. For example, in Seoul, the study shows that weather change can reduce the traffic speed up to 50% [17]. Other results of the studies above are summarized in Table 1.
Moreover, considering urban roads' condition in developing countries where various transportation modes-cars, trucks, motorcycles, pedestrians, and even wild animals-can be present on roads, measuring the roads' traffic speeds is extremely difficult. In this research, the traffic data collected from citizens' smartphones in Jakarta, Indonesia, was used to represent the traffic speed of a developing country's urban roads. The weather data associated with Jakarta City was also collected for this research work. More details regarding the collected dataset will be presented in the following sections.

Machine Learning Method in Traffic Studies
In the literature, researchers have applied the k-means clustering algorithm [22] widely, especially in the traffic field. Liu [36] implemented the k-means clustering method to obtain the optimal system design of sensors in a complex coordination control in ITS. The ITS problem studied by [36] combines the issues of coordinating the traffic signal lamp, balancing the traffic flow, and reducing the travel time. Pattanaik et al. [37] used the k-means clustering method to cluster the severity of congestion in a New Delhi study. They found that their methodology can segment the roads according to the congestion severity. Similar to Pattanaik et al. work [37], Hongsakham et al. [38] applied the k-means clustering method to cluster the congestion levels on a particular road section in Bangkok, Thailand. Motivated by the studies of Pattanaik et al. [37] and Hongsakham et al. [38], in this work, the k-means clustering method was used to cluster the congestion severity based on the traffic speed data collected from smartphones in Jakarta, Indonesia. In the traffic study, the RF method has been widely utilized in finding the contributing factors of a traffic accident. Essentially, the RF method combines many decision tree predictors where each tree carries out a random subset of features (feature bagging technique) to reduce the model variance without increasing the model bias at the same time [39]. With this superiority, RF is a popular method to rank the importance of predictive model variables. For instance, Wang et al. [40] identified that the weather and ramp geometry were the significant factors in crashes that happened on the expressway ramp in Florida, USA. Similar to Wang et al.'s work, Lee et al. [41] used the RF to investigate critical variables of traffic crashes. They found that the speed limit, collisions, and pavement condition were the significant factors influencing Florida's traffic crash. Motivated by the studies in the literature, this study utilized RF methods to investigate the important factors regarding the WSR.

Methodology
This research proposed the STC matrix to measure and visualize the severity of the traffic speed drop caused by weather change across different periods. Then, the WSR analyses can be carried out under the STC matrix. Section 3.1 explains the proposed STC calculation, and Section 3.2 presents the proposed framework for analyzing WSR. Figure 2 illustrates the process of constructing the STC matrix. Essentially, two datasets are required: the traffic speed dataset as in Figure 2a and the weather dataset as in Figure 2b. Both datasets are summarized daily into an N × T matrix, where N is the road ID number and T is the number of time-window slots. The traffic speed and weather matrices contain the average traffic speed and precipitation at the corresponding day and time-window slots, respectively. Figure 2c,d illustrates how to aggregate the traffic speed on dry and rainy days on the same time window based on the summarized traffic speed and weather datasets. Here, without losing the generality, the number of the aggregated matrices can be determined based on the variety of the collected weather data. Let X i be the aggregated traffic speed matrix with dimension N × T, based on the weather state i. If more weather states, such as dry, rainy, snowing, etc. are defined, a more aggregated traffic speed matrix X i can be obtained. In this work, in order to study the impact of dry and rainy seasons on traffic speed, the weather condition with two states, dry and rainy, were used (i = {1, 2}). Figure 2 illustrates the process of constructing the STC matrix. Essentially, two datasets are required: the traffic speed dataset as in Figure 2a and the weather dataset as in Figure 2b. Both datasets are summarized daily into an × matrix, where is the road ID number and is the number of time-window slots. The traffic speed and weather matrices contain the average traffic speed and precipitation at the corresponding day and time-window slots, respectively.    (1) to be the elements of matrix X i .

STC Matrix
Finally, the STC matrix, illustrated in Figure 2e, can be calculated to present the speed drop scale caused by weather change. Based on Equation (2), STC is calculated by the scaled difference between X 1 and X 2 where X 1 and X 2 are traffic speed matrices under dry and rainy weather, respectively. Obviously, STC ranges from 0 to 1 because X 1 > X 2 , and both are positve values. Here, 0 indicates 0% traffic speed drop caused by the weather change. It also means the traffic speed is invulnerable to the weather. On the contrary, 1 means the road traffic completely stuck due to the raining condition. Figure 3 shows an example of the STC matrix to illustrate the temporal traffic speed drop caused by weather change on Jakarta's 25 representative roads. Basically, the x-axis and y-axis represent the time-window and road ID, respectively. The heat map colors changing from green, yellow, to red are used to indicate the relative minimum, average, and maximum vehicle speed drop caused by weather difference (dry vs. raining) under different timing. As can be seen in Figure 3, the "green" regions indicate that the weather changes from dry to rainy cause relatively minimal traffic speed drop during the nonworking hours (before 10 a.m. and after 6 p.m.). During the working hours (between 10 a.m. and 6 p.m.), the "yellow-red" regions show the severe traffic speed drop caused by rain. Through the proposed STC matrix, the roads which have a longer time with the extended speed drop can be detected. For example, road #5528 has a longer traffic speed drop period than the others (10 a.m. to 8 p.m.). Road #1209 and #1210 almost experienced no or minimal speed drop across all time windows. Based on the STC matrix, the impact of weather can be easily analyzed and used for the prediction model.

The Proposed Framework for Analyzing WSR Based on Sequential Clustering and Classification Analysis
This section explains the proposed framework for analyzing WSR based on sequential clustering and classification analysis. Sequential clustering and classification analysis are often used for data exploration or data analytics tasks, mainly when the classification task labels are unavailable [21]. In such a case, the clustering method is performed to generate the labels for the classification task. Then, the labels obtained from the clustering method might be the "clues" of labeling the classification task's data afterward. Here, the problem of analyzing WSR is mainly the same as the mentioned data analytics task.
First, the clustering method can be performed to cluster the roads based on their speed drop severities. In this research, STC matrix was used as the dataset to perform Kmeans clustering method. By using the K-means, the roads with similar speed drop of a similar time will be clustered together. We can then further investigate the associating factors of the traffic speed drop patterns on a particular WSR cluster. Since STC matrix only considers weather change, another set of data such as road altitude above sea level, the distance to an important facility nearby, etc. can be considered for the analysis. The feature selection and data classification method can be applied to the road feature dataset combined with the cluster labels. Figure 4 illustrates the proposed framework of identifying WSR by utilizing the STC mentioned above and the roads' characteristics dataset [42]. As mentioned earlier, the traffic and weather datasets can be used to generate STC data. After performing K-means on STC, the generated cluster labels are appended with the road characteristic dataset. Then, the classification model is built to classify the cluster label by the road characteristics. The detailed information of clustering and classification are shown in the following sections.

The Proposed Framework for Analyzing WSR Based on Sequential Clustering and Classification Analysis
This section explains the proposed framework for analyzing WSR based on sequential clustering and classification analysis. Sequential clustering and classification analysis are often used for data exploration or data analytics tasks, mainly when the classification task labels are unavailable [21]. In such a case, the clustering method is performed to generate the labels for the classification task. Then, the labels obtained from the clustering method might be the "clues" of labeling the classification task's data afterward. Here, the problem of analyzing WSR is mainly the same as the mentioned data analytics task.
First, the clustering method can be performed to cluster the roads based on their speed drop severities. In this research, STC matrix was used as the dataset to perform K-means clustering method. By using the K-means, the roads with similar speed drop of a similar time will be clustered together. We can then further investigate the associating factors of the traffic speed drop patterns on a particular WSR cluster. Since STC matrix only considers weather change, another set of data such as road altitude above sea level, the distance to an important facility nearby, etc. can be considered for the analysis. The feature selection and data classification method can be applied to the road feature dataset combined with the cluster labels. Figure 4 illustrates the proposed framework of identifying WSR by utilizing the STC mentioned above and the roads' characteristics dataset [42]. As mentioned earlier, the traffic and weather datasets can be used to generate STC data. After performing K-means on STC, the generated cluster labels are appended with the road characteristic dataset. Then, the classification model is built to classify the cluster label by the road characteristics. The detailed information of clustering and classification are shown in the following sections.  (1) Sample k roads without replacement from set X randomly.
(2) Assign the roads to set { , … , }, and set the initial clusters' centroid equal to the assigned roads. (3) Draw a road without replacement from set X randomly ( ).  The K-means method can group the roads based on the predetermined number for clusters K. It is worth mentioning that the selection of K can be determined by the multiobjective optimization, which will be addressed later.

• Classification
The classification method can be defined as a function = Ω( | ), where are the predicted labels obtained from the classification function Ω based on the input data given the pregenerated from the clustering method. Following the previous work, this study's performance metric is the Hamming loss, as this loss function is standard for the

• Clustering
After obtaining the STC matrix, the data clustering was conducted based on the STC matrix. The clustering method can be defined as a function L, d r , d a = φ(STC), where the clustering function φ processes the STC matrix as the input and resulting in three outputs: (1) labels for WSR (L), (2) the sum of squared distances between all cluster centers (d r ), and (3) the sum of squared distances of roads to their cluster center (d a ). Given a set of roads (X = {x 1 , x 2 , . . . , x N }) where each road is a T-dimensional real vector x i = {x i1 , x i2 , . . . , x iT }, k-means clustering aims to group elements in set X into K sets (S = {S 1 , . . . , S K }), where K ≤ N. The K-means clustering's procedure is as the following.
(1) Sample k roads without replacement from set X randomly.
(2) Assign the roads to set {S 1 , . . . , S K }, and set the initial clusters' centroid equal to the assigned roads. (3) Draw a road without replacement from set X randomly (x n ).
(4) Find the nearest set to x n , c n , where c n is equal to 1 if argmin k ||x n − µ k || 2 , else 0. c n also represents the assignment of road x n to sets S k , ∀k ∈ {1, . . . , K}.
(5) Update the cluster centroids, µ k := Repeat steps 3-5 until convergence. Generally, the K-means clustering algorithm aims to minimize ∑ K k=1 ∑ N i=1 ||{c n = k}x n − µ k || 2 . The K-means method can group the roads based on the predetermined number for clusters K. It is worth mentioning that the selection of K can be determined by the multiobjective optimization, which will be addressed later.

• Classification
The classification method can be defined as a functionL = Ω(Z L) , whereL are the predicted labels obtained from the classification function Ω based on the input data Z given the pregenerated L from the clustering method. Following the previous work, this study's performance metric is the Hamming loss, as this loss function is standard for the multilabel classification problem [43]. Hamming loss converts the class labels into unique binary strings and calculates the loss generated based on exclusive disjunction (XOR) operation between the actual and predicted class labels' binary strings. The Hamming loss follows Equation (3), where A, N, M, ⊗, y ij ,ŷ ij are the model accuracy, number of instances, length of the binary strings, XOR operation, actual and predicted class label j in data instance i, respectively.
The proposed framework considers the RF method for the classification task [39]. The RF model is an ensemble model of B classification trees (CT). Essentially, RF method generates the prediction based on majority prediction results of the CT. The training of RF algorithm for the WSR study is shown as the following.
(1) Sample B observations from Z with replacement. •

Multiobjective Optimization
As mentioned earlier, how to determine the number of clusters for K-means is a challenging question. Smaller K might not represent the variety of the WSR, and larger K (lager number of labels) might escalate the problem complexity, especially for the classification task. In terms of objective functions, different K will generate different results of the clustering problem (d r and d a ) and the classification problem (A). Therefore, the multiobjective optimization method is used in the proposed framework to find the nondominated K to maximize d r , reciprocal of d a , and A.
Here, we applied the Pareto front method [24] for solving the above-mentioned multiobjective optimization. Given a set of possible solutions S d , which is the number of classes for clustering (see the dots in the Pareto plot in Figure 4), and d is the number of considered objectives. Given two vectors in the objective space, s 1 ∈ S d and s 2 ∈ S d . The vector s 1 is said to dominate s 2 if and only if s 1 i ≥ s 2 i , ∀i ∈ {1, . . . , d}, ∃j ∈ {1, . . . , d}. In other words, the solutions in vector s 1 do not dominate each other in s 1 , and they dominate the solutions in vector s 2 . The solutions in vector s 1 are known as nondominated solutions. Based on the process of finding the nondominated solutions s 1 , the multiple K can be selected from vector s 1 . The optimal K can be further chosen based on the comparison among the criteria. The experiment of the real-world data analysis shown in Section 4 will provide a detailed example of choosing K from the nondominated solutions.

Dataset
This paper considers three datasets for studying WSR: Dataset #1 is a traffic speed dataset based on smartphone sensors data, Dataset #2 is a weather dataset, and Dataset #3 contains roads' characteristics. In this research, Dataset #1, provided by the Jakarta Smart City Department, a research unit under the Government of Jakarta, Indonesia, contains the traffic speed data in Jakarta, collected from November 2017 to October 2018. The traffic speed data contains the GPS information, which includes the real-time information coordinate and vehicle speed information from citizens' smartphone sensors in Jakarta while they are traveling using cars and motorcycles. The size of Dataset #1 is 600 Gigabytes with more than two billion traffic speed records. Table 2 shows the example of the processed Dataset #1, and more detailed information of the preprocessing data can be found in the previous work [3]. Four main attributes in Dataset #1 were used to build Jakarta's traffic speed: the time information, location latitude, location longitude, and the recorded speeds. The traffic speed used in this case is motor vehicle speed (motorcycles and cars) only. We reconstruct the attributes into geographic information systems (GIS) data representation by matching the attributes with the GIS information of roads in Jakarta in OpenStreetMap [42]. Dataset #2, collected from WorldWeatherOnline ® [44], is the weather information of Jakarta City between November 2017 and October 2018. Since the effect of rain against the traffic speed is the main research topic, we consider only two weather states: dry and rainy. Dataset #2 contains the rain intensity and the associated date and time information. The examples of Dataset #2 can be seen in Table 3. Dataset #3, obtained from OpenStreetMap [42], contains the roads' characteristic information in Jakarta, as shown in Table 4. The obtained information regarding a particular road and its surroundings includes the length, width, number of lanes, types, the altitude of the streets, and distances to the nearest public areas, such as schools, mosques, and shopping malls. The distance between a particular road segment and the site (shopping malls, mosques, schools) is measured based on the Euclidean distance of the corresponding coordinates. We later used these features in Table 4 as the RF method's predictor variables to predict the speed drop. In this paper, the study region, a 5 × 5 km-square area of nonresidential roads in west Jakarta, as marked in red-box in Figure 5a, is used to represented a case study. Figure 5b is the zoom-in of the red-box on the left-hand side. The reason of using this area is because the selected location can represent the condition in Jakarta's roads in general. In the chosen 5 × 5 km-square area, there are 906 roads, three large shopping mall complexes, business districts, three local universities, the entrance and exit gates of a highway toll road, a commuter station, and plenty of wide and narrow streets. In fact, this 5 × 5 km-square area is the core area of Jakarta city, representing a common urban area in a metropolis of a developing country without losing the generality. The traffic data were aggregated into fifteen-minute intervals from 6 a.m. to 10 p.m. isOneway 1-If the road applies the one-way policy, 0-otherwise isPrimary 1-If the road is the main road, 0-otherwise isSecondary 1-If the road is the secondary road, 0-otherwise isTertiary 1-If the road is the tertiary road, 0-otherwise Length Length of the road in meter unit Width Width of the road in meter unit Number_lanes Number of lanes Type_road Three types: primary, secondary, tertiary Altitude The road's altitude (meter) above the sea Distance_school Distance (meter) to the nearest school to the road segment Distance_mosque Distance (meter) to the nearest mosque to the road segment Distance_mall Distance (meter) to the nearest mall to the road segment (a) Study area in a map of Jakarta (b) Zoom-in area Please note that this research particularly emphasizes urban traffic on weekdays; therefore, we omitted the traffic data on holidays and weekends. To anticipate the error in the GPS reading, based on the research work suggestion in [45], we included all of the traffic data within 10 m in calculating the average traffic speed. In this research, only the traffic speed lower than 10 km/h was used for investigation because the speed greater than 10 km/h was considered the normal traffic based on the Jakarta government's definition. We assumed all factors except in Table 4 that can create traffic congestion, such as traffic incident, as the implicit factors in the WSR analysis.
We tested the significance of the average traffic speed based on the paired t-test between the dry and rainy days. Our preliminary results show that in 58,890 combinations (906 roads times 65 window slots), 90% of combinations are statistically significant with a significance level of 0.05. It also means that most of the STC matrix data showing the speed on a dry day is higher than it on a rainy day has been statistically verified. The remaining 10% are nonsignificant results are mainly due to the lack of traffic data during the night time and on the less traveled road. Please note that this research particularly emphasizes urban traffic on weekdays; therefore, we omitted the traffic data on holidays and weekends. To anticipate the error in the GPS reading, based on the research work suggestion in [45], we included all of the traffic data within 10 m in calculating the average traffic speed. In this research, only the traffic speed lower than 10 km/h was used for investigation because the speed greater than 10 km/h was considered the normal traffic based on the Jakarta government's definition. We assumed all factors except in Table 4 that can create traffic congestion, such as traffic incident, as the implicit factors in the WSR analysis.
We tested the significance of the average traffic speed based on the paired t-test between the dry and rainy days. Our preliminary results show that in 58,890 combinations (906 roads times 65 window slots), 90% of combinations are statistically significant with a significance level of 0.05. It also means that most of the STC matrix data showing the speed on a dry day is higher than it on a rainy day has been statistically verified. The remaining 10% are nonsignificant results are mainly due to the lack of traffic data during the night time and on the less traveled road.

Selection of K for Clustering
This section reports the process of selecting K based on Pareto front optimization of d r , reciprocal of d a and A, where the three objectives were obtained from the following experiments. As explained earlier in Section 3, the Pareto front method determines the best K with considering the optimization of the objectives of both clustering and classification methods: d r , reciprocal of d a and A. This paper assumes that finding the best-fit K from the nondominated solutions based on objective functions can lead to a better WSR result.
We simulated K-means clustering algorithm with K from 2 to 30 and recorded the results of d r and d a . For each K, the prediction accuracies of the RF method A were stored together with the d r and d a . Then, the selection of best K was conducted based on the Pareto Front method. Figure 6 visualizes the solutions under Pareto surface of the reciprocal of d a (x-axis), d r (y-axis), and A (z-axis). The grey surface in Figure 6 represents the Pareto frontier area. The edges of the Pareto frontier are the nondominated solutions (red dots) which are not dominating each other in the same solution set. The blue dots are the solutions that are dominated by red dots. From Figure 6, the five nondominated solutions are K = 2, 5, 12, 20, and 30.

Selection of K for Clustering
This section reports the process of selecting K based on Pareto front optimization of , reciprocal of and , where the three objectives were obtained from the following experiments. As explained earlier in Section 3, the Pareto front method determines the best K with considering the optimization of the objectives of both clustering and classification methods: , reciprocal of and . This paper assumes that finding the best-fit K from the nondominated solutions based on objective functions can lead to a better WSR result.
We simulated K-means clustering algorithm with K from 2 to 30 and recorded the results of and . For each K, the prediction accuracies of the RF method were stored together with the and . Then, the selection of best K was conducted based on the Pareto Front method. Figure 6 visualizes the solutions under Pareto surface of the reciprocal of (x-axis), (y-axis), and (z-axis). The grey surface in Figure 6 represents the Pareto frontier area. The edges of the Pareto frontier are the nondominated solutions (red dots) which are not dominating each other in the same solution set. The blue dots are the solutions that are dominated by red dots. From Figure 6, the five nondominated solutions are = 2, 5, 12, 20, and 30.  Table 5 lists the K, and the associated reciprocal of , and of the nondominated solutions as in Figure 6. Among the five nondominated solutions, the K with relatively higher is preferred because lower means the classifier has poor performance in predicting the WSR label using the prediction variables in Table 4. It also means the lower prediction accuracy A implies that the road characteristics cannot help on predicting the speed drop level of WSR. Therefore, = 2 and = 5 are the preferred candidates with relatively higher level of , 0.822 and 0.817, respectively.  Table 5 lists the K, and the associated d r reciprocal of d a , and A of the nondominated solutions as in Figure 6. Among the five nondominated solutions, the K with relatively higher A is preferred because lower A means the classifier has poor performance in predicting the WSR label using the prediction variables in Table 4. It also means the lower prediction accuracy A implies that the road characteristics cannot help on predicting the speed drop level of WSR. Therefore, K = 2 and K = 5 are the preferred candidates with relatively higher level of A, 0.822 and 0.817, respectively. When comparing the solutions with K = 2 and K = 5, the d r and the reciprocal of d a of solution K = 5 are 22.000 and 0.004, respectively, which are higher than it in solution k = 2 (higher is better). Based on this comparison, this study considers the solution K = 5 as the best fit solution for representing WSR. Figure 7 shows the clustering results using K = 5. In Figure 7, the blue, yellow, green, red, and black colors indicate the roads in clusters #1, #2, #3, #4, and #5, respectively. Also, the number of roads of each cluster was 92, 159, 573, 61, and 21 for cluster #1, #2, #3, #4, and #5, respectively. Generally, #3 and #5 have relatively fewer roads than clusters #1, #2, and #4, which acquire most of the roads in the experimented area. When comparing the solutions with = 2 and = 5, the and the reciprocal of of solution = 5 are 22.000 and 0.004, respectively, which are higher than it in solution = 2 (higher is better). Based on this comparison, this study considers the solution = 5 as the best fit solution for representing WSR. Figure 7 shows the clustering results using = 5. In Figure 7, the blue, yellow, green, red, and black colors indicate the roads in clusters #1, #2, #3, #4, and #5, respectively. Also, the number of roads of each cluster was 92, 159, 573, 61, and 21 for cluster #1, #2, #3, #4, and #5, respectively. Generally, #3 and #5 have relatively fewer roads than clusters #1, #2, and #4, which acquire most of the roads in the experimented area.

•
Speed Drop Pattern Figure 8 shows the average speed drop in kilometers per hour (y-axis) of five clusters of WSR over time (x-axis). Please note that the traffic speed drop is due to weather change (dry vs rainy). Figure 8 Figure 7 to represent the road clusters, with additional shape indicators following by the blue diamond, orange X, green rectangle, red triangle, and black circle.

•
Speed Drop Pattern Figure 8 shows the average speed drop in kilometers per hour (y-axis) of five clusters of WSR over time (x-axis). Please note that the traffic speed drop is due to weather change (dry vs rainy). Figure 8 Figure 7 to represent the road clusters, with additional shape indicators following by the blue diamond, orange X, green rectangle, red triangle, and black circle.

uses the same color indicators shown in
In general, Jakarta roads experience at least a 4.7% speed drop when the weather changes from dry to rainy. Obviously, each road cluster has different speed drop patterns over time. For example, the average speed drop in cluster #3 is approximately closer to 5%. Except in cluster #3, we can observe the speed drop patterns of all clusters in four time-frames: (1) from 6 a.m. to 12 p.m. In the morning, the speed drop of cluster #2 steadily increases from 5% to 6.5% over time. The speed drop of cluster #2 is stable at around 6.5% in the early afternoon and declines until 4.9% during the late afternoon and night time. (nigh the morning, the speed drop of cluster #2 steadily increases from 5% to 6.5% over t The speed drop of cluster #2 is stable at around 6.5% in the early afternoon and dec until 4.9% during the late afternoon and night time.
The speed drop pattern of cluster #1 is relatively unique compared to Clusters #2 #3. The speed drop of cluster #1 increases from 5.2% to 6.6% in the morning time declines to 5% in the early afternoon time. Interestingly, the speed drop increases fo second time in the later afternoon to 5.8%, before it declines for the second time du the night time to 4.8%.
In cluster #4, the speed drop is relatively constant at around 5.1% in the mor time. The speed drop begins to increase to 7.1% in the early afternoon time. During later afternoon, the speed drop increases quickly to 8.3% at 5.30 p.m. and declines that to just 5.7% at 8 p.m. The speed drop keeps decreasing to 4.8% during the night.
Among all clusters, in general, cluster #5 has a higher traffic speed drop. The sp drop of cluster #5 increases slowly from 5.2% to 5.7% during the morning time. Du the early afternoon, the speed drop jumps to 8.2% at 1.30 p.m. and stable at around 7 The speed drop increases to 8.35% at 4.45 p.m. and steadily decreases to 6.1% during late afternoon. The speed drop continues to decline to 4.75% during the night time tween 12 to 6 p.m., the speed drop of cluster #5 is between 7% and 8.35%, and this i highest speed drop compared to other clusters, where the speed drops are mostly b 7%, except the anomaly in cluster #4 between 5.30 p.m. and 6.30 p.m.
The further investigation on how the characteristics of these road clusters are as ated with the corresponding speed drop patterns caused by raining is presented in following subsection.

•
Road Characteristics Associated with WSR The speed drop pattern of cluster #1 is relatively unique compared to Clusters #2 and #3. The speed drop of cluster #1 increases from 5.2% to 6.6% in the morning time and declines to 5% in the early afternoon time. Interestingly, the speed drop increases for the second time in the later afternoon to 5.8%, before it declines for the second time during the night time to 4.8%.
In cluster #4, the speed drop is relatively constant at around 5.1% in the morning time. The speed drop begins to increase to 7.1% in the early afternoon time. During the later afternoon, the speed drop increases quickly to 8.3% at 5.30 p.m. and declines after that to just 5.7% at 8 p.m. The speed drop keeps decreasing to 4.8% during the night.
Among all clusters, in general, cluster #5 has a higher traffic speed drop. The speed drop of cluster #5 increases slowly from 5.2% to 5.7% during the morning time. During the early afternoon, the speed drop jumps to 8.2% at 1.30 p.m. and stable at around 7.5%. The speed drop increases to 8.35% at 4.45 p.m. and steadily decreases to 6.1% during the late afternoon. The speed drop continues to decline to 4.75% during the night time. Between 12 to 6 p.m., the speed drop of cluster #5 is between 7% and 8.35%, and this is the highest speed drop compared to other clusters, where the speed drops are mostly below 7%, except the anomaly in cluster #4 between 5.30 p.m. and 6.30 p.m.
The further investigation on how the characteristics of these road clusters are associated with the corresponding speed drop patterns caused by raining is presented in the following subsection.

•
Road Characteristics Associated with WSR Figure 9 lists the road characteristics with their significance, obtained from RF classifier, from top to bottom based on Table 4. As shown in Figure 9, the most crucial factor, shown on the top, is the distance to the nearest shopping mall, with an importance score of 0.175 from RF method. Other essential variables are the distance to the nearest mosque and school, the altitude, and the roads' length with the importance score of 0.154, 0.151, 0.153, and 0.152. The roads' width and the number of lanes are also crucial, with the importance score of 0.081 and 0.072, respectively. It is interesting to associate the variables in Figure 9 with the road clusters the WSR. Table 6 shows the significant road characteristics in Figure 9 agai WSR clusters. Based on Table 6, cluster #3 has the highest average altitude c other clusters. It can be the reason why the impact of weather changes on c considerably stable, around 5%, from morning until night. Cluster #3 also has t average road length and width, which means the roads are relatively shorter than other clusters. Intuitively, the number of vehicles in cluster #3 might b other clusters; thus, the impact of weather change is relatively constant.
Cluster #1 consists of relatively long roads, mostly single and two-lane relatively far from schools. Unlike cluster #1, cluster #2 has shorter and broader higher altitudes. Cluster #2 roads are also relatively far from public areas, and drops peaks in the afternoon.
Cluster #4 has relatively low altitude and is very close to mosques and s are essential facilities in Indonesia's citizen life. The highest average speed dr roads occurs during the late afternoon (5.30 p.m. to 6.30 p.m.). Based on th prayer schedule, the period of 5.30 p.m. to 6.30 p.m. is the typical prayer tim because it is the only between sunset and the beginning of the night. As a resul drop on this cluster accumulating to be the highest between the prayer time.
The roads in cluster #5 are closer to shopping malls and relatively far fro and schools than other clusters. Uniquely in Jakarta, shopping malls are the m sites and the primary destination for family recreation. Furthermore, peopl also hang out at shopping malls [44]. The shopping malls in Jakarta usually a.m. and close at 10 or 11 p.m. Therefore, these roads experience the highest ave drop during the shopping malls' operational time (see Figure 8).  It is interesting to associate the variables in Figure 9 with the road clusters to describe the WSR. Table 6 shows the significant road characteristics in Figure 9 against the five WSR clusters. Based on Table 6, cluster #3 has the highest average altitude compared to other clusters. It can be the reason why the impact of weather changes on cluster #3 is considerably stable, around 5%, from morning until night. Cluster #3 also has the smallest average road length and width, which means the roads are relatively shorter and narrow than other clusters. Intuitively, the number of vehicles in cluster #3 might be less than other clusters; thus, the impact of weather change is relatively constant. Cluster #1 consists of relatively long roads, mostly single and two-lane roads, and relatively far from schools. Unlike cluster #1, cluster #2 has shorter and broader roads with higher altitudes. Cluster #2 roads are also relatively far from public areas, and have speed drops peaks in the afternoon.
Cluster #4 has relatively low altitude and is very close to mosques and schools that are essential facilities in Indonesia's citizen life. The highest average speed drop on these roads occurs during the late afternoon (5.30 p.m. to 6.30 p.m.). Based on the Maghrib prayer schedule, the period of 5.30 p.m. to 6.30 p.m. is the typical prayer time in a day because it is the only between sunset and the beginning of the night. As a result, the speed drop on this cluster accumulating to be the highest between the prayer time.
The roads in cluster #5 are closer to shopping malls and relatively far from mosques and schools than other clusters. Uniquely in Jakarta, shopping malls are the main tourist sites and the primary destination for family recreation. Furthermore, people in Jakarta also hang out at shopping malls [44]. The shopping malls in Jakarta usually open at 11 a.m. and close at 10 or 11 p.m. Therefore, these roads experience the highest average speed drop during the shopping malls' operational time (see Figure 8).
In short, based on the proposed data analysis framework, the speed drop between the day and rainy weather conditions was used to present the traffic condition of roads in Jakarta. Then, the sequential clustering and classification process was conducted to cluster the roads and search for associating the characteristics of roads. The case study in real-world data in Jakarta shows the framework is able to not only identify WSR with significant factors regarding the speed drop but also provide insights useful for city traffic management.

Conclusions
WSR identification and analysis are crucial for city development. Setting the higher priority of maintenance on WSR over the non-WSR can be more cost-effective in reducing traffic congestion. Also, providing locations of WSR to road users can help drivers bypass the WSR when it is about to rain. Especially for developing countries, such as southern Asian countries where the mass transportation system is limited or under construction, millions of motorcyclists can count on the information of WSR to enhance the mobility of transportation on rainy days.
In this research, to fill the research gap, Jakarta's traffic pattern was studied as a representative example of the metropolis in the developing country. Because of the inadequate sewer systems in Jakarta, rains often create a lot of sudden traffic on the roads that are traffic-free in dry weather. This study focuses on identifying and analyzing the causes of WSR using machine learning methods based on smartphone sensors, weather, and road characteristics datasets. A framework consisting of sequential clustering and classification tasks was proposed. We first introduced the STC matrix to representing the roads' average speed drop caused by the weather changed from dry to rainy. Then, the STC matrix was clustered by using K-means clustering method. The clustering labels were used as the prediction labels for the classification tasks. In this research, the RF method was used in the classification tasks to investigate the associating causes of WSR based on the given dataset. The Pareto front method was used to select the K, based on the objectives of both the clustering and classification methods.
The experimental results show that K = 5 is chosen to represent the WSR in Jakarta. Based on this study, the unique speed drop patterns of road clusters can be observed. For example, roads in cluster #4 face a significant speed drop during the late afternoon, while the opposite effect showed in cluster #1. Using the RF method, seven leading factors of WSR in Jakarta were found out: the distances to (1) shopping malls, (2) mosques, (3) schools, and the roads' (4) altitude, (5) length, and (6) width, and (7) the number of lanes, with the importance scores of 0.175, 0.154, 0.151, 0.153, 0.152, 0.081 and 0.072, respectively.
The main contribution of this work is to propose the framework which can be used to assess the impact of weather change against the road traffic speed. Without losing the generality, the proposed analysis framework can be practically applied in many other weather changes, such as fog and snow. Since the current dataset does not contain precipitation information, in the future, investigating how precipitation affects the speed drop of WSR could be the next task. Moreover, incorporating more parameters in the WSR study, such as traffic incidents, is worth future study. Last but not least, other clustering and classification methods, such as the fuzzy C-means clustering algorithm, can be integrated to extend the proposed framework for finer clusters.  Data Availability Statement: The data used in this research is owned by Jakarta Smart City. Any request regarding the data should contact Jakarta Smart City.