Evaluation of Outlier Filtering Algorithms for Accurate Travel Time Measurement Incorporating Lane-Splitting Situations

: Malaysia has a high percentage of motorcycles. Due to lane-splitting, travel times of motorcycles are less than passenger cars at congestion. Because of this, collecting travel times using the media access control (MAC) address is not straightforward. Many outlier ﬁltering algorithms for travel time datasets have not been evaluated for their capability to ﬁlter lane-splitting observations. This study aims to identify the best travel time ﬁltering algorithms for the data containing lane-splitting observations and how to use the best algorithm. Two stages were adopted to achieve the objective of the study. The ﬁrst stage validates the performance of the previous algorithms, and the second stage checks the sensitivity of the algorithm parameters for different days. The analysis uses the travel time data for three routes in Kuala Lumpur collected by Wi-Fi detectors in May 2018. The results show that the Jang algorithm has the best performance for two of the three routes, and the TransGuide algorithm is the best algorithm for one route. However, the parameters of Jang and TransGuide algorithms are sensitive for different days, and the parameters require daily calibration to obtain acceptable results. Using proper calibration of the algorithm parameters, the Jang and TransGuide algorithms produced the most accurate ﬁltered travel time datasets compared to other algorithms


Introduction
Like most other countries in the world, the urban areas of Malaysia experience a high level of traffic congestion. It is a grave problem that affects everyone in the country due to the economic and health implications. The traffic congestion costs Malaysia an estimated The aim of measuring travel times using MAC addresses is to obtain accurate results of the passenger vehicles. In Malaysia and other ASEAN countries, the error in the datasets is due to the motorcycles data during congestion (lane-splitting data) and the outliers. Because of the significant difference in the travel time of lane-splitting motorcycles and passenger vehicles, it is essential to filter out the lane-splitting data using a filtering algorithm to obtain an accurate travel time pattern for passenger vehicles. Several researchers have developed algorithms for filtering outliers. At present, there is no evaluation of the algorithms using travel time datasets that contain lane-splitting observations. Therefore, it is essential to evaluate the well-known filtering algorithms to identify the best algorithm for filtering outliers and lane-splitting observations. This paper addresses the evaluation of the previously established filtration algorithms in terms of accurate representation of the actual situation and the sensitivity of the parameters using the travel time data containing lane-splitting observations and outliers. The study used the travel time datasets for three routes in Kuala Lumpur (KL).

Literature Review
Many algorithms and approaches discussed in the literature addressed the filtering of outliers in travel time data. Each algorithm has some assumptions and levels of complexity. The appropriate algorithms for the travel time data collected by MAC addresses are as follows.

Percentile Algorithm
This algorithm uses percentiles to define the validity range using the 10th percentile as the lower limit and the 90th percentile as the upper limit. These limits are applied after dividing the time into small equal time windows, usually 5 or 15 min. This method is simple, easy to perform, and can be used in real-time applications. However, it does not consider the comparison with the previous time window. Therefore, the filter does not work well if the number of outliers is greater than the number of true observations or if all observations are outliers.

Mean Absolute Deviation Algorithm
The validity range in this algorithm is defined using the median and mean absolute deviation (MAD). The appropriate time window is 5 or 15 min [29].
where JT i is the travel time of the vehicle i, M e is the median of the travel times in the time window, and n is the number of travel time observations in the time window.
The lower limit of the validity range is M e − 3 MAD, and the upper limit is M e + 3 MAD.
This algorithm can be used in real-time applications. The method is simple and easy to perform. However, it does not consider the comparison with the previous time window. It also has the same shortcomings as the percentile algorithm if the number of outliers is greater than the number of true observations or if all observations are outliers.

TransGuide Algorithm
The TransGuide algorithm proposed by the Southwest Research Institute is one of the earliest algorithms for automatic vehicle identification (AVI) data. The algorithm defines a travel time observation as valid if it lies within the pre-defined travel time limits based on the previous average travel time. These limits are the validity range [30]. The following equations describe this rolling average algorithm.
Equation (2) defines Stt AB t as a set of valid travel times from point A to point B at time t; it is used in Equation (3) to calculate tt AB t , the average travel time for the corresponding set of observations. t Ai is the detection time of vehicle i at point A, and t Bi is the detection time of vehicle i at point B. t is the time at which the travel time estimation takes place, and t W is the time window. tt AB t is the previous average travel time from A to B. l th is the link threshold travel time parameter. The time window, t W , and the link threshold travel time, l th , are the main parameters in the TransGuide algorithm. t W defines the period of time that should be considered when estimating the current average travel time. l th is used to identify and remove outlier observations. The proposed values for these parameters are two minutes for t W and 0.2 for l th . These values mean that any travel time between a pair of readers differs by more than 20% from the average travel time associated with the observations made within the previous two minutes is outlier and is not included when calculating the current interval average travel time [31]. This algorithm has a low level of complexity and a simple mechanism that divides the time into small windows and compares the observations in the current window with the average travel time of the previous window. This algorithm can be used in real-time applications.

Dion and Rakha Algorithm
Dion and Rakha [31] argued that the TransGuide algorithm could not track abrupt changes in the observed travel times at a low sampling rate. They proposed an enhanced filtering algorithm to address this shortcoming. This algorithm applies a series of filters to the collected travel times to remove invalid observations by assuming that the travel times in the time window have a lognormal distribution. The algorithm considers any travel time that falls outside the validity range defined using the mean and standard deviation as outliers. The developers of the algorithm proposed the following versions.
Version 1 tt ABmaxk = e [ln (tts AB k )+n σ .(σ stt AB k )] (8) Equations (4) and (5) derive tts AB k and σ 2 stt AB k using the values of the previous sampling interval and α. Equation (6) uses β and n vk to calculate α. Equations (7) and (8) define the lower and upper limits of the validity range using the results of Equation (4), Equation (5), and n σ . Equation (9) uses the results of Equations (7) and (8) to determine the valid travel time. Equations (10) and (11) use the valid travel time to calculate tt AB k and σ 2 tt AB k used in the calculations of the next time window. β, n σ , and the size of the time window are user-defined parameters. The recommended values for these parameters are 0.2-0.5 for β, two or three for n σ , and two minutes for the time window [31].
Version 2 , n vk = 0 and n a < 3 and n b < 3 [ln(t Bi −t Ai ) k −ln(tts ABk )] 2 n vk , n vk = 1 and n a < 3 and n b < 3 , n vk ≥ 2 and n a < 3 and n b < 3 0.01.(tt AB k) , n a ≥ 3 or n b ≥ 3 The difference between version 2 and version 1 is employing Equations (14) and (19) to calculate α and σ 2 tt AB k . These changes have been proposed to track sudden variations in traffic conditions. In particular, the amendments enable the algorithm to consider the third of three successive observations outside the validity range as valid, provided that the three observations are either above or below the validity range. In this version, n a is the number of consecutive observations above the validity range, and n b is the number of consecutive observations below the validity range.
This algorithm is more complex than the other algorithms because it has many assumptions, and the complexity makes it harder to understand and apply. In addition, the need to calibrate multiple parameters makes it impractical. However, it can be used in real-time applications.

Jang Algorithm
Jang [32] introduced a new outlier filtering algorithm that comprises two parts that are based on the number of observations in the time window. This algorithm utilizes a validity range from the previous time window if the number of observations is less than three, which is insufficient to generate a best measure of location. However, if there are three or more observations, the algorithm uses the time window of the current observation to determine the validity range. Because the median is the best measure of central tendency for skewed variables, the second part of the algorithm adopts the median as a measure of location instead of the mean. The minimum sample size for generating an effective median is three observations. The median can detect the discordant value if two travel times are true and one is discordant, but the mean may not. The median absolute deviation is utilized to define the validity range. If the valid observations are less than the outliers or all travel times are outliers, the comparison between the median of the current time window and the mean of the previous time window is used to overcome the problem.
IF n < 3 then where, T w is a time window or collection interval, and n is the number of travel time observations in T w . S 1AB (t) and S 2AB (t) are the sets of valid travel times from A to B at time t. N(S 1AB (t)) and N(S 2AB (t)) are the numbers of valid travel times from A to B at time t. T AB (t) is the average travel time of valid observations from A to B at time t. T AB (t − T w ) is the average travel time of valid observations from A to B at time (t − T w ). Equations (20) and (21) make up the first part of the algorithm employed when the current time window has less than three observations. Equation (20) gives the valid travel time by comparing each travel time observation in the time window with the average travel time of the previous time window T AB (t − T w ). If the absolute difference divided by T AB (t − T w ) exceeds α, the observation is an outlier; otherwise, it is valid. Equation (21) is utilized to calculate T AB (t) to use it as the average travel time of the previous time window for the next time window calculations. It is worth noting that this part is similar to the TransGuide algorithm. The second part of the algorithm comprises Equations (22)- (26). Equations (22) and (23) calculate M t and M AD , respectively. Equation (24) is used for situations where the valid observations are less than the outliers or when all the travel times are outliers. Equation (25) uses M t , M AD , and β to define the validity range and determine the valid travel time. Equation (26) is used to calculate T AB (t). The recommended values for the parameters are five minutes for T w , 0.35 for α, 3 for β, and 0.3 for γ [32].
The Jang algorithm is different from the other filtering algorithms discussed in this paper. The other algorithms depend on the determination of the validity range based on either the previous time window or the current time window, whereas the Jang algorithm relies on both. This algorithm is suitable for real-time applications. However, it has a medium level of complexity. Table 1 summarizes the main characteristics of the abovementioned algorithms.

Methodology
The literature review showed that the main approach in evaluating the performance of the filtering algorithms in detecting outliers in travel time datasets is applying the algorithms to field data and using graphs and some statistics such as mean absolute relative error (MARE) to present the algorithm performance [29,[31][32][33][34].

Research Methodology Flowchart
In order to achieve the research objective, the appropriate methods and analyses were selected after reviewing the literature relevant to the subject under study. These analyses were carefully organized to ensure consistency between them. So as to facilitate understanding the outlines of the procedures and analyses, a research flowchart was developed, as illustrated in Figure 1. Here in this figure, five phases are adopted to achieve the research objective. The literature review is the first phase to understand the research problem and to identify the research gap. The second phase is collecting the data, where all required data is gathered before commencing the analysis. After that, the data analysis phase constitutes the third phase consisting of two stages. The first stage is the validation of the previous travel time filtration algorithms, applying these algorithms to travel time data for one day. The second stage is the examination of the sensitivity of the algorithm parameters for different days. The two stages set out to evaluate the previously established filtration algorithms and identify the most appropriate algorithm and parameters able to filter lane-splitting observations and outliers. The results and discussions constitute the fourth phase. The last phase includes the conclusions and recommendations based on the results of the analysis.

Study Area
The data analyzed in this study is from an urban road network located near the Kuala Lumpur City Center (KLCC). There are skyscrapers in the area, including the Petronas Twin Towers, and many shopping centers, hotels, and business offices. Four MAC address sensors were installed at this road network to measure travel time. Sensor 1 is on the KL-Seremban Highway, Sensor 2 is on Istana Road, Sensor 3 is close to the U.S. Embassy on Tun Razak highway, and Sensor 4 is on Yew Road. These sensors collected the travel time data at three different routes. Figure 2 shows the locations of the sensors and the three routes. To facilitate discussion, Route A is the route between sensors 1 and 2. Route B is between sensors 1 and 4. Route C is between sensors 4 and 3. Table 2 presents the route information. Route A comprises two segments; the first segment is a part of the KL-Seremban Expressway, and the second is Istana Road. Route B has three segments, a portion of the KL-Seremban Expressway, Sungai Besi Road, and Yew Road. Route C is a part of Tun Razak highway.

Data Collection
The traffic data used in this study were collected in 2018 by Integrated Transportation Solutions Sdn. Bhd. (ITSSB) under the Proof-of-Concept (PoC) project of the Advanced Traffic Information System (ATIS). This project is a collaboration with the Integrated Transport Information System (ITIS), DBKL. ITSSB collects the travel time data by developing a system that anonymously detects, transmits, records, matches, and analyzes the MAC address sent out periodically by smartphones via Wi-Fi to measure travel time. This study employed a part of the millions of MAC address data collected during the PoC project.

Data Description
The data description using the MAC address data for lane-splitting based on the actual situation contains a high percentage of motorcycles have never been reported. This section presents actual MAC address datasets in Malaysia to demonstrate the impact of lane-splitting on the travel time pattern. The presented datasets contain raw matched MAC address data before applying any filtering algorithm.
The travel time dataset for the study area is categorized into three observations: valid, outliers, and lane-splitting. Figure 3 presents the travel time datasets for Routes A, B, and C. In the figure, the blue points are the valid observations, the gray points are outliers, and the orange points are lane-splitting observations. The data was manually classified based on the authors' experience. Figure 3a shows the travel times for Route A. The differences between the three categories are apparent for this route. Figure 3b shows the datasets for Route B. This route has much fewer outliers and lane-splitting data because it is longer and has many intersections. It is hard to filter this type of travel time dataset. Figure 3c presents the travel time observations for Route C. The figure shows a small gap between the lane-splitting and valid observations during the morning peak hour. However, the evening peak hour exhibits a significant difference between the lane-splitting observations and the valid observations. Based on the comparison between the morning peak hour and evening peak hour, the observations in the morning peak hour that are less than 200 s are considered lane-splitting.
To notice the difference in travel time between passenger vehicles and motorcycles, Table 3 presents average travel time for passenger vehicles and motorcycles during morning peak period from 8:00 to 9:00. It is clear that the difference between passenger vehicles and motorcycles is very high for all routes.

Data Analysis
This study used extensive empirical travel time data from three routes to validate the performance of several filtering algorithms in detecting outliers and lane-splitting observations. The evaluated algorithms are the percentile algorithm, mean absolute deviation algorithm, TransGuide algorithm, Dion and Rakha algorithm (version 1 and 2), and Jang algorithm. The literature review section has presented the equations for each algorithm. Evaluation of the filtering algorithms was done in two stages to identify the most appropriate algorithm and the parameters for each route. This study used the R software to analyze the effectiveness of the algorithms and to make the calculations.
Stage 1: Validation of the Previous Filtering Algorithms The validation of the algorithms used the travel time datasets for the three routes between 00:00 and 23:59 on 28 May 2018. This day was selected because it is a weekday.
There are considerable amounts of lane-splitting observations within the datasets of this day for the three routes. The values of the algorithm parameters were calibrated using a trialand-error method to identify the best performance of each algorithm. The assessment was conducted by observing the performance of each algorithm and comparing its performance with other algorithms using graphs. In addition, the mean absolute relative error (MARE) was used as a numerical indicator to compare the algorithms' performances.
where n is the number of samples. x(t) is the average travel time from ground truth data (the valid observations in Figure 3) at collection interval t (five minutes), and y(t) is the average travel time from a filtering algorithm at the collection interval t (five minutes). Travel time data collected by MAC addresses can be used as ground truth for intelligent transportation system applications [36]. In this study, the ground truth was extracted manually as Moghaddam and Hellinga [20] did in their study.
Stage 2: Sensitivity Analysis of the Algorithm Parameters The best algorithm for each route was applied to datasets from ten days to verify the sensitivity of the algorithm parameters on different days. On the days that showed unaccepted results, the parameters were calibrated to determine the capability of the best algorithm to filter the data from all days. The parameters calibration was done using a trial-and-error method. The mean absolute relative error (MARE) was used as a numerical indicator to compare the algorithms' performance before and after calibration.

Validation of the Previous Filtering Algorithm
The outlier detection algorithms discussed in the literature review section were applied to the travel time datasets for routes A, B, and C. The algorithms are percentile algorithm, mean absolute deviation algorithm, TransGuide algorithm, Dion and Rakha algorithm, and Jang algorithm. The validation used the travel time dataset from 00:00 to 23:59 on 28 May 2018. Figure 4 shows the results for Route A. Figure 4a presents the valid data after applying the percentile algorithm using the 25th percentile as the lower limit and the 75th percentile as the upper limit. The algorithm detects lane-splitting observations and most lane-splitting data but removed a significant number of valid observations. proposed using the 10th percentile as the lower limit and the 90th percentile as the upper limit. However, these limits did not result in a good performance. The mean absolute deviation algorithm with a validity range of M e ± 3 MAD proposed by could not identify the lane-splitting data for the morning and evening peak hours and failed to detect a significant number of outliers. Thus, the validity range was modified to M e ± 0.8 MAD. This modification has a positive impact on the performance of the test, as shown in Figure 4b, even though it removed a significant number of valid observations. Figure 4c presents the result of using the TransGuide algorithm with l th = 0.5. This algorithm did not detect lane-splitting data at the onset of the morning peak hour and in the middle of the evening peak hour. Figure 4d shows the behavior of the Dion and Rakha version 1 algorithm. The parameter values that produced the best results are β = 0.5, n σ = 2.5, and a time window of five minutes. The algorithm did not eliminate several lane-splitting observations in the morning and evening peak periods and failed to detect many outliers in the evening peak periods. Figure 4e shows the performance of the Dion and Rakha version 2 algorithm. The adopted parameters values are β = 0.5, n σ = 2.5, time window of five minutes, n skips = 10. Version 2 showed worse performance than version 1. proposed using α = 0.35, β = 3, and γ = 0.3 in his algorithm. However, these values did not produce good results. This study changed the values to α = 1, β = 1.5, and γ = 0.3. The best algorithm for route A is the Jang algorithm.   Figure 5a shows the result of applying the percentile test using the 25th percentile as the lower limit and the 75th percentile as the upper limit. The result is not satisfactory. Figure 5b shows that the mean  Figure 5c presents the result of using the TransGuide algorithm with l th = 0.3. This algorithm is the best for Route B because it efficiently detects the lane-splitting data and outliers. Figure 5d,e show the behavior of Dion and Rakha version 1 and version 2 algorithms, respectively. Both algorithms showed poor performance in detecting outliers. Figure 5f shows the behavior of the Jang algorithm using the parameters values α = 0.3, β = 1, and γ = 0.3. The algorithm did not remove all lane-splitting observations in the morning peak period but removed a significant number of valid travel time observations in the morning peak hour.   Figure 6a shows the result for the percentile algorithm using the 25th percentile as the lower limit and the 75th percentile as the upper limit. The algorithm showed a good performance detecting lane-splitting data and outliers but removed a significant number of valid observations. Figure 6b shows that the mean absolute deviation algorithm with a validity range of M e ± 0.5 MAD showed good performance detecting lane-splitting data and outliers but removed a significant number of valid observations. Figure 6c shows that the TransGuide algorithm with l th = 0.6 did not detect the lane-splitting data. The Dion and Rakha version 1 and version 2 algorithms failed to remove the lane-splitting data in the morning peak period and did not detect a considerable number of outliers, as shown in Figure 6d,e, respectively. Figure 6f shows the performance of the Jang algorithm with α = 0.5, β = 1, and γ = 0.3. This algorithm showed excellent performance in detecting the lane-splitting observations and outliers. Therefore, the Jang algorithm is the best for Route C. Table 4 presents the MARE values for the three routes for 28 May. The algorithm with the smallest MARE value is the best, indicating that this algorithm has minimum error relative to the ground truth data. The table shows that the Jang algorithm is the best for Routes A and C, while the TransGuide algorithm is the best for Route B. This finding affirms the conclusions drawn from the discussion of Figures 4-6.

Sensitivity Analysis of the Algorithm Parameters
The travel time data from ten days were used to determine if the best algorithm parameters require daily calibration to achieve the best performance or if the parameters used in the previous section are suitable for all days. The ten days are 2, 5, 8, 11, 14, 17, 20, 23, 26, and 29 May 2018. The 2,8,14,17,and 23 May are weekdays; 5, 20, and 26 May are weekends; 11 May is an election day; and 29 May is a holiday. This section tested only the algorithm that showed the best performance for each route.
The algorithm parameters are sensitive for different days if any day shows unacceptable performance, indicating that algorithm parameters require calibration on the days with unaccepted performance. The calibration involves modifying the values of algorithm parameters using a trial-and-error method to ensure the algorithm achieves the best performance. This step checks the ability of the algorithm to filter the data from all days. The Jang algorithm with α = 1, β = 1.5, γ = 0.3, and t W = 5 showed the best performance for Route A for the 28 May travel time dataset. The Jang algorithm with these parameters was applied for the ten days, and Figure 7 shows the results. Figure 7a,c,e,f,h show the datasets with poor performance because considerable amounts of lane-splitting observations during morning peak period remained after filtration. It is worth noting that these figures are for the weekdays. Therefore, the Jang algorithm parameters for Route A are sensitive for datasets from different days, and it is essential to make daily calibration of the parameters to obtain acceptable results. The Jang algorithm parameters for the datasets that showed unacceptable results in Figure 7 were calibrated using a trial-and-error method. Table 5 presents the MARE values for route A before and after calibration for the days that need to be calibrated. The MARE values for entire day after calibration were less than before calibration for all days, indicating that the calibration of the parameters improved the performance of the Jang algorithm. For the morning peak period, the amounts of lane-splitting observations before calibration were considerable for all days as shown in Figure 7a,c,e,f,h. As such, the MARE values for 8:00-9:00 before calibration were high as presented in Table 5. The MARE values for 8:00-9:00 after calibration were much less than before calibration for all days. This indicates that the calibration of the parameters highly improved the performance of the Jang algorithm during this period. The Jang algorithm has four parameters, α, β, γ, and t W . Table 6 shows the values of the Jang algorithms' parameters after calibration for the days that showed unacceptable performance before calibration for route A. The α, β, and t W are sensitive, while γ is insensitive. Four of five days have the same parameters, indicating that, after calibration, there are two new parameters sets, α = 0.5, β = 0.5, γ = 0.3, and t W = 1 for 8 May. The other parameters set, α = 0.5, β = 1, γ = 0.3 and t W = 5, are for 2, 14, 17, and 23 May. t W is different for just one day, indicating the sensitivity of t W is less than α and β. The TransGuide algorithm with l th = 0.3 and t W = 5 is the best algorithm for filtering the datasets for Route B for 28 May. The data for the ten days were filtered using the TransGuide algorithm with l th = 0.3 and t W = 5. Figure 8 shows the performance of the TransGuide algorithm for Route B for the ten days. Figure 8a-c,e,h,i show the datasets with poor performance. These datasets represent four weekdays and two weekends. Clearly, the TransGuide algorithm failed to follow the valid observations at the peak periods. As such, the parameters of the TransGuide algorithm for Route B are sensitive for datasets from different days and required calibration for each day.  Table 7 presents the MARE for route B before and after calibration for the days that need to be calibrated. The values of MARE for all days after calibration were much less than before calibration, indicating that the calibration of the parameters highly improved the performance of TransGuide algorithm.  Table 8 shows the values of the TransGuide algorithms' parameters after calibration for the days that showed unacceptable performance before calibration for Route B. Both parameters, l th and t W , are sensitive. There are five new parameters sets after the calibration. The best parameters for 2 and 5 May are l th = 0.5 and t W = 5. The remaining days have different parameter sets. The Jang algorithm with α = 1, β = 1, γ = 0.3, and t W = 5 is the best algorithm for the Route C travel time data for 28 May and was applied for the ten days. Figure 9 shows the Jang algorithm performance for Route C for the ten days. Figure 9h shows the poor performance for the data for 23 May, which is a weekday. The Jang algorithm parameters are sensitive because one dataset for Route C showed poor performance. Therefore, the parameters require daily calibration to ensure good performance. Table 9 presents the MARE for route C before and after calibration for 23 May. The MARE value after calibration was less than before calibration, indicating that the calibration of the Jang algorithm parameters improved the performance of the Jang algorithm. For the evening peak period, the number of lane-splitting observations before calibration was considerable as shown in Figure 9h. As such, the MARE value for 17:00-18:00 before calibration (0.544) was very high as presented in Table 9. The MARE value for 17:00-18:00 after calibration (0.039) was much less than before calibration (0.544). This indicates that the calibration of the parameters highly improved the performance of the Jang algorithm during this period.   Table 10 shows the values of the Jang algorithm's parameters after calibration for the days that showed unacceptable performance before calibration for Route C. The table shows that α is sensitive, but the other parameters are insensitive. Only one parameter for one day is sensitive for Route C. Therefore, the sensitivity of Route C is lower than Routes A and B.  Table 11 summarizes the evaluation of the filtering algorithms and the sensitivity of the algorithm parameters. The Jang algorithm is the best for Routes A and C, while the TransGuide algorithm is the best for Route B. The number of days with poor performance before calibration, and the number of new parameter sets after calibration are used to compare the routes in terms of the sensitivity of the algorithm's parameters. Route C is less sensitive, with only one day of poor performance before calibration. Route B is more sensitive than Routes A and C because it has the highest number of days with poor performance before calibration and the highest number of new parameter sets after calibration. Because Routes A and C are less sensitive than Route B, the Jang algorithm is less sensitive than the TransGuide algorithm. The route length is the distance between the two Wi-Fi sensors at the start and the end of the route. The Pearson correlation coefficient, effect size, and coefficient of determination are calculated to test the relationship of the distance between the sensors and the number of travel time observations. Table 12 shows that the effect size (d) is compatible with the Coefficient of Determination (R 2 ) since R 2 (0.93) and d (−7.47) are very large based on Cohen's standard [37]. In addition, the slope of the trend line in Figure 10 and the sign of d are negative. Thus, there is a very large negative correlation between the distance between the sensors and the number of observations. This indicates that the distance between the sensors has to be shortened to obtain more observations. Based on Table 11 and Figure 10, it can be concluded that the increase in the number of observations makes the Jang algorithm the best filtering algorithm. Jang algorithm is the best algorithm for Routes A and C because they have more observations than Route B. Concerning the sensitivity of the algorithm parameters, Route C is less sensitive than Routes A and B because it has the highest number of observations. Route B is more sensitive than Routes A and C and has the lowest number of observations. Route A is more sensitive than Route C but less sensitive than Route B. Route A has fewer observations than Route C but more observations than Route B. Thus, the algorithms' parameter sensitivity is inversely proportional to the number of observations. This means that the algorithms are highly sensitive at low number of observations, making the filtration is difficult. This conclusion conforms with the argument of Dion and Rakha [31] about the difficulties of filtering travel time datasets that have low sampling rate, especially the difficulty of tracking sudden changes in traffic conditions. Given the inverse proportion of the distance between sensors (or the route length) with the number of observations, the sensitivity of the algorithm parameters is directly proportional to the distance between sensors.

Conclusions
This paper has addressed the methods for filtering the travel time data for Malaysian roads with common lane-splitting situations. The two main problems with the datasets collected by MAC address are the outliers and lane-splitting observations. The literature proposed many travel time filtering algorithms for removing outliers from travel time datasets. Evaluation of these algorithms by considering lane-splitting yielded the following finding.

•
The Jang algorithm and TransGuide algorithm are effective in filtering the outliers and lane-splitting data. The Jang algorithm is the best algorithm for Routes A and C, while the TransGuide algorithm is the best algorithm for Route B.

•
The parameters of the Jang and TransGuide algorithms are sensitive for different days and can be used after daily parameter calibration.

•
Since Routes C and A are less sensitive than Route B, the Jang algorithm is less sensitive than the TransGuide algorithm.

•
The distance between sensors and the number of observations for the study area are inversely proportional.

•
It can be concluded that an increase in the number of observations makes the Jang algorithm the best filtering algorithm because the Jang algorithm is the best algorithm for Routes A and C, which have more observations than route B.

•
The sensitivity of the algorithm parameters is inversely proportional to the number of observations. Given the inverse proportion of the distance between sensors and Concerning the sensitivity of the algorithm parameters, Route C is less sensitive than Routes A and B because it has the highest number of observations. Route B is more sensitive than Routes A and C and has the lowest number of observations. Route A is more sensitive than Route C but less sensitive than Route B. Route A has fewer observations than Route C but more observations than Route B. Thus, the algorithms' parameter sensitivity is inversely proportional to the number of observations. This means that the algorithms are highly sensitive at low number of observations, making the filtration is difficult. This conclusion conforms with the argument of Dion and Rakha [31] about the difficulties of filtering travel time datasets that have low sampling rate, especially the difficulty of tracking sudden changes in traffic conditions. Given the inverse proportion of the distance between sensors (or the route length) with the number of observations, the sensitivity of the algorithm parameters is directly proportional to the distance between sensors.

Conclusions
This paper has addressed the methods for filtering the travel time data for Malaysian roads with common lane-splitting situations. The two main problems with the datasets collected by MAC address are the outliers and lane-splitting observations. The literature proposed many travel time filtering algorithms for removing outliers from travel time datasets. Evaluation of these algorithms by considering lane-splitting yielded the following finding.

•
The Jang algorithm and TransGuide algorithm are effective in filtering the outliers and lane-splitting data. The Jang algorithm is the best algorithm for Routes A and C, while the TransGuide algorithm is the best algorithm for Route B.

•
The parameters of the Jang and TransGuide algorithms are sensitive for different days and can be used after daily parameter calibration.

•
Since Routes C and A are less sensitive than Route B, the Jang algorithm is less sensitive than the TransGuide algorithm.

•
The distance between sensors and the number of observations for the study area are inversely proportional. • It can be concluded that an increase in the number of observations makes the Jang algorithm the best filtering algorithm because the Jang algorithm is the best algorithm for Routes A and C, which have more observations than route B.

•
The sensitivity of the algorithm parameters is inversely proportional to the number of observations. Given the inverse proportion of the distance between sensors and the number of observations, the algorithms' sensitivity is directly proportional to the distance between sensors.
In conclusion, even though there are many filtering algorithms, their usage is dependent on the characteristics of the travel time data. Two of the five well-known algorithms investigated in this study showed promising results. Even though the best two algorithms could filter the outliers and lane-splitting observations, they require considerable improvement to resolve the sensitivity issue. This will help to apply the filtration algorithms easily in real-time applications. In addition, it is recommended to use travel time data from more than three routes to investigate the statistical relationships in this study.
However, due to COVID-19, the authors could not collect recent traffic data due to Malaysia's full and partial lockdowns. Undoubtedly, traffic is highly affected by these lockdowns. This was the main reason why the data in this research was archived data of the actual traffic condition of May 2018. Moreover, for the study area, the other traffic parameters such as traffic flow were not collected during May 2018.