Dwell Time Estimation Using Real-Time Train Operation and Smart Card-Based Passenger Data: A Case Study in Seoul, South Korea

Dwell time is a critical factor in constructing and adjusting railway timetables for efficient and accurate operation of railways. This paper develops dwell time estimation models for a Shinbundang line (S line) in Seoul, South Korea using support vector regression (SVR), multiple linear regression (MLR), and random forest (RF) techniques utilizing archived real-time metro operation data along with smart card-based passenger information. In the first phase of this research, the collected data are processed to extract boarding and alighting passenger counts and observed dwell times of each train at all stations of the S line under the current operational environment. In the second phase, we develop SVR, MLR, and RF-based dwell time estimation models. It is found that the SVR-based model successfully estimates the dwell times within 10 s of differences for 84.4% of observed data. The results of this paper are especially beneficial for autonomous railway operations that need constructing and maintaining dynamic railway timetables that require reliable dwell time predictions in real-time.


Introduction
The relatively reliable schedule adherence of railway systems is one of the most attractive merits of the mode for the railway passengers [1]. However, it is still a challenge to operate on railways with perceived reliability. Railway schedules are constructed based on passenger demand, although the arrival patterns of passengers at stations are not uniform or deterministic in real-life. If a particular train's arrival at a station is delayed, an additional accumulation of arriving passengers in the meantime will result in extended boarding and alighting times when the train dwells at the station, and this event will further propagate delays with respect to subsequent trains upstream.
Railway operators and authorities have considered different strategies when determining train schedules with respect to stations with expected passenger demands. They employ a "running time supplement" [2,3], a buffering time that can help to recover from delays if they occur. To minimize the delay, larger railway operators often control the passenger demand with a more direct approach of employing dedicated personnel on the platforms at stations, who can restrict passengers from boarding trains when they are delayed. There have been various studies in predicting delays and analyzing

Matching the Automatic Fare Collection Data with the Real-Time Train Operation Data
This paper develops a dwell time estimation model using boarding and alighting passenger counts, and onboard passenger counts between stations for each train. The initial dataset for analysis was prepared according to the method suggested by Hong et al. [10] based on the smart card passenger information and archived real-time train operation data. This dataset includes the number of trains operating and their current status with respect to their nearest station as one of 3 stages-"approaching," "arrival," and "departure"-along with timestamps for when the current stage was acquired. The approaching status is gained as soon as the arriving train passes a balise located at 1000 m upstream of an associated station. The arrival status is achieved as soon as the train passes a balise located 400 m upstream. The status changes to departure when the train passes a balise at 200 m downstream. The smart card-based passenger information includes the serial number of the card, boarding station, boarding time, alighting station, alighting time, transfer station, and transfer time. It is noted that the time associated with boarding, alighting, and transfer refers to the time when the passenger checks in or out with the gates at stations.
In this paper, both archived real-time train operation and smart card-based passenger data from the Shinbundang line (S Line) on 31st October 2017 was used. The S line has concentrated commuting passenger demands during both morning and afternoon peak periods, with dominating directions of the majority of passengers: towards the CBD in the morning and towards Bundang in the afternoon (See Figure 1). In this study, to show the morning peak clearly with maximum crowdedness, the S line's Central Business District (CBD)-bound direction towards Gangnam station was selected for the analysis. The S line has a total length of 31.29 km, consisting of 12 stations, of which 4 stations allow inter-line transfers. The fleet size is 20 trains, and they run 327 cycles and 271 cycles in total on weekdays and weekend days, respectively; mostly throughout the entire route, except for the first and last trains each day. There is only 1 class of tickets, and daily ridership on the S line is roughly 247,000, as of 2017. For train operation data, arrival and departure times were identified for each train and each station. However, in cases where departure data were missing, average dwell times based on visual observations with video cameras were added to the arrival times to estimate the missing departure times.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 12 It is noted that the time associated with boarding, alighting, and transfer refers to the time when the passenger checks in or out with the gates at stations. In this paper, both archived real-time train operation and smart card-based passenger data from the Shinbundang line (S Line) on 31st October 2017 was used. The S line has concentrated commuting passenger demands during both morning and afternoon peak periods, with dominating directions of the majority of passengers: towards the CBD in the morning and towards Bundang in the afternoon (See Figure 1). In this study, to show the morning peak clearly with maximum crowdedness, the S line's Central Business District (CBD)-bound direction towards Gangnam station was selected for the analysis. The S line has a total length of 31.29 km, consisting of 12 stations, of which 4 stations allow inter-line transfers. The fleet size is 20 trains, and they run 327 cycles and 271 cycles in total on weekdays and weekend days, respectively; mostly throughout the entire route, except for the first and last trains each day. There is only 1 class of tickets, and daily ridership on the S line is roughly 247,000, as of 2017. For train operation data, arrival and departure times were identified for each train and each station. However, in cases where departure data were missing, average dwell times based on visual observations with video cameras were added to the arrival times to estimate the missing departure times. Figure 2 shows the process of matching the smart card passenger information with the archived real-time train operation data. The process starts with identifying a passenger who has alighted at a station. Then, the train that arrived at the station most recently is assumed to be the one that the passenger got off from. The departure time of the train that the passenger had boarded earlier is found. If the boarding time is earlier than the departure time of the train, the train number is assigned to that passenger, assuming that the passenger has boarded and will alight from that train. If the matching process does not succeed, the data from the smart card is not used, since there may be  Figure 2 shows the process of matching the smart card passenger information with the archived real-time train operation data. The process starts with identifying a passenger who has alighted at a station. Then, the train that arrived at the station most recently is assumed to be the one that the passenger got off from. The departure time of the train that the passenger had boarded earlier is found. If the boarding time is earlier than the departure time of the train, the train number is assigned to that passenger, assuming that the passenger has boarded and will alight from that train. If the matching process does not succeed, the data from the smart card is not used, since there may be logical errors in the data. This is reasonable, because there is no way to reduce the travel time by overtaking between trains, or transferring through the other lines since the S line has a single class of train. The process stops when the number of the smart-card data that have been processed is equal to the total number of the smart-card data, N. Figure 3 shows smart card-based passenger data and train operation data from the S07 station to the S11 station in a passenger entry-exit map suggested by Hong et al. [10]. The x-axis represents the time at the S07 station, and the y-axis represents the time at the S11 station. Circles, triangles, and plus signs represent groups of individual passenger's smart card data who have completed their trips from S07 to S11 with associated boarding and alighting times at those stations. Train number 477 arrives at the S07 at 7:25:56 and S11 at 7:38:02. Any potential passengers who were on train 477 during the Appl. Sci. 2020, 10, 476 4 of 12 trip must be on the 2nd quadrant of the horizontal and vertical lines intersecting at train 477. If the passengers used train 477, they must pass the boarding gate at S07 station before train 477 arrives at S07 station, and must also alight after the train arrives at S11 station. Therefore, the group of passengers surrounded by the dotted lines roughly represents all of the potential users of train 477 from S07 to S11 station. It is noted that there were some passengers who boarded and alighted at the same stations, and these were excluded from the analysis.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 the passengers used train 477, they must pass the boarding gate at S07 station before train 477 arrives at S07 station, and must also alight after the train arrives at S11 station. Therefore, the group of passengers surrounded by the dotted lines roughly represents all of the potential users of train 477 from S07 to S11 station. It is noted that there were some passengers who boarded and alighted at the same stations, and these were excluded from the analysis.   Figure 4 shows the number of passengers boarding, alighting, and arriving onboard at each station on the S line. Trains depart from the right towards the left, and each line in each graph Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 the passengers used train 477, they must pass the boarding gate at S07 station before train 477 arrives at S07 station, and must also alight after the train arrives at S11 station. Therefore, the group of passengers surrounded by the dotted lines roughly represents all of the potential users of train 477 from S07 to S11 station. It is noted that there were some passengers who boarded and alighted at the same stations, and these were excluded from the analysis.   Figure 4 shows the number of passengers boarding, alighting, and arriving onboard at each station on the S line. Trains depart from the right towards the left, and each line in each graph  Figure 4 shows the number of passengers boarding, alighting, and arriving onboard at each station on the S line. Trains depart from the right towards the left, and each line in each graph represents a train. Red lines represent trains operated during the morning peak-hours from 6:30 a.m. to 10:30 a.m., and blue lines represent trains run during the afternoon-peak hours from 6:00 p.m. to 9:00 p.m. Trains run outside of peak-hours are represented by gray lines.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 12 represents a train. Red lines represent trains operated during the morning peak-hours from 6:30 a.m. to 10:30 a.m., and blue lines represent trains run during the afternoon-peak hours from 6:00 p.m. to 9:00 p.m. Trains run outside of peak-hours are represented by gray lines.  Figure 4 shows passenger boarding, alighting and onboard counts. When observed from the right, starting with the first station S01, in the morning peak, the boarding counts from S01 to S08 stations, inclusive, are significantly higher than the rest of stations combined. This is due to the fact that the regions covered by stations from S01 to S08 are dedicated residential areas (bed towns) of the Greater Seoul Area (GSA), S09 and S10 are located in an outskirts of Seoul, and S11 and S12 are located in the CBD of Gangnam area of Seoul. During the evening peak period, boarding counts at the S08  Figure 4 shows passenger boarding, alighting and onboard counts. When observed from the right, starting with the first station S01, in the morning peak, the boarding counts from S01 to S08 stations, inclusive, are significantly higher than the rest of stations combined. This is due to the fact that the regions covered by stations from S01 to S08 are dedicated residential areas (bed towns) of the Greater Seoul Area (GSA), S09 and S10 are located in an outskirts of Seoul, and S11 and S12 are located in the CBD of Gangnam area of Seoul. During the evening peak period, boarding counts at the S08 station stands out as there is also a large concentrated commercial area called "Pangyo Techno Valley". This area is also related to the high number of alighting at S08 in the morning.
During the morning peak-hour period, at stations from S01 to S06, inclusive, and S09, there are nearly zero alighting passengers. At S07 and S08 stations, some passengers alight, while most of the passengers alight at S11 and S12 stations in the CBD. There are relatively high numbers of alighting passengers at S07 and S08 compared to other residential areas because there are concentrations of office buildings near those stations. Note that S07 and S08 also serve as transfer stations connected with other metro lines. In the case of station S09, there are only residential facilities near the station, hence resulting in very low alighting counts. In the afternoon-peak and non-peak periods, most alighting counts occur at S11 and S12 stations.
Due to the various characteristics mentioned earlier, the onboard passenger counts, as shown in Figure 4c, continue to increase as trains approach the final station S12 in the CBD. It is also noted that most trains in the morning-peak period experience passengers occupying at the full capacity of the trains and even exceeding the capacity with a factor of 2 between the S07 and S11 stations.

Estimating Observed Dwell Time of Train When Departure Information Is Missing
This paper utilizes the real-time operation data of the metro trains. The collected data have relatively well-recorded arrival times of trains, while often missing the departure times. Therefore, in cases where the departure times are missing, the observed dwell time for each train at each station was estimated by using typical travel times between stations.
As illustrated in Figure 5, the train arrival times at station S and S + 1 is easily determined from the train operation data. The arrival time refers to the time when a train passes 400 m upstream of an associated station. The time difference between the 2 stations consists of the braking time at station S (B S ), dwell time at station S (DW S ), and the travel time between the station S and 400 m upstream location of station S + 1. If we assume the braking times are similar for all stations, the braking time at station S (B S ) can be subtracted from the time difference between the two stations S and S + 1 in order to estimate the actual dwell time. The S line, from which the data were collected, is specifically suitable for such assumptions, because it is autonomously operated and has constant braking times at stations with a large headway of 4 min, and has a relatively low chance of congestion occurring due to the influence of trains downstream. The assumption of constant braking times was verified visually with video cameras installed on all trains. Figure 6 shows the time difference between arrivals of trains at all stations (solid black and red lines), and average travel times visually observed from trains equipped with video cameras (blue dotted line). The difference, DW S , between the TT S and TD S is the estimated dwell time for each train at each station in the case of missing departure data. It is noted that the red lines represent time difference between arrivals at stations of the first 5 trains in the early morning period, where there are no delays, while black lines represent for all other trains.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 12 station stands out as there is also a large concentrated commercial area called "Pangyo Techno Valley". This area is also related to the high number of alighting at S08 in the morning. During the morning peak-hour period, at stations from S01 to S06, inclusive, and S09, there are nearly zero alighting passengers. At S07 and S08 stations, some passengers alight, while most of the passengers alight at S11 and S12 stations in the CBD. There are relatively high numbers of alighting passengers at S07 and S08 compared to other residential areas because there are concentrations of office buildings near those stations. Note that S07 and S08 also serve as transfer stations connected with other metro lines. In the case of station S09, there are only residential facilities near the station, hence resulting in very low alighting counts. In the afternoon-peak and non-peak periods, most alighting counts occur at S11 and S12 stations.
Due to the various characteristics mentioned earlier, the onboard passenger counts, as shown in Figure 4c, continue to increase as trains approach the final station S12 in the CBD. It is also noted that most trains in the morning-peak period experience passengers occupying at the full capacity of the trains and even exceeding the capacity with a factor of 2 between the S07 and S11 stations.

Estimating Observed Dwell Time of Train When Departure Information Is Missing
This paper utilizes the real-time operation data of the metro trains. The collected data have relatively well-recorded arrival times of trains, while often missing the departure times. Therefore, in cases where the departure times are missing, the observed dwell time for each train at each station was estimated by using typical travel times between stations.
As illustrated in Figure 5, the train arrival times at station S and S + 1 is easily determined from the train operation data. The arrival time refers to the time when a train passes 400 m upstream of an associated station. The time difference between the 2 stations consists of the braking time at station S (BS), dwell time at station S (DWS), and the travel time between the station S and 400 m upstream location of station S + 1. If we assume the braking times are similar for all stations, the braking time at station S (BS) can be subtracted from the time difference between the two stations S and S + 1 in order to estimate the actual dwell time. The S line, from which the data were collected, is specifically suitable for such assumptions, because it is autonomously operated and has constant braking times at stations with a large headway of 4 min, and has a relatively low chance of congestion occurring due to the influence of trains downstream. The assumption of constant braking times was verified visually with video cameras installed on all trains. Figure 6 shows the time difference between arrivals of trains at all stations (solid black and red lines), and average travel times visually observed from trains equipped with video cameras (blue dotted line). The difference, DWS, between the TTS and TDS is the estimated dwell time for each train at each station in the case of missing departure data. It is noted that the red lines represent time difference between arrivals at stations of the first 5 trains in the early morning period, where there are no delays, while black lines represent for all other trains.   Figure 7 shows a relationship between the estimation of observed dwell times for all trains at all stations versus the sum of boarding and alighting passenger counts. The x-axis represents the sum of boarding and alighting passengers while the y-axis represents the estimation of observed dwell times. Each symbol represents a station, as illustrated in Table 1. The minimum observed dwell time is 16 s. The variability of dwell times is found to be larger when the passenger counts are smaller, and smaller when the passenger counts are larger. In addition, it is found that the minimum dwell time increases as the passenger counts increase. It is interesting to note that many observed dwell times at certain stations seem to have constant dwell times regardless of passenger counts. This is a result of enforced dwell times at some major stations being assisted by dedicated employees on the platforms, limiting the number of boarding passengers.   Figure 7 shows a relationship between the estimation of observed dwell times for all trains at all stations versus the sum of boarding and alighting passenger counts. The x-axis represents the sum of boarding and alighting passengers while the y-axis represents the estimation of observed dwell times. Each symbol represents a station, as illustrated in Table 1. The minimum observed dwell time is 16 s. The variability of dwell times is found to be larger when the passenger counts are smaller, and smaller when the passenger counts are larger. In addition, it is found that the minimum dwell time increases as the passenger counts increase. It is interesting to note that many observed dwell times at certain stations seem to have constant dwell times regardless of passenger counts. This is a result of enforced dwell times at some major stations being assisted by dedicated employees on the platforms, limiting the number of boarding passengers.  Figure 7 shows a relationship between the estimation of observed dwell times for all trains at all stations versus the sum of boarding and alighting passenger counts. The x-axis represents the sum of boarding and alighting passengers while the y-axis represents the estimation of observed dwell times. Each symbol represents a station, as illustrated in Table 1. The minimum observed dwell time is 16 s. The variability of dwell times is found to be larger when the passenger counts are smaller, and smaller when the passenger counts are larger. In addition, it is found that the minimum dwell time increases as the passenger counts increase. It is interesting to note that many observed dwell times at certain stations seem to have constant dwell times regardless of passenger counts. This is a result of enforced dwell times at some major stations being assisted by dedicated employees on the platforms, limiting the number of boarding passengers.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.  Suji-gu Office S05 Yangjae S11 Dongcheon S06 Gangnam S12

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.  Suji-gu Office S05 Yangjae S11 Dongcheon S06 Gangnam S12

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.  Suji-gu Office S05 Yangjae S11 Dongcheon S06 Gangnam S12

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.  Suji-gu Office S05 Yangjae S11 Dongcheon S06 Gangnam S12

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.  Suji-gu Office S05 Yangjae S11 Dongcheon S06 Gangnam S12

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations.
When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8  and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.

Dwell Time Estimation Model
In this second phase of research, we utilize the results from the first phase, including the observed dwell times (which required estimations when the departure information was missing), boarding and alighting passenger counts for all trains at all stations, in order to develop dwell time estimation models. The proposed models estimate dwell times at all stations with given boarding and alighting passenger counts and onboard passenger counts on arriving trains.
Boarding, alighting, and onboard passenger counts for each train at each station are set as independent variables and the dwell time as dependent variable. This is intuitive, as dwell times are mainly affected by the boarding and alighting counts, while the onboard crowdedness in trains indirectly affects them. The dependent variable, dwell time, was extracted from the observed data as described in Section 3. Support Vector Regression (SVR), Multiple Linear Regression (MLR), and Random Forest (RF) methods were used to develop 3 different estimation models, and their estimation performances were compared. Seventy percent of the data were used as a training set to develop the model, and 30% of the data were used as a validation set.
To compare the performances of the different models, performances were compared for 3 different scenarios of dwell time estimation accuracy of less than 3, 5, and 10 s between the actual and estimated dwell times. Among the validation set, the SVR model shows that 53.6% of cases have less than 3 s of errors, and 67.2% have less than 5 s of errors. On the other hand, 87.5% of the estimation by the RF model in the validation set marked below 10 s of error, as shown in Table 2. It is noted that the dwell times used for training are in integer values while the estimated dwell times from the models are in real numbers, and making distinctions between the 3 scenarios separated by a few seconds may be practically insignificant in real-life metro operations. When compared from the perspective of percentage accuracy, the SVR model performed the best in the case of less than 30% errors followed by RF and MLR models as seen in Table 3. Figures 8 and 9 show the comparison of dwell time estimation results derived from each model. The orange square represents the trains operated during the morning peak hour, the blue circle represents the trains operated during the evening peak hour, and the black triangle represents the trains operated outside the two rush-hour periods.          To further enhance the SVR model with 3 variables (with boarding, alighting, and onboard passenger counts at independent variables) which was found to be the most accurate model among the 3 models, we developed different SVR models with 3 additional input variables: train arrival time, station information, and time difference between arrivals at 2 consecutive stations. As shown in Tables 4 and 5, it is found that the performances of SVRs with additional input variables were further enhanced. In particular, the scenario with less than 5 s of error was improved by 20% in accuracy. Most notably, the enhanced SVR model performed at 97.2% accuracy for the scenario with less than 10 s of error. Figure 10 shows the comparison of estimation performance between the 6-variable model with the 3-variable model. To further enhance the SVR model with 3 variables (with boarding, alighting, and onboard passenger counts at independent variables) which was found to be the most accurate model among the 3 models, we developed different SVR models with 3 additional input variables: train arrival time, station information, and time difference between arrivals at 2 consecutive stations. As shown in Tables 4 and 5, it is found that the performances of SVRs with additional input variables were further enhanced. In particular, the scenario with less than 5 s of error was improved by 20% in accuracy. Most notably, the enhanced SVR model performed at 97.2% accuracy for the scenario with less than 10 s of error. Figure 10 shows the comparison of estimation performance between the 6-variable model with the 3-variable model.   Figure 10. Estimation performance of the 6-variable and the 3-variable SVR model. Figure 10. Estimation performance of the 6-variable and the 3-variable SVR model.

Conclusions
This paper develops a railway dwell time estimation model using SVR, RF, and MLR methods. In the first phase, smart card-based passenger information is matched against the real-time train operation data from the S line of the Seoul Metro in Seoul, South Korea, for extracting boarding, alighting, on-bard passenger counts, and the observed dwell times for all trains at all stations. When the train departure times were missing, an assumption of constant braking time was adopted to estimate actual/observed dwell times, since the S line is autonomously operated with minimal variability in braking times.
In the second phase, the paper develops dwell time estimation models utilizing the extracted information from the first phase. The SVR, RF and MLR-based models were developed for dwell time estimations while boarding, alighting, and onboard passenger counts were treated as independent variables and the dwell time was set as a dependent variable. In the comparative scenarios with less than 3, 5, and 10 s of errors between the estimations and the observed values, the SVR model performed the best, with an accuracy of 67.2% in the scenario with less than 5 s of error. Then the SVR model was enhanced by including additional independent variables, including arrival times and time difference of arrivals, at 2 consecutive stations for all trains at all stations. It was found that the enhanced SVR model improved the accuracy by 20% to an accuracy of 87.6% for the scenario of less than 5 s of error. In the case of less than 10 s of error, the improved SVR model performed at 97.2% accuracy.
This research is unique in the sense that, firstly, it extracts boarding, alighting, and onboard passenger counts using data from real-life metro operations and smart card-based passenger information for all trains at all stations on an urban metro line. Secondly, this research develops dwell time estimation models with high performance accuracies that are validated by real-life data. The results of this paper are especially beneficial for autonomous railway operation, which requires construction and maintenance of dynamic railway timetables that require reliable dwell time real-time predictions. However, this estimation model may not work well if incidents such as rolling stock failure, or big events (i.e., sports games or exhibitions) occur near metro stations that may increase the demand dramatically and present unseen data to the proposed models. Enhancing the proposed estimation models to cover not only a general commute-based operation environment, but also the situations affected by such special events, are topics of future studies.