Identiﬁcation of Inelastic Subway Trips Based on Weekly Station Sequence Data: An Example from the Beijing Subway

: Urban rail transit has become an indispensable option for Beijing residents. Subway inelastic users (SIUs) are the main component among all users. Understanding the proportion of SIUs and their characteristics is important in developing service promotions and helpful for subway agencies in making marketing policies. This paper proposes a novel and simple identiﬁcation process for identifying regular subway inelastic trips (SITs) in order to distinguish SITs and non-SITs and extract their characteristics. Weekly station sequence (WSS) is selected as the data-based format, principles of SIUs are discussed and chosen, and the framework of SIT identiﬁcation is applied to a large weekly sample from the Beijing Subway. A revealed preference (RP) survey and results analysis are undertaken to estimate the performance of the proposed methods. The RP survey validation shows that accuracy reaches as high as 94%, and the distribution analysis of SITs and their origin-destinations (ODs) indicate that the SIT characteristics extracted are consistent with the situation in Beijing. The proportion of SIUs is stable on workdays and is more than 80% during rush hour. The efforts described in this paper can provide subway managers with a useful and convenient method to understand the characteristics of subway passengers and the performance of a subway system.


Introduction
Urban rail transit has become an indispensable option for daily travel in China, especially for commuters in metropolises such as Beijing and Shanghai [1,2]. In 2015, passenger traffic on the Beijing Subway was 3.32 billion boarding, and the average daily passenger volume reached 9.11 million boarding [3]. With millions of people choosing rail transit as their primary travel mode, congestion occurs during peak hours. The Beijing Subway was designed with a low-fare policy in 2007 to attract car commuters and offer commuters a convenient and comfortable means of transportation. However, in a sample survey during 2012, managers of the Beijing Subway found that commuters accounted for only about 60% of peak trips [4], which partly led to an increased fare policy in 2014 to reduce noncommuters in rush hour. Thus, an awareness of passenger classifications during peak hours, especially commuters, is needed for managers to make informed decisions [5,6]. passenger flow volume [30][31][32]. Ridership statistics and performance indicators mentioned in these studies offer helpful information to determine the principles of SIUs used in this paper.
Automatic fare collection (AFC) has provided transit agencies with huge amounts of operational data, which are widely recognized as having the potential to serve functions beyond the designated purpose of revenue management [9,16,23]. Based on AFC data mining, understanding SIUs and their behaviors with less labor costs is possible, and efforts can be made to enhance the LOS of the subway system to maintain the major users. For subway agencies, the data mining results can be helpful in marketing, encouraging non-SIUs to shift their travel times to less busy hours, and implementing special fare policies for peak travel. Furthermore, defining regular SIUs is useful to provide basic and important data in a subway planning program.
In line with understanding SIUs' travel patterns, the contribution of this work is twofold. From a methodological perspective, we provide a novel and simple framework for classifying regular subway inelastic trips (SITs) based on three principles, which measure the regularity of SITs spatially and temporally. From an empirical perspective, we analyze and expose the nature of SITs in the Beijing Subway system. The data mining framework is used to understand SIT and non-SIT characteristics from a typical week in 2016 and a similar period in 2017.
The remainder of this paper is organized as follows. Section 2 discusses the data foundation of the Beijing Subway. Section 3 provides an overview of the methodology, and Section 4 describes its application and validation in the Beijing Subway. Finally, the conclusions and limitations of the work are discussed in Section 5.

Data Foundation of the Beijing Subway
In the case of the Beijing Subway, an AFC system was adopted around the time of the 2008 Olympics. Subway users need to swipe a smart card both inbound and outbound, and transfers do not require reswiping, with the exception of the airport line [1,9]. Beijing Subway AFC data contain 37 fields, 7 of which are extracted in this paper [22]. Transaction records collected for use in this paper comprise the user's card number, entry and exit times, inbound and outbound lines, and stations. Table 1 shows an example of a transaction record.
Beijing Subway has a total mileage of 574 km, 19 operating lines, and 276 stations as of 2017. A subway map is shown in Figure 1. To validate the accuracy of the proposed methods, AFC data were collected in two typical weeks from 16 October to 22 October 2016 and 2017. There are no special holidays within 7 days before and after the selected weeks, therefore the data are more likely to accurately reflect the daily travel characteristics of users. There are more than 7 million smart cards and more than 32 million transaction records in a week. Figure 2 shows the distribution of Beijing Subway smart cards and transaction records in 2016. The average number of trips (number of transaction records/smart cards) a subway user makes in a day is about 1.74 on workdays, which is extremely stable from Monday to Thursday and a little higher on Friday [22,32].

Methodology
In addition to the growth in passenger traffic attracted by the construction of new lines, fixed user groups are the key to a stable passenger flow in the Beijing Subway system [3,4]. This paper focuses on regular SIU identification to understand the main composition of passenger types. A discussion of the methodology follows to present the data mining process in 3 steps: (1) Section 3.1 presents the data preprocessing, by which the data format used in the study is prepared; (2) Section 3.2 presents the principle characteristics of regular SIUs and SITs by conclusions from previous studies; and (3) Section 3.3 presents the data processing framework of SITs for detecting every principle characteristic proposed in Section 3.2.

Weekly Station Sequence (WSS)
Weekly station sequence (WSS) is selected as the data-based format in this paper, which refers to a string of numbers consisting of station code (SC) within a week. WSS includes the user's inbound and outbound station number by comparing whether the station is the same with the previous ones. Figure 3 shows the data mining procedure for extracting WSS from weekly transaction records of the AFC data system. The details of the WSS tagging process are shown in Table 2.
Have all records of Card Number i been tagged? Yes → Output WWS and Step 7; No → Step 4.
Have all AFC records been tagged? Yes → End; No → Step 3.
Have all records of Card Number i been tagged? Yes → Output WWS and Step 7; No → Step 4.
Have all AFC records been tagged? Yes → End; No → Step 3.
Taking an example of the transaction records in Table 1, the data mining results of WSS are computed as the array format shown in Figure 4. Single station sequence (SSS) presents a single trip from origin to destination, which constitutes a daily station sequence (DSS). It can be determined in Taking an example of the transaction records in Table 1, the data mining results of WSS are computed as the array format shown in Figure 4. Single station sequence (SSS) presents a single trip from origin to destination, which constitutes a daily station sequence (DSS). It can be determined in Figure 4 that the user traveled between the first station and the second station from Day 1 to Day 4, and did not return by subway on Day 5. Furthermore, in WSS format, we adopt an arrow to indicate a trip and a plus sign to distinguish different trips. Therefore, WSS contains information about the selected stations, which is useful for detecting the regular stations and trips.  Figure 4 that the user traveled between the first station and the second station from Day 1 to Day 4, and did not return by subway on Day 5. Furthermore, in WSS format, we adopt an arrow to indicate a trip and a plus sign to distinguish different trips. Therefore, WSS contains information about the selected stations, which is useful for detecting the regular stations and trips.

Subway Inelastic Trip (SIT) Identification Principles
Many studies have discussed how to recognize SIUs' travel patterns by detecting the temporal and spatial characteristics of their trip chains [12,13,22]. Indicators such as traveling days, route sequences, stops, and departure times are defined to measure the commuting intensity of each transit rider [6,11]. The frequency of the indicators or the thresholds of the frequency are the main factors to describe SIUs' behavior. For example, traveling days in a week (≥3) with a time interval (≥7 h) is used as a condition to distinguish public transport commuters [20]. Complex indicators such as the travel space balance or a sequence of intervals are introduced to identify passengers' travel patterns [9,12,13].
Most studies have concluded that: (1) SITs always repeat stops and lines with high probability, and (2) SIUs are used to choosing the same travel route and time every day. These conclusions

Subway Inelastic Trip (SIT) Identification Principles
Many studies have discussed how to recognize SIUs' travel patterns by detecting the temporal and spatial characteristics of their trip chains [12,13,22]. Indicators such as traveling days, route sequences, stops, and departure times are defined to measure the commuting intensity of each transit rider [6,11]. The frequency of the indicators or the thresholds of the frequency are the main factors to describe SIUs' behavior. For example, traveling days in a week (≥3) with a time interval (≥7 h) is used as a condition to distinguish public transport commuters [20]. Complex indicators such as the travel space balance or a sequence of intervals are introduced to identify passengers' travel patterns [9,12,13].
Most studies have concluded that: (1) SITs always repeat stops and lines with high probability, and (2) SIUs are used to choosing the same travel route and time every day. These conclusions provide the rationale to identify SITs. If the spatiotemporal characteristics of subway trips can be detected in a proper way, SITs can be recognized and identified [2,8,33]. Hence, we focus on checking the spatiotemporal regularity and repeatability of trips to distinguish SITs and non-SITs in this paper. The repeated routes and stations are identified by detecting the WSS, and the repeated travel time is checked by comparing the entry time of the repeated station sequence (RSS) subsequently.
In summary, we determine to set SIT identification principles from 3 aspects: (1) number of weekly traveling days (WTDs), (2) number of repeated station sequences (RSSs), and (3) entry time difference (ETD) of RSSs. The specific principles for SITs are described as follows.
3.2.1. Principle 1: WTDs ≥ 2 Weekly traveling days (WTDs) refers to the number of the days that users travel by subway in a week. Figure 5 shows the distribution of subway users according to WTDs. It indicates that the proportion in a week or on workdays is around 42% when WTDs is equal to 1. About 23.1% of subway users travel by subway 5 or more days a week, of which 7.7% have 6 WTDs and 3.2% have 7.
If a user only takes the subway once a week, the transaction record might not contain enough information to understand the user's travel demand [33]. These are usually called one-way trips, and users are considered random users who might not travel regularly by subway [17]. Hence, the condition WTDs ≥ 2 is employed to exclude random users and improve the efficiency of the SIT identification algorithm. In addition, the authors do not adopt traveling on workdays as a principle, because there is still a small group of SIUs who travel on weekends in Beijing. Repeated station sequence (RSS) refers to an SSS that happens more than once. RSSs show the stations and routes that subway users choose every day, which indicates their origin and destination. RSSi is presented to the i RSS in a WSS. For example, in Figure 4, there are 2 RSSs in the WSS. RSS1 is 1→2, which appears 5 times, and RSS2 is 2→1, which repeats 4 times. In this case, the first station can be predicted as the one near home and the second station as the one near the workplace.
Principle 2 focuses on the spatial regularity of SITs. The condition RSS ≥ 2 is employed to check that SITs at least repeat more than once. Though some studies require 3 times as a condition, we made it 2 to prevent missing some SITs [2,11,33]

Principle 2: RSS ≥ 2
Repeated station sequence (RSS) refers to an SSS that happens more than once. RSSs show the stations and routes that subway users choose every day, which indicates their origin and destination. RSSi is presented to the i RSS in a WSS. For example, in Figure 4, there are 2 RSSs in the WSS. RSS 1 is 1→2, which appears 5 times, and RSS 2 is 2→1, which repeats 4 times. In this case, the first station can be predicted as the one near home and the second station as the one near the workplace.
Principle 2 focuses on the spatial regularity of SITs. The condition RSS ≥ 2 is employed to check that SITs at least repeat more than once. Though some studies require 3 times as a condition, we made it 2 to prevent missing some SITs [2,11,33].

Principle 3: ETD ∈ [0, 30 min]
Entry time difference (ETD) of RSS refers to the entry time difference between repeated SSSs. It can be counted by Equation 1: where ET k i is the entry time of trip k of RSSi; ETD k is the minimum of the absolute value of the time difference between ET k i and the other trips of RSSi; and n refers to the repeated times of RSSi. Principle 3 aims to measure the temporal regularity of SITs. Previous research has found that commuter travel time is concentrated in the morning and evening rush [2,8]. In Beijing, the average travel time in the morning and evening rush is 62.3 min and 52.8 min, respectively [3,4]. The threshold 30 min (about 50% of the average travel time) is set, for beyond that threshold riders might be late for their destination.

SIT Identification Framework
Based on the principles mentioned in Section 3.2, the framework of SIT identification is designed as in Figure 6. The input is the foundation database of all trips, and the output is the SIT database, which contains the identified inelastic trips. First, we read the data prepared in Section 3.

Validation and Discussion
To validate the effectiveness of the proposed SIT identification method, a detailed revealed preference survey (RP) was conducted to collect the travel attributes of smart card holders in Beijing in May 2017. The RP survey was carried out around five subway stations by asking users to answer a questionnaire on the spot. The questionnaire included questions about personal information such

Validation and Discussion
To validate the effectiveness of the proposed SIT identification method, a detailed revealed preference survey (RP) was conducted to collect the travel attributes of smart card holders in Beijing in May 2017. The RP survey was carried out around five subway stations by asking users to answer a questionnaire on the spot. The questionnaire included questions about personal information such as age, income, and occupation, and travel information such as trip purpose, travel route (origin and destination stations), and travel time. Crucially, every respondent was required to input his/her smart card number and trips were defined as a SIT or not based on this trip information.
The  Table 3. The average accuracy rate of SIUs is as high as 94%, a little lower than the accuracy rate of non-SIUs, which is favorable proof of the proposed framework for SIT identification. Though the average accuracy rate is not higher than the results from previous studies (94.1% in [11] and 94.5% in [21]), the framework works in a simple way without as much work in mining data and collecting personal information. Table 3. Results of validation of subway inelastic users (SITs) and non-SITs. We analyzed the distribution of entry times of all trips and SITs to estimate classification and stability of SITs. With 14 distribution figures extracted from two weeks of data from the Beijing Subway, the curves and values of these distributions are found to be extremely similar for workdays and weekends. Hence, only four representative figures are demonstrated in Figure 7. The horizontal axis presents entry times in 5 min intervals from 05:00 to 23:55, and the left ordinate shows number of trips (transaction records), while the right ordinate presents the ratio of SITs to all trips from 0% to 100%. Distributions of three indicators are shown in the figure: (1) trips refers to number of all trips, (2) SITs refers to number of SITs, and (3) ratio refers to the ratio of SITs to all trips.  Table 3. Results of validation of subway inelastic users (SITs) and non-SITs. We analyzed the distribution of entry times of all trips and SITs to estimate classification and stability of SITs. With 14 distribution figures extracted from two weeks of data from the Beijing Subway, the curves and values of these distributions are found to be extremely similar for workdays and weekends. Hence, only four representative figures are demonstrated in Figure 7. The horizontal axis presents entry times in 5 min intervals from 05:00 to 23:55, and the left ordinate shows number of trips (transaction records), while the right ordinate presents the ratio of SITs to all trips from 0% to 100%. Distributions of three indicators are shown in the figure: (1) trips refers to number of all trips, (2) SITs refers to number of SITs, and (3) ratio refers to the ratio of SITs to all trips. Figure 7a,b shows two distributions of trips on Wednesday, for which the SITs curve is close to the trips curve with two peak hours (07:00 to 09:00 and 17:05 to 19:05). Meanwhile, the distribution of ratio has three peak areas (07:00 to 09:00, 17:40 to 19:40, and 23:20 to 23:35). The highest ratio reaches more than 80% in peak hours, and the average of ratio is about 75.6% in morning peak hours and about 64.2% in evening peak hours. The peak area from 23:20 to 23:35 indicates that a small percentage of passengers prefer to regularly travel at night. Figure 7c,d shows two distributions of trips on Saturday. The similarity of the SITs curve and trips curve on weekends is not as high as what appears in Figure 7a,b. The ratio is below the value on workdays; however, there are still about 40% of SITs travelling in the morning and night peak hours.

SIUs in Survey
The distributions of trips suggest that SITs extracted by the proposed method are stable in number of SITs, travel time, and ratio of SITs to all trips on workdays, which is consistent with the definition of SITs. Additionally, some special SITs (such as those on late nights and weekends) were detected, which is in line with the diversity of occupations in Beijing [1,18]. In short, the distribution of trips strongly proves the feasibility of the proposed method.    Figure 7a,b shows two distributions of trips on Wednesday, for which the SITs curve is close to the trips curve with two peak hours (07:00 to 09:00 and 17:05 to 19:05). Meanwhile, the distribution of ratio has three peak areas (07:00 to 09:00, 17:40 to 19:40, and 23:20 to 23:35). The highest ratio reaches more than 80% in peak hours, and the average of ratio is about 75.6% in morning peak hours and about 64.2% in evening peak hours. The peak area from 23:20 to 23:35 indicates that a small percentage of passengers prefer to regularly travel at night. Figure 7c,d shows two distributions of trips on Saturday. The similarity of the SITs curve and trips curve on weekends is not as high as what appears in Figure 7a,b. The ratio is below the value on workdays; however, there are still about 40% of SITs travelling in the morning and night peak hours.
The distributions of trips suggest that SITs extracted by the proposed method are stable in number of SITs, travel time, and ratio of SITs to all trips on workdays, which is consistent with the definition of SITs. Additionally, some special SITs (such as those on late nights and weekends) were detected, which is in line with the diversity of occupations in Beijing [1,18]. In short, the distribution of trips strongly proves the feasibility of the proposed method.
We further analyzed origin-destination (OD) distribution of SITs based on subway stations, and the distributions are also found to be similar on workdays and weekends. An example of OD distributions is shown in Figure 8. First, distributions of origin are matched to distributions of destination, which indicates that most SIUs always prefer the same station as an origin or a destination and return to the origin station. This characteristic of SIUs has been proved and discussed in many studies [1,6,11,21]. Second, high passenger flow volumes of SITs are concentrated at the center of Beijing City, especially around and within the Fourth Ring Road. Third, the highest volumes are focused around two areas, Zhongguancun and Guomao, the two largest central business districts in Beijing, with millions of jobs. To some extent, OD distribution of SITs proves the stability of SIUs in choosing subway stations.
In summary, the proposed SIT identification methods have practicability in distinguishing subway users, which is very important to the future of the transit industry. Regular SITs are a large part of ridership on many transportation systems around the world. By data mining with extensive data, SITs and inelastic demand can be easily recognized for further infrastructure planning and system optimization.

Conclusions
A data mining methodology is proposed for identification of regular SITs and understanding of regular SIU characteristics of the Beijing Subway. WSS, used in this paper, is a novel and simple data format to enhance the visibility of data mining results. Three specific principles to identify SITs are presented, of which Principle 1 aims to enhance the data mining efficiency, Principle 2 to check the spatial regularity of SITs, and Principle 3 to measure the temporal regularity of SITs. Finally, the framework of the data mining process is proposed and applied to distinguish SITs and non-SITs in the Beijing Subway system. RP survey and analysis are performed to demonstrate the effectiveness of the proposed method. Results confirm that the method proposed in this paper is feasible for SIT identification and is easy to use in practice.
By applying the proposed method, managers of transit agencies can understand the features of SITs and have insight into the elaborate travel demands of passengers only based on AFC data without high cost and the effort of collecting extra information and preprocessing different data. The characteristics of SITs are helpful in assessing the LOS of the subway system and improvements can be proposed in the periods with a high ratio of SITs. The method also can be used to predict the origins and destinations of travelers. However, the principles for identifying SITs in this paper only focus on the repeatability of stations and travel time, and the conditions of SITs might not be as constant as we discussed. SITs might have different manifestations and characteristics when the influencing factors change, such as job and housing locations, congestion, and fare policies, which are not detected in this paper. In addition, the dataset used in this paper does not include the users' personal attributes, and special SITs are hard to detect, which might lead to deviation in identifying SITs. Further studies will focus on examining and tracing the characteristics of SIUs over a longer term and improving the accuracy of the discrimination methods.