Design a Semantic Scale for Passenger Perceived Quality Surveys of Urban Rail Transit: Within Attribute’s Service Condition and Rider’s Experience

: A better understanding of passenger perceived quality helps urban rail transit managers adopt better strategies to improve the service quality of urban rail transit, which is beneﬁcial to the sustainable development of an urban rail transit system itself and cities. This paper designs a semantic scale to survey passenger perceived quality of urban rail transit. The methodology is selecting speciﬁc features of an attribute and then describing the features to present the attribute’s service condition and the rider’s experience. The scale’s options can reduce cognitive steps and hesitation for riders to answer the survey questionnaire. Furthermore, it enables urban rail transit managers to understand passenger perceived quality more visually. After verifying the reliability and validity of the semantic scale, an empirical study was conducted to compare the evaluation results of the proposed semantic scale, Likert, and numeric scales. Compared to the Likert and numeric scales, the evaluation result of the semantic scale is fairer for attributes with homogeneous service conditions over operation periods from the transit agency perspective. Meanwhile, it is more homogeneous for attributes with homogeneous service conditions and is more heterogeneous for attributes with heterogeneous service conditions.


Introduction
Transit service quality is usually defined as the overall measured or perceived performance of transit service from the passenger's point of view [1] (Chapter 4, p. 6). Improving service quality can help attract more riders, retain the current riders [2], and alleviate excessive use of private cars [3]. It helps to promote the sustainable development of cities. To have more targeted strategies for improving transit service quality and to allocate resources more reasonably, transit managers often need to know the current status of service quality. The primary method is to conduct passenger perceived quality surveys [4].
At present, self-administrated questionnaires serve as the primary form of passenger perceived quality surveys [5], and five-point Likert and numeric scales are the most used [6]. The Likert scale options are generally set to very satisfied, satisfied, normal, dissatisfied, and very dissatisfied, such as [7,8]; the numeric scale options are set to one, two, three, four, and five points, as in [4,5].
These two scales are practical tools to measure passenger perceived quality. However, based on the following three reasons, we aim to design a five-point semantic scale for passenger perceived quality surveys of urban rail transit. Compared with Likert and numeric scales, the options of the semantic scale describe the attribute's service condition and rider's experience more directly. set of attributes, until all attributes are covered. Thus, riders did not need to evaluate every attribute, and it saved time. However, this scale still has not been widely used [16].
Some scholars summarized the evaluation characteristics of different scales by comparing their evaluation results on the same object. On the one hand, the evaluation result of the Likert scale shows central tendency bias. Presser and Schuman [17] analyzed five data sets that involve social or political issues and found offering a middle alternative in the Likert scale increased the size of this category by 10-20%. Most of the increase came from declines in polar positions, and the size of "do not know" responses mostly remained the same. Whether a middle position was offered also did not affect univariate distributions. On the other hand, the derived importance of attributes from the best-worst scale matched previous studies better than the Likert scale for bus transit service [16].
Moreover, the semantic-differential scale seems to have higher reliability, internal validity, and model fit of the structural equations model than the Likert scale. Based on four data sets that evaluated stores, Ofir et al. [18] concluded that the semantic-differential and Likert scales were non-interchangeable. In most cases, the semantic-differential scale had higher reliability and internal validity than the Likert scale. Friborg et al. [19] tested human resilience, discovering the structural equations model in the semantic-differential version fit the data better than the Likert version. Bonera et al. [20] used the semantic-differential scale to investigate the factors (e.g., socio-economics) that affect the user's perception of travel experience and the ease of doing several activities on the journey.
Nevertheless, the semantic-differential scale only has descriptive sentences at the two ends of the scale. The categorization of other satisfaction levels is still as abstract as the Likert and numeric scales. Therefore, the cognitive steps and hesitation of the semantic-differential scale are still as same as the Likert and numeric scales. Table 1 summarizes some characteristics of the Likert, numeric, stated preference, best-worst, and semantic-differential scales. Based on the above research and Table 1, it at least suggests three points. First, the two most currently used Likert and numeric scales in transit passenger perceived quality surveys have optimizing room and altering a scale form will be a feasible method. Second, the semantic-differential scale incorporates advantages in data quality, but the cognitive steps and real experience reflection can still be improved. Third, using different scales to evaluate the same object may have different results.

Design Concept and Framework
We aim to design a semantic scale with attributes and descriptive sentences in all options for each attribute. The attributes will be arranged based on the process of a ride in the questionnaire, which helps riders recall their riding experiences so that it facilitates them to answer. In each level's option, the sentence describes the attribute's service condition based on the rider's experience. The descriptive subjects of an attribute are defined as features, and the adjectives or rider's experience used to describe the features are defined as terms. Figure 1 depicts the hierarchical relationship between an attribute and its features, terms, and options. which helps riders recall their riding experiences so that it facilitates them to answer. In each level's option, the sentence describes the attribute's service condition based on the rider's experience. The descriptive subjects of an attribute are defined as features, and the adjectives or rider's experience used to describe the features are defined as terms. Figure 1 depicts the hierarchical relationship between an attribute and its features, terms, and options. For each level's option, the semantic sentence could be expressed as Equation (1):

Attribute
The option = using terms to describe feature1 + using terms to describe feature2 + ...， For example, the "Ticket purchase and top-up service" can be described as "the operation is simple, and the number of machines is sufficient". In this manner, "operational simplicity" and "the number of machines" work as features 1 and 2, respectively. Correspondingly, "simple" and "sufficient" are the terms of features 1 and 2, respectively. Furthermore, features remain the same in every level's option of an attribute, while terms are different. This is because terms define various service levels of features. For example, the "Ticket purchase and top-up service" can also be described as "the operation is inconvenient, and the number of machines is insufficient". In this manner, features are still "operational simplicity" and "the number of machines", while the terms have changed to "inconvenient" and "insufficient". Therefore, we need fixed features but multiple terms to form a semantic scale of an attribute.
In summary, the establishment of options consists of four steps ( Figure 2). The first and second steps are to identify the features and terms of all attributes, respectively. In the third step, we combine features and their terms to formulate options. Finally, scale levels and scores are assigned to the options.

Identifying features of attributes
Step 2

Identifying terms of features
Step 3 Combing features and their terms to form options Step 4 Assigning scale levels and scores to options Figure 2. A four-step methodological framework to form a semantic scale.

First
Step: Identifying Features of Attribute The first step is to identify the features of all attributes. Attributes are extracted from previous studies based on the findings of attributes' importance [8,21]. For each attribute, features can be obtained through a focus group. The focus group should include 8-10 people [22] (p. 41). During the focus group, a researcher asks riders what affects their perceived quality with attributes, and riders are allowed to discuss. Reasons that affect the rider's perceived quality with attributes are recorded, For each level's option, the semantic sentence could be expressed as Equation (1): The option = using terms to describe feature1 + using terms to describe feature2 + . . . , (1) For example, the "Ticket purchase and top-up service" can be described as "the operation is simple, and the number of machines is sufficient". In this manner, "operational simplicity" and "the number of machines" work as features 1 and 2, respectively. Correspondingly, "simple" and "sufficient" are the terms of features 1 and 2, respectively. Furthermore, features remain the same in every level's option of an attribute, while terms are different. This is because terms define various service levels of features. For example, the "Ticket purchase and top-up service" can also be described as "the operation is inconvenient, and the number of machines is insufficient". In this manner, features are still "operational simplicity" and "the number of machines", while the terms have changed to "inconvenient" and "insufficient". Therefore, we need fixed features but multiple terms to form a semantic scale of an attribute.
In summary, the establishment of options consists of four steps ( Figure 2). The first and second steps are to identify the features and terms of all attributes, respectively. In the third step, we combine features and their terms to formulate options. Finally, scale levels and scores are assigned to the options. which helps riders recall their riding experiences so that it facilitates them to answer. In each level's option, the sentence describes the attribute's service condition based on the rider's experience. The descriptive subjects of an attribute are defined as features, and the adjectives or rider's experience used to describe the features are defined as terms. Figure 1 depicts the hierarchical relationship between an attribute and its features, terms, and options. For each level's option, the semantic sentence could be expressed as Equation (1):

Attribute
The option = using terms to describe feature1 + using terms to describe feature2 + ...， For example, the "Ticket purchase and top-up service" can be described as "the operation is simple, and the number of machines is sufficient". In this manner, "operational simplicity" and "the number of machines" work as features 1 and 2, respectively. Correspondingly, "simple" and "sufficient" are the terms of features 1 and 2, respectively. Furthermore, features remain the same in every level's option of an attribute, while terms are different. This is because terms define various service levels of features. For example, the "Ticket purchase and top-up service" can also be described as "the operation is inconvenient, and the number of machines is insufficient". In this manner, features are still "operational simplicity" and "the number of machines", while the terms have changed to "inconvenient" and "insufficient". Therefore, we need fixed features but multiple terms to form a semantic scale of an attribute.
In summary, the establishment of options consists of four steps ( Figure 2). The first and second steps are to identify the features and terms of all attributes, respectively. In the third step, we combine features and their terms to formulate options. Finally, scale levels and scores are assigned to the options.

Identifying features of attributes
Step 2

Identifying terms of features
Step 3 Combing features and their terms to form options Step 4 Assigning scale levels and scores to options

First
Step: Identifying Features of Attribute The first step is to identify the features of all attributes. Attributes are extracted from previous studies based on the findings of attributes' importance [8,21]. For each attribute, features can be obtained through a focus group. The focus group should include 8-10 people [22] (p. 41). During the focus group, a researcher asks riders what affects their perceived quality with attributes, and riders are allowed to discuss. Reasons that affect the rider's perceived quality with attributes are recorded,

First Step: Identifying Features of Attribute
The first step is to identify the features of all attributes. Attributes are extracted from previous studies based on the findings of attributes' importance [8,21]. For each attribute, features can be obtained through a focus group. The focus group should include 8-10 people [22] (p. 41). During the focus group, a researcher asks riders what affects their perceived quality with attributes, and riders are allowed to discuss. Reasons that affect the rider's perceived quality with attributes are recorded, and they serve as the features of attributes.

Second
Step: Identifying Terms of Features While riders answer the reasons that affect their perceived quality with attributes, some words that riders use to describe a feature would be detected. Those words can be adjectives that define the service condition or riders' experiences, and they are selected as the terms of that feature.
Brace [11] (p. 51) suggested that spontaneity is more critical than prompt, and great care should be taken not to prompt. To capture the most spontaneous reaction from riders, the number of terms for each attribute is not fixed. Otherwise, it may prompt riders, and the proposed terms are not entirely consistent with their original perceptions of the features.
Then, the terms are coded to distinguish the service level of features. It also prepares for translating the options to scale levels and scores in the fourth step. Please note that the codes are only numerical labels of the scale levels of a feature, which are only ordinal variables instead of interval variables [22] (p. 105). To make the codes more common, we assumed a larger number indicates a higher service level, and set the codes to "equally spaced" numbers from 0 to 1 (Equation (2)). For instance, a two-level term is coded 1and 0, and a three-level term is coded 1, 0.5, and 0. The code of i-level terms = 0, 0

Third Step: Combing Features and Their Terms to Form Options
In the third step, we combine features and their terms to form options ( Figure 2 is an example). Each level's option is structured based on Equation (1). Since different attributes evaluate different service contents, the number of reasons (i.e., features) that affect the rider's perception about attributes might differ. Meanwhile, as different features describe different aspects of its attribute, the number of terms that riders proposed to distinguish their perception of features may vary. Thus, there are maybe several kinds of combinations of features and their terms.
To define each kind of combination, we denoted the number of features as the number of digits, the number of terms as the value per digit, and * as the digits' connection. For example, when an attribute has two features, and each feature has three terms, its combination can be denoted as 3 * 3 ( Figure 3).
If the number of options is less than the scale's required points, additional options can be added through a Delphi method or a focus group. The added options' orders, which might vary among attributes, are also determined in this process based on the service level.

Fourth Step: Assigning Scale Levels and Scores to Options
Before the assignation, we need to define the option code. In this paper, we assume features of the same attribute have equal weights. Thus, one possible way to define the option code is the sum of the terms' codes in this option (Equation (3)): The code of the option = the code of the term of feature1 + the code of the term of feature2 + . . . , (3) Sustainability 2020, 12, 8626 6 of 21 number of terms that riders proposed to distinguish their perception of features may vary. Thus, there are maybe several kinds of combinations of features and their terms.
To define each kind of combination, we denoted the number of features as the number of digits, the number of terms as the value per digit, and * as the digits' connection. For example, when an attribute has two features, and each feature has three terms, its combination can be denoted as 3 * 3 ( Figure 3).

F_1 Code Option Code
The F_1 is T_1, and the F_2 is T_1 2 The F_1 is T_1, and the F_2 is T_2 The F_1 is T_2, and the F_2 is T_1 1.5 The F_1 is T_1, and the F_2 is T_3 The F_1 is T_2, and the F_2 is T_2 The F_1 is T_3, and the F_2 is T_1 1 The F_1 is T_2, and the F_2 is T_3 The F_1 is T_3, and the F_2 is T_2 0.5 The F_1 is T_3, and the F_2 is T_3 0 If the number of options is less than the scale's required points, additional options can be added through a Delphi method or a focus group. The added options' orders, which might vary among attributes, are also determined in this process based on the service level.

Fourth Step: Assigning Scale Levels and Scores to Options
Before the assignation, we need to define the option code. In this paper, we assume features of the same attribute have equal weights. Thus, one possible way to define the option code is the sum of the terms' codes in this option (Equation (3)): As the size of terms' code can distinguish the service level of features, naturally, the size of the option code represents the service level of the option; the larger the option code is, the higher the service level of the option is. The scale levels and scores are then assigned to the options according to the size of the option codes. The option with the largest option code is assigned to the highest scale level and score; the option with the smallest option code is assigned to the lowest scale level and score; options that have the same size of the option code are assigned to the same scale level and scores.
Please note that the mathematical meaning of option codes and term codes are the same; they are ordinal variables instead of interval variables. Both of them only numeric labels that represent the service levels. Figure 3 demonstrates relationships among option codes, scale levels, and scores of a 3 * 3 combination. As the number of option code types is five, this combination corresponds to a five-point scale. The first option is "The feature 1 is term 1, and feature 2 is term 1". Both the codes of two "term 1" are 1. According to Equation (3), the code of this option is 2. This option code is the largest among all options, so it is then assigned to the largest scale level (i.e., S4) and scores (i.e., 4).

First Step: Identifying Features of Attributes
The semantic scale was set to five points as the five-point scales are the most used in current transit passenger perceived quality surveys [6]. In total, 17 Attributes were extracted from the previous studies [23][24][25][26][27][28][29] based on the findings of attributes' importance. Table 2 shows the selected attributes, which are arranged based on the process of a ride.

Escalator and lift
Crowdedness no need to wait or wait a moment (1) a long line (0) Frequency of out of service never met (1)  Note: 1. Station accessibility: add "not too close, walking is acceptable" to describe the feature "distance" and then merge good (1) bad (0) "walking environment" to serve as the option S2; it belongs to the 2 * 2 combination. 2. In-station guide signs: add the case "guide signs are missing" to serve as the option S2. 3. Fare gate waiting time: add "no need to wait" to describe the feature "length" and serves as the option S4; ignore smooth (1) or stuck (0) due to the marginal effect on the time in this case. 4. Line map info and train arrival info: add the case "no relevant info or the equipment is being repaired" to serve as the option S0. 5. Noise: add "quiet" to describe the feature "level" and serves as the option S4. 6. Staff service: add the case "no staff or their contact information" to serve as the option as S0.
The features of attributes were obtained through a focus group. The focus group comprised of two researchers and eight riders [22] (p. 41). Table A1 (in Appendix A) shows the socio-economic and travel behavior information of all participants of the focus group. Two researchers served as the host and recorder, respectively. The host asked riders what affected their perceived quality with attributes, and riders were allowed to discuss. For instance, most riders believe the "clarity" and "conspicuousness" were the reasons affecting their perceived quality with "In-station guide signs". Hence, the "clarity" and "conspicuousness" were used as the features of this attribute. Since the level of service of attributes from the TCRP Report 165 [1] has already stated the features of "Station crowdedness" (Chapter 10, p. 14), "Train waiting time" (Chapter 5, p. 4), and "Train crowdedness" (Chapter 5, p. 24), we directly utilized them instead of obtaining from the focus group.

Second Step: Identifying Terms of Features
While riders were answering, terms that defined the service condition or rider's experience of features were collected. For example, when riders were talking about their perceptions of "clarity" of "In-station guide signs", some of them directly used the adjectives "clear" or "a bit unclear" to describe their perception. Meanwhile, others used their specific experiences that the In-station guide signs are hard to understand to show their opinions. Thus, "clear", "a bit unclear", and "hard to understand" became the terms used to describe the feature "clarity". According to Equation (2), "clear", "a bit unclear" and "hard to understand" were coded 1, 0.5, and 0, respectively.
However, riders only used "conspicuous" or "concealed" to talk about their perception of "conspicuousness" of "In-station guide signs". Interestingly, no rider proposed a middle term, such as "a bit conspicuous". Perhaps it is because the service conditions that riders experienced were extreme, or it is natural for them to use such a two-level term to describe their perception of this feature. Thus, "conspicuous" and "concealed" served as the terms of the feature "conspicuousness". According to Equation (2), "conspicuous" and "concealed" were coded 1 and 0, respectively.
Particularly, for the service of "In-station guide signs, Train arrival info, and Staff service", the focus group also mentioned the experience where the corresponding service was missing. Thus, the case "no relevant info or the equipment is being repaired" was added to "Line map info" and "Train arrival info", serving as the lowest service-level term (i.e., coded 0); the case "no staff or their contact information" was added to "Staff service", serving as the lowest service-level term (i.e., coded 0). Table 2 summarizes the features of all attributes, the terms used to describe the features, and the codes of the terms. The features obtained through the focus group are mostly consistent with the service requirements of the attributes stated in [1] (Chapter 4, pp. 17-36; Chapter 10, pp. 10-29).

Third Step: Combing Features and Their Terms to Form Options
We combined the features and terms to form options. Based on Table 2, all combinations can be denoted as 2 * 2, 2 * 3, 3 * 3, and 5 * 5. For the combination of 2 * 2, the number of options is less than five. According to the existing options, the focus group was asked to discuss again to propose more options. The most suitable option was then selected through scoring. Based on the service level, the added option's order was identified by the focus group. The added options and their orders are as follows. The option "not too close, walking is acceptable" for "Station accessibility" was added. It was placed between the options "short walking distance but a bad walking environment" and "walking distance is more suitable for cycling". The option "quiet" for "Noise" was added. It was placed before the option "intermittent small noise". The option "no need to wait" for "Fare gate waiting time" was added. It was placed before the option "wait a moment, and pass the fare gate smoothly". The note row of Table 2 also presents the relevant explanations.
The combination of attributes "Station crowdedness, Waiting time, and Train crowdedness" are 5 * 5. In each attribute, the service conditions of different features affect each other, causing the service levels of all features to change in the same direction. Hence, the number of features can be regarded as one. After the combination of features and their terms, the number of options equals five, which is known as the combination of 5.

Fourth Step: Assigning Scale Levels and Scores to Options
The scale levels are denoted as S4, S3, S2, S1, and S0, and their corresponding scores are four, three, two, one, and zero, respectively. The scores range from zero to four points based on [22] (p. 111) as they supposed it assured the effectiveness of the modeling analysis. Based on the option codes, Sustainability 2020, 12, 8626 9 of 21 the options were assigned to the corresponding scale levels and scores. In the questionnaire, the terms of attributes are displayed. Figure 4 illustrates the semantic scale designed in this paper.

The Validity and Reliability
We conducted a pilot survey to measure the content validity and reliability of the semantic scale. The content validity and reliability were calculated by two widely used indexes, the Lawshe's content validity ratio (CVR) [30] and Cronbach's α [31], respectively.
The pilot survey incorporates two parts. First, it was conducted on a content evaluation panel. Based on [32], a panel of 5-10 experts is suitable. Thus, the panel size was set to 8. The panelists incorporate four professors who major in the quality of urban rail transit service and four urban rail transit managers. The data were used to calculate the Lawshe's CVR of every feature in our semantic scale. Equation (4) where e n is the number of panelists identifying the feature as "essential", and N is the total number of panelists. When all panelists think the feature is "essential", the Lawshe's CVR adjusts to 0.99.

The Validity and Reliability
We conducted a pilot survey to measure the content validity and reliability of the semantic scale. The content validity and reliability were calculated by two widely used indexes, the Lawshe's content validity ratio (CVR) [30] and Cronbach's α [31], respectively.
The pilot survey incorporates two parts. First, it was conducted on a content evaluation panel. Based on [32], a panel of 5-10 experts is suitable. Thus, the panel size was set to 8. The panelists incorporate four professors who major in the quality of urban rail transit service and four urban rail transit managers. The data were used to calculate the Lawshe's CVR of every feature in our semantic scale. Equation (4) shows the equation of Lawshe's CVR [33].
where n e is the number of panelists identifying the feature as "essential", and N is the total number of panelists. When all panelists think the feature is "essential", the Lawshe's CVR adjusts to 0.99.
The second part of the pilot survey was conducted to riders to measure the reliability of the semantic scale. The riders were passengers of Metro Line 1 from Guangzhou, China. According to [8], the sample size was set to 36. The data were utilized to calculate Cronbach's α. Table 3 shows the results. The Lawshe's CVR of every feature ranges from 0.75 to 0.99, which meets the threshold 0.75 calculated by [34]. Furthermore, Cronbach's α is 0.84. Devon et al. [31] stated Cronbach's α > 0.7 indicates an acceptable internal consistency among attributes for new scales. Therefore, the validity and reliability of the semantic scale are well supported.

Empirical Study
We launched an empirical study to test the difference in evaluation results among the semantic, Likert, and numeric scales. The comparison results help us understand the potential characteristics of the semantic scale and assist transit managers to understand the impact of the scale form on the evaluation result. Since transit managers usually refer to the relative frequency distribution, mean, and variance of attribute scores to understand the current passenger perceived service quality of the transit and the heterogeneity of passenger perceived service quality, the difference was analyzed from those three aspects. Moreover, hypothesis tests were conducted to explore whether the differences are accidental or statistically significant. The data collection, data processing, results, and discussion of the empirical study are illustrated from Sections 5.1-5.3, respectively.

Data Collection
The empirical study was conducted using an online survey panel (www.wjx.cn) [35], and Metro Line 1 from Guangzhou, China, was the evaluation object. Riders needed to complete three copies of questionnaires whose attributes are the same, but the scales of attributes are different, which are Likert, numeric, and semantic scales. The Likert scale was set to very satisfied, satisfied, normal, dissatisfied, and very dissatisfied, and they were assigned to four, three, two, one, and zero points, respectively. Meanwhile, the numeric scale was set to four, three, two, one, and zero points. Asking one rider to answer these three copies of questionnaires ensures the differences in evaluation results are not caused by the differences in rider perceptions.
For the answer sequence of questionnaires, the linguistic scale-type questionnaires appeared last because the first-appeared linguistic options may cause a priming effect that affects riders to answer the rest of the questionnaires [11] (p. 135). Therefore, the numeric scale-type questionnaire appeared first, followed by the Likert scale-type questionnaire, and lastly, the semantic scale-type questionnaire. After completing three questionnaires in turn, in the end, riders filled in information about their socio-economic and travel habits. Brace [11] (p. 53) believed questions about rider socio-economics and travel habits might violate riders' privacy. If they are placed at the beginning of the survey, it may irritate riders, which can reduce the data quality or cause riders to withdraw halfway through.
where p is generally set to 0.5 where n can maximize; N is the population size; α is the significance level; e is the margin of error; and z α/2 is a normal distribution quantile at the α significance level. The passenger flow of the Guangzhou Metro Line 1 is about 1.1 million riders per day, hence, N = 1.1 × 10 7 . Furthermore, the significance level α and was set to 0.05, and the margin of error e was set to 5%, which is consistent with [37][38][39]. Finally, the calculation result is n ≥ 384.

Data Processing
The data processing incorporates five steps.

•
In the first step, we excluded the invalid questionnaires.
Researchers compared the IP addresses of the received questionnaires. For the questionnaires with a repeated IP address, we only kept the first copy and marked the rest as invalid. Having repeated IP address questionnaires was probably because a rider submitted the questionnaire repeatedly. Furthermore, riders could only submit the questionnaires after answering all questions, thanks to the automatic missed question detected function provided by the online survey platform. Therefore, the received questionnaires have no missed questions. Ultimately, we obtained 408 valid questionnaires. The Cronbach's α of the semantic scale is 0.84. According to [41], Cronbach's α > 0.7 means a good internal consistency and reliability. Table A1 shows the information on the respondent socio-economics and travel habits and the evaluated operation periods. Respondent socio-economics and travel habits have a wide coverage with normal proportions, and the evaluated operation periods cover the peak and non-peak hours of weekdays and weekends, which enhances the representativeness of the sample.

•
In the second step, we converted the evaluation result of the semantic scale into scores.
Based on the codes of the terms in Table 2, researchers used Equation (3) to change the evaluation results into option codes (Figure 3 is an example). Then, they transferred option codes to scores based on Section 3.2.4.

•
In the third step, we compared the score's relative frequency distributions in the three scales of each attribute and then conducted hypothesis tests.
As the same rider completed all three questionnaires, paired samples were collected. The scale level is over 2, indicating that the Bowker test is suitable. Take the comparison between Likert and semantic scales as an example. The null hypothesis is denoted as H 0 and means the score's relative frequency distributions of this attribute between Likert and semantic scales have no difference. Whereas the alternative hypothesis is denoted as H 1 and means the score's relative frequency distributions of this attribute between Likert and semantic scales are different. The null hypothesis and alternative hypothesis of other comparisons can be obtained similarly. The p-value, denoted as P LS and P SN , indicates the results of the Bowker test and their subscript letters are the initialisms of the two compared scales. Table 4 illustrates the results.

•
In the fourth step, we compared the means in the three scales of each attribute and then conducted hypothesis tests.  Note: P LS and P SN are p-values of an attribute, respectively, denoting the test results of the equality of its score's relative frequency distributions in Likert and semantic scales, and semantic and numeric scales. * p < 0.05; ** p < 0.01; *** p < 0.001.
If the difference of the paired-sample data follows a normal distribution at a 95% confidence level, the paired-sample t-test is suitable; otherwise, we chose the paired-sample Wilcoxon signed-rank test. The Anderson-Darling test and Shapiro-Wilk test were selected as the normality test method for the difference of paired-sample data because the hypothesis' normal distribution was unknown, and the sample size of the data did not exceed 2000. Under this condition, these two test results are more reliable than other feasible tests [42,43]. Take the comparison between Likert and semantic scales as an example. The H 0 indicates the means of this attribute between Likert and semantic scales have no difference, whereas H 1 indicates the means of this attribute between Likert and semantic scales are different. H 0 and H 1 of other comparisons can be obtained similarly. The hypothesis test results are expressed in the same way as in step three. Figure 5 and Table 5 present the results.

•
In the fifth step, we compared the variances in the three scales of each attribute and then conducted hypothesis tests.
according to the mean on the semantic scale; the abscissa represents mean value, and the red, orange, and blue dots represent the value from Likert, numeric, and semantic scales, respectively. Table 5 uses p-value to shows the hypothesis test results of the corresponding phenomena. For instance, LS P and SN P of "Train crowdedness" denote the results of its mean equivalence tests in Likert and semantic scales, semantic and numeric scales, and numeric and Likert scales, respectively.

P LS P SN
Train crowdedness 0.38 *** Station accessibility *** *** Noise *** 0.004 ** Station crowdedness *** 0.47 Escalator and lift 0.43 *** Fare gate waiting time 0.27 *** Temperature and ventilation *** *** Safety and security *** 0.004 ** Train waiting time *** *** Ticket purchase and top-up service *** *** Cleanliness *** *** Staff service *** *** Service span *** *** In-station guide signs *** *** Illumination *** *** Line map info *** *** Train arrival info *** *** If each set of paired-sample data follows a normal distribution at a 95% confidence level, the paired-sample F-test is suitable; otherwise, we chose the paired-sample Levene's test. The normality test method is the same as in step four. Take the comparison between Likert and semantic scales as an example. The H 0 means the variances of this attribute between Likert and semantic scales have no difference, whereas H 1 means the variances of this attribute between Likert and semantic scales are different. H 0 and H 1 of other comparisons can be obtained similarly. The hypothesis test results are expressed in the same way as in step three. Figure 6 and Table 6 show the results. tests in Likert and semantic scales, semantic and numeric scales, and numeric and Likert scales, respectively.    Figure 6 and Table 6 indicate the following rule: On the semantic scale, the variances are or are close to the highest or lowest among the three scales.
Most test results are significant at a significance level of 5% or even 1‰ (the first two columns of Table 6). It indicates that the variances of most attributes significantly differ between the semantic scale and the other two scales; these differences are less likely to be accidental phenomena. Thus, we have statistical evidence that the semantic scale form can affect attribute variances, causing this rule.
This phenomenon may be because semantic scale options leave riders with less room for imagination than Likert and numeric scales do. While using numeric or Likert scales, riders needed   Table 4 reflects the differences in the distribution of riders' perceived quality caused by the scale form. Most Bowker test results are significant at a significance level of 1% or even 1% . It indicates the score's relative frequency distributions of most attributes significantly differ in the three scales, and these differences are less likely to be accidental phenomena.
Such phenomena occurred may be due to the range and content of scale levels. Firstly, neither the Likert scale nor the semantic scale is an interval scale [22] (p. 103), i.e., the distance between two adjacent levels of the scale varies. In contrast, the numeric scale is an interval scale [22] (p. 103). Secondly, both Likert and numeric scales apply abstract categorizations of scale levels. However, the semantic scale distinguishes scale levels more clearly by defining the service conditions and the rider's experience at each scale level.
Interestingly, the differences have the following rule.

1.
On the semantic scale, the four-point frequency of some attributes is around the sum of the threeand four-point frequencies on the other two scales.
For instance, the four-point frequency of "In-station guide signs" is 87.25%, and its corresponding semantic option is "clear and conspicuous", i.e., 87.25% of the respondents believed the guide signs in the stations were clear and conspicuous (Table 2). 87.25% is close to the sum of the frequencies of very satisfied (42.65%) and satisfied (49.51%) levels of the Likert scale, or the sum of the frequencies of four points (57.60%) and three points (37.01%) of the numeric scale. It indicates about half of the respondents regarded the service of "clear and conspicuous guide signs" provided by this transit agency as satisfying or three points; in contrast, the rest thought it was very satisfying or four points.
This phenomenon not only reflects the rider heterogeneity of perceived quality but also may be related to hesitation in answering. Respondents needed to translate their attitudes in conceptual terms to options when using Likert or numeric scales (Section 1). However, they might not have had a specific or determined attitude towards the service performance of that attribute and felt the adjacent levels of Likert or numeric scales (e.g., very satisfied and satisfied, four and three points) were similar to their attitudes in conceptual terms, making them hesitant to map their attitudes onto a scale option. Thus, they might have been reluctant or lacked sufficient time to ponder the difference between the adjacent levels in these two scales before answering, especially in a hurry, which adhered to the satisficing behavior of questionnaires proposed by Krosnick [44].
However, "clear and conspicuous" should have reached the service goal set by transit managers for "In-station guide signs", which is also reasonable. The evaluation results of Likert and numeric scales may underrate the performance of this attribute, which is unfair to the transit agency. If the semantic scale is used, transit managers will understand passenger perceived quality more visually by reading the semantic options. In this example, transit managers can think highly of the performance of "In-station guide signs", and thus allocate resources to improve the service quality of other attributes.
Attributes with a similar phenomenon include "Line map info, Train arrival info, Illumination, Temperature and ventilation, Cleanliness, Staff service, Safety and security, and Service span" ( Table 4). The service conditions of these attributes may commonly not change with operation periods (e.g., peak or non-peak periods). Figure 5 and Table 5 reflect the differences in the average of riders' perceived quality caused by the scale form. In Figure 5, the ordinate represents attributes, which are arranged in ascending order according to the mean on the semantic scale; the abscissa represents mean value, and the red, orange, and blue dots represent the value from Likert, numeric, and semantic scales, respectively. Table 5 uses p-value to shows the hypothesis test results of the corresponding phenomena. For instance, P LS and P SN of "Train crowdedness" denote the results of its mean equivalence tests in Likert and semantic scales, semantic and numeric scales, and numeric and Likert scales, respectively. Figure 5 indicates the following rule:

Mean Comparison
The central tendency bias of most attributes is alleviated on the semantic scale. On the Likert scale, the mean of most attributes is the closest to the median of a five-point scale (i.e., two points). This phenomenon agrees with the discovery of [13,17,45] who observed the central tendency bias in the Likert scale. However, this phenomenon does not show up on the semantic scale of most attributes (14 out of 17). The reason can be that the Likert scale indicates abstract categories of satisfaction levels, while semantic options are less abstract-they provide more visualized service conditions of attributes. It enables riders to directly select the option that most closely matches their journey experiences.
However, the central tendency bias may not be effectively reduced on "Train crowdedness, Escalator and lift, and Station accessibility". Table 5 shows the means of "Train crowdedness" and "Escalator and lift" on the semantic scale are not statistically different from the means on the Likert scale (P LS = 0.43 and 0.38, respectively). Figure 5 displays the mean of "Station accessibility" on the semantic scale (blue dot) is closer to the median than is means on the Likert scale (red dot). There may be two reasons. Firstly, the middle options (i.e., S2) of these attributes on the semantic scale match some respondents' perceptions (Table 4). Secondly, the middle options describe a better service condition or rider's experience than "normal" implies.
Finally, this rule is less likely to be an accidental phenomenon, as Table 5 demonstrates the means of most attributes significantly differ in the three scales (p < 0.05). Thus, we have statistical evidence to believe the semantic scale can usually reduce central tendency bias. Figure 6 and Table 6 reflect the differences in the dispersion of riders' perceived quality caused by the scale form. In Figure 6, the ordinate represents attributes, which are arranged in ascending order according to the variance on the semantic scale; the abscissa represents variance value, and the red, orange, and blue dots represent the values from Likert, numeric, and semantic scales, respectively. Table 6 uses p-value to show the hypothesis test results of the corresponding phenomena. For instance, P LS and P SN of "Illumination" denote the results of its variance equality tests in Likert and semantic scales, semantic and numeric scales, and numeric and Likert scales, respectively. Figure 6 and Table 6 indicate the following rule: On the semantic scale, the variances are or are close to the highest or lowest among the three scales.

Variance Comparison
Most test results are significant at a significance level of 5% or even 1% (the first two columns of Table 6). It indicates that the variances of most attributes significantly differ between the semantic scale and the other two scales; these differences are less likely to be accidental phenomena. Thus, we have statistical evidence that the semantic scale form can affect attribute variances, causing this rule.
This phenomenon may be because semantic scale options leave riders with less room for imagination than Likert and numeric scales do. While using numeric or Likert scales, riders needed to assess their attitudes in conceptual terms (e.g., a searched memory that the in-station guide sign is clear) and then found a number or a Likert term that most closely matches their attitudes. Due to the heterogeneity, riders may choose different options for the same service condition, and Section 5.3.1 manifests related examples; alternatively, they could choose the same option for different service conditions. In contrast, the semantic scale already presents service conditions or rider's experience in the options. Riders did not need to translate their attitudes in conceptual terms to options; they could directly select the option that most closely matches their searched experience.
Therefore, if an attribute has homogeneous service conditions over periods, the semantic scale helps riders have a higher possibility to select the same option, so the evaluation results on the semantic scale are more homogeneous (i.e., smaller variance). "Service span" is an example because it has a small difference among various stations in the same line. Correspondingly, its variance on the semantic scale is the smallest among the three scales. In contrast, if an attribute has heterogeneous service conditions over periods or individuals, the semantic scale helps riders have a higher possibility to select different options. Thus, the evaluation results on the semantic scale are more heterogeneous (i.e., larger variance). "Station accessibility" and "Train crowdedness" serve as examples. The experiences of "Station accessibility" may differ in individuals due to their various origins; "Train crowdedness" may differ between peak and non-peak hours. Correspondingly, their variances on the semantic scale are the biggest among the three scales.
This phenomenon implies that if the evaluated operation period is singular (e.g., only peak hours), the evaluation results of attributes with heterogeneous service conditions will be more likely to be incomprehensive, and their variances may decline. Thus, having data from extensive operation periods would contribute to obtaining a more comprehensive evaluation result.

Conclusions
This research proposes a semantic scale for passenger perceived quality surveys of urban rail transit. The contents of the semantic scale were obtained through a focus group and TCRP 165 [1]. Then, we combined the content to form the options. A pilot survey was conducted to assess the validity and reliability of the semantic scale; the result indicates that the semantic scale meets the requirement. The semantic scale's options contain the attribute's service condition and the rider's experience. It enables urban rail transit managers to understand passenger's perception of the service quality more visually than only knowing the fixed terms "very satisfied, satisfied, normal, dissatisfied, and very dissatisfied" on a Likert scale or numbers on a numeric scale. Therefore, when the number of attributes remains unchanged, urban rail transit managers can formulate more targeted strategies to improve service quality. Furthermore, based on previous studies, the semantic scale can reduce cognitive steps and hesitation for riders when they fill in the questionnaire.
Then, we conducted an empirical study to explore the potential characteristics of the semantic scale by using paired-sample survey data to compare the difference in evaluation results among the semantic, Likert, and numeric scales. The empirical study uncovers the following three insights.

•
First, for attributes with homogeneous service conditions over operation periods, the semantic scale offers fairer evaluation results from the transit agency perspective than Likert and numeric scales. It can be because of lessened hesitation among riders when answering.

•
Second, the semantic scale can usually reduce central tendency bias. It may be because the semantic scale options depict visualized service conditions of attributes or rider's experience.

•
Third, compared to Likert and numeric scales, the evaluation result of the semantic scale is more homogeneous for attributes with homogeneous service conditions and is more heterogeneous for attributes with heterogeneous service conditions. It can be due to fewer riders' cognitive steps are required while applying the semantic scale to answer.
We proposed the following suggestions based on the above findings.
• First, as the scale form can affect the evaluation results, we recommend transport authorities to unify a questionnaire of passenger perceived quality surveys of urban rail transit in a region or even the whole country. Hence, when the evaluation results of different times (e.g., different years) or spaces (e.g., different cities) are compared, the results are more reliable.

•
Second, the collected data should cover operation periods as fully as possible; otherwise, it may increase the measured deviation of riders' perceived quality.
Some researchers have combined transit-and passenger-oriented data to measure the quality of transit service, such as [46,47], which produced less subjective results. For future work, we will apply the analytic hierarchy process analysis in the focus group to select features of each attribute and determine their weights, as the analytic hierarchy process analysis helps improve the capability of the semantic scale to handle uncertainty, ambiguity, and vagueness of passenger's perception. Finally, the concept of the semantic scale can also be applied to different modes of public transit.