Comparing Semantic Di ﬀ erential Methods in A ﬀ ective Engineering Processes: A Case Study on Vehicle Instrument Panels

: When developing a user-oriented product, it is crucial to consider users’ a ﬀ ective needs. Various semantic di ﬀ erential (SD) methods have been used to identify a ﬀ ect regarding materials, and this is the most important property in products. This study aims to determine which of the three conventional SD methods (absolute evaluation 1 [AE 1], absolute evaluation 2 [AE 2], or relative evaluation [RE]) is most e ﬀ ective for a ﬀ ective evaluation. A ﬀ ective evaluation was performed for vehicle instrument panels by each of these three SD methods. Two quantitative analysis methods (correlation analysis and repeated-measures ANOVA) were used to examine the performance (sample distinguishability) of each evaluation method, and it was found that both AE 2 and RE produced better results than AE 1. The correlation coe ﬃ cients and p -values in correlation analysis were slightly better for RE than for AE 2. In conclusion, an a ﬀ ective evaluation produced better results when pairwise samples (especially one sample pair) were presented, indicating that maintaining distinct samples is very important. The clearer the di ﬀ erence in comparison targets is, the more accurate the evaluation results.


Introduction
When developing products, the materials, along with shape, size, and other properties, are considered the greatest factors in formation of the overall impression of the product [1]. A product's materials can produce substantial sensory data that encode their own properties of texture, hardness, temperature, and weight [2]. These design variables determine users' affect regarding the product, with the result that sensory triggered attractiveness and the expectation of improved feeling increasingly encourage consumers to buy [3]. It is, therefore, necessary to develop a material that best reflects the physical characteristics and makes a good impression. The representative product development methodology that best reflects this is affective engineering. By this approach, the user's affects such as impressions, feelings, and demands for a product are recognized and embodied in the product's design [4,5]. Furthermore, users' complex affects caused by the physical stimuli of a product can be quantitatively identified and adjusted to make products more affect-friendly [6].
Many statistical tools such as linear regression and neural networks have been developed and used in various contexts [7] as a bridge connecting user's affects for material or product properties. Before the mathematical implementation, affective evaluations can be performed in various ways such as using facial expressions, verbal comments, and semantic differentials (SD) in which meanings of words or attitudes towards something are identified using bipolar scales. Image/video and voice/speech recognition systems can be effective in identifying primary emotions (e.g., anger, joy, and sorrow) with adequate reliability [7][8][9]. SD may, admittedly, influence the attitudes of users by non-conscious emotion processing [10,11]. Nevertheless, it is still a widely applied measurement technique. It is the most appropriate one for evaluating the strength and direction of the meaning of concepts, especially complex and multidimensional concepts [12,13], and it has the great advantages of being easy to manage and enables relatively quick evaluation.
Many studies have been conducted on improving the affective quality of materials by identifying a predominant or preferred affect. Several experiments evaluated various affective properties based on visual and tactile sensation (e.g., gloss, translucence, and softness), describing how humans' affects are formed via sensation. For example, some studies focused more heavily on the influence from visual perception (visual dominance) [14,15] while other studies focused on qualifying touch perception (tactile dominance) by measuring a system of tactile sensation [16,17]. Furthermore, in humans' tactile investigation of surfaces, the four factors of roughness, compliance, coldness, and slipperiness were widely accepted as dimensions underlying tactile judgment [18][19][20]. In addition to visual and tactile sensations, some studies suggested the multi-modal sensory perception as an effective way to derive the final pleasurable sensibility [21,22].
Such researches also have attempted to identify the correlation between perceived and actual physical properties [20,23,24]. There is, in particular, increasing interest in mechanical properties (e.g., viscosity and elasticity) transmitted through shape and motion cues [25], implying that there is interest in how to make affective products through human-product interactions. For example, Kim et al. (2018) [26] investigated the relationships of design parameters with luxuriousness and naturalness in leather interiors for vehicles. Bahn et al. (2006) [27] proposed a systematic affective design plan to enhance the user's satisfaction by specifying material characteristics of automobile crash pads with respect to objective sensibility. Shin and Park (2016) [28] examined the physical properties of metal such as reflectivity and color to increase its affective impact as an interior finishing material. Lee et al. (2000) [29] studied the pattern design and comfort evaluation of seat cover fabrics in trains. Vieira et al. (2017) [30] suggested a new parameter of tactile perception (e.g., "clickiness") for in-vehicle rubber keypads by establishing very strong relation between seven affects and related design parameters.
Furthermore, most affective engineering studies usually focused on developing a satisfaction model based on surface perception, especially vehicle interiors. You et al. (2006) [31] investigated customer satisfaction with six selected major interior parts of a vehicle and developed satisfaction models to identify relatively important design parameters and preferred design features for the interior parts. Lee et al. (2005) [32] also conducted a cognitive satisfactory survey for vehicle instrument panels, focusing on design guidelines and a check list related to fabricating and processing an instrument panel that impacts cost and quality. In terms of research methods, various affects, which formed a higher level of affect such as satisfaction, were selected in each study, but the higher level of affect was evaluated not only by sensory aspects which were dependent on human senses, which could result in intuitive and less biased results, but also by functional aspects which were influenced by subjective preferences and experiences of individuals, which could be result in diverse and more biased results. For more accurate measurements on both aspects of affects, it has been emphasized that affective evaluations should be carefully designed, however, no comprehensive approach to the methods was performed while no standard evaluation method was established in all previous studies.
In conclusion, researches on product appreciation have generally been performed in the automobile industry as well as other industrial domains such as textiles, electronics, and home appliances to develop new products for their own markets [33]. For example, Nagamachi (1999) [34] represented that Mazda had improved the drivers' sense of self-control by estimating a variety of shift-lever lengths in the SD experiments. Mamaghani et al. (2014) [35] suggested a reference to make decisions on designing a ketchup bottle by determining the relationships between products' features and adjectives (affects). Yang and Chang (2012) [36] proposed the representative dimensions of affect for mobile phone design. Recently, the field has expanded into web page design and product-service system design [37]. For example, Lin et al. (2013) [38] improved both user-specific expectations and aesthetic consistency in web pages by suggesting the optimal graphic-to-text ratio for a specific set of design elements and the users' feelings.
In general, the three types of sensory evaluations (vision, tactile, and multi-modal) and affective modeling studies for new product development evaluate various types of samples and compare them via appropriate adjectives, referring to their corresponding affects, which are usually in the SD methods. The SD method is an essential component of an affective evaluation in which participants rate affect at a certain level as they use a product, and SD means "rating of several concepts on a set of bipolar adjective scales by a sample of subjects" ( [39], p. 248). However, researchers differ in their use of the SD methods in affective evaluation because the manner in which samples are presented in the SD method can vary, though the SD adjectives they evaluate are the same. In three studies [40][41][42], participants were asked to answer all questions simultaneously for samples presented in order, which can be referred to as absolute evaluation (AE). In the following two studies [43,44], on the other hand, participants were asked to answer all questions for the sample pairs presented. This is relative evaluation (RE). Wongsriruksa et al. (2012) [24] asked absolute evaluation questions individually for samples presented in order. This evaluation method is a combination of the aforementioned two methods; participants evaluate all samples per adjective before moving on to the next adjective.
These three SD methods should be compared because study results are expected to vary depending on the evaluation method used. Participants who measure samples are generally subject to high variations in environment, over time, and within the sample, and are prone to bias [45]. Results of affective engineering are heavily impacted by regulating factors such as the order of sample presentation. For example, some specific affect can be better measured by a specific SD method with respect to sample presentation. Ultimately, among affective engineering researches, more reliable results excluding the effect of biases can only be expected when the evaluation methods are identical. Kim et al. (2018) [26] and Bahn et al. (2006) [27] asked participants to rate every question about affective word with bipolar adjective scales for all samples presented one by one in the visual and tactile perception for leathers in vehicle interiors. This process was repeated until all the samples were completely evaluated. They also developed preference models and luxuriousness models for leathers in vehicle interiors by suggesting the preferred combination of material properties of leathers. Both studies showed visual and tactile relationships with physical measurements, however, they did not consider an effect from choosing another SD method, which could cause different results because there were differences in the number of evaluations or in the form of presenting items and analyzing data. Therefore, some relations between affects and SD methods should be investigated so that they could control participants' evaluation pattern, causing different sensibilities on the same affect. Since each affect should be carefully considered by its own nature, it is also important to identify how participants differently react to the different samples in terms of each affect independently as well as to determine which sample is the most preferred for constructing an optimal combination of material properties.
The purpose of this study is to propose a new strategy to identify the most appropriate SD method in visual and tactile affective sensory evaluation by increasing the perceived distinguishability among samples depending on the nature of each affect. Through a vehicle instrument panel case study, this study attempts to compare the three SD methods quantitatively by repeating the evaluation, for the same participant group and for the same evaluation samples by all three methods under controlled conditions. Comparing the three SD methods will find a better means of SD sample presentation and corresponding affective adjective pairs and suggest the evaluation method that will afford the most accurate and in-depth evaluation results by identifying affective adjective pairs significantly correlated to expected design parameters.

Participants
The experiment included 24 participants-17 males and 7 females with a mean age of 27.1 years. All participants had driver's licenses and their own vehicles, and they were general consumer who were not expertise in vehicle interior but users with enough experiences with a vehicle instrument panel. All the participants repeated the same affective evaluation three times, that is, by applying three different SD methods. They were monetarily compensated after participating in the experiment.

Semantic Pairs
The process of selecting affective words largely involved four steps. To establish a pool of affective words in the study, first, affective words were collected from the reviews on previous researches (17 domestic researches and 15 international researches) about affective evaluation, especially SD-based sensory tests for vehicle interiors [31,[46][47][48][49][50]. As a result, the first level of affective structure was established with 626 words by collecting affective words that had used in visual and tactile sensory evaluations, textile evaluations, and vehicle interior design evaluations. In addition to the literature review, a total of 180 online expert reviews (by general consumers with a good knowledge of vehicle interiors) from vehicle driving tests were collected, and only 49 reviews about vehicle instrumental panels or vehicle interior design were considered. For the text analysis, 2062 words were extracted, and 23-word groups for affective evaluations were classified. As a result, the most frequent affective words in real world were confirmed (warming, soft, and glossy), and they were included in the pool of affective words with 649 words. Second, duplicated words were removed, and similar ones were combined from the 649 words based on literature review and text analysis, as a result, 180 words were selected in the second stage. Third, the organized affective words were evaluated on suitability for leathers and instrument panels, and 108 words were selected including the most important affective words (bright, moist, sticky, and rugged). Fourth, antonyms and synonyms of each word were selected. Ultimately, a total of seven semantic pairs (14 words) were derived as sensory words through several group discussions. Moreover, an expert consultation was conducted to elicit the more appropriate words for the experiment. The process of selecting affective words was presented in Figure 1.

Participants
The experiment included 24 participants-17 males and 7 females with a mean age of 27.1 years. All participants had driver's licenses and their own vehicles, and they were general consumer who were not expertise in vehicle interior but users with enough experiences with a vehicle instrument panel. All the participants repeated the same affective evaluation three times, that is, by applying three different SD methods. They were monetarily compensated after participating in the experiment.

Semantic Pairs
The process of selecting affective words largely involved four steps. To establish a pool of affective words in the study, first, affective words were collected from the reviews on previous researches (17 domestic researches and 15 international researches) about affective evaluation, especially SD-based sensory tests for vehicle interiors [31,[46][47][48][49][50]. As a result, the first level of affective structure was established with 626 words by collecting affective words that had used in visual and tactile sensory evaluations, textile evaluations, and vehicle interior design evaluations. In addition to the literature review, a total of 180 online expert reviews (by general consumers with a good knowledge of vehicle interiors) from vehicle driving tests were collected, and only 49 reviews about vehicle instrumental panels or vehicle interior design were considered. For the text analysis, 2062 words were extracted, and 23-word groups for affective evaluations were classified. As a result, the most frequent affective words in real world were confirmed (warming, soft, and glossy), and they were included in the pool of affective words with 649 words. Second, duplicated words were removed, and similar ones were combined from the 649 words based on literature review and text analysis, as a result, 180 words were selected in the second stage. Third, the organized affective words were evaluated on suitability for leathers and instrument panels, and 108 words were selected including the most important affective words (bright, moist, sticky, and rugged). Fourth, antonyms and synonyms of each word were selected. Ultimately, a total of seven semantic pairs (14 words) were derived as sensory words through several group discussions. Moreover, an expert consultation was conducted to elicit the more appropriate words for the experiment. The process of selecting affective words was presented in Figure 1.  The seven semantic pairs for the affective evaluation in this study were: "Dark-Bright," "Matt-Glossy," "Cooling-Warming," "Dry-Moist," "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft." These semantic pairs were classified as either visual pairs or tactile pairs. Two visual pairs, "Dark-Bright" for brightness and "Matt-Glossy" for glossiness could be perceived by mere observation. The brightness was influenced by contrast of color, and the gloss was influenced by illumination. The five tactile pairs, "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft" for surface, "Cooling-Warming" for temperature, and "Dry-Moist" for humidity could be perceived by movement in contact. In particular, roughness, warmness, and humidity were perceived statically by mere touch whereas slipperiness and softness were perceived dynamically by rubbing, pressing, and tapping [17,51]. The definitions of each semantic pair are presented in Table 1. Table 1. Semantic pairs and their definitions.

Evaluated Semantic Pairs Definitions
Dark-Bright How bright the surface of leather is when you see it.

Matt-Glossy
How glossy the surface of leather is when you see it.

Cooling-Warming
How warming the leather is when you touch it.

Dry-Moist
How moist the leather is when you touch it.

Slippery-Sticky
How sticky the surface of leather is when you rub or press it.

Flat-Rugged
How rugged the surface of leather is when you touch it.

Hard-Soft
How soft the surface of leather is when you tap it.

Samples
Six vehicle instrument panels were used in the experiment ( Figure 2). They were structurally the same but covered with different leathers (Figure 3). Each vehicle instrument panel was presented with in a counter-balanced order across participants, minimizing anchoring effects in affective evaluations.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 17 The seven semantic pairs for the affective evaluation in this study were: "Dark-Bright," "Matt-Glossy," "Cooling-Warming," "Dry-Moist," "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft." These semantic pairs were classified as either visual pairs or tactile pairs. Two visual pairs, "Dark-Bright" for brightness and "Matt-Glossy" for glossiness could be perceived by mere observation. The brightness was influenced by contrast of color, and the gloss was influenced by illumination. The five tactile pairs, "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft" for surface, "Cooling-Warming" for temperature, and "Dry-Moist" for humidity could be perceived by movement in contact. In particular, roughness, warmness, and humidity were perceived statically by mere touch whereas slipperiness and softness were perceived dynamically by rubbing, pressing, and tapping [17,51]. The definitions of each semantic pair are presented in Table 1.

Evaluated Semantic Pairs Definitions Dark-Bright
How bright the surface of leather is when you see it.

Matt-Glossy
How glossy the surface of leather is when you see it.

Cooling-Warming
How warming the leather is when you touch it.

Dry-Moist
How moist the leather is when you touch it.

Slippery-Sticky
How sticky the surface of leather is when you rub or press it.

Flat-Rugged
How rugged the surface of leather is when you touch it.

Hard-Soft
How soft the surface of leather is when you tap it.

Samples
Six vehicle instrument panels were used in the experiment ( Figure 2). They were structurally the same but covered with different leathers (Figure 3). Each vehicle instrument panel was presented with in a counter-balanced order across participants, minimizing anchoring effects in affective evaluations.  Seven prospective design parameters related to the selected affects were measured in each sample: grayscale, gloss, thermal conductivity, water vapor transmission rate (WVTR), squeak, roughness average (Ra), and softness. Grayscale was selected as a design parameter instead of color because every sample was achromatic and differed only in brightness [31]. Brightness was measured by a Hunter color-difference meter (OLOEY JZ-300): two sides of samples were measured with hairline direction of 0° and 90° (Figure 4a). Gloss is a physical property in which light is reflected directly from the surface of an object; the degree of gloss is determined by the degree of smoothness of the surface and the reflection angle of the projected light source [31]. In this study, gloss was measured using a gloss meter (HG60): each sample was inserted, and values from 20°, 60°, and 85° were recorded on the display (Figure 4b). Thermal conductivity is defined as the degree to which heat is transferred from the front side to the back side of a material; it was measured using an HC-074/Technox in accordance with ASTM-C518/ISO-8301: each sample was trimmed with 200 mm × 200 mm and was inserted between two aluminum plates. The temperature of an upper aluminum plate (Qupper) and a lower aluminum plate (Qlower) were pre-set, respectively. The sample between aluminum plates were tested as a one in HC-074 ( Figure 4c). WVTR is a measure of how much water vapor passes through leather. WVTR was measured by the guidelines of EN13726-2; leather samples of 35mm in diameter were inserted into a cup with 50 g of water and then placed in a constant-temperature air-humidifier for 24 h. Then, the change in weight and rate of change were calculated (Figure 4d). Squeak is a newly defined unit-less parameter that is calculated from normalizing the friction force value. It is yielded by dividing the difference between the maximum frictional force and the minimum frictional force by the mean frictional force. The frictional values were measured using the Lloyd Instrument [26]. For measurements, each sample was trimmed into two different sizes. The larger specimen was fixed at the bottom of the Lloyd Instrument while the smaller specimen was attached on the bottom of a weight (4.5 kg), and then two specimens were in contact by placing the weight attached with the smaller specimen on the larger specimen. After the weight was pulled by the Lloyd Instrument, the load was measured as the weight began to move (Figure 4e). Ra is a physical property of the roughness of the surface of leather. It was measured by a TopoGetter 3D scanner with specimens trimmed to a 10 mm × 10 mm dimension (Figure 4f). Softness is a parameter that indicates the softness compliance of leather. It is defined by how the leather is compressed when pressed with a certain force [26]. It was measured by inserting each sample into a softness tester (ST-300) for one minute (Figure 4g). Seven prospective design parameters related to the selected affects were measured in each sample: grayscale, gloss, thermal conductivity, water vapor transmission rate (WVTR), squeak, roughness average (Ra), and softness. Grayscale was selected as a design parameter instead of color because every sample was achromatic and differed only in brightness [31]. Brightness was measured by a Hunter color-difference meter (OLOEY JZ-300): two sides of samples were measured with hairline direction of 0 • and 90 • (Figure 4a). Gloss is a physical property in which light is reflected directly from the surface of an object; the degree of gloss is determined by the degree of smoothness of the surface and the reflection angle of the projected light source [31]. In this study, gloss was measured using a gloss meter (HG60): each sample was inserted, and values from 20 • , 60 • , and 85 • were recorded on the display (Figure 4b). Thermal conductivity is defined as the degree to which heat is transferred from the front side to the back side of a material; it was measured using an HC-074/Technox in accordance with ASTM-C518/ISO-8301: each sample was trimmed with 200 mm × 200 mm and was inserted between two aluminum plates. The temperature of an upper aluminum plate (Q upper) and a lower aluminum plate (Q lower ) were pre-set, respectively. The sample between aluminum plates were tested as a one in HC-074 ( Figure 4c). WVTR is a measure of how much water vapor passes through leather. WVTR was measured by the guidelines of EN13726-2; leather samples of 35-mm in diameter were inserted into a cup with 50 g of water and then placed in a constant-temperature air-humidifier for 24 h. Then, the change in weight and rate of change were calculated (Figure 4d). Squeak is a newly defined unit-less parameter that is calculated from normalizing the friction force value. It is yielded by dividing the difference between the maximum frictional force and the minimum frictional force by the mean frictional force. The frictional values were measured using the Lloyd Instrument [26]. For measurements, each sample was trimmed into two different sizes. The larger specimen was fixed at the bottom of the Lloyd Instrument while the smaller specimen was attached on the bottom of a weight (4.5 kg), and then two specimens were in contact by placing the weight attached with the smaller specimen on the larger specimen. After the weight was pulled by the Lloyd Instrument, the load was measured as the weight began to move (Figure 4e). Ra is a physical property of the roughness of the surface of leather. It was measured by a TopoGetter 3D scanner with specimens trimmed to a 10 mm × 10 mm dimension (Figure 4f). Softness is a parameter that indicates the softness compliance of leather. It is defined by how the leather is compressed when pressed with a certain force [26]. It was measured by inserting each sample into a softness tester (ST-300) for one minute (Figure 4g). These seven design parameter values are the means from three different measurements for each of the six samples and are summarized in Table 2. Based on literature review and a meeting of experts, design parameters that seemed related were matched for each semantic pair [31,32,52]. Table 3 shows the interpretation of semantic pairs with related design parameters whether both have a positive or negative relationship. For example, the lower the gray scale of a sample is, the brighter the sample is; the lower the WVTR of a sample is, the moister the sample is. On the other hand, the higher the gloss of a sample is, the glossier the sample is; the higher the thermal conductivity of a sample is, the more warming the sample is; the higher the squeak of a sample is, the stickier the sample is; the deeper (larger) the Ra of a sample is, the more rugged the sample is; the higher the softness of a sample is, the softer the sample is.  These seven design parameter values are the means from three different measurements for each of the six samples and are summarized in Table 2. Based on literature review and a meeting of experts, design parameters that seemed related were matched for each semantic pair [31,32,52]. Table 3 shows the interpretation of semantic pairs with related design parameters whether both have a positive or negative relationship. For example, the lower the gray scale of a sample is, the brighter the sample is; the lower the WVTR of a sample is, the moister the sample is. On the other hand, the higher the gloss of a sample is, the glossier the sample is; the higher the thermal conductivity of a sample is, the more warming the sample is; the higher the squeak of a sample is, the stickier the sample is; the deeper (larger) the Ra of a sample is, the more rugged the sample is; the higher the softness of a sample is, the softer the sample is.

Evaluation Method
The participants evaluated their affects regarding samples in three different ways. In the first case, samples were presented in a random order, and participants answered all questions about each sample. When they finished evaluating one sample, they began to evaluate the next sample by answering all the questions; this SD method was referred to in this study as absolute evaluation type 1 (AE 1). In the second case, all samples were presented to participants, and every sample was evaluated for a single word in a random order. After they finished evaluating one affective word, they evaluated the next word in the same order. This process was repeated until the last affective word; this SD method was referred to in this study as absolute evaluation type 2 (AE 2). In the third case, all possible pairs of samples were presented to participants, and all the affective words were evaluated for each sample pair. When the evaluation was completed for one pair, the same process was applied to another pair. This process was repeated until all the pairs were completely evaluated; this SD method was referred to in this study as relative evaluation (RE). Three different questionnaires were developed according to the characteristics of the evaluation methods, and a seven-point Likert scale was used for each semantic pair. For example, the graded scale indicated that the closer to one point, the darker the sample feels, and the closer to seven points, the brighter the sample feels. All three questionnaires are presented in the Supplementary Materials Figures S1-S3. In conclusion, in both AE 1 and AE 2, one participant evaluated six samples by seven semantic pairs using a seven-point Likert scale. Thus, six sessions were repeated for AE 1 (according to the number of samples), and seven sessions were repeated for AE 2 (according to the number of semantic pairs) in which they rated all semantic pairs as absolute values in each sample. However, in RE, one participant relatively evaluated two samples which were arbitrarily paired among six samples. Therefore, 15 sessions were repeated for RE while they rated all semantic pairs as relative values on one sample to the other sample. A flow chart of these three SD methods is presented in Figure 5. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 17 Figure 5. Flowchart of three semantic differential (SD) methods in affective evaluation.

Experimental Procedure
Three evaluation methods were performed over three days considering the cognitive and physical load of the participants. The participants performed the three evaluations in different orders. For example, one participant performed RE on the first day, AE 1 on the second day, and AE 2 on the third day; and another participant performed AE 2 on the first day, RE on the second day, and AE 1 on the third day. For minimizing contextual contamination in a large number of evaluations, some strict guidelines were set up so that the participants should follow. In the beginning of each evaluation, a brief explanation about the purpose and steps of the evaluation was given to all participants to ensure their understanding of the items of the questionnaire. Furthermore, they were given guidance on how to evaluate each semantic pair one by one, and they rated each semantic pair after completing a specific interaction. For example, they immediately rated the semantic pair of "Hard-Soft" on the questionnaires immediately after tapping the samples. In particular, researchers had observed and monitored the whole process of all participants' evaluations, confirming whether they met a minimum number (at least once per semantic pair) of the right interaction (e.g., seeing, touching, rubbing, pressing, and tapping). In terms of the group of participants, two participants were joined in one session of an evaluation, however, samples were given in a counter-balanced order across participants to avoid interrupting each other's evaluations as well as to prevent ordering effects of samples. In order to maintain the tactile sensitivity of the participants over a certain level, a rest period of 5 to 10 min was provided in the middle of the evaluation. In addition, samples which were located in the same place side by side were cleaned before each experiment to eliminate any possible influence of contamination. In terms of the group of samples, six samples were recommended by experts in the field as the most popular ones: each of samples was presented in random orders in AE 1; each list of six samples per semantic pair was provided in AE 2; each pair of two samples was presented in RE so that six samples could be paired once each other. After one affective evaluation each day, the participants were asked directly in the questionnaire about the perceived distinguishability of the samples for each semantic pair. The participants evaluated how easy or difficult it was to distinguish between the samples in terms of the seven semantic pairs by a seven-point Likert scale. For example, one point indicated that they felt it was very difficult to distinguish between the samples whereas seven points indicated that they could distinguish between samples very easily. An example of the questionnaire is presented in Figure S4.

Experimental Procedure
Three evaluation methods were performed over three days considering the cognitive and physical load of the participants. The participants performed the three evaluations in different orders. For example, one participant performed RE on the first day, AE 1 on the second day, and AE 2 on the third day; and another participant performed AE 2 on the first day, RE on the second day, and AE 1 on the third day. For minimizing contextual contamination in a large number of evaluations, some strict guidelines were set up so that the participants should follow. In the beginning of each evaluation, a brief explanation about the purpose and steps of the evaluation was given to all participants to ensure their understanding of the items of the questionnaire. Furthermore, they were given guidance on how to evaluate each semantic pair one by one, and they rated each semantic pair after completing a specific interaction. For example, they immediately rated the semantic pair of "Hard-Soft" on the questionnaires immediately after tapping the samples. In particular, researchers had observed and monitored the whole process of all participants' evaluations, confirming whether they met a minimum number (at least once per semantic pair) of the right interaction (e.g., seeing, touching, rubbing, pressing, and tapping). In terms of the group of participants, two participants were joined in one session of an evaluation, however, samples were given in a counter-balanced order across participants to avoid interrupting each other's evaluations as well as to prevent ordering effects of samples. In order to maintain the tactile sensitivity of the participants over a certain level, a rest period of 5 to 10 min was provided in the middle of the evaluation. In addition, samples which were located in the same place side by side were cleaned before each experiment to eliminate any possible influence of contamination. In terms of the group of samples, six samples were recommended by experts in the field as the most popular ones: each of samples was presented in random orders in AE 1; each list of six samples per semantic pair was provided in AE 2; each pair of two samples was presented in RE so that six samples could be paired once each other. After one affective evaluation each day, the participants were asked directly in the questionnaire about the perceived distinguishability of the samples for each semantic pair. The participants evaluated how easy or difficult it was to distinguish between the samples in terms of the seven semantic pairs by a seven-point Likert scale. For example, one point indicated that they felt it was very difficult to distinguish between the samples whereas seven points indicated that they could distinguish between samples very easily. An example of the questionnaire is presented in Figure S4.

Analysis
Before the main analyses, data pre-processing was performed for a dataset in RE. Since the pairwise comparison was continuous in all comparative evaluations on the six samples, each raw datum was about how much better one sample was than the other in terms of the seven sematic pairs. Therefore, all raw data (15 pairwise evaluations per sample × 24 participants) were transformed into the global value of each sample based on the analytic hierarchy process, thereby determining the priorities using a matrix to compare variables of the same level in pairs. The priority weighting of each semantic pair and sample was calculated using the eigenvalue principle. Finally, a consistent level of the pairwise comparison matrix was tested.
To compare the three SD methods, two analyses were performed from two respects: for measuring the accordance between design parameters and the corresponding affects under the three evaluation methods and for measuring the perceived distinguishability from the self-reported responses. Pearson's correlations between design parameters and related semantic pairs were analyzed for each affective evaluation result. Twenty-one correlation analyses (3 evaluation methods × 7 semantic pairs-design parameters) were performed. That is, the purpose of correlation analysis was to figure out how much each affect was correlated with a related design parameter based on rating of corresponding semantic pair. Then, a repeated-measures analysis of variance (ANOVA) was conducted to identify the difference between evaluation methods in the perceived distinguishability of samples. If a significant difference between evaluation methods was identified (for a semantic pair which had a significant difference in distinguishability of samples), a subsequent comparison within groups (evaluation methods) was made using Bonferroni tests. Mauchly's sphericity test was also completed. That is, a total of seven repeated-measures ANOVAs were performed because all participants evaluated perceived distinguishability among six samples in terms of seven semantic pairs under three different evaluations. Consequently, post hoc analysis for all seven semantic pairs was applied to figure out how much the participants differentiated samples less or more in which evaluation methods. The statistical significance was set at 0.05 in all analyses. SPSS 26.0 (IBM Corporation, Armonk, NY, USA) and Microsoft Excel 2016 (Microsoft Corporation, Redmond, WA, USA) were used for all statistical analyses.

Suammry
This study can be broadly divided into four stages. First, selection for affective words was performed, and seven semantic pairs ("Dark-Bright," "Matt-Glossy," "Cooling-Warming," "Dry-Moist," "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft") were selected. Second, seven design parameters (e.g., grayscale of color, gloss, thermal conductivity, WVTR, squeak, roughness, and softness) were measured in terms of the seven semantic pairs. Third, experiments on three affective evaluation SD methods (e.g., AE 1, AE 2, and RE) were conducted, and direct questionnaires on perceived distinguishability among six samples per semantic pair were also used. Lastly, two representative statistical models (e.g., correlation analysis and repeated-measures ANOVA) were used to examine the performance of the three affective evaluation SD methods. A schematic representation of the research workflow from Sections 2.1-2.6 is presented in Figure 6.

Analysis
Before the main analyses, data pre-processing was performed for a dataset in RE. Since the pairwise comparison was continuous in all comparative evaluations on the six samples, each raw datum was about how much better one sample was than the other in terms of the seven sematic pairs. Therefore, all raw data (15 pairwise evaluations per sample × 24 participants) were transformed into the global value of each sample based on the analytic hierarchy process, thereby determining the priorities using a matrix to compare variables of the same level in pairs. The priority weighting of each semantic pair and sample was calculated using the eigenvalue principle. Finally, a consistent level of the pairwise comparison matrix was tested.
To compare the three SD methods, two analyses were performed from two respects: for measuring the accordance between design parameters and the corresponding affects under the three evaluation methods and for measuring the perceived distinguishability from the self-reported responses. Pearson's correlations between design parameters and related semantic pairs were analyzed for each affective evaluation result. Twenty-one correlation analyses (3 evaluation methods × 7 semantic pairs-design parameters) were performed. That is, the purpose of correlation analysis was to figure out how much each affect was correlated with a related design parameter based on rating of corresponding semantic pair. Then, a repeated-measures analysis of variance (ANOVA) was conducted to identify the difference between evaluation methods in the perceived distinguishability of samples. If a significant difference between evaluation methods was identified (for a semantic pair which had a significant difference in distinguishability of samples), a subsequent comparison within groups (evaluation methods) was made using Bonferroni tests. Mauchly's sphericity test was also completed. That is, a total of seven repeated-measures ANOVAs were performed because all participants evaluated perceived distinguishability among six samples in terms of seven semantic pairs under three different evaluations. Consequently, post hoc analysis for all seven semantic pairs was applied to figure out how much the participants differentiated samples less or more in which evaluation methods. The statistical significance was set at 0.05 in all analyses. SPSS 26.0 (IBM Corporation, Armonk, NY, USA) and Microsoft Excel 2016 (Microsoft Corporation, Redmond, WA, USA) were used for all statistical analyses.

Suammry
This study can be broadly divided into four stages. First, selection for affective words was performed, and seven semantic pairs ("Dark-Bright," "Matt-Glossy," "Cooling-Warming," "Dry-Moist," "Slippery-Sticky," "Flat-Rugged," and "Hard-Soft") were selected. Second, seven design parameters (e.g., grayscale of color, gloss, thermal conductivity, WVTR, squeak, roughness, and softness) were measured in terms of the seven semantic pairs. Third, experiments on three affective evaluation SD methods (e.g., AE 1, AE 2, and RE) were conducted, and direct questionnaires on perceived distinguishability among six samples per semantic pair were also used. Lastly, two representative statistical models (e.g., correlation analysis and repeated-measures ANOVA) were used to examine the performance of the three affective evaluation SD methods. A schematic representation of the research workflow from Sections 2.1-2.6 is presented in Figure 6.

Analysis on Direct Questionnaires on Perceived Distinguishability of Samples per Semantic Pair
A repeated-measures ANOVA determined that perceived distinguishability of "Hard-Soft" varied significantly across the evaluation methods [F (2, 46) = 3.345, p < 0.05]. The post-hoc tests using Bonferroni correction revealed that the perceived distinguishability score fell by an average of 0.67 between the AE 1 and RE (p = 0.044). Moreover, the assumption of Mauchly's sphericity was met in all repeated-measures ANOVAs (p > 0.05). Table 5 shows all repeated-measures ANOVA results.

Discussion
The three semantic pairs of "Dark-Bright," "Dry-Moist," and "Flat-Rugged" (also called brightness, humidity, and roughness affects) showed significant correlations between subjective scores on SD rating and each design parameter (grayscale, WVTR, and Ra, respectively). The visual semantic pair of "Dark-Bright" had the strongest correlations in RE. This indicates that affect related to brightness can be evaluated most appropriately in RE, revealing that visual dominance of surface perception in terms of brightness is significant for identifying more subtle differences between the samples [53]. In addition, in terms of the two tactile semantic pairs of "Dry-Moist" and "Flat-Rugged", all methods seem to be applicable to evaluate, but both correlations were strongest in AE 1. This means that the affect associated with humidity is most appropriately evaluated by AE 1 where the subject had different types of interactions for a single sample. There was enough time between interactions. For accurate measurement on the first sensation, AE 1 can minimize hand sweating caused by heat exchange between the skin and the samples [54]. Similarly, the affects related to surface roughness seem to be most appropriately evaluated in AE 1, revealing that a tactile dominance of surface roughness is significant for distinguishing samples. The results of the last two tactile affects indicate that participants tend to distinguish between samples based on overall impression with organized psychological states affected by physical objects with a resting hand rather than a moving hand to gain an individual impression of each surface while focusing each design elements [55].
Only by RE did the semantic pair of "Slippery-Sticky" in this study show significant correlation between the subjective score on SD rating and the design parameter of squeak. This indicates that affect related to slippery sensation is most appropriately evaluated in RE because participants could distinguish slippery surface when one comparative sample was given. In general, participants' confidence in the evaluation was shown to be affected by increased availability of tactile inputs (with a reference point) because participants who relied on touch became frustrated by being unable to touch the samples [56]. However, in the tactile sensory evaluations such as AE 1 and AE 2, participants' ability to remember prior affects seemed to be great in evaluating the remaining samples; furthermore, the slippery surface of "leather" was hardly perceived by most participants [43]. In conclusion, participants had difficulty finding a significant difference in the slippery surfaces of leathers, by just a quick and simple touch in both AE 1 and AE 2.
The repeated-measures ANOVA, on the other hand, revealed no significant difference in perceived distinguishability of samples among all three evaluation methods for the four semantic pairs of "Dark-Bright," "Dry-Moist," "Flat-Rugged," and "Slippery-Sticky," and most scores about their perceived distinguishability were less than five points out of seven, suggesting that their self-confidence in conducting their evaluations was low. These common results in this study, nevertheless, indicate that human observers are skillful at recognizing and classifying materials as sessions of an evaluation were added. In other words, humans seem to be better at judging material classes and material properties than they expected [57][58][59].
Previous studies have shown that hardness-related tactile sense is well-perceived by humans [19,60,61], and the repeated-measures ANOVA and post-hoc comparisons showed significant differences in scores of perceived distinguishability between two types of SD methods (AE 1 and RE). RE was found to give more distinctions among samples than AE 1. However, this indicates that their perceived distinguishability was completely wrong with the following results of affective evaluations when interacting with real samples.
Humans can generally distinguish the hardness of an object by tapping it on the surface, and the frequency of vibration caused by tapping gives the cue for the perception of hardness [62]. Although the semantic pair of "Hard-Soft" showed a significant correlation between subjective scores on SD rating and its design parameter of softness in two evaluation methods (AE 2 and RE), the results were only valid for AE 2. Only in AE 2 did the participant perform correct evaluations consistent with the physical measures; they felt the softness of samples as their softness (as a design parameter) increased. While RE had the highest absolute value of correlation, RE is inappropriate for evaluating softness-related affect. In fact, the participants exhibited adverse discrimination among the hardness of leathers while tapping the surfaces on the vehicle instrumental panel; they actually felt hardness with samples as their softness (as a design parameter) increased. Therefore, especially in RE, the origin of vibration stimuli can be assumed to be the instrument panel itself rather than the leathers.
The two semantic pairs of "Matt-Glossy" and "Cooling-Warming" showed no significant differences among all three types of SD methods. They also showed no significant correlation with the design parameters of gloss and thermal conductivity, respectively. Affects concerning glossiness (e.g., glossy, shiny, and silky) were greatly influenced by the fundamental ambiguity of the visual system under arbitrary viewing conditions, and they can be completely obscured by lighting, geometry, or other factors [63]. Fleming (2014) [51] revealed that inferring true physical parameters such as gloss and viscosity is often impossible. Similarly, the warm-cool feeling is associated with thermal comfort described as the human satisfaction with environmental thermal conditions, which are substantially affected by a combination of physiological, psychological, and physical factors (e.g., air temperature, air velocity, air radiant, and relative humidity) [64]. In particular, the tactile affect was heavily influenced by uncontrolled variables such as the condition of the participants and their surroundings. In conclusion, gloss and warmness were found to be affects that could not be distinguished regardless of the evaluation methods. Instead of adjusting the order of presentation of the samples, they require more careful attention in an environmental setting. In particular, as the values of these two design parameters (gloss and thermal conductivity) for the six samples are relatively close across the samples, insignificant results may have been found in all evaluation methods.
Such environmental thermal conditions and uncontrolled variables (e.g., conditions of the participants and their surroundings) can be called contextual factors. In this comparative study, there might be effects on data for the contextual factors, and additional factors were provided as follows. First, an experimental procedure was designed as a series of evaluations for three days. Second, bipolar scales were used for a measurement. Third, an experimental environment was set as an indoor laboratory. In terms of an experimental procedure, participants' decisions might be affected after each choice by focusing repeated choice on samples [65]. In terms of bipolar scales, participants might carry their ratings over bipolar scales with inaccurate or opposite concepts [11]. In terms of an experimental environment, samples could not be evaluated in natural light as in the actual context of use.
To sum up, an affective evaluation produced better results when pairwise samples were presented, meaning that maintaining the distinction between samples is important. This characteristic of both AE 2 and RE reflects the reality that consumers often touch products to compare multiple items in a single category because frequent touches on multiple items provide consumers with a reference point for evaluating relative quality [66]. They can also reduce the effort required to build mental models in material perception, whether extreme or moderate, as participants did in AE 1. Fleming (2017) [25] also claimed that material perception was data-driven because participants estimated material properties by identifying feature dimensions that interpolated between samples with which participants were experienced.

Conclusions
This study attempted to compare the conventional SD methods in affective evaluation. Three statistical methods-repeated-measures ANOVA, post-hoc analysis, and correlation analysis-were used for the analysis. Neither repeated-measures ANOVA nor post-hoc analysis yield any valid result; on the other hand, in the correlation analysis, both AE 2 and RE were more appropriate for affective evaluation than AE 1, especially with respect to distinguishability among samples. However, a better method can be chosen for a certain affect considering the characteristics of the affect. For example, AE 1 is recommended for affects of humidity and roughness; AE 2 is recommended for affects of hardness; RE is recommended for affects of brightness and slipperiness. Choosing either AE 2 or RE in the evaluation can prevent participants from excessively extrapolating beyond a superficial understanding of presented information. Developing a "within-sample reference model" is most appropriate to derive more accurate and insightful results from an SD method in an affective evaluation. A limitation of this study was that the evaluation time of each SD method was not clearly recorded. However, this study proposed a new strategy to determine an appropriate SD method in affective evaluation by increasing the distinguishability among the samples depending on the nature of each affect, revealing that granting increased evaluation time is well worthwhile to achieve reliable results. In further studies comparing the three SD methods, more factors influencing experiments, such as the number of participants, the number of evaluation samples, and the number of semantic pairs, can be varied to expand on the results of this study with more advanced statistics in the analysis.