Comparison of Home Use Tests with Differing Time and Order Controls

Consumer tests are classified in terms of the location of testing as laboratory tests or central location tests (CLTs) and home use tests (HUTs). CLT is generally used in sensory tests due to the ease of test control, whereas HUT has higher validity because of real consumption. However, the lack of test control in HUT is a major issue. In order to investigate the error occurrence and efforts required to minimize errors, three groups of tests were designed differing time and order control and evaluation was conducted using six snacks with texture differences. Errors related to time, order, and consumer or sample number were higher for more controlled conditions, however, most errors were recoverable using identification information except for cases of no response. Additionally, consumers preferred to consume all snacks in the evening at home, which differed from the typical 9 a.m. to 6 p.m. evaluation time in CLT. However, the timing differed for consumers with self-reported snacking time. The research title that included the term ‘home’ might have influenced the participants’ choice of location for evaluation. Overall, there was no significant difference between the results of groups despite different time and order controls, which could increase the applicability of HUT.


Introduction
Acceptance tests are crucial for food companies as acceptability can be used as an estimation for the possible repurchase of products by customers. These tests are usually conducted at sensory laboratories under a controlled environment using samples and preparations, which have been used to predict potential long-term purchases [1]. Consumer acceptance tests are largely divided into laboratory tests or central location tests (CLT) and home use tests (HUT). In CLT, consumers visit a specific place such as a shopping mall, hospital, or sensory test lab for undergoing the test. Most of the external factors, except certain environmental attributes under investigation, can be controlled at these places. Thus, environmental control is important in sensory tests. Even though CLT is widely used, it is unreasonable to evaluate the whole product experience with a small serving presented in a relatively short exposure time. Thus, its validity has been questioned in the real world [2,3] because consumers are practically affected by a variety of environmental factors in day-to-day life. Hence, control elements of sensory tests have been improved and tested to reflect real-use environments for years [4][5][6][7]. On the contrary, HUT is conducted at home where the consumers can evaluate in natural circumstances, thus it is one of the most noticeable methods to measure acceptability in real consumption. Nevertheless, the biggest challenge of HUT is control, as the evaluation is autonomous and is influenced by external factors such as evaluation of an uncertain amount of sample, improper focus, and interference of other family members' during evaluation.
In this respect, several comparative studies between CLT and HUT have been done as the context effect suggests that the results would be different depending on the test location [4,5,[8][9][10]. Most studies drew higher scores from HUT as the consumers could Some challenges from the past remain, however, with the development of internet technology, the data can be collected and checked on a real-time basis. Furthermore, HUT is being used more owing to the prohibition of public gatherings due to the COVID-19 pandemic. These can be an alternative method of testing in the new normal era. Although sensory evaluation has started moving out of the controlled laboratory environment in order to reflect real consumption, COVID-19 accelerated the speed of change. HUT is a necessary form of consumer test. There is a lack of studies done on these tests, hence, they need to be investigated further for development and standardization. In addition, many studies mention the importance and common limitations of HUT, however, the information on methods for overcoming these limitations is not enough. Therefore, it is imperative to review the kind of errors that could occur while conducting HUT and find the amount of control/effort required to minimize the error.
The objectives of this study were as follows: (1) to determine if home-use tests could be an alternative to central location tests or laboratory tests, (2) to study the kind of errors that could occur while conducting home use tests, and (3) to suggest critical control factors.

Participants
A total of 300 Korean participants (218 females and 82 males between 19 and 65 years) were recruited through the online bulletin board in Pusan National University or wordof-mouth and enrolled utilizing an online survey tool (Survey Monkey, Palo Alto, CA, US). Participants were asked to use QR codes. The demographic information of the consumers, along with their snacking frequency and usual snacking time on a daily basis, are shown in Table 1. Participants were selected based on the frequency of snack consumption (at least once every other day), willingness to consume samples, and absence of any food allergies and pregnancy. People were asked to choose items that they were not willing to consume as a snack from a list. Those who were unwilling to consume any of the test samples were automatically excluded from the study. The addresses of the participants were collected for shipping samples for testing at home. After the experiment, participants who completed the evaluation received mobile gift cards as compensation. This study was Some challenges from the past remain, however, with the development of internet technology, the data can be collected and checked on a real-time basis. Furthermore, HUT is being used more owing to the prohibition of public gatherings due to the COVID-19 pandemic. These can be an alternative method of testing in the new normal era. Although sensory evaluation has started moving out of the controlled laboratory environment in order to reflect real consumption, COVID-19 accelerated the speed of change. HUT is a necessary form of consumer test. There is a lack of studies done on these tests, hence, they need to be investigated further for development and standardization. In addition, many studies mention the importance and common limitations of HUT, however, the information on methods for overcoming these limitations is not enough. Therefore, it is imperative to review the kind of errors that could occur while conducting HUT and find the amount of control/effort required to minimize the error.
The objectives of this study were as follows: (1) to determine if home-use tests could be an alternative to central location tests or laboratory tests, (2) to study the kind of errors that could occur while conducting home use tests, and (3) to suggest critical control factors.

Participants
A total of 300 Korean participants (218 females and 82 males between 19 and 65 years) were recruited through the online bulletin board in Pusan National University or word-ofmouth and enrolled utilizing an online survey tool (Survey Monkey, Palo Alto, CA, USA). Participants were asked to use QR codes. The demographic information of the consumers, along with their snacking frequency and usual snacking time on a daily basis, are shown in Table 1. Participants were selected based on the frequency of snack consumption (at least once every other day), willingness to consume samples, and absence of any food allergies and pregnancy. People were asked to choose items that they were not willing to consume as a snack from a list. Those who were unwilling to consume any of the test samples were automatically excluded from the study. The addresses of the participants were collected for shipping samples for testing at home. After the experiment, participants who completed the evaluation received mobile gift cards as compensation. This study was approved by the Institutional Review Board at Pusan National University (PNU IRB/2020_59_HR).

Samples
Jeltema et al. [22] divided individuals into four major groups depending on their mouth behavior: Crunchers, Chewers, Suckers, and Smooshers. Six commercial snacks (Table 2) with a clear difference in texture were selected according to mouth behavior. Samples were individually wrapped in a Kraft bag, labeled with three-digit random codes and consumer numbers. Participants received all samples and instructions at once by postal delivery.

Test Design
Participants were divided into three groups. Each group had 100 participants and their sex, age, and address were considered in order to balance their proportion. If the address provided for delivery of samples was the same, we considered the participants to be living together and assigned them to the same group to minimize confusion during evaluation. Group A was the 'No control' group for time and order, thus consumers were allowed to proceed with the evaluation at any time and in any order they wanted. Group B was the 'Order control' group that consisted of participants who were instructed to evaluate in a preassigned testing order. However, they could evaluate whenever they wanted. Group C was the 'Time and order control' group that had a preassigned order and evaluated one sample every 2 days, three times per week. No instructions were given regarding the minimum amount of consumption and presence of other people with the participants while conducting the test. The instruction manual used pictograms in the description for ease of understanding. Participants were simply referred to quick response (QR) codes on the instruction manual or click links that were sent via text messages. The 'No control' and 'Order control' groups used QR codes provided on the instruction manual, whereas the 'Time and order control' group received messages including the link to the online questionnaire at 10 am on Monday, Wednesday, and Friday for two consecutive weeks. If the participant did not respond to the questionnaire for more than 72 h since the last evaluation, a reminder text message was sent to finish the assigned evaluation. Participants in the 'Time and order control' group were reminded 2 days later to maintain the evaluation intervals with other participants in the same group. The participant evaluation was terminated if they ignored the message three times consecutively. A schematic diagram of test design for comparing three home use test settings with differing test controls regarding evaluation time and test order is found in Figure 2. Consumers filled out their identification number and product number each time and selected the time and location of their evaluation. The questionnaire was composed of  Consumers filled out their identification number and product number each time and selected the time and location of their evaluation. The questionnaire was composed of queries regarding the product acceptability, texture characteristics, and prior knowledge of product. Overall acceptability, liking for the packaging, flavor, and texture were evaluated by using the nine-point hedonic scale (1 = "dislike extremely" and 9 = "like extremely"). Flavor intensity, texture intensity, and afterfeel intensity were measured using the ninepoint scale (1 = "extremely weak" and 9 = "extremely strong"), whereas the amount of residue was measured on the six-point scale (0 = "None" and 5 = "very much"). Participants checked suitable texture terms using check-all-that-apply (CATA) and the mouth behaviors (cruncher, chewer, sucker, or smoosher) that were relevant to the corresponding sample were determined. Subsequently, they answered yes/no questions about knowledge of samples, brand name, and experience. Additionally, willingness to purchase was measured by using the five-point scale (1 = "I would definitely not buy" and 5 = "I would definitely buy").

Supplementary Questionnaire
After all the evaluations were done, participants were questioned about demographics, motivation of snacking, and mouth behavior. In snacking questionnaire, participants were questioned about three main parts. First was frequency, time, and reason for snacking. The second was a liking towards 11 snacks (snack, cookie or cracker, bread, fruit, chocolate, coffee, ice cream, beverage, jelly, nuts, and rice cake) using the nine-point hedonic scale (1 = "dislike extremely" and 9 = "like extremely"). Lastly, they were asked to respond to questions about the motivation of snacking using the six-point scale (1 = "never" and 6 = "always") [23].
In the mouth behavior questionnaire, the participants were questioned for preference of mouth behavior. They responded to the degree of preference with the six-point scale (1 = "strongly disagree" and 6 = "strongly agree") using questions that revealed the difference in the texture of food items, and photographs representing each of the four mouth behaviors [24][25][26][27]. They also responded to questions about the condition of their teeth and were asked to allocate the importance of taste, flavor, and texture by percentage.

Data Analysis
Data were divided into two categories: completed without error and error occurred. Their frequency was recorded. Data that did not have any errors was considered complete data. Data having errors related to time delay, evaluation order, entry of wrong sample number, or consumer identification were recovered by tracking personal identification information (last four digits of the phone number). Missing response cases were not followed up due to time passing after consumption. After confirmation, all data, except that of participants who did not respond, were corrected and used for analysis. Demographic information was completed with further requests from participants. Therefore, the number of participants whose data was available was different depending on the samples. The number of days taken for completion of evaluation with all samples was presented as mean, minimum, median, and maximum values of the difference between the start and end date using a data unit. The frequency and percentage for evaluation time and place, knowledge of samples, mouth behavior (MB), and amount of consumption were calculated. Liking and perceived intensity scores, willingness to purchase, and adequate portion size were analyzed using analysis of variance (ANOVA) to determine significant differences among groups and samples within each group. When significance was found, the Fisher's least significant difference (LSD) was conducted as a post-hoc test at a significance level of 0.05. Additionally, the evaluated order from the 'No control' group was counted as the frequency.
Data from CATA was presented in terms of the frequency of selected sensory attributes and used for correspondence analysis (CA). The RV coefficient test was also performed using the results of CA to compare samples and terms between groups.

Checking of Errors and Analyzable Data
The frequency of analyzable data acquired from evaluations is shown in Table 3. Complete data indicates data collected correctly for the conditions of each group without any errors, including time delay. Overall completeness was the highest in the 'No control' group and was the lowest in the 'Order control' group. Completion of the survey was influenced by the degree of controls such as preassigned order and evaluation interval for each group. As the 'No control' group had the least control compared to the other two groups, only seven participants exceeded the evaluation period.
Simple errors such as the entry of incorrect consumer numbers and the three-digit random codes were detected and data was saved. In 'Order control' and 'Time and order control' groups, some evaluations were done without following test design protocols, however, their sample number was identified with data. When the evaluation was delayed, participants were reminded and the evaluation period was extended. In addition, no response was also counted as an error. The pattern of overall error occurrence was similar to that of completeness, with 'Order control' and 'Time and order control' groups having errors nearly twice as compared to 'No control' groups. 'Order control' group accounted for most errors from preassigned orders. However, a few also occurred in the 'Time and order control' group despite the notification. Additionally, the completeness from 'Order control' was the lowest, although the degree of control was not greater than that of the 'Time and order control' group as the preassigned order error was counted several times per person. When preassigned order error was treated as one per participant, despite its occurrence being more than once, then the order error occurrence was the highest in the 'Time and order control' group, followed by 'Order control' and 'No control' groups. Other errors were mostly related to incorrect entry of numbers, such as consumer numbers or three-digit random codes.
Recoverable errors referred to error data that was correctable. Even if the same consumer repeated mistakes more than twice, these were treated as an error. Participants had to enter the last four digits of their phone numbers for each evaluation, hence, their consumer numbers could be traced and errors could be rectified.
The sum of missing data in each group was in the following order: 'No control' group followed by 'Order control' and 'Time and order control' groups. For the 'Order control' group, missing data for apple sauce and potato chips was higher than the total number of dropped consumers. Moreover, participants who did not complete all evaluations were also in the order of 'No control', 'Order control', and 'Time and order control', however, some of their data were included.
Participants who did not complete the evaluation within the preset time period were considered for extended evaluation. Participants for extended evaluation from the 'Time and order control' group were considerably greater in number compared to the 'No control' and 'Order control' groups. For the 'Time and order control' group, the participants could not proceed to the next step on their own because they were informed of the testing sample within the evaluation intervals. Hence, the 'Time and order control' group had a lesser frequency in the no response and dropped consumers, although their number was highest in the extended evaluation.
All other errors were recoverable with confirmation, except for non-response data. The response rate for all groups was high and the frequency of error differed depend- ing on the degree of control, which could be converted into complete data using the identification information.

Evaluation Time and Place
Information of evaluation time and place for each group is shown in Table 4. Most samples were evaluated in the evening except for jelly in the 'No control' group and candy in the 'Time and order control' group. Consumers in 'No control' and 'Order control' groups evaluated samples in the afternoon frequently. However, for the 'Time and order control' group, the frequency of evaluation was slightly higher in the morning, probably because the notification was sent in the morning. Samples in each group were evaluated with the least frequency at dawn. Depending on food types, it may have a more appropriate time of the day. Birch et al. [28] indicated that breakfast food items were preferred in the morning than in the afternoon, whereas food items associated with dinner were preferred in the afternoon than in the morning. CLT is normally conducted within a fixed time period, Foods 2021, 10, 1275 9 of 20 usually from 10 am to 6 pm, whereas participants of HUT choose appropriate consumption time according to their convenience unless noted otherwise, hence, natural behavior can be practiced. A comparison of the evaluation time for CLT and HUT shows that most HUT participants usually conducted the evaluation in the late afternoon or evening [9]. This led to increased satisfaction due to free conditions [5]. Comparison of liking categories depending on evaluation time showed no difference (p > 0.05). In this study, evaluation time did not influence acceptability.   All samples were mostly consumed at home, followed by the workplace. The snack consumption location for Canadians and Norwegians was home more frequently, followed by the workplace [29,30]. In addition, participants were informed about the 'Home use test' before the experiment. Most of them provided their home address for evaluation location as they might have thought that considering the name of the test, they had to evaluate at home. More than half of the participants were office workers, therefore, the evaluation location had a greater influence on the snacking time compared to the supplement questionnaire and real evaluation time. Table 5 shows the number of days taken for HUT with six samples by calculating the difference value between the start and end date. 'No control' and 'Order control' groups showed a similar pattern in the number of days taken including mean, minimum, and median values, whereas the 'Time and order control' group had a much higher value. When comparing the maximum number of days taken, 'No control' and 'Time and order control' groups had a higher value than the 'Order control' group.

Number of Days Taken for Home Use Test (HUT)
The testing days were not designated, hence, 'No control' and 'Order control' participants could conduct the test in one day, whereas 'Time and order control' participants could take up to 10 days for completing the evaluation, considering the interval time and the date of sending a text message. Interestingly, some participants from the 'Time and order control' group communicated with their acquaintances in a different control group and received the survey link or QR code before their designated evaluation link was sent. However, we could not consider acquaintances in assigning participants into the same or different group. The instructions have to clearly stated that confidentiality should be maintained during and after participation, even if the participants are acquainted with each other. Table 5. Number of days 1 taken for the home use test (HUT) with six samples.

Group No Control Order Control Time and Order Control
Mean (±SD) 9.0 ± 4. The results of the maximum days taken by the 'Time and order control' group reflected the effect of the periodic testing intervals and reminders. One consumer from the 'No control' group took 26 days to complete the testing. The missing data were found a few days later and completed. The maximum number of days for completion of evaluation in the 'No control' group was 15 days without this data. However, several participants from the 'Time and order control' group who received the evaluation and supplementary questionnaire on the last day did not evaluate carefully. They completed only one of the questionnaires and the remainder was completed after the reminder. Another downside of sending reminders was that some participants from the 'No control' and 'Order control' groups completed all remaining evaluations at once after receiving the reminder because they thought that they were supposed to finish the questionnaires immediately. Furthermore, some participants lost the testing samples and requested to receive more samples. Therefore, they needed more time for testing. Table 6 presents mean values for consumer acceptability, perceived intensity, and amount of residue for participants of 'No control', 'Order control', and 'Time and order control' groups. The mean liking scores were generally between 'Neither like nor dislike' and 'Like moderately'. In general, the liking, intensity, and amount of residue scores showed similarity among the three groups. Each liking category indicated very similar results for samples: spread wafer had the highest score in overall, package, and flavor liking category, while potato chips had the highest score in the texture liking category. Candy ranked the highest for after feel and texture intensity. Apple sauce scored the lowest in every liking category and intensity. The amount of residue showed the highest value for spread wafer and the lowest value for candy. There were no significant differences (p > 0.05) between groups in liking, perceived intensity, and amount of residue, while there were significant differences (p < 0.05) within groups.

Consumer's Liking and Perceived Intensity of Samples
Overall, liking was rated positively ranging between neutral to like moderately, probably because all products were commercially available [31] and the comfortable condition in context could have positively influenced acceptability [5,32]. Apple sauce is not available for sales in Korean markets, therefore, the Korean consumers were not familiar with the product. However, its taste and texture were liked as they are similar to other products, such as apple pie [33]. Considering the results, participants did not have much knowledge of apple sauce (Table 7). Additionally, many participants answered in open-ended questions that they would eat the remaining sample with other snacks such as bread or yogurt rather than eating apple sauce by itself. Abbreviation: LSD-least significant difference. 1 Evaluated using the nine-point scale from 1 (dislike extremely) to 9 (like extremely) and the amount of residue was rated using the six-point scale from 0 (none) to 5 (very much). 2 A lower case alphabet indicates significant differences within each group (α = 0.05). 3 There was no significant difference between groups (α = 0.05).
Although spread wafer was also an unfamiliar product (Table 7), its liking score was relatively high, which could have been affected by brand awareness [34,35] and familiarity with the spread used for filling in the spread wafer. Soerensen et al. [36] indicated there was no dynamic liking when novel flavors were added because of a high perceived familiarity with chocolates. In addition, well-known samples were evaluated first, whereas unfamiliar samples were evaluated later in the 'No control' group (Tables 7 and 8).
Although samples were wrapped in Kraft bags, consumers in the 'No control' group might have opened all bags to choose their evaluation order. More than half of the participants had knowledge of samples except for apple sauce. Although the spread wafer had low product awareness and experience, its brand awareness was high. On the other hand, apple sauce was mostly evaluated last and it had the lowest brand and product and brand awareness, and experience.

Purchase Intent and Price Willing to Pay
The results of purchase intent and price that the consumers were willing to pay are shown in Table 9. There was no significant difference between the groups (p > 0.05) and the samples showed significant difference within each group (p < 0.05). Similar to overall liking, spread wafer rated the highest in purchase intent, while apple sauce rated the lowest. Although apple sauce is similar to baby food, it was the only product not available in Korea, and thus was an unfamiliar product for the participants.
Participants were asked the price that they were willing to pay in Korean won (KRW) for a provided quantity of each sample as an open-ended question and the mean values (and SD) of the responses are shown in Table 9. There was a significant difference (p < 0.05) within groups while there was no significant difference (p > 0.05) between groups. All samples were rated similarly among the three groups, except potato chips and spread wafer in the 'No control' group. Participants were willing to pay the highest price for jelly and the lowest for candy. The problem was that candy, cereal bar, and spread wafer were provided in a quantity of more than one, thereby confusing the consumers whether the question was for only one piece or all provided. Most samples were rated higher than the original price except for potato chip and spread wafer (Tables 2 and 9). For apple sauce and spread wafer, the difference of the values between the original price and the participants' response was bigger than that of others due to participant's unfamiliarity with the products. Abbreviation: LSD-least significant difference; SD-standard deviation. 1 Purchase intent was measured using the five-point scale from 1 = definitely would not purchase to 5 = definitely would purchase. 2 Appropriate price was asked as an open-ended question in KRW. And exchange rate of 1230 KRW was used to calculate the price in USD (as of May 2020). Converted price was used for the analysis. 3 Sharing the same lower case letter means there is no significant difference between samples (α = 0.05). There was no significant difference between groups (α = 0.05).

Mouth Behavior Used during Consumption
Consumers chose all relevant mouth behavior such as cruncher, chewer, sucker, or smoosher during consumption (Table 10). The highest frequency of mouth behavior for each sample was similar between groups. The mouth behavior commonly used for apple sauce and candy was sucker, and that for cereal bar, jelly, potato chips, and spread wafer was chewer, which was slightly different from the expected results (Table 2). Although the smoosher category included soft food, such as ripe bananas and custard, apple sauce was close to liquid, and a large portion of consumers answered that they could not feel any texture and drank it like a juice. Moreover, the chewer category was also selected because of tiny particles. Others had the highest frequency in the chewer category except for candy because its texture changed during eating. Jeltema et al. [22] mentioned that people perceived the overall texture of a food item as the texture that lasts the complete duration, rather than that at the beginning.

Correspondence Analysis of Texture Characteristics
Correspondence analysis (CA) biplot shows the relationship between snack samples and the 51 texture attributes evaluated using the CATA method from each group (Figure 3). With Dimension 1 (Dim 1) and 2 (Dim 2), Figure 4a-c explains the 65.15% data variance in 'No control', 65.22% in 'Order control', and 66.11% in 'Time and order control' group. The RV coefficients provided that the terms configuration was similar between 'No control' and 'Order control' groups (RV = 0.963, p < 0.001), 'No control' and 'Time and order control' groups (RV = 0.961, p < 0.001), and 'Order control' and 'Time and order control' groups (RV = 0.955, p < 0.001). Samples configuration for the groups were as follows: 'No control' and 'Order control' group (RV = 0.989, p < 0.001), 'No control' and 'Time and order control' group (RV = 0.997, p < 0.001), and 'Order control' and 'Time and order control' group (RV = 0.984, p < 0.001). Samples with a difference in texture were dispersed into each quadrant and were explained by nearby texture characteristics.

Analysis of Portion Size by Consumers
The amount of consumption evaluated by consumers is shown in Table 11. All groups indicated similarity. More than half of the participants consumed all the provided quantities of cereal bar, potato chips, and spread wafer, whereas more than 40 participants consumed under 1/3 of the provided quantity for apple sauce, candy, and jelly. Liking was positively related to consumption [37,38]. However, in our study, overall liking was high for candy and jelly, while their consumption was low. This might be related to the time required to intake these food items because of their texture attributes ( Figure 3). Furthermore, the adequate portion size evaluated by consumers was lower than the provided quantity when samples provided were considered as 100 percent (Figure 4). There was a significant difference (p < 0.05) within groups while there was no significant difference between groups (p > 0.05). The quantity of samples provided in CLT is relatively smaller than that of HUT along with a brief exposure time [39], hence, the prediction of the amount of consumption in a laboratory setting could be missed out. Gough et al. [40] found that participants might underestimate the portion size consumed in laboratory settings because of a tendency to conceal their eating behavior.

Analysis of Portion Size by Consumers
The amount of consumption evaluated by consumers is shown in Table 11. All groups indicated similarity. More than half of the participants consumed all the provided quantities of cereal bar, potato chips, and spread wafer, whereas more than 40 participants consumed under 1/3 of the provided quantity for apple sauce, candy, and jelly. Liking was positively related to consumption [37,38]. However, in our study, overall liking was high for candy and jelly, while their consumption was low. This might be related to the time required to intake these food items because of their texture attributes ( Figure 3). Furthermore, the adequate portion size evaluated by consumers was lower than the provided quantity when samples provided were considered as 100 percent (Figure 4). There was a significant difference (p < 0.05) within groups while there was no significant difference between groups (p > 0.05). The quantity of samples provided in CLT is relatively smaller than that of HUT along with a brief exposure time [39], hence, the prediction of the amount of consumption in a laboratory setting could be missed out. Gough et al. [40] found that participants might underestimate the portion size consumed in laboratory settings because of a tendency to conceal their eating behavior.

Suggestions for the Home Use Test and Limitations
Contrary to CLT, many external factors influence testing in a real setting; hence, greater efforts are required for the evaluation of several samples in HUT. After follow-up, the final number of participants with completed data was high, despite the occurrence of various errors. Most errors were related to incorrect entry of consumer numbers or three-digit random codes. Other errors included not following the preassigned order of testing or evaluation period extension. The consumer and sample numbers were both three-digit numbers. This might have confused the participants as they had to enter these numbers directly using open-ended questions, despite receiving six samples at once. Nevertheless, identification information helped in the modification of errors.
Kraft bags were used for hiding packages of samples before evaluation. However, it may not have served the purpose of random sample selection in the 'No control' group as familiar and/or preferred products were evaluated first. Moreover, a few participants unwrapped Kraft bags as soon as they received them, thereby mixing the sample numbers. In such cases, repackaging was required until the effect of packaging was studied and consistency of the food quality was ensured by avoiding external factors such as temperature, long contact with humid air, or direct sunlight. Moreover, it is important to check if the shelf life of the product would last an extended evaluation period.
Six samples were provided simultaneously by postal delivery before the start date. However, some participants from the 'No control' and 'Order control' groups conducted the test before the announcement of the start date using the QR code on the instruction manual. On the other hand, the 'Time and order control' group could start testing only on the day the survey link was provided. The survey link should be provided for the first sample evaluation if the start date of the evaluation is to be fixed as it eases the follow-up process.
In this study, new findings were that the job profile of the participants and the time of sending messages could influence evaluation time. The term 'home' used in the research title in the testing information may have influenced the evaluation place. Accordingly, the QR code allowed more freedom than the link in terms of evaluation time. In addition, the access time in online evaluation did not coincide with evaluation time in the questionnaire despite its mention in the written instructions. Hence, it was important to emphasize the instruction prominently or by using a video for better understanding.
Social communication was not considered in our study. Snacks are generally eaten alone, however, their acceptability and consumption can be influenced by social interaction. In our study, family members or housemates were classified in the same group. However, this factor was excluded from the analysis due to low occurrences. Furthermore, some participants who were acquaintances contacted each other regarding the testing, although they were not in the same group during the evaluation period. Thus, better instruction was needed to avoid communication among participants for reduction of errors, such as participants from the 'Time and order control' group receiving survey links from the 'Order control' group. The recommendations for HUT is summarized in the Figure 5.
However, this factor was excluded from the analysis due to low occurrences. Furthermore, some participants who were acquaintances contacted each other regarding the testing, although they were not in the same group during the evaluation period. Thus, better instruction was needed to avoid communication among participants for reduction of errors, such as participants from the 'Time and order control' group receiving survey links from the 'Order control' group. The recommendations for HUT is summarized in the Figure 5.

<Control consideration for HUT>
1. If personal identification information to added in the questionnaire, it could be easy to track and modify errors. However, the information should be deleted after data acquisition. 2. Repackaging may be required in certain cases until the effect of packaging is studied, and consistent food quality should be ensured in such cases by avoiding exposure to external factors. The shelf life should be checked as well. 3. The use of the survey link would be better for the first sample evaluation if the start date is fixed as it would provide ease for follow-up in the evaluation process. 4. The working hours and environment of the participants can influence the evaluation time and the time of sending text messages. The QR code on the package allows more freedom than the link in terms of evaluation time. It should be mentioned on the product packaging and on the instruction manual.
5. Use of the term 'home' in the research title in the testing information may have influenced the location of evaluation. Additional information in context of location should be provided if the naturalness of the location is important. 6. The instructions should be emphasized properly or videos should be used to explain the procedure of evaluation for better understanding and to avoid a discrepancy between online access time and evaluation time in the questionnaire. 7. Family members or housemates should be classified in the same group because social communication influences product acceptability and consumption. 8. The instruction manual should mention that participants are not allowed to communicate regarding the test, especially if the testing conditions were different between groups.

Conclusions
This study investigated and compared the results of three groups differing in time and order control using six samples with different textures in the home use test. It aimed

Conclusions
This study investigated and compared the results of three groups differing in time and order control using six samples with different textures in the home use test. It aimed to determine the amount of control/effort that would be required to handle errors that might occur while conducting HUT. Overall, the results of the evaluation were similar between the groups, regardless of the degree of control. Thus, HUT can be utilized similarly to CLT as consumer tests in terms of the number of samples. HUT allows the evaluation of samples in a real environment and can be designed to evaluate long-term usage. They can be utilized to improve the launch of new products and evaluate their success.
Not much research has been conducted in a realistic environment and there is almost no report on errors that may occur while conducting HUT. However, a few of its disadvantages are its cost and high dropout rate. This study evaluated two control factors of preassigned order and evaluation time in HUT. It included the evaluation of six snack samples by participants, which is normally evaluated in one session in CLT or laboratory evaluation. If CLT or laboratory tests were included as a control compared to HUT, our findings would have been better validated. As a small number of consumer sample participated in this study, our findings may not be generalized. When conducting HUT with more consumers providing a higher number of samples than traditional HUT of one or two samples, more errors or higher dropout rate might be observed. More experiments on HUT should be performed for generalization. Furthermore, other control factors, such as sample temperature, should also be considered in the future.