Applying Collective Intelligence in Health Recommender Systems for Smoking Cessation: A Comparison Trial

: Background: Health recommender systems (HRSs) are intelligent systems that can be used to tailor digital health interventions. We compared two HRSs to assess their impact providing smoking cessation support messages. Methods: Smokers who downloaded a mobile app to support smoking abstinence were randomly assigned to two interventions. They received personalized, ratable motivational messages on the app. The ﬁrst intervention had a knowledge-based HRS ( n = 181): it selected random messages from a subset matching the users’ demographics and smoking habits. The second intervention had a hybrid HRS using collective intelligence ( n = 190): it selected messages applying the knowledge-based ﬁlter ﬁrst, and then chose the ones with higher ratings provided by other similar users in the system. Both interventions were compared on: (a) message appreciation, (b) engagement with the system, and (c) one’s own self-reported smoking cessation status, as indicated by the last seven-day point prevalence report in different time intervals during a period of six months. Results: Both interventions had similar message appreciation, number of rated messages, and abstinence results. The knowledge-based HRS achieved a signiﬁcantly higher number of active days, number of abstinence reports, and better abstinence results. The hybrid algorithm led to more quitting attempts in participants who completed their user proﬁles. S.H.-F., S.M. S.S.-A.; resources, S.M. S.S.-A.; S.H.-F.; S.H.-F., M.J.J.M.C., S.M. F.J.N.-B.; writing—review F.S., L.F.-L., S.H.-F. and H.d.V.; visualization, S.H.-F. and M.J.J.M.C.;


Introduction
For decades, computers have been used to generate health recommendations [1,2]. This usage of computers to adjust health materials to each person, in order to make them relevant and credible for their situation, replicating what an actual human counselor would do, is called computer tailoring [3,4]. Computer tailoring involves the generation of participant-specific recommendations, typically in the form of messages, by computers, which is done after an assessment of each person to match their characteristics, needs, and In the following sections we explain how this study was performed, including its materials and methodology, and provide an overview of the demographics of the recruited participants, the results of each metric, and a discussion on the findings.
Moreover, traditional computer tailoring was based on 'static' scores for each individual's answers. Researchers started considering the usage of recommender systems in a number of health interventions. A recent well-studied therapeutic area is nutrition [48][49][50][51][52]. For example, Elahi et al. (2015) introduced and evaluated a food recommender system, obtaining user preferences through an interaction design, which showed positive user feedback [53]. Musto et al. (2020) [50] presented a strategy using knowledge-aware recommender systems to recommend an appropriate diet. Gómez-del-Río et al. (2020) described an activity recommender system to promote healthy habits in obese children, while using gamification as a mechanism to enhance personalization and increase user motivation [54]. In addition, several other studies have investigated the usage of recommender systems to help people nourish themselves more healthily [55][56][57][58].
However, there has been a need to optimize content not only towards individual past preferences, but also to support healthy content, thus ensuring optimization to a user's health goals [48].
Other therapeutic areas, such as cervical cancer [59], diabetes [60], and mental health [61] have also been explored. However, they did not implement an intelligent artificial intelligencebased framework [60], and some even lacked a filtering functionality on the recommended activities [61].
For the specific case of smoking cessation support, a previous study using an HRS which featured a hybrid algorithm (a combination of demographic filtering, content-based, and utility-based approaches) showed positive effects of the HRS that supported behavioral changes [62,63]. This hybrid HRS was embedded into a mobile app incorporated into the routine care workflow of a hospital-based smoking cessation unit, and the app was offered to patients referred from other specialized care units of the hospital. The PERSPeCT study [64] compared rule-based computer tailoring to a hybrid recommender system. However, they did not assess a six month point smoking cessation, being limited to shorter period outcomes. Another study, Smoker2Smoker [65], used a recommender system that combined content-based ranking and collaborative filtering methods to select a message for an individual participant. However, being an observational study, there were limitations in ascertaining causality.
As recommender systems have grown in interest, multiple literature reviews have been conducted in recent years around the HRS topic [28,[66][67][68][69][70][71]. Hors Fraile et al. (2018) suggested improvements in HRSs by utilizing relevant behavior change theories and applying important features of tailored interventions [28,67]. A review by Ferretto et al. (2017) indicated limited use of recommendation systems in mobile healthcare applications, although the use of mobile phones and connection to the internet is ever increasing [69]. Another scoping review by Cheung et al. (2019) illustrated the potential benefits and need for incorporating a collaborative filtering method with demographic filtering as a second step to knowledge-based filtering in HRSs [67]. The authors also demonstrated the potential of this type of hybrid HRS towards enhancing user experience. To the best of our knowledge, there has been no randomized study comparing two smoking cessation HRSs and further analyzing the sub-group differences.

Design
A trial was approved by the Ethical Committee of Taipei Medical University Joint Institutional Review Board (TMU-JIRB), and conducted from 10 November 2017 to 15 January 2020. After voluntarily downloading the mobile app, without any prior recommendation of any clinician, and filling out a baseline measurement, respondents were randomly assigned to either the KBA smoking cessation intervention or the HA smoking cessation intervention. Baseline measurements were taken between 10 November 2017 and 15 July 2019, and post-tests ended 6 months after the baseline. Between the baseline and final follow-up, respondents could freely interact with the system to obtain more motivational message support, as well as provide evaluations on the various messages they received.
The Android version of the app used in this study was launched on 10 November 2017, and the iOS version of the app was launched on 6 August 2018. Different time intervals were defined to perform in-depth analyses to measure the evolution of the two user groups: 0~7, 8~14, 15~21, 22~30, 31~60, 61~120, and 121~180 days. These time intervals were defined to have a good understanding of participants' evolution throughout the study, aiming to distribute the collected data in time periods where none would be without data. As time passes, participants are likely to stop using the app as intensively as in the initial days. The registered information would then be sparser across later months, and therefore the time periods should increase their duration. Grouping the data in longer time intervals allowed more effective analysis of each period, and the intervals are commonly assessed time points for smoking cessation assessments (7 days, 15 days, 1 month, 2 months, and 6 months). The initial time point of these time intervals was the first day with a valid quitting attempt.

Interventions
Both HRSs selected smoking cessation motivational messages and sent them to a mobile app used in the 3M4Chan study [72], performed within the H2020 Project Smoke-FreeBrain [73]. The mobile app interface was the same in both groups (Figure 1), as were the personalized elements included in the motivational messages, such as referencing the name of the user in the message and the initial message delivery frequency between participants.
According to Abroms et al. (2015) [74], five messages were sent on the quitting date, one message per day during the first week after the quitting day, and three messages a week after that. This frequency could be changed every 14 days after the second week of the quitting attempt by participants themselves, regardless of their group. Users were given the choice of changing the frequency, i.e., increasing it up to one message per day or decreasing it down to one message per week, by answering a weekly question within the app. The time to send each message was set at random, given an allotted time range previously configured by each participant.
In total, the system had 311 different messages. The same message could only be sent a maximum of three times to a participant, as explained in the system design description by Hors-Fraile et al. (2019) [26]. Also, users could report their abstinence status in the app by answering the following weekly question: "Are you still resisting the temptation or have you smoked? Please, be honest." And possible answers were (a) I have not smoked; (b) Only one cigarette; (c) Two or three cigarettes; and (d) Four or more cigarettes. In total, the system had 311 different messages. The same message could only be sent a maximum of three times to a participant, as explained in the system design description by Hors-Fraile et al. (2019) [26]. Also, users could report their abstinence status in the app by answering the following weekly question: "Are you still resisting the temptation or have you smoked? Please, be honest." And possible answers were (a) I have not smoked; (b) Only one cigarette; (c) Two or three cigarettes; and (d) Four or more cigarettes.

Messages
The messages used in the study were created by a behavioral science researcher in English. They were then translated into Mandarin Chinese, and validated by two Taiwanese doctors specialized in smoking cessation. The messages followed the tailoring recommendations for smoking cessation support made by the World Health Organization (WHO) [75]. These guidelines included case scenarios of behavioral support sessions, reasons, strategies, and tips for people to stop smoking. We reflected that knowledge in the sentences by elaborating on recurrent topics and following suggested approaches to support abstinence.
In addition, we also introduced determinants of change psychological constructs which influence a behavior proposed in the I-Change model [29], which was previously used for smoking cessation studies [76][77][78][79]. The chosen determinants were used in previous studies to increase awareness, raise motivation, and change behaviors [78,[80][81][82][83]. The included determinants were: attitudes towards stopping smoking (the perceived advantages and disadvantages of quitting), social support to quit, skills to manage situations when feeling tempted to smoke, self-efficacy to quit, as in the perception of the smoker's ability to achieve cessation, and action planning of the tasks to be successful (e.g., throwing away all ashtrays at home). All messages were enounced from a positive point of view, addressing the reader with the pronoun 'you', using the active voice and

Messages
The messages used in the study were created by a behavioral science researcher in English. They were then translated into Mandarin Chinese, and validated by two Taiwanese doctors specialized in smoking cessation. The messages followed the tailoring recommendations for smoking cessation support made by the World Health Organization (WHO) [75]. These guidelines included case scenarios of behavioral support sessions, reasons, strategies, and tips for people to stop smoking. We reflected that knowledge in the sentences by elaborating on recurrent topics and following suggested approaches to support abstinence.
In addition, we also introduced determinants of change psychological constructs which influence a behavior proposed in the I-Change model [29], which was previously used for smoking cessation studies [76][77][78][79]. The chosen determinants were used in previous studies to increase awareness, raise motivation, and change behaviors [78,[80][81][82][83]. The included determinants were: attitudes towards stopping smoking (the perceived advantages and disadvantages of quitting), social support to quit, skills to manage situations when feeling tempted to smoke, self-efficacy to quit, as in the perception of the smoker's ability to achieve cessation, and action planning of the tasks to be successful (e.g., throwing away all ashtrays at home). All messages were enounced from a positive point of view, addressing the reader with the pronoun 'you', using the active voice and easy to understand vocabulary, avoiding technical terms and complicated words, and with a maximum word count in English of 200, with an average of 85.5 words per message. Ten of the behavior change techniques proposed by Abraham et al. (2011) [84] were included across the messages. The messages also considered health communication methods, such as repeating answers, creating empathy, adding new knowledge, and changing existing misconceptions. Additional examples of the applications of these techniques can be found in Appendices C and D of the study by Hors-Fraile et al. (2019) [26]. As an illustrative example, the following sentence presents a motivational message intended to enhance the social skills of a participant called 'John', Electronics 2022, 11, 1219 7 of 34 who reported struggling with social pressure to smoke when completing his profile during the enrollment process in the mobile app: "Hi John. You told us that you cannot refuse a cigarette when someone offers it to you. Well, we understand that you may struggle with it. However, you can practice how to kindly refuse a cigarette. A good reinforcing strategy is to add to the sentence one of the reasons you have to quit smoking. 'No. I am quitting because I want to keep my teeth clean and white', or, 'No. I am quitting to have a longer life and enjoy it with my grandchildren'. In this way, you say no and also remind yourself why you are doing it. If you prefer not to say the reason, you can just think it. To get a natural and almost immediate reaction, you can ask someone you know to role play a person inviting you to smoke, and you have to reject the invitation. Even if you think this exercise is not worth doing, it can help you succeed in a real situation."

Knowledge-Based Algorithm (KBA) System
The KBA system computed an adapted KBA approach over a pool of smoking cessation motivational messages. The adaptation consisted of pre-selecting message characteristics based on five user profile features: age, gender, quitting date, number of smoked cigarettes, and weekly expenditure on tobacco products. If the message dealt with time or date-specific content (e.g., things to do early in the morning, related to the weekend, etc.), then the context when the message was going to be sent (the time and week day) was also considered, and it was calculated before starting this KBA. Thus, the user requests which are necessary to run knowledge-based algorithms [85] were fixed by design. This was done so that participants would receive only one message according to the message frequency delivery pattern of Abroms et al. (2015) [74]. In addition, the intended intervention behavior was not to give participants the possibility of choosing from different messages, but to send them one relevant message per delivery. In this way, participants could focus their attention on a single message concept, not being flooded with different concepts at the same time, which might not be good for remembering them, and which could even overwhelm participants, producing anxiety and becoming a trigger to smoke.
Consequently, the KBA retrieved all messages compatible with the given participant profile. A message was compatible with the participant profile if all the meta-features (the defining attributes) of the messages were also found in the user profile. If a message did not have a specific meta-feature value it was considered a valid match. For example, a message that did not have a gender meta-feature was assumed to be compatible for both male and female users. Also, we enforced the algorithm to ensure a strict matching process. This means that if a message has a single meta-feature that is not present in the user profile, that message cannot be sent to that user. A simple illustrative case is shown in Table 1 below. The last column refers to whether the different messages are compatible for the user profile of a participant with the following meta-features: middle-aged male, less than 10 cigarettes per day, whose quitting date was 9 days ago, with a low weekly expenditure, and who set no time limit to receive motivational messages during the day. The intermediate columns of the table represent the meta-features of each message. The result of applying the KBA algorithm was a list of compatible and potentially relevant messages for that user. Then, as only one message could be sent at a time, the system selected messages from the list that had been sent fewer times to that user and picked one at random in case of a draw.

Hybrid Algorithm (HA) System
The HA system computed a two-step algorithm approach, also known as cascade [86]. The first step performed the same knowledge-based algorithm as in the KBA described above to filter potential non-compatible messages with the target participant. In the second step, a user-based demographic filtering algorithm was applied to the output of the first step (as opposed to random selection used in the KBA). This second step selected a motivational message based on the premise of prioritizing those which were found relevant by other similar participants (called neighbors) [87]. The selection was done using a score that was calculated for each message. This score represented the probability that the message was relevant for the given user. The algorithm used for this calculation worked as follows: First, all users and their message ratings were represented in a matrix (not including the user and ratings for which we are going to send the message). This means that we did not limit the neighbor size limit (the number of users used to compute the similarity). Second, a neighbor similarity index was computed for each user in the matrix following the equation below. where: A and B are two users; F u represents all meta-features completed by user 'u'; F u (x) represents the value of the meta-feature 'x' of user 'u'; |F u (x)| represents the number of potential valid values of meta-feature 'x' of user 'u'; n is the total number of meta-features which can have only two values (e.g., yes or no, male or female, etc.); m is the total number of meta-features which can have more than two values (e.g., high, medium, or low motivation level); δ y,z represents a function that sums the number of matching meta-features between the lists of meta-features 'y' and 'z'.
The system used all available information about user meta-features in the neighbor (with a minimum of nine, i.e., the core questions, and a maximum of 60 variables, after adding 51 voluntary extended-profile questions). The more meta-features in common between participants, the more similar their profiles were and the higher the likelihood of rating messages similarly. Hence, this similarity calculation approach could cover both participants who completed the extended profile and those who did not.
Third, the ratings for a message were multiplied by the corresponding user similarity index and added together following the equation below. where: rating m,i is the rating given to message m by user i, u is the user that is going to get the recommendation i is one of the neighbors of u n is the total number of neighbors If a message had been re-rated by the same user, only the last rating was considered. Fourth, the message relevance scores were normalized to a range between 0 and 1. Messages which were not rated previously were assigned a 0.5 final score. This represented a 50% chance of being relevant or not. Fifth, the algorithm selected only the subset of messages which had been sent to the user a lower number of times to maximize message diversity and minimize repetitions. Messages which had been sent three times already to that user were discarded. We chose three as the maximum number because we wanted users to realize their meanings, and the implications of the message arguments, following the recommendations of Cacioppo et al. (1989) [88]. Messages sent once or twice previously were sent with a complementary sentence in the same message acknowledging the fact that the message had, at some point, been sent already to the participant, but stressing the importance of the message content as to why the participant was receiving it again. Finally, a weighted random selection, based on the relevance score, was run to select the message to be sent. Figure 2 summarizes the message selection process for the two groups, and the detailed selection process, as previously explained by Hors-Fraile et al. (2019) [26]. Electronics 2022, 11, x FOR PEER REVIEW 10 of 35

Participants and Recruitment
Participants considered for this study were smokers over 20 years of age, the legal adult age in Taiwan, willing to quit, who owned a compatible smartphone (Android or iPhone), and who downloaded the smoking cessation "Quit and Return" app, registered on 15 July 2019 or earlier, accepted the terms and conditions of the mobile app, made a valid quitting attempt, and who were able to understand any of the two languages in which the mobile app was offered: Mandarin Chinese or English. A quitting attempt was considered valid if it was not cancelled in the first 24 h after it was initiated, when the quitting day was not set in the past, and when the quitting day was set within a maximum

Participants and Recruitment
Participants considered for this study were smokers over 20 years of age, the legal adult age in Taiwan, willing to quit, who owned a compatible smartphone (Android or iPhone), and who downloaded the smoking cessation "Quit and Return" app, registered on 15 July 2019 or earlier, accepted the terms and conditions of the mobile app, made a valid quitting attempt, and who were able to understand any of the two languages in which the mobile app was offered: Mandarin Chinese or English. A quitting attempt was considered valid if it was not cancelled in the first 24 h after it was initiated, when the quitting day was not set in the past, and when the quitting day was set within a maximum of 90 days in the future with respect to the user registration date in the system. The system used a JavaScript random function to randomly allocate registered users between the two groups to receive motivational messages to stop smoking using a different personalization strategy. Participants were blinded to this randomization.

Demographics
All participants had to answer nine core questions on their profile to use the app. These included their gender, age, employment status, date on which they began smoking, quitting date, number of cigarettes smoked weekly, and amount of money spent weekly on tobacco. In addition, they had to complete two standardized questionnaires to determine their nicotine dependence [89] and motivation to quit [90]. They could also voluntarily complete a questionnaire with 51 additional extended-profile features about comorbidities, people they share their house with, educational level, physical activity routines, and questions based on the I-Change behavioral change model [29], such as skills for stopping smoking, attitudes towards quitting, social support, self-efficacy, and action plans. This extended-profile questionnaire was used in the intervention by the HA to create a more comprehensive user profile and calculate similarities in the demographic filtering step.
We considered five user attributes as indicators of potential differences in the study outcome. Four of them were direct variables from the previously described core questions, and the fifth attribute was derived from answering the voluntary questions.
First, we assessed gender (male = 0, female = 1). Second, we assessed nicotine dependence based on the Fagerström test [89] (low = 0, high = 10). Its result was used to classify participants following the categorization included in the Fagerström test: participants with scores of ≤4 were included in the "lowdependence" sub-group (recoded as 0), those with scores of 5 or 6 were included in the "medium-dependence" sub-group (recorded as 1), and those with scores of ≥7 were included in the "high-dependence" sub-group (recorded as 2). Third, we assessed the level of motivation to quit based on the Richmond test [90] on a scale from 0 (low) to 10 (high). These test scores were used to classify participants into four motivation sub-groups proposed in the Richmond test using specified cutoff points [90]. We merged the low and medium-low sub-groups to avoid having two groups with too few participants. Thus, smokers with scores of ≤5 were assigned to the "low and medium-low motivated" (recorded as 0) sub-group, those with scores of 6 or 7 were included in the "medium-highly motivated" sub-group (recorded as 1), and those with scores of ≥8 were included in the "highly motivated" sub-group (recorded as 2).
Fourth, we assessed their age on the starting day of their first valid quitting attempt. To do so, we created two participant age sub-groups (generations), based on their potential familiarity with technology, as proposed by Berkup [91]: "baby boomers or older" (born before 1964) and "generation X or younger" (born after 1965) (baby boomers or older = 0, generation X or younger = 1).
Finally, we assessed participants' completion of the extended-profile questionnaire by checking their profile data registry (incomplete = 0, complete = 1). The latter measurement was only relevant for participants receiving recommendations generated by the HA, because the intervention with the simpler knowledge-based algorithm (KBA group) was not influenced by the extended-profile questionnaire.

Outcomes Message Appreciation
Message appreciation was assessed by comparing the average rating provided to messages rated by users of the KBA and HA groups in each time interval. Rating options ranged from one to five stars. Users could self-determine whether to rate messages or not, and so the frequency of message appreciation differed between persons.

Engagement with the System
We assessed three metrics in each time interval: (1) the number of active days in the system, with an active day defined as any day when a participant opened the app at least once; (2) the number of rated messages that were sent and rated in a given time interval, and (3) the number of abstinence reports submitted in each time interval. A fourth metric, the number of registered quitting attempts that were not cancelled within the following 24 h after being set, was assessed for the entire intervention period, as it did not make sense to subdivide it. We considered those attempts cancelled within 24 h as people just exploring the app with no real commitment to quit smoking.

Smoking Behaviors
The 7D-PP smoking status was assessed using self-reported abstinence reports by participants in each time interval. The 7D-PP metric was defined as a self-report of smoking no cigarettes (not even a puff) in the last 7 days. Smokers had to report whether they were currently smoking (1) or non-smoking (0), based on having relapsed or being abstinent at the time of the report (regardless of the number of cigarettes smoked). A maximum of one abstinence report could be sent each week. As most time intervals lasted several weeks, several self-reported status reports could be sent during a given time interval; we always considered the last one submitted within each time interval. Thus, this metric provides the 7D-PP for the last self-reported smoking status submitted in each time interval. The 7D-PP was identified as an important metric to assess self-reported smoking cessation outcomes [92].
To compare smoking cessation outcomes with other studies, smoking behavior changes were calculated as the proportion of eligible participants that sent positive and negative self-reported abstinence reports. Three analyses were conducted. First, an analysis on available data, averaging the values of the last 7D-PP abstinence reports of the intermediate time intervals. This was done taking the last report within each time interval, and then using those values to calculate the average across the study for each participant. Second, an analysis on the 7D-PP abstinence taking the very last available abstinence report value ever reported in the study by each participant. Third, a pessimistic analysis of the previous one where non-respondents of the abstinence reports within each time interval were considered as non-abstinent (penalized imputation). For this, we took the very last abstinence report value ever reported, but we considered relapsers; those participants who did not submit an abstinence report. We used this latter analysis as a conservative approach to avoid optimistic comparison effects of the intervention [93].

Statistical Analysis
To analyze the data, we first performed a descriptive analysis of the sample demographics. Secondly, three kinds of dropout were defined: • dropout in terms of no longer sending message ratings, • dropout in terms of no longer sending abstinence reports, and • dropout in terms of no longer being active on the app.
For each of these, logistic regression was performed to identify potential determinants of dropout, which were subsequently used as covariates in the primary analyses on the effects of the type of HRS. The independent variables used in the dropout analysis were: • gender, • nicotine dependence levels (low, medium, high), • motivation level (low and medium-low, medium-high, and high), • age (born after 1965 versus born on, or before, 1965), • employment situation (employed versus unemployed), • and completion of the extended profile (yes versus no).
To take care of the dependencies within observations of each subject, mixed models were used to examine the effects of the app. Depending on the scale level and distribution of the outcome, a different model was used. For message appreciation (after categorizing this variable into 4 levels), an ordinal mixed model, for engagement metrics (number of active days, number of ratings, number of abstinence reports, and number of quitting attempts), a negative binomial regression mixed model, and for smoking cessation metric, a logistic mixed model. For each outcome, a suitable model for the (co)variances of the random effects was chosen to adequately capture the dependencies in the outcome across time. Since for smoking cessation, none of the examined covariance structures of random effects convergence was obtained, the analysis was done by standard logistic regression.
To compare smoking cessation outcomes with other studies, smoking behavior changes were calculated as the proportion of eligible participants that sent abstinence reports in each time interval. Next, both an analysis on available data, according to logistic regression, and a sensitivity analysis were done, assuming a pessimistic scenario (penalized imputation) in which non-respondents within each time interval were considered to be non-abstinent.
For all metrics, we also performed an in-depth analysis of the impact of having completed the extended profile questionnaire, using the same type of regression model for each metric. In statistical testing, a significance level of α = 0.05 was used.

Description of the Sample and Involvement Level
In total, 844 participants downloaded the app and registered. Among these, 371 had a valid quitting attempt and were eligible for our study; 290 (78.16%) were male and 81 (21.83%) were female with a mean age of 36.90 (standard deviation (SD) 10.21, coefficient of variation (CV) 0.28) years. The mean nicotine dependence-level score was 5.13 (SD 2.59, CV 0.50), and the mean motivation-to-quit score was 7.54 (SD 1.91, CV 0.25).
In total, 181 users were allocated to the KBA group and 190 users were allocated to the HA group; these numbers slightly differed because of differences in making a valid quitting attempt. No statistically significant differences between the two groups were found in age, gender, nicotine dependence, motivation to quit, employment status, or completion of the extended-profile questionnaire. Table 2 shows a more detailed analysis. In total, 843 rated messages and 373 abstinence reports were provided by the 371 users of both systems during the 6-month period after making their first valid quitting attempt.
Notes: KBA, knowledge-based algorithm group; HA, hybrid algorithm group; SD, standard deviation; CV, coefficient of variation; * next to the p-value of the asymptotic chi-square test (left), also the p-value of the Fisher exact test is given (right).

Dropout Analysis
The dropout analysis for message ratings showed that there was no interaction effect between the app and time period on the dropout rate (p = 0.394), so a potential interaction between the group and time period on the average ratings, or the number of rated messages, could not be due to differences in dropout rate at the different time intervals. There was also no main effect of the app for this type of dropout (p = 0.375). The variables of gender, nicotine dependence, motivation level and completed extended profile turned out to be predictors of this type of dropout and were subsequently used as covariates in the analysis of the outcome variables of message ratings and number of rated messages.
For the number of active days there was also no interaction effect of HRS group and period on dropout rate (p = 0.910). So, the difference between the HRS groups in terms of dropout rate did not differ across time. Averaged across time, there was also no difference in dropout rate between the two HRS groups (p = 0.583). The variables gender, employment situation, nicotine dependence and completed extended profile were significant predictors of this type of dropout and were subsequently used as covariates in the analysis of the outcome variable number of days active.
For dropout based on smoking cessation reports, no significant HRS group by period interaction (p = 0.682), and also no main effect of HRS group (p = 0.158), was found. So, the difference between the groups, in terms of dropout rate, did not differ across time, and there was also, averaged over the time intervals, no difference in dropout rates between the groups. Only completion of the extended profile was a significant predictor of dropout (p = 0.002).

Overview of Outcomes
All outcome parameters, excepting that of the number of quitting attempts, were assessed from four perspectives: their evolution across time, the main effect of the type of app, the potential interaction effect between time and type of app, and the possible interaction effect between type of app and having completed the profile or not. The number of quitting attempts, measured over the whole study period, was only studied regarding the main effect of the type of app, the interaction of the type of app, and having completed the profile or not.

Message Appreciation Results
First, we assessed the evolution of the mean message appreciation (range 0-5) by each of the 2 HRS groups across the different time intervals (see Figure 3, where the Y axis is extended beyond the maximum mean level to avoid cutting the top part of some error bars). Next to the means, in each figure the 95% confidence intervals (CI) are also displayed, represented by the two error bars around each of the sample means.
First, the ordinal mixed model analysis showed that the difference between the two groups, in terms of appreciation, did not develop differently across time (p = 0.897). Also, there was no statistically significant difference between the two groups in terms of average Third, concerning potential interaction effects with other factors, the analysis on the role of completing the extended profile showed that the effect of having completed the extended profile did not differ between the KBA and HA groups (p = 0.980). Also, for both groups, completers of the extended profile showed the same level of message appreciation as those who did not complete it (p = 0.976). However, as emerged from the dropout analysis, for both HRS groups completers of the extended profile had a higher probability to stay in the study (p < 0.001). No interaction effects were found for the other factors (gender, age, nicotine dependence level, and motivation to quit). Electronics 2022, 11, x FOR PEER REVIEW 15 of 35 Figure 3. Evolution of the message appreciation (a higher score is a higher appreciation).
First, the ordinal mixed model analysis showed that the difference between the two groups, in terms of appreciation, did not develop differently across time (p = 0.897). Also, there was no statistically significant difference between the two groups in terms of average ratings across time (KBA: Mean = 3.835, SD = 1.218, CV = 0.318 vs. HA: Mean = 4.185, SD = 1.205, CV = 0.288, p= 0.353). Hence, there was no difference between the KBA and HA groups in terms of the level of message appreciation.
Third, concerning potential interaction effects with other factors, the analysis on the role of completing the extended profile showed that the effect of having completed the extended profile did not differ between the KBA and HA groups (p = 0.980). Also, for both groups, completers of the extended profile showed the same level of message appreciation as those who did not complete it (p = 0.976). However, as emerged from the dropout analysis, for both HRS groups completers of the extended profile had a higher probability to stay in the study (p < 0.001). No interaction effects were found for the other factors (gender, age, nicotine dependence level, and motivation to quit).

Number of Rated Messages
First, concerning the evolution of the number of rated messages for each of the HRS groups across the different time intervals (as shown in Figure 4), the results of the negative binomial regression mixed model showed that the number of rated messages did not develop differently for the two HRS groups across time (p = 0.920). We chose a negative binominal regression because the variance of the outcome variable was much larger than its mean, resulting in overdispersion. Thus, negative binomial regression was preferred over Poisson regression. Second, we did not find an overall main effect of the type of app when considered across the whole 0-180 days' period. Both groups had a similar number

Number of Rated Messages
First, concerning the evolution of the number of rated messages for each of the HRS groups across the different time intervals (as shown in Figure 4), the results of the negative binomial regression mixed model showed that the number of rated messages did not develop differently for the two HRS groups across time (p = 0.920). We chose a negative binominal regression because the variance of the outcome variable was much larger than its mean, resulting in overdispersion. Thus, negative binomial regression was preferred over Poisson regression. Second, we did not find an overall main effect of the type of app when considered across the whole 0-180 days' period. Both groups had a similar number of rated messages (KBA: Mean = 3.59, SD = 5.877, CV = 1.637 vs. HA: Mean = 3.31, SD = 5.677, CV = 1.715; p = 0.884). Third, concerning interaction terms, the analysis of the impact of completing the extended profile questionnaire showed that its effect on the number of rated messages did not differ between the KBA and HA groups (p = 0.894) and also, when averaged across both groups, had no significant effect on the number of rated messages (p = 0.156). However, there was an effect on dropout: completers of the extended profile had a significantly lower probability to dropout from the study (p < 0.001).
of rated messages (KBA: Mean = 3.59, SD = 5.877, CV = 1.637 vs. HA: Mean = 3.31, SD = 5.677, CV = 1.715; p = 0.884). Third, concerning interaction terms, the analysis of the impact of completing the extended profile questionnaire showed that its effect on the number of rated messages did not differ between the KBA and HA groups (p = 0.894) and also, when averaged across both groups, had no significant effect on the number of rated messages (p = 0.156). However, there was an effect on dropout: completers of the extended profile had a significantly lower probability to dropout from the study (p < 0.001).

Number of Active Days
First, the evolution of the mean number of active days on the mobile app for each of the HRS groups across the different time intervals (see Figure 5) developed differently across time for HA and KBA, although it was marginally significant (p = 0.051).

Number of Active Days
First, the evolution of the mean number of active days on the mobile app for each of the HRS groups across the different time intervals (see Figure 5) developed differently across time for HA and KBA, although it was marginally significant (p = 0.051).
We also found that the covariate employment status of the participants had a marginally significant interaction with the HRS group (p = 0.070). Consequently, the examination of the differences between the two HRS groups at each time interval was done separately for employed and unemployed participants. To control the type I error rate, for each of these groups, Holm correction was applied when testing the app difference at each of the 7 time intervals. Unemployed participants showed no difference between the two app groups for any time interval in terms of active days. However, employed participants had a significant difference in active days between the two app groups for the last two time intervals of 61-120 days (p = 0.007), and 121-180 days (p = 0.008), where the HA led to less active days than the KBA.
Examining the main effect of type of app over the 0-180 days' time period also showed that the KBA led to a higher number of active days ( There was no interaction effect of HRS group and completion of the extended profile on the number of active days (p = 0.815). The difference between the app groups in the number of active days was the same for the completers and non-completers of the extended profile. There was a marginally significant effect of having completed the profile on the number of active days (p = 0.068), in that completers were, on average, active for more days. Also, completers of the extended profile had a significantly lower probability of dropping out of the study (p < 0.001). We also found that the covariate employment status of the participants had a marginally significant interaction with the HRS group (p = 0.070). Consequently, the examination of the differences between the two HRS groups at each time interval was done separately for employed and unemployed participants. To control the type I error rate, for each of these groups, Holm correction was applied when testing the app difference at each of the 7 time intervals. Unemployed participants showed no difference between the two app groups for any time interval in terms of active days. However, employed participants had a significant difference in active days between the two app groups for the last two time intervals of 61-120 days (p = 0.007), and 121-180 days (p = 0.008), where the HA led to less active days than the KBA.
Examining the main effect of type of app over the 0-180 days' time period also showed that the KBA led to a higher number of active days ( There was no interaction effect of HRS group and completion of the extended profile on the number of active days (p = 0.815). The difference between the app groups in the number of active days was the same for the completers and non-completers of the extended profile. There was a marginally significant effect of having completed the profile on the number of active days (p = 0.068), in that completers were, on average, active for more days. Also, completers of the extended profile had a significantly lower probability of dropping out of the study (p < 0.001).

Number of Quitting Attempts
An analysis, by the negative binomial regression model, of the number of quitting attempts over the whole study period (0-180 days) showed that there was no significant Testing the interaction between HRS group and having completed the extended profile on the number of quitting attempts showed that there was a significant interaction effect (p = 0.042). The effect is such that the effect of having completed the extended profile on the number of quitting attempts is significantly larger in the HA group. In the HA group there was a very significant effect of having completed the extended profile on the number of quitting attempts (p < 0.001), with the incidence rate ratio of completers versus non-completers being 1.869. In the KBA group there also was a significant effect of having completed the extended profile on the number of quitting attempts (p = 0.020), the incidence rate ratio of completers versus non-completers being, however, lower, 1.342.

Number of Abstinence Reports
The evolution of the number of sent abstinence reports for each of the HRS groups across the different time intervals is shown in Figure 6.
The difference between the two HRS groups in terms of the number of smoking cessation reports did not develop differently across time (p = 0.339). However, when testing the difference between the two HRS groups over the 0-180 days' period, the average amount of cessation reports (KBA: Mean = 1.11, SD = 1.806, CV = 1.627; HA: Mean = 0.7, SD = 1.118, CV = 1.597) was significantly higher for the KBA group (p = 0.001). Note that this cannot be explained by the dropout rate, as there were no significant differences in dropout rate between the two app groups.
There was no significant effect of having completed the extended profile (p = 0.870), implying that completers of the extended profile had the same incidence rate of the number of cessation reports as the non-completers (and that was the case for both app groups). However, completers of the extended profile in both groups had a higher probability to stay in the study (p = 0.002). Testing the interaction between HRS group and having completed the extended profile on the number of quitting attempts showed that there was a significant interaction effect (p = 0.042). The effect is such that the effect of having completed the extended profile on the number of quitting attempts is significantly larger in the HA group. In the HA group there was a very significant effect of having completed the extended profile on the number of quitting attempts (p < 0.001), with the incidence rate ratio of completers versus non-completers being 1.869. In the KBA group there also was a significant effect of having completed the extended profile on the number of quitting attempts (p = 0.020), the incidence rate ratio of completers versus non-completers being, however, lower, 1.342.

Number of Abstinence Reports
The evolution of the number of sent abstinence reports for each of the HRS groups across the different time intervals is shown in Figure 6: The difference between the two HRS groups in terms of the number of smoking cessation reports did not develop differently across time (p = 0.339). However, when testing the difference between the two HRS groups over the 0-180 days' period, the average amount of cessation reports (KBA: Mean = 1.11, SD = 1.806, CV = 1.627; HA: Mean = 0.7, SD = 1.118, CV = 1.597) was significantly higher for the KBA group (p = 0.001). Note that this cannot be explained by the dropout rate, as there were no significant differences in dropout rate between the two app groups.
There was no significant effect of having completed the extended profile (p = 0.870), implying that completers of the extended profile had the same incidence rate of the number of cessation reports as the non-completers (and that was the case for both app Figure 6. Evolution of the number of sent abstinence reports (higher is better).

7D-PP Abstinence: Analysis on Available Data
The evolution of the 7D-PP abstinence for each of the HRS groups across the different time intervals is shown in Figure 7 (the mean was calculated by averaging the last abstinence report of each participant in each group in each period, coded as 0 for being abstinent and 1 for relapse).
The analysis based on the standard logistic regression model revealed no statistically significant interaction between group and time period (p = 0.737). Also, the two HRS groups did not differ significantly (p = 0.123) on 7D-PP, averaged across 0-180 days, with abstinence rates of 48.8% in KBA vs. 37.9% in HA.
An overall analysis, using logistic regression only, on 7-day point prevalence between 0 and 180 days for the effect of the type of HRS revealed that belonging to the HA group led to a lower 7D-PP of abstinence (OR = 0.364; KBA: 54% vs. HA: 30.2%, p = 0.023). The difference of this analysis with the previous analysis of 7D-PP averaged across 0-180 days is that, in this analysis, only each participant's last available abstinence report over the whole time span of 0-180 days was considered, and not for each time interval. In the previous analysis, all participants' intermediate last available abstinence reports were averaged in determining whether there was a difference between the HRS groups.

7D-PP Abstinence: Analysis on Available Data
The evolution of the 7D-PP abstinence for each of the HRS groups across the different time intervals is shown in Figure 7 (the mean was calculated by averaging the last abstinence report of each participant in each group in each period, coded as 0 for being abstinent and 1 for relapse): The analysis based on the standard logistic regression model revealed no statistically significant interaction between group and time period (p = 0.737). Also, the two HRS groups did not differ significantly (p = 0.123) on 7D-PP, averaged across 0-180 days, with abstinence rates of 48.8% in KBA vs. 37.9% in HA.
An overall analysis, using logistic regression only, on 7-day point prevalence between 0 and 180 days for the effect of the type of HRS revealed that belonging to the HA group led to a lower 7D-PP of abstinence (OR = 0.364; KBA: 54% vs. HA: 30.2%, p = 0.023). The difference of this analysis with the previous analysis of 7D-PP averaged across 0-180 days is that, in this analysis, only each participant's last available abstinence report over the whole time span of 0-180 days was considered, and not for each time interval. In the previous analysis, all participants' intermediate last available abstinence reports were averaged in determining whether there was a difference between the HRS groups.
The in-depth analysis of the impact of completing the extended profile for the two groups showed a marginally significant interaction between app group and completing The in-depth analysis of the impact of completing the extended profile for the two groups showed a marginally significant interaction between app group and completing the extended profile (p = 0.065). In the HA group, the completers had a lower probability of being 7 days abstinent (OR = 0.566), whereas in the KBA group the completers had a higher probability of being 7 days abstinent (OR = 1.872), but in none of these groups were these differences significant (p = 0.209 and p = 0.175 respectively).
Also, having completed the extended profile, was a significant predictor of dropout (p = 0.002), such that for both groups those who completed the extended profile, had a lower probability to drop out.

7D-PP Abstinence: Sensitivity Analysis under a Pessimistic Scenario
The standard logistic regression analysis showed no interaction between group and time period (p = 0.607). Also, no main effect of HRS group on 7-day point prevalence was found (KBA: 3.4% vs. HA: 3.3%, p = 0.847). So, there was no significant difference between the two groups in terms of 7-day point prevalence under a pessimistic scenario. Also, the logistic regression analysis for the 0-180 days' time period, which considered only the last available cessation report, showed no significant effect on the probability of 7-day cessation (p = 0.392).
The results of the main effects for each measure comparing both systems are summarized in Tables 3-5.

Main Findings
The aim of this study was to compare two different health recommender systems for smoking cessation. The first system used a knowledge-based (KBA) algorithm, whereas the second used a hybrid algorithm (HA), employing KBA and demographic filtering. Effects were studied concerning participants' message appreciation, engagement with the system, and self-reported smoking cessation outcomes.

Message Appreciation
Message appreciation was almost always higher across the intermediate time periods for the HA group, as can be seen in the message appreciation evolution chart (see Figure 3). However, this difference was neither significantly different for the intermediate time periods nor averaged across the whole 6-month period. This result suggests that, in our study, although the demographic filtering step in the HA was delivering slightly more relevant recommendations, it did not suffice to be statistically significant compared to a random selection of the filtered messages resulting from the knowledge-based step. This lack of statistical significance contradicts the potential advantage of collaborative filtering that several previous studies showed [94,95], where messages had higher relevance after going through that step, leading to more and better ratings, which would reflect higher message appreciation [96,97]. In our case, the lack of statistical significance could be related to the cold-start problem [98]. We estimated, during the study design phase, that we would recruit a higher number of participants, and those participants would provide more ratings, quickly reducing the impact of the cold-start problem. Another explanation for the lack of statistical significance is the limited sample size: 843 ratings provided by 57 participants, of which 30 were in the HA group. A higher sample size would yield more power to show that the apparent trend of higher appreciation and number of rated messages in the HA is, indeed, due to the demographic filtering step [99,100].

Engagement
Concerning engagement, we found mixed results. For the number of rated messages, similar to message appreciation, we had a higher number of rated messages for all the time intervals in the HA group, but this was not statistically significant, and the same rationale elaborated for message appreciation is applicable to this measure. Finding no statistical difference in the number of rated messages than for message appreciation can be explained by the same reasons (cold start, and sample size). Lack of power, due to small sample size, may especially be an issue here, as, for the engagement metrics, the coefficient of variation in the large majority of cases is larger than 1, reflecting large variability of the metrics in the sample. In addition, as we did not find statistically significant differences in message appreciation previously, which was the metric related to some predecessors of behavioral intentions of use [101], it follows that no statistical difference on the number of rated messages was found either.
Yet, we found differences in other types of engagement metrics: number of active days and number of abstinence reports, both being higher in the KBA group. These were unexpected results because we had expected, at least, the same level of engagement between KBA and HA groups, since the HA added an extra filtering layer, aiming to provide motivational messages tailored to the preference of the participant rather than at random as in the KBA group. Therefore, in the worst-case scenario, where the demographic filtering step was not working, message appreciation was to be expected to be similar but not worse, and more in-depth studies may be needed to further understand this finding. One explanation could be that the factor which made participants engage with the mobile app in terms of active days, and submit abstinence reports, was not related to the algorithms, but with other variables that we did not consider in our study and which were more favorable for the KBA group. For example, perceived trust in the app could be one of these factors, as it played a relevant role in other health apps before [102,103]. This explanation is also in line with the study by Dovaliene et al. (2016) [104], who showed that user satisfaction was not relevant for mobile app engagement, which is what we were aiming to maximize through higher message appreciation, as the more a participant appreciates messages, the more he or she may be satisfied with the app. However, as the mobile app interface was the same for both groups, the perceived trust, or any other non-considered variable, was more prevalent in those messages sent by the KBA. In the case of trust in the recommendations, there are approaches, such as the application of deep-learning, that could be applied to enhance trust [105].
Concerning interaction effects, we found that employed participants in the KBA group had a higher number of active days in the last two time periods. A potential explanation for this result could be found in other participant variables. Although we did not measure income and educational levels, they could explain a higher usage of a cessation app in terms of socioeconomic status. Socioeconomic status is a known predictor for smoking prevalence and knowledge of smoking effects [106,107] (higher educational level relates to higher knowledge), and could also be linked to mobile app engagement, as has been shown for other health-related app studies [108][109][110].
Another interaction effect revealed that completers of the extended profile in the HA group made more quitting attempts. This suggested that, when the HA was provided with more information to personalize the recommendations, it may have influenced determination to make a quitting attempt after a relapse. The evidence of the rationale for this effect would be more solid if we also found similar effects on message appreciation for the completion of the extended profile in the HA group, because participants pleased with received messages would be willing to re-engage with the app after a relapse. However, we previously identified the non-significant difference of having completed the extended profile in the appreciation metric. Therefore, although the effect on quitting attempts was significant, it will need to be further explored in future studies to reinforce its evidence.
When testing different approaches to calculate similarity, varying the elements used to compute the similarity also varied the actual outcomes [111,112]. In our case, during the algorithm design phase, we did not know what the optimal number and type of variables to use were, as each dataset would have a different optimal point [113]. It was unclear which are the most important ones and how to deal with dissimilarities in information provided. Ultimately, to calculate the similarity we decided to use all the available information of the 60 questions defining the user profile. However, future studies should carefully reassess what variables may better define user similarities. A lower number of variables may improve user experience because they will complete shorter, but more meaningful, questionnaires. Also, we did not know how the number of user variables for computing similarity impacted on other metrics, such as smoking abstinence. Increasing the number of questions might begin reducing effective similarity among neighbors. This can happen if variables do not really impact smoking and diminish the effect of those which do. Pu et al. (2012) [27] recommended minimizing preference elicitation in profile initialization. Yet, as extended profiles may improve smoking cessation attempts, further experimental research is needed to assess which core questions are needed. We recommend future researchers willing to use demographic filtering in digital health interventions explore approaches that maximize the number of participants completing user profiles. The trade-off between a mandatory questionnaire and total freedom to complete it should be considered to avoid participants having a poor user experience, but also retain the potential engagement benefits found in this study. Similarly, regarding demographic user profile variables, users with insufficient ratings may not be getting good enough recommendations, as they were also part of the similarity calculation in our case. This problem could be enhanced in the future by using implicit ratings, as proposed in the study by Ahmaidan et al. [114].

Smoking Cessation
We found that the HA algorithm was unable to provide significantly better smoking cessation outcomes than the KBA in the 7D-PP analysis averaging the intermediate time period abstinence report results. The abstinence measurements averaged across time intervals can be considered an approximation of continuous abstinence, which was a relevant measure included in the Russell 2.0 standard for smoking cessation outcomes [93]. However, the lack of abstinence reports at specific time points (e.g., 6 months) reduces the reliability of this approximation, as we may be considering the continuous prevalence of participants based on only one report made in the first month, for instance.
The analysis for 7D-PP considering the last abstinence report of each participant as the value for the whole study showed that the KBA algorithm performed better. This 7D-PP can detect delayed quitting, and was also included in the Russell 2.0 standard as a relevant smoking outcome measure [92]. However, as our trial was done under realworld conditions, where we could not consistently collect abstinence reports at specific time points, it was the decision of the participant as to whether to voluntarily submit an abstinence report or not. This analysis implicitly assumes that the participants remained within their last reported abstinence status until the end of the study. This assumption may have introduced bias, as we cannot know the actual abstinence status of participants at the end of the study if they did not report it.
To further examine the sensitivity of these results, we also considered a pessimistic scenario analysis in which participants who dropped out were considered as smokers (penalized imputation). In this case, we found no differences between groups. This result differs from the corresponding analysis on available data results because in both groups there were participants who never sent an abstinence report and were assumed to be relapses in this analysis, hence reducing the differences between groups. Although some authors showed that penalized imputation does not necessarily lead to less biased effect, estimates, or more conservative effect estimates than complete case analysis [115], suggested that it may be too overly critical [116]. Considering all the previous measures, it seems that the overall impact on smoking cessation outcomes for the participants was not significant between groups.
In addition, it is conceivable that certain messages could result in effects different to those intended, and thus lower the efficacy of the HA system. For example, for a participant called John, one of the messages could read "Hi John. Did you know that as you are no longer a smoker, you are no longer part of a chain that favors child exploitation? Yes, you've read right. In countries like Pakistan, USA, and Indonesia child labor is used in the tobacco industry! Children and teenagers are exposed to toxic substances and hard work conditions. Stay smoke-free for their sake, and for yours!". This message, which was intended to provide knowledge, may not be positively appreciated because, even phrased in positive terms, thinking about the fact of having contributed to child labor is unpleasant. However, the message content might cast a deep impact on the participants, leading to non-perceived higher abstinence motivation. Therefore, if users did not like the message, the HA would be unlikely to send it again in favor of other better rated messages, whilst the random selection done by the KBA algorithm would give the same probability of sending the message again to a new user. Unfortunately, in this study we could not perform traceability back to what specific message was rated with what score after the study. Consequently, we cannot validate whether this type of situation happened to be the reason for the unexpected engagement results.
Although the findings of the 7D-PP abstinence, considering only the last abstinence report available, contradict our initial assumptions, the pessimistic scenario and 7D-PP abstinence, considering the averaged abstinence reports, were in line with a study by Westmaas et al. (2018) [117], who compared two tailoring levels in e-mails to support smoking cessation, and found no differences between the basic and advanced versions. This may mean that having a more complex system for tailoring smoking cessation support does not necessarily provide better abstinence results.
To compare our abstinence results to other similar previous studies, a recent review [118] examined SmartQuit [119] and SmokeFree28 [120], which managed to achieve abstinence rates of 13% at 2-month and 21% at 28-day follow-ups, respectively. Mobilebased interventions for smoking cessation, such as Clickotine [121], reported a 7-day abstinence rate of 45.2% and a 30-day abstinence rate of 26.2% in an 8-week study following an intention-to-treat analysis. In more traditional computer-tailored interventions that had follow-up assessments at 6 months, as in our study, abstinence rates were up to 18.3% (10.2% intention to treat) [122] and 20.4% (8.5% following intention to treat) [123]. These results are similar to the results of our study (54.1% in the KBA group and 30.2% in the HA group, following analysis on available data for 7-day point prevalence abstinence, considering the last available abstinence report for each intermediate time period, and 8.4% in the KBA group and 11% in the HA group in the pessimistic scenario).

Additional Considerations
Across all metrics, we identified that the completion of the extended profile was associated with less dropouts. As this was consistent for both the HA and KBA groups (in which completion had no effect on the algorithm), we propose that it could be that the personality of participants who filled out the extended questionnaire is such that they are more curious, or willing, to interact with the system. This was consistent with the work of Karumur et al. (2018), who explored how user personality can influence user engagement and activity in recommender systems, based on a collaborative filtering approach [124].
The external validity of mobile apps for health behavior change has been previously criticized and it is currently being studied in other domains, such as physical activity [125]. We conducted this trial under real-world conditions, where we did not reward participants with money, nor stimulate them to complete the abstinence reports; additionally, researchers did not recruit, or follow-up, participants in any way. Hence, the present results of realworld effectiveness, in terms of the effects of such an HRS, increase the external validity of our study [126], compared to efficacy results in previously mentioned studies. Thus, the results of this trial study have to be understood within this real-world context and might not be directly comparable to other studies, in which recruitment and follow-up methods may have positively impacted the commitment of participants.
Regarding the recommender system algorithms used in this study, we could not compare their theoretical performances with others found in the literature before running the study because of their differences in design, that makes them incompatible with existing datasets, which were not focused on health behavior. For instance, in other contexts (e.g., movie recommendations) recommender systems could use existing datasets as benchmarks for performance assessment, such as MovieLens [127]. Hence, commonly used metrics for recommendation systems, such as the root mean square error (RMSE), mean average error (MAE), normalized discount cumulative gain (NDCG), and precision could not be applied to assess the algorithms before the study, as we lacked an initially rated, compatible dataset.
This study focused on metrics which assessed the impacts of the recommendations on participants' behavior, as Sahoo et al. (2019) [128] proposed for HRSs, not on the technical performance of the recommender system itself. Schäfer et al. (2017) [25] also concluded that HRSs needed multidimensional user satisfaction measures, which covered message appreciation and engagement metrics. Further, message appreciation can be seen as a proxy for Mean Average Precision at 1. This sets our study in line with previous experimental studies for HRS performance analysis based on hits and total numbers of recommendations or users. This was the case in a study by Rivero-Rodriguez et al. (2013) [129], which followed a similar approach to assess their HRS, using the hit rate (no. of hits/no. of users), as a performance metric. Another example can be found in a study by Bocanegra et al. (2017) [130], in which they used precision to assess their recommender system. Despite applying similar approaches, the direct comparison of performance metrics and reflection on the conclusions generated in other studies are still limited in our case because they are totally different study designs and research questions. A recent scoping review [28] backs the relative scarcity of studies applying HRSs in the health domain and their diversity in therapeutic areas and reported outcomes.
The most similar study we found was the SoloMo study which presented results of another hybrid recommender system for smoking cessation [131]. That algorithm was tested in a clinical context where patients were followed up by healthcare professionals for one year [63]. Patients referred to the smoking cessation unit from other specialized care units of the hospital, were invited to the study. The precision of that system (which was directly related to appreciation, as previously mentioned) achieved a high score (which would mean high appreciation), with a minimum value of 0.96 over a total maximum of one. There were key differences with our present study, which explain this difference.
First, the end of their messages included the name of the patient's doctor, and this may have increased the perceived quality of the messages, leading to high precision (and we can therefore assume, also appreciation). Second, patients may have thought that their doctors were going to check their ratings and wanted to please them with higher scores. Third, the potential rating options were only positive, negative, or neutral, as opposed to the five-star scale included in our study, which provided greater rating granularity. Fourth, the participants enrolled in SoLoMo were referred from a specialized care unit, which eventually contributed to enhancing their motivation and appreciation of the system, potentially driven by a positive framing bias effect. Despite the apparently better results in the SoLoMo study [63], a smaller range in the rating values (only three options compared to the five we had) limits the opportunity of HRSs to learn from users' opinions in digital interventions for smoking cessation.
Further, in a recent scoping review [28], several gaps in the reviewed HRS studies were found. Our study covered many of them, including: (1) reporting the results of our study based on a large user cohort size (n = 371), compared to previous ones found in the literature; (2) using an HRS which was grounded in a behavioral change theory (the I-Change model [29]), recommending messages with behavioral change techniques (following the guide by Abraham et al. (2011) [84]; (3) using advanced profile adaptation and (4) having a clear explicit feedback system (in the HA group). Table 6 presents the final classification of this health recommender system following the proposed taxonomy by Hors-Fraile et al. [28].

Limitations
Despite strengths of this study, such as comparing two different HRSs and their participants' age, gender, nicotine dependence, employment status, motivation to quit, or completion of the extended profile questionnaire, our study was subject to some limitations.

1.
The HRSs considered all users' feedback for computing recommendations. This implies that the feedback provided by one group affected the generation of recommendations for the other. This design decision was taken to reduce the cold start effect.

2.
Between 22 May and 6 June 2018, and between 1 and 6 August 2018, there was a server service interruption that prevented users from registering the app and receiving messages.

3.
We could not verify the smoking status self-reports. Self-reports may provide a valid estimation of cessation rates as they were used in several previous studies [132]. The Society for Research on Nicotine and Tobacco Subcommittee on Biochemical Verification considered the use of biochemical validation unnecessary in studies with limited face-to-face contact [133]. However, the use of bogus-pipeline procedures [134], or some biochemical verification methods, would have improved the validity of the smoking status reports [135]. Also, the pessimistic scenario analysis we conducted, intending to follow a conservative approach, may not have accurately reflected the actual behavior of the participants. Also, this study may have suffered from errors derived from some users' difficulties in accurately recalling their behaviors, as Ramo et al. previously reflected on for an anonymous survey about smoking behaviors [136].

4.
We considered the last status for the abstinence report as the value for each time interval. This way of measuring smoking cessation results hampered direct comparisons with previous studies.

5.
In this effectiveness study, smokers could report by self-chosen times, resulting in the fact that we could not assess all data for all participants at one specific time (e.g., smoking cessation status after 1 month). 6.
It is conceivable that specific subgroup effects could have occurred in our analyses, requiring more sophisticated models with more two-way (or even three-way) interactions to explain our results. However, due to the sparsity of the collected data, these more sophisticated models could not be applied. 7.
User-experience metrics, such as perceived quality and satisfaction, which are commonly evaluated nowadays in the field of recommender systems [27,137], were not included in this study. 8.
Persuasion profile meta-features to determine what recommendation style (e.g., authority shown in the message, the reflected consensus stated in the message, the message sender liking perception, etc.) [138,139] would persuade participants the most, were not considered. Such types of meta-features could have added extra personalization power to the HRS without needing the participants to complete additional questions in their user profile.

Recommendations
Based on our results, we recommend future studies to keep exploring the usage of different types of HRS to support smoking abstinence, as the results of both algorithms improved the unassisted cessation success rates for six months which is around 3-5% [140,141], and were above some nicotine-replacement therapy rates, whose six-month abstinence rate is around 7% [142,143]. More research is needed to explore relationships between message appreciation, engagement, and health outcomes. When using HRS with collaborative filtering, new means to determine recommendations' relevance, other than participants' appreciation, should be considered. For instance, including the achieved health outcomes as a complementary rating (e.g., asking how useful previously sent messages were when rating a message about abstinence status). In this way, messages that are well appreciated but do not contribute towards supporting abstinence would not be recommended in the future, and messages which are both contributing to support abstinence and highly appreciated would be prioritized to be sent by the HRS.
As we did not find differences between HRS for gender, age, nicotine dependence level, and motivation to quit subgroups, we encourage future research to be conducted considering other variables, such as trust and socioeconomic status, which may help better understanding of smokers' behaviors. Still, gender, age, nicotine dependence level, and motivation to quit could be good meta-features for HRS similarity computation and we suggest keeping them as part of the HRS and message design processes.
In addition, future research should consider larger sample sizes with more than six months of follow-up time, as results between the KBA and HA, in terms of message appreciation and number of rated messages, suggest that these differences could become significant. Alternatively, this may be compensated for by increasing the frequency of the sent messages, as the HRS would have more information to process new recommendations faster, and/or have larger sample sizes to be better able to detect such differences. To avoid overloading participants with too many messages, which may be bothersome and negatively impact their user experience, we consider that a suitable solution could be to offer 'on-demand' messages. This larger sample would facilitate applying more complex statistical models, which may help us explain some results about which we could only guess in this study. Also, we suggest pro-actively persuading users to complete their profiles if they do not do so voluntarily during the enrollment phase, to maximize the impact of collaborative filtering, giving participants more probability to make a new quitting attempt in case of relapse. This could be achieved, for instance, by making the digital solution ask the participants to complete one or two unanswered questions of their user profile every day. It would yield a low entry barrier to start using the solution, whilst it would allow computation recommendations with increasing user profile information over time.

Conclusions
The first goal of this study was to compare the two presented HRSs. We found the KBA-led participants had more active days on the mobile app, completed more abstinence reports, and had better 7D-PP abstinence results, despite being non-significant when averaged across time, and also being non-significant in the pessimistic scenario. The additional step of demographic filtering in the HA only improved the number of quitting attempts. However, the HA group seemed to rate a higher number of messages and gave better ratings to the messages, based on trends shown on the evolution line charts; yet both measures did not statistically differ between groups. The second goal was to identify potential subgroup differences, and we found that participants who completed their extended profiles were more likely to stay in the study and, among the employed, those who were in the KBA group had higher engagement, in terms of active days, than those in the HA group. No other differences were found for any other subgroups (gender, age, nicotine dependence level, motivation to quit).
Our findings provide insights into the usage of health recommender systems in a real-world setting. We conclude that collaborative intelligence provided mixed results, some of them unexpected, and more research is needed to fully take advantage of it in the context of smoking cessation support. However, this study showed a promising future for health recommender systems. Combining behavioral change techniques and models in recommender systems, even with simple algorithms, can lead to higher smoking cessation rates than unsupported quitting and also more than some nicotine replacement therapies.