Predicting Loneliness through Digital Footprints on Google and YouTube

: Loneliness is an increasingly prevalent condition with many adverse effects on health and quality of life. Accordingly, there is a growing interest in developing automated or low-cost methods for triaging and supporting individuals encountering psychosocial distress. This study marks an early attempt at building predictive models to detect loneliness automatically using the digital traces of individuals’ online behavior (Google search and YouTube consumption). Based on a longitudinal study with 92 adult participants for eight weeks in 2021, we ﬁnd that users’ online behavior can help create automated classiﬁcation tools for loneliness with high accuracy. Furthermore, we observed behavioral differences in digital traces across platforms. The “not lonely” participants had higher aggregated YouTube activity and lower aggregated Google search activity than “lonely” participants. Our results indicate the need for a further platform-aware exploration of technology use for studies interested in developing automated assessment tools for psychological well-being.


Introduction
Loneliness, defined as the "discrepancy between a person's desired and actual social relationships", has been identified as the next critical public health issue [1,2].Indeed, a recent study reported that 61% of young adults in the United States actively feel lonely [3].Moreover, influential figures like the United States Surgeon General, Vivek Murthy [4], have also called loneliness an "epidemic".Research provides validity to such claims by showing that loneliness directly affects public health, causing an increased risk of mortality [5], cancer [6], high blood pressure [7], anxiety [8], and depression [9].
Recently, research has called for the analysis of digital traces (e.g., Google search history and YouTube consumption logs) to shed light on factors related to "health and wellbeing" such as loneliness [10,11].This development is unsurprising, given the amount of time individuals spend online.According to recent findings, more than 90% of Americans are online, and nearly 46% can "no longer imagine everyday life without the Internet" [12].This development is also reasonable, given that both the theoretical and empirical literature suggest a relationship between technology use and well-being.
Theories within media, communication, psychology, and other fields posit that technology use and well-being are related.For instance, Uses and Gratifications Theory (UGT) proposes that people actively choose the types of media they engage with to satisfy their needs [13,14].According to UGT, psychological factors motivate individuals to use media [14].Previous research applied UGT to examine the influence of various constructs like depression as motivators of technology use.For instance, Pittman et al. [15] highlight the role of social media in gratifying users' social, intimacy, and affection needs.On the other hand, Elhai et al. [14] show the role of smartphones in alleviating anxiety.
Empirical evidence also indicates a relationship between psychological well-being and online behavior.For instance, Boursier et al. [16] used structural equation modeling (SEM) to discover that loneliness is positively related to excessive social media use (ESMU) and ESMU is, in turn, positively correlated to other problems, such as anxiety.Meanwhile, Yoder et al. [17] applied multiple linear regression to find that Internet pornography is directly associated with loneliness.
Thus, the literature has discerned meaningful insights into the complex relationship between technology use and factors related to well-being.Still, studies have yet to analyze the degree to which automated technology, such as machine learning models built upon multi-platform online behavior, could be created to help individuals monitor and improve their health.In addition, although research has shown that online behavioral data can be used to predict mental health factors, such as suicidal risk, depression, and anxiety, it has not yet examined, to our knowledge, how individuals' multi-platform digital data coming from Google and YouTube could be used to infer their loneliness scores [11,[18][19][20].
In line with other "AI for health" studies, we aim to fill this gap in the literature by researching whether machine learning models can accurately assess loneliness [21].Based on theoretical and empirical literature that suggests that online platforms differ in their ability to influence health and psychological well-being, we examine two platforms in this study: Google and YouTube [13,14].We focus on these platforms since they are very commonly used and are relatively different from one another.For instance, while Google search is a text-based search engine, YouTube is primarily an image and video-based platform.Research finds that image-based platforms are more effective at provoking feelings of social presence and intimacy than text-based platforms [15].We also examine these platforms since studies show they fulfill different needs and underscore contrasting facets of online behavior.For example, Google is primarily associated with active information seeking, whereas YouTube is more often connected with passive media consumption [22].
Using a combination of self-reported survey data and digital trace data provided by 92 individuals during a period of extended isolation from February to April 2021, we aim to answer the following research questions in this study: RQ1: Can machine learning models use trace data from online platforms to predict loneliness?RQ2: Are there systematic differences in terms of the predictive ability of online platforms (Google search, YouTube) for loneliness?Hence, the key contributions of this study are (a) to propose a novel approach to use digital trace data to predict loneliness and (b) to systematically analyze the differences in user behavior across Google and YouTube based on their loneliness levels.Based on the analysis, we find that digital trace data could be used to create relatively accurate and cost-effective prediction models for individuals to track their loneliness.We also uncover additional support for theories like UGT that posit that technology use may influence wellbeing differently based on how satisfied individuals are with their digital use, both in terms of usage and the model's predictive ability.With refinements, we believe the proposed approach could contribute towards digital health dashboards for individuals, wherein their data, combined with models running on their computers (e.g., as web plugins), could be used for triaging health, and provide support and guidance via awareness material or referrals.

Theoretical Background: Motivations behind Online Media Usage
The effects of media use on users' personal lives, health, and well-being have been studied in media, communication, and psychology.According to the Uses and Gratifications Theory (UGT), people actively choose media and engage in technology to gratify their specific needs [13,15].Since UGT considers diverse motivations ranging from sociodemographic to psychological characteristics [14], emotions are one of the causes that motivate people to use media.Previous studies have applied UGT to analyze the influence of social media usage on loneliness, happiness, and satisfaction with life [15] and the effect of increased smartphone use on depression severity and emotion regulation of users [9].
A recent study reported that people watch vlogging videos to fulfill informational and entertainment needs [23].In turn, the motivation they had to watch these videos significantly impacted their level of engagement (emotional and otherwise).Another study found that YouTube was used more for entertainment purposes than information (e.g., to obtain political or medical information) [24].Further, we note that recommendations, subscriptions, and passive consumption significantly impact YouTube utilization and the associated user experience.Hence, we consider YouTube's behavior to be relatively more passive and more entertainment centric than Google's search behavior that is more active and more information centric.

Loneliness and Online Behavior
Similarly, multiple empirical studies have suggested an interconnection between loneliness and online behavior.For instance, Lee et al. [22] used structural equation modeling and found a connection between YouTube use and loneliness.Yoder et al. [17] suggested a link between online porn consumption and loneliness.Haridakis et al. [25] argued that ". . .while people watch videos on YouTube for some of the same reasons identified in studies of television viewing, there is a distinctly social aspect to YouTube use that reflects its social networking characteristics".Being empirical, these studies did not try to build predictive models for loneliness.This fact is partially surprising, given the studies that report the use of social media to build predictive models for other mental health issues, including depression, anxiety, and suicide risk [11,26].As closely related efforts, we note the study by Mazuz et al. [27] who used individual Reddit posts to predict loneliness, and Brodeur et al. [3] who used aggregated search behavior, as opposed to individual-level behavior, to predict loneliness.Hence, using individual online data to predict loneliness, especially as a combination of YouTube and Google logs, still needs to be explored.
Our study follows the recommendations from a recent review article on loneliness and social media use by O'day et al., who stated that "Loneliness is a risk factor for problematic SMU" (Social Media Use).They further state that "To date, problematic SMU has been defined in terms of frequency rather than pattern of use.Most research has relied on self-report cross sectional examinations of these constructs.More experimental and longitudinal designs are needed to elucidate potential bidirectional relationships between social anxiety, loneliness, and social media use" [28].We go beyond self-report and focus on the patterns of use rather than frequency alone to study predictive interconnections between loneliness and online media use.Specifically, we perform this in the context of predictive models created using individual-level Google and YouTube traces since this connection, motivated by the past literature, is yet to be explored systematically.Further, we analyze the differences in use patterns across platforms as they relate to loneliness, and interpret the differences based on potential user motivations.

Data Collection
We collected two types of data from consenting participants over ten weeks between February and April 2021 as part of a project called the "Rutgers Wellness Study" [29] and shared the data with researchers through a secure mechanism.Meanwhile, behavioral data, including loneliness information, was provided by participants through the completion of an online questionnaire on Qualtrics every week.
For this study, we considered adults over the age of 18 living in the United States.Additionally, we reserved participation for participants who were active users of Google search, Google Mail, and Google Location Services three months prior to the study.Recruitment efforts focused on using online advertisements, social media, and university mailing lists to enlist subjects.Potentially, because of the recruitment process, most participants were affiliated with a large public university in the Northeastern United States.A total of 101 participants signed up for the study and 92 completed the study.The data presented in this article were obtained from these 92 participants.

Ethical Considerations and Permissions
The Rutgers University Institutional Review Board (IRB) reviewed and approved the study.Participants were informed of the study's goals and data collection procedure before involvement.They were also informed that they could withdraw from the study at any point during the ten weeks.All participants were provided consent forms, and only those who agreed to the terms participated in the study.The participants were compensated monetarily for their time.
Several steps were performed before data analysis to protect participants' confidential information.First, Google's Cloud Data Loss Prevention (DLP) API was used to deidentify participants' data (e.g., names, addresses, and phone numbers) before it was shared with the research team.Next, data were stored using a secure and confidential system.Third, a mental health clinician was included in the research team and available to deal with unexpected scenarios and provide referrals to those in need.Finally, findings based on participants' data are only reported as aggregate trends or associations instead of individual results.

Variables of Interest
Loneliness (Target Variable): We measured loneliness using the University of California (UCLA) loneliness scale [1], which contains 20 statements, such as "I am unhappy doing so many things alone", and measures items on a 4-point Likert-type scale ranging from 1 ("I never feel this way") to 4 ("I often feel this way").The scale, which has been widely validated and used in the literature [30][31][32], ranges from 20 to 80. Using the past research [6], we considered scores between 20-34 to denote low degrees of loneliness and those higher to denote moderate-to-high degrees of loneliness.Accordingly, to convert the modeling into a binary classification problem, we labeled subjects with low degrees of loneliness (<35) as "not lonely" and those with high degrees of loneliness (≥35) as "lonely" on a weekly basis.
Demographic Features: We employed 11 sociodemographic features in our research.Participants' age, race, gender, income, household size, living situation, marital status, employment status, pets, veteran status, etc., were measured using multiple-choice and open-ended questions.
Digital Features: We utilized 44 digital trace features in our study.These features were based on individuals' Google search and YouTube engagement data.They contained various temporal aggregate features to measure the immediate and long-term impact of different types of technology use.For example, "num_google_searches" measured the weekly number of times Google search was used and allowed us to analyze weekly longitudinal trends.On the other hand, "url_category_x" and "yt_category_x" underscore the different ways Google and YouTube were used by participants in our study.These categories (e.g., sports, technology, and music) were obtained using third-party APIs.Given the sparsity in these categorical data, we retained only those categories that were utilized by at least half the users at any point during the study period.Table 1 provides a list of the primary digital trace information that was used in this study.

Data Preprocessing and Modeling
Weeks 1 and 10 were discarded due to inconsistencies in the data.For example, week 1 was dropped due to a ramping-up effect, whereas some people signed up on Monday, and others joined on Sunday.On the other hand, week 10 experienced the reverse scenario; while some participants stopped sharing data on Monday, others waited until Sunday.Hence, we analyzed data from eight weeks out of the ten-week period.
We also implemented a strict "iron curtain" policy for the evaluation of the machine learning models.We used the data from 75% of the participants from weeks 2 to 7 as the training set and tested the model on the remaining 25% of the participants for the remaining two weeks (weeks 8 and 9).Hence, there is no overlap of time and individuals between the training set and the test set.
Four machine learning models were used for testing in this classification study: Random Forest (RF), eXtreme Gradient Boosting (XGboost), Logistic Regression (LR), and Multilayer Perceptron Neural Network (MLP).Random Forest is an ensemble learning technique that, during training, builds many decision trees and outputs the mode of the classes for classification problems or the average prediction for regression tasks.By merging various trees, it reduces overfitting and improves forecast accuracy [35].XGBoost, or eXtreme Gradient Boosting, is an efficient and scalable gradient-boosting algorithm.It successively constructs a sequence of decision trees, each rectifying the errors of the preceding one, and employs a regularization term to control model complexity [36].Logistic Regression is a binary classification linear model that predicts the likelihood of an instance belonging to a specific class.It employs the logistic function to predict the outcomes [37].A multilayer perceptron is an artificial neural network.It consists of numerous layers of interconnected nodes or neurons.It learns complicated patterns in data using an activation function and backpropagation, making it appropriate for various tasks, including classification and regression in both structured and unstructured datasets [38].
Modeling was performed using Python 3.8 and Python libraries, such as sklearn and XGboost.Missing values were replaced with median values.We consider the area under the curve (AUC) as the primary evaluation metric in this study because it can handle data imbalances quite gracefully.We also consider standard accuracy and F1 score as supporting metrics to evaluate the models [39].To better understand the predictive role of sociodemographic factors and different digital platforms we tested the model on all features and subsets of the complete data.
Figure 1 demonstrates the steps we have taken to execute our data analysis.We first took all the features we obtained from the dataset and divided them into three different subsets, which are Sociodemographic features, Google features, and YouTube features.Then, we combined these subsets with one another to obtain a total of six combinations.For each combination of the subsets, we trained and tested four model types (Random Forest, Logistic Regression, XGBoost, and Multi-layer Perceptron) a total of 50 times.In each such iteration, a different set of rows was randomly selected to be part of the training and test set, respectively.To ensure a robust design for our analysis, we only tuned the hyper-parameters for each of the machine learning models based on the first training set.For each model, we undertook forward feature selection, i.e., the features were ranked based on their permutation importance [40], and added one by one to the model.The best performing feature selection was retained, and its performance was recorded.The average scores for 50 such iterations are reported in the Section 4.

Sample Population
A majority of the 92 participants in the study identified as female (68.48%).Although participants ranged in age from 18 to "65 and older", a significant portion was between the ages of 18 and 21 (43.48%).The two biggest racial groups represented in the study are White (39.13%) and Asian (34.78%), and most of the participants were single (81.52%).Table 2 provides the primary sociodemographic details of this study's sample population.

Loneliness in Participants
There was some variation in the weekly number of lonely participants, as shown in Figure 2. The total number of lonely participants in a week ranged from 36 to 43, with a mean of around 40 lonely participants weekly.We found that 26 out of 92 participants experienced significant shifts in their well-being status, particularly in their classification as "lonely" or "not lonely" throughout the study.We note that the lowest loneliness levels were observed for week 6, corresponding to the student's spring break.

Loneliness in Participants
There was some variation in the weekly number of lonely participants, as shown in Figure 2. The total number of lonely participants in a week ranged from 36 to 43, with a mean of around 40 lonely participants weekly.We found that 26 out of 92 participants experienced significant shifts in their well-being status, particularly in their classification as "lonely" or "not lonely" throughout the study.We note that the lowest loneliness levels were observed for week 6, corresponding to the student's spring break.

Online Behavior and Loneliness
Figure 3a,b shows the variation in weekly aggregate activity regarding the total number of Google searches and the total number of YouTube videos watched.As demonstrated in Figure 3, "lonely" participants used Google search more than the "not lonely" participants.Interestingly, the trend was inverse in terms of YouTube videos watched.The "not lonely" participants used YouTube more than the "lonely" participants.Figure 3c,d show the trend of the weekly average of pages visited via Google search for the selected 21 categories and the weekly average of YouTube videos watched for 11 selected categories

Online Behavior and Loneliness
Figure 3a,b shows the variation in weekly aggregate activity regarding the total number of Google searches and the total number of YouTube videos watched.As demonstrated in Figure 3, "lonely" participants used Google search more than the "not lonely" participants.
Interestingly, the trend was inverse in terms of YouTube videos watched.The "not lonely" participants used YouTube more than the "lonely" participants.Figure 3c,d show the trend of the weekly average of pages visited via Google search for the selected 21 categories and the weekly average of YouTube videos watched for 11 selected categories across all participants, which demonstrated similar trends to Figure 3a,b.However, Figure 3c,d have overlaps at week 8. Similarly, the patterns in week 8 are different from earlier weeks in Figure 3b (relatively smaller gap between the two groups).This suggests that these features are likely to be useful but not absolute predictors of the level of loneliness.

Biggest Differences Observed
The differences between the mean values of the digital trace data of participants fro the "lonely" and "not lonely" categories also provide insight into how different individ als use online platforms.Table 3 reports the top three features with the highest positi (similarly negative) difference between the "lonely" and "not lonely" groups (inclusi criteria: minimum one search/watching activity per week on average for that specific fe ture).It shows that although high usage of "Sports", "Music", and "Education" relat YouTube content was much more common for "not lonely" participants, the "lonel group more frequently utilized Google search to search/browse information related "Hobbies and Interests", "Miscellaneous", and "COVID".Given the thematic closene

Biggest Differences Observed
The differences between the mean values of the digital trace data of participants from the "lonely" and "not lonely" categories also provide insight into how different individuals use online platforms.Table 3 reports the top three features with the highest positive (similarly negative) difference between the "lonely" and "not lonely" groups (inclusion criteria: minimum one search/watching activity per week on average for that specific feature).It shows that although high usage of "Sports", "Music", and "Education" related YouTube content was much more common for "not lonely" participants, the "lonely" group more frequently utilized Google search to search/browse information related to "Hobbies and Interests", "Miscellaneous", and "COVID".Given the thematic closeness between "Hobbies and Interests" and "Sports", "Music", and "Education", we posit that the platform characteristics (e.g., active Google search vs. Passive YouTube) play an important role in the association with loneliness.

Prediction Results
We tested four machine learning models in this study: Random Forest (RF), Logistic Regression (LR), eXtreme gradient boosting (XGB), and Multilayer Perceptron Neural Networks (MLP).We divided our features into the following sub-categories: Sociodemographic (Demo), Google-based features including those about aggregated search activity, websites visited via search, and their categorical distribution (Google Features), and YouTube-based features including aggregated activity level and their categorical distribution (YouTube Features).For each model, we computed the performance based on all combinations of the sub-categories mentioned above.Feature subset selection was undertaken in each setting to optimize for the ROC curve (AUC), which we used as the primary comparison metric to find the best-performing machine learning model.The results of the evaluation are shown in Table 4.As can be seen from Table 4, MLP with Demo and Google Features provided the highest AUC for the different settings considered.The MLP model generally outperformed other predictive models.We also notice that while the sociodemographic features yielded a strong predictive power, adding digital traces to sociodemographic features showed higher predictive power than using only sociodemographic features.
For completeness and interpretability, we repeat the same process optimizing feature selection for accuracy and F1-score, respectively, and report the results in Tables 5 and 6, respectively.The same setting (Multi-Layer Perceptron with Demo + Google features) that obtained the highest score in terms of AUC also recorded the highest scores in terms of accuracy and F-1 scores (80.17% and 74.49%, respectively).Similarly, we notice that the Multi-Layer Perceptron outperformed other ML models in most settings.Overall, these performance scores are modest but illustrative of the potential of using digital traces like Google and YouTube features for similar tasks in the future.We also note that these results are based on a setting where the test data does not overlap with the training data in terms of participants or time.If specific settings require generalization along only one of those two axes, the model will have opportunities to learn from more data and yield higher performance.For example, if we used the data from the first six weeks to predict values for the next two weeks for the same individuals, the XGBoost model yielded an AUC of 93.45%.For comparison, a baseline model that labels loneliness in week 8 simply as the label from week 2 will obtain an AUC of 79.72%.Finally, if, in specific settings, the application designers want to only use passive digital traces, and not collect self-reported data on demographics, then MLP could be used to yield an AUC of 73.89%.

Initial Remarks
In this study, we examined whether users' online behavior could be used to predict and prevent them from developing well-being-related health issues, particularly loneliness.In conjunction with sociodemographic information, we found that Google and YouTube data could infer an individual's loneliness levels with reasonable accuracy (AUC = 84.69).As a result, machine learning models could be utilized to develop low-cost screening tools to support individual health.Furthermore, our study finds that digital trace information improves loneliness prediction across a variety of machine learning approaches.However, MLP performed better than others in the current study.
Further, we observed systematic differences between online platforms.In terms of aggregate use, "lonely" participants used Google search more than the "not lonely" participants.On the other hand, "not lonely" participants used YouTube more than the "lonely" participants.Different platforms also yielded different degrees of predictive power in terms of the prediction model.As reported in Table 4, Google data had higher predictive power than YouTube data in three of the four settings.These results underscore that different online platforms influence individuals differently, depending on how the participants use them and their motivations.Hence, the results support theories like UGT by showing that technology used for different purposes can influences people differently and can have different predictive ability.

Deployment Scenarios
Online platforms, such as Google and YouTube, have powerful potential and can be used to develop automated tools that rely on machine learning methods to mitigate and prevent serious health problems.For example, loneliness, referred to as an "epidemic," is one of the many facets of mental well-being that involves social stigma and prevents individuals from seeking help [4].Digital trace data present a unique opportunity for individuals to utilize their online data for self-evaluation purposes.This is especially pertinent for individuals who either experience stigma, do not wish to receive professional help, cannot access professional help, or cannot afford professional help.With refinement and clinical validation, the method illustrated in this study can be used to create a browser plug-in or lightweight computer application to provide periodic tips on mental health or re-referrals to mental health facilities depending on users' loneliness scores.

Limitations
Our study has a few limitations.First, we acknowledge the privacy and ethical concerns associated with assigning a health score to individuals based on passive data collection, as pointed out by Tufekci [41].To address these concerns, we recommend that automated tools created based on machine learning models, such as our own, explicitly request permission to access users' data.We also suggest that tools are designed to be self-evaluation guides, and only trained health professionals and physicians can evaluate individuals' circumstances further.Such approaches can play a small role in creating automated tools that can alleviate the burden on individuals and healthcare professionals while reducing costs.
Next, we acknowledge the limitations relating to the findings of our study.Our study also relies on findings from a relatively homogenous sample during the COVID-19 pandemic, a period of increased isolation, social distancing, and loneliness [42].Thus, while findings could be generalized to a similar population, they may not apply to vastly different populations.They may also be different depending on the time.Accordingly, we recommend that future studies test objective claims from this study using causal methods that investigate different (non-COVID) periods with various sample populations.

Conclusions
Our work represents the first effort, to our knowledge, to analyze and uncover the ability of multiplatform digital trace data (Google and YouTube) to predict loneliness.Our study also provides additional theoretical and empirical knowledge on how online platforms, such as Google and YouTube, differ from one another and impact individuals differently.The combination of the activity level and category of content utilized allowed the algorithms to create algorithms that yielded high accuracy at predicting loneliness.Given the widespread increase in loneliness levels and calls for the early detection of loneliness to undertake counter actions, this study can serve as an important building block for healthcare applications.Our approach can be used to create personal digital health dashboards that use the individuals' data and models running on their own devices (such as web plugins) to triage their health status and offer assistance and guidance through relevant information or referrals.

Figure 1 .
Figure 1.Process flow chart for the evaluation.

Figure 2 .
Figure 2. Total number of lonely participants per week.

Figure 2 .
Figure 2. Total number of lonely participants per week.

Figure 3 .
Figure 3. (a) Average number of weekly Google searches; (b) average weekly YouTube vide watched; (c) average pages visited via Google search each week (for the selected 21 categories); a (d) average YouTube videos watched each week (for the 11 selected categories).All results present are means across all participants.

Figure 3 .
Figure 3. (a) Average number of weekly Google searches; (b) average weekly YouTube videos watched; (c) average pages visited via Google search each week (for the selected 21 categories); and (d) average YouTube videos watched each week (for the 11 selected categories).All results presented are means across all participants.

Table 1 .
Digital trace data used in this study and their explanation.YouTube sessions.Here, two videos belong in a session if they were watched within 60 min of each other weekly_use_count_youtube Weekly number of times YouTube is used (e.g., videos watched and comments) yt_category_x Weekly number of videos watched on YouTube per category as defined by the YouTube API [34].We retain 11 such categories based on active use by the participants (11 different features).
num_comments Weekly number of YouTube comments unique_yt_cat_visited_weekly Weekly number of unique YouTube categories visited total_yt_weekly_top_cats Weekly sum of YouTube videos watched for the 11 selected categories

Table 2 .
Sociodemographic characteristics of participants.

Table 3 .
Mean differences in online behavior of participants (by category).

Table 4 .
The area under the ROC curve (AUC) of machine learning models on different feature sets.The best performing models are bolded.

Table 5 .
The accuracy of machine learning models on different feature sets.

Table 6 .
The F-1 scores of machine learning models on different feature sets.