Recreating the Relationship between Subjective Wellbeing and Personality Using Machine Learning : An Investigation into Facebook Online Behaviours

The twenty-first century has delivered technological advances that allow researchers to utilise social media to predict personal traits and psychological constructs. This article aims to further our understanding of the relationship between subjective wellbeing (SWB) and the Five Factor Model (FFM) of personality by attempting to replicate the relationship using machine learning prediction models. Data from the myPersonality Project was used; with observed SWB scores derived from the Satisfaction With Life Scale (SWLS) and Five Factor Model (FFM) personality profiles generated using responses on the 100-item IPIP proxy of the NEO-PI-R. After data cleaning, FFM personality traits and SWB scores were predicted by reducing Facebook Likes into 50 dimensions using SVD and then running the data through six multiple regressions (fitting the model via least squares and splitting the data via k-folds validation) with the Likes dimensions as predictors and each of the FFM traits and the SWB score as response variables. Standard multiple regression analyses were conducted for the observed and machine learning predicted variables to compare the relationships in the context of previous literature. The results revealed that in the observed model, high SWB was predicted by high extraversion, conscientiousness, and agreeableness, and low openness to experience and neuroticism as per previous research. For the machine learning model, high SWB was predicted by high extraversion, openness to experience, conscientiousness, and agreeableness, and low neuroticism. The relationships between SWB and extraversion, neuroticism, and conscientiousness were successfully replicated in the machine learning model. Openness to experience changed direction in its relationship with SWB from the observed to machine learning-derived variables due to failure to accurately recreate the variable, and agreeableness was multicollinear with SWB in the machine learning model due to the unknowing use of identical digital behaviours to replicate each construct. Implications of the results and directions for future research are discussed.


Introduction
Fast-paced technological trends demand research tools in psychology to evolve.There has been a historical focus on self-report methods and traditional behaviour analysis due to their ease of use and proliferation in psychological research.Novel approaches such as machine learning and data mining have recently begun to gain traction in psychological research [1].Data mining allows large, diverse samples to be analysed and utilised in algorithms to predict future outcomes [2].In particular, the algorithms can be used to predict psychological constructs such as subjective wellbeing (SWB) and the traits of the Five Factor Model (FFM) of personality [3].Previous research has demonstrated machine learning's ability to recreate psychological constructs from online data.However, the observed relationship between wellbeing and personality has not yet been recreated via machine learning techniques in the extant literature.The current study utilises simple machine learning techniques such as singular value decomposition, k-folds validation, and linear regression.Facebook 'Like' data was first reduced via unsupervised feature extraction using singular value decomposition, employing the dimensions and pre-labelled personality (FFM) and SWB data, and linear regression and k-folds validation were used to predict participants' personality (FFM) and SWB.These predicted values were then used to recreate the relationship between the FFM and SWB, and assess the accuracy of the prediction model compared to observed scores.

Subjective Wellbeing
Research surrounding SWB has captivated the field of psychology.Brickman and Campbell [4] coined the term of the 'hedonic treadmill', in which individuals change in reaction to improved circumstances, such as wealth and material goods, yet do not yield happiness.They and other researchers found that an increase in income did not increase one's SWB and found that lottery winners were typically less happy and paraplegics happier than one would anticipate [5][6][7].On the other hand, the most significant influences of SWB have been found to be personality traits, as they predispose an individual to life experiences and behaviours that may positively or negatively affect one's average level of life satisfaction [8][9][10].A higher level of SWB would be associated with frequent positive, pleasant affective experiences.An individual may consider this in the form of cognitions, such as evaluation of marital and career satisfaction, or in the form of affect, such as experiencing certain moods and emotions in reaction to an event [11].Investigating SWB and how it is best predicted is important in psychology to further our understanding of mental illnesses, such as depression, and researchers' and society's perception of true 'happiness'.

The Five Factor Model Model of Personality
An individual's personality may be described in layman's terms as 'friendly', 'outgoing', 'loud', or 'shy'.However, the dominant understanding of personality in the academic community is the FFM, constituted of: extraversion, neuroticism, openness to experience, conscientiousness, and agreeableness.Personality psychology aims to understand the whole person to comprehend how individuality is organised and integrated [12,13].The FFM traits have been well established in psychological research as a way to address the underlying variety in human behaviour with a nomenclature that can classify individual personality differences [14].Additionally, personality has been linked to marital and relationship outcomes, career satisfaction, social adaptation, and cultural differences [15][16][17].

Machine Learning
Machine learning is a relatively new and emerging research tool in psychological study.Due to the evolving nature and skills required in this field, computer scientists and engineers have typically dominated; however, it is becoming a popular tool in social sciences [2,18].Machine learning involves applying a performance algorithm to a large data set to produce a prediction model and using this model to predict an outcome [19].Repeating this process iteratively allows for a 'perfected' model and accurate predictions of psychological constructs.In order to use machine learning, data mining is required, in which large data sets can be utilised to create new information or strengthen previous knowledge.Feldman, et al. [20] used Facebook 'status updates' of 73, 789 participants recruited through the myPersonality application to determine whether the machine learning condition supported the previously found positive relationship between profanity and honesty.Utilising linguistic inquiry and word count on the large data set replicated this previous relationship, showing that those who used more profanity in their Facebook 'status updates' were more likely to be honest.Data mining also allows for the analysis of large data sets with a higher accuracy and thus additional insights can be determined [21].
Two types of machine learning are commonly used in psychology, which vary through the labelling of data.Supervised machine learning requires prior labelling of the data by the researcher based on theory or previous knowledge, and produces an algorithm that can be used to predict future instances.On the other hand, unsupervised machine learning does not involve prior labelling of the data and instead, the machine applies its own clustering of data to produce an unknown class of items and predictions [22].

Advantages and Disadvantages of Machine Learning
Machine learning brings advantages in terms of large sample sizes, generalisation of research, a considerably lower cost, a higher statistical power, and lower bias in results.Stillwell and Kosinski [23], among other researchers [1,2], have commented on the usefulness of internet-based psychological research and its ability to yield large samples of participants, allowing a larger scale to base research on.Similarly, Kosinski, et al. [24] noted the proficiency of Facebook as a research tool, permitting large and non-WEIRD (western, educated, industrialised, rich, democratic) samples to be inexpensively recruited.They also stated that Facebook provides records of human behaviour expressed in a natural online environment.This reduces reference group effect bias, typically found in self-report questionnaire measures (such as the IPIP; International Personality Item Pool), which refers to describing oneself in relation to others.Human activity monitored by digital services and devices allows behaviours to be digitally mediated, permitting large-scale samples to be obtained with the minimisation of sampling error and reduction of group effect bias [24,25].Using the Internet as a research tool permits the collection of diverse samples and generally produces better quality data [1].
Although machine learning brings great advantages for research, disadvantages of this tool still exist.Human perception is flexible and can recognise behavioural cues that are not matched by computer-based predictions.Assessing and determining personality not only depends on questionnaire outcomes, but also on how the individual behaves subconsciously and certain cues that may pertain to dishonesty [26].For Facebook data, users are able to remove online behavioural traces, which may render their profile subject to misinterpretation and social desirability bias when analysed via machine learning techniques [24].Finally, social media platforms are vulnerable to fake profiles, which could skew and affect research results.However, Kosinski, Matz, Gosling, Popov and Stillwell [24] have argued that these profiles are usually easy to detect.Because these disadvantages can be controlled for, machine learning poses a promising future for psychological research.

Social Media Data in Psychology Research
Researchers have investigated how Facebook data can be used to produce algorithms to predict psychological constructs.Website choice, 'Like'-based, and language-based data have been the most commonly used variables [26,27].For example, most users on Twitter were classified as emotionally stable and extroverted by using counts of the Twitter information 'following', 'followers', and 'listed' [28].Additionally, Reece and Danforth [29] successfully identified markers of depression from participants' Instagram photos, which surpassed general practitioners' typical unassisted diagnostic success rate for depression.Quercia, et al. [30] evaluated the efficacy of digital methods to predict the strength of online social relationships on Facebook and their findings were consistent with previous analyses used on Twitter [31].The researchers concluded that explanations for relationship deterioration did not differ between online and offline social worlds.Kosinski, Stillwell and Graepel [3] praised the high predictive power of Facebook 'Likes' and predicted a range of personal attributes with varying accuracy.A large sample of Facebook information for approximately 58,000 participants allowed for the development of a prediction model in which numeric variables such as age and intelligence were predicted using linear regression.Dichotomous variables such as gender were predicted using logistic regression.Ethnic origin, gender, and age were the most accurately predicted variables.Personality traits were moderately predicted, whereas SWB was weakly predicted.The researchers attributed the low accuracy of prediction for SWB to the basis that Facebook 'likes' accrue over a continuous period and give a long-term score of SWB, which is not reflected in the SWLS (Satisfaction with Life Scale) because that is a snapshot in time.Social media and technology are an important part of society and integrating them into psychological research allows human behaviour to be monitored and analysed in a way that may be beneficial to future research [32].
1.6.The Relationship between Subjective Wellbeing and the Five Factor Model Model of Personality Previous findings indicate that there is a strong relationship between the FFM of personality and SWB.High levels of SWB are associated with low levels of neuroticism and high levels of conscientiousness, extraversion, agreeableness, and openness to experience [8][9][10]33,34].Fujita [35] found a strong correlation between neuroticism and negative effect, whilst a meta-analysis determined a moderate correlation between extraversion and positive effect [36].The consistency of these correlations displays the effectiveness of predicting SWB from the FFM of personality.Correlational studies suggest that individuals are sensitive to certain stimuli and thus will respond to events differently [37][38][39].Personality type predisposes an individual to experiencing certain life events, which in turn affects an individual's level of SWB [40].

The Current Study
Studies have shown the ability to predict a person's personality at a superficial level using machine learning [2,3,23,25,26,28,41]; however, the relationship between machine learning predicted SWB and personality has not yet been explored.Kosinski, Stillwell and Graepel [3] found the test retest reliability of the openness to experience FFM trait to approximate the predicted versus observed correlation score, suggesting that the observation of user Facebook 'Likes' is about as informative as that from a personality questionnaire test score.The research question addressed in the current study is whether machine learning and data mining can be used to predict personality (FFM) and SWB, and then whether these outcomes can be used to replicate previously observed relationships between the FFM of personality and SWB.It is predicted that the machine learning model will accurately predict SWB and FFM, resulting in a consistent relationship between observed FFM traits and observed SWB and that of predicted FFM traits and predicted SWB.

Participants
The data was obtained from the "myPersonality Project" (mypersonality.org;[23]).The myPersonality Project contains more than four million individual Facebook profiles.The participants had accessed the myPersonality application through their Facebook profiles during the years 2007 to 2012.The current analysis was conducted in two steps, substantially reducing the number of participants (see bitbucket.org/jakekraska/swlbig5for data reduction and analysis code).
Step one involved identifying participants that had provided demographic, SWB, and FFM data and merging this together, resulting in 80,628 participants.Participants missing from the myPersonality user-likes file were removed, resulting in a dataset that included 26,573 participants that had complete FFM personality scores and a SWB score.After removing missing data and duplicate data, iteratively removing participants that had less than 10 likes and then removing likes that have been liked less than 50 times, 21,122 participants and 10,377 Likes remained.The singular value decomposition and least squares multiple linear regression analyses were conducted for this data, predicting FFM personality scores and SWB scores from 50 SVD dimensions across 10 folds (predicting values for 10% of the sample from 90% of the sample, iteratively).This format of validation (k-folds validation with 10 folds) was utilised due to the small number of participants available after data cleaning.Available country data (148 countries) for these participants is included in Appendix A (Table A1).
For the second stage, remaining analyses including correlations between the original and predicted values, and a comparison of the multiple linear regression models, were only conducted for those participants that met the inclusion criteria.That is, only participants that were aged between 16 and 90, and those that had provided a gender were included in the final analysis.The average age of participants in the final sample (n = 13,497) was 24.56 years (SD = 7.08), consisting of less male participants (n = 5322, 39.43%) than females (n = 8175, 60.57%).Participants who were aged outside the range of 17 to 90 were omitted due to ethical guideline considerations and false ages given (e.g., 150 years).Due to the anonymity of the data, ethics exemption was granted by the Monash University Human Research Ethics Committee.

Subjective Wellbeing Measure
To measure subjective wellbeing (SWB), the Satisfaction with Life Scale (SWLS) from Diener, et al. [42] was administered via the myPersonality application.The five-item SWLS is a widely used and reliable measure for SWB.A review confirmed the ability of the SWLS to measure SWB as a cognitive judgemental process with a high internal consistency and temporal reliability [43].Using a Likert scale (ranging from 1 = strong disagree to 7 = strongly agree), participants are asked to respond to five questions about how they view their own life.For example: 'In most ways my life is close to my ideal'.A low overall score indicates extreme dissatisfaction with life and a high score indicates extreme satisfaction with life.
An internal consistency coefficient of 0.87 and a test-retest correlation coefficient of 0.82 over a two month period was reported by Diener, Emmons, Larsen and Griffin [42].A later study found a moderate mean Cronbach's alpha coefficient (α = 0.78) for the SWLS and attributed the moderate score to the small number of items in the scale [44].

Five Factor Model Personality Measure
The independent variables consisted of the five personality traits from the FFM and were measured using the 100-item IPIP proxy of the NEO-PI-R through the myPersonality application on Facebook [24].The NEO-PI-R is a widely used and comprehensive measure of an individual's FFM of personality-extraversion (ext), neuroticism (neu), openness to experience (ope), conscientiousness (con), and agreeableness (agr) [45,46].The NEO-PI-R demonstrates a high internal consistency and stability over a six-year period, showing the reliability and validity of the measure [45,47,48].The five subscales in the 100-item IPIP proxy have 20 items for which the participants must respond on a five-point Likert scale.For example: responding strongly agree to 'I know how to captivate people' would contribute to a higher score on the extraversion scale.
The reliability coefficients for the 100-item IPIP proxy of the NEO-PI-R range from 0.85 for the agreeableness scale to 0.91 for both the neuroticism and extraversion scales [49].

Data Analysis
Data analysis was conducted using R version 3.5.1 [50] and RStudio, version 1.1.453[51].The methodology is contained in Figure 1.
The statistical procedure was modelled on 'Mining Big Data to Extract Patterns and Predict Real-Life Outcomes' by Kosinski, Wang, Lakkaraju and Leskovec [2], as well as other research and guidelines in the area of investigation, e.g., [26,52].Six data sets were utilised that each had specific and not necessarily the same users: a final SWB score, final scores of each FFM trait, Facebook likes of the user, a list of Facebook like ids and their names, demographic (age, gender) data, and location (country).
For the prediction of each variable, a user-like matrix was constructed to match users from the SWLS and FFM data sets and likes.The matrix was trimmed, removing users with less than 10 likes and like entities (e.g., "Sleeping Too Much", "Saying I love you", "Jason Mraz", "Bowling", "Talk With a British Accent Day!") with less than 50 users.The remaining users were split into 10 folds to reduce overfitting.Fifty SVD dimensions were extracted and underwent Varimax rotation.Figure 2 displays the scree plot for the SVD of the user-like matrix.reduce overfitting.Fifty SVD dimensions were extracted and underwent Varimax rotation.Figure 2 displays the scree plot for the SVD of the user-like matrix.Multiple linear regression (least squares used for fitting the model), SVD, and k-folds validation (k = 10) were used to predict FFM personality traits and SWB scores for participants (n = 21,122).Participants that did not include gender and the age range of 16-90 years old were then removed.A correlation analysis was performed for each predicted and observed variable to determine whether it had been replicated accurately.A multiple linear regression model (n = 13,497) was built for the observed data (observed SWB as the response variable and each observed FFM personality trait as the predictor variables) and the predicted data (predicted SWB as the response variable and each predicted FFM personality trait as the predictor variables).Correlation, ANOVA, and covariance analyses were run to determine whether the relationship between SWB and personality had been replicated.reduce overfitting.Fifty SVD dimensions were extracted and underwent Varimax rotation.Figure 2 displays the scree plot for the SVD of the user-like matrix.Multiple linear regression (least squares used for fitting the model), SVD, and k-folds validation (k = 10) were used to predict FFM personality traits and SWB scores for participants (n = 21,122).Participants that did not include gender and the age range of 16-90 years old were then removed.A correlation analysis was performed for each predicted and observed variable to determine whether it had been replicated accurately.A multiple linear regression model (n = 13,497) was built for the observed data (observed SWB as the response variable and each observed FFM personality trait as the predictor variables) and the predicted data (predicted SWB as the response variable and each predicted FFM personality trait as the predictor variables).Correlation, ANOVA, and covariance analyses were run to determine whether the relationship between SWB and personality had been replicated.Multiple linear regression (least squares used for fitting the model), SVD, and k-folds validation (k = 10) were used to predict FFM personality traits and SWB scores for participants (n = 21,122).Participants that did not include gender and the age range of 16-90 years old were then removed.A correlation analysis was performed for each predicted and observed variable to determine whether it had been replicated accurately.A multiple linear regression model (n = 13,497) was built for the observed data (observed SWB as the response variable and each observed FFM personality trait as the predictor variables) and the predicted data (predicted SWB as the response variable and each predicted FFM personality trait as the predictor variables).Correlation, ANOVA, and covariance analyses were run to determine whether the relationship between SWB and personality had been replicated.

Results
Descriptive statistics for the observed variables prior to age and gender matching are shown in Table 1.Descriptive statistics for the predicted variables are shown in Table 2.After deriving the predicted scores for each variable through a machine learning algorithm, preliminary analyses were conducted to ensure no violation of normality and homoscedasticity.With the use of a p < 0.001 criterion, Mahalanobis distance and Cook's distance did not suggest the presence of any outliers (Max MD (12) = 24.38,Max D i Observed = 0.002, Max D i ML = 0.152).The variance inflation factors for the 12 variables in the regression models were less than 10, indicating the absence of collinearity in the sample.The correlation between agreeableness and SWB in the machine learning regression model suggested multicollinearity (r = 0.650).However, as the variance inflation factor did not suggest collinearity and SWB is the dependent variable, both were retained.Correlations of the observed variables are contained in Appendix C Table A4 and for the predicted variables, in Appendix C Table A5.

Singular Value Decomposition Analysis (SVD)
The results from the SVD analysis are presented in Figure 3a-d; r denotes the prediction accuracy and k denotes the number of dimensions.According to Gignac and Szodorai [53], the normative guidelines for small, typical, and large effect sizes are r = 0.10, r = 0.20, and r = 0.30, respectively, which will be followed for this analysis.Overall neuroticism, extraversion, and openness to experience were predicted with the greatest accuracy.All variables appear to be of a typical to large effect size in replicating the observed variables.

Correlational Analysis (Accuracy of Predictions)
Table 3 displays the correlations between the observed and predicted scores for those participants aged 16-90 years old.Unlike in the SVD analysis above, the predicted data used in this analysis was obtained through training the model on 10 folds, in order to reduce overfitting.According to Gignac and Szodorai [53], there is a moderate relationship between the scores for SWB, extraversion, agreeableness, conscientiousness, and neuroticism, suggesting a relationship between the observed and machine learning-derived scores for these variables.Openness to experience has a large relationship, suggesting a strong relationship between the observed and machine learning-derived scores.

Multiple Linear Regression Models for Observed and Predicted Data
The two multiple linear regression model statistics for the models of observed scores for SWB and the FFM personality traits and the machine learning-derived (predicted) scores of the same variables are summarized in Table 4.For the observed scores, a standard multiple regression was performed for SWB and extraversion, openness, agreeableness, conscientiousness, and neuroticism.R was significantly different from zero, F(5, 13,491) = 992.6,p < 0.001, with adjusted R 2 at 0.269, and therefore approximately a quarter of the variability in SWB is predicted by the FFM personality traits.Extraversion, agreeableness, conscientiousness, and neuroticism significantly predicted SWB.Extraversion uniquely explained 1.07%, openness uniquely explained 0.01%, neuroticism uniquely explained 8.26%, conscientiousness uniquely explained 1.66%, and agreeableness uniquely explained 0.25% of the variance in SWB.
For the machine learning-derived scores, the same process was repeated.Again, R was significantly different from zero, F(5, 13,491) = 3585, p < 0.001, with adjusted R 2 at 0.570, and therefore more than half of the variability in SWB is predicted by the FFM personality traits when using SWB and FFM variables that were predicted using machine learning techniques.For the derived scores, all independent variables significantly predicted SWB.The machine learning-derived factor extraversion uniquely explained 0.18%, openness uniquely explained 9.96%, neuroticism uniquely explained 0.94%, conscientiousness uniquely explained 2.33%, and agreeableness uniquely explained 17.09% of the variance in SWB.See Appendix B for the covariance matrices produced from the two regression models (see Tables A2 and A3).

Discussion
The current study aimed to replicate the relationship between the FFM of personality and SWB using simple machine learning techniques: singular value decomposition, k-folds validation, and multiple linear regression.
It was hypothesised that the machine learning model would accurately recreate the SWB and FFM variables.The results support this hypothesis to an extent.From the correlation analysis, the observed scores for extraversion (r = 0.21), neuroticism (r = 0.20), and openness (r = 0.31) were most accurately recreated in the machine learning model.
The second hypothesis postulated that the variables predicted through machine learning techniques would be capable of replicating the relationship between observed SWB and the FFM variables.Again, this hypothesis is partially supported.Higher scores for extraversion, agreeableness, conscientiousness, and neuroticism predicted a higher score for SWB; however, the openness to experience prediction reversed in the machine learning model.The openness to experience prediction reversed in direction, so in the machine learning model, an increase in openness to experience predicted a higher SWB.This may be attributed to the failure to recreate the variable in the first hypothesis.Based on the multiple regression analyses, openness to experience became positively correlated with SWB in the machine learning model, which could be attributed to reference group bias through the administration of the NEO-PI-R via Facebook [24,25].A higher correlation was found between agreeableness and SWB, which is due to the unknowing use of identical digital behaviours to predict the variables or multicollinearity, which inflated the relationship.This suggests that using machine learning to recreate variables is likely to overestimate the relationship between variables.
Whilst the findings from this study pose additional evidence for the utility of using digital behaviour as data to produce prediction models, the accuracy of predictions for the currently investigated constructs is not high when relying solely on SVD and multiple linear regression.Kosinski, Wang, Lakkaraju and Leskovec [2] seem to inflate the accuracy of the linear regression model to predict psychological constructs; while they achieved a high accuracy when predicting gender, personality was predicted with a relatively lower accuracy.Additionally, Kosinski, Stillwell and Graepel [3] found prediction of personality constructs to range from r = 0.17 to r = 0.43, which are not necessarily high accuracies.The prediction for SWB (r = 0.17) was very low in comparison to what was found for age (r = 0.75) in their study.Further investigation within the psychological literature into more complex machine learning techniques that may increase the accuracy of predictions using social media data is required.

The Relationship between SWB and the FFM of Personality
The correlations between the SWB and FFM variables in both models (observed and machine learning-derived) partly replicated previous literature.A summary of correlations between the FFM traits and SWB from four studies in the literature is displayed in Table 5. Extraversion and neuroticism were replicated with a reasonable accuracy and mirrored the findings of Steel, Schmidt and Shultz [8] and Grant, Langan-Fox and Anglim [33].Openness to experience replicated the findings from Steel, Schmidt and Shultz [8] and Grant, Langan-Fox and Anglim [33] in the original model; however, the machine learning model did not accurately replicate the variable.To an extent, the observed variable of conscientiousness paralleled the findings from Anglim and Grant [34], though the machine learning variable almost doubled in its correlation size and did not represent any previous findings in the literature.Agreeableness in both models did not represent or mirror any previous research.None of the variables in either model replicated the findings from DeNeve and Cooper [9], which could be attributed to the meta-analysis' mean age of 53 years.The current study was predominantly young adult aged and therefore the different life stage may explain the discrepancy [54,55].Only conscientiousness in the original model slightly mirrored Anglim and Grant [34], which could be due to the different personality measure used, the 30-item Facet IPIP.The shorter scale may have exaggerated the scores for neuroticism and extraversion, as they are considerably higher in comparison to the other studies mentioned.
The correlations between the observed and machine learning predicted variables somewhat replicated the findings of Kosinski, Stillwell and Graepel [3].The highest correlation for the current study is for extraversion and the SWB correlation in the current study was almost the same as that found by the researchers (r = 0.17).As Kosinski, Stillwell and Graepel [3] did not specify the correlations for the other FFM variables (r = 0.17 to r = 0.30), conclusions regarding these variables are not complete.However, agreeableness, conscientiousness, and neuroticism were in the range stated by the researchers.These similarities are to be expected given that we have used the same initial dataset, but with different inclusion criteria.
Overall, for the original model with observed variable scores, high SWB was predicted by high extraversion, agreeableness, and conscientiousness, and low openness to experience and neuroticism.For the machine learning model, high SWB was predicted by high extraversion, openness to experience, conscientiousness, and agreeableness, and low neuroticism.Therefore, it could be concluded that high extraversion, conscientiousness, and agreeableness, and low neuroticism, are relatively consistent predictors of high SWB.
The greater prediction accuracy of the machine learning model linear regression compared to the observed data linear regression (Table 4) may be due to the genuine nature of Facebook 'likes' used to train the machine learning algorithm, and thus their impact on the recreated variables.When considering the machine learning model to predict the variables, most variables recreated the observed variables with a relative accuracy with a large effect size according to Gignac and Szodorai [53].Using social media data to predict real life outcomes presents an important opportunity in psychology to further measure how individuals can be perceived and how they behave in a natural online environment [24].

Implications, Limitations and Future Research
The basis of Facebook 'likes' is to record human behaviour through expressing positive opinions regarding online content.Technological advances have allowed big data to be extracted from social media websites and this data can be manipulated and analysed to further understand human behaviour [32].The amount of information that can be gathered through social media is significant and generates new areas and possibilities for future research.The current study had a large sample size of 21,112 participants (used to predict FFM traits and SWB for 13,497 participants aged 16-90) from 148 different countries, which exhibits the advantage of large samples allowing for high statistical power to be obtained [2].Despite previous authors praising the utility of social media to attract a less western population, the sample for this study was predominately western, as most participants were from Australia, Canada, the United Kingdom, or the United States, limiting the generalisability of the results.Future research should investigate non-westernised countries and the prediction models based off their Facebook 'likes', as they may be considerably different from the western population.
Although using the Internet and Facebook information in psychological research reduces reference group bias, some bias may be evident.Though the Internet provides a medium to observe human behaviour, individuals can still put on a façade and "fake good".As of 2017, over two million applications exist (Apple and Android) to alter photos (similar to Adobe's 'Photoshop'), access social media sites, locate oneself on a map, order food delivery and transport, track health and exercise, and much more.Holland and Tiggemann [56] systematically reviewed 20 studies and concluded that social networking website use, body image, and disordered eating are related (regardless of gender).Particularly viewing and uploading photos and attention seeking 'status updates' that received negative responses were damaging.Another study found, in a sample of adolescent females, that increased appearance exposure on Facebook, but not overall Facebook usage, was significantly correlated with weight dissatisfaction, thin ideal internalisation, and self-objectification [57].Our study has avoided many of these issues by utilising Facebook likes, rather than status updates or profile pictures.Although technology is growing, as is social media and its associated websites, individuals can be a different person online both physically and socially, which can impact their cognitive and mental functioning in detrimental ways.
The large increase (though due to multicollinearity) in agreeableness in the machine learning model should be inferred cautiously.Agreeableness is characterised by positive social relationships, friendliness, compassion, and cooperativeness [3].On the Internet, individuals can be whoever they want to be and may thus reinvent themselves as a highly agreeable individual.Other traits, such as conscientiousness, refer to an organised, reliable, and consistent individual who enjoys planning, seeking achievements, and pursuing long-term goals [3].These characteristics may not be evident through Facebook 'likes', as social media websites often do not focus on goals and organisation, but on networking friends and individuals.Facebook 'likes' are a basic, discrete digital behaviour that work well with linear regression models.While using natural language processing may allow for a greater understanding of machine learning in social media, it may suffer from significant error due to the complexity of language in statistical analysis [26].Due to the scope of this study, other aspects of Facebook behaviour were not analysed.Further research into the prediction of traits through machine learning could focus on other aspects of Facebook, such as 'status updates', friendship networks, and past events attended, as these online expressions may explain the variables more accurately.
As this is a relatively new area of research, ethical considerations must be addressed.No clear guidelines for conduct in online human subjects research currently exist and thus protocols related to designing online studies, data storage, and analysis of results are scarce, as well as contradictory [24,58].Using the Internet and Facebook as a research tool poses new ethical dilemmas concerning consent, confidentiality, and competence.The researcher may not be in the same room as the participant, nor have met them, and thus the reliability and validity of results could be decreased or diminished.For the American population, the American Psychological Association lists three documents with guidelines governing how to conduct research utilising the internet, with the most recent from 2003 [59,60].However, this is nearly fifteen years old and with the increase in Web 2.0-type websites (i.e., non-static pages), this may not be relevant or sufficient for the current state of Internet-based research.Recommendations from the Association of Internet Researchers (AoIR) Ethics Working Committee state that although no concrete guidelines have been set for internet-based research in America, policies and documents such as the UN Declaration of Human Rights, the Nuremburg Code, the Declaration of Helsinki, and the Belmont Report, apply to all types of research [61].The basics of these documents are to respect the dignity, autonomy, and rights of the human population and to avoid any possibility of harm.The Australian Psychological Society (APS) takes these fundamentals into account and addresses the key issues of technology, quality, control, and security when dealing with Internet-based research.In Australia, psychologists must abide by the International Guidelines on Computer-Based and Internet Delivered Testing [62] and also abide by the APS ethical guidelines [63].Accordingly, the British Psychological Society has released Ethics Guidelines for Internet-mediated Research that mirror the APS ethical guidelines [64].Only three western countries' ethical guidelines have been mentioned as they make up most of the sample for the current study.However, further investigation should inspect the ethical guidelines for other non-westernised countries, as it may be possible to conduct Internet research on these populations.Confidentiality, consent, potential limitations, and security of data collected are perhaps more important in Internet-based research due to the potential of hackers and insecure storing of private information.The ethics behind Internet-based research has not been clear in America due to the dated governing documents.This poses a limitation, as the data collected by the 'myPersonality project' is American-based and consists of predominantly American participants.Future research could limit the sample to Australian participants as the Australian ethical guidelines are comprehensive, though the ethics is still questionable due to the American-based project overall.
In terms of implications for SWB and the FFM of personality, this study creates new avenues of measurement for these constructs, as well as an additional understanding of what they constitute in the online world.Future research could alter the methodology to include regression trees, neural networks, or other algorithms in order to further consider the utility of other machine learning algorithms in the computational social sciences.Greater understanding of the variety of techniques available to psychology researchers, as informed by our data science and computer science colleagues, can only enhance research within the field.With the collaboration of researchers from these fields, a greater knowledge could be built upon to evaluate digital behaviours that are related to certain traits, and individuality can be further investigated using digital data and machine learning techniques.

Conclusions
The current study found that, by using a machine learning model of Facebook 'likes', high SWB was predicted by high extraversion, openness to experience, conscientiousness, and agreeableness, and low neuroticism.The results further enhance the understanding of machine learning in behavioural sciences and how psychological constructs can be predicted through non self-report methods of measurement.However, the issue of multicollinearity remains when attempting to predict relationships between psychological variables, given that the same digital behaviours are utilised for both independent and dependent variables.As technology use in psychological research continues to develop, it is important for researchers to consider how individuals portray themselves on social media influences how a machine learning algorithm may predict their SWB and personality.Through continual investigation into social media opinion expressions online and their relationship with individual constructs, researchers may be able to develop methods of targeting those at risk of low SWB.In doing so, other associated problems (e.g., depression, body image issues) may be able to be recognised and early intervention provided.Social media websites already use 'cookies' in the browser history to predict the most successful advertisements and promotions, so by using machine

Figure 1 .
Figure 1.A graphical representation of the research design.

Figure 2 .
Figure 2. Scree plot for SVD of the user-like matrix.

Figure 1 .
Figure 1.A graphical representation of the research design.

Figure 1 .
Figure 1.A graphical representation of the research design.

Figure 2 .
Figure 2. Scree plot for SVD of the user-like matrix.

Figure 2 .
Figure 2. Scree plot for SVD of the user-like matrix.

BigFigure 3 .
Figure 3. Accuracy of: (a) Neuroticism across k dimensions, (b) Extraversion across k dimensions, (c) Conscientiousness across k dimensions, (d) Openness to Experience across k dimensions, (e) Agreeableness across k dimensions, and (f) SWB across k dimensions.It is important to note that the accuracy plots utilised predicted data that was obtained from training the algorithm on 100% of the sample, rather than iteratively through 10 folds.

Table 1 .
Descriptive statistics for the observed FFM traits and SWB (n = 21,122).

Table 2 .
Descriptive statistics for the predicted FFM traits and SWB score (n = 21,122).

Table 3 .
Correlations between observed and machine learning-derived FFM traits and SWB.

Table 4 .
Multiple Linear Regression Model of observed FFM traits on SWB and of machine learning-derived FFM traits on the machine learning-derived SWB (n = 13,497) after matching predicted values with gender and age.

Table 5 .
Correlations from the literature: FFM Traits with SWB.