Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning

Kridera, Stavroula; Kanavos, Andreas

doi:10.3390/info16090772

Open AccessArticle

Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning

by

Stavroula Kridera

and

Andreas Kanavos

^*

Department of Informatics, Ionian University, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 772; https://doi.org/10.3390/info16090772

Submission received: 28 July 2025 / Revised: 29 August 2025 / Accepted: 3 September 2025 / Published: 5 September 2025

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

Download

Browse Figures

Versions Notes

Abstract

Social networks generate vast amounts of data that can reveal patterns of human behaviour, social attachment, and mental states. This paper explores advanced machine learning techniques to detect and model such patterns, focusing on community structures, influential users, and information diffusion pathways. To address the scale, noise, and heterogeneity of social data, we leverage recent advances in graph theory, natural language processing, and anomaly detection. Our framework combines clustering for community detection, sentiment analysis for emotional state inference, and centrality metrics for influence estimation, while integrating multimodal data—including textual and visual content—for richer behavioural insights. Experimental results demonstrate that the proposed approach effectively extracts actionable knowledge, supporting mental well-being and strengthening digital social ties. Furthermore, we emphasise the role of privacy-preserving methods, such as federated learning, to ensure ethical analysis. These findings lay the groundwork for responsible and effective applications of machine learning in social network analysis.

Keywords:

online social networks; social attachment; mental state modelling; trust prediction; machine learning; graph-based analysis; sentiment analysis; privacy-preserving learning

1. Introduction

Online social networks have become integral components of contemporary digital society, facilitating communication, collaboration, and information exchange across diverse populations [1]. Platforms such as Facebook, Twitter, and Instagram generate vast volumes of interaction data, offering opportunities to study complex behaviours, group dynamics, and emergent societal phenomena [2]. Extracting meaningful insights from such heterogeneous and dynamic datasets requires computational approaches capable of modelling latent structures, behavioural trends, and relational dynamics [3].

Pattern detection in social networks is central to identifying communities, influence propagation, trust links, and information diffusion pathways [3]. These insights support applications such as recommendation systems, misinformation detection, and public health surveillance. However, traditional algorithms often struggle with the scale, noise, and multimodal nature of online data [4], and the temporal evolution of user behaviour demands adaptive and scalable solutions.

Recent advances in graph theory, machine learning, and natural language processing have opened new avenues for analysing large-scale social data [5,6]. Graph Neural Networks (GNNs), attention-based models, and temporal embeddings capture evolving topologies and semantic heterogeneity, while hybrid approaches such as trust propagation models [7], matrix factorisation [8], and neural frameworks [9] further illustrate the utility of machine learning in complex network settings.

Despite these advances, a key gap remains in modelling higher-order constructs such as social attachment and digital trust. While sociological and psychological studies underscore the role of trust in enabling cooperation and community stability [10,11], their insights have yet to be fully operationalised in computational frameworks. For example, Donath [12] highlighted the importance of online identity cues in shaping trustworthiness, but algorithmic implementations remain limited. Moreover, the relationship between affective signals—such as emotional tone, intimacy, and communication frequency—and bond strength has been underexplored. Prior studies also indicate that trust and privacy concerns are linked to risk perception and communication style [13], offering behavioural dimensions that remain underutilised computationally [14].

This paper addresses this gap by introducing a data-driven framework for modelling social attachment using behavioural features extracted from Facebook activity. Building on prior research in trust dynamics [15,16], we examine the predictive power of machine learning models in assessing interpersonal connection strength. Specifically, we investigate how temporal interactions, emotional sentiment, and public communication contribute to perceived closeness, with applications in digital mental health monitoring and social support detection. Our methods combine the expressive power of neural networks with traditional classifiers, comparing their effectiveness in predicting attachment strength from both structured and unstructured interaction data.

Despite the extensive body of research on trust prediction and community detection, existing studies have rarely incorporated higher-order relational constructs such as social attachment, intimacy, and emotional valence into large-scale computational frameworks. This omission limits the ability of prior models to capture the nuanced psychological and affective dimensions of online relationships. Our work directly addresses this gap by introducing attachment-oriented scoring functions that integrate temporal, emotional, and interactional cues, and by demonstrating their utility in both predictive modelling and behavioural segmentation.

The contributions of this work are threefold. First, we propose a dual attachment scoring mechanism that integrates temporal recency, emotional valence, and intimacy features, providing a psychologically grounded measure of tie strength. Second, we design an evaluation pipeline that combines supervised classification with unsupervised clustering, showing how attachment scores enhance predictive accuracy while uncovering latent behavioural segments. Third, we incorporate emotional and temporal signals as first-class features, bridging affective computing with social network analysis. Together, these innovations establish a robust framework for inferring mental state indicators from online behaviour, offering both theoretical insights and practical pathways for digital mental health applications.

The remainder of this paper is organised as follows. Section 2 reviews the existing literature on trust prediction, social attachment, and behavioural analysis. Section 3 introduces the methodological framework, integrating graph-based, machine learning, and deep learning approaches. Section 4 details the implementation environment, feature extraction processes, attachment scoring functions, and model configurations. Section 5 presents the experimental evaluation, followed by discussion. Finally, Section 6 summarises the contributions, highlights applications, and outlines future research directions.

2. Related Work

Research on trust prediction in online social networks has evolved significantly over the past two decades, driven by the proliferation of user-generated content and complex interaction patterns. Early efforts drew on sociological and psychological theories of interpersonal trust [10,11], which laid the foundation for computational approaches. As online platforms gained prominence, trust began to be quantified through behavioural proxies such as interaction frequency, reciprocity, and endorsement patterns [12,17]. These insights motivated diverse modelling strategies, including graph-based inference [18], probabilistic methods [19], and early machine learning classifiers [20,21].

Several studies leveraged machine learning to incorporate structured and unstructured data into trust prediction. For instance, content-based and review-driven features were employed in [22,23,24], while ref. [8,25] integrated temporal patterns and reputation scores. Neural methods have further expanded these capabilities, with ref. [26] combining Dempster–Shafer theory with neural networks and ref. [9] applying artificial neural networks for predictive trust inference. Attention-based mechanisms have been used in context-aware settings [27], enabling adaptation to dynamic behaviours and shifting interaction patterns.

Graph structures remain central to trust modelling, as they capture relational dependencies and propagation dynamics. Notable examples include TrustWalker [8] and CommTrust [28], which leverage both user–item and trust graphs. Other approaches such as that in [29] estimated trust via propagation and similarity measures, while ref. [16,30] refined these models with contextual cues and adaptive weighting. Social influence and homophily effects, examined in [31,32], have also been integrated into recommender systems and friend suggestion mechanisms.

Parallel to trust modelling, tie strength and attachment prediction have been studied through emotional and communicative features. For example, ref. [33,34] demonstrated how tie strength shapes user well-being and interaction intensity. Sentiment analysis, language use, and frequency of communication have been applied as proxies of closeness [35,36]. Privacy-aware frameworks [13,37] further stress ethical considerations in modelling sensitive interpersonal dynamics. Despite these efforts, the integration of affective signals and trust estimation remains limited.

More recent work combines statistical models with deep learning to address complex social behaviours. Hybrid frameworks in [9,38,39] incorporate context-awareness, temporal evolution, and multimodal signals. Studies such as ref. [4,22,40] demonstrate simultaneous mining of trust links and influence patterns, while ref. [41] investigates trust evolution over time. In parallel, diffusion models such as the Independent Cascade [42] and Linear Threshold [43] have been foundational in simulating influence propagation.

Beyond computational models, longitudinal studies emphasise that shifts in engagement reflect deeper psychological transitions [44]. The concept of multiplexity, highlighting overlapping relational contexts, was introduced in [45], further informing tie strength and trust diffusion. Several surveys, including ref. [1,16], have summarised advances in trust modelling while noting challenges such as sparsity, cold-start effects, and the complexity–interpretability trade-off. Deep learning architectures, including GNNs and transformers, have been proposed to address these limitations [46].

Recent work has increasingly explored multimodal approaches for social network analysis and mental health detection. One line of research has systematically reviewed multimodal sensing methods for mental health assessment, highlighting how integrating heterogeneous data sources improves detection accuracy [47]. Complementary studies have emphasised the role of multimodal information, such as combining text, behavioural traces, and physiological signals, in screening for depression and related disorders [48]. Beyond text and activity data, voice-based models have also been shown to provide valuable cues for depression recognition, particularly when pre-training strategies are applied [49]. In parallel, reviews of AI applications on social media have demonstrated the potential of machine learning for analysing mental health conditions at scale, while also stressing ethical and interpretive challenges [50].

Despite progress, many open challenges remain in modelling fine-grained constructs such as social attachment, where emotional, linguistic, and temporal cues must be jointly considered. Existing methods often capture either structural trust or affective signals in isolation. Our study contributes to bridging this gap by fusing trust prediction with attachment estimation on real-world Facebook data, using a diverse set of behavioural features. By building on prior work, we extend trust research into the domain of emotional closeness and social support detection, offering a machine learning framework for inferring mental state indicators in online networks.

3. Methodological Framework for Pattern Detection in Online Social Networks

The detection of meaningful patterns in online social networks presents a multifaceted challenge, requiring a systematic methodological approach that integrates data acquisition, representation, algorithm selection, and feature modelling. As digital interactions generate increasingly complex and large-scale datasets, the need for robust and scalable analytical frameworks has intensified. This section outlines the methodological foundation adopted in this study for detecting behavioural and structural patterns in social network data. Emphasis is placed on the integration of traditional graph-based analysis with modern machine learning and deep learning techniques, aiming to model phenomena such as trust propagation, attachment strength, and community structure. The proposed framework encompasses all stages of the computational pipeline—from ethical data collection and graph construction to feature extraction, algorithmic learning, and visual interpretation—thereby enabling a comprehensive and reproducible approach to social network analysis.

3.1. Problem Formulation and Research Objectives

Effective pattern detection in online social networks begins with precise problem formulation, which serves as the conceptual foundation for the entire analytical pipeline. In the context of this study, the central objective is to infer trust propagation and attachment strength as behavioural indicators of social closeness and mental state. Social networks provide rich, interconnected data that reveal insights into human behaviour and collective dynamics [51]. This stage involves identifying the specific behavioural or structural phenomena under investigation—such as community structures, influential individuals, anomalous activity, or trust dynamics—and aligning them with clearly defined research objectives [52].

A well-formulated problem statement ensures that the scope of analysis is both focused and actionable. For instance, in the context of community detection, researchers may aim to uncover densely connected subgroups that exhibit shared interests or frequent interactions [53]. Conversely, when modelling trust or influence propagation, emphasis shifts to directional edges, interaction frequency, and contextual features that affect social relationships. Similarly, anomaly detection formulations must account for irregular structural changes or deviations from behavioural norms in the network [54,55].

It is also essential to consider the broader characteristics of the data—including scale, density, heterogeneity, and temporal dynamics—which influence the choice of analytical methods. Social networks often contain noise, sparsity, and mixed data types, posing challenges for generalisation and model robustness [56]. Incorporating these contextual constraints at the problem formulation stage helps ensure methodological alignment and interpretability.

In this study, the overarching objective is to uncover latent behavioural patterns that reflect the strength of interpersonal connections and the propagation of trust. These patterns are inferred from both interaction metadata (e.g., messaging frequency, sentiment polarity, intimacy cues) and structural indicators (e.g., network centrality, community affiliation). The methodological framework is therefore tailored to support both supervised and unsupervised learning tasks, with a strong emphasis on ethical data handling, model robustness, and interdisciplinary interpretability.

3.2. Data Collection

Pattern detection relies on access to high-quality, representative social interaction data. In our case, data collection is oriented toward features relevant to trust and attachment inference, such as communication frequency, intimacy cues, and emotional sentiment. Social network data can be sourced from various platforms, including Facebook, Twitter, Reddit, and LinkedIn, each offering different types of interaction records such as likes, comments, shares, private messages, and profile metadata [57,58].

Data can be obtained using several methods. One common approach is through Application Programming Interfaces (APIs) provided by social media platforms. APIs enable researchers to query structured data directly from user profiles, post interactions, and network connections. For instance, Facebook’s Graph API and Twitter’s REST API facilitate controlled access to real-time and historical data, making them ideal for longitudinal studies of social behaviour [59].

In cases where APIs are restricted or insufficient, web scraping is often employed to gather publicly available information from forums, blogs, and other web platforms. Tools such as BeautifulSoup, Scrapy, and Selenium allow for the automated extraction of HTML-based content, which can then be parsed and structured into datasets for analysis. However, scraping introduces additional challenges, such as handling site-specific markup variations, rate limits, and ethical considerations.

An alternative to real-time extraction is the use of publicly available benchmark datasets, such as those hosted on SNAP, Kaggle, or academic repositories. These datasets often contain anonymised interaction logs, preprocessed network graphs, or behavioural metadata curated for research use. Leveraging such repositories can expedite experimental development, provide standard baselines for comparison, and reduce barriers to entry in trust prediction and pattern mining research.

Importantly, all data collection activities must adhere to ethical research standards, including the protection of user privacy and compliance with data governance policies such as the General Data Protection Regulation (GDPR). User consent, anonymisation procedures, and transparency about data use are essential components of responsible research practices.

3.3. General Preprocessing Considerations

Raw social network data are rarely suitable for direct analysis. In our study, preprocessing is guided by the need to preserve behavioural signals relevant to trust, attachment, and emotional inference while ensuring data quality and model reliability. Social media data are often noisy, incomplete, and inconsistent due to user variability, API limitations, and platform-specific formatting [60]. Without careful preprocessing, machine learning and network analysis algorithms may produce misleading or biased results.

A primary task in this phase is data cleaning, which involves removing duplicates, filtering out spam or bot-generated content, and addressing erroneous or irrelevant entries. Another key step is handling missing values—attributes such as sentiment scores, message timestamps, or metadata may be missing due to privacy settings or API throttling. Depending on analysis goals, missing data can be imputed or excluded.

Normalisation and standardisation are essential for ensuring that numerical features such as message frequency or sentiment scores contribute equally during model training. Inconsistent formatting—such as variations in date formats or emoji encoding—must also be resolved. Finally, filtering based on minimum interaction thresholds helps focus the analysis on meaningful ties rather than incidental contacts.

3.4. Network Representation

Network representation transforms raw interaction data into structured graph form, enabling the modelling of trust propagation and attachment signals. Nodes represent users, while edges denote interactions such as messages, likes, or comments [15,61]. Edge weights and attributes—including sentiment, intimacy, or timestamps—add behavioural depth to the graph.

More advanced models use heterogeneous or temporal graphs to capture evolving dynamics and multi-relational features, which are particularly relevant in real-time platforms. Figure 1 illustrates trust-based communities within a network, where colours denote clusters and node size reflects centrality, highlighting how trust patterns emerge and propagate.

3.5. Algorithmic Strategies for Pattern Detection

Once the social network is represented as a graph, the selection of appropriate algorithms becomes a critical step in uncovering meaningful patterns. These algorithmic strategies must align with the specific analytical goals—such as detecting communities, identifying influential users, or uncovering anomalies—as well as with the properties of the underlying data, including network density, node attributes, and temporal dynamics [28,62].

Pattern detection tasks in online social networks are commonly approached using both supervised and unsupervised learning techniques. Supervised models—such as logistic regression, support vector machines, and ensemble classifiers—are well suited to tasks like trust prediction or attachment classification where labelled data are available [8,9]. In contrast, unsupervised methods—including clustering, dimensionality reduction, and graph partitioning—are effective for exploratory analysis when prior knowledge is limited or when latent structures must be inferred directly from raw network topologies [63,64].

Community detection remains one of the most widely studied problems in network science. Algorithms such as the Louvain Method, Label Propagation, and DBSCAN reveal cohesive subgroups by exploiting modularity, density, or iterative label convergence [65,66]. These techniques are particularly valuable for identifying latent social structures, affiliation groups, or clusters of users that exhibit similar interaction behaviours. In our framework, community detection complements attachment estimation by highlighting user clusters with shared intimacy cues or communication frequency.

Identifying influential or central users is another crucial aspect of trust and attachment modelling. Centrality-based measures—including degree, betweenness, and eigenvector centrality—provide baseline indicators of influence and information diffusion. These measures can be extended with learning-based approaches that incorporate node attributes, edge weights, and temporal frequency of interactions, thereby offering more nuanced estimates of user impact on trust propagation and social bonding [15,41].

In dynamic environments, temporal pattern detection algorithms are essential for capturing evolving behaviours. Approaches such as time-aware clustering, sliding-window analysis, and recurrent models (e.g., LSTMs) allow researchers to track behavioural drift, sentiment shifts, and the rise in or decay of trust and attachment over time [34,67]. Such methods are particularly important for monitoring online signals of emotional well-being or relationship closeness, which can fluctuate rapidly in response to social or contextual changes.

In this context, we use the term temporal pattern detection algorithms to refer to computational methods that explicitly capture sequential or time-dependent changes in user behaviour and network structure. Examples include sliding-window analysis, time-aware clustering, and recurrent neural networks such as LSTMs. By focusing on temporal dependencies, these algorithms ensure that shifts in engagement, emotional tone, or relational closeness are incorporated into the analysis rather than treated as static phenomena.

Overall, the choice of algorithmic strategy directly influences the interpretability and robustness of results. By combining statistical methods, graph-based techniques, and machine learning models, it becomes possible to detect patterns that are not only accurate but also actionable for inferring social attachment and mental state indicators in online networks.

3.5.1. Classification Algorithms

Classification algorithms are central to supervised pattern detection tasks, particularly when the objective is to predict categorical outcomes such as levels of trust, social attachment, or user engagement. These models require labelled datasets and are applied when behavioural indicators and relationship labels are available from prior annotations or self-reported data. In our framework, classification serves as the primary means of linking observable interactional features (e.g., intimacy, sentiment, recency) with attachment strength categories:

Logistic Regression: A linear baseline model, useful for interpreting the contribution of individual behavioural features to trust and attachment outcomes.
K-Nearest Neighbor (KNN): Effective for smaller or sparse networks, where relational similarity among neighbouring users informs prediction.
Random Forest and Extra Trees: Ensemble-based classifiers that enhance robustness and provide feature importance scores, highlighting which behavioural features most influence attachment strength.
Support Vector Machine (SVM): Powerful for distinguishing nuanced interpersonal dynamics through non-linear decision boundaries.
Naive Bayes: A probabilistic model well suited for high-dimensional text interactions such as messages and comments in trust inference.
AdaBoost and Gradient Boosting: Boosting algorithms that iteratively refine predictions, useful for ambiguous trust levels or borderline attachment cases.
LightGBM and XGBoost: Efficient Gradient Boosting methods capable of handling large-scale social datasets with minimal preprocessing.
CatBoost: Optimised for categorical interaction data, eliminating the need for manual encoding of message- or user-based features.
Neural Networks and LSTM: Capable of capturing complex behavioural dynamics and temporal changes in trust and attachment across evolving interactions.
BERT: Transformer-based language model for extracting sentiment, emotional tone, and psychological cues from user-generated content [67].

These classifiers collectively provide a balance of interpretability, scalability, and predictive performance, making them suitable for both explanatory and predictive modelling of social attachment and trust in online networks.

3.5.2. Clustering Algorithms

Clustering methods are employed in unsupervised settings where patterns must be inferred without predefined labels. In this study, clustering complements classification by uncovering latent communities, behavioural groupings, and tie-strength clusters that arise naturally from social interaction data. This bottom-up approach provides additional insight into attachment and trust structures beyond what labelled datasets reveal:

K-Means: Partitions users into k clusters based on behavioural or structural similarity; efficient but sensitive to centroid initialisation [62].
Hierarchical Clustering: Builds a tree of nested user groupings, offering flexible resolution levels for community structure analysis [64].
Louvain Method: A modularity-maximisation approach that efficiently uncovers dense, self-contained communities [63].
DBSCAN: Detects arbitrarily shaped clusters and isolates noise, without requiring the number of clusters in advance [66].
Gaussian Mixture Model (GMM): Probabilistically assigns nodes to clusters, allowing overlaps and ambiguity in group memberships [68].
Newman–Girvan: Identifies community structure by iteratively removing high-betweenness edges that act as structural bridges [69].
Label Propagation Algorithm (LPA): Assigns community labels through iterative majority voting, scalable to large graphs but sensitive to initialisation [65].
OPTICS: Extends DBSCAN by revealing multi-density clustering structure, making it suitable for heterogeneous networks [70].

Clustering thus provides a lens for exploring emergent social structures, interaction intensity, and cohesion levels, enabling the detection of tightly bonded groups and peripheral users whose weaker attachments may be associated with vulnerability or disengagement.

3.6. Preprocessing Pipeline for Social Interaction Data

Preparing social network data for machine learning and pattern detection requires a carefully designed preprocessing pipeline. In this study, the pipeline was tailored to preserve the behavioural signals most relevant to trust and attachment while ensuring consistency, comparability, and reproducibility across experiments [60]:

Missing Value Handling: Records with incomplete values in key behavioural or emotional features were excluded to minimise bias and preserve feature distribution integrity.
Feature Extraction: From raw Facebook interaction logs, we derived a composite feature set for each user pair, including
- Total number of private messages exchanged;
- Number of wall posts, comments, and likes;
- Frequency of intimacy-related keywords;
- Average sentiment polarity (via the VADER tool);
- Time elapsed since the last interaction.
Sentiment Scoring: Textual interactions were processed with the VADER sentiment tool [71], producing polarity scores from $- 1$ (negative) to $+ 1$ (positive). These were averaged per dyad to quantify emotional tone.
Standardisation: All numerical features were normalised using z-scores:

$z = \frac{x - μ}{σ}$

(1)

where x is the raw feature value, $μ$ the mean, and $σ$ the standard deviation.
Label Encoding: For supervised experiments, categorical labels were generated from ground-truth indicators of attachment or mental health scores, serving as training targets.
Dimensionality Reduction (for visualisation): PCA was applied to high-dimensional features in clustering tasks, reducing them to two components for interpretable visual exploration.

This pipeline ensures robust and replicable analysis while retaining the psychological and behavioural constructs embedded in interaction data, forming the empirical basis for trust and attachment modelling.

3.7. Ethics and Privacy Considerations

Inferring social attachment and mental states from Facebook activity raises critical ethical concerns regarding privacy, data protection, and responsible AI use. To safeguard participants, all data were fully anonymised prior to analysis, with personally identifiable information (PII) removed to comply with regulations such as the GDPR. Only aggregated, de-identified behavioural features—such as message counts, sentiment scores, and timestamps—were processed in our models.

To further strengthen privacy, future implementations could adopt privacy-preserving techniques such as federated learning (training models locally on user devices without centralising data) and differential privacy (introducing controlled noise into aggregated outputs to prevent re-identification). These strategies would enable scalable deployment while maintaining confidentiality.

From an ethical standpoint, the system’s outputs must be treated as indicative rather than diagnostic. Although the models highlight behavioural patterns associated with attachment and emotional well-being, they are not substitutes for clinical assessment. Instead, they can provide early warning signals to support professionals, moderators, or digital health services in detecting disengagement or distress. Transparency about system limitations, informed user consent, and strict adherence to responsible governance principles are essential to ensure that this research benefits society without compromising ethical integrity.

Anonymisation Procedure

All personal identifiers were permanently removed, and user IDs were replaced with one-way hash codes. Temporal features were converted into relative measures (e.g., days since last interaction), and only aggregated behavioural indicators (counts, frequencies, sentiment scores) were retained. These steps ensure that anonymisation is irreversible and that re-identification is not possible.

4. Implementation and Experimental Setup

This section presents the complete technical framework designed to model social attachment, mental state indicators, and trust dynamics from Facebook user activity. The system integrates natural language processing (NLP), machine learning (ML), graph-based analysis, and time-series modelling to derive insights from users’ behavioural, emotional, and relational patterns. The pipeline comprises five core stages: data preprocessing, behavioural feature extraction, scoring function computation, model-based classification and clustering, and evaluation through multiple metrics. Emphasis was placed on interpretability, robustness, and reproducibility.

4.1. Toolkits and Computational Environment

Implementation was carried out using Python 3.9. Core machine learning algorithms were developed with the scikit-learnlibrary [72], while advanced ensemble methods (XGBoost, LightGBM, CatBoost) were accessed via their dedicated APIs. Deep learning models—including feedforward neural networks and Long Short-Term Memory (LSTM) architectures—were implemented in TensorFlow and Keras. The BERT-based classifier was fine-tuned using HuggingFace’s Transformers library.

Natural language processing (NLP) was central to analysing textual Facebook content. Sentiment analysis was performed with the VADER tool [71], while emotion and psychological signal detection leveraged pre-trained BERT models fine-tuned for classification. In addition to sentiment, the framework supports topic modelling and emotion detection, enabling inference of mental state indicators such as anxiety, stress, or depression [73]. These capabilities enrich behavioural profiling by uncovering latent psychological traits embedded in natural language.

For social network analysis, graph algorithms (e.g., Louvain, centrality measures, PageRank) were implemented with NetworkX and iGraph, while visualisation relied on matplotlib and seaborn. Experiments were executed on a GPU-enabled workstation (AMD Ryzen 9 CPU, 64 GB RAM, NVIDIA RTX 3090 GPU with 24 GB VRAM).

To capture temporal trends in user activity (e.g., message frequency shifts, interaction timing), time-series models such as ARIMA and LSTM were employed. Visual content was also incorporated through CNNs and Vision Transformers (ViTs), which extracted mood-related visual cues from shared images or profile photos [38]. These multimodal extensions provided a more comprehensive picture of user states by integrating textual, temporal, and visual signals.

4.2. Dataset Description

For the experimental evaluation, we used a dataset of anonymised Facebook user interactions collected over a period of twelve months. The dataset contains records from approximately 2500 active users, resulting in more than 450,000 interaction events. These include private messages, wall posts, comments, and likes, which were processed into behavioural and emotional features relevant to trust and attachment inference.

Collection duration: Data were gathered continuously between January 2024 and December 2024, providing sufficient temporal coverage to analyse both short-term and long-term dynamics in user interactions.

Demographic representativeness: The participants primarily represent young to middle-aged adults (ages 18–45), with a balanced gender distribution (51% female, 49% male). Geographically, the majority of users were located in Europe, reflecting the availability and accessibility of the platform within these regions. All demographic data were anonymised and reported only in aggregate form.

Dataset size: The final preprocessed dataset included 2500 user nodes and approximately 14,000 dyadic ties. Each tie was enriched with behavioural features such as frequency of communication, intimacy-related lexical markers, sentiment polarity, and temporal recency of interactions.

Public vs. private: All data were anonymised and de-identified prior to analysis, in compliance with ethical standards and the GDPR. Only behavioural features (counts, frequencies, sentiment scores) were retained, while raw message content and personally identifiable information were excluded. The dataset itself is not publicly released due to privacy constraints, but a small anonymised sample (with placeholder identifiers such as TEMP1, TEMP2) together with the preprocessing pipeline and feature extraction scripts is openly available at https://github.com/stavroulakridera/Modeling-Social-Attachment-and-Mental-States-from-Facebook-Activity-with-Machine-Learning (GitHub repository, providing anonymised sample data, preprocessing pipeline, and documentation for experimental reproducibility, accessed on 4 September 2025).

4.3. Behavioural and Emotional Feature Extraction

User interactions (messages, comments, wall posts) were mined to derive a feature set representing both behavioural intensity and emotional tone. Features were computed at the dyadic level, capturing the strength and nature of interpersonal ties:

Message Count: Total number of exchanged messages, reflecting interaction frequency.
Wall Posts and Comments: Visibility-based interaction measures.
Sentiment Polarity: Average sentiment score of messages, computed with VADER [71]. VADER was chosen because it is tailored to social media language, handling slang and emojis effectively, while offering efficiency as a lightweight baseline.
Words Expressing Intimacy: Count of lexical markers indicating closeness.
Emojis and Punctuation: Presence of affective cues such as emojis or exclamation marks.
Days Since Last Communication: Temporal recency, modelling decay in relational strength.

In addition to text-based features, visual data shared by users were analysed. Image-derived features (e.g., brightness, facial expression, scene context) were extracted with CNN-based models and incorporated into the behavioural profiles. This multimodal integration strengthens the inference of user sentiment and emotional states by combining linguistic and visual cues.

4.4. Attachment Strength Scoring Functions and Comparative Evaluation

A central objective of this study was to quantitatively assess the strength of interpersonal bonds—referred to as attachment strength—based on user interactions on Facebook. Two distinct scoring functions were implemented and evaluated to capture this construct.

In this study, we operationalise attachment strength through what we call a dual attachment scoring mechanism. This term denotes the use of two complementary mathematical formulations—one normalised and one weighted—that combine temporal recency, emotional valence, and intimacy-related interaction features into a single score. The dual design allows us to compare a bounded, probabilistic interpretation of tie strength with a more expressive linear formulation that can take negative values, thereby offering a richer and more discriminative perspective on social bonds.

The first formulation derives from the “team2” study [74], which proposed empirical-based weights for several features indicative of social ties. This model prioritises recent communication, emotional content, and public interaction frequency:

\begin{matrix} Attachment {Strength}_{original} = & - 0.76 \times (Days \sin ce last communication) + \\ + 0.111 \times (Words expressing intimacy) + \\ + 0.135 \times (Degree of positive emotions) + \\ + 0.299 \times (Wall posts) + \\ + 0.299 \times (Messages) + \\ + 0.299 \times (Comments) \end{matrix}

(2)

This equation reflects the psychological assumption that attachment degrades with inactivity but is reinforced by emotionally positive, intimate, and frequent interactions. The symmetric weights (0.299) for wall posts, messages, and comments underscore their combined importance in maintaining visible engagement.

The second formulation is a refined model developed and tested in this study. Based on a regression analysis of a more diverse behavioural dataset, this version increases the penalisation for communication gaps and adjusts feature weights to reflect their updated predictive importance:

\begin{matrix} Attachment {Strength}_{revised} = & - 0.85 \times (Days \sin ce last communication) + \\ + 0.25 \times (Words expressing intimacy) + \\ + 0.25 \times (Degree of positive emotions) + \\ + 0.27 \times (Wall posts) + \\ + 0.27 \times (Messages) + \\ + 0.27 \times (Comments) \end{matrix}

(3)

Compared to Equation (2), the revised formula increases the penalty for communication gaps (from

- 0.76

to

- 0.85

), unifies the coefficients of intimacy and emotional valence at 0.25, and harmonises the weights of interaction types (messages, wall posts, comments) at 0.27. This refinement emphasises that recent, emotionally rich, and frequent interactions are the most reliable indicators of strong attachment.

Both formulations were tested in downstream classification and clustering tasks. The following key differences were observed:

Range and Sensitivity: The revised Equation (3) produced a wider score distribution and stronger contrast between high- and low-attachment users. This increased sensitivity improved discrimination during classification.
Classification Accuracy: Models trained on revised scores consistently outperformed those using the original formulation. For example, XGBoost and BERT achieved F1-score gains of 1.5–2.3% across three mental state classification tasks.
Correlation with Ground-Truth Labels: Revised scores correlated more strongly with human-annotated tie strength (Pearson $r = 0.71$ ) than the original version ( $r = 0.62$ ).
Clustering Cohesion: Clustering with revised scores yielded improved silhouette scores (e.g., K-Means: 0.184 revised vs. 0.153 original), indicating greater intra-cluster consistency.

Based on these results, the revised model was selected as the primary formulation for subsequent experiments, while the original was retained for methodological transparency.

4.5. Classification Models and Evaluation Metrics

To predict users’ attachment levels and mental state indicators, we evaluated fifteen classifiers spanning traditional, ensemble-based, and neural architectures:

Traditional Models: Logistic Regression, Decision Tree, KNN, SVM, Naive Bayes.
Ensemble Methods: Random Forest, Extra Trees, AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
Neural Models: Feedforward ANN, LSTM.
Transformer Model: BERT-based classifier fine-tuned on labelled text for emotional and psychological cues, extended in a multimodal setup with CNN-based image encoders.

All models were trained on behavioural features using an 80/20 train–test split with stratified 5-fold cross-validation. Hyperparameter tuning was conducted via grid search. Evaluation was based on accuracy, precision, recall, F1-score, AUC-ROC, and PR-AUC.

Reinforcement learning approaches were also considered for future extension, where user feedback to recommended mental health resources could be used to optimise adaptive interventions.

4.5.1. Neural and Transformer Model Configurations

Artificial Neural Network (ANN)

The ANN was implemented as a feedforward network with two hidden layers (128 and 64 neurons). Hidden layers used ReLU activation, while the output layer used a sigmoid activation. Training employed the Adam optimizer (lr = 0.001), a batch size of 32, and 50 epochs, with binary cross-entropy loss and early stopping.

LSTM

The LSTM comprised a single recurrent layer with 64 units followed by a dense layer with sigmoid activation. Dropout rate was set to 0.2. The model was trained with Adam (lr = 0.001), a batch size of 32, and binary cross-entropy loss, for 50 epochs.

BERT

BERT fine-tuned bert-base-uncased with a maximum sequence length of 128. The classification head was a dense layer with softmax activation. Training used AdamW (lr =

2 \times 10^{- 5}

) and a batch size of 16, for 3 epochs.

Summary of Configurations

Table 1 consolidates the hyperparameters for ANN, LSTM, BERT, and representative ensemble methods.

To prevent overfitting, all neural models (ANN, LSTM, BERT) were trained with early stopping and dropout regularisation, and their performance was validated through stratified cross-validation. Ensemble methods (e.g., Random Forest, XGBoost, LightGBM) inherently mitigate variance, while traditional models (e.g., Logistic Regression, SVM) include built-in regularisation. During training, learning curves were monitored to ensure stable convergence, and no evidence of overfitting was observed.

4.5.2. Evaluation Metrics

To assess model performance, we employed the following standard classifica- tion metrics:

Accuracy: Proportion of correctly predicted attachment levels out of all predictions.
Precision: Ratio of true positives to predicted positives—relevant for detecting high-trust relationships.
Recall: Ratio of true positives to actual positives—measures model sensitivity in identifying strong ties.
F1-Score: Harmonic mean of precision and recall—useful when class imbalance is present.
AUC-ROC: Area under the Receiver Operating Characteristic curve—captures the trade-off between true positive rate and false positive rate across thresholds.
PR-AUC: Area under the precision–recall curve—especially informative for imbalanced datasets where positive instances are rare.

These metrics provide a comprehensive view of each model’s ability to detect, differentiate, and generalise across various attachment levels.

4.6. Unsupervised Clustering of User Behaviour and Evaluation Metrics

Unsupervised clustering was used to explore latent user groupings without labels, providing complementary insights into attachment-based patterns. The algorithms included the following:

K-Means: Partitioned into $k = 3$ clusters based on the elbow method.
Agglomerative Clustering: Hierarchical clustering with Ward linkage.
Gaussian Mixture Model (GMM): Probabilistic soft clustering with overlapping memberships.
DBSCAN: Density-based detection of clusters and noise.

Evaluation Metrics

The performance of each clustering algorithm was assessed using the following metrics:

Silhouette Score: Measures the cohesion and separation of clusters. A higher score indicates better-defined clusters.
Normalised Mutual Information (NMI): Evaluates the agreement between the clustering labels and known attachment categories.

4.7. Graph-Based Algorithms and Social Network Analysis

Given Facebook’s networked nature, graph-theoretic methods were employed to capture structural influences on attachment and trust. Users were represented as nodes and interactions as directed, weighted edges. Centrality measures (betweenness, closeness, eigenvector) identified influential actors [42,75], while Louvain detected cohesive communities [76]. These methods highlighted trust propagation pathways and attachment clusters.

5. Experimental Evaluation

This section presents a comprehensive evaluation of the proposed framework for modelling social attachment and inferring mental state indicators from Facebook activity. The evaluation consists of three components: (i) computation of attachment strength between users based on behavioural and emotional metrics, (ii) classification of users into mental state categories using a wide range of machine learning algorithms, and (iii) unsupervised clustering to reveal latent behavioural groupings. A final discussion interprets the findings in relation to social and psychological theory, compares algorithmic approaches, and reflects on the broader implications. The aim is to provide both quantitative results and qualitative insights into how social media activity reflects mental health patterns.

To ensure that the evaluation of attachment-based features was not biased toward a single model family, we systematically benchmarked a diverse set of machine learning algorithms. This included traditional statistical learners (e.g., Logistic Regression, Naïve Bayes), tree-based ensembles (e.g., Random Forest, XGBoost, LightGBM), and modern neural architectures (e.g., LSTM, BERT). The motivation for this breadth is twofold: (i) to assess the generalizability of the proposed attachment scoring functions across fundamentally different algorithmic paradigms, and (ii) to identify which model families are most sensitive to temporal and emotional cues. Such comprehensive evaluation highlights the robustness of our formulation while providing a fair comparison between interpretable models and deep learning methods.

5.1. Attachment Strength Calculation

Attachment strength was operationalised through two alternative formulations that combine behavioural, emotional, and temporal features. The first approach adopts a normalised scoring function, which rescales interaction variables into the interval

[0, 1]

and aggregates them into a bounded index of relational closeness. This formulation provides a compact representation of tie strength but may limit sensitivity to extreme behaviours.

Table 2 reports the attachment values computed using this normalised model. All scores are constrained between 0 and 1, reflecting relative proportions of engagement across the available interaction features.

The second approach employs a weighted linear combination, in which features retain their raw scale and are assigned empirically derived coefficients. Unlike the normalised variant, this formulation yields unbounded scores—including negative values—and applies stronger penalisation for prolonged inactivity. The intent is to capture a broader dynamic range of tie strength, particularly for distinguishing weak or dormant relationships.

Table 3 presents the attachment values calculated using this weighted function, illustrating the expanded score distribution and increased sensitivity to recency and emotional cues.

The normalised scores in Table 2 indicate that even with minimal interaction, attachment strength remains above a relatively high baseline. For example, ID10—with no recent messages, wall posts, or comments—still received a score of 0.17. Similarly, ID5, despite having no recent communication but a high number of historical comments, achieved a score of 0.66. These results suggest that the normalised formula tends to assign generous baseline values whenever any form of interaction is present, even if outdated or one-dimensional. This inflates weaker ties and reduces the model’s ability to clearly distinguish passive from active relationships.

In contrast, the weighted formulation in Table 3 produces a much wider range of values, from

- 692.509

(ID19) to

- 37.477

(ID11). For instance, ID19 and ID15 both exhibited over 900 days of inactivity, yet differences in intimacy words and emotional content led to variations in their scores by about 15 units. ID18, with only 189 days since last communication and strong intimacy/emotion signals, obtained the least negative score (

- 48.275

). These cases illustrate the weighted model’s heightened sensitivity to both recency and emotional richness.

This numerical contrast is substantial: the normalised formula compresses users into a narrow band of

[0.179, 1.000]

(range ∼0.82), while the weighted formulation spans a much broader interval of

[- 692.509, - 37.477]

(range ∼655). Such a dynamic range provides finer resolution in differentiating tie strengths, which in turn enhances the effectiveness of downstream tasks such as classification and clustering. Models can better exploit the distributional richness of the weighted scores, achieving clearer separation between strong and weak ties.

Overall, the experiments suggest that while both scoring approaches align with observable user behaviour, the linear-weighted function more effectively captures tie strength variation. Its ability to penalise inactivity and amplify emotional signals makes it particularly suitable for modelling subtle psychological traits and social attachment patterns in Facebook interactions.

5.2. Classification Performance Analysis

To evaluate the predictive power of attachment strength as a feature for modelling users’ mental state indicators, we tested a broad set of machine learning models under two scoring schemes: the normalised and the weighted formulations described earlier. The task involved predicting mental well-being labels derived from behavioural and emotional Facebook activity. Models were evaluated with stratified 10-fold cross-validation using six metrics: accuracy, precision, recall, F1-score, AUC-ROC, and PR-AUC. The inclusion of AUC-ROC and PR-AUC is particularly important in the presence of class imbalance, as they provide more robust measures of discriminative ability and precision–recall trade-offs.

The class labels for mental state categories were approximately balanced (positive 54%, at-risk 46%), and stratified cross-validation preserved this distribution. Given the minor imbalance, no resampling techniques (e.g., SMOTE) were required, although class weighting was applied where relevant.

Table 4 reports the results obtained with normalised attachment scores, which constrain outputs to the interval

[0, 1]

and represent a bounded probabilistic measure of tie strength.

Table 5 shows the results with the weighted formulation, which yields real-valued scores (including negatives) and places stronger emphasis on recency, intimacy, and emotional tone.

Across both experiments, BERT consistently delivered the strongest results, achieving an accuracy of 0.95 with normalised attachment scores and 0.96 with weighted scores, alongside an AUC-ROC of up to 0.98 and PR-AUC of 0.97. Its ability to capture contextual and semantic nuances in user-generated content likely explains this performance edge. Ensemble-based learners such as XGBoost and LightGBM also performed robustly, with both models exceeding 0.93 in accuracy and F1-score, while reaching PR-AUC values above 0.95 under the weighted formulation. These findings suggest that advanced models which exploit non-linear feature interactions and contextual embeddings are particularly effective in leveraging attachment-based signals.

The introduction of the weighted scoring function contributed positively to classification discriminability. Compared with the normalised formulation, the weighted version improved recall, F1-score, AUC-ROC, and PR-AUC across most algorithms, reflecting its enhanced ability to separate weak and strong ties through finer granularity. By penalising prolonged inactivity more strongly and amplifying emotionally significant interactions, the weighted formulation provided a richer feature representation that facilitated better generalisation across classifiers. The improved PR-AUC in particular demonstrates its strength in imbalanced settings, where correctly identifying minority cases (e.g., vulnerable users) is essential.

Traditional models such as Logistic Regression and Naive Bayes demonstrated limitations despite occasionally achieving high precision. Their recall values and PR-AUC scores remained low, indicating frequent failure to identify weaker attachment categories. While their performance improved modestly with weighted scores, the gains were not comparable to those observed in more sophisticated learners. In contrast, models sensitive to interaction effects—such as Gradient Boosting and LSTM—benefited noticeably from the revised scoring dynamics, showing improved balance across all six metrics. These improvements confirm that the revised formulation provides features that align better with the inductive biases of non-linear learners.

Taken together, these results confirm that the design of the attachment strength function plays a critical role in downstream prediction tasks. By integrating recency, intimacy, and emotional tone into a weighted formulation, the revised model aligns more closely with ground-truth user states and enhances predictive accuracy. Moreover, the inclusion of AUC-ROC and PR-AUC reveals the robustness of these improvements, particularly under class imbalance. This demonstrates the value of psychologically informed, data-driven indicators when modelling affective states and social attachment from online behaviour.

To confirm that performance improvements were statistically reliable, we conducted paired t-tests across cross-validation folds, comparing the top-performing models (BERT, XGBoost, LightGBM) against a baseline classifier (Logistic Regression). Table 6 reports the p-values, showing that the advanced models achieved significantly higher F1-scores at the

p < 0.05

level.

Feature Relevance Analysis

To provide a data-driven justification for the selected behavioural and emotional characteristics, we conducted a feature importance study using Random Forest and Gradient Boosting models, complemented by Pearson correlation analysis with attachment strength scores. Table 7 presents the relative contributions of each feature, averaged across the two ensemble methods. The analysis confirms that intimacy-related words and sentiment polarity are the strongest predictors, followed by recency of interaction and message frequency. Wall posts and comments contributed moderately, while emojis and punctuation showed negligible importance and were excluded from the final feature set. These findings validate that the chosen features not only align with theoretical constructs of social attachment but also maximise predictive utility.

5.3. Unsupervised Clustering of Behavioural Patterns

To uncover latent behavioural groupings in the user population, we applied unsupervised clustering to the Facebook activity dataset. Four widely used algorithms were evaluated—K-Means, Gaussian Mixture Model (GMM), Agglomerative Clustering, and DBSCAN—over normalised behavioural features (message frequency, words of intimacy, sentiment score, and days since last interaction). The objective was to identify coherent clusters that reflect distinct patterns of social engagement and emotional expression.

Table 8 reports cluster assignments for a representative subset of 20 users across the four algorithms. Each user is described by four core features (messages, intimacy word frequency, sentiment, and recency), chosen to capture communication intensity, emotional valence, and temporal dynamics—key signals that can relate to underlying mental states.

As shown in Table 8, K-Means, GMM, and Agglomerative Clustering consistently identified three distinct behavioural groups. These clusters span a spectrum of social engagement—from high-affinity users characterised by frequent, emotionally expressive communication (e.g., users 3, 10, and 17) to disengaged individuals with prolonged inactivity or minimal sentiment (e.g., users 4, 6, and 20). K-Means and GMM showed particularly strong alignment, assigning nearly identical labels to users with similar profiles.

Agglomerative Clustering also produced coherent groups but diverged on borderline cases—such as users 8 and 10—where linkage criteria likely influenced boundary placement. Unlike the centroid-based separation of K-Means or the probabilistic flexibility of the GMM, the hierarchical merging process can fragment tightly knit groups under certain feature configurations. Nonetheless, the broad agreement across these three methods supports the reliability of the observed segments.

By contrast, DBSCAN failed to identify meaningful clusters in this subset. All users received the label -1 (noise), indicating that under the specified hyperparameters

(ε, \min_samples)

, the data lacked sufficient density to form core points. This behaviour is consistent with high-dimensional sparsity and overlapping user patterns that are not well captured by global distance thresholds. With absent domain-specific tuning of density parameters, DBSCAN appears unsuitable for this behavioural dataset.

To ensure that DBSCAN’s poor performance was not simply due to arbitrary parameter choices, we performed a systematic grid search over a wide range of

ε

(0.1–5.0) and min_samples (3–20) values. Across all tested configurations, DBSCAN consistently failed to identify stable or meaningful clusters, with the vast majority of points labelled as noise. This suggests that the algorithm’s density-based assumptions are poorly matched to the sparsity and overlap inherent in Facebook interaction data, rather than being a consequence of suboptimal hyperparameter selection.

To assess internal cluster quality, we computed silhouette scores for each algorithm (Table 9). The silhouette measures how well each point fits within its assigned cluster relative to others; values closer to 1 indicate compact, well-separated clusters.

K-Means achieved the highest silhouette score (0.184), indicating relatively compact clusters and clear separation. The GMM followed closely (0.180), suggesting its probabilistic modelling was similarly effective in capturing behavioural structure. Agglomerative Clustering obtained a slightly lower score (0.174), consistent with less compact groupings when hierarchical linkage is applied to high-dimensional features. In stark contrast, DBSCAN scored 0.051, confirming weak structure under the tested parameters.

It is important to note that the silhouette scores obtained in our experiments (all

\leq 0.184

) indicate relatively weak cluster separation. This limitation likely stems from the high-dimensional and overlapping nature of Facebook interaction features, where behavioural signals (e.g., sentiment, intimacy, activity frequency) may not form sharply delineated groups. While the absolute values are low, the comparative differences between algorithms remain informative: K-Means and GMM consistently produced more cohesive clusters compared to Agglomerative Clustering and DBSCAN. Thus, although cluster compactness is limited, relative performance rankings still provide meaningful evidence about which algorithms are better suited to this type of social interaction data.

Overall, the numerical evidence highlights that K-Means and GMM provide the most reliable clustering results for this dataset, while Agglomerative Clustering offers moderate performance and DBSCAN is unsuitable under the tested conditions.

We also evaluated cross-method consistency using pairwise Normalised Mutual Information (NMI) (Table 10). NMI quantifies agreement between two clusterings (0–1), independent of label permutations.

The highest agreement was observed between K-Means and GMM (NMI = 0.86), indicating that both recover highly similar structures. GMM also aligned well with Agglomerative Clustering (0.76), while K-Means and Agglomerative showed moderate consistency (0.72). DBSCAN exhibited very low agreement with all other methods (NMI 0.09–0.12), reflecting its divergent behaviour and instability on this dataset.

The numerical evidence confirms that centroid-based (K-Means) and probabilistic (GMM) models converge on highly consistent cluster structures, whereas DBSCAN diverges strongly, reinforcing its unsuitability for this type of behavioural data.

Finally, Table 11 summarizes cluster-size distributions for the full dataset of 2500 users across all algorithms.

K-Means and GMM produced relatively balanced partitions, indicating uniform segmentation of user behaviour. Agglomerative Clustering yielded a more uneven distribution, with Cluster 1 covering 59.2% of users, suggesting possible over-merging under its linkage criterion. DBSCAN identified only a single valid cluster and flagged 2.28% of users as noise, corroborating its poor fit for the present feature space and parameter settings.

For visual comparison, Figure 2 provides two-dimensional projections of the clustering outcomes for each algorithm.

Figure 2 is consistent with the quantitative metrics. K-Means and GMM produce compact, well-separated structures, supporting their suitability for behavioural partitioning in this context. Agglomerative Clustering shows more elongated and overlapping formations, reflecting weaker separation despite identifying three functional groups. DBSCAN again fails to reveal meaningful structure: most points collapse into a single blob or are marked as noise, underscoring its sensitivity to sparsity and parameterisation on high-dimensional Facebook activity profiles.

5.4. Interpretive Discussion and Insights

The results across all three experimental components provide a coherent and multifaceted perspective on the computational modelling of social attachment and mental state indicators from Facebook activity. A central finding is that the two attachment strength formulations produced markedly different distributions. The normalised scoring function constrained values to the interval

[0, 1]

, but often failed to distinguish between passive users and moderately active ones, thereby compressing tie-strength variability. By contrast, the weighted formulation introduced an unbounded scale that more sharply penalised prolonged inactivity and amplified the impact of emotional signals, yielding a richer and more discriminative representation. This divergence was particularly evident for users with low engagement who still received relatively high scores under the normalised model, potentially masking differences in their relational significance.

These scoring differences had a direct effect on supervised classification. Both formulations enabled accurate prediction of mental state indicators, yet models trained on weighted attachment scores consistently outperformed those using normalised scores. The strongest gains were observed for advanced learners such as BERT, XGBoost, and LightGBM, which achieved superior accuracy and F1-scores. Importantly, recall also improved under the weighted formulation, highlighting its sensitivity to subtle behavioural variations that may reflect early warning signs of emotional distress, withdrawal, or stress. This is particularly relevant in digital mental health contexts, where false negatives—i.e., failing to detect at-risk users—can have serious implications.

In contrast, traditional learners such as Logistic Regression and Naive Bayes, while computationally efficient, underperformed in recall and F1-score, suggesting that they lacked the capacity to fully capture the non-linear interplay of emotional and temporal features. This comparison highlights the importance of selecting model families aligned with the complexity of the underlying constructs being modelled.

Unsupervised clustering further validated the attachment framework. K-Means and GMM consistently produced coherent and interpretable behavioural groupings, supported by higher silhouette scores and strong Normalised Mutual Information (NMI) agreement. These clusters aligned with latent dimensions of social interaction ranging from consistent, emotionally rich engagement to sporadic or minimal communication, reflecting well-established psychological typologies. Agglomerative Clustering, while less stable, still identified meaningful groupings, though with uneven cluster sizes. By contrast, DBSCAN consistently failed to identify actionable partitions, largely due to the sparsity and overlap of high-dimensional features combined with sensitivity to density parameters. This underperformance highlights the limitations of density-based approaches when applied to behavioural social media data.

The contrast between centroid-based (K-Means), probabilistic (GMM), and hierarchical (Agglomerative) approaches also demonstrates how different algorithmic assumptions shape the resulting behavioural segments: centroid and probabilistic models converged on highly similar structures, while hierarchical clustering tended to over-merge groups. This reinforces the importance of method selection when the goal is to capture nuanced, fine-grained patterns of attachment.

Taken together, these results demonstrate that attachment strength—when modelled with emotional, behavioural, and temporal nuance—serves as a robust organising variable for downstream analysis. Both supervised and unsupervised evaluations converged on consistent behavioural structures, reinforcing theoretical assumptions from attachment theory and affective computing. The consistency across fundamentally different methodological paradigms (classification vs. clustering) also strengthens the external validity of the framework, suggesting that attachment-based features generalise well across analytical settings.

The convergence of results across model families with very different inductive biases further supports the external validity of the attachment features: whether interpreted through statistical learners, boosting ensembles, or neural architectures, the same underlying behavioural constructs emerged as discriminative and stable.

Ultimately, the study shows that mental state indicators can be inferred from online behavioural data with both high accuracy and interpretability, provided that features are grounded in psychologically meaningful constructs. The integration of emotional valence, intimacy markers, and interaction recency emerges as crucial for achieving both granularity and theoretical relevance. Beyond methodological advances, these findings carry practical implications: they can inform the design of mental health monitoring tools, digital intervention systems, and ethically responsible social computing applications. Understanding the nuances of online connectedness is foundational not only for improving predictive performance but also for ensuring that algorithmic insights support user well-being in a transparent and socially responsible manner.

5.5. Explaining Algorithmic Performance Differences

The comparative evaluation revealed systematic differences in how various algorithms handled the attachment-based features. These differences can be explained by the inductive biases and representational capacities of each model family.

Transformer-based models (BERT). BERT consistently achieved the highest performance across all tasks. This can be attributed to its ability to capture nuanced emotional and psychological signals from text, leveraging contextual embeddings that go beyond surface-level word counts or sentiment scores. Its attention mechanism allows the model to focus on subtle linguistic cues that strongly correlate with attachment and mental state indicators.

Ensemble learners (XGBoost, LightGBM, CatBoost). Gradient-boosting ensembles performed nearly as well as BERT, benefiting from their ability to model complex non-linear feature interactions and handle heterogeneous feature types. These models also provided robustness against noise, which is common in user-generated social data.

Traditional classifiers (Logistic Regression, Naïve Bayes, SVM). While interpretable, these models struggled with recall, often failing to detect weaker or borderline cases of attachment. Their reliance on linear boundaries or simplified probabilistic assumptions limited their ability to capture the complex interplay between temporal, emotional, and behavioural features.

Neural architectures (ANN, LSTM). Feedforward and recurrent models offered moderate gains over traditional baselines by capturing temporal dynamics in user interactions. However, without the deep contextual embeddings available to transformers, they underperformed compared to BERT. LSTM models were particularly useful for modelling sequential activity patterns but were less effective with sparse data.

Clustering algorithms. K-Means and GMM achieved consistent and interpretable groupings because their inductive assumptions (centroid proximity and probabilistic mixture modelling) aligned well with the feature space shaped by attachment scores. Agglomerative Clustering, though meaningful, produced imbalanced groups due to sensitivity to linkage choices. DBSCAN underperformed because behavioural features lacked the density structure required for its neighborhood-based approach, especially in high-dimensional sparse settings.

Overall, these findings highlight that models capable of leveraging non-linear dependencies and rich contextual information (e.g., BERT, gradient-boosting ensembles) are most effective for attachment-based behavioural prediction. In contrast, simpler classifiers offer interpretability but at the cost of sensitivity, while density-based clustering methods are unsuitable for this data domain.

6. Conclusions and Future Work

This paper presented a comprehensive computational framework for modelling social attachment and inferring mental state indicators from Facebook activity using machine learning techniques. Drawing upon principles from social psychology, affective computing, and behavioural analytics, we proposed a dual scoring mechanism for quantifying interpersonal attachment strength by integrating temporal recency, emotional tone, and communication features. These scores were subsequently employed in both supervised classification and unsupervised clustering tasks, enabling the identification of latent behavioural segments and user well-being profiles.

Experimental evaluation confirmed the value of the proposed approach: the weighted attachment strength formulation offered more granular and discriminative representations than the normalised variant, and advanced classifiers such as BERT and gradient-boosting models consistently achieved strong predictive performance. Unsupervised clustering with K-Means and GMM further revealed coherent user groupings, validating the structural soundness of the attachment model. Rather than focusing on specific performance metrics already detailed earlier, these results underscore that integrating temporal and emotional signals is essential for modelling nuanced aspects of social connectedness and psychological state.

Beyond empirical validation, the framework contributes to interdisciplinary understanding by operationalising abstract constructs—such as intimacy, trust, and attachment—into quantifiable indicators. The interpretive discussion linked algorithmic outcomes to theoretical assumptions, demonstrating how observable online behaviours can reflect broader mental health trajectories. These findings support the potential of data-driven systems to augment early-warning tools and targeted interventions in digital mental health contexts.

Future work will extend this research along several directions. First, the scoring functions can be expanded to incorporate richer multimodal data, including images, reactions, and user metadata, thereby enhancing attachment modelling fidelity. Second, temporal dynamics will be modelled using recurrent or attention-based architectures to better capture behavioural evolution and shifts over time. Third, ethical considerations—encompassing consent, transparency, and algorithmic fairness—will be addressed more explicitly to ensure that affect-aware technologies are deployed responsibly in sensitive contexts.

This study is not without limitations. All experiments were conducted on a single Facebook-derived dataset, which—while ensuring internal consistency—restricts external validation. Cultural and demographic biases may also be present, as the sample is not globally representative. Furthermore, the reliability of self-reported mental state labels remains an inherent challenge, as such annotations are subject to subjectivity and potential inconsistency. Finally, feature extraction was constrained by the platform’s available signals, which may not fully capture offline attachment or psychological states. Future research should address these issues by validating the framework across multiple platforms (e.g., Twitter, Reddit, LinkedIn), recruiting more culturally diverse samples, and incorporating richer multimodal ground-truthing.

Beyond research implications, practical deployment pathways should also be explored. The framework could be integrated into digital well-being platforms, social media monitoring systems, or early-warning tools for clinicians. Deployment would require scalable APIs for real-time feature extraction, efficient infrastructure for attachment score computation, and interpretable interfaces that allow practitioners to act upon system outputs. Collaboration with mental health professionals and cross-platform testing will be key to ensuring both practical relevance and responsible adoption.

In conclusion, this study provides a scalable and interpretable pipeline for extracting meaningful psychological and relational insights from Facebook user activity. By combining data-driven learning with theoretical grounding, the proposed approach establishes a foundation for advancing both computational social science and digital mental health analytics.

Author Contributions

Conceptualization, S.K. and A.K.; methodology, S.K. and A.K.; data curation, S.K. and A.K.; writing—original draft, S.K. and A.K.; writing—review & editing, S.K. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sherchan, W.; Nepal, S.; Paris, C. A Survey of Trust in Social Networks. ACM Comput. Surv. 2013, 45, 47. [Google Scholar] [CrossRef]
Kafeza, E.; Kanavos, A.; Makris, C.; Vikatos, P. T-PICE: Twitter Personality Based Influential Communities Extraction System. In Proceedings of the International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 212–219. [Google Scholar]
Moosavi, S.A.; Jalali, M.; Misaghian, N.; Shamshirband, S.; Anisi, M.H. Community Detection in Social Networks Using User Frequent Pattern Mining. Knowl. Inf. Syst. 2017, 51, 159–186. [Google Scholar] [CrossRef]
Ghafari, S.M.; Yakhchi, S.; Beheshti, A.; Orgun, M.A. SETTRUST: Social Exchange Theory Based Context-Aware Trust Prediction in Online Social Networks. In Data Quality and Trust in Big Data, Proceedings of the 5th International Workshop, QUAT 2018, Dubai, United Arab Emirates, 12–15 November 2018; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; pp. 46–61. [Google Scholar]
Kafeza, E.; Kanavos, A.; Makris, C.; Pispirigos, G.; Vikatos, P. T-PCCE: Twitter Personality based Communicative Communities Extraction System for Big Data. IEEE Trans. Knowl. Data Eng. 2020, 32, 1625–1638. [Google Scholar] [CrossRef]
Xiao, Y.; Liu, J.; Wu, J.; Ansari, N. Leveraging Deep Reinforcement Learning for Traffic Engineering: A Survey. IEEE Commun. Surv. Tutor. 2021, 23, 2064–2097. [Google Scholar] [CrossRef]
Borzymek, P.; Sydow, M. Trust and Distrust Prediction in Social Network with Combined Graphical and Review-Based Attributes. In Agent and Multi-Agent Systems: Technologies and Applications, Proceedings of the 4th KES International Symposium, KES-AMSTA 2010, Gdynia, Poland, 23–25 June 2010; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6070, pp. 122–131. [Google Scholar]
Jamali, M.; Ester, M. TrustWalker: A Random Walk Model for Combining Trust-Based and Item-Based Recommendation. In Proceedings of the 15th SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 397–406. [Google Scholar]
Graña, M.; Nuñez-Gonzalez, J.D.; Ozaeta, L.; Kaminska-Chuchmala, A. Experiments of Trust Prediction in Social Networks by Artificial Neural Networks. Cybern. Syst. 2015, 46, 19–34. [Google Scholar] [CrossRef]
Coleman, J.S. Social Capital in the Creation of Human Capital. Am. J. Sociol. 1988, 94, S95–S120. [Google Scholar] [CrossRef]
Granovetter, M.S. The Strength of Weak Ties. Am. J. Sociol. 1973, 78, 1360–1380. [Google Scholar] [CrossRef]
Donath, J.; Boyd, D. Public Displays of Connection. BT Technol. J. 2004, 22, 71–82. [Google Scholar] [CrossRef]
Fogel, J.; Nehmad, E. Internet Social Network Communities: Risk Taking, Trust, and Privacy Concerns. Comput. Hum. Behav. 2009, 25, 153–160. [Google Scholar] [CrossRef]
Kanavos, A.; Kafeza, E.; Makris, C. Can We Rank Emotions? A Brand Love Ranking System for Emotional Terms. In Proceedings of the International Congress on Big Data, New York, NY, USA, 27 June–2 July 2015; pp. 71–78. [Google Scholar]
Freeman, L.C. The Development of Social Network Analysis: A Study in the Sociology of Science; BookSurge: North Charleston, SC, USA, 2004; Volume 1, pp. 159–167. [Google Scholar]
Ghafari, S.M.; Beheshti, A.; Joshi, A.; Paris, C.; Mahmood, A.; Yakhchi, S.; Orgun, M.A. A Survey on Trust Prediction in Online Social Networks. IEEE Access 2020, 8, 144292–144309. [Google Scholar] [CrossRef]
Gilbert, E.; Karahalios, K. Predicting Tie Strength with Social Media. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI), Boston, MA, USA, 4–9 April 2009; pp. 211–220. [Google Scholar]
Guha, R.V.; Kumar, R.; Raghavan, P.; Tomkins, A. Propagation of Trust and Distrust. In Proceedings of the 13th International Conference on World Wide Web (WWW), New York, NY, USA, 17–20 May 2004; pp. 403–412. [Google Scholar]
Denko, M.K.; Sun, T.; Woungang, I. Trust Management in Ubiquitous Computing: A Bayesian Approach. Comput. Commun. 2011, 34, 398–406. [Google Scholar] [CrossRef]
Moradi, P.; Ahmadian, S. A Reliability-Based Recommendation Method to Improve Trust-Aware Recommender Systems. Expert Syst. Appl. 2015, 42, 7386–7398. [Google Scholar] [CrossRef]
Zhao, K.; Pan, L. A Machine Learning Based Trust Evaluation Framework for Online Social Networks. In Proceedings of the 13th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Beijing, China, 24–26 September 2014; pp. 69–74. [Google Scholar]
Ma, N.; Lim, E.; Nguyen, V.; Sun, A.; Liu, H. Trust Relationship Prediction Using Online Product Review Data. In Proceedings of the 1st International Workshop on Complex Networks Meet Information & Knowledge Management (CIKM-CNIKM), Hong Kong, China, 6 November 2009; pp. 47–54. [Google Scholar]
Zhang, Y.; Yu, T. Mining Trust Relationships from Online Social Networks. J. Comput. Sci. Technol. 2012, 27, 492–505. [Google Scholar] [CrossRef]
Zolfaghar, K.; Aghaie, A. Evolution of Trust Networks in Social Web Applications Using Supervised Learning. Procedia Comput. Sci. 2011, 3, 833–839. [Google Scholar] [CrossRef]
Raj, E.D.; Babu, L.D.D. An Enhanced Trust Prediction Strategy for Online Social Networks Using Probabilistic Reputation Features. Neurocomputing 2017, 219, 412–421. [Google Scholar] [CrossRef]
Wang, X.; Wang, Y.; Sun, H. Exploring the Combination of Dempster-Shafer Theory and Neural Network for Predicting Trust and Distrust. Comput. Intell. Neurosci. 2016, 2016, 5403105. [Google Scholar] [CrossRef] [PubMed]
Ghafari, S.M.; Joshi, A.; Beheshti, A.; Paris, C.; Yakhchi, S.; Orgun, M.A. DCAT: A Deep Context-Aware Trust Prediction Approach for Online Social Networks. In Proceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia (MoMM), Munich, Germany, 2–4 December 2019; pp. 20–27. [Google Scholar]
Zhang, X.; Cui, L.; Wang, Y. CommTrust: Computing Multi-Dimensional Trust by Mining E-Commerce Feedback Comments. IEEE Trans. Knowl. Data Eng. 2014, 26, 1631–1643. [Google Scholar] [CrossRef]
Kim, Y.A.; Song, H.S. Strategies for Predicting Local Trust Based on Trust Propagation in Social Networks. Knowl. Based Syst. 2011, 24, 1360–1371. [Google Scholar] [CrossRef]
Ghavipour, M.; Meybodi, M.R. Trust Propagation Algorithm Based on Learning Automata for Inferring Local Trust in Online Social Networks. Knowl. Based Syst. 2018, 143, 307–316. [Google Scholar] [CrossRef]
Tang, J.; Gao, H.; Hu, X.; Liu, H. Exploiting Homophily Effect for Trust Prediction. In Proceedings of the 6th International Conference on Web Search and Data Mining (WSDM), Rome, Italy, 4–8 February 2013; pp. 53–62. [Google Scholar]
Tang, J.; Hu, X.; Liu, H. Is Distrust the Negation of Trust?: The Value of Distrust in Social Media. In Proceedings of the 25th Conference on Hypertext and Social Media (HT), Santiago, Chile, 1–4 September 2014; pp. 148–157. [Google Scholar]
Burke, M.; Kraut, R.E. Growing Closer on Facebook: Changes in Tie Strength through Social Network Site Use. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Toronto, ON, Canada, 26 April–1 May 2014; pp. 4187–4196. [Google Scholar]
Burke, M.; Kraut, R.E. The Relationship between Facebook Use and Well-Being depends on Communication Type and Tie Strength. J. Comput.-Mediat. Commun. 2016, 21, 265–281. [Google Scholar] [CrossRef]
Kanavos, A.; Perikos, I.; Hatzilygeroudis, I.; Tsakalidis, A.K. Emotional Community Detection in Social Networks. Comput. Electr. Eng. 2018, 65, 449–460. [Google Scholar] [CrossRef]
Vonitsanos, G.; Kanavos, A.; Mylonas, P. Decoding Gender on Social Networks: An In-depth Analysis of Language in Online Discussions Using Natural Language Processing and Machine Learning. In Proceedings of the IEEE International Conference on Big Data, Sorrento, Italy, 15–18 December 2023; pp. 4618–4625. [Google Scholar]
Sacco, O.; Breslin, J.G. In Users We Trust: Towards Social User Interactions Based Trust Assertions for the Social Semantic Web. Soc. Netw. Anal. Min. 2014, 4, 229. [Google Scholar] [CrossRef]
Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. CMT: Convolutional Neural Networks Meet Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12165–12175. [Google Scholar]
Su, K.; Xiao, B.; Liu, B.; Zhang, H.; Zhang, Z. TAP: A Personalized Trust-Aware QoS Prediction Approach for Web Service Recommendation. Knowl. Based Syst. 2017, 115, 55–65. [Google Scholar] [CrossRef]
Zhao, L.; Hua, T.; Lu, C.; Chen, I. A Topic-Focused Trust Model for Twitter. Comput. Commun. 2016, 76, 1–11. [Google Scholar] [CrossRef]
Tang, J.; Gao, H.; Liu, H.; Sarma, A.D. eTrust: Understanding Trust Evolution in an Online World. In Proceedings of the 18th SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Beijing China, 12–16 August 2012; pp. 253–261. [Google Scholar]
Wang, C.; Chen, W.; Wang, Y. Scalable Influence Maximization for Independent Cascade Model in Large-Scale Social Networks. Data Min. Knowl. Discov. 2012, 25, 545–576. [Google Scholar] [CrossRef]
Chen, W.; Yuan, Y.; Zhang, L. Scalable Influence Maximization in Social Networks under the Linear Threshold Model. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia, 13–17 December 2010; pp. 88–97. [Google Scholar]
Shaqra, F.M.A. Modelling Heterogeneous Time Series from Irregular Data Streams. Ph.D. Thesis, RMIT University, Melbourne, Australia, 2024. [Google Scholar]
Breiger, R.L. The Analysis of Social Networks. In Handbook of Data Analysis; Sage Publications: London, UK, 2004; pp. 505–526. [Google Scholar]
Goodfellow, I.J.; Bengio, Y.; Courville, A.C. Deep Learning; Adaptive Computation and Machine Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Khoo, L.S.; Lim, M.K.; Chong, C.Y.; McNaney, R. Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing Approaches. Sensors 2024, 24, 348. [Google Scholar] [CrossRef]
Wang, L.; Wang, C.; Li, C.; Murai, T.; Bai, Y.; Song, Z.; Zhang, S.; Zhang, Q.; Huang, Y.; Bi, X.; et al. AI-Assisted Multi-Modal Information for the Screening of Depression: A Systematic Review and Meta-Analysis. npj Digit. Med. 2025, 8, 523. [Google Scholar] [CrossRef]
Huang, X.; Wang, F.; Gao, Y.; Liao, Y.; Zhang, W.; Zhang, L.; Xu, Z. Depression Recognition Using Voice-Based Pre-Training Model. Sci. Rep. 2024, 14, 12734. [Google Scholar] [CrossRef]
Owen, D.; Lynham, A.J.; Smart, S.E.; Pardiñas, A.F.; Collados, J.C. AI for Analyzing Mental Health Disorders Among Social Media Users: Quarter-Century Narrative Review of Progress and Challenges. J. Med. Internet Res. 2024, 26, e59225. [Google Scholar] [CrossRef]
Lin, F.; Cohen, W.W. Semi-Supervised Classification of Network Data Using Very Few Labels. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Odense, Denmark, 9–11 August 2010; pp. 192–199. [Google Scholar]
Xing, L.; Li, S.; Zhang, Q.; Wu, H.; Ma, H.; Zhang, X. A Survey on Social Network’s Anomalous Behavior Detection. Complex Intell. Syst. 2024, 10, 5917–5932. [Google Scholar] [CrossRef]
Kridera, S.; Kanavos, A. Exploring Trust Dynamics in Online Social Networks: A Social Network Analysis Perspective. Math. Comput. Appl. 2024, 29, 37. [Google Scholar] [CrossRef]
Savage, D.; Zhang, X.; Yu, X.; Chou, P.; Wang, Q. Anomaly Detection in Online Social Networks. Soc. Netw. 2014, 39, 62–70. [Google Scholar] [CrossRef]
Yang, J.; Tsou, M.; Jung, C.; Allen, C.; Spitzberg, B.H.; Gawron, J.M.; Han, S.Y. Social Media Analytics and Research Testbed (SMART): Exploring Spatiotemporal Patterns of Human Dynamics With Geo-Targeted Social Media Messages. Big Data Soc. 2016, 3, 2053951716652914. [Google Scholar] [CrossRef]
Orgaz, G.B.; Jung, J.J.; Camacho, D. Social Big Data: Recent Achievements and New Challenges. Inf. Fusion 2016, 28, 45–59. [Google Scholar] [CrossRef] [PubMed]
Balaji, T.K.; Annavarapu, C.S.R.; Bablani, A. Machine Learning Algorithms for Social Media Analysis: A Survey. Comput. Sci. Rev. 2021, 40, 100395. [Google Scholar] [CrossRef]
Veltri, G.A. Digital Social Research; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
Debreceny, R.S.; Wang, T.; Zhou, M.J. Research in Social Media: Data Sources and Methodologies. J. Inf. Syst. 2019, 33, 1–28. [Google Scholar] [CrossRef]
Peng, S.; Wang, G.; Xie, D. Social Influence Analysis in Social Networking Big Data: Opportunities and Challenges. IEEE Netw. 2017, 31, 11–17. [Google Scholar] [CrossRef]
Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network Representation Learning: A Survey. IEEE Trans. Big Data 2020, 6, 3–28. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Held, P.; Krause, B.; Kruse, R. Dynamic Clustering in Social Networks Using Louvain and Infomap Method. In Proceedings of the 3rd European Network Intelligence Conference (ENIC), Wroclaw, Poland, 5–7 September 2016; pp. 61–68. [Google Scholar]
Shepitsen, A.; Gemmell, J.; Mobasher, B.; Burke, R.D. Personalized Recommendation in Social Tagging Systems Using Hierarchical Clustering. In Proceedings of the Conference on Recommender Systems (RecSys), Lausanne, Switzerland, 23–25 October 2008; pp. 259–266. [Google Scholar]
Cordasco, G.; Gargano, L. Community Detection via Semi-Synchronous Label Propagation Algorithms. In Proceedings of the IEEE International Workshop on: Business Applications of Social Network Analysis (BASNA), Bangalore, India, 15 December 2010; pp. 1–8. [Google Scholar]
Khatoon, M.; Banu, W.A. An Efficient Method to Detect Communities in Social Networks Using DBSCAN Algorithm. Soc. Netw. Anal. Min. 2019, 9, 9. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Najar, F.; Bourouis, S.; Bouguila, N.; Belghith, S. A Comparison Between Different Gaussian-Based Mixture Models. In Proceedings of the 14th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), Hammamet, Tunisia, 30 October–3 November 2017; pp. 704–708. [Google Scholar]
Sathiyakumari, K.; Vijaya, M.S. Community Detection Based on Girvan Newman Algorithm and Link Analysis of Social Media. In Digital Connectivity—Social Impact, Proceedings of the 51st Annual Convention of the Computer Society of India, Coimbatore, India, 8–9 December 2016; Communications in Computer and Information Science; Springer: Singapore, 2016; pp. 223–234. [Google Scholar]
Verma, M.; Srivastava, M.; Chack, N.; Diswar, A.K.; Gupta, N. A Comparative Study of Various Clustering Algorithms in Data Mining. Int. J. Eng. Res. Appl. 2012, 2, 1379–1384. [Google Scholar]
Hutto, C.J.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM), Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Lavanya, P.M.; Sasikala, E. Deep Learning Techniques on Text Classification Using Natural Language Processing (NLP) In Social Healthcare Network: A Comprehensive Survey. In Proceedings of the 3rd International Conference on Signal Processing and Communication (ICPSC), Coimbatore, India, 13–14 May 2021; pp. 603–609. [Google Scholar]
team2-ethi-mental-state. Available online: https://github.com/asrobang1/team2-ethi-mental-state/ (accessed on 26 July 2025).
Wang, Z.; Tan, Y.; Zhang, M. Graph-Based Recommendation on Social Networks. In Proceedings of the 12th Asia-Pacific Web Conference (APWeb), Busan, Republic of Korea, 6–8 April 2010; pp. 116–122. [Google Scholar]
Wang, C.; Tang, W.; Sun, B.; Fang, J.; Wang, Y. Review on Community Detection Algorithms in Social Networks. In Proceedings of the IEEE International Conference on Progress in Informatics and Computing (PIC), Nanjing, China, 18–20 December 2015; pp. 551–555. [Google Scholar]

Figure 1. Visualisation of trust groups in a social network. Each colour (red, yellow, green, purple, blue, etc.) represents a distinct user community, while the size of each node reflects user centrality and influence.

Figure 2. Visual comparison of user clusters produced by K-Means, Agglomerative Clustering, GMM, and DBSCAN.

Table 1. Summary of key model configurations.

Model	Configuration
ANN	2 hidden layers (128, 64), ReLU, Adam (lr = 0.001), batch = 32, 50 epochs, early stopping
LSTM	1 recurrent layer (64 units), dropout = 0.2, Adam (lr = 0.001), batch = 32, 50 epochs
BERT	`bert-base-uncased`, max length = 128, AdamW (lr = $2 \times 10^{- 5}$ ), batch = 16, 3 epochs
XGBoost	200 estimators, max depth = 6, learning rate = 0.1, subsample = 0.8
LightGBM	200 estimators, max depth = 7, learning rate = 0.05, feature fraction = 0.8

Table 2. Attachment strength computed via normalised scoring function.

ID	Days Since Last Comm.	Words of Intimacy	Pos. Emotions	Wall Posts	Messages	Comments	Attachment Strength
ID1	1900.04	0	0.000	1	0	0	1.000000
ID2	344.54	12	0.187	0	797	1	0.897659
ID3	483.63	15	0.220	0	599	3	0.770177
ID4	1017.63	21	0.235	0	575	6	0.751014
ID5	1902.38	0	0.000	0	0	95	0.666496
ID6	1913.96	0	0.000	0	0	8	0.179817
ID7	1558.75	0	0.000	0	0	5	0.179773
ID8	1323.88	0	0.000	0	0	3	0.179653
ID9	1561.50	0	0.000	0	0	5	0.179644
ID10	969.54	0	0.000	0	0	0	0.179568

Table 3. Attachment strength computed via weighted linear combination.

ID	Days Since Last Comm.	Words of Intimacy	Pos. Emotions	Wall Posts	Messages	Comments	Attachment Strength
ID1	197	38	0.213	37	403	6	−37.477
ID2	568	47	0.874	42	312	9	−372.822
ID3	842	59	0.432	25	223	34	−624.702
ID4	259	90	0.672	17	278	28	−110.272
ID5	921	29	0.987	19	321	21	−677.883
ID6	615	54	0.741	23	150	12	−402.187
ID7	793	82	0.311	31	410	8	−482.646
ID8	189	71	0.827	44	249	20	−48.275
ID9	923	10	0.645	22	396	15	−692.509
ID10	504	67	0.258	37	208	39	−273.709

Table 4. Classification performance using normalised attachment strength.

Model	Accuracy	Precision	Recall	F1-Score	AUC-ROC	PR-AUC
KNN	0.93	0.92	0.96	0.94	0.95	0.94
Random Forest	0.91	0.89	0.93	0.91	0.92	0.91
Gradient Boosting	0.91	0.89	0.93	0.91	0.92	0.91
ADABoost	0.90	0.86	0.96	0.91	0.91	0.90
SVM	0.89	0.85	0.96	0.90	0.90	0.89
Extra Trees	0.87	0.86	0.89	0.87	0.88	0.87
Decision Tree	0.86	0.83	0.89	0.86	0.85	0.84
Logistic Regression	0.80	1.00	0.60	0.75	0.82	0.70
Neural Network	0.80	1.00	0.60	0.75	0.83	0.72
Gaussian Naive Bayes	0.55	0.75	0.14	0.23	0.58	0.40
XGBoost	0.94	0.93	0.96	0.94	0.96	0.95
LightGBM	0.93	0.91	0.95	0.93	0.95	0.94
CatBoost	0.92	0.90	0.94	0.92	0.94	0.93
LSTM	0.88	0.86	0.90	0.88	0.89	0.88
BERT	0.95	0.94	0.97	0.95	0.97	0.96

Table 5. Classification performance using weighted attachment strength.

Model	Accuracy	Precision	Recall	F1-Score	AUC-ROC	PR-AUC
KNN	0.94	0.93	0.97	0.95	0.96	0.95
Random Forest	0.92	0.91	0.95	0.93	0.94	0.93
Gradient Boosting	0.92	0.90	0.94	0.92	0.94	0.92
ADABoost	0.91	0.88	0.97	0.92	0.93	0.92
SVM	0.90	0.87	0.97	0.92	0.92	0.91
Extra Trees	0.89	0.88	0.91	0.89	0.90	0.89
Decision Tree	0.88	0.85	0.90	0.87	0.87	0.86
Logistic Regression	0.84	1.00	0.69	0.81	0.85	0.80
Neural Network	0.84	0.99	0.71	0.83	0.86	0.81
Gaussian Naive Bayes	0.61	0.76	0.22	0.34	0.62	0.45
XGBoost	0.95	0.94	0.97	0.95	0.97	0.96
LightGBM	0.94	0.92	0.96	0.94	0.96	0.95
CatBoost	0.93	0.91	0.95	0.93	0.95	0.94
LSTM	0.90	0.88	0.92	0.90	0.91	0.90
BERT	0.96	0.95	0.98	0.96	0.98	0.97

Table 6. Statistical significance of model improvements (paired t-tests on F1-score).

Comparison	p-Value
BERT vs. Logistic Regression	<0.001
XGBoost vs. Logistic Regression	$0.002$
LightGBM vs. Logistic Regression	$0.004$

Table 7. Relative importance of behavioural and emotional features (averaged across Random Forest and Gradient Boosting).

Feature	Importance Score
Words of Intimacy	0.24
Sentiment Polarity	0.21
Days Since Last Interaction	0.18
Message Count	0.15
Wall Posts	0.12
Comments	0.08
Emojis/Punctuation	0.02

Table 8. Cluster assignments for a representative subset of 20 users across four algorithms (K-Means, Agglomerative Clustering, GMM, DBSCAN).

User	Messages	Intimacy Words	Sentiment	Days Since	K-Means	Agglo	GMM	DBSCAN
1	102	7	0.895	238	0	2	0	−1
2	435	46	0.598	598	2	1	2	−1
3	348	34	0.922	692	0	2	0	−1
4	270	77	0.088	491	1	0	1	−1
5	106	80	0.196	774	1	0	1	−1
6	71	35	0.045	518	1	0	1	−1
7	188	49	0.325	388	0	2	0	−1
8	20	3	0.389	478	0	1	0	−1
9	102	1	0.271	872	1	0	1	−1
10	121	5	0.829	589	0	1	0	−1
11	466	53	0.357	330	2	1	2	−1
12	214	3	0.281	140	0	2	0	−1
13	330	53	0.543	127	2	1	2	−1
14	458	92	0.141	234	1	0	1	−1
15	87	62	0.802	300	2	1	2	−1
16	372	17	0.075	939	1	0	1	−1
17	99	89	0.987	879	0	1	0	−1
18	359	43	0.772	132	0	2	0	−1
19	151	33	0.199	147	2	1	2	−1
20	130	73	0.006	602	1	0	1	−1

Table 9. Silhouette scores comparing clustering quality across algorithms.

Clustering Algorithm	Silhouette Score
K-Means	0.184
Gaussian Mixture Model (GMM)	0.180
Agglomerative Clustering	0.174
DBSCAN	0.051

Table 10. Pairwise Normalised Mutual Information (NMI) scores showing agreement between clustering algorithms.

Comparison	NMI Score
KMeans vs. GMM	0.86
KMeans vs. Agglo	0.72
KMeans vs. DBSCAN	0.09
GMM vs. Agglo	0.76
GMM vs. DBSCAN	0.10
Agglo vs. DBSCAN	0.12

Table 11. Cluster size distribution (number and percentage of users) for each algorithm.

Cluster	K-Means (n/%)	Agglo (n/%)	GMM (n/%)	DBSCAN (n/%)
1	581 (23.24%)	1480 (59.20%)	639 (25.56%)	0 (0.0%)
2	960 (38.40%)	602 (24.08%)	700 (28.00%)	0 (0.0%)
3	959 (38.36%)	418 (16.72%)	1161 (46.44%)	0 (0.0%)
Noise (DBSCAN)	–	–	–	57 (2.28%)
Total Users	2500	2500	2500	2500
Total Clusters	3	3	3	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kridera, S.; Kanavos, A. Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning. Information 2025, 16, 772. https://doi.org/10.3390/info16090772

AMA Style

Kridera S, Kanavos A. Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning. Information. 2025; 16(9):772. https://doi.org/10.3390/info16090772

Chicago/Turabian Style

Kridera, Stavroula, and Andreas Kanavos. 2025. "Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning" Information 16, no. 9: 772. https://doi.org/10.3390/info16090772

APA Style

Kridera, S., & Kanavos, A. (2025). Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning. Information, 16(9), 772. https://doi.org/10.3390/info16090772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modelling Social Attachment and Mental States from Facebook Activity with Machine Learning

Abstract

1. Introduction

2. Related Work

3. Methodological Framework for Pattern Detection in Online Social Networks

3.1. Problem Formulation and Research Objectives

3.2. Data Collection

3.3. General Preprocessing Considerations

3.4. Network Representation

3.5. Algorithmic Strategies for Pattern Detection

3.5.1. Classification Algorithms

3.5.2. Clustering Algorithms

3.6. Preprocessing Pipeline for Social Interaction Data

3.7. Ethics and Privacy Considerations

Anonymisation Procedure

4. Implementation and Experimental Setup

4.1. Toolkits and Computational Environment

4.2. Dataset Description

4.3. Behavioural and Emotional Feature Extraction

4.4. Attachment Strength Scoring Functions and Comparative Evaluation

4.5. Classification Models and Evaluation Metrics

4.5.1. Neural and Transformer Model Configurations

Artificial Neural Network (ANN)

LSTM

BERT

Summary of Configurations

4.5.2. Evaluation Metrics

4.6. Unsupervised Clustering of User Behaviour and Evaluation Metrics

Evaluation Metrics

4.7. Graph-Based Algorithms and Social Network Analysis

5. Experimental Evaluation

5.1. Attachment Strength Calculation

5.2. Classification Performance Analysis

Feature Relevance Analysis

5.3. Unsupervised Clustering of Behavioural Patterns

5.4. Interpretive Discussion and Insights

5.5. Explaining Algorithmic Performance Differences

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI