1. Introduction
Social media platforms have revolutionized the digital marketing landscape and user engagement within the evolving e-commerce environment. The global e-commerce market is characterized by intense competition, advanced technological infrastructure, and rapidly changing social media platforms. These complexities create challenges for marketers, but they also present vital opportunities for data-driven growth and customer-focused strategies [
1]. Consumers demand personalized, real-time solutions and services, and businesses increasingly adopt artificial intelligence (AI) technologies to respond efficiently [
2]. AI-based systems significantly help brands track and analyze user behavior in real-time, target audiences, personalize content, and refine recommendations. These tools shape consumer decisions by offering products and services tailored to users’ needs and preferences [
2,
3]. As a result, AI is used to predict consumer behavior and shape purchasing experiences in data-driven sectors like online retail and e-commerce [
3,
4].
User engagement has become both a key performance metric and a measure of value shared between users and platforms [
4]. The application of AI effectively alters how consumers engage and interact with brands on social media, transforming the structure of digital marketing strategies and enabling decision-makers to understand behavioral factors [
5].
Figure 1 compares the compound annual growth rate (CAGR) of global e-commerce to total retail from 2018 to 2028. It shows that e-commerce consistently surpasses total retail growth, with a significant spike between 2020 and 2024, indicating the pandemic and post-pandemic period (
Figure 1) [
6].
Machine learning, data mining, and predictive analytics tools are increasingly used to model user behaviors, enabling digital marketers to predict consumer actions and improve prediction accuracy [
7].
However, due to the General Data Protection Regulation (GDPR), companies face restrictions and limitations in extracting user data, leading to the development of advanced methods to extract meaningful insights from anonymous or private datasets [
8].
User engagement refers to the time users actively interact with a website or app interface and services, indicating both attention and interest levels [
9,
10,
11].
These behavioral records, such as the time spent scrolling, viewing, or clicking, provide valuable insights for deploying personalized marketing strategies and targeting segmented audiences more effectively [
12,
13]. Digital platforms, including Facebook, Instagram, TikTok, YouTube, and Pinterest, have transformed the way consumers access products and services, creating a comprehensive behavioral data ecosystem [
14].
These platforms enable brands to gather extensive datasets on users’ profiles, choices, interests, behaviors, and preferences, generate insights into user journeys, and support the delivery of personalized content [
15]. Data segmentation strategies enable marketers to align user behavior with their brand identity, thereby enhancing personalization and relevance in consumer interactions [
16].
At the same time, the growth of social media marketing (SMM) has boosted user engagement and increased competition among brands. SMM helps organizations connect with their audiences, enhance online identity and visibility, drive website traffic and interaction, and influence purchasing decisions and loyalty [
17].
Therefore, evaluating the effectiveness of digital content, such as photos and video posts, becomes a crucial part of the marketing strategy. This study explores how users interact with organic images and video posts on a fashion retail brand’s Facebook business page. Three Facebook key performance indicators (KPIs), including “3 s video views from organic posts,” “reach from organic posts,” and “other clicks,” are used as independent variables to predict user engagement, with “engaged users” as the dependent target. The dataset includes 2500 posts published between 2016 and 2024. Extensive data pre-processing and analysis, including descriptive statistics, regression modeling, and data mining classification techniques, have been conducted [
18].
The choice of these specific KPIs—3 s video views, organic reach, and other clicks—is based on their everyday use as performance benchmarks in both academic research and digital marketing industry standards. While broader metrics, such as Time Spent on Site or Page Views, exist, this study emphasizes interaction-related indicators that most accurately reflect active user engagement within the Facebook platform ecosystem. These KPIs relate to user actions and are recognized in previous engagement and SMM studies as pointing to substantial predictive value for user interaction behavior. The period from 2016 to 2024 was selected to encompass a range of consistent and changing social media usability patterns, including pre-pandemic and post-pandemic shifts in user behavior, as well as Facebook’s platform algorithm changes during this time. Additionally, the data are proprietary assets for which we obtained specific licenses to access.
“Engaged Users” refers to the dependent variable, which shows the number of users who interacted with a post through specific actions, such as clicks, reactions, comments, or shares. This metric reflects active user behavior beyond reach or impressions. In this study, engagement was categorized into three levels (low, medium, high) according to post-specific interaction thresholds generated from the dataset’s distribution.
Engagement Theory, Social Exchange Theory, and digital consumer behavior models provide essential theoretical foundations. Engagement Theory emphasizes how interaction features, such as views and clicks, serve as triggers for user involvement.
Social Exchange Theory views engagement as a mutually beneficial relationship in which users invest time and attention, expecting to receive informational or emotional value in return [
19,
20,
21].
Digital Consumer Behavior Models establish KPIs, like reach and view duration, which are indicators of behavioral intent. These frameworks show the importance of KPIs as predictive factors. The reliability of these KPIs was confirmed through Cronbach’s alpha analysis [
19,
20,
21].
3. Research Methodology
3.1. Research Scope
The methodological design used in this research closely aligns with the earlier introduced theoretical frameworks. The study uses descriptive and predictive statistical models to identify potential patterns based on user behavior data and engagement metrics, including KPI such as short video views and other clicks. It emphasizes the importance of integrating artificial intelligence (AI) into SMM by analyzing the factors that influence user behavior and applying predictive analytics in marketing.
Figure 2 illustrates the key steps in this research methodology for developing decision-making rules, encompassing problem definition, data collection, data cleaning and labeling, data analysis, regression analysis, model training and selection, evaluation, and deployment.
The study provides marketers with specific guidelines on how to enhance user engagement through organic social media posts. Its primary focus is on understanding the relationships among KPIs and providing recommendations to help decision-makers optimize SMM strategies. The presentation of results provides a clear overview of social media insights, including both descriptive and predictive statistics [
39,
52,
53].
The authors aim to narrow the gap between marketers’ knowledge and actual user behavior on social media. The data collected includes both organic and paid posts, focusing on engagement metrics such as the target variable “engaged users” (dependent variable), as well as “3 s video views from organic posts,” “reach from organic posts,” and “other clicks” (independent variables) (
Table 3).
Linear regression analysis and classifiers’ assessment include RF, XGBoost, KNN, and NB. A series of factors, including the nature of data, data features, and the context of the research field (classification, clustering, regression, etc.), influence the performance of the algorithms for the current dataset. The case study of applying these models in real-world scenarios provides marketers with the opportunity to explore tools that are more likely to yield highly accurate results [
39,
52,
53]. Specific performance metrics, such as classification accuracy, Mean Squared Error (MSE), r-squared (R2), root mean squared error (RMSE), precision, recall, F1-score, and area under the curve (AUC), are measured to measure prediction errors and assess classification model performance by quantifying correctness, completeness, balance, and discrimination ability [
34,
35,
36,
38].
3.2. Research Design and Objectives
The aim is to establish rules for marketers to boost user engagement and generate a cycle of re-engaging clients (
Figure 3) [
54,
55,
56,
57,
58].
Following a comprehensive five-stage breakdown of a flowchart diagram, this study employs descriptive and predictive statistical analysis to better represent the KPIs influencing user engagement on social media business pages. The methodology includes descriptive statistical tests, regression analysis, and performance assessments of predictive models. The data pre-processing stage involves cleansing, normalization, handling missing data, and splitting the data into training sets (70%) and testing sets (30%). The training and test data are processed before being used by data mining classifiers (
Figure 3).
A raw dataset refers to the stage of collecting user engagement data from the Facebook Business Analytics platform over a selected period, including a set of KPIs. The pre-processing phase examines potential correlations among variables. Linear regression shows which factors influence users’ engagement.
RO1 involves identifying KPIs such as 3 s video views, organic reach, and other clicks that influence user engagement. RO2 aims to categorize engagement classes based on post interactions. RO3 evaluates the performance of predictive models, including RF, XGBoost, KNN, and NB. Model selection and data splitting involve choosing data mining models, such as these, which are evaluated using metrics like classification accuracy (ACC), precision, recall, F1-score, and the area under the curve (AUC). RO4 offers strategic insights to enhance digital campaign performance. A summarized data analysis reveals hidden information and behavioral patterns. Predictions are presented in the final section of the study, including a set of extracted rules and a visual representation of the results.
3.3. Data Collection and Pre-Processing
3.3.1. User Profile
The selected fashion retail Facebook page is run by a business targeting Greek tourist spots on the mainland and two islands. The primary audience for this Facebook fashion page is primarily female users aged 25 to 44 who are very interested in seasonal collections, promotions, and social shopping features. These users typically engage with content such as videos, promotional posts, and click-through offers. Their purchasing decisions are quick and influenced by visual appeal, influencer posts, and community feedback, making them a good target for predictive engagement analysis. This group was chosen because the platform sees high organic traffic during tourist seasons and the brand’s connection to quick consumption by tourists. It offers an excellent opportunity to study user behavior in a tourism retail setting.
3.3.2. Data Acquisition
The dataset is collected from the Facebook fashion retail business page for an online and physical clothing store. Data spans eight years, from 1 January 2016 to 31 December 2024. It includes 2500 instances of Facebook post engagement metrics from 2016 to 2024.
3.3.3. Data Pre-Processing and Preparation
Data is extracted and analyzed using Microsoft Excel (Microsoft, Redmond, WA, USA). The statistical analysis is conducted with SPSS V28 (IBM, Armonk, NY, USA). Descriptive statistics, performance metrics, normality testing, model configuration, and class segmentation are performed to address the formulated hypotheses. Weka3 version 3.9 software and Python libraries (Matplotlib 3.10.0, NumPy v 2.3.0, and Seaborn v.0.13.2) are employed to evaluate data mining models and visually present the results [
59,
60,
61,
62].
Data is collected, cleaned, analyzed, and segmented. All variables are tested for potential relationships with the dependent target variable (e.g., engaged users). According to Facebook Insights, “engaged users” are the total number of unique users who engaged (clicked, liked, shared, or commented) with a post. This metric captures active interactions with content and is widely used as a benchmark for evaluating the effectiveness of posts in academic and business settings.
Only the moderate and strong correlations are maintained and interpreted as the best predictors. The descriptive statistics results are presented. The number of sessions is segmented into different levels of user engagement. The Count denotes the number of posts, while the Mean indicates the average value of the KPIs. Standard Deviation reflects the variability in the KPIs. Min/Max represents the minimum and maximum values for each KPI. Missing Values indicates the number of posts with zeros, missing data, or very low variance. Columns without significant data fields have been removed. The descriptive statistics analysis reveals that the low average values and high variance of all KPIs, particularly “3 s video views” and “reach,” indicate similar distributions (
Table A1).
3.4. Descriptive and Correlation Analysis (RO1)
3.4.1. Normality Test
A normality test was performed to select the appropriate correlation analysis method. A
p-value of 0.05 indicates strong evidence against the null hypothesis. It suggests a deviation from a normal distribution, leading to the recommendation of non-parametric methods for the statistical analysis. Shapiro–Wilk tests are carried out to determine whether parametric or non-parametric correlation analysis is suitable for evaluating the relationships between user engagement and other variables. The Shapiro–Wilk normality test is applied to five KPIs in the study, such as “engaged users,” “3 s video views from organic posts,” “reach from organic posts,” and “other clicks.” For each variable, the normality test statistic indicates whether the variable follows a normal distribution, while the
p-value indicates the statistical significance of the result. All current KPIs have
p-values below the 0.05 threshold, indicating they do not follow a normal distribution. These results suggest that non-parametric statistical methods should be used in the analysis, including Spearman’s correlation to examine the relationships among the KPIs (
Table 4) [
54].
3.4.2. Cronbach’s Alpha
To examine the internal consistency of the user engagement metrics, a Cronbach’s alpha analysis was conducted using the KPIs of “3 s video views from organic posts,” “reach from organic posts,” and “other clicks”. The value of 0.99 indicates a high degree of reliability for further statistical modeling. A session refers to continuous user activity on the platform for approximately 30 min (
Table 5).
3.4.3. Linear Regression
Linear Regression is included in this research to address situations where a linear relationship between factors influencing predictions and user engagement is expected. It provides clear and straightforward estimates of each feature’s impact on the target. Linear Regression models the linear connection between one dependent variable and multiple independent variables. It offers interpretable coefficients and works well when variables have linear relationships. The Shapiro–Wilk test confirmed that the KPI values are not normally distributed; therefore, Spearman’s rank correlation was selected to evaluate the relationships between variables. The current approach emphasizes interpretability and practical explanations for marketers by focusing on direct, multivariate correlations between KPIs and “engaged users.” The selected models serve as a complementary analysis, enabling their application in social media analytics, where behavioral patterns coexist [
55,
56,
57,
58,
63].
3.5. Engagement Classification (RO2)
Users’ engagement is categorized into three classes based on the level of interactions within completed customer sessions created by online users, considering the total recorded instances. These categories are labeled “high engagement,” “medium engagement,” and “low engagement”. The engagement classes were determined using a quantile distribution of the dataset’s “engaged users” metric. Low engagement (0–1) indicates posts with minimal or no user interactions. Medium engagement (2–10) corresponds to posts that elicited limited engagement. High engagement (≥11) signifies posts that generated maximum engagement. This classification provides a straightforward and practical way to categorize posts, enabling valuable insights for more targeted SMM strategies.
The summary shows a distribution where engagement level is defined by the number of sessions per post: “low engagement” (1506 sessions), “medium engagement” (678 sessions), and “high engagement” (316 sessions), out of a total of 2500 (
Table 6).
Figure 4 illustrates how a variable can be categorized into low, medium, or high engagement levels. The median value rises as engagement level increases, indicating that posts with fewer video views tend to have higher engagement. The distribution is wider for low-engagement posts and becomes narrower for high-engagement posts, indicating a positive relationship between “3 s video views from organic posts” and increased engagement. The “reach from organic posts by engagement class” demonstrates that posts with higher organic reach are more likely to fall into the medium-to-high-engagement categories. The distribution is broad for low-engagement posts and narrow for highly engaging ones, further indicating a positive correlation between organic reach and higher engagement. The “other clicks” are clustered near zero for low engagement but shift upward as engagement increases. Outliers in the high-engagement group suggest that some posts tend to generate exploratory interest.
3.6. Predictive Modeling (RO3)
The model selection process was based on the dataset’s structure. Therefore, Random Forest and XGBoost were chosen for their ability to handle multicollinearity and imbalanced classes. KNN was selected for its strong performance on small, structured datasets. Naïve Bayes was also chosen for its simplicity and efficiency with nominal data.
Random Forest (RF) is an ensemble learning technique that builds multiple decision trees and combines their outputs to form a single prediction. It efficiently captures non-linear relationships and feature interactions. RF was chosen for its robustness in handling structured social media metrics and class imbalance. It performs exceptionally well with datasets that include numeric and categorical variables and have non-linear predictor relationships. RF also effectively addresses overfitting. This study selected RF due to its minimal assumption requirements, efficiency with imbalanced datasets, ability to generate reliable classification results, and interpretability through variable importance metrics [
55,
56,
57,
58,
64,
65].
Furthermore, XGBoost (Extreme Gradient Boosting) is a scalable boosting model known for its highly predictive performance and ability to identify complex feature interactions. XGBoost provides high accuracy and robustness in user engagement classification. This research also utilizes it as an efficient and scalable application of gradient boosting decision trees. XGBoost excels at predictive tasks, particularly with tabular datasets, where accuracy, computational efficiency, and control over overfitting are essential. It also supports parallelized tree boosting, making it particularly suitable for managing complex datasets. Additionally, XGBoost can detect feature interactions and handle unbalanced classes [
55,
56,
57,
58,
64,
65].
K-Nearest Neighbors (KNN) is utilized for its efficiency in managing local data clusters, which aids in pattern recognition, particularly in small or non-parametric datasets, such as those employed in the study. The KNN classification model is also simple to understand and implement. KNN performs exceptionally well with small datasets where the computational effort to find several k-neighbors remains low. It classifies data based on the most common class among its nearest neighbors, making it a reliable choice for problems where simplicity is key. It works effectively when a straightforward linear or non-linear model cannot sufficiently define the decision boundary. By classifying records according to the direct values of labeled examples, it is easy to deploy. Therefore, it is suitable for datasets where the relationship between features and the target variable is complex and difficult to predict or classify using parametric models [
55,
56,
57,
58,
64,
65].
NB is recognized for its computational efficiency and strong performance in high-dimensional spaces. Despite assuming attribute independence, NB performs well in real-world applications where these attributes are partially dependent, and its ability for rapid classification makes it a suitable choice. Naive Bayes is a type of supervised learning, meaning the model is trained with labeled data. NB is a straightforward classification algorithm designed to handle large datasets for real-time prediction scenarios. It also excels in processing high-dimensional data, where the number of features is significantly larger than the number of data instances. NB is also effective for categorical data and is frequently used in text mining, which could be relevant for future research on social media sentiment analysis. [
21,
55,
56,
57,
58,
63,
66,
67,
68].
Data are processed using Python libraries and WEKA 3, ensuring effective data handling, training, and testing of the models. To evaluate the model’s generalizability, the dataset was split into a 70% training set and a 30% testing set using sampling to maintain class proportions. Over 60% of low-engagement posts create a class imbalance; thus, performance metrics included precision, recall, and F1-score measures [
55,
56,
57,
58]. Future studies could implement k-fold cross-validation or SMOTE-based balancing to further improve classifier performance.
Data Pre-Processing and Preparation
MSE measures the average of the squared differences between predicted and actual values. Lower values indicate better model performance. RMSE is the square root of MSE, providing error in the same units as the original variable. Lower values indicate higher accuracy. R2 represents the proportion of variance in the target variable explained by the model. Values closer to one indicate a better fit. MSE (Mean Squared Error) measures the average squared differences between predicted and actual values.
Precision, Recall, and F1-score metrics evaluate correctness, completeness, and balance. Classification accuracy refers to the overall number of correct predictions. Precision measures the ratio of correctly classified instances to the total cases. AUC, or Area Under Curve, assesses a model’s ability to discriminate across classes. Recall indicates the ratio of correctly predicted user engagement instances over all actual positive engagement instances. F1-score is the harmonic mean of precision and recall. The higher these values, the better the model’s performance. Model key characteristics are summarized in
Appendix B (
Table A2) [
6].
3.7. Insights for Digital Campaign Optimization and Visualization (RO4)
The best-performing classifier was employed to create practical classification rules. The results, presented via box plots, scatterplots, and normality tests, help inform marketing strategies. These insights are designed to aid data-driven, actionable decisions in fashion retail social media marketing. The following section presents the analytical results of correlation testing, regression modeling, and classification modes.
4. Results
4.1. Descriptive and Correlation Analysis (RO1)
The following results directly relate to the research objectives and hypotheses outlined earlier. The Spearman correlation test uses the “engaged users” variable as the primary focus, indicating the strength and direction of relationships between selected variables and the “engaged users” group. Each row includes the variable name, the correlation coefficient (ρ), and the p-value indicating statistical significance.
The results for “3 s video views from organic posts,” “reach from organic posts,” and “other clicks” showed strong correlations with “engaged users,” suggesting a statistically significant positive relationship and that [
54]: (1) Short video content exposure can trigger bigger user engagement. (2) The role of organic reach in optimizing engagement indicates that organic results significantly attract users’ attention and increase users’ interaction. (3) The user’s interactions beyond direct content are predictive of user engagement (
Table 7).
Scatterplots illustrate confirmed links between the “engaged users” and the independent features.
Figure 5 illustrates a positive relationship between short videos, increased organic reach, and clicking on other links, all of which led to higher user engagement. This suggests that as user actions increase, engagement also increases [
64,
65,
69,
70,
71,
72].
Figure 5 presents three scatterplots that visualize the bivariate relationships between each Key Performance Indicator (KPI) and the dependent variable, “Engaged Users.” These visualizations serve to illustrate the correlation strength and direction between independent engagement metrics and actual user interaction levels. Each subplot demonstrates the following relationships:
This plot highlights a strong positive linear correlation, implying that as the number of 3 s organic video views increases, the number of engaged users also increases. This supports Hypothesis 1 (H1), which states that short video views lead to increased user engagement, confirming that visual content leads to more user interactions.
- 2.
Organic Reach vs. Engaged Users
Although organic reach is positively correlated with user engagement, the distribution appears to be more diffuse compared to video views. This implies that reach alone does not necessarily lead to engagement, as some posts may be widely viewed but still fail to receive a sufficient number of interactions. These visual results align with the regression results, which show that organic reach had a relatively weaker predictive impact.
- 3.
Other Clicks vs. Engaged Users
This plot also demonstrates a strong positive correlation, supporting hypothesis (H3). As the number of other clicks increases, the number of engaged users also increases. This depicts user interest beyond just the post itself, implying a deeper level of user interaction with the brand.
The Spearman correlation analysis results in
Table 7 support all three scatterplots. The visual evidence supports the conclusion that active, user-triggered metrics, such as views and clicks, serve as indicators of user engagement rather than passive exposure metrics, like post reach.
Linear Regression Results
Linear Regression predicts the value of a dependent variable based on one or more independent variables. The Mean Squared Error (MSE) calculates the average of the squares of the errors, representing the average squared difference between the predicted and actual values. Root Mean Squared Error (RMSE), or the square root of MSE, adjusts the error metric to the scale of the original values, making it easier to interpret. The coefficient of determination (R
2) indicates the extent to which the independent variables explain the variability of the dependent variable (
Table 8) [
59,
60,
61,
62,
64,
65,
69,
70,
71,
72].
Table 9 shows the linear regression performance predicting “engaged users” based on each independent feature. It displays the regression coefficients from the Linear Regression analysis for each factor. It implies the expected change in “engaged users” associated with a one-standard-deviation increase in that variable, assuming all other variables remain constant. Positive coefficients indicate that an increase in these attribute values correlates with a higher user interaction rate.
However, the coefficient for “reach from organic posts” is negative, implying that greater reach without actual user interactions does not necessarily lead to increased engagement. This highlights the difference between passive exposure and active participation. The linear regression model provides decision-making guidelines that can help update content strategies by identifying which behaviors and content types manage to influence engagement.
In
Figure 6, the linear regression bar chart displays the standardized regression coefficients predicting user engagement. Each bar represents an independent variable, including the “3 s video views from organic posts,” “reach from organic posts,” and “other clicks.” The length and direction of each bar indicate the strength and direction of its impact on the value of “engaged users” (
Figure 6).
Positive coefficient values indicate that when an attribute’s value increases, the number of “engaged users” is expected to rise, assuming other variables remain constant. Conversely, negative coefficient values show an inverse relationship, where an increase in the feature’s value is associated with a decrease in user engagement.
The “3 s video views from organic posts,” “reach from organic posts,” and “Other clicks” with high positive coefficients are seen as the strongest positive predictors of user engagement. Therefore, the shorter the video, the more user interactions tend to increase. They are also positively correlated but serve as smaller predictors of user engagement.
4.2. Classifiers Performance Assessment (RO 2,3,4)
4.2.1. Classification Accuracy
Table 10 and
Figure 7 display the performance results of the selected classifiers. They provide a comparative summary of the classification accuracy scores of RF, XGBoost, KNN, and NB, including metrics such as precision, recall, and F1-score, all shown as percentages. Classification accuracy (acc) indicates the percentage of correctly classified instances out of the total cases. Precision measures the proportion of accurate optimistic predictions among all predicted positives, showing how well each classifier avoids false positives. Recall or sensitivity evaluates the proportion of correctly predicted positive cases out of all actual positives. F1-score is the harmonic mean of precision and recall, offering a balanced performance measure when false positives and negatives are critical.
Among the classifiers, XGBoost demonstrated strong performance, with an accuracy nearly equal to RF and higher than KNN and NB. It achieved the highest classification accuracy across all metrics, indicating it as one of the most reliable models for predicting user engagement classes. Although RF, KNN, and NB are computationally efficient, their classification accuracies were lower.
Table 10 summarizes the strengths and scores of each model, supporting decision-making in social media strategies (
Figure 7) [
59,
60,
61,
62,
64,
65,
69,
70,
71,
72].
4.2.2. Confusion Matrices
Figure 8 shows the confusion matrices comparing the classification results of the RF, XGBoost, KNN, and NB models. It displays the number of correct and incorrect predictions for each engagement level: low, medium, and high. The cells along the diagonal represent accurate predictions, while the other cells indicate incorrect ones.
The XGBoost confusion matrix indicates a strong overall accuracy, with most posts correctly predicting instances of the low-engagement class. This suggests that XGBoost tends to favor the dominant class, thereby avoiding imbalance issues and increasing precision for the low-engagement class.
This approach creates a liability in predicting less common engagement classes. NB models complement this by recognizing patterns in the minority class and often produce falsely optimistic predictions across most classes.
Although the RF, KNN, and NB confusion matrices show lower classification accuracy, they also display a higher sensitivity to medium and high engagement classes. This provides a better balance in detecting all classes, albeit with a higher error rate. There is a trade-off between accuracy and classification sensitivity, which, when combined, can target different areas based on social media strategy goals (e.g., generic prediction versus personalized marketing).
4.3. Rule-Based Suggestions
The following recommendations can help decision-makers and marketers prioritize KPIs that drive user engagement and enhance organic posts and content strategies for improved performance in SMM campaigns.
According to
Table 11, this study has helped develop a set of derived recommendations for marketers and decision-makers. These recommendations are based on linear regression and classification models. Each suggestion emphasizes the importance of KPIs in tracking the number of engaged users on a Facebook business page.
The Linear Regression model has helped generate the following suggestions:
Short video posts, organic reach from organic posts, and clicks to content—such as clicks on the page name, profile page, people’s names in comments, the like count, or timestamps—indicate that they are highly effective in increasing user engagement. They can easily be interpreted as signs of user engagement.
XGBoost classifier contributes to generating the following suggestions:
Posts that are similar to successful previous posts tend to generate the same level of user engagement.
Social media strategy should emphasize short videos and interactive call-to-action content (links, clickable text) to boost the chances of being classified as highly engaging.
XGBoost-based models provide a prediction of a scheduled post’s performance after posting.
Reach from organic social media posts does not necessarily lead to increased user engagement, click-through rates, or improved conversion optimization.
5. Discussion
The findings are based on previous engagement research discussed in the Related Works section. Consistent with earlier studies [
23,
24,
26], the current results indicate that content characteristics are key factors in user engagement. Specifically, in fashion social media [
26], short videos (such as 3 s clips) boost a post’s reach or impressions, and click-through interactions are among the most influential factors for user engagement on Facebook [
23,
24]. This consistency with previous research suggests that engagement is more strongly driven by user-triggered interactive content than by reach or impression metrics, especially when targeting young users [
26].
This study, consistent with previous research on social media analytics [
12,
27,
28,
29,
30,
31,
32], shows that the XGBoost classifier achieved the highest classification accuracy in user engagement levels, demonstrating XGBoost’s efficiency with structured, small datasets. XGBoost outperformed the other classifiers (≈94.73%), confirming that instance-based learning performs exceptionally well in this domain [
12,
27,
28,
29,
30,
31,
32]. These findings contrast with results from different contexts (e.g., high-dimensional fraud detection) where NB can outperform KNN, highlighting that optimal model choice heavily depends on the specific domain [
12,
27,
28,
29,
30,
31,
32]. The NB model, although less accurate, demonstrated higher sensitivity in detecting low-engagement classes, indicating that the caution expressed in previous research about relying solely on broad performance metrics in complex social media cases is justified [
30,
31,
32,
33,
34].
The current results indicate that specific KPIs, including video views, user reactions, and the type of social media post, have a significant influence on user engagement. XGBoost achieved the highest classification accuracy for user engagement levels due to its ability to infer results from small and supervised datasets [
48]. RF also performed well, especially in handling imbalanced classes, consistent with previous research that shows its sensitivity to probabilistic cases of feature distributions [
44,
45]. These findings also agree with earlier studies on social media user engagement and classification accuracy using machine learning models [
46,
47].
The analysis of user engagement revealed that organic reach, 3 s video views, and other clicks have a significant impact on engagement class predictions. The Linear Regression results showed that reach and other clicks possess predictive value, in line with previous studies that emphasize their role in user interaction metrics [
21,
24].
In comparison to related research in fashion retail and tourism marketing, our results confirm and extend the findings of earlier studies. For example, Jankovic and Curovic (2023) noted that digital consumer engagement can be effectively modeled using simple performance indicators, such as views and clicks, especially when personalized content is involved [
42]. Our results complement this by showing how these variables not only relate to engagement but also act as reliable predictors in classification models.
Machine learning models, such as Random Forests and XGBoost, outperformed simpler classifiers (e.g., Naïve Bayes and KNN), in line with earlier studies by Kaur and Kumari (2020), who noted that ensemble methods are more robust in social media environments with non-linear user behavior patterns [
35]. Likewise, Xia et al. (2024) highlighted the adaptive nature of AI in behavior prediction tasks, supporting our use of ensemble methods to manage seasonal fluctuations in engagement [
40]. While earlier works using Naïve Bayes found its performance to be acceptable for high-dimensional but independent features [
12,
33], our results confirm that this assumption limits NB’s effectiveness in scenarios involving interdependent KPIs, such as post reach and video views.
Furthermore, our rule extraction for classification, especially from tree-based models, provides a practical link between statistical insight and strategic marketing use, aligning with the work of Magableh et al. (2024), who highlighted data-driven marketing personalization as a key element of sustainable financial performance [
41].
This comparison highlights the study’s contribution to social media analytics by combining real-world Facebook retail data, traditional statistical testing, and advanced machine learning. It demonstrates that model interpretability, accuracy, and alignment with user behavior trends are essential for optimizing marketing campaigns in a digital retail environment. The empirical results support Engagement Theory by showing that user engagement increases when content features interactive elements—such as short videos and clickable elements—designed to capture attention and encourage involvement. Similarly, Social Exchange Theory is confirmed through the observed pattern that users are more likely to interact with content when they receive informative or emotional value in return, highlighting the reciprocal nature of digital engagement.
5.1. Research Limitations
Because Facebook remains the most recognizable platform for e-commerce purchases and B2C engagement, it serves as a good case study for detailed engagement analysis. Although the study offers valuable insights, certain limitations need to be addressed to mitigate generalizability. First, the dataset was limited to a single Facebook business page in the fashion industry, which reduces the ability to apply the results broadly across different industry sectors or social media platforms. The class distribution was uneven, with many posts showing low engagement. This imbalance affected the classifiers’ ability to forecast medium- and high-engagement levels accurately. Aside from the KPIs analysis, no qualitative factors are included that could indicate increased user interactions.
5.2. Practical Implications
The insight provides a unique opportunity for marketers to optimize user engagement by exploring and utilizing the publication of engaging videos, call-to-action content, and clickable content. Additionally, applying data mining models to identify high-performing posts and potential user engagement by integrating organic and paid data insights, then categorizing and transforming them into actionable strategies (e.g., e-commerce firms can predict user engagement before publishing). Based on these predictions, targeted audiences and user personalization can be implemented in real-time analytics. Regression rules can also aid campaign planning. Businesses can leverage the current findings by enabling data-based decision-making for content creation, post scheduling, audience segmentation, and posts’ performance optimization. The specific usage of KPIs, such as video views and clicks, would enable marketers to focus on creating optimized content that increases user engagement. Predictive analytics models provide a framework that enables businesses to evaluate the performance of posts before publication, leading to more effective budget management, optimized marketing strategies, and personalized user experiences. Ultimately, the data insights support shifting from the reactive approach of dealing with revenues to more proactive SMM tactics.
5.3. Future Research
Future work will expand the current methodology by increasing the number of businesses across social media platforms, enlarging the dataset, and applying the same approach to different social media platforms. Additionally, more data mining classifiers will be involved in performance assessments, exploring both quantitative and qualitative data, including textual and emotional information. Combining quantitative and qualitative research, utilizing engagement metrics alongside content analysis, can lead to further optimization of SMM [
73]. While the current study provides strong quantitative insights into user engagement using KPIs and predictive analytics, the absence of qualitative dimensions, such as users’ motivations, emotional responses, and interpretive behaviors, is notable. These aspects are essential for a more holistic understanding of user engagement but fall outside the scope of the data used. Future research could employ both methodological approaches and sentiment analysis techniques to examine how emotional or psychological factors influence user engagement, thereby enabling a more nuanced and contextual interpretation of user behavior.
5.4. Ethical Considerations and Trends
The circular economy policy recommends that user engagement KPIs be viewed as metrics aligned with sustainable digital behavior. Circular e-commerce actions encompass user recommerce, low-waste logistics, and extended digital product lifecycles, which in turn influence user engagement strategies and ethical concerns related to AI and marketing [
73].
6. Conclusions
This study examined how specific key performance indicators (KPIs) influence user engagement on a Facebook business page, using real-world data from a fashion retail brand operating in tourist locations. By applying both regression and supervised machine learning models, including linear regression, Random Forest (RF), Extreme Gradient Boosting (XGBoost), K-nearest neighbors (KNN), and Naïve Bayes (NB), the authors assessed the predictive value of three performance metrics: 3 s video views, organic reach, and other clicks.
The results showed that short video views and other clicks are the most significant predictors of user engagement, aligning with previous findings on user interaction performance metrics in social media marketing [
23,
26,
40]. The XGBoost model achieved the highest classification accuracy (~94.73%), ensuring its performance for small, labeled datasets with user interaction characteristics [
12,
27,
28,
29,
30,
31,
32]. These findings support the research hypotheses (H1–H3) and emphasize the Engagement Theory and Social Exchange Theory, which indicate that relevance and user engagement actions are key drivers of digital engagement [
19,
20,
21].
The study also supports previous insights that key performance metrics, such as reach alone, may be able to predict engagement, as they often lack evidence of behavioral intention [
24]. Linear regression analysis revealed strong predictive power (R
2 ≈ 0.98) for all three KPIs, highlighting their significance in marketing optimizations.
From a practical perspective, the insights provide data-driven recommendations for marketers and decision-makers, including increasing the use of short-form video content and incorporating interactive content elements, as well as refining audience targeting strategies. These suggestions are particularly relevant in tourism retail, where seasonality, impulsive purchasing, and visual appeal significantly influence user behavior [
42,
48]. Machine learning models’ rule-based suggestions can also help in predicting post-performance before publication, enabling businesses to predict campaign performance and dynamically allocate resources.
However, the study’s scope is limited to a single fashion retailer’s Facebook data, which may affect the generalizability of the results. Consumer behavior across other industries, platforms (e.g., Instagram, TikTok), or regions may exhibit different patterns. Thus, future research should incorporate diverse datasets from multiple sectors and geographical contexts to justify and extend the generalization of the findings. Looking ahead, future studies could explore cross-platform user behavior, temporal dynamics of engagement, or the ethical implications of AI-driven personalization in tourism and retail. Expanding the modeling framework to include explainable AI (XAI) techniques may also enhance transparency in decision-making processes for both marketers and consumers [
40,
74].
Integrating behavioral and emotional aspects will help create a more comprehensive understanding of user engagement in AI-based marketing [
41,
47]. In the broader context of tourism marketing, these findings underscore how AI-based engagement models can enhance customer retention, personalization, and sustainability. As the tourism and retail sectors become increasingly digital, predictive analytics will play a crucial role in shaping data-driven strategies aligned with consumer behavior and experience optimization.
In conclusion, this study aims to provide further insights into the existing literature on predictive analytics in social media marketing strategies, focusing on KPIs and demonstrating how machine learning models can predict user engagement. As e-commerce continues to grow, a data-driven approach that uses precision, data, and ethical insights will enhance user experiences and satisfaction, while also developing brand–consumer interactions.