Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups

Wang, Yuanqing; Chen, Hao; Zhao, Wei; Zhang, Qixia

doi:10.3390/sym17060900

Open AccessArticle

Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups

¹

School of Educational Science, Jiangsu Second Normal University, Nanjing 211200, China

²

IEBIS, Department of High-Tech Business and Entrepreneurship, Faculty of BMS, University of Twente, 7522 NB Enschede, The Netherlands

³

Faculty of Business Administration, Turiba University, LV-1058 Riga, Latvia

⁴

Department of Computer Science, UiT the Arctic University of Norway, 9019 Troms, Norway

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 900; https://doi.org/10.3390/sym17060900

Submission received: 29 April 2025 / Revised: 29 May 2025 / Accepted: 3 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Applications Based on Symmetry in Machine Learning and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

The relationship between reading exposure and social attitudes across demographic groups remains a pivotal yet underexplored topic in computational social science. This study adopts a machine learning framework to examine the symmetry of reading’s influence on social attitude formation. Models including Random Forest, XGBoost, LightGBM, and linear regression were employed on data from the 2021 Chinese General Social Survey (CGSS). The results show that reading volume is a key predictor of social attitudes. Moreover, a SHAP-based subgroup analysis revealed that the impact of reading exposure remained stable across gender groups, indicating a symmetric pattern of cognitive influence. This study proposes a methodological pipeline for assessing the symmetry of feature importance in social data, offering actionable insights for researchers and policymakers into the equitable role of media consumption in shaping social cognition.

Keywords:

reading exposure; social attitudes; machine learning; symmetry; SHAP; behavioral prediction; computational social science

1. Introduction

As global tensions rise, environmental crises deepen, and technological change accelerates, social attitudes play a crucial role in shaping our collective future, influencing everything from policy decisions to interpersonal relations [1]. Social attitudes are commonly considered individual evaluations of various events and groups, influencing how people think and act in particular directions [2]. Multiple factors contribute to shaping these attitudes, such as media exposure, educational experiences, and cultural background. Among these factors, reading has gained special attention from researchers. Studies have shown that people who read more tend to have less prejudice and more open worldviews [3]. But not everyone has the same reading opportunities or motivations [4]. This creates different (sometimes opposite) effects across population groups [1].

While the relationship between reading exposure and social attitudes has gained considerable attention, the existing research has predominantly relied on traditional analytical approaches such as linear regression [5] or logistic regression [6]. These methods effectively identify direct associations but often fail to capture non-linear interaction effects and demographic-specific patterns. Machine learning offers promising alternatives for analyzing these complex relationships. Computational social scientists have increasingly employed ensemble techniques to analyze large-scale survey data, such as the China General Social Survey (CGSS) [7]. These methods have shown a good performance in identifying feature importance and detecting implicit interactions [8], but their “black-box” nature limits the interpretability of their results. SHAP (SHapley Additive exPlanations) values provide a solution by enhancing model transparency while maintaining analytical power. This study employs multiple machine learning models (Random Forest, XGBoost, LightGBM) with SHAP values to analyze CGSS data. The impact of reading on social attitudes across diverse social groups is explored, with a particular focus on assessing the potential symmetry or asymmetry in these effects. To guide this exploration, this study addresses the following research questions:

To what extent does reading volume serve as a predictor of social attitudes, after accounting for relevant sociodemographic and behavioral factors?

Do machine learning models (Random Forest, XGBoost, LightGBM) offer an improved predictive performance for social attitudes compared to that of traditional linear regression models?

Does the influence of reading volume on social attitudes exhibit symmetrical or asymmetrical patterns across key demographic subgroups?

Consequently, understanding the nuances of these relationships—specifically, whether reading’s influence is consistent or varies across diverse populations—carries direct practical implications. For instance, if reading volume consistently fosters more inclusive social attitudes, this would bolster arguments for universal reading promotion programs. Conversely, if the effects differ significantly based on demographic factors, this would underscore the need for targeted interventions. This study therefore aims to provide actionable insights into how media consumption, particularly reading, shapes social attitudes within a complex societal fabric, informing both educational policy and social intervention strategies.

This study makes several key contributions. First, it empirically demonstrates the robust predictive power of reading exposure in shaping social cognition. Second, it reveals detailed symmetrical and asymmetrical patterns of reading’s impact across diverse demographic subgroups. Third, it advances an interpretable machine learning pipeline tailored to analyzing large-scale survey data. The remainder of this paper is structured as follows: Section 2 reviews the relevant literature; Section 3 details the data and methodological approach; Section 4 presents the empirical results; Section 5 discusses the findings, their implications, and limitations; and Section 6 provides the conclusion.

2. The Literature Review

2.1. The Link Between Reading and Social Attitudes

The influence of reading on social attitudes has been widely investigated in social sciences. Studies have shown that individuals who read more exhibit greater empathy and increased tolerance and understanding towards out-group members [9,10]. They also show more prosocial behaviors and higher social engagement [11]. One possible reason is that reading offers knowledge and experiences beyond everyday life. These experiences can affect how people see the world, broaden their horizons, and shape the way they think [12]. Fiction reading, in particular, has been shown to reduce prejudice, even in regions with longstanding socio-cultural inequalities [13]. Similarly, Oľhová et al. [14]. found that exposing students to fiction can cultivate more tolerant intergroup attitudes.

The evidence for this relationship mostly comes from studies relying on cross-sectional designs. Surveys and behavioral experiments have been employed to explore the direct impact of reading on social attitudes, including potential variations by reading material type [15]. However, these methods have limitations. Cross-sectional designs and traditional linear statistical methods have a limited capacity to capture complex patterns. For example, they may not capture the non-linear effects of reading and the interactions with complex demographic variables. Thus, the external validity of these findings may be limited [9]. A small number of longitudinal studies have explored the long-term effects of reading [10]. However, these studies usually use small sample sizes, limiting the generalizability of their findings [16]. Moreover, existing studies have insufficiently explored how reading effects might vary across diverse social groups. Although sociodemographic variables were included in some studies as control variables, few studies have examined systematic variations in the impact of reading across various populations [17].

These limitations may lead to an insufficient understanding of the effects of reading on social attitudes, limiting the interpretability and application of the results.

2.2. Symmetry and Asymmetry in Reading’s Effects Across Social Groups

Understanding the consistency or divergence of reading’s effects across different social groups is crucial for obtaining a more comprehensive picture of how reading changes social attitudes. Sociodemographic variables such as age, gender, education level, and cultural background are considered fundamental indicators for characterizing group features and predicting the effects of media exposure [18]. These factors can shape how people acquire information, why people read, and how they think. For example, studies have shown that individuals with higher education levels use more logical thinking when forming social attitudes through reading, exhibiting a 23% greater depth of cognitive processing than those with lower education levels [19]. Additionally, it has been found to be easier for adolescents (aged 13–18) to change their attitudes in response to media messages [20]. Cultural factors and ethnicity also matter. In a cross-cultural study, Suzuki et al. [17] found that reading novels affected people’s stereotypes differently depending on their culture. In the UK group, reading novels was clearly linked to stereotype changes. But this link was not found in the Japanese group.

Besides these studies highlighting its differential impacts (asymmetry), the importance of influence symmetry has also been noted. Influence symmetry refers to a phenomenon where different groups exhibit similar patterns of attitudinal and behavioral changes after exposure to the same media content [21]. Several studies have found that media exposure can affect different cultural or demographic groups in similar ways. For example, Liu [22] found that during the COVID-19 pandemic, mass media exposure significantly increased people’s preventive behaviors in both Wuhan and other regions, demonstrating cross-regional consistency. Similarly, Armutcu et al. [23] observed a cross-national symmetrical effect regarding the positive influence of social media marketing on brand awareness and purchase intentions. These findings suggest that the influence of certain media exposures can go beyond group differences.

Researchers generally recognize that influence symmetry offers valuable perspectives for understanding the cognitive processes in attitude formation. But few studies have explored this within the specific context of reading’s impact on social attitudes. Other media studies that have investigated influence symmetry have mostly relied on traditional statistical methods and have often focused on single or limited demographic variables [24]. Thus, it is difficult to capture the complex interactions among multidimensional population characteristics. Although a few machine learning studies have attempted to explore symmetry using techniques such as stratified sampling [25] and interaction terms [26], research quantifying the degree of influence symmetry in the effects of media across different groups remains limited [27].

These limitations suggest that future research should employ advanced methods, such as machine learning, to quantify and analyze the symmetry and asymmetry of media’s influence across different social groups, thereby offering a more comprehensive explanation of the mechanisms underlying attitude formation.

2.3. Machine Learning in Psychological Research

Machine learning (ML) is now widely used in psychology and other social sciences. For a long time, psychological research relied mainly on traditional statistical methods like multiple linear regression. These methods are simple and efficient but require strict linear relationships between variables, which limits their applicability when dealing with complex, non-linear data [28]. Real-world psychological phenomena are often more complex. Therefore, standard software like SPSS or Mplus might struggle to analyze these complex data structures. Instead, ML techniques offer another option in dealing with large datasets. These methods show great capabilities in modeling and identifying complex non-linear relationships among variables [29]. ML models also show a strong generalization performance. They can make accurate predictions on new, unseen data beyond the specific training set [30]. Therefore, ML can effectively capture complex patterns and dynamic relationships in social science data. The utility of ML in social sciences is increasingly recognized, offering powerful tools for analyzing complex societal data and human behaviors [31], with applications expanding rapidly across diverse fields such as entrepreneurship research [32]. The superiority of ML models has been demonstrated in numerous psychological studies [33,34], offering new analytical insights for psychological research.

ML methods can be categorized based on how they use data labels. The main types are supervised learning, unsupervised learning, and semi-supervised learning [35]. Supervised learning uses labeled data. An algorithm learns from examples where the inputs are matched with the correct outputs [36]. The goal is to train a model that can predict labels for new, unlabeled data. Supervised learning includes classification and regression tasks. Common algorithms include k-Nearest Neighbors (KNN), logistic regression (LR), and Random Forest (RF). Unsupervised learning, conversely, works with unlabeled data. It aims to find hidden structures or patterns within the data itself. The algorithm explores the data without predefined labels [37]. This includes methods for clustering (e.g., K-means), finding unusual data points (e.g., One-Class SVM), and reducing the data’s complexity (e.g., PCA). Semi-supervised learning combines both approaches. It uses a mix of labeled and unlabeled data to improve the learning outcomes. It typically leverages a larger amount of unlabeled data to improve the learning process, potentially leading to a better model performance than that when using the labeled data alone [38]. Semi-supervised learning includes algorithms such as Semi-Supervised Support Vector Machines (S3VMs) and semi-supervised clustering.

Despite its strengths, ML faces a major challenge known as the “black-box” problem [39]. This refers to the difficulty of understanding how complex models arrive at their predictions. High accuracy often comes with low interpretability. This lack of clarity is a critical issue in social science. Additionally, using many related features in training can cause multicollinearity problems. Furthermore, algorithms may inadvertently learn or perpetuate hidden biases present in the training data, which can be difficult to detect [40]. These limitations restrict the adoption of ML models in social science domains requiring clear decision-making, such as educational assessment or clinical diagnosis [41].

To address these challenges, Explainable Artificial Intelligence (XAI) has emerged. XAI provides methods to make ML models more transparent and understandable [42]. One prominent XAI tool is SHAP (SHapley Additive exPlanations) analysis. SHAP values help quantify the contribution of each input feature to a specific prediction. This helps bridge the gap between prediction accuracy and interpretability.

2.4. Shapley Additive Explanations (SHAP) in Social Science Research

Shapley Additive exPlanations (SHAP) is an Explainable Artificial Intelligence (XAI) technique. It is based on cooperative game theory. Lundberg and Lee [43] proposed SHAP to address the limitations of traditional machine learning model interpretation. The core principle of the SHAP method involves treating the prediction outcome as a “game”, where each feature acts as a “player” contributing its value to the final result. It calculates the relative importance of each feature by systematically considering all possible feature groups [44]. Mathematically, SHAP values represent a fair way to share the prediction outcomes among the features. The calculation follows clear logical steps. It starts with an empty coalition, the features are randomly ordered and sequentially added, and their average marginal contributions are calculated [45]. The SHAP scores for each prediction have three important properties: Missingness, Consistency, and Local Accuracy [46]. These properties allow SHAP to provide feature explanations at both the local (instance-specific) and global (model-wide) levels [47]. Compared to other interpretation tools, SHAP has advantages in revealing feature interactions, quantifying the contribution of each feature to the prediction outcome, and presenting complex feature relationships through diverse visualizations [48]. These characteristics also make SHAP a potentially effective tool for addressing issues related to feature multicollinearity [49].

The development of tools like SHAP marks a new stage in social science research methodologies. These techniques help address the dilemma of the balance between the predictive performance and explanatory power of machine learning methods [50]. For example, in the field of health research, Sun et al. [51] utilized a SHAP value analysis to quantify and reveal the different risk factor weights for disease predictions across age and gender subgroups. This application helped guide more precise medical treatments. In policy evaluation studies, Chatzimparmpas et al. [52] developed an interactive visualization system based on SHAP, showing how the policy effects varied for different groups. This gave policymakers more detailed evidence for decision-making. In the mental health domain, researchers have used a SHAP analysis with ensemble learning models to build interpretable prediction models for suicide attempts. This approach improved prediction accuracy and provided clinicians with clear explanations of contributing risk factors [47].

SHAP shows great promise for social science, but challenges remain. Pessach and Shmueli [53] noted that the current research has paid little attention to the symmetry in feature effects. Future research should explore SHAP’s potential in handling data symmetry and asymmetry across diverse social contexts, assessing model fairness, and supporting the cross-disciplinary integration of knowledge. Using SHAP effectively in these ways will increase its value for studying complex social science problems [45,48].

2.5. The Present Study

In response to the existing limitations, this study employed machine learning models, specifically Random Forest, XGBoost, and LightGBM, alongside traditional linear regression, to analyze the CGSS2021 dataset. This analysis incorporated reading volume and relevant sociodemographic factors to investigate their influence on social attitudes, with particular emphasis on ecological validity.

The aim of this study is to examine the potential symmetry and asymmetry in reading volume’s impact on social attitudes across different population subgroups. To support this examination, we utilized SHAP for model interpretability. This approach allowed for a clear assessment of whether the influence of reading volume was consistent or varied significantly among different groups.

To address the objectives outlined above, this study will test the following hypotheses:

H1:

Reading volume positively predicts social attitudes, with higher reading exposure associated with more open or progressive attitudes.

H2:

Machine learning models (Random Forest, XGBoost, LightGBM) will demonstrate a superior predictive performance in modeling social attitudes compared to that of a traditional multiple linear regression model.

H3:

The influence of reading volume on social attitudes will exhibit varying patterns of symmetry and asymmetry across demographic groups.

Expected symmetrical effects: It is proposed that reading volume will exert a consistent (symmetrical) influence on social attitudes across certain demographic groups. This expectation is based on the premise that the fundamental cognitive mechanisms engaged by reading may operate similarly irrespective of these specific group distinctions, reflecting potentially universal aspects of media’s influence [22].

Expected asymmetrical effects: Conversely, it is proposed that the influence of reading volume on social attitudes will vary (exhibit asymmetry) across other demographic groups. This variability is expected due to moderating factors such as differences in educational attainment [19], distinct life experiences and social roles [4], and diverse cultural backgrounds [17].

3. Methods

3.1. The Data Source

This study utilized data from the 2021 Chinese General Social Survey (CGSS2021). The Chinese General Social Survey (CGSS) is a nationally representative survey launched by Renmin University of China. The CGSS aims to collect quantitative data to measure the growing complexity of society and provide a national resource for policymakers, researchers, educators, and practitioners.

The survey employed a multistage stratified probability sampling method to accurately represent the country’s diverse population and geography. This method allowed the survey to include people from both urban and rural areas, capturing the varied population and geographical differences across the country. The CGSS covers a wide range of topics, such as education, employment, family structure, social attitudes, social trust, and quality of life. The 2021 survey wave, used in this study, encompassed 28 provincial-level administrative regions across China and initially yielded 8148 valid responses.

3.2. The Participants

The CGSS (2021) sampled adults aged 18 and older in mainland China. For the current analysis, specific exclusion criteria were applied based on theoretical and methodological considerations. First, participants aged 70 years or older were excluded. This decision was based on established research indicating an age-related decline in social cognitive processing [54], which suggests potential declines in the abilities relevant to both reading behavior recall and attitude formation. Moreover, this age threshold addresses potential cohort effects, as older Chinese adults who experienced the Cultural Revolution may exhibit fundamentally different relationships with reading and social attitudes [55]. Second, to ensure the robustness of the machine learning models, individuals with missing data on key study variables—namely, reading volume, the social attitude composite, and crucial sociodemographic predictors—were removed from the sample. Missing data included non-responses or responses coded as “refused,” “don’t know”, or “not applicable.

After applying these criteria, the final sample for the analysis comprised 2698 participants (1258 male). The mean age of the sample was 50.28 years (SD = 13.06).

3.3. Measures

This study selected variables based on their theoretical relevance and empirical evidence from prior research on social attitudes and media exposure [1,4]. The selection methodology involved identifying factors with established links to attitude formation in the literature on social psychology and computational social science, ensuring their alignment with the research goal of examining reading exposure’s influence. Below, the dependent and independent variables are described, along with their relevance to this study.

3.3.1. Dependent Variable: Social Attitudes

The primary dependent variable in this study is social attitudes. To measure this, a composite score was created based on a series of specific items primarily drawn directly from the ‘Social Attitudes’ section of the Chinese General Social Survey (CGSS) 2021 questionnaire. The selection of these items was guided by their relevance in reflecting individuals’ evaluations of social issues, encompassing everyday social understanding and perspective-taking, making them a suitable outcome variable for this research.

The Selected Items Covered Three Domains

Attitudes towards gender roles (e.g., “Men should prioritize career, women should prioritize family”; “Household chores should be shared equally”).

Attitudes towards marriage (e.g., “It is not necessary to have children after marriage”; “A bad marriage is better than being single”).

Attitudes towards family (e.g., “A wife helping her husband’s career is more important than pursuing her own”; “Children should do things that bring honor to parents”).

The participants responded to these items using a 5-point Likert scale (1 = strongly disagree, 5 = strongly agree). Several items reflecting traditional views were reverse-scored. This ensured that higher scores consistently indicated less traditional or more open social attitudes. The Cronbach’s alpha coefficient for the combined set of selected items was 0.79, indicating acceptable internal consistency. To validate the social attitudes measure further, a confirmatory factor analysis (CFA) was conducted. The CFA demonstrated the acceptable fit of the model (CFI = 0.92, TLI = 0.89, RMSEA = 0.067, SRMR = 0.055), indicating that the items effectively represented a unified construct. Therefore, the scores on these items were averaged to create a composite social attitude score. Higher scores in this composite measure were interpreted as reflecting a better social attitude performance.

3.3.2. Independent Variables: Reading Volume

The primary independent variable was reading volume. To measure this, we selected item A30a from the CGSS questionnaire: “In the past 12 months, including both print and electronic formats, how many books have you read in total?” This specific question was chosen because it directly and explicitly captures the quantity of books read by the participants, providing a clear indicator of their reading volume. The participants provided a specific numerical answer. Following the standard practice in related research, these numbers were log-transformed and then converted into standardized Z-scores for use in the analyses.

3.3.3. Sociodemographic and Behavioral Variables

The analyses also included several sociodemographic and behavioral variables given their potential influence on social attitudes and their role as important control variables or moderators. These variables were selected based on the established literature on social psychology and media effects [4,18,56], which indicates their relevance in shaping both media exposure patterns and attitudinal outcomes. Including them allowed for a more robust assessment of reading volume’s unique contribution and an examination of the effect symmetry/asymmetry across these demographic lines. These included the following:

Sociodemographic factors: Gender, age, residence (urban or rural), educational level, ethnicity, marital status, and annual income.

3.3.4. Behavioral Factors

Social media browsing time (indicating the time spent searching for information on social media): This variable was selected because the amount of time individuals spend on social media platforms directly reflects their level of exposure to a distinct and increasingly influential information environment. Social media platforms serve as important channels for information acquisition and social interaction. Beyond this role, they also actively shape “information diets” and facilitate unique forms of engagement, which can significantly impact users’ attitudes, perceptions, and beliefs [57]. The natural interactivity of these platforms [58] and the commonness of electronic Word-of-Mouth (eWOM) [59] mean that browsing time serves as an indicator of engagement with content. This engagement, in turn, can bring about emotional responses and influence perceptions of credibility, which are linked to changes in attitude. Furthermore, social media use has been associated with both the potential to build social capital and the risk of encouraging polarization and conflict [57]. Therefore, understanding the duration of exposure through browsing time is crucial for assessing how these complex online dynamics contribute to the formation and evolution of social attitudes, distinguishing its influence from that of traditional media consumption like reading.

Leisure learning frequency (the frequency of self-directed learning during free time, measured on a 5-point scale): This factor was chosen because a higher frequency of self-directed learning in one’s leisure time may indicate greater intellectual curiosity and proactive engagement with diverse knowledge. Such engagement could foster cognitive flexibility and openness to new perspectives, which are considered conducive to developing less traditional or more progressive social attitudes [12].

Self-rated Mandarin fluency (indicating an individual’s proficiency in Mandarin Chinese): This variable was included because proficiency in the national language is linked to an individual’s capacity for information access and comprehension from mainstream sources [60]. Language fluency can also influence cognitive processing of social information and subsequent attitude formation [61]. Therefore, proficiency in the common language is thus crucial for shaping social experiences and perspectives.

3.4. Data Preprocessing

Obtaining meaningful insights requires both an appropriate model selection and high-quality data. Therefore, we performed initial data preprocessing to improve the quality of dataset. First, duplicate entries were identified and removed. Second, consistent with the participant selection criteria described earlier, cases with missing values for the primary study variables were excluded. Finally, to ensure that features measured on different scales contributed comparably to the analysis, all continuous predictor variables were standardized using the Z-score method. This standardization prevents variables with naturally larger values from having an undue influence on the model outcomes.

3.5. Data Analysis Strategy

This study employed a four-step data analysis strategy to investigate the relationship between reading volume and social attitudes and to assess whether this relationship showed symmetry across different population groups.

3.5.1. Step 1: Model Building

We trained three machine learning models, including Random Forest, XGBoost, and LightGBM, as well as a standard multiple linear regression (MLR) model for comparison.

Random Forest: An ensemble method that constructs multiple decision trees using bootstrap samples and random feature selection. The final prediction is determined by averaging the outputs of all individual trees, which reduces overfitting and improves the generalization performance. Recent applications in social science research have demonstrated Random Forest’s effectiveness in handling complex behavioral data and providing robust feature importance measures [31].

XGBoost (Extreme Gradient Boosting): A gradient boosting framework that builds trees sequentially, where each tree learns from the errors of previous trees, using regularization to prevent overfitting. Recent studies have shown its superior performance in psychological and social science applications, particularly for prediction tasks with complex feature interactions [60].

LightGBM (Light Gradient Boosting Machine): This gradient boosting framework employs histogram-based algorithms for faster training and uses a leaf-wise tree growth strategy rather than the traditional level-wise approach [62]. LightGBM has gained popularity in recent social science research due to its computational efficiency and ability to handle large-scale survey data effectively.

Multiple linear regression (MLR): A traditional parametric approach assuming linear relationships between the predictors and the outcome, serving as a baseline comparison.

These models used the preprocessed data (reading volume, sociodemographic factors, and behavioral factors) as the input features (predictors) to predict the composite social attitude score (outcome).

3.5.2. Step 2: Hyperparameter Optimization

To optimize the predictive accuracy of the machine learning models, their hyperparameters were tuned using a systematic grid search procedure, implemented with the GridSearchCV utility from the scikit-learn library. This process incorporated a 5-fold cross-validation strategy applied to the training dataset to ensure robust parameter selection and mitigate overfitting. For each model, a predefined grid of hyperparameter values was explored. Specifically, the following was implemented:

For Random Forest, the search space included n_estimators (number of trees, e.g., values such as 50, 100, 300, 500), max_depth (maximum depth of trees, e.g., values ranging from 2 to 10), and min_samples_split (the minimum samples required to split an internal node, e.g., 2, 5, 10).

For XGBoost and LightGBM, the grids encompassed n_estimators (e.g., 50, 100, 300, 500), max_depth (e.g., values from 2 to 10), and learning_rate (e.g., values such as 0.01, 0.05, 0.1, 0.2).

The performance of each hyperparameter combination was evaluated based on the mean R² (coefficient of determination) score across the five cross-validation folds. The set of hyperparameters yielding the highest mean R² score was selected as the optimal configuration for each model.

3.5.3. Step 3: Model Validation

We evaluated the final, optimized models’ ability to generalize to new data using K-fold cross-validation (with K = 5). In this process, the dataset was repeatedly split into a training set (4 folds) and a testing set (1 fold). Performance metrics (e.g., R-squared, Mean Absolute Error) were calculated on the testing set in each repetition, and the average metric across all folds was used as a reliable estimate of the model’s expected performance on unseen data.

3.5.4. Step 4: Feature Interpretation

We conducted a feature importance analysis to interpret the models and understand the specific influence of reading volume compared to that of the other predictors. We primarily used SHAP (SHapley Additive exPlanations) values for this purpose. This technique allowed us to quantify how much each predictor contributed to the model’s predictions of social attitudes. It also enabled us to examine whether the effect of reading volume was consistent (symmetric) or varied (asymmetric) when comparing different subgroups based on key sociodemographic characteristics (e.g., gender, age groups, education levels).

3.6. Software and Implementation Tools

All analyses were conducted using Python 3.7. Machine learning models were implemented using scikit-learn 1.0.2 (Random Forest, Linear Regression), XGBoost 1.6.2, and LightGBM 4.6.0. A SHAP analysis was performed using the SHAP library 0.42.1. The data preprocessing and statistical analyses utilized pandas 1.3.5 and numpy 1.21.6. Hyperparameter optimization employed GridSearchCV with 5-fold cross-validation. Visualizations were created using matplotlib 3.5.3 and seaborn 0.11.2.

A detailed flowchart of the implementation pipeline is provided in Supplementary Materials (uploaded separately) for full transparency and reproducibility.

4. Results

4.1. Model Performance

4.1.1. Hyperparameter Settings

The optimal hyperparameters were identified for the tree-based models to enhance their performance (Table 1). Random Forest performed best with max_depth = 5, min_samples_split = 5, and n_estimators = 300. XGBoost achieved the optimal results with learning_rate = 0.1, max_depth = 3, and n_estimators = 100. LightGBM’s best performance was observed with learning_rate = 0.1, max_depth = 3, and n_estimators = 100. The linear regression model used the standard Ordinary Least Squares (OLS) method without tree-specific hyperparameters.

4.1.2. Comparison of the Regression Metrics

To compare how well the four models captured the relationship between reading volume and social attitudes, standard evaluation metrics were used: the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), and the coefficient of determination (R²). Table 2 summarizes the performance of each model.

As shown in Table 2, all of the models demonstrated a moderate predictive capability. The LightGBM model achieved the highest R² score (0.36) and the lowest error metrics (MSE, RMSE, MAE), suggesting its slightly better predictive accuracy compared to that of the other models. Random Forest and XGBoost showed a similar performance. Linear regression performed slightly less well but still achieved acceptable results. This indicates that tree-based algorithms might capture the potentially complex, non-linear aspects of the relationship between reading volume and social attitudes better alongside linear effects.

While LightGBM achieved a marginally better performance (R² = 0.36) compared to that of Random Forest (R² = 0.34) and XGBoost (R² = 0.35), these differences were not practically significant. For the subsequent SHAP (SHapley Additive exPlanations) interpretability analyses, we selected Random Forest based on several methodological considerations. First, Random Forest provides more theoretically grounded and computationally stable feature importance measures, as extensively documented in the literature on machine learning interpretability [63]. Second, Random Forest demonstrated the most pronounced differentiation in the feature importance, with reading volume exhibiting a substantially higher importance (0.42) relative to that of the other predictors, thereby facilitating clearer interpretability of the primary research variables. Third, Random Forest’s ensemble averaging mechanism across multiple decision trees generates more robust and stable SHAP values, which is particularly advantageous for reliable subgroup analyses and cross-population comparisons. Finally, the algorithm’s inherent resistance to overfitting and established track record in social science applications [64] provide additional methodological justification for its selection in interpretability-focused analyses.

4.2. Feature Importance Analysis: The Central Role of Reading Volume

Before interpreting the specific impact of reading volume, we examined the feature importance rankings from the three tree-based models to understand the relative contribution of reading volume among various predictors.

Feature Contributions in Tree-Based Models

Examination of the feature importance rankings from the three tree-based models consistently placed reading volume at or near the top. To visually compare the importance scores of all predictor variables, Figure 1 provides a bar chart comparing the importance scores of all predictor variables across four different models: three tree-based models (RandomForest, XGBoost, and LightGBM) and a linear model. Focusing on the three tree-based models, ‘reading volume’ consistently appears among the top three most important features.

Random Forest: Reading volume had the highest importance score (0.42), significantly exceeding that of other variables like social media browsing time (0.19) and residence (0.12). This clearly indicates its primary role in the model’s predictions.
XGBoost: In the XGBoost model, education level (0.36) and reading volume (0.18) were the two most important features. Reading volume was also very important, ranking just behind education level.
LightGBM: Based on the split frequency, age (145 splits) and reading volume (124 splits) were the most crucial features, followed by social media browsing time (82 splits) and annual income (55 splits). Reading volume again emerged as a top contributing feature, comparable to age.

To facilitate a further visual comparison of how these feature importance patterns differ or align across the models, Figure 2 presents the feature importance scores (as shown in Figure 1) in a radar chart format. Each axis corresponds to a predictor variable, and the distance from the center along each axis indicates the feature’s importance score for a given model. This representation allows for a quick assessment of the unique “importance profile” each model assigns to the set of features.

While these initial feature importance metrics provide a general ranking, SHAP values offer a more nuanced understanding by quantifying the marginal contribution of each feature to the individual predictions and the overall model output. Figure 3 displays the mean absolute SHAP values for each feature, grouped by model and averaged across all predictions. Higher values indicate a greater average impact on the model predictions. Visually, ‘reading volume’ and ‘residence’ consistently exhibit high mean absolute SHAP values across the four models, reinforcing their significant roles.

To explore the nature of these contributions more deeply beyond just their average impact, Figure 4 illustrates the distribution of the SHAP values for each feature within each of the four predictive models (Random Forest, XGBoost, LightGBM, and linear). Each colored dot represents the SHAP value for a specific feature from a single observation (prediction), with the y-axis indicating the magnitude and direction of the SHAP value. The x-axis categorizes these distributions by model. For the ‘reading volume’ feature (pink dots), for instance, the SHAP values are predominantly positive across all models, indicating a general tendency for a higher reading volume to contribute positively towards predicting more progressive attitudes. The plot also reveals variations in the spread and central tendency of SHAP values for different features and across different models, highlighting how each model attributes importance and impact.

This consistent finding across different models strongly suggests that reading volume is not merely a significant predictor but arguably one of the most critical factors associated with social attitudes within this dataset.

4.3. Group Analysis and Effect Testing: Exploring Symmetry

Beyond the overall model, we conducted subgroup analyses using the Random Forest model and the SHAP importance values to explore whether the predictive effect of reading volume on social attitudes was consistent (symmetric) or varied across different demographic groups (education level, ethnic, gender, residence, marital status).

4.3.1. Education Level

Participants were divided into two groups based on their education level (1= below tertiary education; 2= tertiary education). When analyzing by education level, for the group with a below tertiary education, the model’s R-squared was 0.22, and the mean SHAP contribution of reading volume was 0.04. For the group with a tertiary education, the model’s R-squared was 0.18, and the mean SHAP contribution of reading volume was 0.13, which was markedly higher than that in the below tertiary education group (see Figure 5). A large effect size (Cohen’s d = −0.95) and the significant t-test result (t = −21.01, p < 1 × 10⁻⁸⁴) indicated substantial differences in social attitudes between the two education groups. The influence of reading volume showed asymmetry, being considerably more pronounced for the higher-education group.

4.3.2. Ethnicity Grouping

For the two ethnic groups (1 = “Han”, 2 = minority), the model’s R² values were similar, at 0.33 and 0.34, respectively. The effect size analysis (Cohen’s d = 0.2127, p = 0.004) revealed significant differences in social attitudes between these groups. However, examining the direct influence of reading volume using the SHAP values revealed a relatively similar impact: the average SHAP contributions were close (0.11 for Han vs. 0.08 for minority) (see Figure 6). This suggests that reading volume remained important for both groups, but ethnic background also contributed to the differences in attitudes.

4.3.3. Gender Grouping

When analyzing by gender (1 = male, 2 = female), the mean SHAP contribution of reading volume was very similar for men (0.10) and women (0.09). The model fit differed significantly (R² = 0.28 for male; R² = 0.45 for female). Although there was a significant difference in the mean social attitudes between the genders (Cohen’s d = −0.1490, t = −3.8452, p < 0.001), the influence of reading volume itself was relatively symmetrical across genders (see Figure 7). This supports the idea of stability in reading’s influence across gender lines.

4.3.4. Residence Grouping

When comparing urban and rural residents, we found large differences in their average social attitudes (Cohen’s d = 0.7964, p < 1 × 10⁻⁷⁸). The model also explained the data much better for urban residents (R² ≈ 0.31) than for rural residents (R² ≈ 0.17). The importance of the other factors to predicting attitudes also differed; for instance, age was a more important factor for the rural group. However, similar to ethnicity, the direct SHAP contribution of reading volume showed partial symmetry (see Figure 8), with the values being somewhat comparable (0.10 for urban vs. 0.08 for rural). This suggests that even though the overall factors influencing attitudes differ significantly depending on whether someone lives in an urban or rural area, the specific influence of reading volume itself remains quite similar between the two groups.

4.3.5. Marital Status Grouping

For marital status groups (1 = never married, 2 = ever married), the results showed differences in their social attitudes (Cohen’s d = −0.6296, p < 1 × 10⁻³⁰). Clear asymmetry was observed in the influence of reading volume: its mean SHAP contribution was substantially higher for the never-married group (SHAP = 0.20) compared to that in the ever-married group (SHAP = 0.05) (see Figure 9). For the ever-married group, other factors like time spent on social media and where they lived (residence) had a greater influence on their attitudes.

4.4. The Linear Model Analysis

To examine the linear influence of reading volume, we analyzed the results from the multiple linear regression model, which included 11 predictors (Table 3). The analysis confirmed that reading volume significantly predicted social attitudes (β = 0.27, t = 11.44, p < 0.001). After accounting for all factors in the model, several other predictors were also significant: gender (β = 0.13, p < 0.001), residence (β = −0.22, p < 0.001), education level (β = 0.11, p < 0.001), and self-rated Mandarin fluency (β = 0.05, p < 0.001). Additionally, while they did not reach the high significance levels (p < 0.001) of some of the other predictors, frequency of learning in leisure time (β = 0.027, p = 0.031) and annual income (β = 0.04, p = 0.012) also emerged as statistically significant predictors (p < 0.05). Together, these variables explained 32.2% of the variance in social attitudes.

Checks on the model (diagnostics) did not show significant problems with the assumption of normality (Omnibus: 3.61, p = 0.17; JB: 3.58, p = 0.17). However, the Durbin–Watson value of 0.74 indicated a positive autocorrelation in the residuals, which may have arisen due to the hierarchical structure of the survey data (individuals nested within regions). To investigate the multicollinearity concerns suggested by the condition number, we calculated the variance inflation factors (VIFs) for all predictors. All of the VIF values ranged between 1.0 and 1.8, well within acceptable limits (VIF < 5) and indicating no severe multicollinearity. Consequently, the linear regression coefficients are unlikely to be unduly influenced by correlated predictors.

5. Discussion

5.1. The Key Findings

This study examines the influence of reading volume on social attitudes across different demographic groups, with a particular emphasis on effect symmetry. Analyzing the CGSS2021 data through machine learning methods (Random Forest, XGBoost, LightGBM) and linear regression revealed reading volume as a core predictor of social attitudes. This significance was consistently demonstrated across all four models. Specifically, the tree-based feature importance analyses (Random Forest: 0.43; XGBoost: 0.18; LightGBM: 0.11) highlight reading’s significance, further corroborated by its substantial positive coefficient in linear regression (β = 0.27, p < 0.001). The consistency across different modeling approaches highlights the robustness of the association between reading volume and social attitudes. This finding strongly supports Hypothesis 1, which posited that reading volume would positively predict social attitudes, with higher reading exposure associated with more open or progressive attitudes. The consistent positive coefficients and high feature importance scores across diverse models provide robust evidence for this hypothesis.

Additionally, this study’s results indicated that non-linear models demonstrated a slightly better fit, with LightGBM achieving the highest R² value. This suggests that tree ensemble models possess advantages in capturing the complex interplay of factors influencing social attitudes. This result supports Hypothesis 2, which hypothesized that machine learning models (Random Forest, XGBoost, LightGBM) would demonstrate a superior predictive performance in modeling social attitudes compared to that of a traditional multiple linear regression model. Although this improvement was marginal, the higher R² scores and lower error metrics of the machine learning models, particularly LightGBM, indicate their enhanced capability in capturing potential non-linearities in the data.

However, it should be noted that the R² values across all models fall within a moderate range. The observed R² values (ranging from 0.33 to 0.36) require further discussion within the context of social science research. Human attitudes and behaviors are naturally complex phenomena influenced by numerous unmeasured psychological, cultural, and experiential factors. This complexity limits the proportion of variance that can be explained by any measured variable. According to [65], R² values of 0.02, 0.13, and 0.26 represent small, medium, and large effect sizes in behavioral sciences, respectively, suggesting that our results demonstrate practically meaningful relationships. These values are also comparable to or exceed those reported in similar studies examining the effects of media on attitudes, such as Wei et al. [66], who reported an R² of 0.27 using similar large-scale survey data. The consistent performance levels observed across all models indicate that the identified patterns are robust and reliable. While some variance remains unexplained, the models successfully identify key predictor variables and reveal their relative importance in influencing social attitudes, which aligns with the primary objectives of this research.

Beyond the overall model performance, the SHAP-value-based subgroup analyses provide nuanced insights into the symmetry of reading effects. The results reveal both symmetric and asymmetric patterns across demographic categories. For example, the effect of reading volume on different gender groups exhibited strong symmetry (SHAP values: male = 0.10, female = 0.09). Partially symmetric effects were also observed for ethnicity (Han Chinese = 0.11, ethnic minorities = 0.08) and place of residence (urban = 0.10, rural = 0.08). However, notable asymmetries emerge across education levels and marital status. Reading had a much greater effect on people with more education than those with less. Similarly, the relationship between reading volume and social attitudes was stronger for never-married individuals than those for ever-married individuals. These findings provide support for Hypothesis 3, which proposed that the influence of reading volume on social attitudes would exhibit varying patterns of symmetry and asymmetry across demographic groups. The observed strong symmetrical effect across genders aligns with the expected symmetrical effects part of H3. Concurrently, the significant differences in reading’s impact based on education level and marital status, along with the partial symmetries for ethnicity and residence, confirm the expected asymmetrical effects part of H3. This demonstrates that while reading’s core influence may be consistent in some demographic comparisons, its impact is indeed moderated by other sociodemographic factors, leading to varied effects in other contexts.

5.2. The Effects of Reading

A robust association between reading volume and social attitudes was consistently demonstrated across models. It is demonstrated that people who read more tend to show greater understanding and respect for individual differences. They are also more willing to challenge traditional values regarding family, marriage, and gender roles. These findings align with previous research suggesting that reading fosters empathy, reduces prejudice, and broadens worldviews through exposure to diverse perspectives and narratives [11,12]. This study builds on these findings by measuring the importance of reading within a wide range of social and demographic factors. Machine learning methods were used on large-scale, real-world data, which helped overcome the limitations of earlier lab-based studies, which have often used smaller groups of participants.

The strength of the connection found (effect size β = 0.27) was similar to or even greater than that found in earlier research. For example, Jiang’s [9] study reported a correlation of 0.21 between reading and concern for others’ feelings (empathic concern). The stronger link identified in this research might have been because social attitudes were measured comprehensively, and the models used were capable of handling complex interactions between different factors.

The mechanisms through which reading volume influences social attitudes are considered to involve both cognitive and affective pathways. Buttrick et al. [12] suggested that reading broadens perspectives by showing individuals viewpoints beyond their usual social circles. This proposition is consistent with the results of this study, as reading volume remained an important predictor even after accounting for other forms of media engagement, such as social media browsing time.

Furthermore, comparing the performance of different models suggests a potential non-linear relationship between reading volume and social attitudes. Specifically, the tree ensemble models (Random Forest, XGBoost, and LightGBM) exhibited a slightly superior predictive performance than that of standard linear regression. This indicates that reading might have a stronger effect at lower amounts, with its impact potentially leveling off or becoming more complex as reading volume increases. Machine learning models, through their inherent structures (e.g., decision tree splitting rules), are better suited to capturing these kinds of non-linear patterns [29].

It is also noteworthy that the diagnostic checks for our standard linear regression model highlight the advantages of the machine learning approaches used in this study. While the variance inflation factors (VIFs) were all below 1.8, indicating that multicollinearity was not a significant concern despite the condition number, the Durbin–Watson statistic for the linear model was 0.74. This value suggests potential positive autocorrelation in the residuals, implying that their error terms may not be fully independent. Such autocorrelation can affect the efficiency of Ordinary Least Squares (OLS) estimates and the precision of their standard errors. Consequently, while the β coefficients from the linear regression offer insights into linear trends, they should be interpreted with this potential limitation in mind.

This finding demonstrates a benefit of our primary analytical approach, which centers on machine learning models (Random Forest, XGBoost, LightGBM) and their SHAP-based interpretations. These ensemble methods are generally more robust to the strict assumptions of OLS regression, such as the independence of errors, particularly when modeling complex, potentially non-linear relationships and assessing feature importance in a predictive context. The linear model served principally as a conventional benchmark in our broader analytical strategy, and its diagnostic characteristics in this instance further highlighted the value of the more flexible machine learning framework employed in this study.

5.3. The Symmetry and Asymmetry Between Demographic Groups

The results revealed strong symmetry in the effect of reading volume on social attitudes across genders and a partially symmetrical effect across regional and ethnic groups. This suggests that the mental processes through which reading shapes social attitudes may be relatively universal across the studied populations, transcending these specific social boundaries.

Specifically, symmetrical effects were observed across gender groups. This finding may extend and refine perspectives from studies suggesting gender-specific differences in the effects of media. While Valkenburg and Peter [20] found that girls responded more strongly to emotional media content than boys, the symmetry effect observed in this study suggests that reading’s influence works through cognitive processes that are similar for both genders, at least within the context of contemporary Chinese culture.

Regarding the partial symmetry across regions and ethnicities, it was found that although starting attitude levels might differ, both urban and rural residents, as well as Han and ethnic minority groups, showed fairly consistent patterns in the direct effects of reading. While an individual’s residential environment and ethnicity can exert multifaceted influences on social attitudes, the fundamental cognitive impact of exposure to reading seems relatively stable. This highlights the potentially foundational role of reading in shaping social cognition beyond one’s direct environment. These findings are like Liu [22] observation that the effects of mass media on health behaviors during COVID-19 were similar across regions, suggesting that some media effects go beyond regional and cultural boundaries. The findings of this study extend this concept to the domain of reading volume and social attitudes.

In contrast, significant asymmetries were found across education levels. Reading appears to have a greater influence on individuals with a higher education. This difference can be attributed to several factors. Higher education typically cultivates critical thinking abilities, enhances cognitive processing capacities, and exposes individuals to a wider range of more complex reading materials [19]. Consequently, highly educated individuals may engage more deeply with the ideas encountered through reading and incorporate them more readily, leading to more pronounced changes in their attitudes. This finding highlights the potential interaction between reading volume exposure and pre-existing cognitive frameworks in shaping social attitudes.

Similarly, asymmetry was observed in the impact of reading volume across marital status groups. Individuals who had never been married were more susceptible to the effects of reading volume exposure. This may reflect differences in social roles, responsibilities, and life experiences [4]. Never-married individuals may have more free time for reading and exhibit greater openness to novel ideas and perspectives during the processes of identity formation and social integration. In contrast, married individuals may be influenced by family life and established social networks, which could potentially lessen the direct impact of reading on their social attitudes.

These results demonstrate the advantages of interpretable machine learning methods in capturing intricate feature patterns. Such methods can reveal complex patterns in how factors affect outcomes across different groups. Traditional statistical tests comparing group averages might not capture these subtle variations. In contrast, the adoption of the SHAP-value-based analysis in this study enables the decomposition of each factor’s contribution for individuals, revealing more detailed patterns of influence at the group level.

5.4. Implications

This study combines interpretable machine learning with traditional statistical methods to analyze large-scale data. This approach allows the complex relationships in social science research to be captured, improving the understanding of the subtle ways various factors influence social cognition and behavior. The findings support the view of Kyriazos and Poga [67] that machine learning techniques can effectively model non-linear relationships that the traditional methods might miss. Such a methodological advance offers policymakers with refined tools to evaluate how educational and media interventions affect different social groups in varied ways.

Furthermore, this research contributes to the concept of “influence symmetry” in media exposure studies. While previous work has noted the importance of examining such symmetrical effects [21], empirical investigations in this area have been limited. This study provides a novel methodological framework by quantifying and analyzing the symmetry patterns across different demographic groups, enabling the application of this analytical approach to media influence contexts beyond reading.

Additionally, the findings provide ecologically valid tests of the effects of reading on social cognition, extending previous research and yielding important insights for policymakers and educators. The strong association between reading volume and progressive social attitudes supports the promotion of reading in schools and communities. Importantly, the asymmetries observed across education levels and marital status indicate that reading’s impact is varied. It depends on the amount of reading, as well as individual characteristics that influence how information is processed and integrated into existing views. This suggests that efforts to promote reading may need to be adapted to different demographic groups to be the most effective. Indeed, these nuanced findings carry significant practical implications. For educators and policymakers, an important insight is reading’s consistent impact across genders; this suggests that broad promotion initiatives are often suitable without needing gender-specific tailoring. Conversely, reading’s impact varies across education levels, presenting another key consideration. For individuals with lower educational attainment, the positive influence of reading programs on social attitudes can be significant. This effect may be amplified further by integrating these programs with cognitive skill-building activities. Furthermore, the more pronounced impact of reading among never-married individuals indicates a strategic opportunity for intervention. Targeting young adults during their formative years can be particularly effective. University and community-based programs offer potential avenues for such interventions. The goal is to foster more open social attitudes in this demographic. Collectively, these insights are crucial. They underscore the need for context-sensitive interventions. Such interventions must account for the varied effects of reading across different demographic groups.

5.5. Limitations and Future Directions

Several limitations should be noted. First, the cross-sectional nature of the CGSS 2021 data fundamentally limits causal inference. Our findings demonstrate robust associations but cannot definitively determine causality. It is plausible, for example, that individuals with more open social attitudes are more inclined to read, rather than reading directly causing such attitudes, or that a bidirectional relationship exists. Future longitudinal studies tracking changes in reading habits and social attitudes over time are necessary to disentangle these relationships. Alternatively, experimental intervention studies, where participants are assigned different types or volumes of reading materials, followed by an assessment of their attitudinal shifts, could also help disentangle the causal pathways. The use of self-reported reading volume, which may be susceptible to recall bias and social desirability effects, is another limitation. Future research could explore more objective measures of reading behavior, such as book purchase records or library loan data.

Second, while the social attitude measure demonstrated acceptable internal consistency (Cronbach’s α = 0.79), it mainly focused on views about gender roles, marriage, and family. Future research should investigate the generalizability of these findings to other domains of social attitudes, such as those concerning ethnicity, politics, or the environment.

Third, limitations in the survey questionnaire meant the reading measure included all book types together, without distinguishing by genre. This restricted the ability to interpret findings related to specific reading habits, like long-term reading of fiction. As demonstrated by Suzuki et al. [17], the content of reading materials can significantly modulate their effects on stereotypes and social attitudes. Future research should incorporate more detailed measures of reading content to examine how different types of reading materials may differentially influence social attitudes.

Fourth, despite controlling for several sociodemographic and behavioral factors, unmeasured variables may have influenced the relationships observed to some extent. For example, personality traits like ‘openness to experience’ could potentially affect both reading habits and social attitudes, possibly confounding the observed connection.

Moreover, cross-cultural replications of this study would be valuable for determining whether the observed patterns of symmetry and asymmetry are specific to China or reflect more generalizable processes in attitude formation.

6. Conclusions

This study examined the relationship between reading volume and social attitudes across demographic groups in China using a machine learning framework with SHAP-based interpretability. Our key findings revealed that reading volume consistently predicts more open social attitudes, with tree-based ensemble models capturing non-linear relationships more effectively than traditional linear regression.

The analysis of the effect symmetry revealed important patterns in reading’s influence. While the effect was symmetric across gender and partially symmetric across ethnicity and residence, significant asymmetries emerged for education level and marital status. Specifically, reading showed stronger effects among individuals with a tertiary education and those who had never married.

Our methodological contribution lies in combining machine learning with a SHAP analysis to assess the symmetry in feature importance, advancing the computational approaches in social science research. These findings offer valuable insights for educators and policymakers, suggesting that targeted reading promotion strategies could effectively foster positive social attitudes across different demographic groups in contemporary society.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym17060900/s1, File S1: Flowchart of algorithm implementation.

Author Contributions

Conceptualization, Y.W. and H.C.; Methodology, Y.W. and W.Z.; Software, Y.W.; Validation, Y.W. and Q.Z.; Formal analysis, Y.W. and W.Z.; Investigation, Y.W., H.C. and W.Z.; Resources, Q.Z.; Writing—original draft, Y.W. and W.Z.; Writing—review & editing, Y.W., H.C., W.Z. and Q.Z.; Project administration, H.C. and Q.Z.; Funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gawronski, B.; Brannon, S.M. Attitudinal Effects of Stimulus Co-Occurrence and Stimulus Relations: Range and Limits of Intentional Control. Personal. Soc. Psychol. Bull. 2021, 47, 1654–1667. [Google Scholar] [CrossRef] [PubMed]
Eagly, A.H.; Chaiken, S. The Advantages of an Inclusive Definition of Attitude. Soc. Cogn. 2007, 25, 582–602. [Google Scholar] [CrossRef]
Meleady, R.; Crisp, R.J.; Dhont, K.; Hopthrow, T.; Turner, R.N. Intergroup contact, social dominance, and environmental concern: A test of the cognitive-liberalization hypothesis. J. Personal. Soc. Psychol. 2020, 118, 1146–1164. [Google Scholar] [CrossRef]
Smith, E.R.; Mackie, D.M.; Claypool, H.M. Social Psychology; Psychology Press: New York, NY, USA, 2015. [Google Scholar]
Thurstone, L.L. The measurement of social attitudes. J. Abnorm. Soc. Psychol. 1931, 26, 249–269. [Google Scholar] [CrossRef]
Wolff, U. Effects of a Randomised Reading Intervention Study: An Application of Structural Equation Modelling. Dyslexia 2011, 17, 295–311. [Google Scholar] [CrossRef] [PubMed]
Lyu, Z.; Chai, X. Media Influence on Intergenerational Attitudes toward Non-Conventional Sexual Behaviors in Contemporary China: Evidence from Chinese General Social Survey. Int. J. Sex. Health 2024, 36, 77–99. [Google Scholar] [CrossRef]
Huynh-Thu, V.A.; Saeys, Y.; Wehenkel, L.; Geurts, P. Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioinformatics 2012, 28, 1766–1774. [Google Scholar] [CrossRef]
Jiang, Y. Tolerance of Ambiguity, Reading Strategies and Foreign Language Anxiety in English Learning. Curric. Teach. Methodol. 2023, 6, 1–9. [Google Scholar] [CrossRef]
Karam, K.M.; Elfiel, H. An experimental study of the effect of close reading versus casual reading of social drama on the stimulation of the cognitive capacity of empathy. Sci. Study Lit. 2020, 10, 35–65. [Google Scholar] [CrossRef]
Johnson, D.R. Transportation into a story increases empathy, prosocial behavior, and perceptual bias toward fearful expressions. Personal. Individ. Differ. 2012, 52, 150–155. [Google Scholar] [CrossRef]
Buttrick, N.; Westgate, E.C.; Oishi, S. Reading literary fiction is associated with a more complex worldview. Personal. Soc. Psychol. Bull. 2022, 49, 1408–1420. [Google Scholar] [CrossRef] [PubMed]
Kende, A.; Hadarics, M.; Bigazzi, S.; Boza, M.; Kunst, J.R.; Lantos, N.A.; Lášticová, B.; Minescu, A.; Pivetti, M.; Urbiola, A. The last acceptable prejudice in Europe? Anti-Gypsyism as the obstacle to Roma inclusion. Group Process. Intergroup Relat. 2020, 24, 388–410. [Google Scholar] [CrossRef]
Oľhová, S.; Lášticová, B.; Kundrát, J.; Kanovský, M. Using fiction to improve intergroup attitudes: Testing indirect contact interventions in a school context. Soc. Psychol. Educ. Int. J. 2023, 26, 81–105. [Google Scholar] [CrossRef]
Kidd, D.; Castano, E. Different stories: How levels of familiarity with literary and genre fiction relate to mentalizing. Psychol. Aesthet. Creat. Arts 2017, 11, 474–486. [Google Scholar] [CrossRef]
Zhang, C.; Wu, B. Characterizing gender stereotypes in popular fiction: A machine learning approach. Online J. Commun. Media Technol. 2023, 13, e202349. [Google Scholar] [CrossRef]
Suzuki, A.; Osanai, H.; Liu, C.H. Cross-cultural investigation into the associations of fiction reading habits with mentalizing skills and stereotyping among adults in the United Kingdom and Japan. Psychol. Aesthet. Creat. Arts 2024. advance online publication. [Google Scholar] [CrossRef]
Rivas-Drake, D.; Saleem, M.; Schaefer, D.R.; Medina, M.; Jagers, R. Intergroup Contact Attitudes Across Peer Networks in School: Selection, Influence, and Implications for Cross-Group Friendships. Child Dev. 2018, 90, 1898–1916. [Google Scholar] [CrossRef]
de-la-Peña, C.; Luque-Rojas, M.J. Levels of Reading Comprehension in Higher Education: Systematic Review and Meta-Analysis. Front. Psychol. 2021, 12, 712901. [Google Scholar] [CrossRef]
Valkenburg, P.M.; Peter, J. The Differential Susceptibility to Media Effects Model. J. Commun. 2013, 63, 221–243. [Google Scholar] [CrossRef]
Fu, J.; Hsiao, C. Decoding intelligence via symmetry and asymmetry. Sci. Rep. 2024, 14, 12525. [Google Scholar] [CrossRef]
Liu, P.L. COVID-19 Information Seeking on Digital Media and Preventive Behaviors: The Mediation Role of Worry. Cyberpsychology Behav. Soc. Netw. 2020, 23, 677–682. [Google Scholar] [CrossRef]
Armutcu, B.; Zeqiri, J.; Ibahrine, M.; Gleason, K.; Alserhan, B.A. The relationship between digital marketing and product purchase behaviour in Turkey: A structural equations modelling approach. J. Mark. Commun. 2024, 1–31. [Google Scholar] [CrossRef]
Xu, C.; Tyreal Yizhou Qian Yang, L.; Liu, D. Tweets, Triumphs, and Tensions: A Machine Learning Approach to Decoding Multi-Tier Thematic Framing of the 2022 Beijing Winter Olympics on Social Media. Commun. Sport 2024. [Google Scholar] [CrossRef]
Dijkman, B.; Kooistra, B.; Bhandari, M. How to work with a subgroup analysis. Can. J. Surg. 2009, 52, 515–522. [Google Scholar] [PubMed]
Jakulin, A. Machine Learning Based on Attribute Interactions. Ph.D. Thesis, Univerza v Ljubljani, Ljubljana, Slovenia, 2005. [Google Scholar]
Matthes, J.; Knoll, J.; von Sikorski, C. The “Spiral of Silence” Revisited: A Meta-Analysis on the Relationship Between Perceptions of Opinion Support and Political Opinion Expression. Commun. Res. 2017, 45, 3–33. [Google Scholar] [CrossRef]
Kyriazos, T.; Poga, M. Exploring Fuzzy Logic as an Alternative Approach in Psychological Scoring. Open Psychol. J. 2024, 17, e18743501337527. [Google Scholar] [CrossRef]
Rogala, J.; Żygierewicz, J.; Malinowska, U.; Cygan, H.; Stawicka, E.; Kobus, A.; Vanrumste, B. Enhancing autism spectrum disorder classification in children through the integration of traditional statistics and classical machine learning techniques in EEG analysis. Sci. Rep. 2023, 13, 21748. [Google Scholar] [CrossRef]
Madakkatel, I.; Zhou, A.; McDonnell, M.D.; Hyppönen, E. Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study. Sci. Rep. 2021, 11, 22997. [Google Scholar] [CrossRef]
Molina, M.; Garip, F. Machine Learning for Sociology. Annu. Rev. Sociol. 2019, 45, 27–45. [Google Scholar] [CrossRef]
Obschonka, M.; Audretsch, D.B. Artificial intelligence and big data in entrepreneurship: A new era has begun. Small Bus. Econ. 2020, 55, 529–539. [Google Scholar] [CrossRef]
Chowdhury, S.; Dey, P.K.; Rodríguez-Espíndola, O.; Parkes, G.; Tuyet, N.T.A.; Long, D.D.; Ha, T.P. Impact of Organisational Factors on the Circular Economy Practices and Sustainable Performance of Small and Medium-sized Enterprises in Vietnam. J. Bus. Res. 2022, 147, 362–378. [Google Scholar] [CrossRef]
Rahal, C.; Verhagen, M.; Kirk, D. The rise of machine learning in the academic social sciences. AI Soc. 2022, 39, 799–801. [Google Scholar] [CrossRef]
Linthicum, K.P.; Schafer, K.M.; Ribeiro, J.D. Machine learning in suicide science: Applications and ethics. Behav. Sci. Law 2019, 37, 214–222. [Google Scholar] [CrossRef]
Nasteski, V. An overview of the supervised machine learning methods. Horiz. B 2017, 4, 51–62. [Google Scholar] [CrossRef]
Luhmann, M. Using Big Data to study subjective well-being. Curr. Opin. Behav. Sci. 2017, 18, 28–33. [Google Scholar] [CrossRef]
Li, W.; Wu, C.; Hu, X.; Chen, J.; Fu, S.; Wang, F.; Zhang, D. Quantitative Personality Predictions from a Brief EEG Recording. IEEE Trans. Affect. Comput. 2022, 13, 1514–1527. [Google Scholar] [CrossRef]
Petch, J.; Di, S.; Nelson, W. Opening the black box: The promise and limitations of explainable machine learning in cardiology. Can. J. Cardiol. 2021, 38, 204–213. [Google Scholar] [CrossRef]
Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core Ideas, Techniques and Solutions. ACM Comput. Surv. 2022, 55, 194. [Google Scholar] [CrossRef]
Tan, Q.; Liu, Y.; Fan, Z.; Zhang, J.; Cui, Q.; Zhang, M.-X. Effect of processing parameters on the densification of an additively manufactured 2024 Al alloy. J. Mater. Sci. Technol. 2020, 58, 34–45. [Google Scholar] [CrossRef]
Lezhnina, O.; Kismihók, G. Combining statistical and machine learning methods to explore German students’ attitudes towards ICT in PISA. Int. J. Res. Method Educ. 2022, 45, 180–199. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. arXiv 2017. [Google Scholar] [CrossRef]
Maxwell, J.A. Why qualitative methods are necessary for generalization. Qual. Psychol. 2021, 8, 111. [Google Scholar] [CrossRef]
Lundberg, S.M.; Nair, B.; Vavilala, M.S.; Horibe, M.; Eisses, M.J.; Adams, T.; Liston, D.E.; Low, D.K.W.; Newman, S.F.; Kim, J.; et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018, 2, 749–760. [Google Scholar] [CrossRef]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef]
Mustafa Abdullah, D.; Mohsin Abdulazeez, A. Machine Learning Applications based on SVM Classification A Review. Qubahan Acad. J. 2021, 1, 81–90. [Google Scholar] [CrossRef]
Belle, V.; Papantonis, I. Principles and Practice of Explainable Machine Learning. Front. Big Data 2021, 4, 688969. [Google Scholar] [CrossRef]
Al-Najjar, H.A.H.; Pradhan, B.; Beydoun, G.; Sarkar, R.; Park, H.-J.; Alamri, A. A novel method using explainable artificial intelligence (XAI)-based Shapley Additive Explanations for spatial landslide prediction using Time-Series SAR dataset. Gondwana Res. 2022, 123, 107–124. [Google Scholar] [CrossRef]
Vishwarupe, V.; Joshi, P.M.; Mathias, N.; Maheshwari, S.; Mhaisalkar, S.; Pawar, V. Explainable AI and Interpretable Machine Learning: A Case Study in Perspective. Procedia Comput. Sci. 2022, 204, 869–876. [Google Scholar] [CrossRef]
Sun, J.; Sun, C.-K.; Tang, Y.-X.; Liu, T.-C.; Lu, C.-J. Application of SHAP for Explainable Machine Learning on Age-Based Subgrouping Mammography Questionnaire Data for Positive Mammography Prediction and Risk Factor Identification. Healthcare 2023, 11, 2000. [Google Scholar] [CrossRef]
Chatzimparmpas, A.; Martins, R.M.; Jusufi, I.; Kucher, K.; Rossi, F.; Kerren, A. The State of the Art in Enhancing Trust in Machine Learning Models with the Use of Visualizations. Comput. Graph. Forum 2020, 39, 713–756. [Google Scholar] [CrossRef]
Pessach, D.; Shmueli, E. A Review on Fairness in Machine Learning. ACM Comput. Surv. 2023, 55, 51. [Google Scholar] [CrossRef]
Moran, J.M. Lifespan development: The effects of typical aging on theory of mind. Behav. Brain research 2013, 237, 32–40. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Hou, L. Children of the Cultural Revolution: The State and the Life Course in the People’s Republic of China. Am. Sociol. Rev. 1999, 64, 12–36. [Google Scholar] [CrossRef]
Miles, E.; Crisp, R.J. A meta-analytic test of the imagined contact hypothesis. Group Process. Intergroup Relat. 2013, 17, 3–26. [Google Scholar] [CrossRef]
González-Bailón, S.; Lelkes, Y. Do social media undermine social cohesion? A critical review. Soc. Issues Policy Rev. 2022, 17, 155–180. [Google Scholar] [CrossRef]
Ooi, K.-B.; Lee, V.-H.; Hew, J.-J.; Leong, L.-Y.; Tan, G.W.-H.; Lim, A.-F. Social media influencers: An effective marketing approach? J. Bus. Res. 2023, 160, 113773. [Google Scholar] [CrossRef]
Anastasiei, B.; Dospinescu, N.; Dospinescu, O. Beyond credibility: Understanding the mediators between electronic word-of-mouth and purchase intention. arXiv 2025, arXiv:abs/2504.05359. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Tursunov, Q. The Impact of Digital and Media Literacy on Reading Comprehension Among High School Students. Excell. Int. Multi-Discip. J. Educ. (2994–9521) 2024, 2, 65–69. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar] [CrossRef]
Votto, A.; Liu, C.Z. Transparent Artificial Intelligence and Human Resource Management: A Systematic Literature Review. In Proceedings of the Annual Hawaii International Conference on System Sciences, Maui, HI, USA, 3–6 January 2023. [Google Scholar] [CrossRef]
Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 307. [Google Scholar] [CrossRef]
Cohen, J. A power primer. Psychol. Bull. 1992, 112, 155–159. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Talhelm, T.; Zhang, K.; Wang, F. When Interdependence Backfires: The Coronavirus Infected Three Times More People in Rice-Farming Areas During Chinese New Year. Personal. Soc. Psychol. Bull. 2024, 50, 1471–1486. [Google Scholar] [CrossRef] [PubMed]
Kyriazos, T.; Poga, M. Planfulness in Psychological Well-being: Mediating Roles of Self-Efficacy and Presence of Meaning in Life. Appl. Res. Qual. Life 2024, 19, 1927–1950. [Google Scholar] [CrossRef]

Figure 1. Feature importance rankings from tree-based models (bar chart).

Figure 2. Feature importance rankings from tree-based models (radar chart).

Figure 3. Mean absolute SHAP values per feature across models.

Figure 4. SHAP summary plot across models.

Figure 5. SHAP value distribution by education level.

Figure 6. SHAP value distribution by ethnicity.

Figure 7. SHAP value distribution by gender.

Figure 8. SHAP value distribution by residence.

Figure 9. SHAP value distribution by marital status.

Table 1. The optimal hyperparameters for machine leaning models.

Model	Hyperparameter	Value	Description
Random Forest	max_depth	5	Maximum tree depth
	min_samples_split	5	Minimum samples to split node
	n_estimators	300	Number of trees in forest
XGBoost	learning_rate	0.1	Boosting learning rate
	max_depth	3	Maximum tree depth
	n_estimators	100	Number of boosting rounds
LightGBM	learning_rate	0.1	Boosting learning rate
	max_depth	3	Maximum tree depth
	n_estimators	100	Number of boosting rounds

Table 2. Model performance metrics.

Model	R² Score	MSE	RMSE	MAE
Random Forest	0.34	0.27	0.52	0.40
XGBoost	0.35	0.26	0.51	0.40
LightGBM	0.36	0.26	0.51	0.40
Linear Regression	0.33	0.27	0.52	0.40

Table 3. Multiple linear regression results predicting social attitudes.

	B	SE	t	p	95% CI
Constant	3.34	0.11	29.16	<0.001	[3.112, 3.561]
Reading volume	0.27	0.02	11.44	<0.001	[0.220, 0.311]
Gender	0.13	0.02	6.14	<0.001	[0.087, 0.168]
Social media browsing time	0.00	0.00	3.45	0.001	[0.000, 0.001]
Leisure learning	0.027	0.01	2.15	0.031	[0.002, 0.049]
Self-rated Mandarin fluency	0.05	0.01	4.89	<0.001	[0.031, 0.071]
Residence	−0.22	0.02	−9.14	<0.001	[−0.269, −0.174]
Age	−0.01	0.00	−4.53	<0.001	[−0.006, −0.003]
Marital status	−0.01	0.04	−0.21	0.834	[−0.076, 0.061]
Ethnicity	−0.03	0.04	−0.65	0.519	[−0.104, 0.052]
Annual income	0.04	0.02	2.51	0.012	[0.008, 0.068]
Education level	0.11	0.03	3.55	<0.001	[0.048, 0.166]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Chen, H.; Zhao, W.; Zhang, Q. Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups. Symmetry 2025, 17, 900. https://doi.org/10.3390/sym17060900

AMA Style

Wang Y, Chen H, Zhao W, Zhang Q. Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups. Symmetry. 2025; 17(6):900. https://doi.org/10.3390/sym17060900

Chicago/Turabian Style

Wang, Yuanqing, Hao Chen, Wei Zhao, and Qixia Zhang. 2025. "Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups" Symmetry 17, no. 6: 900. https://doi.org/10.3390/sym17060900

APA Style

Wang, Y., Chen, H., Zhao, W., & Zhang, Q. (2025). Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups. Symmetry, 17(6), 900. https://doi.org/10.3390/sym17060900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups

Abstract

1. Introduction

2. The Literature Review

2.1. The Link Between Reading and Social Attitudes

2.2. Symmetry and Asymmetry in Reading’s Effects Across Social Groups

2.3. Machine Learning in Psychological Research

2.4. Shapley Additive Explanations (SHAP) in Social Science Research

2.5. The Present Study

3. Methods

3.1. The Data Source

3.2. The Participants

3.3. Measures

3.3.1. Dependent Variable: Social Attitudes

The Selected Items Covered Three Domains

3.3.2. Independent Variables: Reading Volume

3.3.3. Sociodemographic and Behavioral Variables

3.3.4. Behavioral Factors

3.4. Data Preprocessing

3.5. Data Analysis Strategy

3.5.1. Step 1: Model Building

3.5.2. Step 2: Hyperparameter Optimization

3.5.3. Step 3: Model Validation

3.5.4. Step 4: Feature Interpretation

3.6. Software and Implementation Tools

4. Results

4.1. Model Performance

4.1.1. Hyperparameter Settings

4.1.2. Comparison of the Regression Metrics

4.2. Feature Importance Analysis: The Central Role of Reading Volume

Feature Contributions in Tree-Based Models

4.3. Group Analysis and Effect Testing: Exploring Symmetry

4.3.1. Education Level

4.3.2. Ethnicity Grouping

4.3.3. Gender Grouping

4.3.4. Residence Grouping

4.3.5. Marital Status Grouping

4.4. The Linear Model Analysis

5. Discussion

5.1. The Key Findings

5.2. The Effects of Reading

5.3. The Symmetry and Asymmetry Between Demographic Groups

5.4. Implications

5.5. Limitations and Future Directions

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI