Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships

Kyriazos, Theodoros; Poga, Mary

doi:10.3390/encyclopedia4040118

Open AccessEntry

Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships

by

Theodoros Kyriazos

^1,* and

Mary Poga

²

¹

Department of Psychology, Panteion University, 17671 Athens, Greece

²

Independent Researcher, 17671 Athens, Greece

^*

Author to whom correspondence should be addressed.

Encyclopedia 2024, 4(4), 1790-1805; https://doi.org/10.3390/encyclopedia4040118

Submission received: 26 September 2024 / Revised: 16 November 2024 / Accepted: 21 November 2024 / Published: 27 November 2024

(This article belongs to the Collection Encyclopedia of Social Sciences)

Download Versions Notes

Definition

The increasing complexity of social science data and phenomena necessitates using advanced analytical techniques to capture nonlinear relationships that traditional linear models often overlook. This chapter explores the application of machine learning (ML) models in social science research, focusing on their ability to manage nonlinear interactions in multidimensional datasets. Nonlinear relationships are central to understanding social behaviors, socioeconomic factors, and psychological processes. Machine learning models, including decision trees, neural networks, random forests, and support vector machines, provide a flexible framework for capturing these intricate patterns. The chapter begins by examining the limitations of linear models and introduces essential machine learning techniques suited for nonlinear modeling. A discussion follows on how these models automatically detect interactions and threshold effects, offering superior predictive power and robustness against noise compared to traditional methods. The chapter also covers the practical challenges of model evaluation, validation, and handling imbalanced data, emphasizing cross-validation and performance metrics tailored to the nuances of social science datasets. Practical recommendations are offered to researchers, highlighting the balance between predictive accuracy and model interpretability, ethical considerations, and best practices for communicating results to diverse stakeholders. This chapter demonstrates that while machine learning models provide robust solutions for modeling nonlinear relationships, their successful application in social sciences requires careful attention to data quality, model selection, validation, and ethical considerations. Machine learning holds transformative potential for understanding complex social phenomena and informing data-driven psychology, sociology, and political science policy-making.

Keywords:

machine learning in social sciences; nonlinear relationships; model interpretability; predictive analytics; imbalanced data handling

1. Introduction

1.1. Overview of Nonlinear Relationships in Social Sciences

Nonlinear relationships are fundamental to understanding the complexity of social phenomena [1]. In much social science research, variables are traditionally assumed to interact in simple, proportional ways. However, this assumption often overlooks the reality that many relationships are inherently nonlinear [2,3]. In nonlinear relationships, changes in one variable do not consistently lead to proportional changes in another. Instead, the effect of a variable may vary based on other factors, leading to curvilinear, threshold, or even chaotic patterns. These patterns are particularly prevalent in psychology, sociology, and demography, where human behavior and social systems exhibit dynamic, context-dependent interactions [4,5].

Linear models assume a direct, proportional relationship between independent variables and a dependent outcome [6]. For example, in a typical linear regression, each additional year of education is expected to result in a uniform increase in income, regardless of baseline education levels [7]. However, this assumption of uniform effects across all values fails to capture complexities. Nonlinear models, by contrast, allow for more flexible relationships, where the impact of a predictor may grow, shrink, or change direction depending on its value or the values of other variables [8]. For example, the effect of education on income might be modest up to a certain point, such as completing high school, but becomes significantly larger with further higher education.

Nonlinear models have emerged as essential tools for analyzing the complex interactions that linear models frequently overlook. For example, the effect of education on voting behavior can vary significantly across different socioeconomic groups and regions. Research indicates that higher educational attainment generally correlates with increased political engagement, but this relationship is not uniform. Individuals without a high school diploma exhibit minimal political engagement, but those with a college degree show marked increases in participation [9,10,11]. Similarly, cognitive performance follows a curvilinear trajectory: it tends to improve in adolescence and young adulthood, peaks in midlife, and declines in later years. This pattern exemplifies the limitations of linear models, which assume a constant rate of change and fail to account for varying improvements and declines across the lifespan [12,13].

Nonlinear relationships are also evident in the context of income and health outcomes. While higher income typically leads to better access to healthcare and improved health outcomes, the positive effects diminish beyond a certain income threshold. Once essential healthcare needs are met, further increases in income provide little additional health benefit [14,15]. These examples underscore the necessity of nonlinear models for capturing the intricate and multifaceted nature of social phenomena, providing a more accurate representation of dynamic relationships than linear models. This has important implications for policy-making and interventions addressing complex social issues [10,11].

Traditional linear models have been widely utilized in social science due to their simplicity and ease of interpretation. However, they often fail to capture the complexities of social phenomena, leading to significant limitations. One key drawback is their tendency to oversimplify relationships, such as assuming that social support always has a linear positive effect on mental health. This neglects the possibility of diminishing returns or adverse effects, such as dependency or stress from excessive support [16,17]. Moreover, linear models struggle with threshold effects, where a predictor’s influence only becomes significant after crossing a critical point. For example, the relationship between years of schooling and job satisfaction may only emerge after an individual obtains a formal degree or certification [18,19].

Linear models also face challenges in adequately representing interactions between variables. For instance, the effect of parental involvement on student achievement may vary depending on the school’s quality or the student’s socioeconomic background. While linear models can incorporate interaction terms, this can lead to issues like multicollinearity and over-specification, particularly in high-dimensional datasets [20,21]. Furthermore, the manual specification of interactions increases the likelihood of overlooking essential patterns in the data [22].

Finally, linear models can lead to misleading inferences when ignoring nonlinear relationships. For example, studies on the relationship between income and happiness often reveal that happiness levels off after a certain income threshold, contradicting the linear assumption that happiness increases indefinitely with income [23]. Neglecting this nonlinearity can result in policies that overemphasize income to enhance well-being while neglecting other factors like social relationships and personal fulfillment [24].

In summary, while traditional linear models offer a straightforward approach, they often fail to capture the complexity of social phenomena. Their limitations in simplifying relationships, missing threshold effects, and inadequately representing interactions highlight the need for more flexible models. Nonlinear models, which can better accommodate the intricate dynamics of social behavior and interactions, are crucial for generating meaningful insights in social science research [25]. As social science increasingly relies on large and complex datasets, adopting nonlinear modeling techniques becomes critical to avoid the pitfalls of overly simplistic assumptions.

1.2. Introduction to Machine Learning

Linear models have historically been the cornerstone of social science research due to their simplicity and interpretability. However, as social phenomena are increasingly recognized as complex, the limitations of linear models hinder their ability to capture the intricacies of human behavior, social interactions, and psychological processes. This has led to the adoption of more flexible approaches, such as machine learning (ML), which can handle the complexity of real-world social data [26,27,28,29,30].

Machine learning models differ from linear models in both their assumptions and goals. While linear models focus on estimating specific parameters to describe relationships between variables, ML models prioritize prediction and pattern recognition. This distinction is particularly relevant for social science, where the true relationships between variables are often unknown or highly complex. Machine learning algorithms can learn these relationships directly from the data, making them better suited to modeling nonlinear dynamics that traditional approaches might overlook [31,32,33].

A key strength of ML models is their ability to capture nonlinear relationships. Algorithms such as decision trees, random forests, and neural networks are designed to manage nonlinear interactions. Decision trees, for example, split data into branches based on decision rules, effectively capturing sudden changes or threshold effects. Neural networks can model complex nonlinear interactions through their layered architecture by adjusting connection weights between neurons, offering a more nuanced understanding of variable relationships [34,35,36].

ML models also excel in automatically detecting interactions and threshold effects, which would require manual specification in linear models. Random forests, for example, consist of multiple decision trees, each potentially uncovering different combinations of interacting variables contributing to predicting outcomes. This automatic detection is particularly beneficial in high-dimensional datasets, where the number of possible interactions makes manual specification impractical [37,38,39].

Another advantage of ML models is their robustness to noise and outliers. Ensemble methods like random forests mitigate the influence of outliers by averaging predictions across multiple trees, producing more stable results. Similarly, neural networks employ regularization techniques like dropout to reduce overfitting and increase resilience to noisy data, making them suitable for complex, real-world social science data [40,41,42].

The scalability of ML models for high-dimensional data is another significant advantage. Algorithms like support vector machines (SVMs) and gradient boosting machines (GBMs) are designed to handle large predictor sets and can automatically select the most relevant features. This capability is invaluable in sociology, psychology, and demography, where datasets are increasingly large and complex [43,44].

While linear models often focus on inference, machine learning models emphasize prediction accuracy. This shift is particularly relevant in domains like behavioral psychology and public health, where the primary goal is to predict outcomes—such as mental health disorders or voting behavior—rather than test specific hypotheses [45,46]. Machine learning’s predictive power makes it a valuable tool for understanding and forecasting social phenomena [47,48].

However, despite their strengths in prediction and nonlinear modeling, ML models often sacrifice interpretability, which remains essential in social science research. Understanding the mechanisms driving observed relationships is as crucial as making accurate predictions [49,50]. Hybrid approaches are emerging to address this challenge. These involve using ML to explore nonlinear relationships and identify patterns, followed by applying more interpretable models like generalized additive models or decision trees to understand the nature of these relationships. New tools, such as SHAP values and LIME, enhance the ability to extract interpretable insights from even the most complex ML models [51,52].

Ultimately, while linear models have historically been foundational in social science research, the growing complexity of social data necessitates more flexible, data-driven approaches. With their ability to manage nonlinear dynamics, detect interactions, and scale to high-dimensional data, machine learning models offer a powerful alternative to traditional models. As social science evolves, machine learning will play a central role in uncovering the intricate, nuanced relationships that shape human behavior and societal outcomes [53,54].

2. How to Apply Machine Learning Models in Social Sciences

2.1. Machine Learning Models for Nonlinear Relationships

As the complexity of social science datasets continues to evolve, incorporating multidimensional, nonlinear relationships, machine learning (ML) models have become essential tools for uncovering these intricate patterns [55]. Unlike traditional statistical models, which rely on predefined assumptions about relationships between variables, ML models can identify hidden patterns without imposing linear constraints. This capacity to manage nonlinearity makes ML models particularly valuable for modeling the complex interactions often found in social science research, where factors such as socioeconomic status, education, and health often interact nonlinearly and context-dependently [56].

This section focuses on machine learning models widely used to handle nonlinear relationships, including decision trees, neural networks, ensemble methods such as random forests and gradient boosting machines, and support vector machines. We will explore how each model operates, its strengths and weaknesses, and its relevance for social science research, especially concerning interpretability and predictive power issues [57].

Decision trees are among the most intuitive ML models for nonlinear relationships, making them attractive for social science researchers [58]. These models work by partitioning the dataset into smaller subsets based on the values of predictor variables [59]. Each split represents a decision rule that maximizes separation between different outcomes, resulting in a tree-like structure. The hierarchical nature of decision trees allows them to capture complex, conditional relationships that would be difficult to model manually in linear terms [60]. For instance, a tree might split data based on socioeconomic status and then subdivide based on education level, revealing how these factors interact to predict outcomes such as job satisfaction or health status. This recursive, conditional splitting is particularly effective for modeling nonlinear interactions, as it captures multiple layers of complexity that traditional linear models might overlook [61].

However, decision trees are not without limitations. One key drawback is their tendency to overfit the data, especially when trees grow too deep [62]. Overfitting occurs when the model is too tailored to the training data, resulting in poor generalization of new, unseen data [63]. Moreover, decision trees are known for their instability—slight variations in the input data can lead to entirely different tree structures, which reduces the model’s reliability in making consistent predictions. Despite these limitations, decision trees remain highly interpretable, offering researchers a visual representation of how predictor variables contribute to outcomes [64].

Neural networks, in contrast, are much less interpretable but offer significantly higher predictive power, especially when working with large, complex datasets [65] Inspired by the architecture of the human brain, neural networks consist of layers of interconnected nodes, or neurons, that transform input data through a series of weighted sums and activation functions [66]. These activation functions introduce nonlinearity, enabling neural networks to model highly intricate relationships that are difficult for simpler models to capture. For example, a neural network might model the relationship between psychological well-being and a combination of variables such as age, social support, and income, where each variable might exert a nonlinear influence on the outcome [67].

While neural networks excel at detecting nonlinear interactions, their complexity often renders them “black boxes” in interpretability [68]. It is challenging to understand how a neural network arrives at its predictions, which can be a significant drawback in fields like social science, where researchers often seek to explain the relationships between variables, not just predict outcomes [69]. Additionally, neural networks are computationally intensive, requiring significant processing power and large amounts of data to perform optimally [70]. They are also prone to overfitting, especially when the architecture is too complex for the size of the dataset. Regularization techniques such as dropout or early stopping can mitigate this risk, but they add to the complexity of the model-building process [71,72].

Ensemble methods, such as random forests and gradient boosting machines (GBMs), compromise interpretability and predictive power [73]. Both models rely on combining multiple decision trees to improve the accuracy of predictions, though they differ in their approach. Random forests generate multiple trees, each trained on random subsets of the data, and aggregate their predictions to reduce variance and avoid overfitting [74]. This ensemble approach captures a wide range of nonlinear relationships, as different trees focus on different aspects of the data [75]. For instance, in a study of voting behavior, one tree might model the interaction between education and income, while another captures the influence of political ideology and geographic location. The final prediction reflects the aggregate insight of these varied perspectives [76].

Gradient boosting machines, on the other hand, build trees sequentially, with each tree attempting to correct the errors of the previous one. This iterative learning process allows GBMs to capture more subtle nonlinearities and threshold effects [77]. For example, in a study predicting educational attainment, a GBM might model the nuanced, nonlinear relationship between parental education, income, and school quality [78]. The boosting process enables the model to focus on correcting mistakes made by earlier trees, resulting in a more accurate representation of the underlying data [79]. Despite their high predictive accuracy, random forests and GBMs sacrifice some interpretability. While feature importance scores can help identify the most influential variables, the overall model structure remains opaque compared to a single decision tree [80].

Support vector machines (SVMs) offer another powerful approach for modeling nonlinear relationships. SVMs use the “kernel trick” to map data into higher-dimensional space, enabling them to find decision boundaries that separate data points in ways impossible in the original feature space [81]. This approach is beneficial for classification tasks in which the relationship between predictors and outcomes is highly complex and nonlinear. For example, an SVM might use a radial basis function (RBF) kernel to model the relationship between demographic variables and political preferences, capturing nonlinear patterns that a traditional linear model would miss [82]. However, SVMs, like neural networks, are computationally intensive and can be challenging to interpret [83].

The relevance of these machine learning models to social science research lies in their ability to capture the complex, nonlinear dynamics that characterize human behavior and societal outcomes [84]. Traditional linear models often fail to explain these complexities, mainly when interactions between variables are context-dependent or exhibit threshold effects [85]. Whether studying how demographic factors influence voting behavior, how economic conditions affect mental health, or how education impacts income, machine learning models offer a flexible and powerful approach to uncovering hidden patterns [86]. Furthermore, the increasing availability of large-scale social science datasets, such as those derived from surveys, administrative records, or social media, makes ML models especially valuable for managing high-dimensional data [87].

As advancements in interpretability tools, such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-Agnostic Explanations), continue to develop, researchers can bridge the gap between predictive power and the need for explanation [88]. These tools allow for the use of complex models like neural networks and GBMs while still providing insights into the contributions of individual variables. This is crucial for social science, where understanding the “why” behind predictions is as important as the accuracy of the predictions themselves [76].

As a final point, machine learning models offer a robust and flexible framework for modeling nonlinear relationships in social science [79]. Each model comes with its own set of trade-offs: decision trees provide high interpretability but are prone to overfitting; neural networks offer unparalleled predictive power at the cost of interpretability; ensemble methods strike a balance between accuracy and transparency; and SVMs excel in nonlinear classification tasks but can be computationally demanding [80]. By selecting the appropriate model and employing tools for interpretation, social scientists can leverage the full potential of machine learning to gain deeper insights into the complex dynamics that shape human behavior and social outcomes [82].

2.2. Model Evaluation, Validation, and Handling Imbalanced Data

The effectiveness of machine learning models in social science research depends on their ability to capture nonlinear relationships and how well they generalize to new, unseen data. This makes model evaluation and validation crucial [89]. Although machine learning algorithms are powerful, they are also susceptible to overfitting, poor generalization, and biases—especially when dealing with imbalanced or unrepresentative data [90]. Therefore, appropriate evaluation techniques, performance metrics, and strategies for handling imbalanced datasets are necessary to ensure these models’ reliability, fairness, and real-world applicability [91]. This section covers essential evaluation techniques, methods for addressing imbalanced data, and the ethical concerns associated with model validation and application in social science research.

Model validation is critical for ensuring a machine learning model performs well on data beyond the initial training dataset [92]. In social science, where data are often noisy or sparse, proper validation helps prevent the model from overfitting idiosyncrasies of the training data, thereby producing more generalizable predictions [93]. Overfitting occurs when a model becomes too complex and starts capturing noise rather than meaningful patterns. For example, a model designed to predict voter behavior might perfectly capture the nuances of one election cycle but fail to predict behavior in future elections. In contrast, underfitting happens when a model is overly simplistic, failing to capture the essential structure of the data [94]. For instance, using a linear regression model to predict the nonlinear relationship between income and life satisfaction may lead to poor predictions. Proper validation techniques, such as cross-validation, help balance the risks of overfitting and underfitting [95].

Cross-validation is one of the most widely used methods for assessing a model’s performance on unseen data. K-fold cross-validation, a common technique in social science, divides the dataset into k subsets (or folds), with the model being trained on k-1 folds and tested on the remaining fold [96]. This process repeats k times, ensuring the model is evaluated on every part of the data. By averaging the performance across all folds, researchers obtain a more reliable estimate of how well the model will generalize [97]. Stratified k-fold cross-validation, used in cases where the dataset is imbalanced, ensures that each fold maintains the same distribution of class labels, reducing the likelihood of the model favoring the majority class [98]. Leave-one-out cross-validation (LOOCV), another variant, is especially useful for small datasets [99]. In this approach, the model is trained on all but one data point and tested on the remaining point, repeating this process for each observation. While LOOCV optimizes limited data, it is computationally expensive and sensitive to outliers [100].

The choice of performance metrics is also critical, mainly when the research question involves different prediction tasks, such as classification or regression [101]. In classification problems—where the goal might be to predict whether an individual will vote—metrics like accuracy, precision, recall, and the F1 score are essential [102]. While accuracy is often the default metric, it can be misleading in imbalanced datasets. For example, if only 10% of individuals in a dataset commit crimes, a model predicting “no crime” for every individual would achieve 90% accuracy yet fail to identify any actual offenders. Precision becomes crucial when the cost of false positives is high, as in cases where mispredicting recidivism could result in wrongful decisions [73].

Conversely, recall is vital when false negatives are costly, such as missing the prediction of a mental health crisis. The F1 score, the harmonic mean of precision and recall, is often employed when balancing false positives and false negatives is essential. For binary classification problems with imbalanced classes, the AUC-ROC (Area Under the Curve—Receiver Operating Characteristic) is a valuable metric, indicating how well the model can distinguish between classes [93].

In regression tasks, where the goal is to predict continuous outcomes like income or mental health scores, metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared are frequently used. MSE penalizes larger prediction errors more heavily, while RMSE presents the error in the same units as the dependent variable, making it more interpretable [103]. R-squared indicates the proportion of variance the model explains, though it can sometimes provide overly optimistic results in complex models [104].

A particularly challenging issue in social science research is managing imbalanced data, where one class is significantly underrepresented. This problem arises in situations such as predicting rare events like recidivism, extreme poverty, or mental health crises [105]. Without addressing this imbalance, models tend to predict the majority class, yielding poor performance in the minority class [106]. One approach to managing imbalanced data is resampling, which involves adjusting the dataset to balance the representation of classes [107]. Oversampling the minority class through techniques like SMOTE (Synthetic Minority Oversampling Technique) generates synthetic instances to help the model better learn the minority class. While this prevents loss of information from the majority class, it can lead to overfitting [108]. Conversely, undersampling the majority class reduces the imbalance by discarding data points from the majority class, though it risks losing valuable information [109].

Another approach is cost-sensitive learning, where the model assigns higher penalties for misclassifications involving the minority class [95]. This technique avoids modifying the dataset but requires careful tuning of penalty parameters to avoid introducing bias [110]. Ensemble methods, such as random forests and gradient boosting, are naturally robust to class imbalances. These methods combine multiple models, improving performance by leveraging diverse perspectives from different decision trees or boosting iterations. Class-weighted random forests, for example, give higher importance to the minority class during tree construction, improving the model’s ability to classify minority instances correctly [73].

Ethical considerations must also be paramount when validating and applying machine learning models in social science research. Machine learning models are prone to perpetuating biases in the training data, which can result in unfair or discriminatory outcomes. For instance, models predicting criminal recidivism might disproportionately penalize certain racial or ethnic groups if trained on biased historical data [111]. To mitigate these risks, fairness metrics such as demographic parity and equal opportunity can be employed to evaluate the model’s performance across different demographic groups. Demographic parity ensures that the model’s predictions are not correlated with sensitive attributes like race or gender, while equal opportunity guarantees that qualified individuals from different groups have an equal chance of being selected for a positive outcome [112]. Fairness-through-unawareness, where sensitive attributes are not used in the model, is another approach, though it does not fully address biases introduced by correlated variables [95].

Transparency and accountability are also critical. Models should be interpretable, mainly when used in high-stakes areas like criminal justice or healthcare, where researchers and practitioners must understand how and why the model arrived at its predictions. Tools like SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-Agnostic Explanations) provide interpretable insights into complex models, allowing researchers to assess the contribution of specific variables to the model’s output [104].

Model evaluation, validation, and handling of imbalanced data are integral to applying machine learning in social science research. Proper validation techniques, such as cross-validation and selecting appropriate performance metrics, ensure that models generalize well to unseen data. Addressing imbalanced data through resampling techniques, cost-sensitive learning, or ensemble methods ensures that the minority class is not overlooked. Finally, ethical concerns, particularly those related to fairness and bias, must be central to developing and applying machine learning models to ensure that these models produce equitable outcomes in social science contexts [105,106,107]. Table 1 summarizes critical aspects of model evaluation, validation, and imbalanced data handling.

2.3. Practical Recommendations for Applying Machine Learning in Social Science Research

Successfully applying machine learning in social science requires a balanced approach that ensures predictive accuracy and model interpretability. The unique characteristics of social science data—such as smaller sample sizes, noisy data, and high stakes in decision-making—necessitate careful attention to how machine learning models are developed and used [113]. Below are essential best practices that researchers should adopt to enhance the efficacy and ethical application of machine learning in this domain.

2.3.1. Prioritize Data Quality and Preprocessing

The foundation of any machine learning model is the quality of the data it analyzes. In social science, datasets often contain missing values, outliers, and noise that can distort the model’s results. Proper data handling is crucial before applying machine learning algorithms [114].

Dealing with missing data is one of the first challenges. Depending on the extent and nature of the missing data, different imputation techniques can be used, such as multiple imputation or mean imputation. Some models, like random forests, can handle missing data natively, which may be advantageous in specific cases. Additionally, normalizing or standardizing data is essential for algorithms like support vector machines (SVMs) and neural networks to ensure that no feature disproportionately influences the model’s output. However, this step is unnecessary for models like decision trees or random forests [113,114].

Another challenge involves managing outliers. In social science research, outliers can represent meaningful rare events—such as exceptionally high incomes or severe psychological conditions—that should not be dismissed outright. Using robust statistical techniques, researchers must carefully decide whether to retain or exclude these outliers, depending on the study context [115].

2.3.2. Model Selection Based on Research Goals

The choice of machine learning model should align with the research objectives. For instance, neural networks or gradient boosting machines (GBMs) may be suitable for maximizing predictive accuracy. However, if the primary focus is understanding the relationships between variables, simpler and more interpretable models like decision trees, logistic regression, or generalized additive models (GAMs) are often better suited.

Balancing predictive power and interpretability is a critical consideration in social science. High-performing models such as deep neural networks or SVMs often sacrifice transparency for complexity, which may be problematic in research fields where understanding how variables relate to one another is just as crucial as making accurate predictions. Simpler models, while potentially less accurate, allow researchers to maintain interpretability and offer more precise insights into the data [30].

2.3.3. Avoid Overfitting and Ensure Generalization

Overfitting—where a model performs well on training data but fails to generalize to unseen data—is a frequent issue in machine learning, particularly in social science, where datasets are often smaller and noisier. To mitigate overfitting, researchers should employ cross-validation techniques like k-fold cross-validation, which helps assess the model’s ability to generalize by testing it on different subsets of the data [73]. Regularization techniques, such as L1 (Lasso) and L2 (Ridge) regularization, penalize overly complex models and encourage them to focus on the most relevant features, while methods like dropout can prevent overfitting in neural networks [114,116].

2.3.4. Incorporate Ethical Considerations

Machine learning in social science often involves sensitive data related to race, gender, income, or criminal behavior, making ethical considerations essential. Ensuring fairness and addressing bias are crucial steps in model development [117]. Models should be evaluated for fairness across demographic groups using parity and equal opportunity metrics. These measures help ensure that models do not perpetuate or amplify existing biases present in the training data.

It is also critical to identify and address biases within the dataset. Bias in training data—such as underrepresenting certain groups—can result in unfair or discriminatory model predictions, particularly in high-stakes applications like criminal justice or healthcare [118].

2.3.5. Interpreting Complex Machine Learning Models

One of the primary challenges in machine learning, particularly in social science, is interpreting complex models like neural networks or ensemble methods such as random forests and GBMs. While these models offer high predictive accuracy, they are often called “black boxes” due to the difficulty in explaining their internal decision-making processes. This is problematic in social science research, where understanding the relationships between variables is critical [119].

Post hoc interpretability methods offer a way to interpret complex models without sacrificing predictive power. For example, SHAP (Shapley Additive Explanations) assigns contribution scores to each feature for every prediction, providing global and local insights into how the model uses different variables [120]. This method allows researchers to understand how income, social support, and education contribute to mental health outcomes. Similarly, LIME (Local Interpretable Model-Agnostic Explanations) offers interpretable local models to explain individual predictions by approximating the complex model with a simpler one near the prediction [121].

Visualizations are another crucial tool for interpreting and communicating the results of complex machine learning models. For instance, feature importance plots rank the predictors based on their contribution to the model’s accuracy, helping researchers identify the most influential variables. Partial dependence plots (PDPs) show the marginal effect of a single predictor on the outcome while holding other variables constant, making it easier to understand nonlinear relationships within the model. For example, a PDP might demonstrate how income affects job satisfaction, showing diminishing returns beyond a certain threshold [122].

Simplifying model complexity without significantly reducing accuracy may be possible in some cases. For example, a deep neural network might be replaced by a shallower one or a decision tree, trading off some predictive power for increased interpretability [123].

2.3.6. Communicating Results to Diverse Audiences

Once machine learning models have been evaluated and interpreted, the next challenge is effectively communicating the findings to various audiences, including academic peers, policy-makers, practitioners, and the general public. Different stakeholders require different levels of detail and technical explanation, so communication strategies must be tailored accordingly [124].

For academic peers, maintaining technical rigor is essential. Researchers should clearly explain the model’s architecture, assumptions, and limitations. Transparency regarding the choice of features, regularization techniques, and validation methods is necessary to ensure scientific integrity. Additionally, sharing reproducible code and data is crucial for validating the findings and advancing knowledge within the academic community [120].

For policy-makers and practitioners, the focus should be on actionable insights rather than technical details. Simplifying model mechanics explanations while emphasizing key findings allows policy-makers to make informed decisions. Ethical considerations should be highlighted, particularly in healthcare or criminal justice areas where model bias or fairness issues can have significant real-world consequences.

For the general public, the emphasis should be on clarity and societal impact. Avoiding technical jargon and focusing on the broader implications of the research will make the findings more accessible to non-expert audiences. Simple visualizations and straightforward explanations help build trust, especially when addressing concerns about fairness or bias in model predictions [122].

Applying machine learning in social science research requires a thoughtful and rigorous approach that balances predictive accuracy with interpretability. Researchers must prioritize data quality, choose appropriate models based on research goals, avoid overfitting, and incorporate ethical considerations. By using interpretability techniques and tailoring communication to different audiences, the potential of machine learning in social science can be fully realized, providing valuable insights into complex societal phenomena [120].

3. Conclusions

Integrating machine learning (ML) into social science research represents a significant methodological advancement, enabling the analysis of complex, nonlinear relationships that traditional linear models often overlook. Social phenomena such as education, income, health, and political engagement involve intricate, context-dependent interactions. ML models—including decision trees, random forests, neural networks, and support vector machines (SVMs)—can reveal these dynamics by uncovering threshold effects and non-proportional outcomes typically missed by linear models [125,126,127,128].

One of the critical strengths of machine learning is its flexibility in modeling nonlinearity without requiring manual specification of interactions. For instance, decision trees partition data based on predictor variables, exposing conditional relationships. Techniques like random forests and gradient boosting machines (GBMs) help to mitigate overfitting while capturing broader interactions [57]. Neural networks, with their multi-layered architectures, are particularly well suited for modeling complex, multidimensional relationships, such as the effects of age, income, and psychological well-being on mental health. However, their complexity can hinder interpretability [66].

Despite the predictive power of these models, particularly deep learning algorithms, they are often seen as “black boxes,” making it difficult to interpret how specific inputs influence outcomes. This opacity is especially problematic in social science, where understanding the relationships between variables is crucial for informing policy and theory [119]. Post hoc interpretability tools, such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-Agnostic Explanations), offer a partial solution by providing contribution scores for predictions, which enhances transparency in fields like health and criminal justice [120]. Nevertheless, ethical challenges remain, particularly in high-stakes areas where biased training data could result in discriminatory outcomes. Fairness metrics, such as demographic parity and equal opportunity, ensure equitable results across demographic groups.

Addressing imbalanced data is another critical issue in social science, where rare events like recidivism or extreme poverty are often underrepresented. SMOTE (Synthetic Minority Oversampling Technique), undersampling, and cost-sensitive learning can improve model performance in minority classes [109]. Furthermore, cross-validation methods like k-fold and stratified k-fold, alongside regularization techniques such as L1 and L2, enhance model generalization, ensuring robust performance on unseen data [95].

Balancing predictive power with interpretability is a central challenge in applying machine learning to social science. High-performing models like deep neural networks and GBMs often provide superior accuracy at the expense of transparency, while simpler models like decision trees offer more precise insights with reduced predictive power. Researchers must select models based on their research objectives, sometimes opting for hybrid approaches that combine machine learning with traditional statistical methods to achieve better interpretability [122].

As social science datasets increase in size and complexity, the role of machine learning will continue to grow. Its ability to handle high-dimensional data and identify nonlinear interactions gives social scientists deeper insights into complex societal issues. As interpretability tools improve, the trade-off between predictive power and transparency will diminish, allowing the use of more advanced models without sacrificing clarity [120].

Ultimately, machine learning has the potential to transform social science research by modeling complex, nonlinear relationships and uncovering hidden patterns in large datasets. However, successful implementation depends on balancing interpretability, fairness, and generalization. By adhering to best practices in data quality and model validation, researchers can harness the power of machine learning to generate meaningful insights into social dynamics and human behavior.

Author Contributions

Conceptualization, T.K. and M.P.; methodology, T.K. and M.P.; software, T.K. and M.P.; validation, T.K. and M.P.; formal analysis, T.K. and M.P.; investigation, T.K. and M.P.; resources, T.K. and M.P.; data curation, T.K. and M.P.; writing—original draft preparation, T.K. and M.P.; writing—review and editing, T.K. and M.P.; visualization, T.K. and M.P.; supervision, T.K. and M.P.; project administration, T.K. and M.P.; funding acquisition, T.K. and M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Therefore, data sharing is not applicable to this article.

Acknowledgments

No administrative or technical support, nor donations of any kind, were provided for this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Room, G. The Empirical Investigation of Nonlinear Dynamics in the Social World. Ontology, Methodology and Data. Sociologica 2020, 14, 163–193. [Google Scholar]
Kravchenko, S. The birth of “normal trauma”: The effect of nonlinear development. Econ. Sociol. 2020, 13, 150–159. [Google Scholar] [CrossRef]
Strydom, G.; Ewing, M.T.; Heggen, C. Time lags, nonlinearity and asymmetric effects in an extended service-profit chain. Eur. J. Mark. 2020, 54, 2343–2363. [Google Scholar] [CrossRef]
Girme, Y.U. Step out of line: Modeling nonlinear effects and dynamics in close-relationships research. Curr. Dir. Psychol. Sci. 2020, 29, 351–357. [Google Scholar] [CrossRef]
Sanclemente Ibáñez, F.J.; Gamero Vázquez, N.; Arenas Moreno, A.; Medina Díaz, F.J. Linear and nonlinear relationships between job demands-resources and psychological and physical symptoms of service sector employees. When is the midpoint a good choice? Front. Psychol. 2022, 1329, 950908. [Google Scholar]
Hope, T.M. Linear regression. In Machine Learning; Academic Press: Cambridge, MA, USA, 2020; pp. 67–81. [Google Scholar]
Okoye, K.; Hosseini, S. Regression Analysis in R: Linear Regression and Logistic Regression. In R Programming: Statistical Data Analysis in Research; Springer Nature Singapore: Singapore, 2024; pp. 131–158. [Google Scholar]
Munir, K.; Kanwal, A. Impact of educational and gender inequality on income and income inequality in South Asian countries. Int. J. Soc. Econ. 2020, 47, 1043–1062. [Google Scholar] [CrossRef]
Caffrey-Maffei, L. Education, Self-Importance, and the Propensity for Political Participation. Perceptions 2019, 5. [Google Scholar] [CrossRef]
Oser, J.; Hooghe, M. Democratic ideals and levels of political participation: The role of political and social conceptualisations of democracy. Br. J. Politics Int. Relat. 2018, 20, 711–730. [Google Scholar] [CrossRef]
Pellicer, M.; Assaad, R.; Krafft, C.; Salemi, C. Grievances or skills? The effect of education on youth political participation in Egypt and Tunisia. Int. Political Sci. Rev. 2022, 43, 191–208. [Google Scholar] [CrossRef]
Dim, E.E.; Schafer, M.H. Age, Political Participation, and Political Context in Africa. J. Gerontol. Ser. B Psychol. Sci. Soc. Sci. 2024, 79, gbae035. [Google Scholar] [CrossRef]
Pickering, D. Political activation and social movements: Addressing non-participation in Aotearoa New Zealand. Sociol. Compass 2023, 17, e13022. [Google Scholar] [CrossRef]
Džunić, M.; Golubović, N. Civic and Political Participation in Transition Countries: The Case of Serbia. Facta Univ. Ser. Econ. Organ. 2018, 15, 001–013. [Google Scholar] [CrossRef]
Kutuk, Y.; Usturali, A. The nonlinear relationship between political trust and nonelectoral political participation in democratic and nondemocratic regimes. Soc. Sci. Q. 2023, 104, 478–504. [Google Scholar] [CrossRef]
Nickels, S.; Steinhauer, K. Prosody–syntax integration in a second language: Contrasting event-related potentials from German and Chinese learners of English using linear mixed effect models. Second Lang. Res. 2018, 34, 9–37. [Google Scholar] [CrossRef]
Weng, S.F.; Reps, J.; Kai, J.; Garibaldi, J.M.; Qureshi, N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE 2017, 12, e0174944. [Google Scholar] [CrossRef]
Bone, A.E.; Gomes, B.; Etkind, S.N.; Verne, J.; Murtagh, F.E.; Evans, C.J.; Higginson, I.J. What is the impact of population ageing on the future provision of end-of-life care? Population-based projections of place of death. Palliat. Med. 2018, 32, 329–336. [Google Scholar] [CrossRef]
Guimarães, M.H.; Sousa, C.; Garcia, T.; Dentinho, T.; Boski, T. The value of improved water quality in Guadiana estuary—A transborder application of contingent valuation methodology. Lett. Spat. Resour. Sci. 2011, 4, 31–48. [Google Scholar] [CrossRef]
Laparra, V.; Malo, J. Visual aftereffects and sensory nonlinearities from a single statistical framework. Front. Hum. Neurosci. 2015, 9, 557. [Google Scholar] [CrossRef]
Simpson, A.H.; Richardson, S.J.; Laughlin, D.C. Soil–climate interactions explain variation in foliar, stem, root and reproductive traits across temperate forests. Glob. Ecol. Biogeogr. 2016, 25, 964–978. [Google Scholar] [CrossRef]
Wouters, A.; Pauwels, B.; Lambrechts, H.A.; Pattyn, G.G.; Ides, J.; Baay, M.; Meijnders, P.; Lardon, F.; Vermorken, J.B. Counting clonogenic assays from normoxic and anoxic irradiation experiments manually or by using densitometric software. Phys. Med. Biol. 2010, 55, N167. [Google Scholar] [CrossRef]
Parkes, L.; Kim, J.Z.; Stiso, J.; Calkins, M.E.; Cieslak, M.; Gur, R.E.; Gur, R.C.; Moore, T.M.; Ouellet, M.; Roalf, D.R.; et al. Asymmetric signaling across the hierarchy of cytoarchitecture within the human connectome. Sci. Adv. 2022, 8, eadd2185. [Google Scholar] [CrossRef] [PubMed]
Rørvik, E.; Fjæra, L.F.; Dahle, T.J.; Dale, J.E.; Engeseth, G.M.; Stokkevåg, C.H.; Thörnqvist, S.; Ytre-Hauge, K.S. Exploration and application of phenomenological RBE models for proton therapy. Phys. Med. Biol. 2018, 63, 185013. [Google Scholar] [CrossRef] [PubMed]
Bonnebaigt, R.; Caulfield, C.P.; Linden, P.F. Detrainment of plumes from vertically distributed sources. Environ. Fluid Mech. 2018, 18, 3–25. [Google Scholar] [CrossRef] [PubMed]
Alpaydin, E. Machine Learning; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
El Naqa, I.; Murphy, M.J. What Is Machine Learning? Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 3–11. [Google Scholar]
Sammut, C.; Webb, G.I. (Eds.) Encyclopedia of Machine Learning; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. Machine Learning Basics [PowerPoint Slides]. 2016. Available online: http://whdeng.cn/Teaching/PPT_01_Machine%20learning%20Basics.pdf (accessed on 20 November 2024).
Zhou, Z.H. Machine Learning; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Elhanashi, A.; Saponara, S.; Dini, P.; Zheng, Q.; Morita, D.; Raytchev, B. An integrated and real-time social distancing, mask detection, and facial temperature video measurement system for pandemic monitoring. J. Real-Time Image Process. 2023, 20, 95. [Google Scholar] [CrossRef]
Levy, J.; Mussack, D.; Brunner, M.; Keller, U.; Cardoso-Leite, P.; Fischbach, A. Contrasting classical and machine learning approaches in the estimation of value-added scores in large-scale educational data. Front. Psychol. 2020, 11, 2190. [Google Scholar] [CrossRef]
Yılmaz, K.; Turanlı, M. A multi-disciplinary investigation of linearization deviations in different regression models. Asian J. Probab. Stat. 2023, 22, 15–19. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
Hainmueller, J.; Mummolo, J.; Xu, Y. How much should we trust estimates from multiplicative interaction models? Simple tools to improve empirical practice. Political Anal. 2019, 27, 163–192. [Google Scholar] [CrossRef]
Wu, J.; Chen, S.; Zhou, W.; Wang, N.; Fan, Z. Evaluation of feature selection methods using bagging and boosting ensemble techniques on high throughput biological data. In Proceedings of the 2020 10th International Conference on Biomedical Engineering and Technology, Tokyo, Japan, 15–18 September 2020; pp. 170–175. [Google Scholar]
Mitchell, T.M.; Mitchell, T.M. Machine Learning; McGraw-hill: New York, NY, USA, 1997; Volume 1. [Google Scholar]
Morris, C.; Raman, S.; Seymour, S. Openness to social science knowledges? The politics of disciplinary collaboration within the field of UK food security research. Sociol. Rural. 2019, 59, 23–43. [Google Scholar] [CrossRef]
Ray, L. Explaining Violence-Towards a Critical Friendship with Neuroscience? J. Theory Soc. Behav. 2016, 46, 335–356. [Google Scholar] [CrossRef]
Greener, J.G.; Kandathil, S.M.; Moffat, L.; Jones, D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022, 23, 40–55. [Google Scholar] [CrossRef] [PubMed]
Neuman, Y.; Cohen, Y. AI for identifying social norm violation. Sci. Rep. 2023, 13, 8103. [Google Scholar] [CrossRef] [PubMed]
van Putten, I.; Kelly, R.; Cavanagh, R.D.; Murphy, E.J.; Breckwoldt, A.; Brodie, S.; Cvitanovic, C.; Dickey-Collas, M.; Dickey-Collas, M.; Melbourne-Thomas, J.; et al. A decade of incorporating social sciences in the integrated marine biosphere research project (IMBeR): Much done, much to do? Front. Mar. Sci. 2021, 8, 662350. [Google Scholar] [CrossRef]
Lebaron, F.; Castro, T.A.F. Some contributions from Geometry to linear models’ construction in Social Sciences. Bull. Sociol. Methodol./Bull. Méthodol. Sociol. 2018, 140, 90–109. [Google Scholar] [CrossRef]
Yuan, Y.; Zhu, W. Artificial Intelligence-Enabled Social Science: A Bibliometric Analysis. In Proceedings of the 2022 3rd International Conference on Artificial Intelligence and Education (IC-ICAIE 2022), Chengdu, China, 24–26 June 2022; Atlantis Press: Dordrecht, The Netherlands, 2022; pp. 1602–1608. [Google Scholar]
Leach, M.; Scoones, I. The social and political lives of zoonotic disease models: Narratives, science and policy. Soc. Sci. Med. 2013, 88, 10–17. [Google Scholar] [CrossRef]
Veltri, G.A. Big data is not only about data: The two cultures of modelling. Big Data Soc. 2017, 4, 2053951717703997. [Google Scholar] [CrossRef]
Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
Edelmann, A.; Wolff, T.; Montagne, D.; Bail, C.A. Computational social science and sociology. Annu. Rev. Sociol. 2020, 46, 61–81. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Song PX, K.; Wang, N.; Zhou, L.; Zhu, J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Its Interface 2018, 11, 721. [Google Scholar] [CrossRef]
Ahearn, C.; Brand, J.E. Predicting layoff among fragile families. Socius Sociol. Res. Dyn. World 2019, 5, 237802311880975. [Google Scholar] [CrossRef]
Nakagawa, S.; Schielzeth, H. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol. Evol. 2013, 4, 133–142. [Google Scholar] [CrossRef]
Kong, D.; Zhu, J.; Duan, C.; Lu, L.; Chen, D. Bayesian linear regression for surface roughness prediction. Mech. Syst. Signal Process. 2020, 142, 106770. [Google Scholar] [CrossRef]
Playford, C.J.; Gayle, V.; Connelly, R.; Gray, A.J. Administrative Social Science Data: The Challenge of Reproducible Research. Big Data Soc. 2016, 3, 2053951716684143. [Google Scholar] [CrossRef]
Molina, M.; Garip, F. Machine learning for sociology. Annu. Rev. Sociol. 2019, 45, 27–45. [Google Scholar] [CrossRef]
Di Franco, G.; Santurro, M. From big data to machine learning: An empirical application for social sciences. Athens J. Soc. Sci. 2023, 2, 79–100. [Google Scholar] [CrossRef]
Lo-Thong-Viramoutou, O.; Charton, P.; Cadet, X.F.; Grondin-Perez, B.; Saavedra, E.; Damour, C.; Cadet, F. Nonlinearity of Metabolic Pathways Critically Influences the Choice of Machine Learning Model. Front. Artif. Intell. 2022, 5, 744755. [Google Scholar] [CrossRef]
Hilbert, S.; Coors, S.; Kraus, E.; Bischl, B.; Lindl, A.; Frei, M.; Wild, J.; Krauss, S.; Goretzko, D.; Stachl, C. Machine learning for the educational sciences. Rev. Educ. 2021, 9, e3310. [Google Scholar] [CrossRef]
Wu, P.; Jiang, J. Robust estimation of mean squared prediction error in small-area estimation. Can. J. Stat. 2021, 49, 362–396. [Google Scholar] [CrossRef]
Freeman, K. Text as Data: A New Framework for Machine Learning and the Social Sciences; Princeton University Press: Princeton, NJ, USA, 2023. [Google Scholar]
Kern, C.; Klausch, T.; Kreuter, F. Tree-based machine learning methods for survey research. In Survey Research Methods; NIH Public Access: Bethesda, MD, USA, 2019; Volume 13, p. 73. [Google Scholar]
Wu, C.; Wang, G.; Hu, S.; Liu, Y.; Mi, H.; Zhou, Y.; Guo, Y.-K.; Song, T. A data driven methodology for social science research with left-behind children as a case study. PLoS ONE 2020, 15, e0242483. [Google Scholar] [CrossRef]
Gibson, W.J.; Nafee, T.; Travis, R.; Yee, M.; Kerneis, M.; Ohman, M.; Gibson, C.M. Machine learning versus traditional risk stratification methods in acute coronary syndrome: A pooled randomized clinical trial analysis. J. Thromb. Thrombolysis 2020, 4, 1–9. [Google Scholar] [CrossRef]
Zhong, S.; Zhang, K.; Bagheri, M.; Burken, J.G.; Gu, A.; Li, B.; Ma, X.; Marrone, B.L.; Ren, Z.J.; Schrier, J.; et al. Machine learning: New ideas and tools in environmental science and engineering. Environ. Sci. Technol. 2021, 55, 12741–12754. [Google Scholar] [CrossRef]
Pukelis, L.; Stančiauskas, V. The opportunities and limitations of using artificial neural networks in social science research. Politologija 2019, 94, 56–80. [Google Scholar] [CrossRef]
Chen, Y.; Gao, Q.; Liang, F.; Wang, X. Nonlinear variable selection via deep neural networks. J. Comput. Graph. Stat. 2021, 30, 484–492. [Google Scholar] [CrossRef]
Cleophas, T.J.; Zwinderman, A.H.; Cleophas, T.J.; Zwinderman, A.H. Neural Networks for Assessing Relationships that are Typically Nonlinear (90 Patients). In Machine Learning in Medicine—A Complete Overview; Springer: Berlin/Heidelberg, Germany, 2020; pp. 423–427. [Google Scholar]
Clark, D.G.; Abbott, L.F.; Litwin-Kumar, A. Dimension of activity in random neural networks. Phys. Rev. Lett. 2023, 131, 118401. [Google Scholar] [CrossRef]
Rao, A.R.; Reimherr, M. Nonlinear functional modeling using neural networks. J. Comput. Graph. Stat. 2023, 32, 1248–1257. [Google Scholar] [CrossRef]
Fan, W.; Ma, Y.; Li, Q.; Wang, J.; Cai, G.; Tang, J.; Yin, D. A graph neural network framework for social recommendations. IEEE Trans. Knowl. Data Eng. 2020, 34, 2033–2047. [Google Scholar] [CrossRef]
Bungert, L.; Hait-Fraenkel, E.; Papadakis, N.; Gilboa, G. Nonlinear power method for computing eigenvectors of proximal operators and neural networks. SIAM J. Imaging Sci. 2021, 14, 1114–1148. [Google Scholar] [CrossRef]
Linka, K.; Schäfer, A.; Meng, X.; Zou, Z.; Karniadakis, G.E.; Kuhl, E. Bayesian Physics Informed Neural Networks for real-world nonlinear dynamical systems. Comput. Methods Appl. Mech. Eng. 2022, 402, 115346. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Sahin, E.K. Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Appl. Sci. 2020, 2, 1308. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgo, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2020, 54, 1937–1967. [Google Scholar] [CrossRef]
Pop, C.B.; Chifu, V.R.; Cordea, C.; Chifu, E.S.; Barsan, O. Forecasting the Short-Term Energy Consumption Using Random Forests and Gradient Boosting. In Proceedings of the 2021 20th RoEduNet Conference: Networking in Education and Research (RoEduNet), Iasi, Romania, 4–6 November 2021; pp. 1–6. [Google Scholar]
Jafarzadeh, H.; Mahdianpari, M.; Gill, E.; Mohammadimanesh, F.; Homayouni, S. Bagging and boosting ensemble classifiers for classification of multispectral, hyperspectral and PolSAR data: A comparative evaluation. Remote Sens. 2021, 13, 4405. [Google Scholar] [CrossRef]
Saifan, R.; Sharif, K.; Abu-Ghazaleh, M.; Abdel-Majeed, M. Investigating algorithmic stock market trading using ensemble machine learning methods. Informatica 2020, 44, 311–325. [Google Scholar] [CrossRef]
Gabidolla, M.; Carreira-Perpiñán, M.Á. Pushing the envelope of gradient boosting forests via globally-optimized oblique trees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 285–294. [Google Scholar]
Pahno, S.; Yang, J.J.; Kim, S.S. Use of machine learning algorithms to predict subgrade resilient modulus. Infrastructures 2021, 6, 78. [Google Scholar] [CrossRef]
Malek, N.H.A.; Yaacob, W.F.W.; Wah, Y.B.; Nasir, S.A.M.; Shaadan, N.; Indratno, S.W. Comparison of ensemble hybrid sampling with bagging and boosting machine learning approach for imbalanced data. Indones. J. Elec. Eng. Comput. Sci. 2023, 29, 598–608. [Google Scholar] [CrossRef]
Xie, Y.; Peng, M. Forest fire forecasting using ensemble learning approaches. Neural Comput. Appl. 2019, 31, 4541–4550. [Google Scholar] [CrossRef]
Yadav, D.C.; Pal, S. Analysis of heart disease using parallel and sequential ensemble methods with feature selection techniques: Heart disease prediction. Int. J. Big Data Anal. Healthc. (IJBDAH) 2021, 6, 40–56. [Google Scholar] [CrossRef]
González, S.; García, S.; Del Ser, J.; Rokach, L.; Herrera, F. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion 2020, 64, 205–237. [Google Scholar] [CrossRef]
Raj, V.; Dotse, S.Q.; Sathyajith, M.; Petra, M.I.; Yassin, H. Ensemble machine learning for predicting the power output from different solar photovoltaic systems. Energies 2023, 16, 671. [Google Scholar] [CrossRef]
Noviandy, T.R.; Maulana, A.; Idroes, G.M.; Emran, T.B.; Tallei, T.E.; Helwani, Z.; Idroes, R. Ensemble machine learning approach for quantitative structure-activity relationship based drug discovery: A Review. Infolitika J. Data Sci. 2023, 1, 32–41. [Google Scholar] [CrossRef]
Galicia, A.; Talavera-Llames, R.; Troncoso, A.; Koprinska, I.; Martínez-Álvarez, F. Multi-step forecasting for big data time series based on ensemble learning. Knowl.-Based Syst. 2019, 163, 830–841. [Google Scholar] [CrossRef]
Bologna, G. A rule extraction technique applied to ensembles of neural networks, random forests, and gradient-boosted trees. Algorithms 2021, 14, 339. [Google Scholar] [CrossRef]
Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
Takase, T.; Oyama, S.; Kurihara, M. Evaluation of stratified validation in neural network training with imbalanced data. In Proceedings of the 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), Kyoto, Japan, 27 February–2 March 2019; pp. 1–4. [Google Scholar]
Liu, B.; Zhang, H.; Yang, L.; Dong, L.; Shen, H.; Song, K. An experimental evaluation of imbalanced learning and time-series validation in the context of CI/CD prediction. In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering, Trondheim, Norway, 15–17 April 2020; pp. 21–30. [Google Scholar]
Zheng, M.; Wang, F.; Hu, X.; Miao, Y.; Cao, H.; Tang, M. A method for analyzing the performance impact of imbalanced binary data on machine learning models. Axioms 2022, 11, 607. [Google Scholar] [CrossRef]
Gan, Y.; Dai, Z.; Wu, L.; Liu, W.; Chen, L. Deep Reinforcement Learning and Dempster-Shafer Theory: A Unified Approach to Imbalanced Classification. In Proceedings of the 2023 3rd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Wuhan, China, 15–17 December 2023; pp. 67–72. [Google Scholar]
Zhao, Z.; Liang, J.; Wang, W.; Tang, J.; Fu, X.; Yan, Y. Fusion Model Classification Algorithm for Imbalanced Data. Solid State Technol. 2020, 63, 1663–1673. [Google Scholar]
Sadouk, L.; Gadi, T.; Essoufi, E.H. A novel cost-sensitive algorithm and new evaluation strategies for regression in imbalanced domains. Expert Syst. 2021, 38, e12680. [Google Scholar] [CrossRef]
Tanov, V.; Ivanov, I. Data-centric optimization method to imbalanced datasets. In Proceedings of the International Conference on Mathematical and Statistical Physics, Computational Science, Education, and Communication (ICMSCE 2022), Istanbul, Turkey, 8–9 December 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12616, p. 1261602. [Google Scholar]
Rezvani, S.; Wang, X. Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines. Inf. Sci. 2021, 578, 659–682. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
Thölke, P.; Mantilla-Ramos, Y.-J.; Abdelhedi, H.; Maschke, C.; Dehgan, A.; Harel, Y.; Kemtur, A.; Berrada, L.M.; Sahraoui, M.; Young, T.; et al. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. NeuroImage 2023, 277, 120253. [Google Scholar] [CrossRef] [PubMed]
Hussein, A.S.; Li, T.; Yohannese, C.W.; Bashir, K. A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE. Int. J. Comput. Intell. Syst. 2019, 12, 1412–1422. [Google Scholar] [CrossRef]
Thumpati, A.; Zhang, Y. Towards Optimizing Performance of Machine Learning Algorithms on Unbalanced Dataset. In Proceedings of the Artificial Intelligence Applications, Vienna, Austria, 28–29 October 2023; pp. 169–183. [Google Scholar] [CrossRef]
Fan, Z.; Qian, J.; Sun, B.; Wu, D.; Xu, Y.; Tao, Z. Modeling voice pathology detection using imbalanced learning. In Proceedings of the 2020 International Conference on Sensing, Measurement & Data Analytics in the era of Artificial Intelligence (ICSMD), Xi’an, China, 15–17 October 2020; pp. 330–334. [Google Scholar]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
Hodson, T.O.; Over, T.M.; Foks, S.S. Mean squared error, deconstructed. J. Adv. Model. Earth Syst. 2021, 13, e2021MS002681. [Google Scholar] [CrossRef]
Silva, A.; Ribeiro, R.P.; Moniz, N. Model optimization in imbalanced regression. In Proceedings of the International Conference on Discovery Science, Montpellier, France, 10 October 2022; Springer: Cham, Switzerland, 2022; pp. 3–21. [Google Scholar]
Rahman, H.A.A.; Wah, Y.B.; Huat, O.S. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. Pertanika J. Sci. Technol. 2021, 29, 181–197. [Google Scholar] [CrossRef]
Ren, J.; Zhang, M.; Yu, C.; Liu, Z. Balanced mse for imbalanced visual regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7926–7935. [Google Scholar]
Laxmi Sree, B.R.; Vijaya, M.S. A Weighted Mean Square Error Technique to Train Deep Belief Networks for Imbalanced Data. Int. J. Simul. Syst. Sci. Technol. 2018. [Google Scholar] [CrossRef]
Branco, P.; Torgo, L.; Ribeiro, R.P. SMOGN: A preprocessing approach for imbalanced regression. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications; PMLR: New York, NY, USA, 2017; pp. 36–50. [Google Scholar]
Kou, Y.; Fu, G.H. ASER: Adapted squared error relevance for rare cases prediction in imbalanced regression. J. Chemom. 2023, 37, e3515. [Google Scholar] [CrossRef]
Ge, J.; Chen, H.; Zhang, D.; Hou, X.; Yuan, L. Active learning for imbalanced ordinal regression. IEEE Access 2020, 8, 180608–180617. [Google Scholar] [CrossRef]
Annur Sinaga, B.; Vionanda, D.; Permana, D.; Salma, A. Comparison of error rate prediction methods in binary logistic regression modeling for imbalanced data. UNP J. Stat. Data Sci. 2023, 1, 361–368. [Google Scholar] [CrossRef]
Gadekar, B.; Hiwarkar, T. A Critical Evaluation of Business Improvement through Machine Learning: Challenges, Opportunities, and Best Practices. Int. J. Recent Innov. Trends Comput. Commun. 2023, 11, 264–276. [Google Scholar] [CrossRef]
Whang, S.E.; Lee, J.G. Data collection and quality challenges for deep learning. Proc. VLDB Endow. 2020, 13, 3429–3432. [Google Scholar] [CrossRef]
Soni, A.; Arora, C.; Kaushik, R.; Upadhyay, V. Evaluating the Impact of Data Quality on Machine Learning Model Performance. J. Nonlinear Anal. Optim. 2023, 14, 13–18. [Google Scholar] [CrossRef]
Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric ai perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
Toms, A.; Whitworth, S. Ethical Considerations in the Use of Machine Learning for Research and Statistics. Int. J. Popul. Data Sci. 2022, 7. [Google Scholar] [CrossRef]
Ximenes, B.H.; Ramalho, G.L. Concrete ethical guidelines and best practices in machine learning development. In Proceedings of the 2021 IEEE International Symposium on Technology and Society (ISTAS), Waterloo, ON, Canada, 28–31 October 2021; pp. 1–8. [Google Scholar]
Ratul, Q.E.A.; Serra, E.; Cuzzocrea, A. Evaluating attribution methods in machine learning interpretability. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 5239–5245. [Google Scholar]
Rodríguez-Pérez, R.; Bajorath, J. Interpretation of compound activity predictions from complex machine learning models using local approximations and shapley values. J. Med. Chem. 2019, 63, 8761–8777. [Google Scholar] [CrossRef] [PubMed]
Man, X.; Chan, E. The best way to select features? Comparing mda, lime, and shap. J. Financ. Data Sci. Winter 2021, 3, 127–139. [Google Scholar] [CrossRef]
Jalali, A.; Schindler, A.; Haslhofer, B.; Rauber, A. Machine Learning Interpretability Techniques for Outage Prediction: A Comparative Study. PHM Soc. Eur. Conf. 2020, 5, 10. [Google Scholar] [CrossRef]
Fang, J.P.; Zhou, J.; Cui, Q.; Tang, C.Z.; Li, L.F. Interpreting model predictions with constrained perturbation and counterfactual instances. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2251001. [Google Scholar] [CrossRef]
Rashi, A.; Madamala, R. Minimum Relevant Features to Obtain AI Explainable System for Predicting Breast Cancer in WDBC. Int. J. Health Sci. 2022, 6, 1312–1326. [Google Scholar] [CrossRef]
Kyriazos, T.; Poga, M. Quantum Concepts in Psychology: Exploring the Interplay of Physics and the Human Psyche. Biosystems 2024, 235, 105070. [Google Scholar] [CrossRef]
Kyriazos, T.; Poga, M. Leveraging Network Insights into Positive Emotions and Resilience for Better Life Satisfaction. The Open Public Health J. 2024, 17, e18749445338146. [Google Scholar] [CrossRef]
Kyriazos, T.; Poga, M. Life Satisfaction, Anxiety, Stress, Depression, and Resilience: A Multigroup Latent Class Analysis. Trends Psychol. 2024, 1–21. [Google Scholar] [CrossRef]
Kyriazos, T.; Poga, M. Planfulness in Psychological Well-being: Mediating Roles of Self-Efficacy and Presence of Meaning in Life. Appl. Res. Qual. Life 2024, 19, 1927–1950. [Google Scholar] [CrossRef]

Table 1. Key aspects of model evaluation, validation, and handling imbalanced data: a visual summary.

Aspect	Key Points	Challenges Addressed
Model Validation	- Ensures generalization to new data. - Cross-validation methods: k-fold, stratified k-fold, LOOCV.	- Prevents overfitting and underfitting. - Enhances reliability of predictions.
Cross-Validation Techniques	- k-fold: Divides data into subsets for training and testing. - Stratified k-fold: Maintains class distribution. - LOOCV: Validates on small datasets.	- Balances overfitting and underfitting. - Handles sparse and imbalanced data effectively.
Performance Metrics	- Classification: Accuracy, precision, recall, F1 score, AUC-ROC. - Regression: MSE, RMSE, R-squared.	- Addresses misleading metrics in imbalanced datasets. - Evaluates both false positives and false negatives.
Handling Imbalanced Data	- Resampling techniques: Oversampling (e.g., SMOTE), undersampling. - Cost-sensitive learning. - Ensemble methods (e.g., random forests).	- Improves minority class predictions. - Balances representation across classes.
Ethical Considerations	- Mitigates bias in training data. - Uses fairness metrics: Demographic parity, equal opportunity. - Ensures interpretability with SHAP, LIME tools.	- Reduces discrimination risk. - Enhances transparency and accountability in decision-making.
Transparency Tools	- SHAP: Assigns contribution scores to features. - LIME: Provides local model interpretability.	- Explains model outputs to stakeholders. - Builds trust in predictions and decision-making processes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kyriazos, T.; Poga, M. Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships. Encyclopedia 2024, 4, 1790-1805. https://doi.org/10.3390/encyclopedia4040118

AMA Style

Kyriazos T, Poga M. Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships. Encyclopedia. 2024; 4(4):1790-1805. https://doi.org/10.3390/encyclopedia4040118

Chicago/Turabian Style

Kyriazos, Theodoros, and Mary Poga. 2024. "Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships" Encyclopedia 4, no. 4: 1790-1805. https://doi.org/10.3390/encyclopedia4040118

APA Style

Kyriazos, T., & Poga, M. (2024). Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships. Encyclopedia, 4(4), 1790-1805. https://doi.org/10.3390/encyclopedia4040118

Article Menu

Application of Machine Learning Models in Social Sciences: Managing Nonlinear Relationships

Definition

1. Introduction

1.1. Overview of Nonlinear Relationships in Social Sciences

1.2. Introduction to Machine Learning

2. How to Apply Machine Learning Models in Social Sciences

2.1. Machine Learning Models for Nonlinear Relationships

2.2. Model Evaluation, Validation, and Handling Imbalanced Data

2.3. Practical Recommendations for Applying Machine Learning in Social Science Research

2.3.1. Prioritize Data Quality and Preprocessing

2.3.2. Model Selection Based on Research Goals

2.3.3. Avoid Overfitting and Ensure Generalization

2.3.4. Incorporate Ethical Considerations

2.3.5. Interpreting Complex Machine Learning Models

2.3.6. Communicating Results to Diverse Audiences

3. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI