1. Introduction
Survey research has long been a cornerstone of social science, political science, and policy-making, providing critical insights into public opinion, voting behavior, and social attitudes. However, traditional surveys face increasing challenges: they are costly, time-consuming, and burden respondents with long questionnaires, all while suffering from declining response rates and concerns about data reliability. These limitations motivate the exploration of computational methods that could complement or extend survey research, particularly through the use of artificial intelligence.
Large Language Models (LLMs) represent a promising avenue for this purpose. Trained on vast human-written corpora, LLMs capture not only grammatical structure and world knowledge but also complex patterns of human beliefs, biases, and associations [
1,
2]. Many recent studies focus on the consequences of these biases and how to mitigate them [
3,
4]. On the other hand, recent studies have shown that, when properly conditioned, LLMs can simulate nuanced group-level opinions by leveraging these same biases they encode [
5,
6]. This has led to the emergence of “silicon sampling” methodologies, where LLMs are employed as synthetic survey respondents to replicate human-like patterns of political and social beliefs.
Silicon sampling [
6] refers to the idea of conditioning language models on externally sourced demographic or attitudinal profiles—such as those from the ANES—in order to simulate responses from synthetic personas that mirror real population distributions. The goal is to correct for the mismatched marginal distributions inherent to LLM training data, allowing researchers to generate opinion estimates for more representative subpopulations. While promising, this approach carries theoretical limitations. First, the technique assumes that the LLM’s internal conditional distributions
are sufficiently aligned with those of real humans. Second, it presumes that LLM-generated responses conditioned on these profiles generalize beyond the temporal, cultural, or linguistic context of the model’s training data. These limitations highlight the need for validation studies—like ours—that test LLM behavior at both the aggregate and individual levels, across a variety of background configurations and opinion domains.
Despite this progress, critical questions remain about the granularity at which LLMs can simulate opinions—particularly whether they can accurately replicate individual-level responses, how large these LLMs need to be, and how different types of background information affect their performance.
In this study, we address these questions by proposing a framework for simulating public opinion using LLMs conditioned on structured survey data. Specifically, we use the 2020 American National Election Studies (ANES) dataset to build “backstories” for individual respondents, categorizing their attributes into three distinct groups: Demographic, Attitudinal and Political Orientation, and Moral and Social Values. By selectively providing the LLM with different categories of variables, we investigate how the amount and type of background information impacts its ability to simulate individual survey responses across a range of political and social topics.
Our research seeks to answer the following questions:
How accurately can an LLM (Gemma3 12B and Qwen2.5 14B) simulate individual responses to diverse opinion questions based on structured ANES backstory data?
How does the type of information provided (Demographic, Attitudinal, Moral) influence the simulation accuracy? Which combinations of variable types are most effective?
How does the LLM’s performance compare to a standard machine learning classifier (Random Forest) trained on the same data and task?
How does the LLM’s ability to select relevant variables within a pool affect prediction accuracy?
How does simulation accuracy vary across different socio-political topics?
To benchmark the performance, we compare the LLM’s predictions against real survey responses using the F1-score and Cramer’s V for individual-level accuracy and Jensen–Shannon distance (JSD) for distributional similarity. Notably, while the LLM performs comparably to Random Forests on individual-level F1-scores, it outperforms them in capturing the distributional patterns of opinions across most topics. This suggests that even moderately sized LLMs, like Gemma3 12B, can serve as powerful tools for cost-effective and scalable simulations of public opinion at a collective level.
Our main contributions are as follows:
We propose a new framework for conditioning LLMs on structured backstory variables from public survey data to simulate individual and collective opinions.
We systematically analyze the impact of different types and combinations of background information on simulation accuracy.
We directly compare LLM performance against traditional machine learning methods on the same survey prediction tasks.
We demonstrate the viability of LLMs as synthetic populations for simulating public opinion distributions, opening new avenues for low-cost and scalable survey research.
The paper proceeds by first reviewing related work on LLMs and opinion simulation, followed by a detailed description of our methodology and experimental design. We then present each simulation in detail before discussing the main results, limitations, and future directions.
2. Related Work
LLMs have demonstrated remarkable capabilities in generating coherent and contextually appropriate text. These models, trained on vast amounts of data, can simulate human-like responses, making them valuable tools in various applications, including synthetic surveying and social network simulations.
Recent research has increasingly explored the potential of LLMs to simulate aspects of human behavior. Several studies examine how LLMs reflect human cognitive biases, replicate patterns of human error in reasoning, or capture nuanced differences across demographic groups [
7,
8,
9]. While much of this work focuses on evaluating LLMs’ ability to mimic human responses at an aggregate or cognitive task level, fewer studies have addressed the challenge of simulating individual-level opinion patterns based on structured background information. This section reviews some works that are closely related to our own, examining their behavior and limitations. We organize these works into three categories—Persona Simulation, LLMs in Social Science, and Virtual Surveys—while acknowledging the significant overlap between them.
2.1. Persona Simulation
Recent work has explored using LLMs to create virtual agents or characters with coherent personalities and behaviors. For example, ref. [
10] introduced Generative Agents: language-model driven software agents with memory, planning, and reflection components that wake up, cook breakfast, head to work, notice news, form opinions, and coordinate events in a simulated town. In evaluation these agents exhibited believable individual and emergent social behaviors (e.g., planning a community party via invitation chains), demonstrating the potential of LLMs to simulate not only isolated responses but rich social interactions.
Other work has focused on endowing LLMs with more detailed character traits. In [
11] the authors trained a “CharacterBot” on the collected writings of Lu Xun (a Chinese author) to capture his worldview and ideology. Their multi-task fine-tuning yields outputs that better preserve linguistic style and ideological perspective than simpler methods. Similarly, ref. [
12] construct the LifeChoice benchmark from novels to test if LLMs can predict characters’ decisions; they find state-of-the-art LLMs achieve promising accuracy, yet substantial room for improvement remains.
In human–computer interaction, ref. [
13] built PersonaFlow, an LLM-based system that simulates multiple expert personas (e.g., chemist, political scientist) during a brainstorming session. They report that interacting with several simulated experts improves the relevance and creativity of ideas, though it raises concerns about over-reliance and biases. Across these studies, a common limitation is that LLM-generated personas can oversimplify or stereotype real human variability. In [
14] the authors noted that LLM-based demographic simulations often produce “caricatures”—excessively stereotyped or homogenized personas—and they propose metrics to quantify lack of individuation or exaggeration in model outputs. In general, current persona models require extensive curated data or prompt engineering and may fail to generalize beyond the studied characters.
2.2. LLMs in Social Science Experiments
In [
15] the authors introduce the concept of
Turing Experiments (TEs), a novel method to evaluate the zero-shot simulation capabilities of LLMs. In TEs, models are prompted to generate text transcripts simulating human responses across various experimental settings, offering insight into which human behaviors LLMs can replicate. The authors conducted four TEs across distinct domains: behavioral economics, psycholinguistics, social psychology, and collective intelligence. Their results show that LLMs can capture meaningful human-like behavioral patterns, often with higher fidelity in larger models. However, they also highlight important challenges, including potential contamination from training data, the risk of prompt-induced biases, and ethical concerns in simulating sensitive aspects of human behavior. Overall, TEs provide a valuable tool for exploring how LLMs internalize and reflect complex social and cognitive behaviors, while underscoring the need for careful experimental design and interpretation.
Other work has adapted LLMs to field settings: In [
16] the authors created an automated pipeline to extract information from 319 published field experiments (economics/social science) and prompt GPT-4 to choose outcome predictions; they achieve roughly 78% accuracy overall, though performance is highly uneven (e.g., worse for studies on gender or non-Western contexts). Finally, ref. [
17] introduce SocioVerse, a synthetic world built from 10 million “user profiles” and LLM-driven agents. In domains of politics, news, and economics, SocioVerse’s agents collectively reproduce population dynamics (e.g., election outcomes, public opinion shifts) while preserving diversity of attitudes.
These modeling efforts suggest that LLMs offer a promising new tool for social science: they can accelerate hypothesis testing and policy exploration without real subjects (reducing cost and ethical constraints) [
18]. However, existing systems face clear limitations. LLMs ultimately rely on patterns in their training data, so their “inferences” may reflect historical and linguistic biases. They tend to underperform out-of-distribution or minority scenarios [
19] and can produce overly confident or uniform answers compared to diverse human behavior [
15]. Experts therefore caution that LLM-based simulations should be used cautiously and primarily for exploratory research, with repeated runs and careful variability analysis.
2.3. Virtual Surveys and Synthetic Respondents
Another emerging line of work treats LLMs as virtual survey participants, generating synthetic responses under given demographic profiles, which closely relates to our own work.
In [
5] the authors introduce a framework for analyzing whose opinions LLMs reflect when responding to subjective questions across various topics, ranging from religion to science. The authors create OpinionQA, a dataset based on high-quality public opinion polls that allows for evaluating how LLM opinions align with those of 60 U.S. demographic groups. In their primary analysis, the authors directly asked the LLMs the survey questions without any demographic-specific prompting or roleplaying instructions. This revealed a substantial misalignment between views expressed by current LLMs and those of U.S. demographic groups. In their “steerability” experiments, most models do tend to become better-aligned with a specific group when prompted to behave like it; however, these improvements are modest. Here, we should note that the authors focused on demographic traits to condition the models. In contrast, our work goes further by incorporating attitudinal and moral dimensions into the respondent profiles. As we will demonstrate, this richer modeling helps explain why prior work observed only limited steerability: demographic factors alone are often insufficient to capture the full complexity of human opinions.
In [
6] the authors introduce the concept of silicon sampling, proposing that LLMs like GPT-3 can serve as effective surrogates for human respondents in social science research. Their key idea is that “algorithmic bias” in LLMs is not a flaw but reflects fine-grained, demographically correlated patterns that can be systematically leveraged. In their study, the authors condition GPT-3 on thousands of real socio-demographic backstories drawn from large U.S. surveys, generating “silicon individuals” to simulate tasks such as vote choice prediction and answering closed-ended survey questions. While their work provides strong evidence that GPT-3 can simulate aggregate human responses when conditioned on demographic variables, it primarily focuses on traits such as race, gender, age, and political affiliation.
In [
20] the authors probed ChatGPT-4 by assigning it demographic profiles from the World Values Survey and ANES. GPT-4 reproduces many cultural differences and “in-sample” response patterns across U.S. and Chinese subgroups, and even produced correct forecasts for the 2024 U.S. presidential election. However, it shows limitations on value-sensitive questions and an overall tendency to homogenize responses: subtle prompt phrasing or demographic details can substantially affect answers. In [
21] the authors conducted a multi-country test by prompting GPT variants, Llama2, etc., to answer items from the European Social Survey on politics and democracy. They found that with a few relevant examples (“few shot prompts”) LLMs generate highly realistic answers, but in zero-shot mode (no examples) performance degrades markedly. Collectively these studies find that LLMs can simulate group-level opinion distributions, but major challenges and questions remains. In the next sections, we describe our approach to addressing the questions posed in the Introduction, developing a persona construction strategy that moves beyond demographic attributes to incorporate attitudinal and social values.
3. Materials and Methods
The ANES (American National Election Studies) is a respected research organization founded in 1948 that conducts comprehensive pre- and post-election surveys in the United States. These surveys collect extensive data on voter behavior, political attitudes, and public opinion, creating valuable datasets for social science researchers. In the next section we describe in detail the subset of the ANES 2020 dataset that we used in our work.
3.1. ANES 2020 Dataset
The ANES 2020 dataset contemplates interviews with respondents between 18 August 2020, and 3 November 2020, being the primary data source for comprehending the American public opinion. The dataset includes demographic, social, and behavioral information, including responses to presidential votes and political surveys. The dataset also features re-interviews with 2016 ANES respondents, and has samples from the 2020 pre-election and post-election periods.
In our work we focus exclusively on respondents from the 2020 pre-election sample, excluding the ANES 2016 respondents as well as post-election samples. We then select two groups of variables: the first we will denote by backstory variables, and the second simply by topics. The backstory variables are the set of 25 variables that can be used to compose the “persona profile” of our virtual person. The topics are a set of 10 questions that this virtual person will answer when roleplayed by the LLM. We also group the backstory variables into three distinct groups: Demographic (D), Attitudinal and Political Orientation (A), and Moral and Social Values (M). For brevity, we often refer to the last two groups simply as “Attitudinal” and “Moral” variables, respectively.
Table 1,
Table 2 and
Table 3 present the response options available to survey participants. The interpretation of the Demographic and Attitudinal variables is self-explanatory. Among the Social Values variables, the “Preferred Child Trait” item asks respondents to select the child characteristic they value most from the provided options. The “Birthright Citizenship End” variable measures how strongly respondents favor or oppose ending automatic citizenship for children of unauthorized immigrants. Variables prefixed with “Discrimination” gauge respondents’ perceptions regarding the prevalence of specific types of discrimination at the time of the survey. The “Historic Racism Impact” item assesses respondents’ agreement with the statement that generations of slavery and discrimination have created conditions that make life more difficult for Black Americans.
Table 4 shows the topics we are interested in.
For the topics
Climate Change and
Current Economy (marked with “*”), the original survey presented 5 choices. We remapped them in the following way: for
Climate Change, options “A little” and “A moderate amount” were collapsed into “A little”, and options “A lot” and “A great deal“ were collapsed into “A lot”. For
Current Economy, options “Very Good” and “Good” were collapsed into “Good”, and options “Bad” and “Very Bad“ were collapsed into “Bad”. This improves interpretability and ensures sufficient sample sizes across response categories while retaining the substantive distinctions between low and high concern or between positive and negative evaluations [
22]. All other topics are presented exactly as they appear in the survey.
3.2. Data Preparation and Model Selection
Each topic in our study is treated as an independent modeling task. For a given experiment, the input dataset consists of a specific set of backstory variables along with the specific topic variable as the prediction target. Each row of the dataset corresponds to a single survey respondent, fully characterized by their backstory and topic-specific response.
Since we aim to compare the performance of LLMs against a traditional Random Forest (RF) classifier, it is essential that both models are evaluated on the same validation sets. To ensure fairness and maximize robustness, we adopt a 3-fold cross-validation strategy: in each fold, approximately 33% of the data is held out for validation, with the remaining 67% used for training (only for the RF). This procedure guarantees that all data points are used for validation exactly once, reducing variance and minimizing potential biases. All reported performance metrics are computed as the simple average across the three validation folds.
Figure 1 illustrates this schema.
In the early stages of this work, we conducted preliminary comparisons between Random Forests, Linear Classifiers, and Gradient Boosting Machines (all from scikit-learn). Among these, RF consistently outperformed the linear models and outperformed the gradient boosting model on most topics. Although gradient boosting can be highly effective when carefully tuned, we found its performance to be less stable across topics under a fixed parameter setup. Given our goal of comparing an LLM to a robust, interpretable, and fair out-of-the-box benchmark, Random Forests were a natural choice. They are well-suited to datasets with predominantly ordinal and categorical variables, require minimal preprocessing, and are widely used in computational social science. Their robustness to overfitting and strong default performance made them ideal for ensuring consistency and comparability across experiments.
We evaluated two Large Language Models: Gemma3 12B [
23] and Qwen2.5 14B [
24]. Both models represent high-performing, state-of-the-art models among smaller-scale LLMs, making them well-suited for this research. While the main body of our article focuses on results obtained using Gemma3—our primary LLM throughout the study—we also include a comparison with Qwen2.5 to explore the external validity of our findings across models with different alignment and training origins. Qwen2.5 is particularly relevant as it reflects a non-Western alignment paradigm, offering insight into how cultural and normative differences may influence LLM behavior in social science simulations. For clarity and conciseness, we present the key summary tables for Qwen2.5’s performance in Experiments 2 and 3 in
Appendix B. A full duplication of all tables and heatmaps was omitted, for this model, as it would be largely redundant and would not alter the study’s core conclusions. For the complete Qwen2.5’s data results, we refer to our code repository.
Finally, it is important to highlight again that no fine-tuning or training is performed on the LLMs; it uses only the backstory information at inference time. Thus, the training data is utilized exclusively by the RF model. Lastly, respondents with missing data in any relevant fields were excluded from the analysis to maintain consistency across models.
3.3. Evaluation Metrics Statistical Analysis
To quantify the statistical uncertainty of our reported performance metrics (F1-score, JSD, Cramér’s V) and the differences between models or experimental conditions, we employed a non-parametric bootstrap procedure. For each primary metric averaged across the 3 cross-validation folds, B = 1000 bootstrap iterations were performed. In each iteration, (prediction, true label) pairs were resampled with replacement from each fold’s original validation set (maintaining original fold sample sizes). The metric was calculated for these resampled data within each fold, and these three fold-level metrics were then averaged to produce one bootstrap estimate of the mean cross-validated metric. This process yielded a distribution of 1000 bootstrap estimates for each reported mean metric or mean difference. From these distributions, 95% confidence intervals (CIs) were derived using the percentile method (i.e., the 2.5th and 97.5th percentiles). Differences or gains were considered statistically significant at the
p < 0.05 level if their corresponding 95% CI did not include zero. In the main text tables we mark the significant values with an asterisk (*). The full CI data can be found in the
Appendix B.
4. Results
4.1. Experimental Sequence
In the next sections we conduct four sequential experiments to evaluate and compare model performance. We begin by establishing a LLM baseline against naive models, using limited profile information (Exp 1). We then examine how access to more comprehensive variable pools affects model behavior (Exp 2). In the third experiment, we evaluate the model’s capacity for autonomous feature selection (Exp 3). Finally, we present the head-to-head comparison under identical conditions to isolate each model’s distinct capabilities (Exp 4).
Figure 2 illustrates the sequence.
4.2. Experiment 1—Random and Constant Model Baselines
Before evaluating model accuracy or variable grouping effects, we establish a performance baseline using two naive models: a
Random Model and a
Constant Model. These serve as essential lower bounds for interpreting the behavior of more advanced models. All experimental results presented in the main body of the article refer to Google’s Gemma3 12B model, which, for brevity, we refer to throughout as “the LLM.” Results for Qwen2.5 are included in the
Appendix B for comparison.
The Random Model simulates a respondent who selects answers uniformly at random, representing a performance floor with no reliance on input features. The Constant Model always predicts the most frequent class from the training data, reflecting population-level bias but ignoring respondent variation. This is especially relevant in imbalanced datasets, where majority-class predictions can appear deceptively strong.
These baselines help determine whether the LLM captures meaningful patterns or merely mimics simple heuristics. To keep the benchmark experiment as simple and interpretable as possible, we constrain the LLM to select only
one backstory variable from each of the three groups. For each topic, the model selects the most predictive variable per group using its internal knowledge. These are then used to construct the virtual respondent’s prompt. This setup provides a minimal yet structured input, enabling consistent evaluation across topics. Prompt templates for this experiment are included in the
Appendix A. We also note that for all LLM experiments, the
temperature parameter was held fixed at a value of 0.3. A low temperature of 0.3 was chosen to reduce randomness and encourage the model to produce the most probable, deterministic response based on the provided persona. Previous work like [
6] found the temperature parameter to produce stable results, but since our work uses a different setting, we plan to study temperature effects more deeply in future works.
As shown in
Table 5, the LLM consistently and substantially outperforms both the Random and Constant baseline models across all 10 survey topics and all three evaluation metrics (F1-score, Jensen–Shannon Distance (JSD), and Cramér’s V). For instance, on the “Climate Change” topic, the LLM achieved an F1-score of 0.59, compared to 0.36 for the Random model and 0.41 for the Constant model. Similarly, its JSD of 0.21 was markedly better (lower) than the Random model’s 0.29 and the Constant model’s 0.51, indicating superior distributional alignment. The Cramér’s V also showed a clear advantage for the LLM (e.g., 0.41 for “Climate Change” vs. 0.04 for Random; Constant model has no basis for Cramér’s V calculation as it does not use input features). The 95% confidence intervals for Experiment 1 results can be found in the
Appendix B.
This pattern of LLM superiority over the naive baselines holds true across all topics. For example, for “Gay Marriage”, the LLM’s F1-score (0.68) and JSD (0.10) far exceeded those of the Random (F1: 0.38, JSD: 0.33) and Constant (F1: 0.59, JSD: 0.40) models. However, we point out two statistically significant exceptions: First, in the “Drug Addiction” topic, where the Constant model’s F1-score was slightly better (+0.02) due to class imbalance. In this case the LLM demonstrated a significantly better JSD (0.20 vs. 0.41), showcasing its superior ability to capture more than just the majority class. The second, in the “Refugee Allowing” topic, where the JS for the Random model was 0.15 better. Here, the LLM was better at both F1-score and Cramér’s V, suggesting that the variable selection made by the LLM for this specific topic was not good enough. This is confirmed in the next experiments, where the JS for the topic “Refugee Allowing” is significantly better than the one achieved by the Random model.
These results affirm that even with a highly constrained set of input variables, the LLM captures meaningful predictive patterns from the backstory information, performing well above chance and simple heuristics. This validates its potential for opinion simulation and sets the stage for exploring its capabilities with more comprehensive informational inputs in subsequent experiments.
4.3. Experiment 2—Impact of Backstory Variable Categories on Simulation Accuracy
In Experiment 1 we established a performance baseline where the LLM was constrained to utilize a minimal set of information: one self-selected predictor variable from each of the three primary backstory categories, totaling at most three variables. Building upon this, in this section we investigate the potential performance gains and the differential impact achieved when the LLM has access to all variables in each of the seven distinct combinations of these “variable pools”.
The 25 backstory variables are grouped into three categories (
Table 1,
Table 2 and
Table 3), resulting in seven combinations or “variable pools”. For each
topic + variable pool, the LLM is explicitly instructed to use all variables from the pool to make predictions. The prompt templates can be found in the
Appendix A.
We also trained and evaluated RF classifiers on the exact same data. A separate RF classifier is trained for each topic + variable pool combination, using only the backstory variables from the given pool as input features and the corresponding topic survey response as the target. Both inputs and targets are encoded as integers. The training parameters for the RF classifiers are held constant across all experiments and are detailed in the
Appendix B.
In
Table 6,
Table 7,
Table 8 and
Table 9 we summarize the best performing models across variable pools, for each topic.
Table 6 and
Table 7 detail the peak F1-score, JSD, and Cramér’s V achieved by the Gemma3 12B LLM and the RF model, respectively, along with the specific variable pool yielding this optimal performance for each topic and metric. A noteworthy initial observation is that these optimal variable pools frequently differed between the LLM and RF for identical tasks, suggesting distinct information utilization strategies by each model.
Table 8 quantifies the LLM’s performance gains in this experiment relative to the more constrained Experiment 1 baseline, while
Table 9 directly compares the peak LLM performance against the peak RF performance. Pool names are abbreviated for space.
Analysis of the LLM’s optimal configurations (
Table 6) reveals that variable pools incorporating Attitudinal (A) or Moral (M) attributes, or their combination (A+M), frequently yielded the best results across most topics and metrics. For instance, on the “Climate Change” topic, the A+M pool provided the LLM’s highest F1-score (0.70) and best JSD (0.15). As detailed in
Table 8, these scores represented substantial F1 and JSD gains of 0.11 and 0.07, respectively, over the Experiment 1 baseline, indicating that richer psychological information significantly enhanced simulation accuracy for this topic. Conversely, for “Gay Marriage”, while the A pool yielded the LLM’s best F1-score (0.66), this was a slight decrease from the Experiment 1 baseline, and the JSD also showed a decline, suggesting the simpler Experiment 1 setup was more effective for these specific metrics, though Cramér’s V did see a marginal improvement.
No single variable pool proved universally optimal. While A or M variables were prevalent in high-performing configurations, Demographic (D) variables tended to contribute most effectively in combination with other types, particularly for topics with inherent demographic components like “Gender Role”. For distributional similarity (JSD), the A pool demonstrated broad effectiveness, notably achieving the largest JSD gain over the baseline (+0.26) for “Refugee Allowing”. The magnitude of these gains over the Experiment 1 baseline varied considerably; substantial JSD improvements were also seen for “Health Insurance Policy” (+0.17 with D+A pool) and “Race Diversity” (+0.14 with A pool). While providing more extensive backstory information in Experiment 2 generally benefited LLM performance, especially for JSD, the Experiment 1 baseline occasionally remained competitive or superior, highlighting a nuanced interplay between question type, evaluation metric, and the most pertinent conditioning information.
A direct comparison between the peak LLM performance and peak RF performance, with each model utilizing its respective optimal variable pool (
Table 9), reveals distinct capabilities. The LLM consistently demonstrated superior or equal ability in matching overall opinion distributions, achieving a statistically better JSD on six out of ten topics. Particularly notable JSD advantages for the LLM were observed for “Drug Addiction” (+0.18) and “Gender Role” (+0.18), underscoring its proficiency in capturing nuanced collective response patterns.
Regarding individual prediction accuracy (F1-score), the comparison was more balanced. The RF model often achieved a marginally higher peak F1-score, with a more pronounced advantage for “Gun Regulation” where its F1-score was 0.08 points higher than the LLM’s. Similarly, for associative strength (Cramér’s V), the RF model exhibited a clear superiority for “Gun Regulation” (−0.19) and “Race Diversity” (−0.17), while the LLM held a slight edge for “Drug Addiction” (+0.06), with minimal differences on other topics.
Experiment 2 demonstrates that while a RF classifier can achieve comparable or superior peak individual prediction accuracy (F1-score) and associative strength (Cramér’s V) on certain topics when operating with its optimal variable set, the Gemma3 12B LLM generally excels at replicating the distributional characteristics of survey responses (JSD). This distinction highlights their differing strengths: RF for robust predictive power on specific tasks, and LLMs for their nuanced capacity to simulate finer-grained collective opinion structures.
4.4. Experiment 3—Impact of LLM-Driven Feature Selection on Simulation Accuracy
Unlike Experiment 2, where the LLM utilized all variables within a given pool, in this experiment the LLM was first prompted to select only the variables it deemed most relevant for prediction from each topic and variable pool combination. Only these self-selected variables were then used to construct the respondent’s profile for simulation. This approach offers a practical advantage in reducing prompt token count, leading to more efficient inference. The RF model results from Experiment 2, where the RF used its optimal variable pool, serve as a benchmark for comparison.
Table 10 details the peak performance of the Gemma3 12B LLM in Experiment 3 across all topics and metrics when employing this feature selection strategy, along with the variable pool from which features were selected. A comparison of these results with the LLM’s performance in Experiment 2 (
Table 6, where all variables in the optimal pool were used) reveals nuanced effects of feature selection. For F1-score and Cramér’s V, the differences were generally marginal across most topics, often close to zero. However, for JSD, LLM-driven feature selection yielded notable improvements in distributional alignment for several topics. For instance, JSD for “Income Inequality” and “Race Diversity” improved by 0.08 and 0.05, respectively. This suggests that by focusing on a curated set of variables, the LLM can sometimes enhance its ability to replicate collective opinion patterns.
Nevertheless, the efficacy of this internal feature selection was not uniform. A striking instance is the “Health Insurance Policy” topic, where JSD worsened by 0.11 when feature selection was employed. This decline coincided with the optimal pool for JSD shifting from D+A in Experiment 2 to A+M in Experiment 3, suggesting that the LLM’s feature selection may have deprioritized or omitted the crucial “has health insurance” demographic variable. This highlights a potential trade-off: while feature selection can enhance efficiency and, in some cases, improve JSD, it may occasionally overlook critical variables, particularly for topics with strong correlation to specific input features. The distribution of optimal variable pools also shifted in Experiment 3, with the D pool appearing more frequently as optimal for certain metrics compared to Experiment 2.
When comparing the peak LLM performance with feature selection against the peak RF performance from Experiment 2 (
Table 11), the overall trends largely mirrored those observed in the Experiment 2 LLM vs. RF comparison. The LLM maintained its advantage in JSD for several topics, demonstrating strong distributional alignment. For example, for “Drug Addiction”, the LLM with feature selection achieved a JSD improvement of 0.23 over the peak RF performance, and for “Gender Role”, the improvement was 0.15.
Conversely, for individual prediction accuracy (F1-score), the RF model from Experiment 2 often held a slight edge or performed comparably to the LLM with feature selection. For “Gun Regulation” the RF outperformed the LLM by 0.07 in F1-score. Similarly, for associative strength (Cramér’s V), the RF demonstrated stronger performance for “Gun Regulation” (−0.18) and “Race Diversity” (−0.18), while the LLM showed an advantage for “Drug Addiction” (+0.09).
In essence, allowing the LLM to perform feature selection can lead to more efficient inference and, in some instances, can refine its ability to capture distributional patterns (JSD). However, its performance relative to a well-configured RF model remains largely consistent with the trends observed without this explicit LLM feature selection step: the LLM excels in distributional similarity while the RF often matches or slightly exceeds it in individual prediction accuracy on specific tasks.
4.5. Experiment 4: Head-to-Head Model Performance Across All Topic and Variable Pool Configurations
In both Experiments 2 and 3, we focused on identifying and comparing LLM and RF peak model performances across variable pools, which showed that optimal variable pools frequently differed between the LLM and RF for the same topic. In this section we present the direct comparison of these models under identical information conditions: for each topic we compare the performance under the same variable pool. To do this, we use the results from Experiment 2, since in this experiment the LLM utilizes all variables available within each pool, mirroring the RF’s input. This allows for a granular assessment of relative model strengths when the informational input is precisely matched.
The comparative performance is primarily visualized through heatmaps (
Figure 3,
Figure 4 and
Figure 5) depicting the difference between LLM and RF scores for F1-score, Jensen–Shannon Distance (JSD), and Cramér’s V, respectively. In these figures, rows represent the survey topics and columns denote the seven variable pools. Cell values quantify the performance gap (LLM score − RF score for F1-score and Cramér’s V; RF JSD − LLM JSD for JSD), with positive values (e.g., blue shades) indicating LLM superiority for that specific topic–pool combination. As in previous experiments, an asterisk (*) denotes a difference for which the 95% bootstrap confidence interval does not include zero, indicating statistical significance (
p < 0.05).
An examination of these heatmaps reveals distinct patterns in the relative performance of the Gemma3 12B LLM and the RF when given identical inputs from each variable pool.
For F1-score (
Figure 3), the RF model generally demonstrates an advantage or performs comparably to the LLM across most topic–pool configurations. This is evidenced by the prevalence of neutral (light/white) or orange/red shades. For instance, on “Gun Regulation”, the RF consistently outperforms the LLM across all variable pools, with particularly strong advantages when using the D (Diff: −0.33), D+M (Diff: −0.34), and M (Diff: −0.31) pools. Similarly, for “Current Economy“, the RF shows better F1-scores with pools like A+M (Diff: −0.19) and D+A (Diff: −0.15). The LLM achieves comparable or slightly better F1-scores in isolated cells, such as for “Drug Addiction” with the D+A pool (Diff: +0.03) or “Gender Role” with the D+A+M pool (Diff: +0.02), but these instances are less frequent. This suggests that for individual prediction accuracy under these matched input conditions, the RF model often has an edge.
The JSD comparison (
Figure 4) presents a more favorable outcome for the LLM. The LLM frequently achieves superior distributional similarity across a variety of topics and pools. Notably, for “Drug Addiction”, the LLM shows strong JSD advantages with the A (Diff: +0.23) and D+A (Diff: +0.22) pools. For “Gender Role”, the LLM also performs well with the A (Diff: +0.18) and D+A (Diff: +0.18) pools. Similar LLM advantages in JSD are seen for “Health Insurance Policy” with A, A+M, D+A, and D+A+M pools. However, the RF can still achieve better JSD in some cases, particularly when the D pool is used for topics like “Climate Change” (Diff: −0.24) or “Current Economy” (Diff: −0.33), and notably for “Gun Regulation” with the D pool (Diff: −0.39). This indicates that while the LLM often excels in JSD, its advantage is not universal and can be pool-dependent, with RF sometimes performing better especially with purely demographic information.
Regarding Cramér’s V (
Figure 5), which measures associative strength, the RF model again tends to show stronger performance or parity in many topic–pool combinations. Negative values, indicating RF superiority, are prominent for “Gun Regulation” across all pools, with the largest differences seen with D+M (Diff: −0.33), D (Diff: −0.22), and A+M (Diff: −0.21). For “Race Diversity”, the RF also consistently outperforms the LLM. The LLM demonstrates a competitive or superior Cramér’s V for “Drug Addiction” across most pools (e.g., M pool, Diff: +0.07) and for “Health Insurance Policy” with the M pool (Diff: +0.11). However, for many other topic–pool cells, the differences are small or slightly favor the RF model.
This head-to-head comparison, where both models process the exact same variables from each pool, suggests that the RF model often exhibits superior or comparable performance in terms of F1-score and Cramér’s V. The Gemma3 12B LLM, however, frequently demonstrates an advantage in capturing the distributional nuances of survey responses, as measured by JSD, although this strength can also be influenced by the specific variable pool provided.
5. Conclusions
Our study investigated the capabilities of LLMs, specifically Gemma3 12B, to act as synthetic survey respondents, utilizing the 2020 ANES dataset. Our findings demonstrate that LLMs, when appropriately conditioned, serve as a potent tool for simulating public opinion, significantly outperforming naive baseline models even with minimal input.
While a direct quantitative comparison of our performance metrics with those in the literature is challenging due to differing experimental setups, we can contextualize our findings against key benchmarks. For instance, Argyle et al. (2022), in their seminal work on “silicon sampling,” used eleven ANES 2016 variables to predict a twelfth. They measured the correspondence between the associative patterns in human and GPT-3 data using Cramér’s V, finding a “stunning correspondence” with a mean difference of only −0.026. Similarly, other recent work [
20] has reported high LLM accuracy in predicting the U.S. presidential election. However, we argue that the predictive power in such tasks can be heavily driven by the inclusion of exceptionally strong proxy variables. For instance, a respondent’s political ideology or state of residence are already such powerful predictors of vote choice that they can mask the model’s true simulation capability. We build on this foundation by tasking the LLM with a more complex challenge: simulating opinions on politically charged topics like gun regulation, refugee policy, and racial diversity.
A key contribution of our work is the detailed comparative analysis against traditional Random Forest classifiers and the nuanced exploration of different categories of backstory information—Demographic, Attitudinal and Political Orientation, and Moral and Social Values. We consistently found that while LLMs achieve individual-level prediction accuracy (F1-score) comparable to Random Forests, they notably excel in capturing the overall distributional patterns of opinions (as measured by Jensen–Shannon Distance) across a majority of socio-political topics. This superior performance in reflecting collective sentiment is particularly evident when LLMs are provided with richer combinations of attitudinal and moral variables, often surpassing the utility of demographic data alone. This highlights why previous studies focusing primarily on demographic conditioning may have observed limited steerability.
The research also explored the LLM’s inherent ability to discern relevant variables, showing that even when tasked with selecting a subset of features (Experiment 3), performance remained robust and, in some cases, JSD improved. This suggests a potential for more efficient prompting strategies, though with the caveat that crucial, topic-specific variables might occasionally be overlooked if not explicitly provided.
In direct head-to-head comparisons with Random Forests using identical input variable pools (Experiment 4), the pattern held: Random Forests often showed a slight edge or parity in F1-score and Cramér’s V, while the LLM maintained its advantage in distributional similarity (JSD). This underscores distinct strengths, with RFs suited for raw predictive power on individual responses and LLMs better at simulating the collective opinion landscape.
We hypothesize that these distinct strengths stem directly from the models’ training paradigms. The Random Forest is a discriminative model trained exclusively to find the optimal predictive pathways within the ANES sample, making it highly proficient at individual classification (F1-score, Cramér’s V). Conversely, the LLM leverages its vast pre-training on web-scale text, which has implicitly taught it the statistical distributions of beliefs and attitudes at a population level. This allows it to generate responses that better reflect the overall shape of public opinion, often leading to superior distributional accuracy (JSD).
To ensure these observations are not an artifact of a single model, we validated our findings through a full replication using Qwen2.5 14B (see
Appendix B for summary tables). The results from Qwen2.5 consistently reinforce our central conclusion: while the supervised Random Forest model often holds a slight advantage in individual prediction accuracy (F1-score), an LLM excels at capturing the distributional characteristics of public opinion (JSD). This core trade-off was evident across both models. We did, however, observe minor differences in their capabilities. Gemma3 12B appeared to be a more competitive individual-level predictor, occasionally matching or exceeding the RF’s F1-score, whereas Qwen2.5’s performance lagged further behind the RF on this metric. Ultimately, the remarkable consistency in the type of strengths and weaknesses exhibited by two distinct LLMs strongly suggests that this is a fundamental distinction between generative simulation and supervised classification.
In conclusion, our research substantiates the growing potential of LLMs as a valuable and scalable instrument for social science research. They offer a promising, cost-effective method to generate synthetic survey data, explore “silicon sampling”, and complement traditional survey methodologies, especially for understanding nuanced collective opinion patterns. While traditional machine learning models like Random Forests may retain advantages in specific individual prediction contexts, the demonstrated strength of LLMs in distributional modeling positions them as an important new tool for advancing our understanding of public opinion dynamics.
6. Limitations and Future Research
While this study provides strong evidence for the utility of LLMs in public opinion simulation, it is important to acknowledge its limitations. First, our analysis is centered on a single, small-sized open-source model, Gemma3 12B. Performance and behavioral patterns may vary with different model architectures, sizes (e.g., larger proprietary models like GPT-4), or quantization methods. Future work should extend this comparative framework to a broader range of models to assess the generalizability of our findings.
Second, the findings are rooted in the 2020 American National Election Studies (ANES) dataset. The observed relationships between demographic, attitudinal, and moral variables are specific to the U.S. socio-political context of that time. The framework’s effectiveness in simulating public opinion in other cultures, political systems, or time periods requires explicit validation.
Third, while we described our prompt templates and temperature settings in the methods section, we acknowledge that prompt design choices can meaningfully affect model behavior. Subtle changes in wording, formatting, or ordering of variables can lead to shifts in model outputs—particularly in settings involving moral or ideological framing. While prior work has found such simulations to be relatively robust across a wide range of temperature values, prompt sensitivity remains an open methodological challenge in computational social science. Future work should explore more systematic sensitivity analyses and the development of prompt-invariant methods to ensure the robustness and reproducibility of simulation results.
Finally, this work is confined to simulating responses for structured, multiple-choice survey questions. While our distributional metrics (JSD) show strong aggregate performance, there remains a risk that LLMs can generate “caricatures” that over-represent common statistical correlations rather than capturing the full spectrum of nuanced or atypical human beliefs. The ability to generate authentic, open-ended textual responses remains a key area for future exploration. These limitations highlight important directions for continued research, including cross-cultural validation and the development of more sophisticated metrics for assessing the qualitative authenticity of synthetic personas.