Generative AI as an Investment Advisor: Same Client, Different Advice

Agliata, Nicolo; Hasso, Tim

doi:10.3390/fintech5020054

Open AccessArticle

Generative AI as an Investment Advisor: Same Client, Different Advice

by

Nicolo Agliata

and

Tim Hasso

^*

Bond Business School, Bond University, 14 University Drive, Robina, QLD 4226, Australia

^*

Author to whom correspondence should be addressed.

FinTech 2026, 5(2), 54; https://doi.org/10.3390/fintech5020054

Submission received: 12 May 2026 / Revised: 4 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

(This article belongs to the Topic Artificial Intelligence Applications in Financial Technology, 2nd Edition)

Download

Browse Figure

Versions Notes

Abstract

Generative artificial intelligence (GAI) is increasingly embedded in personal finance, yet little is known about how models make recommendations using financial information and demographic cues. This study audits three frontier GAI models, GPT 5.5, Gemini 3.1 Pro, and Claude Opus 4.7, using a conjoint experiment in which each model evaluated the same hypothetical investor profiles and selected among standardized conservative, balanced, and aggressive portfolios. Investor profiles systematically varied attributes, including risk tolerance, time horizon, goal type, income, and age, gender, ethnicity, marital status, and employment type. Ordered logistic regressions and matched-profile comparisons show that all three models base recommendations primarily on financial attributes, especially risk tolerance and time horizon. Age and marital status shift recommendations towards conservatism in all models, conversely only Claude conditions on gender and employment type. Ethnicity exerts no detectable influence on the recommendations of ChatGPT or Claude, but is a small, statistically significant predictor for Gemini, with non-White profiles receiving slightly more conservative recommendations than otherwise identical White profiles. Overall, we find that the models are not interchangeable: they differ significantly in overall risk appetite and in how they translate risk tolerance, time horizon, goal type, and age into portfolio choices, with economically meaningful differences in predicted recommendations for identical clients. These findings suggest that contemporary GAI investment advice is driven mainly by financially relevant attributes, but that demographic sensitivity may appear in model-specific and statistically nuanced ways, alongside a distinct form of platform risk arising from model-specific advisory logic.

Keywords:

generative AI; large language models; robo-advisors; goals-based investing; algorithmic bias; audit methodology; conjoint; fintech

JEL Classification:

G11; G23; G41; O33

1. Introduction

Generative artificial intelligence (GAI) is rapidly reshaping the area of personal financial advice. Within a few years, large language models have moved from experimental novelty to embedded infrastructure inside the workflows of asset managers, banks and, increasingly, retail investors themselves. Survey evidence indicates that nearly half of retail investors now use GAI to interpret financial information [1], and the rapid uptake of GAI-enabled financial chatbots, combined with a parallel maturation of automated advisory platforms, suggests that algorithmic advice, once dispensed by rule-based robo-advisors operating on rigid mean-variance frameworks, is becoming conversational, generative and substantially more flexible. This shift carries a latent tension. The robo-advisor was originally promoted as a corrective to human advisor bias and conflict of interest [2,3]. By delegating recommendations to deterministic algorithms calibrated against client risk profiles, robo-advisory platforms promised to remove the demographic and behavioral biases documented in the human advice literature [4,5,6]. Yet if GAI now substitutes for, or augments, the rule-based recommendation engines that underpin contemporary robo-advice, the architecture of bias may simply migrate rather than disappear. Where rule-based models can be audited against transparent algorithmic logic, GAI models make recommendations through opaque latent representations conditioned on heterogeneous training corpora and reinforcement learning from human feedback [7,8]. Consequently, biases that were excluded by design in conventional robo-advisors may re-enter the recommendation pipeline through the back door.

The empirical question this study addresses is whether contemporary GAI models, when prompted to perform the function of a goals-based investment advisor, generate recommendations that exhibit two distinct and theoretically separable patterns. First, do they respond appropriately to the financial attributes that goals-based investing prescribes as relevant, such as risk tolerance, time horizon, age and goal type? Second, do they discriminate inappropriately based on demographic cues, such as gender and ethnicity, that goals-based investing theory and prevailing regulatory frameworks treat as immaterial to portfolio recommendations, conditional on financial profile? Although Oehler and Horn [9] compared ChatGPT’s investment recommendations to those of established robo-advisors at the average level, they did not decompose the relative weight that the model places on individual investor attributes, nor did they isolate whether identical financial profiles draw different recommendations depending on demographic cues.

Drawing on the audit methodology developed in the algorithmic bias literature [8,10,11], this study uses a full-profile conjoint experiment to probe the implicit advisory logic of three frontier GAI models. In the experiment, three standardized portfolios are held constant while the investor profile attributes vary across choice tasks. Each model completes 5000 choice tasks per experiment, nested in 1000 investor profile scenarios, yielding 15,000 total choice observations.

This study makes three contributions. First, it extends the rapidly growing GAI-in-investing literature beyond comparisons of average performance [9,12,13] towards a structural account of which attributes drive GAI portfolio recommendations and which do not. Second, it imports the audit methodology of algorithmic bias research [10,11] into the goals-based investment advisory context, where a clean theoretical separation exists between attributes that should and should not influence advice. Third, it documents cross-model heterogeneity in advisory logic, illuminating a previously underappreciated source of “platform risk” facing investors who delegate financial decision-making to particular GAI models. The findings carry implications for fintech regulation, robo-advisor governance and the rapidly developing scholarly conversation on AI accountability in financial services.

2. Literature Review

This review situates the study at the intersection of four bodies of work. We begin with the literature on robo-advisors and goals-based investing, which establishes the normative framework against which GAI advice can be benchmarked. We then survey emerging research on GAI in investment advisory, followed by the algorithmic bias literature that motivates audit-style methodologies.

2.1. Robo-Advisors and Goals-Based Investing

Robo-advisors emerged in the wake of the 2008 financial crisis as low-cost, algorithmically driven alternatives to traditional human financial advisors. Operating typically through web-based questionnaires that elicit client risk tolerance, time horizon, financial goals and tax circumstances, these platforms generate model portfolios that map onto the client’s elicited profile through deterministic algorithms grounded in modern portfolio theory [2,14]. The rationale for their growth has been two-fold. First, robo-advisors offer accessibility to retail investors previously priced out of personalized advice [15]. Second, they were designed to mitigate the conflicts of interest and behavioral biases observed in human advice provision [6].

The human advisory literature provides a useful baseline against which to compare robo-advisors. Mullainathan et al. [6] conduct an audit study of human financial advisors and document that human advisors systematically encourage clients to invest in higher-fee, actively managed funds rather than lower-cost index funds, with effects varying by client demographics. Linnainmaa et al. [5] show that advisors recommend portfolios that mirror their own beliefs rather than client risk tolerances, and Egan et al. [4] document substantial heterogeneity in misconduct exposure across client demographics. Consequently, human advice has been shown to be prone to conflicts of interest and behavioral biases.

The principle underlying contemporary robo-advisory practice is goals-based investing. In contrast to mean-variance portfolio optimization, which treats client preference as a single risk-aversion parameter, goals-based investing partitions wealth across distinct goals (such as retirement, education and house purchase), and applies tailored portfolio construction to each goal’s time horizon and shortfall tolerance [16,17]. Within this framework, financial investor attributes are theoretically and prescriptively relevant: time-horizon determines the length of the investment, stated risk tolerance calibrates the equity-bond mix; the goal itself dictates the funding adequacy threshold; and income shapes the contribution capacity. Life-cycle attributes such as age, marital and dependent status, and employment may also be conditionally relevant as they identify the suitability of specific portfolios and products. By contrast, gender and ethnicity, are protected demographic attributes and should not be relevant to portfolio recommendations conditional on financial profile, although they may correlate with risk tolerance or other relevant attributes in observed populations. Empirical research on robo-advisors has examined adoption patterns, portfolio construction quality [14], and behavioral effects on investors, with Rossi and Utkus [15] showing that robo-advisor users improve diversification and reduce behavioral biases relative to their pre-adoption behavior. The arrival of GAI as a substitute or augmentation for robo-advisor recommendation engines therefore re-opens questions that the rule-based generation of robo-advisors was thought to have settled.

2.2. Generative AI in Investment Advisory

A growing body of work investigates whether GAI models can perform investment-advisory functions previously reserved for humans or rule-based platforms. Kim [12] demonstrates that ChatGPT can interpret macroeconomic conditions to construct asset-class portfolios that exhibit diversification benefits relative to random allocations. Ko and Lee [13] extend this finding by showing that ChatGPT-selected portfolios offer statistically superior diversification properties, while Schneider and Yilmaz [18] report that GAI-constructed portfolios calibrated to a stated risk appetite outperform benchmarks in the United States, although performance varies markedly across European markets and across model versions. Beyond portfolio construction, Pelster and Val [19] provide live-experiment evidence that GPT-4 evaluates earnings news in ways correlated with subsequent returns, and Luo et al. [20] document that diversified investors are the most frequent GAI users, with personality traits such as narcissism predicting usage frequency.

Most directly relevant for the present study, Oehler and Horn [9] compare ChatGPT’s investment recommendations to those of established robo-advisors and find that ChatGPT advice often aligns more closely with academic benchmarks for standard investor profiles, particularly for one-time investments. However, their analysis examines aggregate recommendations rather than decomposing the model’s implicit attribute weights, and it does not test whether identical financial profiles elicit different recommendations across demographic dimensions. Schlosky and Raskie [21] revisit ChatGPT’s financial-advisory performance and document improvements in tone and detail in newer versions, while noting persistent limitations regarding legal nuance and specificity. Collectively, this literature establishes that GAI models are substantively engaged in advisory functions, but leaves unresolved whether the attribute weights driving their recommendations are systematic, defensible or platform-invariant.

2.3. Algorithmic Bias in Financial Services

The premise that GAI models are neutral information processors has been comprehensively challenged. Two principal issues generate algorithmic bias: the composition of pre-training corpora and the alignment process via reinforcement learning from human feedback [7,8,22].

Through the first channel, models internalize the statistical regularities of historically biased text. Through the second, they reproduce the cultural assumptions of annotator populations whose preferences shape model outputs. Empirical work has documented systematic algorithmic bias across multiple financial domains. Bowen et al. [10] demonstrate that GAI models recommend higher rejection rates and interest rates for Black mortgage applicants than for identical White profiles, with disparities persisting even when explicit racial labels are removed and proxies, such as geography, convey the same information. Lippens [11] employs an audit methodology to show that ChatGPT systematically rates job applicants with ethnic-minority names lower than majority-named applicants. Motoki et al. [8] document systematic political bias in ChatGPT’s responses to political-orientation surveys and advance the methodological argument that audit-style probing is the appropriate technique for surfacing latent algorithmic preferences. These findings establish a strong prior that GAI models applied to investment-advisory tasks may also encode systematic biases. The advisory domain is particularly consequential because, unlike a one-shot lending decision, advice influences cumulative portfolio outcomes over decades; a small bias in equity allocation, compounded over a working life, generates substantial differences in retirement wealth.

Drawing on these literatures, our study addresses the following overarching research question: when prompted to provide goals-based investment advice, do contemporary GAI models weight financial investor attributes appropriately while remaining invariant to demographic attributes that goals-based investing treats as conditionally immaterial? And to what extent do these patterns vary across GAI platforms?

3. Materials and Methods

We adopt an audit methodology that treats GAI models as research subjects, using structured prompting to elicit their latent advisory logic [8,11]. We conceptualize GAI models not as cognitive agents but as probabilistic engines that produce statistically representative outputs conditional on their training corpora and alignment procedures. The “advice” generated by these models therefore reflects the dominant patterns of advisory discourse encoded in their training data, which makes them particularly suitable subjects for audit-style choice experiments designed to surface implicit attribute weights [23,24].

We audit three frontier GAI models, GPT 5.5 (OpenAI), Gemini 3.1 Pro (Google) and Claude Opus 4.7 (Anthropic), selected for their market dominance and their documented use in financial-advisory contexts. Each system is accessed through its respective official API to ensure independence of responses, and default temperature settings are retained so as to preserve the stochastic variation that characterizes real-world deployment of these models.

3.1. Experimental Design

The audit comprises a full-profile conjoint experiment where three standardized portfolios are presented identically across all choice tasks: a Conservative Portfolio (30% equity, 70% bond), a Balanced Portfolio (60% equity, 40% bond) and an Aggressive Portfolio (90% equity, 10% bond), with all other portfolio attributes held constant at industry-typical levels. The investor profile then varies across nine attributes that we partition into three conceptually distinct categories. The first category, financial attributes, comprises stated risk tolerance, time horizon, goal type and annual income. These are attributes that goals-based investing prescriptively treats as recommendation-relevant, and the GAI response to them serves as a benchmark of advisory competence. The second category, life-cycle attributes such as age, marital, dependent status, and employment may also be conditionally relevant as they identify the suitability of specific portfolios and products. The third category, demographic attributes, comprises gender and ethnicity, and are protected attributes and should not be relevant to portfolio recommendations. Each model completes 5000 choice tasks nested in 1000 distinct choice profiles, in which the levels of the nine investor attributes are varied. The full attribute schema is presented in Table 1.

3.2. Data Collection

The attributes and their levels yield a full-factorial design space of 17,496 unique hypothetical investors. To ensure sufficient statistical power while maintaining experimental and computational efficiency, a fractional factorial design was utilized, drawing a random, orthogonal sample of 1000 distinct profiles from this universe. This sampling procedure minimizes multicollinearity between attributes, ensuring that the main effects of each client characteristic can be estimated independently and with maximum statistical rigor. The 1000 generated profiles were translated into standardized textual prompts, each acting as a client presented the full profile of the hypothetical investor to the large language model. An example of a full prompt is provided in Appendix A. We adopt this fully structured prompt design deliberately. The explicit specification of the attributes and the three standardized portfolios makes the attribute-level decomposition identifiable. By holding the choice architecture constant and varying the demographic and financial attributes orthogonally, we can attribute differences in recommendations to specific attributes rather than to differences in how each model spontaneously frames the advisory task. The design prioritizes internal validity at the cost of ecological validity as real retail investors typically interact with GAI advisors through less structured natural-language conversations in which the recommendation-relevant attributes are partial, ambiguous, or supplied in non-canonical order.

For full reproducibility, we record the exact technical conditions under which the recommendations were elicited. Each of the three models was queried through its provider’s official application programming interface (API) using the most capable frontier model available at the time of the audit: GPT 5.5 (OpenAI), Gemini 3.1 Pro (Google) and Claude Opus 4.7 (Anthropic). The full set of 5000 choice tasks per model (1000 profiles × five replicates) was administered in two collection waves on 8 May 2026 (replicates 1) and 28 May 2026 (replicates 2–5), with model identifiers, prompt messages and sampling parameters held constant across waves. We do not use a system message and only provide information through the prompt. Sampling parameters were left at each provider’s default settings to reflect realistic deployment conditions; the default temperature is 1.0 across the three providers, and the default top-p is 1.0 for ChatGPT, 0.95 for Gemini, and 1.0 for Claude. As a robustness check Gemini was additionally queried at temperature = 0 on the same 1000 profiles in a separate one-replicate run on 28 May 2026; this temperature override was not possible for ChatGPT or Claude, as neither provider exposes a temperature parameter for their frontier models. There were no refusals and all models provided answers in the requested format as per the prompt. We parsed the first 11 characters of the result and coded “Portfolio A” as 1, “Portfolio B” as 2, and “Portfolio C” as 3 to reflect their ordinal nature from Conservative to Aggressive. While we describe these portfolios in terms of Conservative, Balanced, and Aggressive the prompt does not label them as such.

3.3. Analysis

The experiment yields 15,000 ordinal recommendations (1000 client profiles × three models × 5 tasks per profile), coded 1 = Conservative, 2 = Balanced, 3 = Aggressive. Because the same 1000 profiles are administered to each model five times, the data have a matched replicate structure where each profile contributes five recommendations per model. The regression analyses use the replicate-level observations, yielding 5000 observations per model and 15,000 observations in the pooled model. This design permits within-profile inference and delivers greater power than an independent-samples comparison would. The analysis proceeds through four components: a cross-model agreement analysis, per-model proportional-odds regressions, a pooled interaction model that tests for differential attribute weighting across the three GAI models, and an anchor-scenario calculation that translates the coefficient estimates into economically interpretable recommendation probabilities. A set of robustness checks accompanies the regression results.

First, we quantify how consistently each model responds to the same profile across its five replicates. For each of the 1000 profiles, we summarize the five recommendations produced by each model and consider three complementary measures, specifically the percentage of profiles for which all five replicates were identical, Fleiss’s κ across the five replicates [25], and the one-way intraclass correlation, ICC(1).

Second, we characterize the marginal distribution of recommendations by model and assess cross-model agreement. The omnibus null that the three GAI models draw recommendations from the same distribution is tested using the Friedman test for matched ordinal data [26]. Multi-rater concordance is summarized by Kendall’s coefficient of concordance W and by Fleiss’s [25] kappa. Pairwise comparisons are reported in four forms: raw percentage agreement, Cohen’s [27] linearly weighted kappa interpreted against the Landis and Koch [28] thresholds, the Stuart [29]–Maxwell [30] test of marginal homogeneity, and the Wilcoxon signed-rank test for paired ordinal data. Holm’s [31] step-down procedure controls the family-wise error rate across the three pairwise comparisons. Directional asymmetries in disagreement are summarized by counting, for each model pair, the profiles on which one model recommends a more aggressive portfolio than the other; pairwise contingency heatmaps visualize the joint recommendation distributions.

We then identify the client attributes that drive each model’s recommendations. For each GAI model separately, we estimate a proportional-odds (cumulative) logit [32,33] of the ordinal recommendation on the nine client attributes. All predictors are factor-coded, with the lowest-risk or modal level taken as the omitted reference (Male, White, age 28, Conservative risk tolerance, 5-year horizon, Retirement goal, US$50,000 income, Single without dependents, Salaried W-2). The joint significance of each predictor is assessed by a likelihood-ratio (LR) test, obtained by re-estimating the model without that predictor. LR inference is preferred to Wald inference because quasi-complete separation on the Risk Tolerance dummies, documented in the robustness checks below, inflates Wald standard errors but leaves the nested log-likelihood comparisons unaffected [34]. To control the family-wise error rate, we apply the Holm step-down correction separately within each of three theoretically motivated attribute blocks. The financial attributes (Risk Tolerance, Time Horizon, Goal Type and Annual Income) comprise the inputs that goals-based investing theory prescribes as the legitimate determinants of portfolio recommendation. The life-cycle block (Age, Marital Status and Employment Type) comprises attributes that are demographic but carry potential actuarial or life-cycle relevance through their correlation with longevity, dependent on funding requirements or income stability. The demographic block (Gender and Ethnicity) comprises attributes that anti-discrimination laws treat as protected characteristics, that should not be considered for investment purposes. This partition aligns the statistical correction families with the theoretical categories under which we interpret the results. Overall fit is summarized by the McFadden [35] pseudo-R², and selected coefficients are reported on the log-odds and odds-ratio scales to convey the direction and magnitude of attribute effects.

To test whether the three GAI models differ in how they use client attributes rather than merely in their unconditional risk appetite, we pool the 15,000 observations and estimate a proportional-odds logit augmented with two Model dummies (Gemini and Claude, with ChatGPT as the reference) and the full set of Model × X interactions. The omnibus null that all 36 interaction coefficients equal zero is tested by LR. Variable-level Model × X interaction tests are then reported with Holm correction to identify the specific attributes on which the models diverge.

To express the coefficient estimates in economically interpretable quantities, we compute model-implied recommendation probabilities at an anchor scenario chosen to position the predicted recommendation near a category boundary, where marginal effects on the probability of an aggressive recommendation are largest. The anchor is a Male, White, age-45 client with Moderate risk tolerance, a 30-year horizon, a Retirement goal, $300,000 income, Single without dependents, and Salaried (W-2) employment. We then vary one attribute at a time, holding the remaining anchor attributes fixed, and report the implied P(Aggressive) under each GAI model. This complements the LR tests by exposing differences in economic magnitude that joint significance tests may mask.

Four robustness checks are conducted. First, the parallel-regression assumption is assessed for each per-model fit using the Brant [36] test, implemented as a comparison of slope coefficients from binary logits at the two cumulative cuts P(Y ≥ 2) and P(Y ≥ 3). Where the proportional-odds restriction is rejected for an appreciable share of comparisons, a partial proportional-odds specification [37] is fitted to verify that the qualitative pattern of significant predictors is preserved. Second, the per-model fits are inspected for quasi-complete separation [34]; where it is detected, inference relies on LR tests of nested log-likelihoods rather than on Wald standard errors. Third, the per-model fits are re-estimated under leave-one-variable-out perturbations to confirm that the pattern of joint significance in the main results is not driven by any single predictor. Fourth, to assess whether our results are sensitive to the default-temperature setting, we perform a small 1000 observation replication using Gemini at temperature zero, as this is the only frontier model that permits temperature alteration.

4. Results

4.1. Within-Model Consistency

Table 2 reports the within-model consistency by model. All three models are highly consistent, between 83% and 92% of profiles received an identical recommendation on all five presentations, and no model changed from Conservative to Aggressive or vice versa. The ranking is stable across every measure: Gemini is the most internally consistent (91.5% unanimous; Fleiss’s κ = 0.934; ICC = 0.967), ChatGPT is intermediate (89.2%; Fleiss’s κ = 0.919; ICC = 0.958), and Claude is the least consistent (83.0%; Fleiss’s κ = 0.869; ICC = 0.918). Even for Claude, however, the ICC of 0.918 indicates that more than nine-tenths of the variation in its recommendations is systematic (between-profile) rather than replicate-level noise. The replicates therefore confirm that the cross-model differences documented below reflect genuine differences in decision rules rather than stochastic answer variation, while also revealing a modest but consistent ordering in the stability of the three models.

4.2. Cross-Model Agreement

Table 3 reports the marginal distribution of recommendations by model. ChatGPT and Gemini produce nearly identical marginal proportions, roughly half Conservative, a third Balanced, and a fifth Aggressive, whereas Claude is markedly more centrist, with only 27.8% Conservative recommendations but 49.8% Balanced. A Friedman test, the standard omnibus test for matched ordinal data, decisively rejects the null of identical distributions across the three GAI models, χ²(2) = 389.70, p < 0.0001. Kendall’s coefficient of concordance is W = 0.195, which is consistent with the omnibus rejection being driven by a relatively small subset of profiles on which the models disagree.

Table 4 decomposes the omnibus rejection into pairwise comparisons. Agreement between ChatGPT and Gemini remains very high, at 94.2% (linearly weighted κ = 0.930). The introduction of Claude, however, reveals a substantially lower three-way concordance: all three GAI models agree on only 72.9% of the profiles, and the multi-rater Fleiss’s κ is 0.720. Pairwise agreement with Claude is 74.4% (against ChatGPT) and 77.2% (against Gemini), with linearly weighted κ values of 0.692 and 0.727, respectively. Following the conventions of Landis and Koch, the ChatGPT vs. Gemini agreement falls in the “almost perfect” range, while the agreement of each of those models with Claude is best classified as “substantial.” All three pairwise Wilcoxon signed-rank tests and Stuart–Maxwell tests of marginal homogeneity reject equality at p < 0.0001 after Holm correction. The pattern of disagreement is directional: Claude shifts the recommendation upwards (more balanced) relative to ChatGPT on 253 profiles and downwards on only three; relative to Gemini, the corresponding counts are 215 versus 13. In short, Claude is systematically more balanced and less conservative than the other two GAI models on identical profiles. Figure 1 provides further evidence on the agreement between models, showing pairwise contingency heatmaps for the model’s modal recommendations. The three panels reveal that (i) ChatGPT and Gemini are almost perfectly diagonal, (ii) ChatGPT and Claude differ mainly through Claude promoting Conservative cases to Balanced, and (iii) the Gemini and Claude pattern mirrors that asymmetry.

4.3. Regression Analysis

We next examine which client characteristics drive each model’s recommendations. For each GAI model separately, we estimate a proportional-odds (cumulative) logit of the ordinal recommendation on the nine client attributes. All predictors are factor-coded, with the lowest-risk or modal level taken as the omitted reference (Male, White, age 28, Conservative risk, 5-year horizon, Retirement, US$50,000 income, Single without dependents, and Salaried W-2).

Table 5 reports LR tests of the joint significance of each predictor in the three per-model fits, with Holm step-down correction applied. The McFadden pseudo-R² is 0.879 for ChatGPT, 0.808 for Gemini, and 0.700 for Claude. The client attributes therefore account for the bulk of the variation in recommendations, though for Claude a materially smaller share. Claude’s lower pseudo-R² implies that its recommendations carry more variation unexplained by the observable attributes.

Three patterns emerge from Table 5. First, the dominant drivers are the legitimate financial attributes: Risk Tolerance and Time Horizon return χ² statistics in the thousands for every model, with Goal Type and Annual Income contributing smaller but uniformly significant increments. Second, Age and Marital Status shift recommendations towards conservatism in every model, with Age strongest for Claude (χ² = 286.53). Employment Type is significantly used only by Claude (χ² = 15.92, p < 0.001) and is undetectable for ChatGPT and Gemini. Third, within the demographic block, Ethnicity is undetectable for ChatGPT (χ² = 2.94, p = 0.80) and Claude (χ² = 1.08, p = 0.78). For Gemini, the joint LR test for Ethnicity returns χ²(3) = 10.14, p = 0.035. We find that there is a significant Gender effect for Claude (χ² = 15.32, p < 0.001) but no detectable Gender effect for ChatGPT or Gemini. To clarify the interpretive scope of these demographic attribute results, we report the effect sizes, 95% confidence intervals, and minimum detectable effects (MDEs) for the gender and ethnicity contrasts. For ChatGPT, the Female-versus-Male coefficient is β = +0.100 (95% CI [−0.19, +0.39]), and the three ethnicity contrasts have β values in the range from −0.10 to −0.33 with confidence intervals that all cross zero (Black: [−0.53, +0.33]; Hispanic/Latino: [−0.72, +0.07]; Asian: [−0.62, +0.18]). The corresponding individual MDEs at α = 0.05 two-sided and 80% power are OR = 1.51 for Gender and OR ≈ 1.77 for the ethnicity contrasts, and the joint equal-coefficient MDE for the ethnicity block is OR = 1.73. For Gemini, the Female coefficient is β = +0.227 (95% CI [−0.00, +0.46]), and the three ethnicity individual contrasts are negative and significant on a contrast-by-contrast basis (OR = 0.69, 0.63, 0.67), and the corresponding individual MDE is OR = 1.59 with a joint equal-coefficient MDE of OR = 1.54. For Claude, the individual contrast MDEs are tighter (OR = 1.30 for Gender; OR ≈ 1.45 for the ethnicity contrasts, with a joint ethnicity MDE of OR = 1.43), reflecting Claude’s lower coefficient standard errors. Based on this, the null results for ChatGPT (Gender, Ethnicity), Gemini (Gender) and Claude (Ethnicity) rule out moderate-to-large demographic effects of the kind that would alter the modal portfolio recommendation, but they do not exclude smaller effects, particularly for ChatGPT, where the MDEs are widest. Second, all of these statistics are conditional on the structured prompt design adopted here, in which risk tolerance, time horizon, goal type, income, and named life goals are all explicitly specified. The detection threshold our design provides should not be extrapolated to less structured elicitations in which demographic cues may carry more weight.

Table 6 reports the per-model coefficient estimates and odds ratios underlying the joint tests in Table 5. The patterns are largely consistent with those tests but reveal three notable cross-model divergences. Within the life-cycle attributes, all three models shift recommendations towards conservatism as age rises and as dependents enter the household, with the age effect strongest in Claude (OR = 0.16 at age 62) and weakest in Gemini (OR = 0.43 at age 62). Claude additionally assigns less conservative recommendations to female profiles than to otherwise identical male profiles (β = 0.365, OR = 1.44, 95% CI for OR [1.20, 1.73], p < 0.001) and more conservative recommendations to gig or contract workers than to salaried workers (β = −0.441, OR = 0.64, p < 0.001); neither effect is detectable for ChatGPT or Gemini. The individual Wald contrasts in Table 6 also attribute three statistically significant negative ethnicity coefficients to Gemini, with non-White profiles receiving odds ratios in the 0.63–0.69 range relative to White profiles. The 95% confidence intervals for these three Gemini contrasts are [0.50, 0.96] for Black, ref. [0.46, 0.87] for Hispanic/Latino, and [0.49, 0.92] for Asian (relative to White). The financial attributes are uniformly large and consistent in direction across models but vary materially in magnitude. Income sensitivity is strongest for ChatGPT (OR = 10.00 at $300,000) and weakest for Claude (OR = 2.84), with Gemini in between. The largest divergence is in Goal Type: ChatGPT and Gemini treat both Education-funding and House-deposit goals as warranting substantially more conservative allocations than Retirement (ORs of 0.05–0.18), whereas Claude reverses the sign on Education funding (β = 1.086, OR = 2.95, p < 0.001), assigning more aggressive allocations to Education goals than to Retirement goals. The three models therefore agree on the direction of income effects but apply qualitatively different semantic mappings of named life goals onto the risk spectrum. It is important to note that Risk Tolerance and Time Horizon dummies are omitted from Table 6 because they exhibit quasi-complete separation in some subsamples where models always recommend Portfolio A (Conservative) to an investor who is described as Conservative in their Risk Tolerance, which inflates point-coefficient standard errors. However, quasi-complete separation does not affect the LR tests in Table 5 and Table 7 [34].

The per-model fits show that all three GAI models draw on the same broad set of predictors but with different point estimates. To formally test whether the three GAI models differ in how they use the client attributes, we pool the 15,000 observations and estimate a proportional-odds logit augmented with two Model dummies (Gemini and Claude, with ChatGPT as the reference) and the full set of Model × X interactions. Standard errors are clustered on the 1000 client profiles using a cluster-robust sandwich estimator. The omnibus LR test that all 36 interaction coefficients are zero is decisively rejected, χ²(36) = 1567.99, p < 0.0001; adding the interactions raises the McFadden pseudo-R² from 0.749 to 0.798.

Table 7 decomposes the omnibus rejection into variable-level profile-clustered Wald χ² for the joint test that the predictor’s Model × X interactions are zero, each with the Holm correction. The variables the three models use differently are Risk Tolerance, Time Horizon, Goal Type, and Annual Income, together with Age; all are significant under both the LR test and the profile-clustered Wald test (Annual Income is significant under the LR test and marginal under the clustered Wald test, p = 0.076 after correction). The remaining interactions are not significant. The non-significant Model × Gender (Wald χ²(2) = 1.40, p = 0.50) and Model × Ethnicity (Wald χ²(6) = 3.96, p = 0.68) interaction tests do not imply that every demographic contrast is zero within each model. The corresponding equal-coefficient minimum detectable effects at 80% power are OR = 1.85 for the Gender interaction and OR = 2.29 for the Ethnicity interaction, meaning our pooled design can rule out large cross-model differences in demographic responsiveness but not smaller ones. In the per-model results, Claude shows evidence of conditioning on Gender, while Gemini displays borderline evidence for Ethnicity, with consistently negative individual ethnicity contrasts relative to White profiles. The substantive conclusion is that the models differ primarily in how they weight financial attributes, not in how they respond to demographic attributes.

4.4. Economic Magnitude of Effects

To gauge economic magnitude, we compute model-implied recommendation probabilities at a reference scenario chosen to position the predicted recommendation near a category boundary. The anchor scenario is a Male, White, age-45 client with Moderate risk tolerance, a 30-year horizon, a Retirement goal, $300,000 income, Single without dependents, and Salaried (W-2) employment. We then vary one demographic attribute at a time, holding the other anchor attributes fixed, and report the implied probability of an Aggressive recommendation under each GAI model. The anchor scenario is deliberately chosen near a category boundary to maximize the visibility of marginal effects, and therefore differs from the omitted reference category used in the other tables.

Several features of Table 8 are economically noteworthy. First, at the anchor itself, the three GAI models differ markedly in their implied risk appetite: P(Aggressive) is 0.04 for ChatGPT, 0.37 for Gemini, and 0.17 for Claude. Age sensitivity again varies: moving the client’s age from 45 to 28 raises Gemini’s P(Aggressive) by about 0.19 and Claude’s by 0.09, but ChatGPT’s by only 0.05; moving age from 45 to 62 reduces Claude’s P(Aggressive) by 0.12, a far larger downward than upward shift, showing an asymmetric, retirement-cliff response to age. For Ethnicity, Gemini shows a noticeable reduction in P(Aggressive) for non-White profiles, falling from 0.365 for the White anchor to 0.285 for Black, 0.266 for Hispanic/Latino and 0.278 for Asian profiles. This pattern is consistent with the small but statistically detectable Gemini ethnicity effect documented in Table 5, but the magnitude of these probability shifts should be interpreted in light of the design’s detection threshold: the pooled Model × Ethnicity interaction is not significant (Wald χ²(6) = 3.96, p = 0.68; equal-coefficient MDE OR = 2.29); therefore, the cross-model contrast between Gemini’s ethnicity sensitivity and the null results for ChatGPT and Claude is suggestive rather than statistically resolved.

4.5. Robustness

We performed four robustness checks. First, the Brant parallel-regression test indicates that the proportional-odds restriction holds well for ChatGPT and Gemini but is partially violated for Claude. We therefore re-estimated Claude’s model under a partial proportional-odds specification and verified that the qualitative pattern of significant predictors in Table 5 is preserved. Second, the per-model fits exhibit quasi-complete separation on the Risk Tolerance dummies in the ChatGPT subsample and, to a lesser extent, the Claude subsample. The separation inflates Wald standard errors for the affected coefficients but does not affect the LR tests reported in Table 5 and Table 7, which compare nested log-likelihoods rather than relying on Wald inference. Third, we re-estimated the per-model fits omitting one variable at a time and confirmed that the pattern of significance reported in Table 5 is robust to leave-one-out perturbations. Fourth, the default-temperature setting under which the main experiment was conducted introduces a stochastic component that could, in principle, drive the substantive findings. To address this issue, we re-ran the experiment for Gemini at temperature zero, the lowest-stochasticity setting for the Gemini API, on the same 1000 client profiles. We note that this sensitivity check cannot be replicated for ChatGPT or Claude, as neither provider exposes a temperature parameter for the frontier model versions audited in this study. Our robustness temperature test for Gemini provides 1000 recommendations that we compare with Gemini’s 5000 default-temperature recommendations in the main sample. The temperature-zero marginal distribution (46.8%/31.2%/22.0%) is indistinguishable from the main Gemini marginal (47.1%/30.8%/22.1%) under both Stuart–Maxwell (χ²(2) = 2.20, p = 0.333) and Wilcoxon (p = 0.465) tests of homogeneity. At the profile level, the temperature-zero recommendation matches Gemini’s modal recommendation across the five default-temperature replicates on 97.0% of profiles (Cohen’s linearly weighted κ = 0.964; Cohen’s quadratically weighted κ = 0.976), an agreement marginally higher than the 95.8% mean pairwise agreement among the five default-temperature replicates themselves, indicating that the modal-of-five estimator on default-temperature data tracks the deterministic temperature-zero mode at least as closely as it tracks itself across resamples. The McFadden pseudo-R² is 0.804 under temperature zero against 0.808 in the main sample. The per-predictor LR-test significance pattern is preserved on all attributes except Ethnicity. However, the discrepancy is mechanical rather than substantive; the temperature-zero model shows Ethnicity coefficients that are larger in absolute magnitude than the main-sample coefficients, but the wider confidence intervals from the one-fifth sample size are unable to resolve the effect at conventional thresholds. The direction of the effect is preserved in both specifications; only its statistical detectability is sample-size-dependent. We conclude that the substantive findings for Gemini, including the modest ethnicity effect, reflect the deterministic core of the model’s decision rule rather than artefacts of default-temperature sampling.

5. Discussion

The empirical question posed at the outset of this study was whether contemporary generative AI (GAI) models, when prompted to perform the function of a goals-based investment advisor, generate recommendations that respond appropriately to financially relevant attributes whilst remaining invariant to demographic attributes that goals-based investing treats as conditionally immaterial. Five substantive conclusions emerge from our analysis. First, all three GAI models are highly internally consistent on identical attributes: 83% to 92% of profiles receive an identical recommendation across the five replicates, and the one-way intraclass correlation exceeds 0.91 for every model, with Gemini the most consistent (ICC = 0.967) and Claude the least (ICC = 0.918). Second, all three models ground their recommendations overwhelmingly in the legitimate financial attributes (Risk Tolerance, Time Horizon, Goal Type and Annual Income), with McFadden pseudo-R² values of 0.879 (ChatGPT), 0.808 (Gemini) and 0.700 (Claude). The relatively lower fit for Claude indicates that a greater share of its variation is unaccounted for by the supplied client attributes, a result that maps directly onto its lower within-model consistency. Third, within the life-cycle attributes, Age and Marital Status are significant in every model, and the expanded sample additionally reveals that Claude, but not ChatGPT or Gemini, conditions significantly on Gender and Employment Type. Ethnicity is undetectable for ChatGPT and Claude, but is a small significant predictor for Gemini, under the structured prompt design adopted here. The pooled cross-model interaction for Ethnicity is not statistically significant; therefore, the Gemini-specific pattern is descriptive rather than statistically resolved against the other two models. Fourth, the three GAI models are not interchangeable: with the divergence concentrated in how the models translate Risk Tolerance, Time Horizon, Goal Type, Income and Age into a recommendation. Claude is notably divergent, producing fewer Conservative and more Balanced recommendations than the other two GAI models, treating Education-funding goals as warranting more aggressive allocations rather than less, and exhibiting a steeper drop in risk at age 62 than the corresponding rise at age 28. Fifth, although the cross-model differences in demographic use are, for most predictors, statistically indistinguishable, the absolute magnitudes of demographic sensitivity vary substantially at economically realistic anchor scenarios, with consequential implications for downstream allocation decisions.

The high within-model consistency, not previously documented for GAI investment advice, has two implications. The first is methodological: it confirms that the cross-model differences we document are systematic decision-rule differences rather than stochastic answer variation, and that the lower McFadden pseudo-R² for Claude reflects a model that conditions on a marginally broader set of attributes rather than one that produces noisier outputs. The second is substantive: replicate-level stability of 90% or higher means that an investor who consults a particular GAI model on a given profile will, in the overwhelming majority of cases, receive the same recommendation if the same query is repeated. This is a non-trivial property of automated advice and stands in contrast to the variability documented in human-advisor research, where the same advisor often recommends materially different products to identical clients across encounters.

The dominance of financial attributes in driving recommendations is both encouraging and analytically informative. From the perspective of goals-based investing theory [16,17], the attributes that should rationally govern portfolio recommendation are precisely those: risk tolerance, time horizon, goal type and income, on which the GAI models converge. The McFadden pseudo-R² values from 0.700 to 0.879 are exceptionally high for a behavioral prediction task and indicate that the models behave largely as transparent functions of their inputs rather than as opaque pattern-matchers drawing on latent training-corpus regularities. This aligns with, and extends, the aggregate-level finding of Oehler and Horn [9] that ChatGPT advice often tracks academic benchmarks more closely than that of established robo-advisors. Whereas Oehler and Horn establish alignment at the level of average recommendations, our attribute-level decomposition demonstrates that the alignment is not coincidental but reflects appropriate marginal weighting of the relevant financial attributes. The result also strengthens the more cautious findings of Kim [12] and Ko and Lee [13], who establish that GAI-constructed portfolios exhibit defensible diversification properties, by extending the evidence into the conjoint-experimental domain where the implicit attribute weights, and not merely the output portfolios, are observable.

The mixed Ethnicity and Gender results warrant careful interpretation. ChatGPT and Claude show no statistically detectable use of Ethnicity in their recommendations under this design, the joint Ethnicity tests return χ² = 2.94, p = 0.80 and χ² = 1.08, p = 0.78 respectively, and the corresponding equal-coefficient minimum detectable effects (OR = 1.73 for ChatGPT and OR = 1.43 for Claude) rule out moderate-to-large ethnicity effects but cannot exclude smaller ones. Gemini’s joint Ethnicity test crosses the conventional Holm-corrected significance threshold (p = 0.035) with non-White contrasts in the 0.63–0.69 OR range, indicating that Gemini recommends slightly more conservative portfolios to non-White investors than to otherwise identical White investors. For Gender, Claude is the only model in which the effect is statistically significant (OR = 1.44, 95% CI [1.20, 1.73]), with female profiles receiving less conservative recommendations than identical male profiles. The contrast between our results and previous studies documenting clear gender or ethnic disparities in lending and hiring contexts [7,8,10,11] should therefore be read narrowly: under this specific structured prompt design, we find no robust evidence of ethnic disparate treatment in two of three models and no robust evidence of gender disparate treatment in two of three models. We do not claim that contemporary GAI is universally free of demographic bias; rather, we claim that within an audit design that fully specifies financial profile, the residual role of demographic cues is small and model-specific. Several non-exclusive explanations may account for the limited and model-dependent demographic effects observed here. One possibility is that investment-advisory discourse is less gender and racially coded than lending or labor-market discourse. A second possibility is that alignment procedures have been tuned to suppress demographic cues in regulated financial contexts [7]. A third possibility, which our design cannot rule out, is that the structured prompt itself crowds out demographic cues by supplying risk tolerance, time horizon, income and goal type explicitly; demographic cues may carry larger weight when the financial signal is more ambiguous. A fourth possibility, suggested by the model-specific nature of the effects, is that the alignment regimes of the three providers have diverged in this domain, some appear to have neutralized gender and ethnic cues more thoroughly than others, and the persistence of small effects in Gemini for Ethnicity and in Claude for Gender may reflect lower demographic-neutralization effort in their respective alignment processes.

The results for the life-cycle attributes are more nuanced and reveal a model-specific pattern. Age and Marital Status shift recommendations towards conservatism in every model. From a goals-based investing standpoint, neither attribute is, strictly, recommendation-relevant once time horizon, risk tolerance and goal type have been specified; the time-horizon attribute should already capture the life-cycle considerations that age might otherwise proxy, and shortfall aversion on behalf of dependents is, in principle, a function of the explicit goal rather than of marital status. The persistence of an age effect over and above the explicit time-horizon dummies therefore suggests that the GAI models are importing additional life-cycle assumptions around retirement proximity, human-capital depletion or longevity risk, that lie outside the goals-based framework’s formal architecture. Such importation is economically defensible: an older client with a thirty-year horizon nonetheless faces a shorter expected remaining life than a younger client with the same horizon, and conservatism may rationally follow. The treatment of age varies substantively across the three models in a way that goes beyond a simple difference in overall age sensitivity. ChatGPT’s de-risking is approximately linear in age, Gemini’s age response saturates after middle age (the model treats a 45- and a 62-year-old similarly but treats both very differently from a 28-year-old), and Claude exhibits a retirement-cliff (the model treats a 28- and a 45-year-old similarly but applies sharp additional de-risking at 62). Three non-exclusive mechanisms could plausibly account for these distinct age-schedule shapes. First, the training corpora on which contemporary GAI models are trained likely contain a substantial body of retirement-proximity advisory text emphasizing capital preservation, drawdown sequencing and longevity risk, against a comparatively thinner body of text exhorting young investors to take more risk than they would naturally choose; the conservative-prescription corpus is simply larger than the aggressive-prescription corpus. This first mechanism predicts a retirement-cliff shape, which Claude exhibits. Second, the Reinforcement Learning from Human Feedback layer may reinforce this asymmetry, as labelers are likely to penalize responses that recommend Aggressive portfolios to older clients more readily than they would penalize responses that recommend Conservative portfolios to younger ones, the downside of an overaggressive recommendation to a 62-year-old client is salient and culturally legible in a way that the opportunity cost of a conservative recommendation to a 28-year-old client is not. Third, the underlying age schedule that each model approximates may itself differ across providers: Gemini appears to encode a model in which advisory caution rises rapidly through early adulthood and then plateaus, while Claude encodes a model in which caution rises gradually and then accelerates at retirement age, and ChatGPT encodes a roughly linear age schedule. Our design cannot distinguish among these mechanisms, but the cross-model variation in age-schedule shape is itself a substantive finding: the three frontier models do not merely differ in how strongly they de-risk older clients, they differ in the functional form of the de-risking schedule, with consequential implications for the life-cycle allocation a client receives at any given age.

The economic and statistical significance of these age schedules merits explicit assessment. For Claude, the asymmetric retirement-cliff response is statistically significant, and economically substantial: a 12-percentage-point reduction in P(Aggressive) at age 62 relative to age 45, compounded across the typical horizon of late-career portfolio decisions and across the millions of retail interactions an embedded GAI advisory platform conducts annually, represents a structurally important feature of Claude’s life-cycle advisory logic. For Gemini, the saturating age response is also statistically significant and economically substantial in the opposite direction: a 19-percentage-point increase in P(Aggressive) when the client is 28 rather than 45, with only a further 2-percentage-point reduction at age 62, means that Gemini’s advisory logic concentrates its age-related differentiation in early adulthood rather than near retirement. These two model-specific schedules have qualitatively different implications for the cumulative wealth accumulation of clients across the life-cycle: a 62-year-old client receiving Claude’s advice is far more likely to be moved into a conservative portfolio than the same client receiving Gemini’s advice, while a 28-year-old client receiving Gemini’s advice is far more likely to be moved into an aggressive portfolio than the same client receiving Claude’s. The differences are sufficiently large to warrant explicit acknowledgement in any compliance assessment of either platform as a GAI investment advisor.

Perhaps the most consequential finding for practical deployment is cross-model heterogeneity. The divergence is concentrated in financial attributes that are relevant for investment recommendations: Risk Tolerance, Time Horizon, Goal Type and Income. The unconditional marginal distributions illustrate the magnitude of this heterogeneity directly: Claude allocates only 27.8% of its recommendations to the Conservative portfolio against 48.7% for ChatGPT and 47.1% for Gemini, while Claude’s Balanced share of 49.8% is more than fifteen percentage points above that of either alternative. Claude’s reversal of the Education-funding coefficient, treating the goal as warranting more balanced allocation rather than less, further indicates that the models do not share a common semantic mapping of named life goals onto the risk spectrum. The cross-model heterogeneity extends to the demographic attributes as well: Gemini’s modest, design-specific ethnicity effect is not present in the other two models, and Claude’s significant gender effect is not present in ChatGPT or Gemini. The pooled cross-model interaction tests do not statistically resolve these differences, but the descriptive pattern means that an investor selecting between the three platforms is not merely choosing a particular financial attribute-weighting but also a particular influence of demographic attributes, with implications for the ethical assessment of each platform that lie outside the conventional disparate-impact frame. This raises a form of platform risk that is, to our knowledge, largely undocumented in the existing GAI-in-finance literature. Investors who delegate portfolio recommendations to a particular GAI model are, in effect, selecting a particular implicit advisory philosophy whose attribute-weighting profile may not be evident even after extended interaction. The risk is qualitatively distinct from the model-version risk noted by Schneider and Yilmaz [18], who report performance variation across model releases within a single provider; the heterogeneity we document is contemporaneous, persists at the frontier of each provider’s offering and arises in the attribute weights themselves rather than in downstream realized returns.

These findings carry implications for several adjacent literatures and policy domains. For the literature on robo-advisors [2,14,15], our results indicate that the migration from deterministic recommendation engines to GAI-enabled conversational interfaces is unlikely, on the present evidence, to reintroduce the ethnic biases documented in the human-advisor literature [4,5,6], although they leave open the possibility that gender may re-enter the advisory pipeline through model-specific decision rules whose interpretive status lies between defensible actuarial inference and folk-theoretic stereotype. The contrast with Mullainathan et al. [6] is especially striking on ethnicity: where human advisors in their audit study systematically steered clients into higher-cost actively managed products with effects varying by client demographics. The GAI models we audit display either no detectable ethnic patterning (ChatGPT, Claude) or only a small design-specific ethnic effect (Gemini), despite recommending portfolios constructed from the same broad asset classes. We caveat this contrast with the observation that our structured prompt design supplies the financial inputs explicitly, which may attenuate the demographic role that less structured elicitations would reveal. For fintech regulation, this is a complex finding because conventional disparate-impact frameworks are poorly equipped to govern a setting in which the salient differentials are not only between distinct clients within a single platform but also between identically situated clients across platforms. For robo-advisor governance, our results suggest that audit-style methodologies of the kind developed by Lippens [11] and Motoki et al. [8] should be incorporated into routine compliance monitoring of GAI-enabled advisory services, not merely as a one-off vendor assessment but as an ongoing surveillance instrument that tracks attribute weightings across model versions and across providers over time. For the broader scholarly conversation on AI accountability in financial services, the result that frontier GAI models differ materially in their handling of investment recommendation-relevant attributes whilst converging on the conditional irrelevance of ethnicity suggests that the dominant fairness narratives may be insufficient as a description of where the consequential algorithmic variation actually resides.

Several limitations of the present study warrant explicit acknowledgement. First, the audit captures a single snapshot of three model versions at a fixed point in time. Generative AI models are updated continuously and the alignment procedures that govern their behavior are subject to change at the discretion of their developers; the patterns we document may evolve, and replication across model versions and time periods is therefore essential before any conclusion can be regarded as a general property of GAI-enabled advice. Second, our prompts are presented in English and the client names that signal gender and ethnicity are drawn from a United States cultural register; the absence of detectable ethnic disparate treatment in our experiment cannot be generalized to non-Anglo settings without further audit. Third, real retail investors interact with GAI advisors through extended natural-language conversations in which the recommendation-relevant attributes are typically partial, sequentially disclosed, qualitatively described rather than quantitatively specified (a client might describe themselves as “a bit risk-averse” rather than as “Moderate risk tolerance”), and embedded within longer narrative accounts of their financial circumstances. Our results therefore characterize GAI behavior in a high-information, structured-elicitation regime, and should not be generalized without qualification to lower-information conversational regimes in which the latent attribute weights of the models may differ materially. As we discuss below, extension to less heavily anchored prompts is the most important next step suggested by the present design. Fourth, the three-portfolio choice set is a coarse simplification of the continuous allocation space in which real portfolio recommendations are situated, and effects that are subthreshold under our discrete ordinal measure may be detectable under continuous-allocation metrics. Fifth, although names are a well-established device for signaling implicit gender and ethnicity in audit research [11], their information content as cues to demographic identity is plausibly weaker than that of explicit labels, and a stronger experimental manipulation might reveal effects that ours does not. Sixth, the null demographic results we report (no detectable gender effect for ChatGPT or Gemini; no detectable ethnicity effect for ChatGPT or Claude) hold against minimum detectable effects in the range of OR = 1.30 to 1.77 at α = 0.05 two-sided and 80% power. These thresholds rule out moderate-to-large demographic effects but cannot exclude smaller ones, and they are conditional on the structured prompt design adopted here. Demographic effects too small for our design to detect, or effects activated under less structured elicitations in which the financial signal is partial or ambiguous, would not be visible in our results. Seventh, the model-specific effects for Gender and Ethnicity we document are consistent both with the genuine absence of bias and with the presence of explicit safeguards in the alignment layer; the present audit cannot distinguish these mechanisms.

These limitations provide opportunities for future research. Longitudinal audit designs that track the attribute-weighting profiles of frontier GAI models across model versions and over time would establish whether the patterns we document are durable features of contemporary GAI advice or transient artefacts of alignment regimes.

Adversarial and ecologically-graded audit designs are the most direct extension. Three variants of the present design would together characterize the boundary of generalizability identified by the structured-prompt limitation. Attribute-omission designs could systematically withhold one of the four financial attributes at a time, to understand if the demographic attributes acquire greater weight when the legitimate financial attributes are weakened or ambiguous. Narrative-elicitation designs could replace the structured-field prompt with an unstructured client biography that conveys the same financial information in natural language, to assess whether the attribute weights we document are stable across structured and conversational presentations of equivalent information. Continuous-allocation designs would replace the three-portfolio choice set with a request for a numeric equity/bond allocation, to detect demographic effects that are subthreshold under the discrete ordinal measure but resolvable on a continuous scale. Each of these design variants would loosen one of the specific structural priors we adopt for identification, and the comparison of attribute weights across designs would establish how much of our finding reflects the goals-based architecture of the prompt versus the underlying decision rules of the models. Multilingual and cross-jurisdictional extensions would establish whether the demographic patterns we observe generalize beyond the English-language, United-States cultural setting in which our audit was conducted. Welfare-oriented extensions that map the cross-model heterogeneity we document into long-run client outcomes, using, for example, the diversified-portfolio benchmarks of Kim [12] and Ko and Lee [13], would translate the abstract platform-risk finding into the metric that ultimately matters for investors. The audit methodology developed here also extends naturally beyond goals-based portfolio recommendations to adjacent advisory domains, including tax-aware investing, debt management and intergenerational wealth transfer, where the theoretical separation between attribute-relevant and attribute-irrelevant client characteristics is similarly well defined. Finally, the comparison of frontier closed-source models with open-source alternatives, whose alignment procedures are at least partially inspectable, would shed light on the extent to which the patterns we document reflect inherent properties of the underlying language modeling versus deliberate design choices in the alignment layer.

6. Conclusions

This study has audited three frontier GAI models in the goals-based investment advisory setting, using a full-profile conjoint experiment. The audit yields four principal contributions to the rapidly developing literature on GAI in finance. First, by establishing that all three frontier GAI models exhibit high within-model consistency on identical inputs, with an intraclass correlation exceeding 0.92, the study confirms that automated advice from frontier models is stable rather than a stochastic one and provides a methodological benchmark against which future audits can be calibrated. Second, by decomposing model recommendations into attribute-level effects, the study advances the GAI-in-investing literature beyond the comparison of aggregate performance towards a structural account of which client attributes drive GAI recommendations and which do not. The result is partially reassuring on the dimension that has attracted the greatest regulatory attention: contemporary frontier GAI models, when performing goals-based investment advisory, weight the legitimate financial inputs heavily, and ChatGPT and Claude exhibit no statistically detectable disparate treatment on Ethnicity under the structured prompt design adopted here. Gemini, however, exhibits a small but statistically significant ethnicity effect, with non-White profiles receiving more conservative recommendations than otherwise identical White profiles. In contrast, while ChatGPT and Gemini show no use of Gender, Claude shows a significant Gender effect where women are given less conservative recommendations as compared to identical men. Third, by importing the audit methodology of the algorithmic-bias literature into a setting in which the separation between recommendation-relevant and recommendation-irrelevant attributes is unusually clean, the study illustrates the methodological benefits of combining the audit-experimental tradition with the prescriptive framework of goals-based investing. Fourth, and perhaps most importantly for practice, the study documents substantial cross-model heterogeneity in how frontier GAI models translate the same financial attributes into the same standardized portfolios, with unconditional conservative allocation shares spanning a twenty-one-percentage-point range across the three models and with model-specific demographic sensitivities of comparable economic magnitude. This previously underappreciated form of platform risk is qualitatively distinct both from the within-platform version risk and from the within-platform demographic bias that have dominated the conversation to date.

For investors, the findings indicate that the choice of GAI advisory platform is itself a consequential portfolio decision whose effects are likely to compound over the investment life-cycle. For platforms and regulators, the findings indicate that conventional disparate-impact concerns remain relevant. More broadly, the findings suggest that bias has neither cleanly disappeared nor simply reappeared. The use of Gender by Claude and Ethnicity by Gemini further show that demographic sensitivity may appear in model-specific ways, reinforcing the need to evaluate both within-model demographic effects and cross-model differences in advisory logic. The absence of detectable ethnic disparate treatment in two of three models, and of gender disparate treatment in two of three, is conditional on a structured prompt design that fully specifies financial profile. Audits employing less structured or attribute-incomplete prompts may reveal demographic patterns that our design lacks the power to detect. For the broader literature on AI accountability in financial services, the findings suggest that the migration of bias from human to algorithmic advice is mixed and model specific. As GAI becomes embedded ever more deeply in the advisory infrastructure that retail investors rely upon, audit-based monitoring of attribute-weighting profiles across platforms and over time is, in our view, no longer an optional complement to existing governance practices but an indispensable component of them.

Author Contributions

Conceptualization, N.A. and T.H.; methodology, N.A. and T.H.; formal analysis, T.H.; data curation, T.H.; writing—original draft preparation, N.A. and T.H.; writing—review and editing, N.A. and T.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in OSF at https://osf.io/c769d/ (accessed on 26 May 2026).

Acknowledgments

During the preparation of this manuscript, the author(s) used GPT 5.5, Gemini 3.1 Pro, and Claude Opus 4.7 for the purposes of data collection. The GAI tools were also used for editing the manuscript for clarity before submission. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
GAI	Generative Artificial Intelligence
GPT	Generative Pre-trained Transformer
LR	Likelihood Ratio (test)
OLS	Ordinary Least Squares
OR	Odds Ratio

Appendix A

Example of full prompt sent through API: The portfolio attribute levels were designed to represent historically plausible conservative, balanced, and growth-oriented stock-bond allocations. Accordingly, the three portfolios were calibrated to provide a monotonic risk-return structure, with the annualized return, volatility and maximum historical loss based on the patterns of benchmark global equity and bond indices (MSCI World Index and Bloomberg Global Aggregate Index).

** Role **
You are a financial advisor providing investment guidance to clients.
**Client profile**
Your client is Michael Anderson, who is 28 years old and based in the United States. He is self-employed and earns US$300,000 per year. He is single with no dependents. He has indicated that his risk tolerance is aggressive, and is investing for a house deposit with a 5-year time horizon. Michael Anderson has US$200,000 to invest as a lump sum and is asking you to recommend one of three portfolios.
**Investment options**

The three portfolios available are standard offerings from a reputable, regulated provider. They differ in their equity-bond allocation as shown below; all other characteristics (fees of 50 basis points, globally diversified passive index funds) are identical across the three.

|Attribute|Portfolio A|Portfolio B|Portfolio C|
|---|---|---|---|
|Equity allocation|30% equity/70% bonds|60% equity/40% bonds|90% equity/10%bonds|
|Expected annualized return|5.5%|7.0%|9.0%|
|Annualized volatility|6%|11%|16%|
|Maximum historical loss (peak-to-trough)|16%|34%|50%|

**Your task**
Recommend exactly one portfolio for Michael. Begin your response with the words “Portfolio A.”, “Portfolio B.” or “Portfolio C.” on its own line to identify your recommendation. Respond only in the format requested. Do not refuse to provide a recommendation, do not suggest to the client to seek other advice, and do not state that all three options are equivalent.

References

Blankespoor, E.; Croom, J.; Grant, S.M. Generative AI and investor processing of financial information. SSRN 2026. [Google Scholar] [CrossRef]
D’Acunto, F.; Prabhala, N.; Rossi, A.G. The promises and pitfalls of robo-advising. Rev. Financ. Stud. 2019, 32, 1983–2020. [Google Scholar] [CrossRef]
Gaspar, R.M.; Oliveira, M. Robo Advising and Investor Profiling. FinTech 2024, 3, 102–115. [Google Scholar] [CrossRef]
Egan, M.; Matvos, G.; Seru, A. The market for financial adviser misconduct. J. Political Econ. 2019, 127, 233–295. [Google Scholar] [CrossRef]
Linnainmaa, J.T.; Melzer, B.T.; Previtero, A. The misguided beliefs of financial advisors. J. Financ. 2021, 76, 587–621. [Google Scholar] [CrossRef]
Mullainathan, S.; Noeth, M.; Schoar, A. The Market for Financial Advice: An Audit Study; NBER Working Paper No. w17929; NBER: Cambridge, MA, USA, 2012. [Google Scholar] [CrossRef]
Gonzalez Barman, K.; Lohse, S.; de Regt, H.W. Reinforcement learning from human feedback in LLMs: Whose culture, whose values, whose perspectives? Philos. Technol. 2025, 38, 35. [Google Scholar] [CrossRef]
Motoki, F.; Pinho Neto, V.; Rodrigues, V. More human than human: Measuring ChatGPT political bias. Public Choice 2024, 198, 3–23. [Google Scholar] [CrossRef]
Oehler, A.; Horn, M. Does ChatGPT provide better advice than robo-advisors? Financ. Res. Lett. 2024, 60, 104898. [Google Scholar] [CrossRef]
Bowen, D.E., III; Stein, L.C.; Price, S.M.; Yang, K. Measuring and mitigating racial disparities in LLMs: Evidence from a mortgage underwriting experiment. SSRN 2024. [Google Scholar] [CrossRef]
Lippens, L. Computer says “no”: Exploring systemic bias in ChatGPT using an audit approach. Comput. Hum. Behav. Artif. Hum. 2024, 2, 100054. [Google Scholar] [CrossRef]
Kim, J.H. What if ChatGPT were a quant asset manager? Financ. Res. Lett. 2023, 58, 104580. [Google Scholar] [CrossRef]
Ko, H.; Lee, J. Can ChatGPT improve investment decisions? From a portfolio management perspective. Financ. Res. Lett. 2024, 64, 105433. [Google Scholar] [CrossRef]
Beketov, M.; Lehmann, K.; Wittke, M. Robo Advisors: Quantitative methods inside the robots. J. Asset Manag. 2018, 19, 363–370. [Google Scholar] [CrossRef]
Rossi, A.G.; Utkus, S.P. Who benefits from robo-advising? Evidence from machine learning. SSRN 2021. [Google Scholar] [CrossRef]
Brunel, J.L.P. Goals-Based Wealth Management: An Integrated and Practical Approach to Changing the Structure of Wealth Advisory Practices; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
Chhabra, A.B. Beyond Markowitz: A comprehensive wealth allocation framework for individual investors. J. Wealth Manag. 2005, 7, 8–34. [Google Scholar] [CrossRef]
Schneider, C.J.; Yilmaz, Y. Stock portfolio selection based on risk appetite: Evidence from ChatGPT. Financ. Res. Lett. 2025, 82, 107517. [Google Scholar] [CrossRef]
Pelster, M.; Val, J. Can ChatGPT assist in picking stocks? Financ. Res. Lett. 2024, 59, 104786. [Google Scholar] [CrossRef]
Luo, J.; Cao, Q.; Zhang, S.; Gu, D. Generative AI usage among investor types: The role of personality and perceptions. Financ. Res. Lett. 2025, 82, 107604. [Google Scholar] [CrossRef]
Schlosky, M.T.T.; Raskie, S. ChatGPT as a financial advisor: A re-examination. J. Risk Financ. Manag. 2025, 18, 664. [Google Scholar] [CrossRef]
Resnik, P. Large language models are biased because they are large language models. Comput. Linguist. 2025, 51, 885–906. [Google Scholar] [CrossRef]
Bateman, H.; Eckert, C.; Geweke, J.; Louviere, J.; Thorp, S.; Satchell, S. Financial competence and expectations formation: Evidence from Australia. Econ. Rec. 2012, 88, 39–63. [Google Scholar] [CrossRef]
Louviere, J.J.; Hensher, D.A.; Swait, J.D. Stated Choice Methods: Analysis and Applications; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
Conover, W.J. Practical Nonparametric Statistics, 3rd ed.; Wiley: New York, NY, USA, 1999. [Google Scholar]
Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef] [PubMed]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Stuart, A. A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 1955, 42, 412–416. [Google Scholar] [CrossRef]
Maxwell, A.E. Comparing the classification of subjects by two independent judges. Br. J. Psychiatry 1970, 116, 651–655. [Google Scholar] [CrossRef]
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
McCullagh, P. Regression models for ordinal data. J. R. Stat. Soc. Ser. B 1980, 42, 109–142. [Google Scholar] [CrossRef]
Agresti, A. Analysis of Ordinal Categorical Data, 2nd ed.; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
McFadden, D. Conditional Logit Analysis of Qualitative Choice Behaviour B2-Frontiers in Econometrics; Zarembka, P., Ed.; Academic Press: New York, NY, USA, 1974; pp. 105–142. [Google Scholar]
Brant, R. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics 1990, 46, 1171–1178. [Google Scholar] [CrossRef]
Peterson, B.; Harrell, F.E. Partial proportional odds models for ordinal response variables. J. R. Stat. Soc. Ser. C 1990, 39, 205–217. [Google Scholar] [CrossRef]

Figure 1. Pairwise contingency heatmaps of the model’s modal recommendations. Diagonals shaded dark show agreement; off-diagonals (bolded numbers) show the asymmetric disagreement pattern. Darker shading indicates greater agreement.

Table 1. Attributes and levels in the experimental design.

Attribute	Levels
Age	28, 45, 62
Stated Risk Tolerance	Conservative, Moderate, Aggressive
Time Horizon	5 years, 15 years, 30 years
Goal Type	Retirement funding, House deposit, Education funding
Annual Income	US$50,000, US$120,000, US$300,000
Gender	Male, Female (implied by name)
Ethnicity	White, Black, Asian, Hispanic/Latino (implied by name)
Marital and Dependent Status	Single without dependents, Married with dependents, Divorced with dependents
Employment Type	Salaried (W-2 equivalent), Self-employed, Gig or contract

Note. The names of the clients were used to implicitly signal gender and ethnicity. Michael Anderson (White Male), Sarah Anderson (White Female), Jamal Washington (Black Male), Keisha Washington (Black Female), Yichen Wang (Asian Male), Mei Wang (Asian Female), Carlos Rodriguez (Hispanic/Latino Male), Sofia Rodriguez (Hispanic/Latino Female).

Table 2. Within-model consistency by model.

Model	Consistency	Fleiss’s κ	ICC(1)
ChatGPT	89.2	0.919	0.958
Gemini	91.5	0.934	0.967
Claude	83.0	0.869	0.918

Note. Consistency is the percentage of profiles whose five replicates were all identical. ICC(1) is the one-way-random-effects intraclass correlation (between-profile share of variance).

Table 3. Marginal distribution of recommendations by model.

Recommendation	ChatGPT (%)	Gemini (%)	Claude (%)
Conservative	48.7	47.1	27.8
Balanced	31.9	30.8	49.8
Aggressive	19.4	22.1	22.4

Note. The Friedman omnibus test of matched-ordinal homogeneity yields χ²(2) = 389.70, p < 0.0001. Kendall’s W = 0.195. Fleiss’s κ (three raters) = 0.720.

Table 4. Pairwise comparison statistics across the three models.

Comparison	Agreement (%)	Cohen’s Linear κ	Cohen’s Quadratic κ	Stuart–Maxwell χ²	Wilcoxon W	p
ChatGPT vs. Gemini	94.2	0.930	0.953	40.21	148	<0.0001
ChatGPT vs. Claude	74.4	0.692	0.780	245.09	386	<0.0001
Gemini vs. Claude	77.2	0.727	0.805	187.20	1489	<0.0001

Note. All three pairwise Stuart–Maxwell and Wilcoxon signed-rank tests reject equality at p < 0.0001 after Holm’s step-down correction. κ are Cohen’s weighted kappas. Three-way agreement: 72.9% of profiles; Fleiss’s κ = 0.720.

Table 5. Likelihood-ratio tests for each predictor in per-model proportional-odds logits.

Predictor	df	χ² ChatGPT	p	χ² Gemini	p	χ² Claude	p
Financial attributes
Risk Tolerance	2	7144.30	<0.0001	6831.24	<0.0001	4986.97	<0.0001
Time Horizon	2	5493.20	<0.0001	4594.23	<0.0001	4532.65	<0.0001
Goal Type	2	185.67	<0.0001	172.42	<0.0001	185.70	<0.0001
Annual Income	2	153.59	<0.0001	128.05	<0.0001	87.92	<0.0001
Life-cycle attributes
Age	2	80.08	<0.0001	43.46	<0.0001	286.53	<0.0001
Marital Status	2	51.92	<0.0001	50.28	<0.0001	32.99	<0.0001
Employment Type	2	0.44	0.8030	1.68	0.4309	15.92	0.0007
Demographic attributes
Ethnicity	3	2.94	0.8025	10.14	0.0348	1.08	0.7825
Gender	1	0.46	0.8025	3.72	0.0539	15.32	0.0002

Note. Each row reports a LR test of the null that all dummy coefficients for that predictor jointly equal zero, obtained by re-estimating the proportional-odds logit without that predictor. The p columns apply the Holm correction within each block (financial, life-cycle, and demographic). McFadden pseudo-R² = 0.879 (ChatGPT), 0.808 (Gemini), 0.700 (Claude). N = 5000 for each fit.

Table 6. Selected coefficients from the per-model proportional-odds logits.

Predictor (vs. Reference)	β ChatGPT	OR ChatGPT	β Gemini	OR Gemini	β Claude	OR Claude
Female (vs. Male)	0.100	1.11	0.227	1.26	0.365 ***	1.44
Black (vs. White)	−0.102	0.90	−0.367 *	0.69	0.088	1.09
Hispanic/Latino (vs. White)	−0.325	0.72	−0.460 **	0.63	−0.046	0.95
Asian (vs. White)	−0.221	0.80	−0.402 *	0.67	0.008	1.01
Age 45 (vs. 28)	−0.833 ***	0.43	−0.766 ***	0.46	−0.547 ***	0.58
Age 62 (vs. 28)	−1.568 ***	0.21	−0.851 ***	0.43	−1.850 ***	0.16
Married w/dep. (vs. Single)	−0.928 ***	0.40	−0.853 ***	0.43	−0.633 ***	0.53
Divorced w/dep. (vs. Single)	−1.243 ***	0.29	−0.902 ***	0.41	−0.482 ***	0.62
Self-employed (vs. Salaried)	0.179	1.20	−0.109	0.90	−0.145	0.87
Gig or contract (vs. Salaried)	−0.064	0.94	−0.278	0.76	−0.441 ***	0.64
Income $120k (vs. $50k)	1.607 ***	4.99	1.056 ***	2.88	0.763 ***	2.15
Income $300k (vs. $50k)	2.303 ***	10.00	1.642 ***	5.16	1.043 ***	2.84
Goal: Education (vs. Retirement)	−1.173 ***	0.31	−1.692 ***	0.18	1.083 ***	2.95
Goal: House deposit (vs. Retirement)	−2.631 ***	0.07	−1.803 ***	0.16	−0.470 ***	0.63

Note. Coefficients are log-odds of being in a higher recommendation category (more aggressive); odds ratios greater than one indicate a shift towards more aggressive recommendations relative to the reference level. Risk Tolerance and Time Horizon dummies are omitted from the table because they exhibit quasi-complete separation in some subsamples, which inflates point-coefficient standard errors but does not affect the tests in Table 5 and Table 7. *** p < 0.001, ** p < 0.01, * p < 0.05 from Wald tests; the LR tests in Table 5 deliver the joint inference.

Table 7. Variable-level tests of Model × X interactions in the pooled fit.

Predictor	df	Wald χ²	p	Sig.
Risk Tolerance	4	213.24	<0.0001	***
Time Horizon	4	135.20	<0.0001	***
Goal Type	4	113.96	<0.0001	***
Age	4	30.61	<0.0001	***
Annual Income	4	12.30	0.076
Marital Status	4	6.65	0.622
Ethnicity	6	3.96	1.000
Gender	2	1.40	1.000
Employment Type	4	2.17	1.000

Note. Pooled proportional-odds logit, N = 15,000. Clustered Wald χ² uses a profile-clustered sandwich covariance with Holm correction applied. *** p < 0.001, ** p < 0.01, * p < 0.05 from Wald tests.

Table 8. Model-implied probability of an aggressive portfolio choice at anchor scenario.

Variation from Anchor	P(Agg) ChatGPT	P(Agg) Gemini	P(Agg) Claude
Anchor (baseline)	0.039	0.365	0.171
Female (vs. Male)	0.043	0.419	0.229
Black (vs. White)	0.036	0.285	0.184
Hispanic/Latino (vs. White)	0.029	0.266	0.164
Asian (vs. White)	0.032	0.278	0.172
Age 28 (vs. 45)	0.086	0.553	0.263
Age 62 (vs. 45)	0.019	0.345	0.053
Married w/dependents	0.016	0.197	0.099
Divorced w/dependents	0.012	0.189	0.113
Self-employed	0.047	0.340	0.151
Gig or contract	0.037	0.303	0.117

Note. Anchor: Male, White, age 45, Moderate risk tolerance, 30-year horizon, Retirement goal, $300,000 income, Single without dependents, Salaried (W-2). Probabilities are obtained from the per-model proportional-odds logits and are conditional on all other attributes being held at the anchor values. The anchor scenario is deliberately chosen near a category boundary to maximize the visibility of marginal effects, and therefore differs from the omitted reference category used in the other tables.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Agliata, N.; Hasso, T. Generative AI as an Investment Advisor: Same Client, Different Advice. FinTech 2026, 5, 54. https://doi.org/10.3390/fintech5020054

AMA Style

Agliata N, Hasso T. Generative AI as an Investment Advisor: Same Client, Different Advice. FinTech. 2026; 5(2):54. https://doi.org/10.3390/fintech5020054

Chicago/Turabian Style

Agliata, Nicolo, and Tim Hasso. 2026. "Generative AI as an Investment Advisor: Same Client, Different Advice" FinTech 5, no. 2: 54. https://doi.org/10.3390/fintech5020054

APA Style

Agliata, N., & Hasso, T. (2026). Generative AI as an Investment Advisor: Same Client, Different Advice. FinTech, 5(2), 54. https://doi.org/10.3390/fintech5020054

Article Menu

Generative AI as an Investment Advisor: Same Client, Different Advice

Abstract

1. Introduction

2. Literature Review

2.1. Robo-Advisors and Goals-Based Investing

2.2. Generative AI in Investment Advisory

2.3. Algorithmic Bias in Financial Services

3. Materials and Methods

3.1. Experimental Design

3.2. Data Collection

3.3. Analysis

4. Results

4.1. Within-Model Consistency

4.2. Cross-Model Agreement

4.3. Regression Analysis

4.4. Economic Magnitude of Effects

4.5. Robustness

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI