1. Introduction
Generative artificial intelligence (GAI) is rapidly reshaping the area of personal financial advice. Within a few years, large language models have moved from experimental novelty to embedded infrastructure inside the workflows of asset managers, banks and, increasingly, retail investors themselves. Survey evidence indicates that nearly half of retail investors now use GAI to interpret financial information [
1], and the rapid uptake of GAI-enabled financial chatbots, combined with a parallel maturation of automated advisory platforms, suggests that algorithmic advice, once dispensed by rule-based robo-advisors operating on rigid mean-variance frameworks, is becoming conversational, generative and substantially more flexible. This shift carries a latent tension. The robo-advisor was originally promoted as a corrective to human advisor bias and conflict of interest [
2,
3]. By delegating recommendations to deterministic algorithms calibrated against client risk profiles, robo-advisory platforms promised to remove the demographic and behavioral biases documented in the human advice literature [
4,
5,
6]. Yet if GAI now substitutes for, or augments, the rule-based recommendation engines that underpin contemporary robo-advice, the architecture of bias may simply migrate rather than disappear. Where rule-based models can be audited against transparent algorithmic logic, GAI models make recommendations through opaque latent representations conditioned on heterogeneous training corpora and reinforcement learning from human feedback [
7,
8]. Consequently, biases that were excluded by design in conventional robo-advisors may re-enter the recommendation pipeline through the back door.
The empirical question this study addresses is whether contemporary GAI models, when prompted to perform the function of a goals-based investment advisor, generate recommendations that exhibit two distinct and theoretically separable patterns. First, do they respond appropriately to the financial attributes that goals-based investing prescribes as relevant, such as risk tolerance, time horizon, age and goal type? Second, do they discriminate inappropriately based on demographic cues, such as gender and ethnicity, that goals-based investing theory and prevailing regulatory frameworks treat as immaterial to portfolio recommendations, conditional on financial profile? Although Oehler and Horn [
9] compared ChatGPT’s investment recommendations to those of established robo-advisors at the average level, they did not decompose the relative weight that the model places on individual investor attributes, nor did they isolate whether identical financial profiles draw different recommendations depending on demographic cues.
Drawing on the audit methodology developed in the algorithmic bias literature [
8,
10,
11], this study uses a full-profile conjoint experiment to probe the implicit advisory logic of three frontier GAI models. In the experiment, three standardized portfolios are held constant while the investor profile attributes vary across choice tasks. Each model completes 5000 choice tasks per experiment, nested in 1000 investor profile scenarios, yielding 15,000 total choice observations.
This study makes three contributions. First, it extends the rapidly growing GAI-in-investing literature beyond comparisons of average performance [
9,
12,
13] towards a structural account of which attributes drive GAI portfolio recommendations and which do not. Second, it imports the audit methodology of algorithmic bias research [
10,
11] into the goals-based investment advisory context, where a clean theoretical separation exists between attributes that should and should not influence advice. Third, it documents cross-model heterogeneity in advisory logic, illuminating a previously underappreciated source of “platform risk” facing investors who delegate financial decision-making to particular GAI models. The findings carry implications for fintech regulation, robo-advisor governance and the rapidly developing scholarly conversation on AI accountability in financial services.
3. Materials and Methods
We adopt an audit methodology that treats GAI models as research subjects, using structured prompting to elicit their latent advisory logic [
8,
11]. We conceptualize GAI models not as cognitive agents but as probabilistic engines that produce statistically representative outputs conditional on their training corpora and alignment procedures. The “advice” generated by these models therefore reflects the dominant patterns of advisory discourse encoded in their training data, which makes them particularly suitable subjects for audit-style choice experiments designed to surface implicit attribute weights [
23,
24].
We audit three frontier GAI models, GPT 5.5 (OpenAI), Gemini 3.1 Pro (Google) and Claude Opus 4.7 (Anthropic), selected for their market dominance and their documented use in financial-advisory contexts. Each system is accessed through its respective official API to ensure independence of responses, and default temperature settings are retained so as to preserve the stochastic variation that characterizes real-world deployment of these models.
3.1. Experimental Design
The audit comprises a full-profile conjoint experiment where three standardized portfolios are presented identically across all choice tasks: a Conservative Portfolio (30% equity, 70% bond), a Balanced Portfolio (60% equity, 40% bond) and an Aggressive Portfolio (90% equity, 10% bond), with all other portfolio attributes held constant at industry-typical levels. The investor profile then varies across nine attributes that we partition into three conceptually distinct categories. The first category, financial attributes, comprises stated risk tolerance, time horizon, goal type and annual income. These are attributes that goals-based investing prescriptively treats as recommendation-relevant, and the GAI response to them serves as a benchmark of advisory competence. The second category, life-cycle attributes such as age, marital, dependent status, and employment may also be conditionally relevant as they identify the suitability of specific portfolios and products. The third category, demographic attributes, comprises gender and ethnicity, and are protected attributes and should not be relevant to portfolio recommendations. Each model completes 5000 choice tasks nested in 1000 distinct choice profiles, in which the levels of the nine investor attributes are varied. The full attribute schema is presented in
Table 1.
3.2. Data Collection
The attributes and their levels yield a full-factorial design space of 17,496 unique hypothetical investors. To ensure sufficient statistical power while maintaining experimental and computational efficiency, a fractional factorial design was utilized, drawing a random, orthogonal sample of 1000 distinct profiles from this universe. This sampling procedure minimizes multicollinearity between attributes, ensuring that the main effects of each client characteristic can be estimated independently and with maximum statistical rigor. The 1000 generated profiles were translated into standardized textual prompts, each acting as a client presented the full profile of the hypothetical investor to the large language model. An example of a full prompt is provided in
Appendix A. We adopt this fully structured prompt design deliberately. The explicit specification of the attributes and the three standardized portfolios makes the attribute-level decomposition identifiable. By holding the choice architecture constant and varying the demographic and financial attributes orthogonally, we can attribute differences in recommendations to specific attributes rather than to differences in how each model spontaneously frames the advisory task. The design prioritizes internal validity at the cost of ecological validity as real retail investors typically interact with GAI advisors through less structured natural-language conversations in which the recommendation-relevant attributes are partial, ambiguous, or supplied in non-canonical order.
For full reproducibility, we record the exact technical conditions under which the recommendations were elicited. Each of the three models was queried through its provider’s official application programming interface (API) using the most capable frontier model available at the time of the audit: GPT 5.5 (OpenAI), Gemini 3.1 Pro (Google) and Claude Opus 4.7 (Anthropic). The full set of 5000 choice tasks per model (1000 profiles × five replicates) was administered in two collection waves on 8 May 2026 (replicates 1) and 28 May 2026 (replicates 2–5), with model identifiers, prompt messages and sampling parameters held constant across waves. We do not use a system message and only provide information through the prompt. Sampling parameters were left at each provider’s default settings to reflect realistic deployment conditions; the default temperature is 1.0 across the three providers, and the default top-p is 1.0 for ChatGPT, 0.95 for Gemini, and 1.0 for Claude. As a robustness check Gemini was additionally queried at temperature = 0 on the same 1000 profiles in a separate one-replicate run on 28 May 2026; this temperature override was not possible for ChatGPT or Claude, as neither provider exposes a temperature parameter for their frontier models. There were no refusals and all models provided answers in the requested format as per the prompt. We parsed the first 11 characters of the result and coded “Portfolio A” as 1, “Portfolio B” as 2, and “Portfolio C” as 3 to reflect their ordinal nature from Conservative to Aggressive. While we describe these portfolios in terms of Conservative, Balanced, and Aggressive the prompt does not label them as such.
3.3. Analysis
The experiment yields 15,000 ordinal recommendations (1000 client profiles × three models × 5 tasks per profile), coded 1 = Conservative, 2 = Balanced, 3 = Aggressive. Because the same 1000 profiles are administered to each model five times, the data have a matched replicate structure where each profile contributes five recommendations per model. The regression analyses use the replicate-level observations, yielding 5000 observations per model and 15,000 observations in the pooled model. This design permits within-profile inference and delivers greater power than an independent-samples comparison would. The analysis proceeds through four components: a cross-model agreement analysis, per-model proportional-odds regressions, a pooled interaction model that tests for differential attribute weighting across the three GAI models, and an anchor-scenario calculation that translates the coefficient estimates into economically interpretable recommendation probabilities. A set of robustness checks accompanies the regression results.
First, we quantify how consistently each model responds to the same profile across its five replicates. For each of the 1000 profiles, we summarize the five recommendations produced by each model and consider three complementary measures, specifically the percentage of profiles for which all five replicates were identical, Fleiss’s κ across the five replicates [
25], and the one-way intraclass correlation, ICC(1).
Second, we characterize the marginal distribution of recommendations by model and assess cross-model agreement. The omnibus null that the three GAI models draw recommendations from the same distribution is tested using the Friedman test for matched ordinal data [
26]. Multi-rater concordance is summarized by Kendall’s coefficient of concordance W and by Fleiss’s [
25] kappa. Pairwise comparisons are reported in four forms: raw percentage agreement, Cohen’s [
27] linearly weighted kappa interpreted against the Landis and Koch [
28] thresholds, the Stuart [
29]–Maxwell [
30] test of marginal homogeneity, and the Wilcoxon signed-rank test for paired ordinal data. Holm’s [
31] step-down procedure controls the family-wise error rate across the three pairwise comparisons. Directional asymmetries in disagreement are summarized by counting, for each model pair, the profiles on which one model recommends a more aggressive portfolio than the other; pairwise contingency heatmaps visualize the joint recommendation distributions.
We then identify the client attributes that drive each model’s recommendations. For each GAI model separately, we estimate a proportional-odds (cumulative) logit [
32,
33] of the ordinal recommendation on the nine client attributes. All predictors are factor-coded, with the lowest-risk or modal level taken as the omitted reference (Male, White, age 28, Conservative risk tolerance, 5-year horizon, Retirement goal, US
$50,000 income, Single without dependents, Salaried W-2). The joint significance of each predictor is assessed by a likelihood-ratio (LR) test, obtained by re-estimating the model without that predictor. LR inference is preferred to Wald inference because quasi-complete separation on the Risk Tolerance dummies, documented in the robustness checks below, inflates Wald standard errors but leaves the nested log-likelihood comparisons unaffected [
34]. To control the family-wise error rate, we apply the Holm step-down correction separately within each of three theoretically motivated attribute blocks. The financial attributes (Risk Tolerance, Time Horizon, Goal Type and Annual Income) comprise the inputs that goals-based investing theory prescribes as the legitimate determinants of portfolio recommendation. The life-cycle block (Age, Marital Status and Employment Type) comprises attributes that are demographic but carry potential actuarial or life-cycle relevance through their correlation with longevity, dependent on funding requirements or income stability. The demographic block (Gender and Ethnicity) comprises attributes that anti-discrimination laws treat as protected characteristics, that should not be considered for investment purposes. This partition aligns the statistical correction families with the theoretical categories under which we interpret the results. Overall fit is summarized by the McFadden [
35] pseudo-R
2, and selected coefficients are reported on the log-odds and odds-ratio scales to convey the direction and magnitude of attribute effects.
To test whether the three GAI models differ in how they use client attributes rather than merely in their unconditional risk appetite, we pool the 15,000 observations and estimate a proportional-odds logit augmented with two Model dummies (Gemini and Claude, with ChatGPT as the reference) and the full set of Model × X interactions. The omnibus null that all 36 interaction coefficients equal zero is tested by LR. Variable-level Model × X interaction tests are then reported with Holm correction to identify the specific attributes on which the models diverge.
To express the coefficient estimates in economically interpretable quantities, we compute model-implied recommendation probabilities at an anchor scenario chosen to position the predicted recommendation near a category boundary, where marginal effects on the probability of an aggressive recommendation are largest. The anchor is a Male, White, age-45 client with Moderate risk tolerance, a 30-year horizon, a Retirement goal, $300,000 income, Single without dependents, and Salaried (W-2) employment. We then vary one attribute at a time, holding the remaining anchor attributes fixed, and report the implied P(Aggressive) under each GAI model. This complements the LR tests by exposing differences in economic magnitude that joint significance tests may mask.
Four robustness checks are conducted. First, the parallel-regression assumption is assessed for each per-model fit using the Brant [
36] test, implemented as a comparison of slope coefficients from binary logits at the two cumulative cuts P(Y ≥ 2) and P(Y ≥ 3). Where the proportional-odds restriction is rejected for an appreciable share of comparisons, a partial proportional-odds specification [
37] is fitted to verify that the qualitative pattern of significant predictors is preserved. Second, the per-model fits are inspected for quasi-complete separation [
34]; where it is detected, inference relies on LR tests of nested log-likelihoods rather than on Wald standard errors. Third, the per-model fits are re-estimated under leave-one-variable-out perturbations to confirm that the pattern of joint significance in the main results is not driven by any single predictor. Fourth, to assess whether our results are sensitive to the default-temperature setting, we perform a small 1000 observation replication using Gemini at temperature zero, as this is the only frontier model that permits temperature alteration.
5. Discussion
The empirical question posed at the outset of this study was whether contemporary generative AI (GAI) models, when prompted to perform the function of a goals-based investment advisor, generate recommendations that respond appropriately to financially relevant attributes whilst remaining invariant to demographic attributes that goals-based investing treats as conditionally immaterial. Five substantive conclusions emerge from our analysis. First, all three GAI models are highly internally consistent on identical attributes: 83% to 92% of profiles receive an identical recommendation across the five replicates, and the one-way intraclass correlation exceeds 0.91 for every model, with Gemini the most consistent (ICC = 0.967) and Claude the least (ICC = 0.918). Second, all three models ground their recommendations overwhelmingly in the legitimate financial attributes (Risk Tolerance, Time Horizon, Goal Type and Annual Income), with McFadden pseudo-R2 values of 0.879 (ChatGPT), 0.808 (Gemini) and 0.700 (Claude). The relatively lower fit for Claude indicates that a greater share of its variation is unaccounted for by the supplied client attributes, a result that maps directly onto its lower within-model consistency. Third, within the life-cycle attributes, Age and Marital Status are significant in every model, and the expanded sample additionally reveals that Claude, but not ChatGPT or Gemini, conditions significantly on Gender and Employment Type. Ethnicity is undetectable for ChatGPT and Claude, but is a small significant predictor for Gemini, under the structured prompt design adopted here. The pooled cross-model interaction for Ethnicity is not statistically significant; therefore, the Gemini-specific pattern is descriptive rather than statistically resolved against the other two models. Fourth, the three GAI models are not interchangeable: with the divergence concentrated in how the models translate Risk Tolerance, Time Horizon, Goal Type, Income and Age into a recommendation. Claude is notably divergent, producing fewer Conservative and more Balanced recommendations than the other two GAI models, treating Education-funding goals as warranting more aggressive allocations rather than less, and exhibiting a steeper drop in risk at age 62 than the corresponding rise at age 28. Fifth, although the cross-model differences in demographic use are, for most predictors, statistically indistinguishable, the absolute magnitudes of demographic sensitivity vary substantially at economically realistic anchor scenarios, with consequential implications for downstream allocation decisions.
The high within-model consistency, not previously documented for GAI investment advice, has two implications. The first is methodological: it confirms that the cross-model differences we document are systematic decision-rule differences rather than stochastic answer variation, and that the lower McFadden pseudo-R2 for Claude reflects a model that conditions on a marginally broader set of attributes rather than one that produces noisier outputs. The second is substantive: replicate-level stability of 90% or higher means that an investor who consults a particular GAI model on a given profile will, in the overwhelming majority of cases, receive the same recommendation if the same query is repeated. This is a non-trivial property of automated advice and stands in contrast to the variability documented in human-advisor research, where the same advisor often recommends materially different products to identical clients across encounters.
The dominance of financial attributes in driving recommendations is both encouraging and analytically informative. From the perspective of goals-based investing theory [
16,
17], the attributes that should rationally govern portfolio recommendation are precisely those: risk tolerance, time horizon, goal type and income, on which the GAI models converge. The McFadden pseudo-R
2 values from 0.700 to 0.879 are exceptionally high for a behavioral prediction task and indicate that the models behave largely as transparent functions of their inputs rather than as opaque pattern-matchers drawing on latent training-corpus regularities. This aligns with, and extends, the aggregate-level finding of Oehler and Horn [
9] that ChatGPT advice often tracks academic benchmarks more closely than that of established robo-advisors. Whereas Oehler and Horn establish alignment at the level of average recommendations, our attribute-level decomposition demonstrates that the alignment is not coincidental but reflects appropriate marginal weighting of the relevant financial attributes. The result also strengthens the more cautious findings of Kim [
12] and Ko and Lee [
13], who establish that GAI-constructed portfolios exhibit defensible diversification properties, by extending the evidence into the conjoint-experimental domain where the implicit attribute weights, and not merely the output portfolios, are observable.
The mixed Ethnicity and Gender results warrant careful interpretation. ChatGPT and Claude show no statistically detectable use of Ethnicity in their recommendations under this design, the joint Ethnicity tests return χ
2 = 2.94,
p = 0.80 and χ
2 = 1.08,
p = 0.78 respectively, and the corresponding equal-coefficient minimum detectable effects (OR = 1.73 for ChatGPT and OR = 1.43 for Claude) rule out moderate-to-large ethnicity effects but cannot exclude smaller ones. Gemini’s joint Ethnicity test crosses the conventional Holm-corrected significance threshold (
p = 0.035) with non-White contrasts in the 0.63–0.69 OR range, indicating that Gemini recommends slightly more conservative portfolios to non-White investors than to otherwise identical White investors. For Gender, Claude is the only model in which the effect is statistically significant (OR = 1.44, 95% CI [1.20, 1.73]), with female profiles receiving less conservative recommendations than identical male profiles. The contrast between our results and previous studies documenting clear gender or ethnic disparities in lending and hiring contexts [
7,
8,
10,
11] should therefore be read narrowly: under this specific structured prompt design, we find no robust evidence of ethnic disparate treatment in two of three models and no robust evidence of gender disparate treatment in two of three models. We do not claim that contemporary GAI is universally free of demographic bias; rather, we claim that within an audit design that fully specifies financial profile, the residual role of demographic cues is small and model-specific. Several non-exclusive explanations may account for the limited and model-dependent demographic effects observed here. One possibility is that investment-advisory discourse is less gender and racially coded than lending or labor-market discourse. A second possibility is that alignment procedures have been tuned to suppress demographic cues in regulated financial contexts [
7]. A third possibility, which our design cannot rule out, is that the structured prompt itself crowds out demographic cues by supplying risk tolerance, time horizon, income and goal type explicitly; demographic cues may carry larger weight when the financial signal is more ambiguous. A fourth possibility, suggested by the model-specific nature of the effects, is that the alignment regimes of the three providers have diverged in this domain, some appear to have neutralized gender and ethnic cues more thoroughly than others, and the persistence of small effects in Gemini for Ethnicity and in Claude for Gender may reflect lower demographic-neutralization effort in their respective alignment processes.
The results for the life-cycle attributes are more nuanced and reveal a model-specific pattern. Age and Marital Status shift recommendations towards conservatism in every model. From a goals-based investing standpoint, neither attribute is, strictly, recommendation-relevant once time horizon, risk tolerance and goal type have been specified; the time-horizon attribute should already capture the life-cycle considerations that age might otherwise proxy, and shortfall aversion on behalf of dependents is, in principle, a function of the explicit goal rather than of marital status. The persistence of an age effect over and above the explicit time-horizon dummies therefore suggests that the GAI models are importing additional life-cycle assumptions around retirement proximity, human-capital depletion or longevity risk, that lie outside the goals-based framework’s formal architecture. Such importation is economically defensible: an older client with a thirty-year horizon nonetheless faces a shorter expected remaining life than a younger client with the same horizon, and conservatism may rationally follow. The treatment of age varies substantively across the three models in a way that goes beyond a simple difference in overall age sensitivity. ChatGPT’s de-risking is approximately linear in age, Gemini’s age response saturates after middle age (the model treats a 45- and a 62-year-old similarly but treats both very differently from a 28-year-old), and Claude exhibits a retirement-cliff (the model treats a 28- and a 45-year-old similarly but applies sharp additional de-risking at 62). Three non-exclusive mechanisms could plausibly account for these distinct age-schedule shapes. First, the training corpora on which contemporary GAI models are trained likely contain a substantial body of retirement-proximity advisory text emphasizing capital preservation, drawdown sequencing and longevity risk, against a comparatively thinner body of text exhorting young investors to take more risk than they would naturally choose; the conservative-prescription corpus is simply larger than the aggressive-prescription corpus. This first mechanism predicts a retirement-cliff shape, which Claude exhibits. Second, the Reinforcement Learning from Human Feedback layer may reinforce this asymmetry, as labelers are likely to penalize responses that recommend Aggressive portfolios to older clients more readily than they would penalize responses that recommend Conservative portfolios to younger ones, the downside of an overaggressive recommendation to a 62-year-old client is salient and culturally legible in a way that the opportunity cost of a conservative recommendation to a 28-year-old client is not. Third, the underlying age schedule that each model approximates may itself differ across providers: Gemini appears to encode a model in which advisory caution rises rapidly through early adulthood and then plateaus, while Claude encodes a model in which caution rises gradually and then accelerates at retirement age, and ChatGPT encodes a roughly linear age schedule. Our design cannot distinguish among these mechanisms, but the cross-model variation in age-schedule shape is itself a substantive finding: the three frontier models do not merely differ in how strongly they de-risk older clients, they differ in the functional form of the de-risking schedule, with consequential implications for the life-cycle allocation a client receives at any given age.
The economic and statistical significance of these age schedules merits explicit assessment. For Claude, the asymmetric retirement-cliff response is statistically significant, and economically substantial: a 12-percentage-point reduction in P(Aggressive) at age 62 relative to age 45, compounded across the typical horizon of late-career portfolio decisions and across the millions of retail interactions an embedded GAI advisory platform conducts annually, represents a structurally important feature of Claude’s life-cycle advisory logic. For Gemini, the saturating age response is also statistically significant and economically substantial in the opposite direction: a 19-percentage-point increase in P(Aggressive) when the client is 28 rather than 45, with only a further 2-percentage-point reduction at age 62, means that Gemini’s advisory logic concentrates its age-related differentiation in early adulthood rather than near retirement. These two model-specific schedules have qualitatively different implications for the cumulative wealth accumulation of clients across the life-cycle: a 62-year-old client receiving Claude’s advice is far more likely to be moved into a conservative portfolio than the same client receiving Gemini’s advice, while a 28-year-old client receiving Gemini’s advice is far more likely to be moved into an aggressive portfolio than the same client receiving Claude’s. The differences are sufficiently large to warrant explicit acknowledgement in any compliance assessment of either platform as a GAI investment advisor.
Perhaps the most consequential finding for practical deployment is cross-model heterogeneity. The divergence is concentrated in financial attributes that are relevant for investment recommendations: Risk Tolerance, Time Horizon, Goal Type and Income. The unconditional marginal distributions illustrate the magnitude of this heterogeneity directly: Claude allocates only 27.8% of its recommendations to the Conservative portfolio against 48.7% for ChatGPT and 47.1% for Gemini, while Claude’s Balanced share of 49.8% is more than fifteen percentage points above that of either alternative. Claude’s reversal of the Education-funding coefficient, treating the goal as warranting more balanced allocation rather than less, further indicates that the models do not share a common semantic mapping of named life goals onto the risk spectrum. The cross-model heterogeneity extends to the demographic attributes as well: Gemini’s modest, design-specific ethnicity effect is not present in the other two models, and Claude’s significant gender effect is not present in ChatGPT or Gemini. The pooled cross-model interaction tests do not statistically resolve these differences, but the descriptive pattern means that an investor selecting between the three platforms is not merely choosing a particular financial attribute-weighting but also a particular influence of demographic attributes, with implications for the ethical assessment of each platform that lie outside the conventional disparate-impact frame. This raises a form of platform risk that is, to our knowledge, largely undocumented in the existing GAI-in-finance literature. Investors who delegate portfolio recommendations to a particular GAI model are, in effect, selecting a particular implicit advisory philosophy whose attribute-weighting profile may not be evident even after extended interaction. The risk is qualitatively distinct from the model-version risk noted by Schneider and Yilmaz [
18], who report performance variation across model releases within a single provider; the heterogeneity we document is contemporaneous, persists at the frontier of each provider’s offering and arises in the attribute weights themselves rather than in downstream realized returns.
These findings carry implications for several adjacent literatures and policy domains. For the literature on robo-advisors [
2,
14,
15], our results indicate that the migration from deterministic recommendation engines to GAI-enabled conversational interfaces is unlikely, on the present evidence, to reintroduce the ethnic biases documented in the human-advisor literature [
4,
5,
6], although they leave open the possibility that gender may re-enter the advisory pipeline through model-specific decision rules whose interpretive status lies between defensible actuarial inference and folk-theoretic stereotype. The contrast with Mullainathan et al. [
6] is especially striking on ethnicity: where human advisors in their audit study systematically steered clients into higher-cost actively managed products with effects varying by client demographics. The GAI models we audit display either no detectable ethnic patterning (ChatGPT, Claude) or only a small design-specific ethnic effect (Gemini), despite recommending portfolios constructed from the same broad asset classes. We caveat this contrast with the observation that our structured prompt design supplies the financial inputs explicitly, which may attenuate the demographic role that less structured elicitations would reveal. For fintech regulation, this is a complex finding because conventional disparate-impact frameworks are poorly equipped to govern a setting in which the salient differentials are not only between distinct clients within a single platform but also between identically situated clients across platforms. For robo-advisor governance, our results suggest that audit-style methodologies of the kind developed by Lippens [
11] and Motoki et al. [
8] should be incorporated into routine compliance monitoring of GAI-enabled advisory services, not merely as a one-off vendor assessment but as an ongoing surveillance instrument that tracks attribute weightings across model versions and across providers over time. For the broader scholarly conversation on AI accountability in financial services, the result that frontier GAI models differ materially in their handling of investment recommendation-relevant attributes whilst converging on the conditional irrelevance of ethnicity suggests that the dominant fairness narratives may be insufficient as a description of where the consequential algorithmic variation actually resides.
Several limitations of the present study warrant explicit acknowledgement. First, the audit captures a single snapshot of three model versions at a fixed point in time. Generative AI models are updated continuously and the alignment procedures that govern their behavior are subject to change at the discretion of their developers; the patterns we document may evolve, and replication across model versions and time periods is therefore essential before any conclusion can be regarded as a general property of GAI-enabled advice. Second, our prompts are presented in English and the client names that signal gender and ethnicity are drawn from a United States cultural register; the absence of detectable ethnic disparate treatment in our experiment cannot be generalized to non-Anglo settings without further audit. Third, real retail investors interact with GAI advisors through extended natural-language conversations in which the recommendation-relevant attributes are typically partial, sequentially disclosed, qualitatively described rather than quantitatively specified (a client might describe themselves as “a bit risk-averse” rather than as “Moderate risk tolerance”), and embedded within longer narrative accounts of their financial circumstances. Our results therefore characterize GAI behavior in a high-information, structured-elicitation regime, and should not be generalized without qualification to lower-information conversational regimes in which the latent attribute weights of the models may differ materially. As we discuss below, extension to less heavily anchored prompts is the most important next step suggested by the present design. Fourth, the three-portfolio choice set is a coarse simplification of the continuous allocation space in which real portfolio recommendations are situated, and effects that are subthreshold under our discrete ordinal measure may be detectable under continuous-allocation metrics. Fifth, although names are a well-established device for signaling implicit gender and ethnicity in audit research [
11], their information content as cues to demographic identity is plausibly weaker than that of explicit labels, and a stronger experimental manipulation might reveal effects that ours does not. Sixth, the null demographic results we report (no detectable gender effect for ChatGPT or Gemini; no detectable ethnicity effect for ChatGPT or Claude) hold against minimum detectable effects in the range of OR = 1.30 to 1.77 at α = 0.05 two-sided and 80% power. These thresholds rule out moderate-to-large demographic effects but cannot exclude smaller ones, and they are conditional on the structured prompt design adopted here. Demographic effects too small for our design to detect, or effects activated under less structured elicitations in which the financial signal is partial or ambiguous, would not be visible in our results. Seventh, the model-specific effects for Gender and Ethnicity we document are consistent both with the genuine absence of bias and with the presence of explicit safeguards in the alignment layer; the present audit cannot distinguish these mechanisms.
These limitations provide opportunities for future research. Longitudinal audit designs that track the attribute-weighting profiles of frontier GAI models across model versions and over time would establish whether the patterns we document are durable features of contemporary GAI advice or transient artefacts of alignment regimes.
Adversarial and ecologically-graded audit designs are the most direct extension. Three variants of the present design would together characterize the boundary of generalizability identified by the structured-prompt limitation. Attribute-omission designs could systematically withhold one of the four financial attributes at a time, to understand if the demographic attributes acquire greater weight when the legitimate financial attributes are weakened or ambiguous. Narrative-elicitation designs could replace the structured-field prompt with an unstructured client biography that conveys the same financial information in natural language, to assess whether the attribute weights we document are stable across structured and conversational presentations of equivalent information. Continuous-allocation designs would replace the three-portfolio choice set with a request for a numeric equity/bond allocation, to detect demographic effects that are subthreshold under the discrete ordinal measure but resolvable on a continuous scale. Each of these design variants would loosen one of the specific structural priors we adopt for identification, and the comparison of attribute weights across designs would establish how much of our finding reflects the goals-based architecture of the prompt versus the underlying decision rules of the models. Multilingual and cross-jurisdictional extensions would establish whether the demographic patterns we observe generalize beyond the English-language, United-States cultural setting in which our audit was conducted. Welfare-oriented extensions that map the cross-model heterogeneity we document into long-run client outcomes, using, for example, the diversified-portfolio benchmarks of Kim [
12] and Ko and Lee [
13], would translate the abstract platform-risk finding into the metric that ultimately matters for investors. The audit methodology developed here also extends naturally beyond goals-based portfolio recommendations to adjacent advisory domains, including tax-aware investing, debt management and intergenerational wealth transfer, where the theoretical separation between attribute-relevant and attribute-irrelevant client characteristics is similarly well defined. Finally, the comparison of frontier closed-source models with open-source alternatives, whose alignment procedures are at least partially inspectable, would shed light on the extent to which the patterns we document reflect inherent properties of the underlying language modeling versus deliberate design choices in the alignment layer.
6. Conclusions
This study has audited three frontier GAI models in the goals-based investment advisory setting, using a full-profile conjoint experiment. The audit yields four principal contributions to the rapidly developing literature on GAI in finance. First, by establishing that all three frontier GAI models exhibit high within-model consistency on identical inputs, with an intraclass correlation exceeding 0.92, the study confirms that automated advice from frontier models is stable rather than a stochastic one and provides a methodological benchmark against which future audits can be calibrated. Second, by decomposing model recommendations into attribute-level effects, the study advances the GAI-in-investing literature beyond the comparison of aggregate performance towards a structural account of which client attributes drive GAI recommendations and which do not. The result is partially reassuring on the dimension that has attracted the greatest regulatory attention: contemporary frontier GAI models, when performing goals-based investment advisory, weight the legitimate financial inputs heavily, and ChatGPT and Claude exhibit no statistically detectable disparate treatment on Ethnicity under the structured prompt design adopted here. Gemini, however, exhibits a small but statistically significant ethnicity effect, with non-White profiles receiving more conservative recommendations than otherwise identical White profiles. In contrast, while ChatGPT and Gemini show no use of Gender, Claude shows a significant Gender effect where women are given less conservative recommendations as compared to identical men. Third, by importing the audit methodology of the algorithmic-bias literature into a setting in which the separation between recommendation-relevant and recommendation-irrelevant attributes is unusually clean, the study illustrates the methodological benefits of combining the audit-experimental tradition with the prescriptive framework of goals-based investing. Fourth, and perhaps most importantly for practice, the study documents substantial cross-model heterogeneity in how frontier GAI models translate the same financial attributes into the same standardized portfolios, with unconditional conservative allocation shares spanning a twenty-one-percentage-point range across the three models and with model-specific demographic sensitivities of comparable economic magnitude. This previously underappreciated form of platform risk is qualitatively distinct both from the within-platform version risk and from the within-platform demographic bias that have dominated the conversation to date.
For investors, the findings indicate that the choice of GAI advisory platform is itself a consequential portfolio decision whose effects are likely to compound over the investment life-cycle. For platforms and regulators, the findings indicate that conventional disparate-impact concerns remain relevant. More broadly, the findings suggest that bias has neither cleanly disappeared nor simply reappeared. The use of Gender by Claude and Ethnicity by Gemini further show that demographic sensitivity may appear in model-specific ways, reinforcing the need to evaluate both within-model demographic effects and cross-model differences in advisory logic. The absence of detectable ethnic disparate treatment in two of three models, and of gender disparate treatment in two of three, is conditional on a structured prompt design that fully specifies financial profile. Audits employing less structured or attribute-incomplete prompts may reveal demographic patterns that our design lacks the power to detect. For the broader literature on AI accountability in financial services, the findings suggest that the migration of bias from human to algorithmic advice is mixed and model specific. As GAI becomes embedded ever more deeply in the advisory infrastructure that retail investors rely upon, audit-based monitoring of attribute-weighting profiles across platforms and over time is, in our view, no longer an optional complement to existing governance practices but an indispensable component of them.