4.1. Evaluation of Regular and Enhanced Outputs
For each case, we present here the case summary with the regular and enhanced ChatGPT outputs juxtaposed in
Supplementary S1. We use ChatGPT to summarize each case and observe that ChatGPT does an adequate job in summarizing each case. Though we only present the ChatGPT summaries of the cases here (to save space), we input the full text of each case into ChatGPT for our analysis (please see the
Supplementary Materials of
Schlosky et al. (
2024) for the full text of the cases.). We conclude each case with our evaluation where we compare the regular and enhanced outputs and point out their strengths and weaknesses.
Case #1: James’ Financial Troubles
ChatGPT Summary:
James has $10,000 in credit card debt but only makes minimum payments despite having $100,000 in the bank. He underutilizes his company’s 401(k) match, contributing only 2% instead of 6%. Additionally, James ignores his unpaid medical bills.
Evaluation:
The first visible difference between regular and enhanced output is that with the regular output, ChatGPT takes the attitude of a distant third party in its communication. On the other hand, with the enhanced output, ChatGPT sounds more friendly and approachable. This shows that ChatGPT is highly responsive to the tone and expectations that we set for it in our prompts. For credit card debt, under the regular output, ChatGPT suggests that James pay off the balance, but it does not offer any prioritization. However, we observe prioritization in the enhanced output. This shows once again that prompting makes an important difference when using ChatGPT for financial advice. ChatGPT in the enhanced output suggests increasing retirement contributions as a second priority for James. This suggestion comes third in the regular output, but in general, it is unclear if the numbers correspond to priorities in the regular output as there is no mention of priorities. They appear as a bulleted list of actions with no priorities. Emergency fund comes out as the fourth in priority in the enhanced output, followed by creating a budget and a financial plan. We do not observe substantial differences in the context of the advice provided for these items. Different from the regular output, in the enhanced prompt, we see new suggestions, which are insurance, investment opportunities, and a regular review of the financial situation. In the regular prompt, we observe the “avoid new debt” suggestion, which does not appear under the enhanced prompt.
Overall, the enhanced output offers priorities to James compared to the regular output. However, it does not look like the enhanced output follows a logical planning process. Paying off credit card debt is first, followed by increases to 401(k) contributions, medical bills, emergency fund, budgeting, insurance, etc. While these are sound recommendations, the order of importance is critical. Generally, it is wise to address risk management first. This would be establishing an emergency fund, reviewing insurance coverage, budgeting, and then addressing debt. Without a solid foundation, a financial or medical emergency could derail James’ plans. Once the foundation is established, James can then focus on medical bills, increasing 401k savings, and investment strategies. He can then monitor and adjust his financial priorities accordingly.
Case #2: Sports Gambling and Winning the Lottery
ChatGPT Summary:
Mark secretly racks up $500,000 in sports betting debt while living rent-free with his family. He unexpectedly finds a $100 bill and spends it on lottery tickets, winning a $750,000 jackpot, which he hopes will solve his financial woes.
Evaluation:
In this case, both methods appear to produce prioritized action plans. In the regular output, we see the wording “step-by-step” approach. In the enhanced prompt, the suggestions start with “immediate steps”. The regular output ignores the impact of taxes, but with the enhanced output, we observe a suggestion for considering tax implications. Both approaches prioritize paying the credit card debt first and then establishing an emergency fund, although the suggested amounts differ (USD 50,000 under the regular prompt and USD 25,000 under the enhanced prompt). This difference is due to the time frame adopted: Under the regular prompt, the advice-seeker is recommended to have an emergency fund that can cover living expenses from 6 to 12 months. This time frame is 3 to 6 months under the enhanced prompt. We find this difference noteworthy. Under the regular prompt, we observe the prioritization of financial counseling and discussions with family members. This comes later with the enhanced prompt, which places a higher priority on setting up a budget and developing a financial plan. Under both prompts, we see suggestions for saving for college, investing, and insurance. As with the first case, we observe a suggestion for monitoring the financial situation, but this suggestion is generated by both prompts.
Comparing the two approaches, the advice generated with the regular prompt is not reliable, as it ignores the taxes that Mark needs to pay. Furthermore, the recommended amounts for a given action, such as allocating USD 150,000 in a diversified investment, are based on pre-tax amounts. The amounts shown in the enhanced output make more sense given the tax implications. On the other hand, it is unclear whether Mark has the financial capacity to follow the other suggestions, such as making investments and opening college savings accounts. Overall, sound advice is again generated from the enhanced prompt; however, the order of importance is not addressed.
Case #3: Living the Good Life
ChatGPT Summary:
The Smith family earns $500,000 annually but prioritizes a luxurious lifestyle and creating memories with their children over saving for retirement or their kids’ college. They have a second mortgage and rely on their job security, planning to co-sign their children’s student loans.
Evaluation:
Here, with the enhanced output, we continue observing prioritization, but this is less clear with the regular output (it appears to make general recommendations). In the enhanced output, we observe that budgeting and setting up an emergency fund are prioritized first. In this case, we observe a time frame of 6 to 12 months in the enhanced output, similarly to the regular output. Under both prompts we observe similar suggestions, such as savings via 529 accounts, having adequate insurance, estate planning, and paying off the second mortgage. One important detail that we observe is that with the regular prompt, a suggestion to consider refinancing is also provided, but we do not see this under the enhanced prompt. Moreover, a clear suggestion as to how to lower spending (“scaling back on luxury cars and other high-end purchases”) is provided in the regular output. The enhanced output is vague as to how to cut down on spending. Both prompts suggest that the Smith family maximize how much they contribute to their retirement accounts. The reasoning behind this suggestion is provided in the regular output (employee matches and tax benefits) but it is not provided in the enhanced output. However, it remains unclear if the Smith family has the financial capacity to “max out” their retirement savings before they can tackle their lifestyle expenses.
Overall, in both approaches, we observe that a specific financial planning process is not being followed. In other words, once the planner-client relationship has been established, specific and prioritized goals should be agreed upon; then, recommendations can be made. This process did not appear to be followed, and additionally, there is no specific implementation schedule regarding who is responsible for each task or the order in which activities should occur.
Case #4: A Deadly Virus and Wipe-Out of a Family’s Wealth
ChatGPT Summary:
The Franks, known for their frugality and strong financial habits, face a financial wipe-out when the stock market crashes by 90%. Their children criticize them for their sacrifices, comparing their situation to their spendthrift neighbors who now have the same net worth but better life before.
Evaluation:
Prioritization appears to be less clear under both prompts. We observe similar suggestions/recommendations under both prompts. One difference is that in the regular output, there is a recommendation to cut expenses. The family is already very frugal to the extent that they keep the thermostat at low temperatures to save on energy bills. Cutting expenses further may not be a viable strategy. Under both prompts, we observe a recommendation to diversify investments into bonds, real estate, etc. This assumes that in an event where the stock market plummets 90%, other asset classes can protect the family’s portfolio. Although the suggestion of diversification is sound in general, the family should be advised that in a dramatic systematic crisis, all asset classes may experience major declines in their values. If the family is anxious about the possibility of another wipe-out in the financial markets, the suggestion to invest in certificates of deposit might have been a more prudent one. Moreover, it is important to note that the losses in the Frank family’s portfolio are paper losses (i.e., unrealized losses). Selling their stocks now in order to diversify their portfolio may financially ruin them instead of staying put and waiting for a recovery. This point is missed by ChatGPT.
There is a recommendation to increase income under both prompts. While this appears reasonable theory, it is unclear if the Frank family is in any position to take on additional work. ChatGPT suggests that the Frank family enjoy the present and spend quality time with their children, but at the same time, it advises them to find more sources of income, which would naturally reduce family time. There is some inconsistency in these suggestions. It also appears that the prompts provide advice regarding specific values that the clients might not have. For example, a recommendation on teaching the children about financial balance may not be followed unless until it aligns with the client’s current or changed value system.
Case #5: Young Couple and Their Good Fortune
ChatGPT Summary:
Sam and Julie, renovating their new home, find $500,000 in cash and jewelry hidden in the walls. They save $50,000 for emergencies, invest the rest in low-cost index funds, and keep the jewelry, anticipating a rise in gold prices.
Evaluation:
We observe very similar recommendations under both outputs, such as considering tax consequences, estate planning, and the professional appraisal of jewelry. Both methods suggest that Sam and Julie take measures to protect their assets at home in case of theft, loss, or damage. The regular output suggests “safety deposit box”, and the enhanced output suggests “insurance”. We see a suggestion for developing a budget for repairing and improving the home under regular prompts, which is a good one. Both methods mention the children’s education; however, the case makes no mention of the couple having kids or planning to have kids. We see advice on debt management in the enhanced output. The case does not mention any debt that the couple carries. ChatGPT endorsed all the financial decisions that the couple made. Given that ChatGPT speculates that the couple may have debts, it would not make much economic sense to leave the debt unaddressed while allocating a significant proportion of funds to the index funds. Paying off existing high-interest debt should have been the prioritized recommendation. Moreover, both outputs assume that the money and jewelry that the couple found in the house belong to them. However, the legal treatment of this case is quite complex.
2 As a result, ChatGPT may be misleading the couple.
Case #6: 90-Year-Old Grandpa Getting Rich with Options Trading
ChatGPT Summary:
Ken convinces his 90-year-old grandfather, Mr. Wilson, to trade (naked call) options. Mr. Wilson’s account grows by $5,000,000 in a year under Ken’s management. Grateful, Mr. Wilson feels financially secure, keeps $4,000,000, and gives the rest to Ken.
Evaluation:
Both outputs start with an analysis of the risky trade (selling naked call options) and emphasize that this trade was not suitable for Ken’s grandfather. We also see that the output under the enhanced prompt raises ethical concerns with Ken’s deployment of a highly risky trading strategy on behalf of someone who clearly does not understand the risks associated with it. We also see a suggestion to invest in real estate (among other asset classes) for Mr. Wilson. This advice may be inappropriate for someone who is 90 years old considering the illiquidity and the necessary time horizon associated with investing in real estate. Both outputs suggest a diversification strategy to lower/spread risk and are highly critical of what Ken did with his grandfather’s account. Though this is a valid suggestion and a justified criticism, they skirt around the issue that Mr. Wilson does not feel financially secure, and a low-risk strategy may not necessarily be sufficient to achieve the desired outcome in the first place. This does not necessarily mean that Mr. Wilson needs to take on more risks in his portfolio, but the suggestion provided does not seem to address the concern that he has (i.e., outliving his savings). For example, neither of the prompts makes any suggestions on lowering his living expenses to the extent that it is possible.
The enhanced output suggests an emergency fund and tax planning, which are not suggested in the regular output. However, there is no specific tax advice. Moreover, neither output brings up the tax liability on the gains earlier in their analyses. Mr. Wilson’s account was up by USD 5,000,000, and Mr. Wilson kept USD 4,000,000 and gave the rest to his grandson. Given the short-term nature of the deployed trading strategy and short-term investment horizon, there would be a significant tax liability on the gains. It is unclear if Mr. Wilson would even have USD 4,000,000 after he pays the taxes on the account. Furthermore, even if we ignore the taxes on capital gains, there is still a substantial amount of money being gifted to the grandson (Ken). The taxes on the gifted amount are not mentioned in either of the outputs. Overall, we observe that ChatGPT provides after-the-fact recommendations and not proactive recommendations. Additionally, a human approach may be able to provide more specific recommendations rather than broad suggestions regarding tax planning, annuities, and risk management.
Case #7: Hardworking Salesman
ChatGPT Summary:
Hank, needing money for his daughter’s medical expenses, secretly cheated by adding a 0.25% markup on invoices over 20 years, saving $5,000,000 without impacting his company’s finances.
Evaluation:
As with the other cases, the enhanced output has a more compassionate tone and approach. The regular output is unequivocal about the illegal nature of the scheme that Hank ran. On the other hand, the enhanced output is vague on this. It says what Hank did “could be considered embezzlement” while it is clearly an example of embezzlement. Both outputs point out the ethical issues in what Hank did. The regular output assumes that Hank will not be able to use his savings due to the illegal means that he resorted to accumulate them. On the other hand, the enhanced output advises a diversification strategy for Hank’s investments. In both prompts, there is an implicit endorsement of Hank’s decision to retire. Given the embezzlement scheme he ran and the legal consequences he will be facing, Hank is not in a position to retire. It is unclear whether Hank contributed to social security and whether his company offered any pension plans (defined benefit or defined contributions). It is also a legal question whether the company could confiscate Hank’s pension benefits due to the embezzlement scheme he ran. Furthermore, convicts do not draw social security payments.
3 These potential issues are not brought up by ChatGPT. Both outputs emphasize the importance of having adequate health insurance. Thus, they assume that Hank did not know or was not aware of his insurance options. This seems presumptuous as someone who risked going to jail may have missed a simpler solution to his or her problems. The enhanced output suggests life insurance, which is a good idea, and also it also suggests that Hank look for additional ways to raise his income. These are generic recommendations that may or may not apply to Hank’s particular situation.
Case #8: Getting the Right Kind of Mortgage
ChatGPT Summary:
Sally and Matt, eager first-time homebuyers, secure an adjustable-rate mortgage despite knowing interest rates are likely to rise.
Evaluation:
Both outputs highlight the risks associated with ARMs and recommend that the couple refinance their adjustable-rate mortgage to a fixed-rate mortgage. This is a sensible suggestion. However, it is unclear why the couple will qualify for a fixed-rate mortgage via refinancing given that they did not qualify for one at the beginning based on the original case. The regular output suggests that the couple look for additional sources of income. The enhanced output suggests that the couple make extra principal payments and investments. These are great suggestions. However, it is unclear if the couple can take on additional jobs. Furthermore, the fact that they could not qualify for a less risky mortgage at the beginning suggests that the couple is not in a strong financial position. Asking them to make an extra principal payment may not be feasible. Neither of the outputs touches on the elephant in the room: The couple did not understand how adjustable-rate mortgages work with their rate adjustments. It makes sense to “hurry” and lock in a fixed-rate mortgage. In their case, they are not locking in anything. Instead of pointing out this flaw, the enhanced output finds it understandable that the “[t]hey rushed into the mortgage to lock in a lower rate.”
Case #9: Insurance Needs
ChatGPT Summary:
Sally, the sole breadwinner with four children and elderly parents, worries about her family’s financial future if she dies unexpectedly. With a large mortgage and minimal retirement savings, she is stressed about meeting their financial needs and paying for her parents’ medical bills.
Evaluation:
Here, both outputs make very similar suggestions for Sally. Guidance on the amount and the term length for term life insurance is helpful, but somewhat generic. The regular output suggests long-term care insurance for Sally’s parents and setting up a 529 plan and a spousal individual retirement account (IRA). These suggestions are warranted given that Sally has young children and ailing parents, and her partner is not employed. Whether Sally can afford all these additional expenses and contributions is not explicitly discussed. Moreover, suggestions such as the 529 plan and spousal IRA, while valid, may not align with Sally’s goals. Additionally, both outputs explain permanent life insurance products adequately, but they do not go into detail with regard to the investment component of these products. A more detailed explanation is warranted given the complexities of permanent life insurance cash value and investment strategies and risks.
Case #10: Unexpected Diagnosis and Hardship Withdrawal
ChatGPT Summary:
Howard, diagnosed with cancer, exhausts his savings and stops contributing to his 401(k) to pay medical bills. Facing foreclosure, he takes a hardship withdrawal from his 401(k) to keep his home, despite insurance not covering all his treatments.
Evaluation:
The enhanced output has a very compassionate tone. It shows empathy for the unforeseen situation that Howard found himself in. On the other hand, the regular output lacks compassion, and it points fingers at Howard for mismanaging his financial and health situation. Both methods recommend that Howard seek better insurance coverage and financial assistance from the government, charities, and hospitals, and avoid hardship withdrawals. Emergency savings are recommended in both outputs but the amounts vary. The regular output suggests that Howard save enough to cover living expenses for up to 12 months, while the enhanced output suggests that he do so for up to 6 months. Both outputs state that hardship withdrawal should be the last resort. There does not seem to be a clear and actionable recommendation for Howard’s problems in either of the outputs. It also appears that mental and emotional support is the last recommendation made when perhaps it should be the first. Howard is fighting for his life, and many of the suggestions are the last things that may be on his mind besides survival. The enhanced output is generated after telling ChatGPT that it possesses many desirable characteristics, such as empathy. This seems to change the tone that ChatGPT uses, but it does not render it more human in its recommendations. In a sense, this creates a worse output: a false sense of compassion. This is because the tone is compassionate, but recommended actions are not.
Case #11: Gambling as Last Resort
ChatGPT Summary:
Emily, discovering her late husband’s hidden debts totaling $450,000 and facing foreclosure, gambles her last $10,000 emergency savings. Miraculously, she wins $1,000,000 at the casino, enough to pay off her debts.
Evaluation:
The regular output ignores the fact that Emily needs to pay taxes on her gambling income. The enhanced output does take taxes into account. Both outputs prioritize paying off debts and obligations first, suggesting credit card debt to be paid off “immediately”. However, the rest of the debts and obligations are not prioritized. For example, should family members be paid off before the mortgage? We could argue that catching up with her mortgage payments should be her top priority lest she loses her house. Furthermore, it would be beneficial to Emily to have a good idea about how much she has after taxes before she takes care of her debts and obligations. It is also surprising that the suggestion on counseling appears in the regular output, but not in the enhanced output. The recommendations under investments come before assessing what Emily’s risk tolerance truly is. This may not be prudent investment advice given her situation. We also noticed that the enhanced output recommends steps to take immediately. This is close to providing guidance on an implementation schedule.
Case #12: Estate Planning
ChatGPT Summary:
Sarah, a widow with a $10,000,000 net worth, plans to create equal trust funds for her ten grandchildren. She wants a competent manager to oversee the funds until they turn 18, with provisions to redirect funds to a children’s hospital if any grandchild misbehaves.
Evaluation:
Both outputs suggest an irrevocable trust, but the enhanced output suggests that Sarah first create a revocable living trust and structure it so that it becomes irrevocable after she dies. In line with Sarah’s wishes, both outputs recommend a misbehavior clause. The enhanced output also has a suggestive tone for the misbehavior clause. The importance of tax planning is highlighted in both outputs, but the outputs do not account for unforeseen events such as the death of a grandchild after the trust is formed. Sarah should factor such unforeseen events into the trust’s documents. Another possibility is that Sarah may pass away before the trust documents are finalized. It will be prudent for her to appoint someone who can oversee this process according to her wishes in case she passes away. When it comes to choosing a trustee, there is a slight difference in the outputs. The regular output sticks with professional trustees or trust companies while the enhanced output appears to leave the door open for non-professionals trusted by the trustor. Sarah should be provided with the pros and cons of choosing a professional versus a non-professional trustee.
Case #13: No Luck with Convincing His Father
ChatGPT Summary:
Adam, a successful hedge fund manager, wants to convince his father, Sam, to invest in the stock market. Sam refuses to invest, preferring to keep his $5,000,000 out of the market, leading to tension between them.
Evaluation:
Both output attempts to give a balanced view of the possible thinking process adopted by Adam and his father Sam. The enhanced output speculates that Sam may have invested in low-risk alternative investments such as real estate and bonds. Here, the term “alternative investments” is misused in that alternative investments do not include bonds. Furthermore, calling real estate a low-risk investment may not be appropriate, as investing in real estate comes with many risks, including liquidity risk. Furthermore, the enhanced output recommends that Sam consider investing in equities with low risk to diversify his portfolio. It also provides recommendations for Sam to continue to stay informed about various investment options so that he can make educated decisions aligned with his financial goals. However, both prompts failed to address what the goals of Adam and Sam are. In other words, why does Adam want his father to invest in the stock market, why did Sam not invest in the stock market in the first place, and how did he accumulate his USD 5 million? There is also an assumption in both outputs that Sam is a risk-averse investor. Sam may simply dislike the stock market and may have engaged in risky ventures such as using substantial leverage to invest in real estate or even gambling to build his wealth.
Case #14: No Way to Go Wrong with the Stock Market
ChatGPT Summary:
Alex has a strong quantitative background, advises his friend Henry to take on more risk in his portfolio for higher returns. He suggests Henry load up on risky assets to boost his retirement savings.
Evaluation:
Both outputs confirm the validity of the risk-return relationship but warn that Alex’s generalization is too simplistic. The enhanced output distinguishes different types of risks, which is a helpful approach. Both methods highlight the importance of risk tolerance and the psychological aspect of investing as well as the need to have a diversified portfolio. The regular output also suggests that Henry increase his retirement contributions, which, if Henry can afford it, is better advice than “simply loading up” his portfolio with more risk. However, both outputs fail to address what Henry’s goals are. If he wants to retire early, it may make more sense for him to save more, and if he has a higher risk tolerance, he may need to invest in riskier assets. However, if he has a long-term time horizon, saving more and investing in riskier assets may not be the optimal choice.
Case #15: Mutual Fund Fees
ChatGPT Summary:
Beth considers two identical mutual funds for a $100,000 investment over 10 years, one with a 6% front-end load and the other with a back-end load. Unsure which to choose, her co-worker advises her to always pick the fund with the lowest cost.
Evaluation:
Both outputs have very similar suggestions and the same numerical example. The calculations for the front-end load fund yield slightly different results. However, in both the regular and enhanced output, for the specific example chosen, the results must be the same for front-end and back-end load mutual funds. The answer for the back-end load funds is USD 184,912 under both prompts, which is only USD 0.23 away from the correct answer (USD 184,912.23). The answer for the front-end load fund is USD 184,016 in the regular output, differing from the correct answer (USD 184,912.23) by USD 896.23. On the other hand, the answer for the front-end load fund is USD 184,150 in the enhanced output, differing from the correct answer (USD 184,912.23) by USD 762.23 (see Case #15 in
Supplementary S1 for details).
It is unclear whether these differences are due to rounding or mathematical errors introduced by ChatGPT. The regular output has a clear recommendation in favor of the back-end load fund under certain conditions, which are a declining fee structure over the years and a waiver of fees after a certain period. The enhanced output here exhibits out-of-the-box thinking and suggests an alternative: “no-load mutual funds or low-cost index funds.” The enhanced output also brings up the importance of fund performance when assessing the claim made by Sarah’s co-worker. This makes sense since, in some cases, high fees can be justified if the fund performs well. It may not always be optimal for investors to choose a mutual fund with the lowest fee.
Case #16: Lure of Alternative Investments
ChatGPT Summary:
Kayla, five years from retirement, considers investing in timberland based on high returns reported by college endowments. She hopes doing this will help her catch up on her retirement savings, trusting the endowments’ professional management.
Evaluation:
Both outputs describe the properties of timberland as an investment. The enhanced output is more specific. For example, it provides the investment horizon (10 to 20 years), corrects the assumption that “timberland produces high returns”, and gives specific examples of risks such as pest infestations. We see detailed explanations on how college endowments are run and how they differ from individual investors in terms of risk tolerance and time horizon. We observe the suggestion of higher contributions in both outputs via catch-up contributions, which we welcome. Other suggestions seem quite generic (e.g., diversified portfolios), and both outputs fail to mention the potentially significant upfront costs in timberland. This could include the cost of buying the land in general, harvesting and replanting costs, as well as tax implications (basis, depletion, gains, etc.).
Case #17: Debt Consolidation
ChatGPT Summary:
Oliver, overwhelmed by $185,000 in debt from credit cards, medical bills, and a car loan, follows a friend’s advice to get a debt consolidation loan. He consolidates his debt into one payment so he no longer needs to deal with multiple lenders.
Evaluation:
Here, both outputs point out similar advantages and disadvantages of the solution (“debt consolidation”) adopted by Oliver. The regular output discusses the potentially positive impact of debt consolidation on Oliver’s credit score (if he makes his payments on time). On the other hand, the enhanced output discusses the potentially negative impact of the debt consolidation on his credit score since the new loan application necessitates a hard inquiry on his credit. Interestingly, the enhanced output makes a recommendation for Oliver to look at his interest rate and loan terms despite the fact that he has already consolidated and obtained the loan. Both outputs advise Oliver to refrain from taking on new debt. Without knowing the underlying reasons behind the debt that he accumulated (e.g., lack of financial literacy, chronic illnesses), this advice may not be feasible, at least in the short-term. The suggestion to look for additional income sources (the regular output) and reduce credit card limits (the enhanced output) could be helpful to Oliver. Both outputs advised Oliver to set up an emergency fund, but the enhanced output is vague in terms of the size of this fund. The enhanced output presents “being debt-free” as the final goal. Debt in itself is not dangerous and advising someone to live on a cash basis (at least implicitly) may stop them from growing their wealth by preventing them from buying a home or pursuing an advanced degree. Moreover, from the balance he carries (USD 60,000) on his car loan, the chances are high that Oliver is underwater on his loan. This possibility was not mentioned in either of the outputs.
Case #18: Retirement Problem with No Inheritance
ChatGPT Summary:
Mark and Emily, each saving $30,000 annually in their 401(k)s with a 7% return, plan to retire in 30 years. Assuming a 4% return during retirement and 20 years of post-retirement life, they need to calculate their annual annuity payments to deplete their account by the end of their lives.
Evaluation:
The regular output only considers the savings of one of the spouses. The couple saves USD 60,000 a year instead of USD 30,000. However, the output provided is consistent with annual savings of USD 30,000. The output does not show the details behind this calculation. The enhanced output has the correct future value of the annuity. The final answer, USD 416,966.52, is close to the correct answer (USD 417,035.40) with a difference of USD 68.88 (see Case #18 in
Supplementary S1). However, the exposition of the solution can be improved. Overall, ChatGPT performs well in simple retirement problems.
Case #19: Retirement Problem with Inheritance
ChatGPT Summary:
Mark and Emily, saving $30,000 annually on their 401(k)s with a 7% return, plan to retire in 30 years. They want to leave a $500,000 inheritance so they need to calculate their annual annuity payments over 20 years of retirement to achieve this goal.
Evaluation:
The regular output generates an incorrect answer (USD 171,727.00 instead of USD 400,244.52 with a difference of USD 228,517.52), and it does not show the calculation steps. The enhanced output generates the correct amount at retirement (as it did in the previous case), which is USD 5,667,647.18. However, it does not properly incorporate the inheritance. The future value of the annuity is the amount available at retirement, while the inheritance is the amount at death (20 years from retirement). The enhanced output treats these two different time periods the same, and we observe this in the step where the inheritance is subtracted from the amount available at retirement. The correct treatment would be the following: The chatbot needed to calculate the present value of the inheritance at retirement and subtract this amount from the amount available at retirement. The correct answer after following these steps would be USD 400,244.52, whereas the ChatGPT’s answer is USD 380,267.11, leading to a difference of USD 19,977.41 (see Case #19 in
Supplementary S1). Overall, ChatGPT does not perform well in more complicated retirement problems.
Case #20: Saving for College
ChatGPT Summary:
The Williams, with three young children, decides to invest $30,000 in call options on blue-chip stocks for each child’s college fund. They hope the high-risk investment will yield sufficient returns by the time each child turns 18.
Evaluation:
Here, the regular and enhanced outputs have a similar analysis of the case and suggest a comparable action plan. The regular output discusses “expiration risk” whereas the enhanced output discusses “time decay”. Here, the enhanced output provides a more accurate assessment of the risk involved in an options trade over time. On the other hand, the regular output makes it clear that options may expire worthless, translating to a 100% loss of the invested capital. We see that 529 accounts are suggested in both outputs. However, the enhanced output also lists Uniform Gifts to Minors Act (UGMA) and Uniform Transfers to Minors Act (UTMA) accounts as additional options. Both outputs fail to mention that most 529 plans do not allow options trading. Additionally, the outputs fail to look at the impact of different accounts on the Free Application for Student Aid (FAFSA). 529 accounts owned by parents are considered an asset of the parent and have a more favorable impact on the application for student aid. UGMA and UTMA accounts are considered assets of the child for FAFSA purposes, and these accounts may be used by the child for any expenses once they reach the age of majority, which is typically age 18 in most states. Additionally, only 5.64% of assets in parent-owned 529 plans are considered as available funds to pay for college, while assets in UGMA and UTMA account for 20% of the student’s assets available to pay for college.
Case #21: Saving His Marriage
ChatGPT Summary:
Thomas and Julie, living paycheck to paycheck, take a $10,000 loan to book a one-week cruise, leaving their toddlers with grandparents. They hope the vacation will relieve their stress and improve their marriage.
Evaluation:
The advice in the regular and enhanced outputs is mostly the same. They both offer suggestions such as “side jobs/income”, “avoiding new debt”, and “emergency savings”. The regular output used to suggest emergency savings that will cover expenses in a period from 6 to 12 months. Now this is reduced to 3 to 6 months. Here, the regular output has interesting suggestions, such as seeking government assistance and childcare options, downsizing the house, refinancing their mortgage, and selling assets. The enhanced output suggests that the couple consider saving for retirement and for their children’s college. Another possibility (not brought up by ChatGPT) is that one parent stays home with the kids to save money on childcare and transportation expenses. This may bring financial relief to the family, as childcare costs in some instances exceed housing costs (
Gibson, 2024). The enhanced output also suggests specific debt repayment plans (debt snowball and debt avalanche methods), and it has a five-step action plan. Both methods suggest skill development to raise earning potential. Community colleges are tuition-free in some states, and this could be a viable option for Thomas and Julie.
4.4. ChatGPT-4o Versus ChatGPT-5
OpenAI has recently introduced a newer model of ChatGPT: ChatGPT-5. Re-examining all the personal finance cases Via ChatGPT-5 is beyond the scope of this study. However, to provide a preliminary analysis of ChatGPT-5’s financial advice capabilities, we ran some of the cases through ChatGPT-5 (for these, ChatGPT-4o has not performed particularly well) and provide our analysis here. These cases are as follows: 3 (
Living the Good Life), 4 (
A Deadly Virus and Wipe-Out of a Family’s Wealth), 5 (
Young People and Their Good Fortune), 6 (
90-Year-Old Grandpa Getting Rich with Options Trading), 13 (
No Luck with Convincing His Father), 15 (
Mutual Fund Fees), and 19 (
Retirement Problem with Inheritance). We first compare the outputs with regular and enhanced prompts. Next, we compare the performance of ChatGPT-5 and ChatGPT-4o. We provide the ChatGPT-5 outputs with regular and enhanced prompts in
Supplementary S2.
For Case #3, we observe the following under the regular prompt. ChatGPT emphasizes the importance of saving for retirement with specific suggestions such as a saving rate of 10–15%, contributing to an IRA, saving enough to obtain the employer matching in 401(k) plans, and using automatic contributions. For the family’s mortgage, ChatGPT recommends refinancing, paying the mortgage sooner, avoiding further use of their home equity for borrowing, and keeping their mortgage payments below 25% of their after-tax monthly income. For emergencies, ChatGPT recommends 6 to 12 months of living expenses in a savings account with a high yield. It also urges parents to be careful about co-signing their children’s student loans, simultaneously providing guidance on minimizing student loans and choosing an appropriate amount of student loans. Another suggestion is provided with respect to keeping discretionary spending at 25% of their income level and creating a long-term financial plan by working with a financial planner. Under the enhanced prompt, we also see a recommendation for saving for retirement (an urgent one) beefed up with a numerical example. Suggestions to contribute to an IRA comes up again, but with a caveat: This time we observe ChatGPT suggesting a backdoor Roth IRA (a strategy deployed by high-income earners).
For the couple’s mortgage, there is a specific guideline on when to refinance (if the interest rate on the mortgage exceeds 6%). As with the output under the regular prompt, we observe guidelines on student loans. It appears that under both prompts, ChatGPT wants to protect the parents’ assets from creditors by setting limits on how much their children should borrow. We see new suggestions under the enhanced prompt, such as life insurance, umbrella liability insurance, estate planning, and using donor-advised funds. We also see a rule of thumb suggested to the family (50/30/20 rule): 50% needs, 30% wants, and 20% saving/investing. Overall, for Case #3, the advice provided under the enhanced prompt is more specific and richer compared to that provided under the regular prompt. Compared to ChatGPT-4o, the newer model provides new suggestions, such as a backdoor Roth IRA, using a high-yield savings account to keep emergency funds, and limits on student loans. The newer model also ends its output with requests to further help the individuals in the case. Despite these changes, we do not observe a vast improvement from ChatGPT-4o to ChatGPT-5.
Case #4 is an emotionally charged instance where a family, despite being financially responsible, had their portfolio destroyed due to a deadly virus. The output under the regular prompt lacks technical guidance and mostly focuses on living a balanced life and using what happened as a teachable moment. The output under the enhanced prompt also stresses the importance of living a balanced life. Further, it emphasizes the importance of diversification across stocks, bonds, cash, and real assets. It also suggests that the family maintains an emergency fund. Having cash could serve as a buffer during tumultuous times, but cash can be placed into a certificate of deposit or a high-yield savings account and still serve as a buffer. Comparing the ChatGPT-5 output to the ChatGPT-4o output reveals that the recommendations from ChatGPT-5 are very similar to those from ChatGPT-4o.
For Case #5, the output under the regular prompt starts with a warning: make sure that you have legal claim on the money! This is a very important point. We observe a suggestion that the couple contact an attorney and report what they found to local authorities. ChatGPT also suggests working with a financial planner to minimize taxes. There is a suggestion to have the gold appraised and to consider selling some of it for diversification purposes. The couple is also urged to revise their will and insurance, incorporating the change in their assets. Overall, we see ChatGPT-5 commending the couple for their investment decisions and urging them to seek legal advice on whether they legally own the items they found in the house. In the enhanced output, we observe similar suggestions on the need to legally establish the ownership of the money and the jewelry found. Potential tax liability is also highlighted under the enhanced prompt. We also see a specific diversification strategy with stock index funds, bond index funds, and real estate investment trusts (REITs). Other suggestions under the new prompt include umbrella liability insurance, 529 accounts, and custodial accounts (UGMA/UTMA). As with the previous cases, the recommendations are more specific and comprehensive under the enhanced prompt compared to the regular prompt.
Differently from ChatGPT-4o, ChatGPT-5 is very clear about the couple needing to seek legal guidance on the money and jewelry found. However, as with ChatGPT-4o, ChatGPT-5 makes assumptions about the couple’s desire to have kids and maintains a positive tone about with respect to what the couple did with the money (e.g., their investment decisions). It appears that ChatGPT lacks internal consistency on the legal dimension. A human advisor would likely take the view that the couple acted prematurely, investing the money too early without knowing whether they had a legal claim to it. Another example of this internal inconsistency is the issue of taxes. Under the enhanced prompt, we see a suggestion that the couple put aside 35–40% of the money for taxes. At the same time, the couple is praised (“smart”) for putting USD 450,000 in index funds, which means that they did not set aside the recommended amount for taxes. Overall, we see some improvements in the new model of ChatGPT in legal matters. However, as with the previous model, we observe that some of the advice provided by ChatGPT is disjointed, lacking a coherent structure and plan.
For Case #6, in the output with the regular prompt, we observe a rich discussion on the ethical and legal implications of Ken’s trading naked call options in his grandfather’s brokerage account. ChatGPT clearly disapproves of what Ken did, and it admonishes him for managing his grandfather’s money without having “proper licenses or disclosures”. However, it leaves the door open for Ken to give advice to his grandfather on conservative investments, which we believe is somewhat inconsistent. We observe the ethical and legal issues from Ken’s naked call trades were raised in the output under the enhanced prompt as well. We also see the legal issues extended to the brokerage firm for approving naked call trading for a 90-year-old. ChatGPT suggests that the appropriate portfolio for Mr. Wilson (Ken’s grandfather) comprises annuity, bonds, and dividend-paying stocks. For Case #6, the outputs from ChatGPT-4o and ChatGPT-5 are quite similar, with both strongly emphasizing that what Ken did was wrong. Moreover, we do not observe any discussion on the capital gains taxes (on the USD 5,000,000 gain) and gift taxes (USD 1,000,000 gifted to Ken) in either model. We continue observing unhelpful suggestions for Ken’s grandfather in terms of what he should invest in. He already feels insecure about his financial position, and it is doubtful how having conservative investments will help him without him making lifestyle adjustments. Overall, we do not notice a substantial change in the outputs from ChatGPT-4o to ChatGPT-5.
In Case #13, in both outputs, ChatGPT-5 takes the view that the claims made by Adam (hedge fund manager) and his father towards investing in the stock market are both correct. Furthermore, the chatbot emphasizes that investing is personal, and that people should not be forced into investments they are not comfortable with. This view is very similar to the output in ChatGPT-4o. Overall, we do not observe a major difference between ChatGPT-4o and ChatGPT-5 in their approach toward Case #13.
In Case #15, we observe a numerical comparison between the front-end and back-end load funds in the output under the regular prompt. It has the correct setup for the future value of the investment (taking the fees into account). The output does not provide a final answer for either of the funds but asserts that both funds have the same after-fee future value, which is correct. We do see a slight preference towards the back-end load funds if fees decline over time with an increasing holding period. We also observe the suggestion of no-load funds (rated as the best option). When assessing the co-worker’s claim, ChatGPT urges us to pay attention to the following (in addition to the fees): expense ratio, fund performance consistency, and the length of the holding period. In the output under the enhanced prompt, we observe the same numerical comparisons, but this time, the numerical answers are provided. The answer for the back-end load fund, USD 184,912, is accurate. There is only a USD 0.23 difference from the correct answer due to rounding (see Case #15 in
Supplementary S2 for details). The answer for the front-end load fund is inaccurate by USD 14.23. It is USD 184,898 instead of USD 184,912.23 (see Case #15 in
Supplementary S2 for details). The same argument in favor of the back-end load funds is observed under the enhanced prompt if the funds are held for long term. As to the co-worker’s claim, here ChatGPT suggests taking liquidity needs, time horizon, and flexibility into consideration. Both ChatGPT-4o and ChatGPT-5 calculate the ending value of the fund with back-end load fees correctly. For the ending value of the fund with front-end load fees, the answers are inaccurate in both models of ChatGPT, but significantly more so under ChatGPT-4o. Overall, the newest model of ChatGPT is more numerically correct. Both models produce somewhat similar responses to the claim made by the co-worker.
In Case #19, we have a retirement problem with inheritance. Both outputs have the correct future value at retirement. The final answer (USD 400,246) differs by USD 1.48 in the enhanced output from the correct answer (USD 400,244.52). The final answer (USD 400,265) differs by USD 20.48 from the current answer (USD 400,244.52) in the regular output (see Case #19 in
Supplementary S2 for details). The setup is correct under the regular output, but there is some inconsistency in how ChatGPT performs rounding. However, the final answers from ChatGPT-5 are a vast improvement over those from ChatGPT-4o. Overall, comparing the outputs from ChatGPT-4o and ChatGPT-5 in this section reveals that numerical precision markedly improved from ChatGPT-4o to ChatGPT-5, but the financial advisory quality largely remained consistent.