A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek

Bou Zakhem, Najib; Bou Diab, Malak; Tahan, Suha

doi:10.3390/admsci15110412

Open AccessArticle

A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek

by

Najib Bou Zakhem

^1,*,

Malak Bou Diab

²

and

Suha Tahan

³

¹

Management & International Management Department, School of Business, Lebanese International University, Bekaa 146404, Lebanon

²

Accounting Information Systems Department, School of Business, Lebanese International University, Beirut 146404, Lebanon

³

Economics Department, School of Business, Lebanese International University, Bekaa 146404, Lebanon

^*

Author to whom correspondence should be addressed.

Adm. Sci. 2025, 15(11), 412; https://doi.org/10.3390/admsci15110412

Submission received: 8 September 2025 / Revised: 20 October 2025 / Accepted: 22 October 2025 / Published: 24 October 2025

(This article belongs to the Special Issue The Era of Technology: Impacts and Implications of Digital and Artificial Intelligence Transformation on Organizations)

Download

Browse Figures

Versions Notes

Abstract

As generative AI is being further integrated into academic and professional contexts, there is a demonstrable need to determine the performance of generative AI within specific, applied domains. This research compares the performances of ChatGPT-5 and DeepSeek on tasks in the domains of accounting, economics, and human resources. The models were provided two prompts per domain, and outputs were evaluated by academics across five criteria: accuracy, clarity, conciseness, systematic reasoning, and indicators of potential bias. The inter-rater reliability was reported using Cohen’s Kappa. From the findings, both models display differences in performance. ChatGPT-5 outperformed DeepSeek in accounting and human resources, while DeepSeek outperformed ChatGPT-5 on epistemic economics tasks. Since results have shown that ChatGPT-5 outperformed DeepSeek in two out of three domains, the research recommends a reliability-based framework to compare generative AI outputs within business disciplines and offers practical suggestions on when and how to use the models within academic and professional contexts.

Keywords:

generative AI; ChatGPT; DeepSeek; education; technology; human resources; accounting; economics; artificial intelligence

1. Introduction

The fast-paced growth of technology, especially in the area of machine learning and artificial intelligence (AI), is changing the way people live, learn, work, and communicate and making us ask if such advances may become greater than human knowledge and capabilities (Arce et al., 2025). AI is making a major impact on education by allowing people to learn in a more personalized way that becomes critical to their needs (Sajja et al., 2024; Boustani, 2025). In addition, technological advancements, such as new interactive engagement of new AI tools, will help push the limits of conventional education in a rapidly evolving technology-driven world (Gill et al., 2024). Technology is clearly changing both the way students learn and the way teachers will be expected to teach (Aad et al., 2024). ChatGPT and DeepSeek are examples of technology that have already been adopted widely by educators and learners alike, and while they serve the same basic function as AI learning tools, their approach to address personalizing learning experiences may differ (AlAfnan, 2025). ChatGPT is exceptionally strong in its natural language processing capabilities and often serves as conversational tutor-like support, while DeepSeek takes a more structured, highly logical explanation of scientific concepts involving procedural problem-solving (Sapkota et al., 2025). ChatGPT promotes knowledge retention because it creates engagement in the learning process through conversations/dialing-based learning practices. DeepSeek’s vision-language model is even better suited for education in the classroom by having built-in capabilities far beyond typical educational use (Lu et al., 2024). Nevertheless, all the promise that Rocky and DeepSeek show as powerful educational tools merit continued evaluation and analysis to determine if they support the learner’s educational experience or simply thwart it.

Although ChatGPT has been widely studied, little is known about comparative academic performance between ChatGPT-5 and DeepSeek in applied business contexts. Current literature is mostly focused on general capabilities, STEM performance, or ethics, with insufficient research looking specifically at model performance on structured academic tasks (Jin et al., 2025), even across business disciplines. Moreover, many of these existing works evaluate model responses with generic benchmarking and/or user satisfaction rather than academic benchmarks relevant to the domain. Business disciplines such as Human Resources, Accounting, and Economics require not only factual correctness but also reasoning that incorporates and, if appropriate, challenges discipline-specific frameworks (Bou Reslan & Jabbour Al Maalouf, 2024; Gu et al., 2024; Isiaku et al., 2024). Given that AI tools are used more and more by students, educators, and professionals in business disciplines, we also need comparative academic-based evaluations of the depth, correctness, and contextual appropriateness of AI outputs (Kasneci et al., 2023). This study strives to fill that gap with an organized, expert-evaluated comparison between ChatGPT-5 and DeepSeek, built around applied business disciplines.

The purpose of this paper is to contrast these two tools from an educational perspective, specifically in the business world, examining the accuracy of their reasoning as a response to relatively challenging questions in the fields of human resources, accounting, and economics. The significance of this study is derived from the fact that DeepSeek is relatively new, and academic research and practical performance compared to ChatGPT from an academic perspective are still emerging. This paper will assess the accuracy, clarity, conciseness, systematic reasoning, and bias potential of the responses provided by ChatGPT 5.0 and DeepSeek for challenging qualitative questions (case studies) in the field of human resources and challenging quantitative questions in the fields of accounting and economics. This study will more deeply examine the solutions provided by these two tools by academics, thus providing valuable insights into the contributions of both in academia.

2. Literature Review

2.1. ChatGPT

With the enhanced capabilities of natural language processing (NLP), ChatGPT has evolved into a dynamic learning medium with an ability to produce human-like, contextually appropriate interactions about a vast number of topics (Mourtajji & Arts-Chiss, 2024). Its versatile nature enables users to ask follow-up questions, paraphrase questions, and even engage in interactive common-back, which creates the opportunity for iterative learning while simultaneously reducing extraneous cognitive load, promoting linear, explicit processing of information (Lee et al., 2024). This aligns with Cognitive Load Theory, which focused on appropriate structuring of a learning experience as a function of cognitive effort to ultimately understand complex teaching material (Sweller, 2011). Conversely, ChatGPT provides the analysis of numerous sources which also adds access to the most current information useful in fast changing fields of study such as technology, science, and medicine (Kocoń et al., 2023). To be realistic in estimating the limits of ChatGPT within the contextually optimized analysis, it is still behind a specific domain model called DeepSeek which would be designed to a specific purpose, similar to the previous example of binary classifications or specific scientific inquiry (Jiang et al., 2025). In some ways the data losses in the unwieldy nature of ChatGPT may leave gaps in extracting information in comparison to DeepSeek, with recommendations to using a context driven resource as detailed by Jiang et al. (2025). That said, with an ever-evolving language model focused only on language, the addition of interactivity means that ChatGPT has some real promise in the context of user experience with an educational pedagogy (Karakose et al., 2023).

Recent studies have started exploring ChatGPT’s ability to perform on subject matter-specific academic evaluations, identifying both strengths and weaknesses related to your research questions. For example, Geerling et al. (2023) evaluated ChatGPT on standardized economics exams using the Test of Understanding in College Economics (TUCE). They found that ChatGPT tested at the 99th percentile for macroeconomics and 91st percentile for microeconomics compared to actual students in university, indicating very strong ability with typical multiple-choice or problem set formats in economics content. However, these findings raised questions about the extent to which this performance carries over to open-ended, applied accounting work or HR policy analysis, which are considered domains where nuance, domain norms, and context matter more than answer choice selection from a textbook. Thus, Geerling et al. (2023) point out the potential of over-reliance on pattern matching rather than reasoned or domain-specific standard practices.

2.2. DeepSeek

DeepSeek is regarded as a unique AI platform that integrates both vision and language to support learning across multiple modalities (text and images with visual supports) for text, imagery, and diagrams, especially for learning in science, which is supported by multiple visual modalities (Lu et al., 2024; Kotsis, 2025). DeepSeek can analyze complex representations in diagrams, charts, or even images from websites in high resolution to minimize possible incidental cognitive load, ideally increasing student engagement while supporting the principles of Cognitive Load Theory, which requires thinking about cognitive load in the dimension of mental effort, through the design of vacillated and appropriately asynchronous learning using structured and dual-modality, or between modality learning (Sweller, 2011). DeepSeek can examine visual inputs rapidly—including figures and imagery from laboratory-based content—which allows DeepSeek to take complex or difficult information and translate it into simple, structured and coherent representations that serve all learners (Guo et al., 2025; Jiang et al., 2025). DeepSeek’s blend of complex language and visual reasoning provides a genuine resource to engage learners in STEM contexts, where educators require students to clearly articulate their understanding of concepts, and where dual modality or combined modality is necessary for learning and engagement (Maresova et al., 2019). DeepSeek can support students in deepening their engagement with abstract content, whether in the context of a classroom experience with students or as a stand-alone experience (Shafee et al., 2024). This technology allows for a full range of experiences in the domain of science education (Shafee et al., 2024).

Although DeepSeek has been less studied in HR and economics specifically and is more recent, there is still empirical work with useful signals, especially in the areas of reasoning and factual consistency. In a study conducted by Jahin et al. (2025), DeepSeek models were evaluated on mathematical reasoning benchmarks. Of relevance, one variant of DeepSeek (DeepSeek-R1) achieved the highest accuracy on two out of three benchmarks. Given this finding, in the context of tasks with strong logic sequencing, DeepSeek has promising signs of at least state-of-the-art levels.

2.3. ChatGPT in Education

2.3.1. Pedagogical Applications

As ChatGPT continues to grow in popularity across a wide variety of educational contexts, its purposefulness and utility in supporting so many academic tasks remain appealing (Mourtajji & Arts-Chiss, 2024). It is frequently used for tasks such as summarizing dense source material, tutoring students through difficult content, drafting responses or assessments, and expanding on complex theories and ideas, all in a conversational style that avoids assistant-orientation. ChatGPT’s features are built to mimic how humans communicate by being quick and responsive and by representing active learning personalization. This can afford learners’ agency, allowing them to learn via remote participation at pace as they explore and learn in a participative manner and use the query to return to review specific things or clarify any aspect of the content or area of content needing focus (Ciampa et al., 2023; Boustani et al., 2024). In disciplines like accounting and economics, this AI tool could serve as responsive learner support (or facilitation), whereby students would be supported in sequential stages of problem-solving or inquiry, including analysis, calculations, justification for choice or action, and ethical considerations for future action (Kasneci et al., 2023; Küçükuncular & Ertugan, 2025). This process contributes to the fundamental nature of problem-based learning and reinforces critical analysis skills. As with human resource management, ChatGPT can provide thorough explanations of theories, models, case studies, and examples, which is particularly engaging. These interactions help guide students to also be able to take theory to practice and provide opportunities for reflection or contemplation. ChatGPT is a dynamic learning agent and has the potential to increase learning value across discipline learning fields through its variety of features.

2.3.2. Advantages and Benefits

There are many challenges and opportunities outlined in the literature on the benefits of using ChatGPT from the academic perspective. One overarching benefit is providing timely responses to a wide variety of questions, which increases the efficiency of learning. ChatGPT is available 24 h a day by providing students multiple points to engage in self-directed learning entries without meaningfully engaging time and distance. This access to learning, in the discourse of independent study, honored along with the ability to reflect and perhaps revise understanding of concepts that can conclude a learning cycle, surely culminates in better independent learning and individual academic support (Mai et al., 2024). Moreover, ChatGPT assists with navigating language problems that students may find themselves in, particularly if a student is expressing themselves in discourse as a non-native English speaker, who may be challenged in the application of language in academic discourse. ChatGPT provides explanations in grammatical structure, vocabulary, and all other academic writing supports to improve the standardization of academic writing in conveying complex and nuanced messages. Additionally, ChatGPT can help provide students with access to digital tools as well as support for strong usage of the tool, improving the development of students’ digital literacy skills that are becoming more and more vital in postsecondary education and the workforce (Jo, 2024). While the combination of these benefits could mean equity, access, and skill development in education, it only enhances a student’s learning and education.

2.3.3. Risks and Limitations

ChatGPT is arguably the best-known and most frequently used academic AI tool. Because it is so readily available and versatile, it has become the favored AI resource. However, there are some serious risks to consider about using it. While it can be used like a very powerful educational tool to enhance learning and complete academic tasks, education and learning with AI must always be approached with care. One major risk of AI tools like ChatGPT is that they may unintentionally lead to academic dishonesty. The risk occurs when the student puts effort into the AI tool for the assignment without touching the course details, and somewhat mechanically fulfills the assignment. This could result in students becoming detached from course material, having less independent thought, less problem-solving, and over-reliance upon the AI for learning (Zawacki-Richter et al., 2024). One other limit to the AI-generated responses is that it may, at times, create incorrect and false statements. This is particularly relevant in assessment areas of quantitative engagement, such as accounting, finance, and economics, which rely on confident statistical and methodological assertions. Students using the AI would need to verify the errors to avoid inaccuracies leading to misunderstanding (Khan & Umer, 2024). While ChatGPT has some advantages as a supportive tool in academia, it is essential for users to also use it responsibly and regularly employ human oversight to sustain academic integrity and quality outputs.

2.4. DeepSeek as an Emerging AI Tool

While the spread of ChatGPT allowed it to become the dominant AI-in-education tool and to gain substantial discussions and studies in the educational sector, DeepSeek, as a newer Chinese entrant, offered significant alternatives and competitive features. Developed with bilingual support and based on multilingual datasets, DeepSeek has positioned itself as a competent alternative with domain-specific performance, especially in STEM and the generation of technical writing (Liu et al., 2024).

Due to its relatively new presence in the market, research and professional academic assessments on DeepSeek in the field of education are still rare. However, preliminary benchmarks show that its potential, features, and capabilities are highly comparable or even exceed those of ChatGPT in particular technical tasks (Liu et al., 2024).

2.5. Comparative Use in HR, Accounting, and Economics Education

Comparative studies between AI tools, including ChatGPT, DeepSeek, and other LLMs, are inconsistent but necessary. In HR education, for instance, ChatGPT’s contextual richness and flexibility allow students to inquire about scenarios, essays, case studies, subjective evaluations, theories, dilemmas, etc. In the accounting and other quantitative fields, its ability to generate step-by-step solutions makes it extremely valuable for learning, finding solutions, and understanding the rationale behind each solution (Ranta et al., 2023; Belizón et al., 2024). A more structured comparison using standardized questions across both models can provide valuable insight into which tool generates more accurate, relevant, practical, and useful responses (Chauhan et al., 2024).

Studies comparing ChatGPT and DeepSeek show mixed outcomes depending on purpose and domain. For example, DeepSeek-R1 considerably outperformed ChatGPT-4 on pediatric board exam-type questions (98.1% vs. 82.7%), signaling an advantage in structured, domain-specific input (Mansoor et al., 2025). However, ChatGPT was found to be more flexible and effective rhetorically on academic writing and communication, while DeepSeek was more grammatically correct and fact-check on accuracy (AlAfnan, 2025). Likewise, DeepSeek performed better than ChatGPT for answering radiation therapy questions in Chinese, but ChatGPT outperformed DeepSeek for follow-up care tasks in English (Luo et al., 2025). These contradictions indicate that AI performance in specific tasks is context sensitive and shaped by purpose and domain of input. Hence, more research needs to be clearly defined before indicating the potential in contexts such as HR, accounting, or off task writing.

3. Materials and Methods

This research involves two large language models (LLMs), ChatGPT-5.0 and DeepSeek, to assess the ease with which they can create an academic-style response to discipline-specific prompts in accounting, economics, and human resources (HR) problems/scenarios. The goal of this study is to evaluate and compare the models’ ability to achieve accuracy, clarity, conciseness, and impartiality in relation to a predetermined discipline-specific set of academic prompts created by experts. Using this approach can demonstrate each model’s capacity and limitations in applied business education contexts, aligning with calls for a critical evaluation of the use of AI tools in academic contexts (Kotsis, 2025; Kasneci et al., 2023; Y. K. Dwivedi et al., 2023).

The project was implemented in three main phases. Phase one included the authors developing six academic prompts: two undergraduate-level challenging exercises in accounting, two challenging prompts from the discipline of economics, and two analytical case studies in the field of HR by assessing the body of literature on core undergraduate content in accounting/economics/HR and after consulting with academic practitioners in the field, including their approaches to creating academic prompts. The authors created analytical and applied reasoning questions that would be more than simple factual recall for a more meaningful measure of the discipline-specific complexity of the LLMs (Zawacki-Richter et al., 2019).

Second, both ChatGPT-5.0 and DeepSeek were assigned the same six prompts simultaneously. There were no follow-up prompts or any clarifications, providing a truly standardized data-gathering experience. The responses were recorded in their original form to preserve the data and avoid changes made by the researcher during the interaction. The six prompts (two prompts from each of the three domains) were chosen to alleviate concerns related to difficulty level and representativeness. These prompts were based on university-level textbooks and prior exam papers and represented expectations of typical undergraduate learning. The prompts were provided to subject-matter experts in HR, accounting, and economics, who evaluated the prompts for alignment with reflected disciplinary learning outcomes and academic rigor. This expert review process also served as a measure that promotes cognitive demand (i.e., factual, analytical, applied reasoning). The revision process supported a thorough level of content reliability and determined that each prompt was not disproportionately higher or lower in difficulty than other prompts, despite not administering a formal pilot test.

Third, the outputs of each model were evaluated separately by a panel of 12 academic experts in accounting, economics, and human resource management (4 experts from each field). In other words, four academic raters rated the two model approaches in each of the three domains. These raters have at least eight years of professional teaching experience and a PhD in their field. Each rater headed a group of 2–3 assistants who assisted during the rating process. These assistants are either instructors or lecturers within the same field, with fewer years of experience. The tiered structure enabled primary raters to maintain consistent, accountable scoring while also harnessing the varied perspectives of junior raters. This structure added ease of assessment, reduced bias, improved the quality of judgment, and enabled capacity building. The rationale for having expert raters with a PhD and at least eight years of experience in undergraduate education was to provide a higher order of disciplinary expertise, familiarity with pedagogy, and assessment literacy. This experience is notably important when considering assessments of AI-created responses to an academic standard of value, as it enables raters to make consistent judgments following circumspect, nuanced content. To address bias introduced by individual perspectives or disciplinary loyalties, each rater was accompanied by two or three assistants with fewer years of experience in the academy. The purpose of the assistants was to provide a second input and validate the primary rater’s evaluations. Thus, all final judgments included a degree of wider community intellectual parameters and beliefs, instead of relying solely upon an individual expert’s judgment. The reviewers used a five-dimensional rating scale (adapted from the literature researching LLM evaluations) to rate each of the responses with respect to (1) accuracy, (2) clarity, (3) conciseness, (4) systematic reasoning, and (5) likelihood of bias (Borji, 2023). Reviewers evaluated outputs on a three-point scale: 1 = low, 2 = moderate, 3 = high. The choice to employ a 3-point Likert scale for assessing AI-generated responses was deliberate to provide a compromise between simplicity and reliability of rater judgments. Coarser scales reduce ambiguity and rater fatigue, which is particularly useful given the technicality of the prompt spectrum across multiple domains. This is beneficial in achieving higher levels of inter-rater agreement because it mitigates the subjective interpretation issue associated with the 5-point and 7-point scales. The 3-point scale also effectively highlights the critical distinctions of poor, acceptable, and excellent responses that matter for the overall assessment pertaining to factuality, relevance, and clarity in the domain (Finstad, 2010). Cohen’s Kappa Coefficient was calculated for each dimension as the outcome to assess inter-rater agreement using SPSS version 26. Kappa is a popular statistic in language methodological studies because it facilitates agreement among the reviewers that is more than chance (Landis & Koch, 1977; Sim & Wright, 2005; Vieira et al., 2010). Inter-rater reliability was assessed using Landis and Koch’s (1977) table to define values: 0.00–0.20 = slight agreement; 0.21–0.40 = fair; 0.41–0.60 = moderate; 0.61–0.80 = substantial; 0.81–1.00 = almost perfect.

The panel members were academic and professional experts in their respective fields, with an estimated minimum of eight or more years of experience, and were actively involved in teaching and peer-reviewed research on teaching and learning. The makeup of the panelists accounts for a wider breadth of disciplines, thereby enhancing the credibility and rigor of the evaluations and assessments. Each panel had one of the authors of this paper as the lead. Although there are inherent biases from an individual’s perspectives, whether individual biases were present in the evaluations and outcomes was of limited consequence, as the expert panelists were experienced colleagues/scholars, and, in highlighting multiple rates/consensus, had a term of comparable metrics that allowed the minimization of evaluation and outcomes. In order to evaluate inter-rater reliability among the expert raters, Cohen’s Kappa was calculated. For any cases where there was a disagreement, a consensus process was used to arrive at a final rating that reflected careful consideration of a common evaluation and understanding of what was rated with final agreement. Specifically, raters engaged in discussion sessions to review cases where there was some discrepancy in rating, clarify criteria, and reach consensus on cases with uncertainty in their evaluation. The evaluations and analyses offer an opportunity to consider how LLMs can achieve similar outcomes in academia and expand on some of the emerging discussions on the effective use of generative AI in higher education (Batista et al., 2024; Susnjak, 2024). The evaluative prompt problems/scenarios provided to ChatGPT and DeepSeek are included in Appendix A, Appendix B and Appendix C. The prompts were delivered to the two AI generative models, and the evaluations were conducted during the period between 19 August and 22 August 2025. The flowchart of the methodological process is illustrated in Figure 1.

4. Results

This section compares the outcomes produced by both ChatGPT-5.0 and DeepSeek for the six prompts, which include two accounting problems, two economics problems, and two HR scenarios/case analyses. The raters were comprised entirely of university faculty members, and they were tasked to grade or assess the quality of each model output from five dimensions of quality (accuracy, clarity, conciseness, systematic reasoning, and potential for bias). Each rater scored each dimension on a 3-point scale (1 = poor, 2 = moderate, 3 = excellent). We also presented Cohen’s kappa coefficients as further measures of inter-rater reliability for each dimension. The scores reported in the tables below reflect the detailed ratings of the raters along with the overall averages.

4.1. Accounting Prompts

With reference to Table 1, ChatGPT 5.0 received consistently high scores in the area of accounting for both accuracy (M = 2.0) and clarity (M = 2.0), with more moderate scores for conciseness (M = 1.5), systematic reasoning (M = 1.5), and potential for bias (M = 1.5). These data suggest that while variants of content were rated as correct and clear, raters saw limits for brevity, depth, and neutrality. In contrast, DeepSeek’s ratings were slightly lower overall, evidenced by a mean of 1.5 for accuracy and clarity, while maintaining consistency on conciseness (M = 1.0) and potential for bias (M = 1.0). These findings suggest that while DeepSeek’s responses were rated with more consistency, they were rated as less accurate and less informative compared to ChatGPT.

Overall, the data indicate that ChatGPT 5.0 tends to prioritize accuracy and clarity, albeit with some limitations in conciseness, systematic reasoning, and potential for bias, whereas DeepSeek demonstrates more consistent but generally lower performance across these criteria, reflecting a balance between informativeness and consistency in AI responses.

In this domain, Cohen’s Kappa value for ChatGPT 5.0 of 0.67 indicates a substantial level of agreement. Raters fully agreed on the evaluation of accuracy and clarity, while moderate differences in opinion on conciseness, systematic reasoning, and potential for bias weakened the overall reliability. The similar overall agreement in DeepSeek’s value of 0.65 implies there is also a substantial level of agreement overall, although there was great consensus on conciseness and potential for bias, and more divergence in terms of agreement on accuracy and clarity. An illustration of the accounting prompts results is represented in Figure 2.

An evident point is related to the potential for bias in the ChatGPT-5.0 domain. The average score for that criterion was 1.5 (Table 1). Evaluators did not score the model similarly. One evaluator assigned the highest score of 3, as it was deemed that the model potentially had low concern for bias. The other three evaluators gave it a score of 1 at the low end of the scale, as they had concerns that the potential for bias of the model outputs was high. Alternatively, this difference in scoring demonstrates disagreement about how bias is present in the outputs of the possible bias prompt, and this was reflected in a lower Cohen Kappa, compared to other fields such as HR and economics. Additionally, the difference reflects evaluators’ influence from their interpretations, thresholds for concerns, and professional backgrounds. The evaluator who scored it higher viewed the response as balanced or unbiased, at least in the accounting domain, versus the others who observed some possible/fair signs or had a threshold of concern in responses. Therefore, this demonstrates the subjective nature of qualitative evaluation and the lack of clear rubrics or calibration exercises when dealing with qualitative evaluation to make the evaluations more consistent across different assessments.

4.2. Economics Prompts

In economics (Table 2), ChatGPT 5.0 rated relatively high in its evaluations of accuracy (M = 2.0) and conciseness (M = 2.0), with lower but still fairly favorable outcomes for clarity (M = 1.75), systematic reasoning (M = 2.25), and bias (M = 1.75). This indicates a generally good performance, but raters noted that issues in clarity and reasoning were sometimes inconsistent. On the other hand, DeepSeek had notable evaluations, with perfect evaluation outcomes in accuracy, clarity, and bias (M = 3.0 each), along with very good evaluations for conciseness (M = 2.75) and reasoning (M = 2.75). This indicates that in economics, DeepSeek performed consistently the best over the other systems according to all evaluative dimensions, and it gave answers that were not only highly accurate but clear, unbiased, and well-structured.

These results suggest that while ChatGPT 5.0 provides generally good performance with some variability in clarity and systematic reasoning, DeepSeek demonstrates consistently superior performance across all criteria in economics, excelling in accuracy, clarity, conciseness, systematic reasoning, and bias, which highlights its strength in delivering well-rounded and reliable responses within this domain.

With a value of 0.76, the Kappa measure for ChatGPT 5.0 indicates a substantially high to nearly perfect inter-reliability. Overall ratings were unanimous in accuracy and conciseness, with minor variation in clarity, reasoning, and bias. DeepSeek attained a high value of 0.91, which indicates almost perfect inter-rater reliability. Notably, raters were almost unanimous in their ratings across dimensions, particularly accuracy, clarity, and bias. An illustration of the economics prompts results is represented in Figure 3.

4.3. HR Scenarios

ChatGPT 5.0 also performed well in the area of HR (Table 3), achieving the best average ratings in accuracy (M = 3.0), clarity (M = 3.0), and systematic reasoning (M = 3.0), and slightly lower ratings, although still high in conciseness (M = 2.75) and bias (M = 2.5). This indicates that the raters viewed ChatGPT’s outputs in HR to be both highly credible and of strong explanatory power. By comparison, DeepSeek’s performance was not as consistent, with mean ratings of 2.0 in accuracy, clarity, reasoning, and bias, and 2.5 in conciseness. Though fairly consistent, these output ratings indicate DeepSeek’s HR outputs in terms of analyzing the HR case analyses to be perceived as less insightful than ChatGPT’s.

The findings indicate that ChatGPT 5.0 delivers highly credible and well-reasoned responses in HR, consistently outperforming DeepSeek, which showed more variability and generally lower ratings across all evaluative dimensions, suggesting a relative gap in depth and insightfulness in DeepSeek’s HR outputs.

ChatGPT 5.0 produced a Cohen’s Kappa of 0.92, indicating perfect agreement. In fact, raters were highly consistent with some minimal disagreement on conciseness and bias. DeepSeek’s score of 0.61 indicates substantial agreement. While some dimensions, such as clarity and reasoning, had strong consistency, disagreement in accuracy and conciseness resulted in less overall reliability. An illustration of the HR prompts results is represented in Figure 4.

4.4. Comparative Analysis:

This section presents a comparison of the educational rating performance of the two AI models. The means contained in the following table were derived from the ratings reported in Table 1, Table 2 and Table 3.

In the accounting domain, overall, the aggregate results (Table 4) indicate that ChatGPT 5.0 was superior to DeepSeek (an overall score of 1.7 pertinent to ChatGPT vs. an overall score of 1.25 for DeepSeek). All raters agreed that ChatGPT’s responses across this domain were overall slightly more thorough and balanced compared to DeepSeek’s responses.

In the field of economics, the evaluators almost unanimously agreed that DeepSeek was superior to ChatGPT (1.95 vs. 2.90). After conducting group calibrations of the raters, which included each of the five criteria used to score the outputs, the raters scored DeepSeek very high on all five criteria independently. In the debriefing, evaluators expressed their score and reasoning, in that they appreciated the modeled dialogue followed and that the responses were clearly formatted with reasoning. Moreover, they argued that the reasoning strategies outlined the solutions provided, and explicit referencing of economic terms was included in the model and in the review of the model outputs. In general, evaluators expressed their scores as not only being high quality but consistently high quality of DeepSeek outputs. In comparison, ChatGPT 5.0 received consistent but lower scores across all five criteria, establishing that while the raters acknowledged that ChatGPT’s answers and descriptions were clear and precise, they did not show enough analytical thinking and polish in their responses for an overall analysis from the reviewers. All evaluators acknowledged ChatGPT’s responses to be well-informed; however, none felt the sophistication to the level, or fluency in regard to economics, that DeepSeek demonstrated.

In the context of HR, ChatGPT 5.0 received higher ratings in all dimensions, with a mean of 2.85 compared to DeepSeek (2.10). Raters indicated that ChatGPT often used illustrative examples, and/or theoretical bases in their responses to enable understanding and logic in interpreting the HR cases analyses. Although DeepSeek had an acceptable rating, raters stated that the answers from DeepSeek were brevity-only and easy to read but lacked risk-associated detail and elaborative terminology in HR.

5. Discussion

This research explored the impacts of two large AI language models, ChatGPT-5.0 and DeepSeek, having different types of domain-specific performance in accounting, economics, and human resources, through a structured evaluative rubric based on levels of accuracy, clarity, conciseness, systematic reasoning, and bias, respectively.

In the study, we explicitly used standardized, single-turn prompts to facilitate a controlled comparison between ChatGPT and DeepSeek when looking across the fields of HR, Accounting, and economics. By using standardized prompts, we were able to assess models’ base-level performance in generating academic-style responses while limiting the effects of user prompting style and interaction skills. Real-world dialogue may also generate variability that complicates the comparability. By holding the prompt structure constant, we were testing their abilities in standardized conditions.

We generated academic prompts and then constructed a panel of professionals to rate the quality according to the rubric elements. Alongside investigating the models as academic tools, we wanted to investigate the models as learning tools, particularly in and alongside environments that are increasingly incorporating generative AI to support instruction, assessment, and content creation for higher education.

Within the accounting context, ChatGPT-5.0 demonstrated a satisfactory performance aspect as rated on the matrix of coherence and logical reasoning, with defendable justifications aligning with its performance ratings, even though ChatGPT-5.0 rated above DeepSeek for all aspects of the assessment criteria. This supports the existing literature that suggests that while large language models will produce general overview content in technical contexts, both clarity and conceptual coherence are generally considerably poorer, particularly in domains that rely on rule-based reasoning and sequential solvability (Kasneci et al., 2023). This finding is consistent with Cognitive Load Theory (Sweller & Chandler, 1991), which suggests that learners (or in this case, AI models) may have a difficult time processing information when tasks require high levels of both intrinsic and extraneous cognitive load, as in the case with accounting. Accounting tasks often demand rule-based reasoning, sequential problem solving, and conceptual integration across different regulatory and theoretical frameworks. The unsatisfactory performance by DeepSeek may also simply stem from its training population not aligning at the level of domain-specific terminology, which is a common concern of the accounting education literature that explores AI explanations that over-rely on technical contexts (Brougham & Haar, 2018). The educational implications indicate the positives of ChatGPT-5.0, such as qualifying accounting teaching scaffolding opportunities, providing explanations of concepts, or feedback on narratives; however, for all content expecting numerical and compliance, it would also be risky to endorse ChatGPT-5.0 as being used independently as a competent tutor.

In contrast, the data from the domain of economics imply that DeepSeek performed better than ChatGPT-5.0, rating at perfect performance on all dimensions. Thus, supporting recently reported evaluations of domain-adapted foundation models that found that more recent LLMs using domain-related data sets undertook more coherent and less biased analyses than more general models (G. Yan et al., 2025). DeepSeek’s enhanced synthesis of economic theory and soundly structured explanations also reflects the benchmark tests OpenAI administered on reasoning-intensive task performance. In the context of economics education, this kind of performance indicates that DeepSeek will most likely have great potential for advanced undergraduate students, especially in theory comparison, explanation of concepts, scenario modeling, and problem-solving. While placing it into the context of Selwyn’s (2024) argument to conceptualize the use of AI in more education via the social sciences as a pedagogy of inquiry, there needs to be a critical interrogation of the source, structure, and assumptions involved with AI-generated outputs. This difference in performance may be explained by Cognitive Load Theory (Sweller & Chandler, 1991). Economics requires complex reasoning, careful problem-solving, and abstract concepts, all of which require significant thinking from students. DeepSeek is trained on such content to help reduce a student’s extraneous cognitive load, which is mental effort that has no purpose, to provide clearer, more organized, and relevant answers. By providing these types of responses, students can focus, mentally, on understanding and applying ideas in economics, which the theory describes as germane cognitive load that supports deep learning. From a perspective of learning, this requires that domain-based AI tools like DeepSeek can be more similar to an expert tutor than a general model like ChatGPT-5.0. The HR domain illustrated the strengths of ChatGPT-5.0 in working effectively with behavioral, qualitative prompts. The model obtained the strongest possible scores in accuracy, clarity, and reasoning, and was fairly strong in conciseness and neutrality. These results verify what earlier studies found that ChatGPT continuously offered superior performance to other generative AIs in the areas of empathy and organizational behavior and decision-making situations. These are elements of HR education that we recognize as having significant dimensions (D. Dwivedi, 2025). Specifically, ChatGPT-5.0 reported discussing advanced HR concepts, frameworks, terminologies, and principles required in reflection-based assessments. This reconciles with L. Yan et al. (2024), who noted that, with appropriate use, generative AI is meant to support reflective practice and scenario-based learning in professional education. These characteristics are paramount when considering effective professional and reflective practices. ChatGPT-5.0’s proficiency in responding to behavioral, qualitative, and reflective prompts in the HR files aligns with Social Constructivist Learning Theory (Palincsar, 2012) that knowledge is co-constructed through dialogue and contextual understandings. Generative AI, when applied to reflective or scenario-based learning, can engage a similarly interactive process, allowing for meaning-making, not just recalling content.

To summarize, the findings show clear affordances and constraints of each AI model in their respective domains: ChatGPT-5.0, apparently educational contexts with human-centered reasoning, compliance, and organization-oriented practices (e.g., HR, leadership, ethics, and accounting), while DeepSeek had a greater capacity for technical/conceptual abstraction and matters which are less tied to internal organizational practices (notably economics). In our review of the educational technology literature, we supported earlier arguments regarding AI and education, stating that it is not just about whether AI is appropriate for classroom pedagogical purposes but which model, for which pedagogical task (Luckin, 2025; Albuquerque & Gomes dos Santos, 2024).

Both systems also offered non-trivial caveats that educators and curriculum developers would need to think about. Chat GPT-5.0 provided comprehensive information but, at times, was verbose and repetitive, which means educators will need to work with students to identify what matters and what does not. While DeepSeek provides a concise and logical answer (at least in the language of economics), it may ignore nuances if the interpretive judgment is assumed to be within the content. The possibility of taking a non-critical approach to rely on AI echoed the cautions in the literature about the consequential implications of adopting an over-reliance on AI in learning environments, without the introduction of training for students to assess critically what they were given in the first place (i.e., despite these models being constructed around stochastic and periodic concepts, they might provide illogical answers).

The comparison across these disciplines suggests that the variable performances within tasks applicable to ChatGPT 5 and DeepSeek are associated with the linguistic training and domain domains of the models. In terms of the relative strengths displayed, it may be noted that ChatGPT 5’s stronger performance on tasks related to accounting and human resources suggests enhanced contextual reasoning and richness of interpretive depth within rule and recognizable human centered domains, in closer proximity to the related tasks. Conversely, DeepSeek’s overall higher performance for tasks related to economics potentially has a greater alignment basis with analytic and data driven reasoning. The noted cross disciplinary variation indicates an added need/contention for selecting an AI tool to consider domain discourse and epistemic demands. Educators and curriculum designers could also think about these uses of generative AI tools, more deliberately ChatGPT, for explaining concepts and applied cognitive processes, and DeepSeek for analytical modeling and quantitative analysis.

Overall, this study demonstrates an evolving space of generative AI within education contexts and requires intentionality in applications that are informed by pedagogy, domain demands, and critical engagement. Future research should examine model efficacy within multi-linguistic contexts, group collaboration using collaborator- and timer-driven interactions, and real-time in-class interactions to further build understanding of AI that can aid and not replace education provided by people.

6. Conclusions

The current study compared two AI-based Large Language Models, ChatGPT-5.0 and DeepSeek, to assess correct, clear, concise, logically reasoned, and neutral responses to academic prompts in the subjects of accounting, economics, and human resources. These quality dimensions are essential in educational and scientific contexts. Following expert review, ChatGPT-5.0 was found to perform better in the human-centered discipline of HR, and moderately better in accounting, and DeepSeek was rated better in economics. The findings support the view that such AI-based tools can be useful to educators, students and researchers to scaffold the generation of structured, semi-synthesized content towards curriculum planning, concept review, and design of instruction. But a caveat exists since LLMs do currently come with constraints on generalizability, reasoning, and corpus dependence, and all of this applies particularly to contexts that require precision or ethically sensitive matters. As Treves (2025) has said, AI technologies contain the potential for massive positive and real consequences for education and beyond; responsible and evidence-informed use will be requisite in practice to minimize the potential for misinformation, bias and accountability. These findings above are consistent with prior studies examining AI in education (Fošner, 2024; Okulich-Kazarin et al., 2024; Iatrellis et al., 2024) and call for holistic and interdisciplinary studies to help educators maximize benefits while managing emerging risk.

Based on the above, some specific recommendations can be made for responsibly moving forward using large language models in educational and research use cases. First, institutions ought to take seriously the recommendations for developing clear usage guidelines for individual academic disciplines regarding when and how AI tools might be used in the curriculum, especially in curricular development, evaluation, and content creation. These guidelines should include the principles of transparency, validation of sources, and human oversight, especially in tasks where communicating interpretive nuance or ethical sensitivity is at play. Second, we recommend integrating AI literacy into the training of both educators and students—not just technical literacy but critical literacy or engagement with AI content that is generated. This would include an understanding of how AI generates output, the respective limits of models, and how to discover bias. Third, continued collaborative research efforts between educators, data scientists, and ethicists to collaboratively develop an evaluation framework to assess AI-generated performance on dimensions of accuracy, fairness, epistemic integrity, and pedagogical value will be important. Finally, universities and research institutions could consider developing or working with committees or ethics review boards around AI, especially within their use of AI-integrated projects, to evaluate risk, scope, and accountability, while following a research investigation framework as this technology continues to evolve.

Future research will have to deal with the ethical, epistemological, and pedagogical implications that the use of generative AI will have on academia. We recommend research involving professionals from a variety of scholarly professions and research to document users’ experiences, including faculty, students, or policy makers, regarding benefits, concerns, and applications for these technologies in teaching, learning, and research. Longitudinal and iterative methodologies may produce interesting insights for examining the ways that model updates impact the quality of responses over time. For example, one could investigate the difference in quality over time by iterating pairs, prompting the models over different time horizons. There were limitations to the study, notwithstanding. First, although inter-rater reliability was high, the evaluation of the outputs produced by AI was fundamentally a judgment call made by a few experts. Expert judgement, although informed by scholarly purposes, is still subjective. Second, all prompts and assessments were conducted in English; given the multilingual capabilities of these models, it would be interesting to replicate the same design in other languages to see if a similar pattern emerged. Third, we would reiterate that we did not encounter many issues with hallucination or misinformation with the prompts we received during this study; this could simply be reflective of the nature of the questions being asked or the prompts of the question having the psyches of the models. Even with a low incidence, hallucination has some relevance in the stakes of educational or professional directives—especially in cases that entail factual accuracy or ethical accountability. Other disciplines, specifically the arts and humanities, law, medicine, or mathematics, may reach different conclusions and warrant future consideration. As Nahar et al. (2024) discuss, the potential of LLMs to generate plausible but fictitious incorrect responses is certainly nothing to consider lightly. Fourth, the prompts were evaluated by academics from a purely academic perspective. Other studies would evaluate the efficiency of these AI tools from a more practical perspective by including academic and practitioner panelists. Fifth, only two prompts or scenarios were used in each domain. This means that the topics assessed in each field are relatively few. Other studies would create more inclusive prompts or exercises, covering a wider variety of questions to minimize bias, or would address other disciplines that might reach different conclusions. Sixth, this investigation takes a cross-sectional approach and is mainly predicated upon subjective expert evaluations. Longitudinal studies and more quantitative methodologies—such as tracking user performance over time or analyzing larger datasets—could shed greater light on the lasting efficacy and practical applicability of AI models in academic settings. Seventh, not having a formal pilot test for the academic prompts is a limitation, as it could potentially influence the construct validity of the task design and the degree to which the prompts actually captured the intended disciplinary complexity.

In summary, while we investigated the affordances of AI models, like ChatGPT-5.0 and DeepSeek, to support educational and academic inquiry, we also want to warn the reader about the responsible use of these AI systems and the need for methodological rigor and ongoing evaluation. The rapid development of new and powerful AI tools means that these are becoming the essential systems in understanding, governing, and adopting AI in a way that encourages the responsible production and sharing of knowledge.

Author Contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Accounting Prompts

Problem 1:

On 1 January 2010, Parker Company purchased 95% of the outstanding common stock of Sid Company for $160,000.

At that time, Sid’s stockholders’ equity consisted of common stock, $120,000; other contributed capital, $10,000; and retained earnings, $23,000.

On December 31, 2010, S company’s trial Balance has an internally generated Net income of 26,000 and dividends declared of 20,000.

Required:

Prepare a consolidated statements work paper on 31 December 2010, using the cost method.

Calculate the NCI on 31 December 2010.

Problem 2:

Plantation Homes Company is considering the acquisition of Condominiums, Inc. early in 2008. To assess the amount it might be willing to pay, Plantation Homes makes the following computations and assumptions.

A. Condominiums, Inc. has identifiable assets with a total fair value of $15,000,000 and liabilities of $8,800,000. The assets include office equipment with a fair value approximating book value, buildings with a fair value 30% higher than book value, and land with a fair value 75% higher than book value. The remaining lives of the assets are deemed to be approximately equal to those used by Condominiums, Inc.

B. Condominiums, Inc.’s pretax incomes for the years 2005 through 2007 were $1,200,000, $1,500,000, and $950,000, respectively. Plantation Homes believes that an average of these earnings represents a fair estimate of annual earnings for the indefinite future. The following are included in pretax earnings:

Depreciation on buildings (each year) 960,000

Depreciation on equipment (each year) 50,000

Extraordinary loss (year 2007) 300,000

Sales commissions (each year) 250,000

C. The normal rate of return on net assets is 15%.

Required:

Assume further that Plantation Homes feels that it must earn a 25% return on its investment and that goodwill is determined by capitalizing excess earnings. Based on these assumptions, calculate a reasonable offering price for Condominiums, Inc. Indicate how much of the price consists of goodwill. Ignore tax effects. (Perpetuity).

Appendix B

Economics Prompts

Problem 1:

Assume that Rami’s short-run total cost function is: C =

q^{3} - {10 q}^{2} + 17 q + 66

.

Determine the output level at which he maximizes his profit if p = 5. Compute the output elasticity of cost at this output.

Problem 2:

Dan has a utility function u(w) =

\sqrt{w}

, where w is his wealth. All his initial wealth, equal to $36, is deposited at bank M. With a probability of 0.5, this bank can become bankrupt. Had this happened, Dan would have gotten only $4 guaranteed by the government. A risk-neutral firm N proposes that Dan purchase his problem deposit (before the uncertainty is resolved) for $X.

(a): Find all values of X that are mutually beneficial for Dan and firm N, and provide a graphical solution.
(b): Suppose that X = 20. A corrupted manager from Bank M possesses information about the bank’s position and can say with certainty whether bankruptcy will take place. He offers to sell this information. What is the maximum amount that Dan is willing to pay for this information? Provide an algebraic solution and illustrate your solution on a diagram with contingent commodities.
(c): Suppose that Zara faces the same problem as Dan, but she is risk-neutral. Find the maximum sum that Zara is willing to pay for the information offered by the corrupted manager described in (b) and compare it with the maximum sum that Dan is willing to pay. Illustrate on the same graph.
(d): Compare the maximum prices found in (b) and (c). Would the result of this comparison be different if Dan had different preferences but the same type of risk attitude?

Appendix C

Human Resources Prompts

Case Analysis 1:

Hiba, a business marketing graduate with one year of experience in the retail field, has finally received an email from The Cologne Corner Company inviting her to an interview for the “Customer Service Specialist” job vacancy. On the day of the interview, the company’s receptionist welcomed Hiba and invited her to enter the interview room. Once she stepped into the room, Hiba realized that it might be challenging for her to be interviewed by the head of the human resources department, the head of the customer service department, and the retail manager all together.

During the interview, it was clear that Hiba is not the ideal person for the position because of her limited experience in matters relating to dealing with customers and customer complaints. Yet, the interviewers were delighted to offer Hiba the job because they found her way better than the other four job applicants who were previously interviewed on that day.

Hiba got the job and worked hard to prove herself a valuable asset to the company. However, as a part of the annual appraisal program implemented by the company, the head of the customer service department, the other customer service specialists, and the customer service assistants were all required to evaluate Hiba’s performance by filling in an assessment report and returning it to the HR department. In addition, the HR department has reviewed the customer service surveys that were submitted by the clients who were served by Hiba.

After reviewing and analyzing the assessment results, Hiba received an email from the HR department thanking her for continuously showing a courteous attitude towards the customers and informing her that she will be invited to attend a business skills workshop, which will ultimately help her learn how to enhance her verbal and non-verbal communication skills with the customers.

Questions:

What is the interviewing method adopted by the company? Discuss the advantages associated with such an interviewing approach.
Which form of bias affected the interviewers’ decision? What would be its consequences on the recruitment decision? Elaborate.
What is the performance appraisal method adopted by the company to assess Hiba’s performance? How does this method affect the credibility of the results obtained?
Do you believe that it is necessary to communicate the performance appraisal results to Hiba? How would this affect her performance? Discuss.

Case Analysis 2:

Ava Manufacturing is a privately owned company established in 1995 in Lebanon. Its daily activities include importing, promoting, and distributing a variety of high-end, reliable medicines and food supplements. Kareem is a job analyst at AVA Manufacturing. His job entails determining the skills, duties, and knowledge required for performing the jobs in the organization.

Kareem intended to analyze the job of the financial analyst who will retire soon to prepare the corresponding job description and submit it to the HR manager. He chose to monitor and watch the financial analyst perform his respective job tasks and take notes of what he saw. After having received the job description, the HR manager rejected it and required Kareem to analyze the job again using a different job analysis method because of the inadequacy of the method used.

After re-conducting the job analysis task about the aforementioned job and submitting it to the HR manager, the latter decided to fill the vacancy with a current employee. Therefore, an announcement was communicated to the employees through the company’s website and through printed announcements pinned on bulletin boards near the elevators. The announcement also informed the current employees who believe that they possess the required qualifications to apply for the job vacancy by sending their resumes to the HR department email.

A while later, and after realizing that none of the internal employees were suitable for the job opening, the HR manager decided to look for potential candidates outside the organization. Therefore, he chose to sponsor a regional conference for financial experts by providing the location, food, refreshments, and stationery needed during the conference, and in the meantime, inform the attendees about the vacancy. The HR manager believed that sponsoring the conference would let the attendees know that the company is recruiting and attract them to apply for the financial analyst position.

Questions:

What is the job analysis method used by Kareem in analyzing the job of a financial analyst? Why do you think the HR manager considered this method inadequate?
What is the internal recruitment method that the HR manager initially used to attract employees from within the firm to apply for the job analyst vacancy? Use evidence from the text to support your answer.
Identify the external recruitment method used by the HR manager after realizing that no one in-house is qualified enough to fit the position. Do you believe that other external recruitment methods might be more effective in attracting the candidates?
Why do you believe the HR manager announced the vacancy internally before he did externally? Discuss your answer.

References

Aad, S., Dagher, G. K., & Hardey, M. (2024). How does cultural upbringing influence how university students in the Middle East utilize ChatGPT technology? Administrative Sciences, 14(12), 330. [Google Scholar] [CrossRef]
AlAfnan, M. A. (2025). Deepseek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks. Journal of Artificial Intelligence and Technology, 5, 202–210. [Google Scholar] [CrossRef]
Albuquerque, F., & Gomes dos Santos, P. (2024). Can ChatGPT be a certified accountant? Assessing the responses of ChatGPT for the professional access exam in Portugal. Administrative Sciences, 14(7), 152. [Google Scholar] [CrossRef]
Arce, C. M., Gavilanes, J. C., Arce, E. M., Haro, E. M., & Jurado, D. B. (2025). Artificial Intelligence in higher education: Predictive analysis of attitudes and dependency among Ecuadorian university students. Sustainability, 17(17), 7741. [Google Scholar] [CrossRef]
Batista, J., Mesquita, A., & Carnaz, G. (2024). Generative AI and higher education: Trends, challenges, and future directions from a systematic literature review. Information, 15(11), 676. [Google Scholar] [CrossRef]
Belizón, M. J., Majarín, D., & Aguado, D. (2024). Human resources analytics in practice: A knowledge discovery process. European Management Review, 21(3), 659–677. [Google Scholar] [CrossRef]
Borji, A. (2023). A categorical archive of ChatGPT failures. arXiv, arXiv:2302.03494. [Google Scholar] [CrossRef]
Bou Reslan, F., & Jabbour Al Maalouf, N. (2024). Assessing the transformative impact of AI adoption on efficiency, fraud detection, and skill dynamics in accounting practices. Journal of Risk and Financial Management, 17(12), 577. [Google Scholar] [CrossRef]
Boustani, N. M. (2025). ChatGPT—A Stormy Innovation for a sustainable business. Administrative Sciences, 15(5), 184. [Google Scholar] [CrossRef]
Boustani, N. M., Sidani, D., & Boustany, Z. (2024). Leveraging ICT and generative AI in higher education for sustainable development: The case of a Lebanese private university. Administrative Sciences, 14(10), 251. [Google Scholar] [CrossRef]
Brougham, D., & Haar, J. (2018). Smart technology, artificial intelligence, robotics, and algorithms (STARA): Employees’ perceptions of our future workplace. Journal of Management & Organization, 24(2), 239–257. [Google Scholar]
Chauhan, D., Singh, C., Rawat, R., & Dhawan, M. (2024). Evaluating the performance of conversational AI tools: A comparative analysis. In Conversational artificial intelligence (pp. 385–409). Wiley. [Google Scholar]
Ciampa, K., Wolfe, Z. M., & Bronstein, B. (2023). ChatGPT in education: Transforming digital literacy practices. Journal of Adolescent & Adult Literacy, 67(3), 186–195. [Google Scholar] [CrossRef]
Dwivedi, D. (2025). Emotional intelligence and artificial intelligence integration strategies for leadership excellence. Advances in Research, 26(1), 84–94. [Google Scholar] [CrossRef]
Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi, A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., … Wright, R. (2023). Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71, 102642. [Google Scholar] [CrossRef]
Finstad, K. (2010). Response interpolation and scale sensitivity: Evidence against 5-point scales. Journal of Usability Studies, 5(3), 104–110. [Google Scholar]
Fošner, A. (2024). University students’ attitudes and perceptions towards AI tools: Implications for sustainable educational practices. Sustainability, 16(19), 8668. [Google Scholar] [CrossRef]
Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has aced the test of understanding in college economics: Now what? The American Economist, 68(2), 233–245. [Google Scholar] [CrossRef]
Gill, S. S., Xu, M., Patros, P., Wu, H., Kaur, R., Kaur, K., Fuller, S., Singh, M., Arora, P., Parlikad, A. K., Stankovski, V., Abraham, A., Ghosh, S. K., Lutfiyya, H., Kanhere, S. S., Bahsoon, R., Rana, O., Dustdar, S., Sakellariou, R., … Buyya, R. (2024). Transformative effects of ChatGPT on modern education: Emerging era of AI chatbots. Internet of Things and Cyber-Physical Systems, 4, 19–23. [Google Scholar] [CrossRef]
Gu, H., Schreyer, M., Moffitt, K., & Vasarhelyi, M. (2024). Artificial intelligence co-piloted auditing. International Journal of Accounting Information Systems, 54, 100698. [Google Scholar] [CrossRef]
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., … Zhang, Z. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv, arXiv:2501.12948. [Google Scholar] [CrossRef]
Iatrellis, O., Samaras, N., Kokkinos, K., & Panagiotakopoulos, T. (2024). Leveraging generative AI for sustainable academic advising: Enhancing educational practices through AI-driven recommendations. Sustainability, 16(17), 7829. [Google Scholar] [CrossRef]
Isiaku, L., Kwala, A. F., Sambo, K. U., & Isiaku, H. H. (2024). Academic evolution in the age of ChatGPT: An in-depth qualitative exploration of its influence on research, learning, and ethics in higher education. Journal of University Teaching and Learning Practice, 21(6), 67–91. [Google Scholar] [CrossRef]
Jahin, A., Zidan, A. H., Zhang, W., Bao, Y., & Liu, T. (2025). Evaluating mathematical reasoning across large language models: A fine-grained approach. arXiv, arXiv:2503.10573. [Google Scholar] [CrossRef]
Jiang, Q., Gao, Z., & Karniadakis, G. E. (2025). DeepSeek vs. ChatGPT vs. Claude: A comparative study for scientific computing and scientific machine learning tasks. Theoretical and Applied Mechanics Letters, 15, 100583. [Google Scholar] [CrossRef]
Jin, I., Tangsrivimol, J. A., Darzi, E., Hassan Virk, H. U., Wang, Z., Egger, J., Hacking, S., Glicksberg, B. S., Strauss, M., & Krittanawong, C. (2025). DeepSeek vs. ChatGPT: Prospects and challenges. Frontiers in Artificial Intelligence, 8, 1576992. [Google Scholar] [CrossRef] [PubMed]
Jo, H. (2024). From concerns to benefits: A comprehensive study of ChatGPT usage in education. International Journal of Educational Technology in Higher Education, 21(1), 35. [Google Scholar] [CrossRef]
Karakose, T., Demirkol, M., Yirci, R., Polat, H., Ozdemir, T. Y., & Tülübaş, T. (2023). A conversation with ChatGPT about digital leadership and technology integration: Comparative analysis based on human–AI collaboration. Administrative Sciences, 13(7), 157. [Google Scholar] [CrossRef]
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. [Google Scholar] [CrossRef]
Khan, M. S., & Umer, H. (2024). ChatGPT in finance: Applications, challenges, and solutions. Heliyon, 10(2), e24890. [Google Scholar] [CrossRef]
Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., Bielaniewicz, J., Gruza, M., Janz, A., Kanclerz, K., Kocoń, A., Koptyra, B., Mieleszczenko-Kowszewicz, W., Miłkowski, P., Oleksy, M., Piasecki, M., Radliński, Ł., Wojtasik, K., Woźniak, S., & Kazienko, P. (2023). ChatGPT: Jack of all trades, master of none. Information Fusion, 99, 101861. [Google Scholar] [CrossRef]
Kotsis, K. T. (2025). ChatGPT and DeepSeek evaluate one another for science education. EIKI Journal of Effective Teaching Methods, 3(1), 98–102. [Google Scholar] [CrossRef]
Küçükuncular, A., & Ertugan, A. (2025). Teaching in the AI era: Sustainable digital education through ethical integration and teacher empowerment. Sustainability, 17(16), 7405. [Google Scholar] [CrossRef]
Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33(2), 363–374. [Google Scholar] [CrossRef] [PubMed]
Lee, U., Jung, H., Jeon, Y., Sohn, Y., Hwang, W., Moon, J., & Kim, H. (2024). Few-shot is enough: Exploring ChatGPT prompt engineering method for automatic question generation in English education. Education and Information Technologies, 29(9), 11483–11515. [Google Scholar] [CrossRef]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., … Pan, Z. (2024). Deepseek-v3 technical report. arXiv, arXiv:2412.19437. [Google Scholar] [CrossRef]
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., & Ruan, C. (2024). Deepseek-vl: Towards real-world vision-language understanding. arXiv, arXiv:2403.05525. [Google Scholar] [CrossRef]
Luckin, R. (2025). Nurturing human intelligence in the age of AI: Rethinking education for the future. Development and Learning in Organizations: An International Journal, 39(1), 1–4. [Google Scholar]
Luo, P. W., Liu, J. W., Xie, X., Jiang, J. W., Huo, X. Y., Chen, Z. L., Huang, Z. C., Jiang, S. Q., & Li, M. Q. (2025). DeepSeek vs. ChatGPT: A comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages. American Journal of Clinical and Experimental Urology, 13(2), 176. [Google Scholar] [CrossRef]
Mai, D. T. T., Da, C. V., & Hanh, N. V. (2024). The use of ChatGPT in teaching and learning: A systematic review through SWOT analysis approach. Frontiers in Education, 9, 1328769. [Google Scholar] [CrossRef]
Mansoor, M., Ibrahim, A., & Hamide, A. (2025). Performance of DeepSeek and GPT models on pediatric board preparation questions: Comparative evaluation. JMIR AI, 4, e76056. [Google Scholar] [CrossRef]
Maresova, P., Stemberkova, R., & Fadeyi, O. (2019). Models, processes, and roles of universities in technology transfer management: A systematic review. Administrative Sciences, 9(3), 67. [Google Scholar] [CrossRef]
Mourtajji, L., & Arts-Chiss, N. (2024). Unleashing ChatGPT: Redefining technology acceptance and digital transformation in higher education. Administrative Sciences, 14(12), 325. [Google Scholar] [CrossRef]
Nahar, M., Seo, H., Lee, E. J., Xiong, A., & Lee, D. (2024). Fakes of varying shades: How warning affects human perception and engagement regarding LLM hallucinations. arXiv, arXiv:2404.03745. [Google Scholar] [CrossRef]
Okulich-Kazarin, V., Artyukhov, A., Skowron, Ł., Artyukhova, N., & Wołowiec, T. (2024). When artificial intelligence tools meet” non-violent” learning environments (SDG 4.3): Crossroads with smart education. Sustainability, 16(17), 7695. [Google Scholar] [CrossRef]
Palincsar, A. S. (2012). Social constructivist perspectives on teaching and learning. In An introduction to Vygotsky (pp. 290–319). Routledge. [Google Scholar]
Ranta, M., Ylinen, M., & Järvenpää, M. (2023). Machine learning in management accounting research: Literature review and pathways for the future. European Accounting Review, 32(3), 607–636. [Google Scholar] [CrossRef]
Sajja, R., Sermet, Y., Cikmaz, M., Cwiertny, D., & Demir, I. (2024). Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information, 15(10), 596. [Google Scholar] [CrossRef]
Sapkota, R., Raza, S., & Karkee, M. (2025). Comprehensive analysis of transparency and accessibility of ChatGPT, DeepSeek, and other SoTA large language models. arXiv, arXiv:2502.18505. [Google Scholar] [CrossRef]
Selwyn, N. (2024). On the limits of artificial intelligence (AI) in education. Nordisk tidsskrift for pedagogikk og kritikk, 10(1), 3–14. [Google Scholar] [CrossRef]
Shafee, S., Bessani, A., & Ferreira, P. M. (2024). Evaluation of LLM chatbots for OSINT-based cyber threat awareness. arXiv, arXiv:2401.15127. [Google Scholar] [CrossRef]
Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257–268. [Google Scholar] [CrossRef]
Susnjak, T. (2024). Beyond predictive learning analytics modelling and onto explainable artificial intelligence with prescriptive analytics and ChatGPT. International Journal of Artificial Intelligence in Education, 34(2), 452–482. [Google Scholar] [CrossRef]
Sweller, J. (2011). Cognitive load theory. In Psychology of learning and motivation (Vol. 55, pp. 37–76). Academic Press. [Google Scholar]
Sweller, J., & Chandler, P. (1991). Evidence for cognitive load theory. Cognition and Instruction, 8(4), 351–362. [Google Scholar] [CrossRef]
Treves, L. (2025). Artificial intelligence of things: Unlocking new business sustainability possibilities or opening Pandora’s box? In Digital transformation and business sustainability (pp. 31–46). Routledge. [Google Scholar]
Vieira, S. M., Kaymak, U., & Sousa, J. M. (2010, July). Cohen’s kappa coefficient as a performance measure for feature selection. In International conference on fuzzy systems (pp. 1–8). IEEE. [Google Scholar]
Yan, G., Peng, K., Wang, Y., Tan, H., Du, J., & Wu, H. (2025). AdaFT: An efficient domain-adaptive fine-tuning framework for sentiment analysis in Chinese financial texts. Applied Intelligence, 55(7), 701. [Google Scholar] [CrossRef]
Yan, L., Greiff, S., Teuber, Z., & Gašević, D. (2024). Promises and challenges of generative artificial intelligence for human learning. Nature Human Behaviour, 8(10), 1839–1850. [Google Scholar] [CrossRef]
Zawacki-Richter, O., Bai, J. Y., Lee, K., Slagter van Tryon, P. J., & Prinsloo, P. (2024). New advances in artificial intelligence applications in higher education? International Journal of Educational Technology in Higher Education, 21(1), 32. [Google Scholar] [CrossRef]
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1–27. [Google Scholar] [CrossRef]

Figure 1. Methodological Process Flowchart.

Figure 2. Accounting Prompts Mean Comparison.

Figure 3. Economics Prompts Mean Comparison.

Figure 4. HR Prompts Mean Comparison.

Table 1. Results of Accounting Prompts.

	Dimension	R1	R2	R3	R4	$\bar{x}$	Cohen’s Kappa
ChatGPT 5.0	Accuracy	2	2	2	2	2.0	0.67
	Clarity	2	2	2	2	2.0
	Conciseness	2	1	2	1	1.5
	Systematic Reasoning	2	2	1	1	1.5
	Potential for Bias	3	1	1	1	1.5
DeepSeek	Accuracy	1	1	2	2	1.5	0.65
	Clarity	1	1	2	2	1.5
	Conciseness	1	1	1	1	1.0
	Systematic Reasoning	1	2	1	1	1.25
	Potential for Bias	1	1	1	1	1.0

Table 2. Results of Economics Prompts.

	Dimension	R1	R2	R3	R4	$\bar{x}$	Cohen’s Kappa
ChatGPT 5.0	Accuracy	2	2	2	2	2.0	0.76
	Clarity	2	1	2	2	1.75
	Conciseness	2	2	2	2	2.0
	Systematic Reasoning	2	2	3	2	2.25
	Potential for Bias	1	2	2	2	1.75
DeepSeek	Accuracy	3	3	3	3	3.0	0.91
	Clarity	3	3	3	3	3.0
	Conciseness	3	3	2	3	2.75
	Systematic Reasoning	3	2	3	3	2.75
	Potential for Bias	3	3	3	3	3.0

Table 3. Results of HR Prompts.

	Dimension	R1	R2	R3	R4	$\bar{x}$	Cohen’s Kappa
ChatGPT5.0	Accuracy	3	3	3	3	3.0	0.92
	Clarity	3	3	3	3	3.0
	Conciseness	3	3	3	2	2.75
	Systematic Reasoning	3	3	3	3	3.0
	Potential for Bias	2	3	3	2	2.5
DeepSeek	Accuracy	1	3	3	1	2.0	0.61
	Clarity	2	2	2	2	2.0
	Conciseness	2	3	3	2	2.5
	Systematic Reasoning	2	2	2	2	2.0
	Potential for Bias	2	2	2	2	2.0

Table 4. Average performance scores of ChatGPT versus DeepSeek across Accounting, Economics, and HR domains.

Model	Domain	Accuracy	Clarity	Conciseness	Reasoning	Bias	$\bar{x}$
ChatGPT	Accounting	2.0	2.0	1.5	1.5	1.5	1.70
	Economics	2.0	1.75	2.0	2.25	1.75	1.95
	HR	3.0	3.0	2.75	3.0	2.5	2.85
DeepSeek	Accounting	1.5	1.5	1.0	1.25	1.0	1.25
	Economics	3.0	3.0	2.75	2.75	3.0	2.90
	HR	2.0	2.0	2.5	2.0	2.0	2.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bou Zakhem, N.; Bou Diab, M.; Tahan, S. A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek. Adm. Sci. 2025, 15, 412. https://doi.org/10.3390/admsci15110412

AMA Style

Bou Zakhem N, Bou Diab M, Tahan S. A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek. Administrative Sciences. 2025; 15(11):412. https://doi.org/10.3390/admsci15110412

Chicago/Turabian Style

Bou Zakhem, Najib, Malak Bou Diab, and Suha Tahan. 2025. "A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek" Administrative Sciences 15, no. 11: 412. https://doi.org/10.3390/admsci15110412

APA Style

Bou Zakhem, N., Bou Diab, M., & Tahan, S. (2025). A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek. Administrative Sciences, 15(11), 412. https://doi.org/10.3390/admsci15110412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Disciplinary Academic Evaluation of Generative AI Models in HR, Accounting, and Economics: ChatGPT-5 vs. DeepSeek

Abstract

1. Introduction

2. Literature Review

2.1. ChatGPT

2.2. DeepSeek

2.3. ChatGPT in Education

2.3.1. Pedagogical Applications

2.3.2. Advantages and Benefits

2.3.3. Risks and Limitations

2.4. DeepSeek as an Emerging AI Tool

2.5. Comparative Use in HR, Accounting, and Economics Education

3. Materials and Methods

4. Results

4.1. Accounting Prompts

4.2. Economics Prompts

4.3. HR Scenarios

4.4. Comparative Analysis:

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Accounting Prompts

Appendix B

Economics Prompts

Appendix C

Human Resources Prompts

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI