Next Article in Journal
The Importance of White Males with Power, Resources, and Influence as Allies Supporting Diversity in the US Workplace
Next Article in Special Issue
Popular Habitus: Updating the Concept of “Habitus” as a Guide for the Selection of Cases of Analysis in Qualitative Digital Research
Previous Article in Journal
Contrasting Prosumption Models: Experiences, Benefits and Continuation in Allotment Gardens and Community-Supported Agriculture in Switzerland
Previous Article in Special Issue
User Spatial Content in Social Research: Approaches, Opportunities, and Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI Response Quality in Public Services: Temperature Settings and Contextual Factors

by
Domenico Trezza
1,*,
Giuseppe Luca De Luca Picione
2 and
Carmine Sergianni
2
1
Department of Social Sciences, University of Naples Federico II, 80138 Napoli, NA, Italy
2
Department of Economics, Management and Institutions, University of Naples Federico II, 80138 Napoli, NA, Italy
*
Author to whom correspondence should be addressed.
Societies 2025, 15(5), 127; https://doi.org/10.3390/soc15050127
Submission received: 7 April 2025 / Revised: 2 May 2025 / Accepted: 3 May 2025 / Published: 6 May 2025

Abstract

This study investigated how generative Artificial Intelligence (AI) systems—now increasingly integrated into public services—respond to different technical configurations, and how these configurations affect the perceived quality of the outputs. Drawing on an experimental evaluation of Govern-AI, a chatbot designed for professionals in the social, educational, and labor sectors, we analyzed the impact of the temperature parameter—which controls the degree of creativity and variability in the responses—on two key dimensions: accuracy and comprehensibility. This analysis was based on 8880 individual evaluations collected from five professional profiles. The findings revealed the following: (1) the high-temperature responses were generally more comprehensible and appreciated, yet less accurate in strategically sensitive contexts; (2) professional groups differed significantly in their assessments, where trade union representatives and regional policy staff expressed more critical views than the others; (3) the type of question—whether operational or informational—significantly influenced the perceived output quality. This study demonstrated that the AI performance was far from neutral: it depended on technical settings, usage contexts, and the profiles of the end users. Investigating these “behind-the-scenes” dynamics is essential for fostering the informed governance of AI in public services, and for avoiding the risk of technology functioning as an opaque black box within decision-making processes.

1. Introduction

The New Frontiers of Artificial Intelligence: Technology or Social Construct?

The recent rise of generative Artificial Intelligence (AI)—a technology capable of autonomously producing textual and multimedia content—is profoundly reshaping society. It offers new opportunities while simultaneously raising complex questions about the transformation of decision-making and administrative processes.
In everyday life, we tend to perceive technologies—particularly algorithmic systems—as autonomous entities, governed by objective rules and detached from their social context. However, such a perspective overlooks a fundamental truth: every technology is a socio-technical construct, emerging from the interplay between technical components and social dynamics [1,2]. The algorithms powering AI systems embed choices, values, and cognitive frames that reflect not only the work of developers but also the broader cultural, political, and institutional contexts in which these systems are conceived and applied [3,4,5].
This implies that seemingly technical decisions—such as the selection of training data or the adjustment of operational parameters—are, in fact, shaped by human priorities and worldviews. While these issues have long been at the center of debates on the ethics and governance of algorithms [6], they acquire renewed urgency in the age of AI. In this new landscape, the perception of having greater control over system inputs—such as textual prompts—clashes with the reality of a machine whose inner workings remain largely opaque to end users [7]. The challenge, or indeed the risk, lies in the fact that even simple parameters may significantly affect the language generation and information interpretation, with potentially profound consequences—especially in the context of public policy [8,9].
These socio-technical considerations highlight the importance of examining how specific algorithmic settings, which are often hidden from end users, can significantly influence the outputs of generative AI systems. Among various possible parameters, temperature represents an especially valuable case for investigation: it is relatively simple to adjust, easy to communicate (through the intuitive notion of low versus high temperature), and nevertheless capable of producing substantial variations in the creativity, variability, and coherence of AI-generated responses. For these reasons, this study took temperature as a key example to explore how technical configurations shape the perceived quality of outputs in public service contexts.
Technically, temperature governs the creativity and variability of responses: lower values tend to produce more predictable and conservative outputs, while higher values favor more original yet potentially less coherent responses. Analyzing this parameter provides a unique lens into the black box of generative AI, illustrating how a single setting can shape meaning-making, expressed priorities, and interpretive frames embedded within the system’s responses.
The case study examined here was Govern-AI, a chatbot designed to assist professionals working in the field of public policy with tasks related to planning and implementation. Through an experimental design that manipulated the temperature across a set of questions pertaining to various domains—administration, education, welfare, and labor—we explored how the response frames shifted, what role contextual factors (such as question type) played, and what implications emerged for decision-making processes.
Ultimately, this study offered a critical reflection on the role of generative AI in public governance, highlighting both the possibilities and the ambiguities that these technologies introduce for contemporary policymakers.
Building on this premise, this study addressed a central research question: How do specific generation parameters—particularly temperature—influence the content and perceived quality of AI-generated responses in public service contexts?
This paper is structured as follows. The next section presents the theoretical background on the socio-technical nature of generative AI and its implications for public services. We then introduce the Govern-AI program and describe the experimental design, including the use of the temperature setting and the evaluation framework. The empirical analysis follows, combining the descriptive statistics, multiple correspondence analysis (MCA), and regression models. Finally, we discuss the main findings and conclude with reflections on the potential and limits of generative AI in public governance.

2. Theoretical Background

2.1. Artificial Intelligence in Public Services

In the digital age, technological innovation is profoundly transforming the public sector, offering new opportunities to enhance the efficiency, quality, and accessibility of public services [10]. The integration of digital technologies into administrative processes has led to increased automation and resource optimization, primarily through the adoption of advanced tools, such as AI. AI has opened the door to a wide array of possibilities, including the automation of tasks traditionally performed by humans, support for decision-making processes, personalized services tailored to user needs, and a reduction in administrative burdens [11].
AI is capable of executing tasks that typically require human intelligence, such as reasoning, learning, and planning. It can also perceive environmental cues, analyze large volumes of data, and act autonomously to achieve specific goals [12]. This technological evolution is redefining the very concept of public service, which is increasingly based on intelligent solutions. The result is a more interconnected and interactive ecosystem of services [13], in which public administrations can engage more effectively with citizens, collect real-time feedback, and deliver more targeted and efficient services.
By adopting an innovative approach, the use of AI in public services not only improves the user experience but also helps strengthen public trust in institutions. In this context, digitalization is not merely a tool for operational efficiency; it becomes a fundamental means for building a more dynamic, proactive administration that is responsive to the needs of society [14]. However, recent studies on the integration of AI into public services are increasingly focusing on emerging challenges, particularly concerning institutional vulnerabilities and the resilience of administrative systems [15].
The spread of these technologies raises critical questions concerning transparency and citizen awareness [16]. In recent years, several studies and policy initiatives have sought to establish both regulatory and technical frameworks to ensure that AI systems operate in an understandable, fair, and accountable manner. In Europe, since 2017, numerous efforts have been made to update the EU legal framework, with the aim of developing an ethical and legal structure aligned with European values and the Charter of Fundamental Rights of the European Union [17]. In this regard, the AI Act proposed by the European Commission represents a concrete attempt to regulate AI use in public services according to the principles of transparency and accountability [18]. However, it is important to note that formal regulation alone may prove insufficient if not accompanied by organizational learning processes and the development of critical digital skills within public administrations [19].
In parallel with regulatory efforts, various tools and strategies have been developed to raise awareness about AI use in public administration. These include the development of open-source algorithms, the publication of algorithmic registries, and the implementation of AI impact assessment models—which are all measures intended to promote greater transparency and trust in automated decision-making.
Despite this growing focus, however, significant gaps remain. First, research on AI is still largely concentrated within computer science, while studies from the perspectives of the social sciences and public sector practice are comparatively limited [20]. Second, much of the literature neglects the user’s point of view, including the role of public administrations themselves as users of AI [16]. This leaves a gap in understanding how AI technologies can be configured to better meet user needs and support transparent governance. What is needed, therefore, is a more interdisciplinary approach—one that integrates technical analysis with social, ethical, and institutional evaluations of AI use in public governance [21].
Finally, there is a lack of empirical studies that have examined AI adoption within public organizations [21]. This absence of evidence restricts our ability to identify the best practices for enhancing user awareness and to translate these practices into effective public policies.

2.2. Chatbots as a Key Application of AI in Public Services

Among the main applications of AI in public services, chatbots stand out as a promising solution to support information retrieval and processing, facilitating faster and more intuitive access to data and the completion of administrative procedures [22]. The adoption of chatbots in public administration represents one of the most promising uses of AI, contributing to increased accessibility, efficiency, and improved interaction between citizens and institutions. These AI-based systems can process requests, provide information, and manage administrative procedures in an automated and timely manner, thereby reducing the workload of human operators and optimizing public resources.
One of the key advantages of chatbots lies in their constant “availability”, which allows citizens to access services at any time, unrestricted by the operating hours of public offices. Furthermore, advancements in natural language processing (NLP) have enabled these tools to understand and respond to user queries in an increasingly intuitive and personalized way, thus enhancing the user experience [23].
However, implementing chatbots in public services also presents several challenges. These include ensuring the accuracy of the information provided, managing data privacy and security, and integrating chatbots into existing administrative systems without compromising the service quality. Additionally, while chatbots can efficiently handle a high volume of standardized requests, more complex cases—or those requiring empathy and human judgment—may still necessitate human intervention.
When strategically implemented and supported by appropriate infrastructure and clear principles of transparency and ethics, chatbots can become a cornerstone of the digital transformation of public administration, improving governance effectiveness and strengthening the relationship between citizens and institutions [24]. The Materials and Methods Section is detailed to allow for replication and further development based on the published results.

3. Materials and Methods

3.1. Generative Technology in Public Services: The Govern-AI Program

The analysis presented in this paper is part of the Govern-AI program (GOVERNance assistance for social areas by Artificial Intelligence)1. Launched in July 2023, this project aims to explore the potential of generative AI within the welfare domain by providing decision support tools to various stakeholders in the system: adult education (professionals from the Provincial Centers for Adult Education, CPIA), social welfare (representatives from Territorial Social Areas and institutional actors in regional policy), and the labor sector (the UIL Campania trade union).
The experimentation seeks to test the applicability of generative AI in complex environments characterized by diverse needs and layered decision-making processes. The operational core of the project is an advanced chatbot developed through a socio-technical approach, which combines algorithmic application with the participatory construction of data, based on direct input from involved actors. The AI is trained on domain-specific datasets that are continuously updatable, ensuring high adaptability to the needs of different welfare contexts.
Govern-AI is not merely a technological pilot; rather, it constitutes a living laboratory of “algorithmic governance” aimed at testing new models of interaction between generative AI and public decision-making processes. Particular attention is devoted to decision transparency, data security, and human oversight—which are all crucial elements for the responsible implementation of AI in public services.
An initial phase of the project consisted of shared testing of the system’s output capabilities. The results are not only technical validations of a system but—more importantly—also provide an opportunity to reflect sociologically on the epistemological nature of AI-based language models.

3.2. Design of the Empirical Phase

The empirical phase of this study focused on evaluating the comprehensibility and accuracy of the responses generated by the Govern-AI chatbot. The system was tested using a set of 400 prompts that were processed autonomously by AI and evenly distributed across five professional categories: adult education teachers and staff (AE), trade union representatives (SIN), operators and decision-makers from Territorial Social Areas (ATS), administrators and professionals from the regional department for social and educational policies (AMP), and professionals in institutional management (PROG.MAN).
The participants involved—33 professionals already active in the Govern-AI program—assessed the chatbot’s responses via an online questionnaire connected to a real-time updated spreadsheet. Each target group received 100 system-generated prompts tailored to their professional field and structured along two key analytical dimensions:
  • Micro- vs. macro-level, which distinguished between prompts focused on specific, localized issues and those that concerned broader, strategic matters at the regional or national level;
  • Informational vs. operational nature, which depended on whether the prompt aimed at gathering explanations and factual information or at eliciting actionable guidance.
Each prompt was associated with three AI-generated responses produced under low, medium, and high temperature settings to examine how the adjustment of this parameter influenced the output quality. In total, the process generated 1200 responses, which were then submitted for evaluation.

3.3. Language Model and Technical Infrastructure

The responses analyzed in this study were generated using the GPT-4 language model developed by OpenAI. The system was deployed via Chatbase, a customizable chatbot platform that enables interaction with large language models through a user-friendly interface and parameter management.
For the purpose of this experiment, three distinct temperature settings were applied—0.2, 0.7, and 1—in order to systematically evaluate how variations in this parameter influenced the quality of the generative outputs. The temperature, in this context, regulated the randomness and creativity of the model’s responses, where lower values produced more deterministic and conservative answers, and higher values favored more diverse and potentially original outputs.
The integration of Chatbase allowed for prompt-based testing in controlled conditions while also facilitating the real-time collection of professional evaluations through embedded feedback sheets.

3.4. Evaluation of Prompts and Responses

The questionnaire invited the participants to evaluate both the prompts and the AI-generated responses. The assessment of the responses was based on two primary criteria (Table 1):
  • Accuracy, which was defined as the factual correctness and precision of the information provided (rated on a scale from 0 to 5);
  • Comprehensibility, which referred to the readability and clarity of the text—this is crucial for enabling rapid and effective interpretation in professional settings (rated on a scale from 0 to 5).
In parallel, the relevance of the prompts was also examined, which measured the perceived usefulness of each prompt in relation to the informational and operational needs of professionals (rated on a scale from 0 to 5). This dimension was treated as a supplementary variable in the analysis to explore the relationship between the prompt formulation and the quality of the generated output.

3.5. Analytical Approach

The analysis was based on a structured dataset that consisted of 8880 total evaluations2 distributed across the participating professional groups. To explore the relationships between the categorical variables and identify significant patterns, a multiple correspondence analysis (MCA) was applied—which is a technique particularly suited to visually mapping associations between variable categories [25].
To identify the most relevant variables that contributed to latent factors, the V-test was employed, which allowed for the detection of categories with statistically significant contributions. The MCA revealed two principal axes: the first was associated with the quality and stability of the AI-generated responses, and the second was linked to the decision-making context and prompt relevance.
In addition to the factorial representation, a multiple linear regression was conducted using ANCOVA [26] to measure the specific contribution of each variable to the variation in comprehensibility and accuracy of the AI responses. This approach enabled the quantification of how the type of prompt, the user’s professional category, and the perceived relevance of the request influenced the perceived quality of the model’s outputs.

4. Results

4.1. General Descriptive Analysis

The data, on the one hand, show a relatively balanced distribution in terms of both the level and type of prompts evaluated (47.7% macro vs. 52.3% micro; 54.6% informational vs. 45.4% operational). On the other hand, there was a less uniform distribution across the five professional groups involved. The most active evaluators were the professionals in program management (26.1%) and trade union representatives (25.9%), whereas the regional policy officials were the least represented group (8.1%), a result consistent with their smaller presence within the sample (Table 2).
With regard to the scores assigned, we present the mean values for the prompt relevance and the comprehensibility and accuracy of the responses. Since all the ratings were based on a 0–5 scale, a preliminary comparison was possible. It is worth noting that the relevance—which was defined as the extent to which the prompt was perceived as pertinent to the respondent’s professional role—received the lowest average score and showed the least variability. This may partly reflect the fact that the prompts were developed externally to the participants and covered highly heterogeneous fields of intervention.
The descriptive statistics on the AI response quality confirm a well-known trend in language models: responses were generally perceived as more comprehensible than accurate. Moreover, as illustrated in the graph presented later, the temperature setting appears to have had a positive effect on the quality scores at the highest level (Table 3).
There is no significant difference between comprehensibility and accuracy overall. However, when cross-referenced with prompt characteristics and participant profiles, one notable difference emerged: prompts of an operational nature—which were those oriented toward practical issues—tended to generate higher-quality responses.
In contrast, no substantial variations were observed regarding the prompt level or the professional role. The responses to the micro-level prompts were, on average, evaluated less positively. As for the user profiles, both the comprehensibility and accuracy scores followed a “rollercoaster” pattern, with the highest peaks recorded among the professionals from the CPIA (AE), the Territorial Social Areas (ATS), and program management (Figure 1).
When considering the three temperature levels, a nearly identical trend was observed for both dimensions. At the low and medium levels, there were no significant differences, but at the high temperature setting, both scores rose sharply. Increasing the creativity parameter thus appeared to positively affect the overall quality of the content (Figure 2). When examining the different prompt contexts, the substance of the trend remained largely unchanged. Whether the prompts were informational and micro in nature, or operational and macro, a clear positive shift occurred when the temperature was set to the highest level. However, the effect was less pronounced in the operational and macro prompts, even though these tended to produce responses that were, overall, rated more favorably (Figure 3).

4.2. Multiple Correspondence Analysis

To better understand the underlying dynamics, we applied an MCA. The extraction of the two factors with the highest explained inertias (Figure 4), combined with the interpretation of the V-test values for each category, allowed us to synthesize the data along two main dimensions (Table 4):
  • Factor 1, “AI response quality”, distinguished between the clear and reliable outputs (mostly associated with operational and macro-level prompts concerning the CPIA and local welfare issues) and less stable responses, which were primarily linked to informational prompts related to regional policy and union domains.
  • Factor 2, “decision-making context of the prompt”, separated low-relevance prompts (more operational and easily manageable by AI) from highly relevant prompts, which were often tied to strategic decisions.
Practically, the first factor contrasted responses rated highly with those rated poorly, which helped to trace the profile of both the prompt and the professional who evaluated it. The second factor highlighted the importance of the prompt relevance as a crucial parameter: the AI seemed to handle low-relevance and operational prompts more effectively, whereas the responses to more strategic prompts tended to be more heterogeneous and increasingly dependent on the temperature setting.
By examining the distribution of the variable modalities on the factorial plane (Figure 5), a clear trend emerged: the operational and macro-level prompts were mainly located in the area associated with what we defined as the “good AI performance” zone, while the informational and micro-level prompts appeared more frequently in the “critical AI performance” zone. This suggests that the AI response quality was strongly influenced by the nature of the prompt, and that the professionals that worked in more strategic areas—such as trade unions and regional policy—received less satisfactory outputs compared with those that interacted with the AI on more operational and local matters.
The MCA thus confirmed that the quality of the AI-generated responses was not a static attribute, but rather the result of several interconnected factors: the temperature setting, the type of prompt, and the evaluator’s professional profile. In particular, the prompts that were more procedural and standardizable benefited the most from the AI capabilities, whereas more complex and strategic requests exposed the model’s limitations in producing coherent and context-sensitive outputs.

4.3. Regression Models on AI Response Quality

To better understand the impact of key variables on the quality of AI-generated responses, we applied an Analysis of Covariance (ANCOVA). Specifically, we estimated two separate multiple regression models: one focused on the comprehensibility and the other on the accuracy. These models allowed us to quantify how the characteristics of the prompt and the evaluator’s professional profile affected the perceived quality of the outputs:
  • Comprehensibility = 3.722 + 0.066·Relevance + 0.031·macro + 0.050·inform + 0.110·role-AE − 0.442·role-AMP − 0.045·role-ATS − 0.400·role-SIN;
  • Accuracy = 3.589 + 0.100·Relevance + 0.198·macro − 0.211·inform − 0.146·role-AE − 0.615·role-AMP + 0.044·role-ATS − 0.391·role-SIN.
Both models confirmed the patterns observed in the MCA, showing that the AI response quality was influenced by a combination of factors related to the prompt type and user role. The relevance of the prompt was a positive and significant predictor of the AI response quality: the prompts perceived as more useful by the evaluators tended to receive higher scores for both the comprehensibility and accuracy. However, the weight of this variable was greater for the accuracy (+0.100) than for the comprehensibility (+0.066), suggesting that the relevant prompts especially encouraged more factually correct responses (Figure 6).
The “macro-level” dimension of the prompt also had a positive impact on the AI evaluation, which was stronger for the accuracy (+0.198) than for the comprehensibility (+0.031). This indicates that the prompts that addressed broader strategic or policy-making issues tended to receive higher ratings in terms of the precision.
Conversely, the informational nature of the prompt had a divergent effect on the two models: it had a slightly positive impact on the comprehensibility (+0.050), but a negative effect on the accuracy (−0.211). This supports previous findings from the MCA that showed informational prompts tended to generate more fluent and readable responses, but not necessarily more accurate ones.
The effect of the evaluator’s professional profile was one of the most relevant aspects and closely mirrored the patterns found in the factorial analysis:
  • The professionals in adult education (role-AE) assigned slightly higher scores for the comprehensibility (+0.110), but lower ones for the accuracy (−0.146), suggesting that while they found the responses readable, they questioned their factual correctness.
  • The regional policy staff (role-AMP) exhibited the strongest negative impact on both dimensions (−0.442 for the comprehensibility and −0.615 for the accuracy), confirming that this group tended to provide the most critical evaluations of the AI responses.
  • The trade union representatives (role-SIN) also showed significant negative evaluations (−0.400 for the comprehensibility and −0.391 for the accuracy), indicating that the AI responses may not have sufficiently reflected the specificities of their domain.
  • The social area professionals (role-ATS) showed no significant effect on the comprehensibility (−0.045), but a slightly positive one on the accuracy (+0.044), indicating a tendency to rate the AI outputs as somewhat more precise than the other groups.
The regression analysis confirms that the AI performance was not homogeneous, but rather was significantly influenced by both the type of prompt and the profile of the evaluating user. While the responses to the operational and macro-level prompts tended to be rated more favorably, those that addressed informational or strategically relevant content presented more challenges—particularly in terms of the accuracy. Furthermore, the variation in the evaluations across the professional groups suggested that the perceived quality of the chatbot was strongly shaped by the expectations and specific needs of the different actors. This finding points to the importance of developing more targeted and personalized AI tools within the domains of welfare and labor policy.

5. Discussion

The analysis of AI-generated responses within the domains of social and labor policy has provided a nuanced picture of the dynamics that shape the quality of outputs and their perceptions by professionals involved in the evaluation. Through three levels of analytical depth—descriptive analysis, multiple correspondence analysis, and regression modeling—we identified significant patterns that offer insights not only into the AI’s performance within the Govern-AI experiment but also into broader implications for the integration of generative models in decision-making and information processes. These implications are closely tied to the growing need to ensure transparency, traceability, and adaptability in algorithmic systems implemented within public sector contexts, as emphasized by recent studies [13,15].
From a descriptive standpoint, it is clear that the AI responses tended to be rated as more comprehensible than accurate. This suggests that while the model is capable of producing formally clear texts, their factual correctness may be problematic in certain contexts. Moreover, the evaluations varied considerably across the professional groups: the adult education professionals and social area operators tended to be relatively satisfied with the responses, while the regional policy staff and trade union representatives expressed more critical judgments. This divergence may be attributed to the greater complexity of their respective domains, which require highly contextualized and precise information—which are needs that are more difficult to meet with a standard generative model, as also noted by Madan and Ashok [19], who emphasized the need to adapt AI tools to specific professional contexts to ensure their effective and context-sensitive use in public services.
The MCA helped to systematize these findings by identifying two key interpretive axes: Factor 1 distinguished the positively evaluated responses from the less satisfactory ones, while Factor 2 separated the prompts of low relevance from those with high decision-making impact. This analytical framework highlighted that the language model performed particularly well with operational and concrete prompts, whereas informational prompts—especially those in the union and policy-making domains—tended to generate more variable and less satisfying outputs. Additionally, the prompt relevance emerged as a key variable: the prompts perceived as low-impact received more stable and reliable responses, while those of medium relevance proved more problematic, likely due to their higher interpretive ambiguity. This reflects an AI that, despite significant recent advances, still relies on language models that remain fragile and are deeply influenced by contextual factors and human framing [4,7].
The regression models confirmed and quantified these dynamics, showing that the prompt relevance and macro-level framing were positive predictors for both the comprehensibility and accuracy. However, as already suggested by the MCA structure, the model showed a strong negative effect for evaluators from the AMP and SIN groups, indicating that the chatbot struggled to meet the expectations of these professionals. A particularly noteworthy result was the contrasting effect of the informational nature of the prompt: while it improved the readability of the responses, it negatively affected their accuracy. This finding suggests that the AI tended to produce fluent texts but with a higher risk of inaccuracy when asked to provide data or normative references, highlighting one of the central challenges of generative systems: not merely producing outputs that are formally well-structured but ensuring that they are also epistemically reliable and verifiable [3].

6. Conclusions

This paper is situated within the framework of the Govern-AI experimentation program, an initiative aimed at implementing generative AI in the administrative and decision-making activities of professionals working in the public services, welfare, and labor sectors. Specifically, we presented the results of a testing phase conducted on a chatbot system by adjusting the temperature setting—an essential parameter known to influence how responses are generated.
However, this occasion also offered a valuable opportunity to reflect empirically on the broader socio-technical implications of generative systems. Our intent was not only to assess the technical performance of the model but also to explore the epistemological dimensions of its outputs. The findings of this research suggest that the adoption of generative AI in social and labor policy contexts cannot disregard the need for careful model calibration according to the expectations and needs of different professional groups.
Our guiding research question—How do the generation parameters in AI systems affect outputs in public decision-making contexts?—found an articulated response through three key findings:
The AI response quality was not univocal, but multidimensional.
Comprehensibility and accuracy were not uniformly distributed. The model’s temperature played an ambivalent role: while higher values tended to enhance the clarity and detail, they did not always guarantee greater factual accuracy. This issue is especially critical for prompts involving normative or strategic content, where correctness is essential.
The differences between the professional groups were significant.
The adult education professionals and social area workers tended to accept the AI responses more readily, whereas the regional policy staff and trade union representatives were more critical. This gap reflected not only the complexity of their domains but also the difficulty in providing context-specific and specialized answers when such data were not embedded in the system.
Was the AI more effective for operational tasks than strategic decisions?
Both the MCA and regression analysis indicated that the AI responses performed better for the operational prompts. Conversely, the informational and policy-related prompts were associated with more variability and lower perceived quality. This suggests that generative AI may serve as a useful support tool for routine and practical tasks but requires greater oversight and control in matters with strategic relevance.
What are the perspectives for the future?
These findings raise questions not only within the scope of our study but also for the future deployment of AI in domains critical to public services, social protection, and labor governance. Model customization emerges as a crucial factor: static and uniform configurations risk generating outputs that fall short of the diverse needs of professionals and decision-makers. Integrating generative AI with sector-specific knowledge bases and dynamically regulating parameters, such as the temperature, may represent promising strategies for improving the quality of the outputs.
At the same time, our analysis implicitly highlights the need for the informed governance of AI in public and union settings. The adoption of generative models should not be seen as a neutral process: their configuration directly shapes knowledge production and the way digital realities are represented. This implies that professionals using these tools must develop critical and interpretative competencies to avoid repeating a recurring error in past technological transformations—namely, the tendency to perceive innovative systems as inherently objective and unquestionable, when, in fact, they are the result of precise technical and epistemological choices. One such example is a seemingly trivial parameter like temperature.
In conclusion, generative AI may indeed serve as an ally for innovation in public services but only if it is used consciously within a framework of clear and responsible governance. A simple yet illustrative question for any public professional might be as follows: What temperature setting is being used by the AI model supporting my work? Failing to ask such questions risks introducing opacity into decision-making processes, with potentially distortive effects on social and labor policies. The real issue, therefore, is not whether AI should be used but how it should be integrated so that it becomes an effective support tool rather than a source of confusion or uncritical delegation.

Author Contributions

Conceptualization, D.T.; Methodology, D.T. and G.L.D.L.P.; Software, D.T. and C.S.; Validation, D.T.; Formal analysis, D.T.; Investigation, C.S.; Resources, G.L.D.L.P.; Data curation, D.T.; Writing—original draft, D.T.; Writing—review & editing, G.L.D.L.P. and C.S.; Visualization, D.T. and C.S.; Supervision, D.T. and G.L.D.L.P.; Project administration, D.T. and G.L.D.L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the nature of this research, which did not involve sensitive personal data, medical procedures, or any form of clinical experimentation. This study was based on the evaluation of AI-generated content by professionals in their institutional roles without collecting health-related or identifiable private information.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The data collected in this study are available upon request from the corresponding author. The data are not publicly available due to institutional privacy agreements related to the evaluation of AI systems in public service contexts.

Acknowledgments

The authors wish to thank the participants of the Govern-AI program for their collaboration, and the institutions involved in the testing phase for their logistical and technical support.

Conflicts of Interest

The authors declare no conflicts of interest.

Notes

1
www.governai.it (accessed on 2 May 2025).
2
This overall figure includes the evaluation of all three responses associated with each prompt. On average, each of the 33 participants assessed approximately 269 responses.

References

  1. Latour, B. Reassembling the Social: An Introduction to Actor-Network-Theory; Oxford University Press: Oxford, UK, 2005. [Google Scholar]
  2. Amaturo, E.; Aragona, B. Per un’epistemologia del digitale: Note sull’uso di big data e computazione nella ricerca sociale. Quad. Sociol. 2019, 81, 71–90. [Google Scholar] [CrossRef]
  3. Floridi, L. Etica Dell’intelligenza Artificiale: Sviluppi, Opportunità, Sfide; Raffaello Cortina Editore: Milan, Italy, 2022. [Google Scholar]
  4. Seaver, N. Computing taste: Algorithms and the makers of music recommendation. In Computing Taste; University of Chicago Press: Chicago, IL, USA, 2022. [Google Scholar]
  5. Amato, F.; Aragona, B.; De Angelis, M. Factors and possible application scenarios of Explainable AI. Riv. Digit. Politics 2023, 3, 543–564. [Google Scholar] [CrossRef]
  6. Zuboff, S. Surveillance capitalism and the challenge of collective action. In New Labor Forum; Sage Publications: Los Angeles, CA, USA, 2019; Volume 28, pp. 10–29. [Google Scholar]
  7. Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
  8. Peeperkorn, M.; Kouwenhoven, T.; Brown, D.; Jordanous, A. Is temperature the creativity parameter of large language models? arXiv 2024, arXiv:2405.00492. [Google Scholar]
  9. Senadheera, S.; Yigitcanlar, T.; Desouza, K.C.; Mossberger, K.; Corchado, J.; Mehmood, R.; Li, R.Y.; Cheong, P.H. Understanding Chatbot Adoption in Local Governments: A Review and Framework. J. Urban Technol. 2024, 31, 1–35. [Google Scholar] [CrossRef]
  10. Van Noordt, C.; Misuraca, G. Artificial intelligence for the public sector: Results of landscaping the use of AI in government across the European Union. Gov. Inf. Q. 2022, 39, 101714. [Google Scholar] [CrossRef]
  11. Neumann, O.; Guirguis, K.; Steiner, R. Exploring artificial intelligence adoption in public organizations: A comparative case study. Public Manag. Rev. 2022, 26, 114–141. [Google Scholar] [CrossRef]
  12. European Parliament. Artificial Intelligence: How Does It Work, Why Does It Matter, and What Can We Do About It? European Parliamentary Research Service: Brussels, Belgium, 2020. Available online: https://www.europarl.europa.eu/RegData/etudes/STUD/2020/641547/EPRS_STU(2020)641547_EN.pdf (accessed on 2 May 2025).
  13. Wirtz, B.W.; Weyerer, J.C.; Geyer, C. Artificial Intelligence and the Public Sector—Applications and Challenges. Int. J. Public Adm. 2019, 42, 596–615. [Google Scholar] [CrossRef]
  14. Wirtz, B.W.; Langer, P.F.; Fenner, C. Artificial Intelligence in the Public Sector—A Research Agenda. Int. J. Public Adm. 2021, 44, 1103–1128. [Google Scholar] [CrossRef]
  15. Vatamanu, A.F.; Tofan, M. Integrating Artificial Intelligence into Public Administration: Challenges and Vulnerabilities. Adm. Sci. 2025, 15, 149. [Google Scholar] [CrossRef]
  16. Madan, R.; Ashok, M. AI adoption and diffusion in public administration: A systematic literature review and future research agenda. Gov. Inf. Q. 2023, 40, 101774. [Google Scholar] [CrossRef]
  17. European Commission. Ethics Guidelines for Trustworthy AI; Publications Office: Luxembourg, 2019. Available online: https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai (accessed on 2 May 2025).
  18. Contissa, G.; Galli, F.; Godano, F.; Sartor, G. IL REGOLAMENTO EUROPEO SULL’INTELLIGENZA ARTIFICIALE: Analisi informatico-giuridica. I-LEX 2021, 14, 1–36. [Google Scholar]
  19. Madan, R.; Ashok, M. A public values perspective on the application of Artificial Intelligence in government practices: A Synthesis of case studies. In Handbook of Research on Artificial Intelligence in Government Practices and Processes; IGI Global Scientific Publishing: Hershey, PA, USA, 2022; pp. 162–189. [Google Scholar]
  20. Aoki, N. An experimental study of public trust in AI chatbots in the public sector. Gov. Inf. Q. 2020, 37, 101490. [Google Scholar] [CrossRef]
  21. Dwivedi, Y.K.; Hughes, L.; Ismagilova, E.; Aarts, G.; Coombs, C.; Crick, T.; Duan, Y.; Dwivedi, R.; Edwards, J.; Eirug, A.; et al. Artificial Intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int. J. Inf. Manag. 2021, 57, 101994. [Google Scholar] [CrossRef]
  22. Maragno, G.; Tangi, L.; Gastaldi, L.; Benedetti, M. AI as an organizational agent to nurture: Effectively introducing chatbots in public entities. Public Manag. Rev. 2023, 25, 2135–2165. [Google Scholar] [CrossRef]
  23. Mehr, H.; Ash, H.; Fellow, D. Artificial Intelligence for Citizen Services and Government; Harvard Kennedy School: Cambridge, MA, USA, 2017; pp. 1–12. [Google Scholar]
  24. Mergel, I.; Edelmann, N.; Haug, N. Defining digital transformation: Results from expert interviews. Gov. Inf. Q. 2019, 36, 101385. [Google Scholar] [CrossRef]
  25. Di Franco, G. Corrispondenze Multiple e Altre Tecniche Multivariate per Variabili Categoriali; FrancoAngeli: Milan, Italy, 2006; Volume 15. [Google Scholar]
  26. Poston, D.L.; Poston, D.L., Jr.; Conde, E.; Field, L.M. Applied Regression Models in the Social Sciences; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
Figure 1. Comprehensibility and accuracy—level, type of question, and role.
Figure 1. Comprehensibility and accuracy—level, type of question, and role.
Societies 15 00127 g001
Figure 2. Ratings of responses by temperature.
Figure 2. Ratings of responses by temperature.
Societies 15 00127 g002
Figure 3. Prompt contexts.
Figure 3. Prompt contexts.
Societies 15 00127 g003
Figure 4. Scree plot with factor inertia.
Figure 4. Scree plot with factor inertia.
Societies 15 00127 g004
Figure 5. Factorial plane of MCA. The red dots represent the categories of the active variables (e.g., quality ratings across different temperature settings), while the brown squares indicate the supplementary variables (e.g., question types).
Figure 5. Factorial plane of MCA. The red dots represent the categories of the active variables (e.g., quality ratings across different temperature settings), while the brown squares indicate the supplementary variables (e.g., question types).
Societies 15 00127 g005
Figure 6. Estimated coefficients for multiple regression models on AI response quality.
Figure 6. Estimated coefficients for multiple regression models on AI response quality.
Societies 15 00127 g006
Table 1. Evaluation criteria for prompts and responses.
Table 1. Evaluation criteria for prompts and responses.
PromptsResponses
RelevanceRelevance of the question concerning the professional’s actual needs (scale 0–5)AccuracyFactual adherence and precision of the provided information (scale 0–5)
ComprehensibilityClarity and readability of the response (scale 0–5)
Table 2. Percentage of prompts by level, type, and role.
Table 2. Percentage of prompts by level, type, and role.
VariablesCategoriesnr%
LevelMacro141247.7
Micro154852.3
TypeInformational161754.6
Operational134345.4
RoleAE58919.9
AMP2398.1
ATS59220.0
SIN76725.9
PROG.MAN77326.1
Table 3. Average rating (and standard deviation) for prompt and responses by relevance, comprehensibility, and accuracy (0–5).
Table 3. Average rating (and standard deviation) for prompt and responses by relevance, comprehensibility, and accuracy (0–5).
AI ContentRatingavgSd
PromptRelevance3.500.90
Responses_temperatureCompr_low3.811.04
Compr_intermediate3.741.10
Compr_high4.061.06
Acc_low3.621.12
Acc_intermediate3.611.07
Acc_high4.021.11
Table 4. F1 and F2 axes from V-test.
Table 4. F1 and F2 axes from V-test.
Variables CategoriesF1 (AI Response Quality) F2 (Decision-Making Context of Prompt)
Macro−2.947−0.428
Micro2.9470.428
Informational2.499−1.051
Operational−2.4991.051
AE−11.6349.886
AMP16.053−0.087
ATS−8.3802.326
SIN16.725−4.586
PROG.MAN.−8.435−6.475
Compr_low_good−30.97824.223
Compr_low_medium−4.953−28.482
Compr_low_not good34.4065.304
Compr _intermediate_good−34.21721.122
Compr _intermediate_medium−4.956−29.782
Compr _intermediate_not good36.6159.484
Compr _high_good−32.1191.172
Compr _high_medium15.468−22.360
Compr _high_not good19.81522.528
Acc_low_good−19.49722.984
Acc_low_medium−15.761−27.872
Acc_low_not good32.0288.311
Acc_intermediate_good−19.73818.290
Acc_intermediate_medium−16.551−27.501
Acc_intermediate_not good32.47113.471
Acc_high_high−29.677−4.787
Acc_high_medium14.122−18.382
Acc_high_not good18.21825.680
High relevance−1.824−3.649
Low relevance−0.7465.521
Medium relevance2.417−0.979
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Trezza, D.; De Luca Picione, G.L.; Sergianni, C. AI Response Quality in Public Services: Temperature Settings and Contextual Factors. Societies 2025, 15, 127. https://doi.org/10.3390/soc15050127

AMA Style

Trezza D, De Luca Picione GL, Sergianni C. AI Response Quality in Public Services: Temperature Settings and Contextual Factors. Societies. 2025; 15(5):127. https://doi.org/10.3390/soc15050127

Chicago/Turabian Style

Trezza, Domenico, Giuseppe Luca De Luca Picione, and Carmine Sergianni. 2025. "AI Response Quality in Public Services: Temperature Settings and Contextual Factors" Societies 15, no. 5: 127. https://doi.org/10.3390/soc15050127

APA Style

Trezza, D., De Luca Picione, G. L., & Sergianni, C. (2025). AI Response Quality in Public Services: Temperature Settings and Contextual Factors. Societies, 15(5), 127. https://doi.org/10.3390/soc15050127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop