1. Introduction
The widespread integration of artificial intelligence (AI) across all spheres of life creates a new technological reality accompanied by ambivalent public perceptions. These range from expectations of significant gains in efficiency and the automation of routine processes to concerns regarding potential risks associated with AI influence (
Stein et al., 2024). Large language models (LLMs) are increasingly applied in healthcare (
Meng et al., 2024) to automate the processing of medical data (
Vasilev et al., 2025a), generate and summarize medical texts (
Bednarczyk et al., 2025), and facilitate the education of patients (
Unger et al., 2025) and physicians (
Iqbal et al., 2025). However, these systems possess some limitations, including a tendency to generate convincing yet inaccurate information (“hallucinations”) and to amplify systemic biases present in training datasets (
Athaluri et al., 2023). These shortcomings raise serious safety concerns regarding the integration of LLMs into clinical practice, where errors may have crucial consequences for patient health.
Retrieval-augmented generation (RAG) may improve the reliability of LLMs by grounding generated responses in external knowledge sources. This is particularly relevant when LLMs are used as clinical reference tools, where source traceability, up-to-date information, and hallucination reduction are important for physicians’ trust (
Hang et al., 2025). Recent studies on RAG, including graph-based health-related fact-checking and contextual retrieval for rapidly evolving domain-specific knowledge (
Conger et al., 2025), further highlight the need to evaluate clinicians’ attitudes toward grounded LLMs.
Despite the growing number of publications evaluating the capabilities of LLMs in solving specific medical tasks (
Meng et al., 2024), existing studies are often focused on technical validation conducted by AI/ML computational science specialists. Such works rely on standardized statistical metrics (
Barbella & Tortora, 2022;
Reiter, 2018;
Lavie & Agarwal, 2007) accompanied by expert evaluation, for which a range of tools have been developed (
Vasilev et al., 2025b;
Tam et al., 2024). The perception of this technology by its primary stakeholders—practicing clinicians—remains understudied. Physicians’ attitudes toward AI can influence both their expert assessment and the success of LLM implementation in real-world practice (
Spotnitz et al., 2024). This is particularly relevant when LLM-based chatbots are used to answer questions, including those that require retrieving information from electronic health records (EHRs) (
Reshetnikov et al., 2025).
Recent studies assessing physicians’, healthcare workers’, and students’ attitudes towards LLMs have relied on non-validated questionnaires (
Vasilev et al., 2025b;
Spotnitz et al., 2024;
Reshetnikov et al., 2025). Despite this limitation, which the authors themselves acknowledge, the findings consistently indicate a positive overall perception of LLMs and a willingness to integrate them as assistive tools in clinical practice and education. Validated questionnaires for assessing physicians’ attitudes toward AI in general have been described in the literature (
Stein et al., 2024), but they fail to capture the specific features of LLMs and their application in real-world clinical practice.
There is a methodological blind spot in existing research, necessitating the development of a validated questionnaire assessing physicians’ attitudes toward medical LLMs. Developing such a questionnaire would also help identify weaknesses in the implementation pipeline and support targeted improvements in educational programs to enhance physicians’ digital readiness as well as the development of specific administrative solutions.
Thus, the development of a reliable, valid questionnaire to assess physicians’ attitudes toward LLMs represents a necessary step for conducting methodologically sound research in this field. This is particularly important for healthcare initiatives integrating LLM-based chatbots into medical information systems (
Reshetnikov et al., 2025). A reliable questionnaire would enable timely identification and management of various aspects of physicians’ attitudes toward LLMs.
We previously developed the ATRAI-14 questionnaire to assess radiologists’ attitudes toward artificial intelligence (AI) technologies (
Vasilev et al., 2024). The instrument demonstrated acceptable internal consistency (Cronbach’s α = 0.78; 95% CI 0.68–0.83), high test–retest reliability (intraclass correlation coefficient [ICC] = 0.89; 95% CI 0.67–0.96;
p < 0.05), and acceptable criterion validity (Spearman’s ρ = 0.73;
p < 0.001).
Given these confirmed psychometric properties, we proceeded to develop an ATRAI-14-based questionnaire adapted for a different target population and expanded AI application scenarios. In this context, the acronym “ATRAI” is conceptualized as denoting a family of instruments designed to assess healthcare professionals’ attitudes toward AI-based digital technologies.
The aim of this study is to develop and assess the reliability and validity of a questionnaire designed to evaluate physicians’ attitudes toward LLM-based chatbots used as a tool for analyzing medical documents, including patient EHRs.
2. Materials and Methods
The design of the questionnaire development and validation is presented in
Figure 1.
2.1. Sample Selection
The questionnaire is intended for physicians of all specialties (outpatient and inpatient) and clinical directors.
2.2. Study Participants
The research team, which consisted of three physicians with at least three years of work experience, a sociologist, and three AI/ML computational scientists, was responsible for the questionnaire development.
Experts comprised two physicians with at least three years of work experience and two AI/ML computational scientists.
The focus group consisted of 15 physicians providing care in outpatient and inpatient settings. The population for reliability and validity assessment included 562 physicians of various specialties working in Moscow healthcare and taking part in the pilot project integrating an LLM-based chatbot into the Unified Medical Information and Analytical System (UMIAS) (
Reshetnikov et al., 2025).
The survey was multicenter and included specialist physicians from 58 medical organizations: 5 medical organizations providing outpatient care to adults (including 22 satellite health centers), 5 medical organizations providing outpatient care to pediatric population (including 22 satellite health centers), and 4 multidisciplinary hospitals (3 providing care to adults and 1 to pediatric patients) within the Moscow Department of Health. Participating physicians represented a diverse range of specialties: general practitioners, pediatricians, infectious disease specialists, cardiologists, colorectal surgeons (proctologists), neurologists, nephrologists, otorhinolaryngologists, ophthalmologists, pulmonologists, rheumatologists, trauma and orthopedic surgeons, urologists, andrologists, general surgeons, endocrinologists, obstetricians, immunologists, gastroenterologists, and geriatricians. We also surveyed managerial staff, including chief physicians, branch heads, deputy chief physicians, and department heads.
The LLMs being implemented in healthcare were YandexGPT 5.1 Pro (Yandex LLC., Moscow, Russia) and GigaChat 2.0 (PJSC Sberbank of Russia, Moscow, Russia).
2.3. Questionnaire Development (Item Generation, Reduction, and Questionnaire Formatting)
We based the new instrument on the previously developed and validated ATRAI-14 questionnaire designed to assess radiologists’ attitudes toward artificial intelligence (
Vasilev et al., 2024). It is important to note that Moscow Healthcare department radiologists are not only end-users of AI technologies but also participate in the development of specialized AI algorithms (
Vasilev & Vladzymyrskyy, 2025). Physicians typically act only as end-users of fully developed software. Given the significant differences in how these two groups use AI, we had to adapt and refine several of the original questions.
We based the development of our questionnaire, as well as the parent ATRAI-14, on the theoretical domains framework, which is validated for use in implementation and behavior-change research. According to this framework, the behavior of healthcare workers toward an implemented innovation can be comprehensively assessed across 14 domains (
Cane et al., 2012). Thus, we initially preserved the domain structure of the ATRAI-14 questionnaire. It contains a part related to the respondent’s demographics and professional background, followed by the main domains: “Trust”, “Implementation Perspective”, and “Hopes and Fears”.
In accordance with the theory of planned behavior (
Ajzen, 1991), we define the attitude toward LLM assistant as a positive or negative intention of the healthcare professional to use the assistant in their clinical practice. This intention is based on the respondent’s beliefs on real-world capabilities of the LLM assistant, consequences of its implementation, and evaluation of those consequences.
The background part items were adapted to the new objectives and target population. We removed items related to radiology and experience with various imaging modalities and replaced them with items capturing more detailed information on respondents’ positions and specialties. We added an item to the background part to capture prior experience with LLMs in clinical practice. Furthermore, the questions assessing experience with LLMs in the UMIAS, originally in the “Familiarity” domain, were reduced to a single question and moved to the background part.
The “Trust” domain included five items designed to assess physicians’ trust in the LLM-based chatbot integrated into UMIAS. The “Implementation Perspective” domain comprised three items assessing anticipated adoption of the LLM assistant, while the “Hopes and Fears” domain contained three items identifying physicians’ concerns regarding its implementation. In total, the adapted questionnaire consisted of 19 items (8 in the background part and 11 in the main part).
The response formats included a five-point Likert scale, multiple-choice, and a five-point scale. For the Likert-based questions, response options ranged from 1 to 5, corresponding to extremely negative and extremely positive attitudes, respectively. Several items (T2, I1, I3) allowed multiple responses, and the total score for each of these items was calculated based on the number of selected options. To enhance the reliability of the collected data, we also included items with reversed scoring (T2) in the questionnaire. The use of reverse-scored items is a widely accepted methodological practice in survey design. Its primary purpose is to identify respondents who may be answering carelessly or exhibiting response bias, such as acquiescence (tendency to agree regardless of content).
Following the initial drafting of the questions, in-depth interviews with experts were conducted to evaluate the relevance and appropriateness of each item. Afterward, six items were revised.
The preliminary version of the questionnaire was then pilot-tested in a focus group. The focus group evaluated usability, clarity of the item wordings, and the appropriateness of the response options. Following this assessment, the research team determined whether revisions suggested by focus group members should be accepted. If similar comments were provided by the majority of physicians (10 or more) in the focus group, revisions were implemented without further discussion.
The ATRAI-LLM questionnaire was designed to assess physicians’ attitudes toward LLMs as practical tools rather than to evaluate their understanding of the underlying technical architecture. The survey items did not address technical details such as model architecture, training data, or algorithmic mechanisms, nor did they require respondents to possess such knowledge. From the perspective of the physician, the utility of an LLM assistant is judged by its performance in practice, independent of technical complexity. Within this framework, attitudes are shaped primarily by observed functionality: if the LLM performs poorly or produces errors, confidence in it is directly and negatively affected, regardless of the underlying technology.
2.4. Questionnaire Composition
We developed an electronic version of the questionnaire using the survey administration software “Yandex Forms”. The platform processes personal data only to the extent necessary for delivering the form and stores it in compliance with local data protection legislation. All data transmission between the respondent’s browser and Yandex’s servers is encrypted (HTTPS), and Yandex does not retain or publish the respondents’ answers beyond the period required for form fulfilment. No patient-identifying information (e.g., names, medical record numbers, or health-status details) was collected; the questionnaire asked only about physicians’ attitudes and practice-related experiences. Consequently, the data collection complied with relevant patient data protection requirements. Questions were presented in a series of linked pages (multiple-item screens) with accompanying electronic instructions.
Participants received a cover letter explaining the survey’s purpose. The first page of the electronic form presented the informed consent for participation in the study and for publication of the results, which participants needed to accept to proceed.
2.5. Pre-Testing
To evaluate how well respondents understood the questions and response options, four members of the research team conducted individual interviews with participants from the focus group who were similar to the sampling frame. The aim was to assess how the questions were interpreted and whether respondents’ understanding aligned with the original intent (
Collins, 2003).
2.6. Sample Size Estimation
The minimum sample size for estimating a latent variable (attitude toward LLMs) based on three observed variables (“Trust”, “Implementation Perspective,” and “Hopes and Fears”) was 328 estimates (type I error rate 0.05, power 0.95) (
Soper, 2026). For factor analysis with conditions of good agreement between sample and population, a wide level of communality, and three factors with at least three variables per factor, the estimation of minimum necessary sample size was 450 participants (
Mundfrom et al., 2005).
2.7. Reliability and Validity Assessment
A validation study was conducted to evaluate the reliability and validity of the questionnaire. Participants were provided access to the electronic version of the ATRAI-LLM. Following data collection, reliability and validity analyses were performed; the statistical methods applied are summarized in
Table 1.
Reliability was assessed based on internal consistency.
Four types of validity were evaluated: face, content, construct, and criterion validity. Face and content validity were assessed by experts (n = 4). The following questions were considered: “Does the questionnaire measure the intended construct?” and “Does the questionnaire adequately cover all key aspects of the domain?” Each expert and research team member provided a binary response (“yes” or “no”) for each item. An item was considered acceptable if 75% of experts (≥3) provided an affirmative response.
Construct validity was examined using confirmatory factor analysis (CFA) to test the hypothesis that the observed data fit the proposed domain structure and to identify items requiring modification or removal. Criterion validity was assessed by comparing questionnaire scores with respondents’ self-reported attitudes measured on a visual analogue scale (VAS) ranging from 0 to 10, where 0 indicated the most negative attitude and 10 the most positive attitude.
2.8. Statistical Data Analysis
Data were processed using R version 4.3.1 with the psych (2.4.6), lavaan (0.6-18), and ltm (1.2-0) packages. Calculated values were interpreted according to
Table 1 with assessment of statistical significance. A
p-value < 0.05 was considered statistically significant for all tests.
4. Discussion
We developed and validated the ATRAI-LLM questionnaire to assess physicians’ attitudes toward LLMs in healthcare specifically when used as a tool for answering clinical queries. The final questionnaire comprised 19 items: 8 in the background part and 11 in the main part, 9 of which contributed to scoring. The validation results confirmed the three-domain structure: the three-factor model demonstrated satisfactory fit indices (RMSEA = 0.05, CFI = 0.97, TLI = 0.96, SRMR = 0.03). Three domains were retained: “Willingness to Use,” “Implementation Perspective,” and “Hopes and Fears”. Criterion validity demonstrated statistically significant yet moderate correlation between the ATRAI-LLM score and visual analogue scale assessment (Spearman’s rho = 0.68,
p < 0.001). The correlation coefficient observed in the present study was close to the ATRAI-14 study (
Vasilev et al., 2024).
The instrument demonstrated acceptable internal consistency (Cronbach’s alpha 0.770, 95% CI [0.731, 0.800], McDonald’s omega ω
t = 0.830, ω
h = 0.610). This result exceeds the commonly accepted threshold of 0.7 for research instruments, supporting its use both in scientific studies and in practical healthcare settings, given that end-user attitudes directly influence the success of technological implementation (
Spotnitz et al., 2024). Moreover, no ceiling or floor effects were observed, as scores were well-distributed across the entire range (median—20 points, maximum—33 out of 36) with approximately normal distribution (
Figure 6A). Additionally, the absence of a maximum score indicates that current implementation of LLMs is not sufficient to gain the full trust of physicians.
ATRAI-LLM comprises three domains that assess different aspects of attitudes toward LLMs. The “Willingness to Use” domain measures the respondent’s perception of LLM assistant quality. The “Implementation Perspective” domain reflects respondents’ awareness of LLM assistant usefulness. The “Hopes and Fears” domain captures perceptions of how LLMs may influence physicians’ careers in terms of salary and professional prestige. We defined the attitude toward an LLM assistant as a positive or negative intention of the healthcare professional to use the assistant in their clinical practice. According to Conner (
Conner, 2001), there are three variables determining the intention: (1) the respondent’s evaluation of their behavior, (2) subjective norms, reflecting the respondent’s beliefs on how their peers and significant others would perceive the respondent’s behavior, and (3) the degree of control the respondent has over their behavior in the current situation. Therefore, the construct definition of attitude toward LLMs implemented in the ATRAI-LLM questionnaire fits well with the concept of intention-to-use determinants, having a correspondence between the “Willingness to Use” domain and behavior evaluation, the “Implementation Perspectives” domain and the degree of control over the situation, and the “Hopes and Fears” domain and subjective norms.
This domain-based questionnaire structure enables not only the assessment of physicians’ overall attitudes toward LLM assistants but also the identification of weaker domains. ATRAI-LLM may help identify specific barriers and problem areas in the integration of LLMs into real-world clinical practice, including hospital policy, thereby informing the development of targeted organizational and educational interventions. In particular, domain-specific results may support the design of training programs aimed at improving physicians’ competencies in the responsible use of LLM-generated information. Furthermore, repeated administration of the instrument during different stages of system deployment may facilitate longitudinal monitoring of changes in physicians’ attitudes. More specifically, repeated administration before deployment, after training, and after several months of use could help monitor whether physicians’ concerns decrease, whether trust increases appropriately, and whether overreliance risks emerge. This structured feedback would be essential to optimize user acceptance and ensure sustainable integration of LLM-based tools into clinical workflows.
This is also relevant for retrieval-supported LLM systems, where physicians’ attitudes may depend not only on the perceived usefulness of the generated response but also on the transparency of retrieved sources, the perceived reliability of grounding information, and the integration of these functions into clinical workflows. In this context, ATRAI-LLM may help identify whether barriers to adoption are related to general unwillingness to use LLMs, concerns about implementation, or specific doubts regarding source-grounded clinical information.
Physicians’ attitudes toward LLMs should also be interpreted as task-dependent. Previous studies suggest that clinicians may perceive LLMs more favorably when they are used for low-risk administrative or informational tasks, such as documentation support, summarization, or generation of patient educational materials, whereas greater caution is expressed when LLMs are expected to support diagnosis, treatment planning, or clinical judgment (
Blease et al., 2025;
Tangadulrat et al., 2023). In a qualitative study of UK general practitioners, Blease et al. showed that physicians recognized the potential of LLMs for documentation-related tasks but raised concerns about clinical judgment, accountability, and operational uncertainty. Similarly, Tangadulrat et al. reported that physicians were more cautious than medical students regarding the use of ChatGPT for treatment guidance and medical education, while both groups viewed its use for patient educational materials more positively. These findings are consistent with the structure of ATRAI-LLM, which separates willingness to use, implementation perspectives, and hopes and fears, and may help identify whether physicians’ concerns are related to specific high-risk clinical applications rather than to LLMs in general.
In comparison with the ATRAI-14 questionnaire, the ATRAI-LLM “Hopes and Fears” domain was reduced to two items. While some concerns can be expressed about content validity, sensitivity, and reliability of two-item factors, there is evidence available that even single-item scales can serve as substitutes for 20-item measures of health-related parameters (
Cunny & Perri, 1991). Moreover, shorter surveys were shown to be reliable while producing higher response and completion rates (
Kost & Correa da Rosa, 2018). Currently, we are performing a study to test this observation in relation to the ATRAI-LLM questionnaire.
According to the results of the factor analysis, items I2 and T1 demonstrated a redistribution of factor loadings and were assigned to factors other than those hypothesized a priori during the scale development process. The observed differences in the factor structure may be attributable to variations in the target population of the instrument and the contextual setting in which AI technologies are applied. In the present study, substantial heterogeneity was observed among respondents, as the sample comprised physicians from diverse clinical specialties. In contrast, the ATRAI-14 study involved only radiologists. This distinction is crucial, as AI has been integrated into radiology for an extended period of time; many algorithms have become commonplace in clinical practice and demonstrate high levels of accuracy. Conversely, LLMs often exhibit errors (
Athaluri et al., 2023), the detection of which can significantly depend on the physician’s level of expertise.
Despite existing attempts to evaluate physicians’ attitudes toward LLMs, a major methodological limitation of previous studies is the use of non-validated questionnaires (
Spotnitz et al., 2024). In a study by
Xu et al. (
2024), the questionnaire included demographics, AI baseline proficiency and usage, perception of LLMs, and implications of AI in medical education and healthcare. However, the absence of confirmatory factor analysis prevented definitive conclusions about whether the instrument truly measured the constructs it was intended to measure. Moreover, the authors did not perform an a priori sample size calculation. Their final sample included 102 medical students rather than practicing physicians, which restricts the applicability of the findings to real-world clinical practice.
In
Sumner et al. (
2025), the sample size was larger (1083 respondents), but the population was highly heterogeneous, including practicing physicians, nurses, hospital administrative staff, and medical students. Furthermore, the questionnaire domains were, in our view, not designed to evaluate the respondent’s personal stance on the potential implementation and use of LLMs in clinical practice. The use of convenience sampling also limits the representativeness of the data and the generalizability of the findings to the broader physician population.
Spotnitz et al. (
2024) assessed physicians’ attitudes toward LLMs and their comfort level in using them for various clinical, educational, and research tasks. Participants expressed favorable attitudes toward most evaluated AI-assisted tasks: nearly 70% (16 out of 23) received positive ratings from at least half of the respondents, with the greatest support observed for applications involving data analysis, modeling outbreaks, creating training cases, and clinical decision support. In contrast, tasks involving direct patient communication or complex content generation—such as responding to patient questions about radiology reports or writing original scientific manuscripts—received the fewest positive and the most negative ratings. Thus, the questionnaire used in that study can be viewed primarily as a tool for assessing the acceptability of using LLMs for different tasks. Recruitment was conducted through convenience sampling, and the sample consisted of 30 physicians from a single medical center, limiting the generalizability of the findings. Moreover, although the authors state that they used a valid instrument, standard validation procedures were not performed.
Our questionnaire enables the assessment of physicians’ attitudes toward LLMs used in medicine as tools for answering medical questions. A key advantage of the instrument is its comprehensive validation, confirming its robustness across four criteria of validity—face, content, construct, and criterion, as well as its reliability. This supports the high quality and interpretability of the data obtained with this tool. Importantly, the potential value of ATRAI-LLM extends beyond quantitative assessment; it also provides rich material for qualitative analysis. In our opinion, it is best to pair the ATRAI-LLM survey results with the data on actual LLM usage. There is a gap between self-reported attitude and real-life behavior (
Vasilev et al., 2024), and the modern view on the problem dictates integration of additional sources of data to reflect the respondents’ experiences in real-world settings (
Shankar et al., 2025).
This study has several limitations. During the development and testing stages, we surveyed only physicians from the Moscow Healthcare Department and validated the questionnaire exclusively within a Russian-speaking population. Furthermore, a substantial proportion of participants (71.7%) reported prior use of LLM-based chatbots within the UMIAS, which may represent a potential confounding factor in the analysis, as the quality of the LLMs deployed could have influenced respondents’ attitudes. The quality and performance of this specific LLM implementation may have influenced the respondents’ overall perception of LLMs. We did not conduct test–retest reliability analysis due to the inherently dynamic nature of LLMs as a rapidly evolving system. Respondents’ attitudes, experiences, and access to LLMs can change significantly even within two weeks, making traditional test–retest procedures potentially inappropriate, as any observed instability could reflect genuine change rather than measurement error. The concept of the test–retest analysis relies on the assumptions of perfectly stable true scores, which in our case is clearly violated, thus introducing bias into the analysis. Simulated data of Groh show that decreasing true score stability indeed biases test–retest metric estimates (
Groh, 2026). Therefore, ATRAI-LLM questionnaire provides the attitude estimate at the time of the survey. Nevertheless, to partially address concerns regarding the internal coherence of our measures, we conducted an alternative reliability check by analyzing the relationship between a behavioral item assessing actual LLM usage (F1) and overall attitudes toward LLMs. Additionally, the ATRAI-LLM questionnaire was adapted from the previously validated ATRAI-14 instrument, which may have influenced its final structure and item count. We mitigated this potential bias via validation of the new questionnaire, confirming its reliability and validity. Finally, the age and gender of respondents were not collected to preserve anonymity. Future studies should include broader demographic variables to assess possible differences in attitudes and usage patterns.
In subsequent publications, we plan to report analyses of physicians’ attitudes toward LLMs and the factors influencing these attitudes.