Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Modelling and Measuring Professional Vision in Medical Education: A Cognitive Process Framework

Int. Med. Educ. 2026, 5(2), 52; https://doi.org/10.3390/ime5020052

by Tina Seidel¹, Christian Kosel^1,*

, Ricardo Böheim¹, Martin Gartmeier² and Pascal O. Berberat²

Reviewer 1:

Xinrui Zhang

Reviewer 2: Anonymous

Int. Med. Educ. 2026, 5(2), 52; https://doi.org/10.3390/ime5020052

Submission received: 26 March 2026 / Revised: 13 May 2026 / Accepted: 14 May 2026 / Published: 22 May 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Much of the conceptual architecture in the manuscript appears adapted from prior professional vision and teacher education literature. The authors should state more directly what is newly developed for medicine.
The justification for why medicine requires a distinct professional vision model remains underdeveloped. The authors should specify which features of medical work,such as multimodal data, time pressure, or patient interaction, make a domain-specific adaptation theoretically necessary.
The authors appropriately acknowledge that eye movements are not direct measures of attention or cognition, but some content is still read as if fixation or transition metrics map rather straightforwardly onto latent noticing processes.
The discussion of simulation-based learning and instructional design is relevant but stays abstract. One or two concrete examples would make the educational implications more actionable.
Several gaze metrics depend heavily on how AOIs are defined, especially in complex or dynamic clinical scenes.

Author Response

We would like to sincerely thank Reviewer 1 for the thoughtful and constructive feedback on our manuscript. We greatly appreciate the time and expertise invested in evaluating our work, and we are particularly grateful for the insightful comments. These comments have helped us to substantially sharpen our argumentation and to clarify the contribution of the PV-CP model to medical education research. Below, we respond to each point in detail and indicate the corresponding revisions made to the manuscript.

Reviewer 1

Much of the conceptual architecture in the manuscript appears adapted from prior professional vision and teacher education literature. The authors should state more directly what is newly developed for medicine.

Thank you very much for this comment. We agree that this needed to be made explicit. We have inserted a new paragraph at the end of the Introduction (immediately following the paragraph beginning "The aim of this paper is to introduce…") that distinguishes the contributions of the present paper from existing literatures. The paragraph states that the noticing–reasoning architecture is transferred as a starting point from professional-vision research developed in teacher education, but that the sub-processes have been re-specified for medical work and aligned with measurable indicators tailored to clinical tasks. It also positions the model relative to existing theories of clinical reasoning, which describe what is reasoned about but largely treat the perceptual front-end as a black box, and to existing eye-tracking research in radiology and surgery, which produces rich descriptive findings but typically without an explicit process model. The PV-CP model is, to our knowledge, the first cognitive process model of professional vision formulated specifically for medicine, and we now state this directly in both the abstract and the new Introduction paragraph.

„What we develop here is, to our knowledge, the first cognitive process model of professional vision formulated specifically for the medical domain. While the broader concept originates in ethnographic work [8] and has been most extensively elaborated in the teaching context [10,11,14], the cognitive sub-processes posited in those frameworks have not been mapped onto the perceptual and reasoning demands of medical practice. Existing accounts of medical expertise — most prominently theories of clinical and diagnostic reasoning [21,52,53] — describe how physicians integrate symptoms with knowledge schemas, but they treat the perceptual front-end of this integration largely as a black box. They specify what is reasoned about, not how clinically relevant information is selected from a complex visual scene in the first place. Conversely, eye-tracking research in radiology and surgery has produced a rich descriptive catalogue of gaze patterns [4,16,39,42] but typically operates without an explicit process model linking gaze indicators to interpretive reasoning. The PV-CP model proposed here closes this gap: it transfers the noticing–reasoning distinction from professional-vision research developed in processing teaching situations as a starting architecture, but re-specifies its sub-processes for medical work and aligns each sub-process with measurable indicators tailored to clinical tasks.“

The justification for why medicine requires a distinct professional vision model remains underdeveloped. The authors should specify which features of medical work,such as multimodal data, time pressure, or patient interaction, make a domain-specific adaptation theoretically necessary.

Thank you very much for this point. We adressed this with a new paragraph inserted in chapter 2.1. The new paragraph identifies four features of medical work that differentially load the noticing and reasoning sub-processes specified in Figure 1, and that, taken together, motivate a domain-specific adaptation rather than a relabelling of teacher-education constructs: (i) parallel multimodal information streams (patient appearance, vital-sign monitors, records, colleague input); (ii) time pressure with patient-safety consequences, which elevates the role of fast, schema-driven information selection; (iii) intrinsically dynamic visual material (anatomy in motion, evolving symptoms) rather than the static or slowly changing scenes typical of other professional-vision domains; and (iv) the patient–physician interaction, in which gaze itself functions communicatively and visual attention is regulated also by interpersonal and ethical considerations.

„Several features of medical work make a domain-specific adaptation of professional vision theoretically necessary: First, medical tasks routinely involve multimodal information streams that are processed in parallel: a physician on a ward round simultaneously monitors a patient's appearance and verbal report, vital-sign monitors, paper or electronic records, and the input of colleagues [13]. The visual demands here differ qualitatively from those of a classroom or a single radiograph. Second, many clinical decisions are made under time pressure with substantial consequences for patient safety, which constrains the duration available for deliberate visual search and elevates the importance of fast, schema-driven information selection [15,17]. Third, much of the relevant visual information is itself dynamic — anatomy in motion during ultrasound or surgery, evolving symptoms during a patient interaction — rather than more static or slowly changing scenes that dominate other professional-vision domains. Fourth, the patient–physician interaction adds a relational layer in which gaze itself functions communicatively, so visual attention is regulated not only by diagnostic relevance but also by interpersonal and ethical considerations. Each of these features differentially loads the noticing and reasoning sub-processes specified in Figure 1, and a model that does not foreground them risks to underestimate both the cognitive demands of medical visual work as well as necessary requirements for designing instructional support.“

The authors appropriately acknowledge that eye movements are not direct measures of attention or cognition, but some content is still read as if fixation or transition metrics map rather straightforwardly onto latent noticing processes.

We have strengthened the interpretive caveat in chapter 3.1, where the original short paragraph on the probabilistic nature of gaze evidence has been expanded. The revised passage explicitly notes that the inference from gaze to attention rests on auxiliary assumptions that hold imperfectly in dynamic clinical scenes (where parafoveal and covert attention can be substantial), and that long fixations and high transition frequencies are each compatible with several competing cognitive interpretations. We now state plainly that disambiguating these readings requires either converging verbal data or strong task-design constraints, and that gaze indicators should be read throughout the paper as process-level proxies whose interpretation is theoretically substantiated by the PV-CP model. We also note that Reviewer 2 raised the same concern, and the revisions above address that comment as well.

„It is important to note that eye movements do not constitute direct measures of attention or cognition. Gaze behavior provides a probabilistic evidence for underlying attentional allocation, and even this inference rests on auxiliary assumptions — most centrally that foveal vision is required for detailed information uptake [41] — that hold imperfectly in dynamic clinical scenes where parafoveal and covert attention can be substantial [16,18]. Long fixations may reflect deep processing, encoding difficulty, hesitation, or simple disengagement; high transition frequencies may indicate skilled relational integration or fragmented search. Disambiguating these competing readings requires either converging evidence from multimodal data such as additional verbal data (see Chapter 4.2) or strong task-design constraints. We therefore treat gaze indicators throughout this paper as process-level proxies whose interpretation is theoretically substantiated by the PV-CP model rather than as direct windows into cognition.“

And

"For example, during the interpretation of complex diagnostic images, experienced physicians tend to display structured reductions in entropy over time, reflecting the progressive narrowing of diagnostic hypotheses [47,48]"

The discussion of simulation-based learning and instructional design is relevant but stays abstract. One or two concrete examples would make the educational implications more actionable.

We have added a new paragraph in chapter 5.2 with two worked examples: (a) a ward-round simulation using a high-fidelity manikin and a confederate "patient", in which the visual–cognitive demand profile (cue salience, multimodal noise) is the experimental manipulation and sub-process indicators (fixation on the critical region, plus a brief structured think-aloud) rather than diagnostic accuracy alone are the dependent measures; and (b) an eye-movement modelling example (EMME) in radiology in which novices view a chest CT overlaid with an expert's gaze trajectory and verbal annotations, then perform a transfer case, with feedback targeted at the specific sub-process where their performance diverged from the expert (information selection, relational organisation, or knowledge integration). Each example explicitly maps the instructional design onto sub-processes of the PV-CP model.

„Two concrete examples illustrate how this can be operationalised. (a) A ward-round simulation using a high-fidelity manikin and a confederate "patient" can systematically vary whether a key clinical cue — say, subtle peripheral cyanosis — is presented alone, alongside a salient but irrelevant cue, or embedded in a noisy multimodal display including monitor alarms and chart entries. Wearable eye tracking captures whether the learner allocates fixations to the diagnostically critical region; a brief structured think-aloud at the end of each scenario captures whether the cue, once fixated, was correctly interpreted. The instructional manipulation is not the case content but the visual–cognitive demand profile, and the dependent measures are PV sub-process indicators rather than diagnostic accuracy alone. (b) An eye-movement modelling example (EMME) in radiology presents novices with a chest CT and overlays the gaze trajectory of an expert who has annotated key cues with brief verbal justifications [5]. After viewing, learners perform a transfer case; instructional feedback is targeted at the sub-process where their performance diverged from the expert — e.g., a delayed first fixation on the relevant region (information selection), an unsystematic scanpath (relational organisation), or a fixated-but-unjustified cue (knowledge integration). In both designs the PV-CP model functions as a diagnostic framework for the learner's process, allowing instructional support to be targeted rather than generic.“

Several gaze metrics depend heavily on how AOIs are defined, especially in complex or dynamic clinical scenes.

Thanks for this comment. We agree, and this is now treated explicitly. We have added a new paragraph at the end of chapter 4.1, before the closing summary, addressing the AOI-dependency of fixation-, transition-, and entropy-based measures. The paragraph notes that AOI delineation is tractable but theory-laden in static images and considerably more demanding in dynamic clinical material, where AOIs must be tracked frame by frame or defined on moving objects, and that small differences in AOI boundaries can produce sizeable differences in derived metrics. We recommend that medical eye-tracking studies report AOI definition procedures, inter-coder agreement on AOI boundaries where applicable, and sensitivity analyses examining how key conclusions shift under reasonable variations in AOI definition.

„A further methodological caveat applies across the indicator families described above. Fixation-, transition-, and entropy-based measures all presuppose that the relevant regions of the visual scene have been delineated in advance. In static images such as a single radiograph this is tractable, although still theory-laden, but in dynamic clinical scenarios — ward rounds, surgical procedures, ultrasound examinations, or simulated patient interactions — AOIs must either be tracked frame by frame or defined on moving objects, and small differences in AOI boundaries can produce sizeable differences in derived metrics. Reported reliabilities of AOI-based measures therefore depend not only on the eye-tracker and analysis pipeline but on the explicitness and inter-rater stability of the AOI scheme itself. We recommend that medical eye-tracking studies report AOI definition procedures, inter-coder agreement on AOI boundaries where applicable, and sensitivity analyses examining how key conclusions shift under reasonable variations in AOI definition.“

We would like to express our gratitude to Reviewer 2 for the careful and detailed evaluation of our manuscript and for the encouraging assessment that our work addresses a significant and underexplored area within medical education. We particularly appreciate the constructive suggestions regarding the clarification of newly proposed components, the differentiation from teacher education literature and the positioning of our framework relative to existing models of clinical reasoning. These comments have been highly valuable in strengthening both the conceptual clarity and the methodological rigor of our manuscript. In the following, we address each point individually and describe the corresponding revisions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript reviews and develops a theoretical framework for a Professional Vision Cognitive Process (PV-CP) model designed for medical education. It suggests that clinically relevant visual expertise is a hidden, process-based skill that brings together perceptual noticing and reasoning. The paper also recommends using both gaze-tracking and verbal indicators to measure this expertise. This topic is important and useful for medical education, especially as simulation, imaging, and learning analytics become more common.

Please make sure the manuscript clearly explains the following points:

What components are newly proposed for medicine?
What is transferred from teacher education literature?
How does your framework offer improvements over current models of medical expertise?
Why do you think current models of clinical reasoning are not sufficient?

The manuscript appropriately notes that gaze ≠ cognition, but still sometimes overstates the interpretability of eye-tracking metrics.

The paper proposes a useful framework, but stops short of operational next steps. A stronger final section should include concrete research questions, such as:

Do gaze/verbal profiles predict OSCE performance?
Can PV be trained longitudinally?
Which PV subprocesses distinguish residents from attendings?
EMT-guided feedback improves transfer to patient care?
How stable are PV indicators across cases?

The manuscript needs professional English editing.

Some minor issues:

Duplicate references exist (e.g., Waite et al. 2019 appears twice).
Figure 1 is conceptually dense and visually difficult to interpret.
Define abbreviations at first use consistently (EMT, AOI, PV-CP)
Table 1 is useful but should include psychometric cautions/reliability notes.

The manuscript examines a significant and underexplored area within medical education. The integration of perceptual noticing and reasoning processes presents a promising conceptual approach that may be suitable for publication following substantial revision.

However, the manuscript needs to explain its new ideas more clearly, provide better context for medical education, strengthen its methods, organise the content more effectively, and improve the language throughout.

Comments on the Quality of English Language

The manuscript needs professional English editing.

Author Response

Please make sure the manuscript clearly explains the following points:

What components are newly proposed for medicine?

This is taken up in the new paragraph at the end of chapter 1 (also addressing Reviewer 1's first comment) and the new paragraph in chapter 2.1 (also addressing Reviewer 1's second comment). Together they identify three medicine-specific contributions: (i) a process model that explicitly couples perceptual sub-processes (information selection, breadth of visual field, schema-aligned vs. schema-non-aligned processing, organising) with reasoning sub-processes (cue justification, hypothesis articulation, causal explanation, structural representation) under medical task demands; (ii) the explicit alignment of these sub-processes with the multimodal, time-pressured, dynamic, and interpersonal features of clinical work; and (iii) a measurement framework (gaze + verbal indicators, summarised in Table 1 and chapter 4) in which each indicator family is theoretically tied to a sub-process and accompanied by a statement of its inferential scope, limitations, and reliability considerations.

What is transferred from teacher education literature?

We have made this transparent in the new Introduction paragraph. The high-level noticing–reasoning architecture (Sherin et al., 2011 [23]; Seidel & Stürmer, 2014 [14]; Seidel et al., 2025 [10,11]) is transferred as the organising backbone of the model. The dispositions / situation-specific skills / performance distinction in §2.2 is also drawn from competence-development work largely conducted in teacher-education contexts (Blömeke et al. [22]; Weinert et al. [25]). The new Introduction paragraph names this transfer explicitly and contrasts it with what required re-specification for medicine.

How does your framework offer improvements over current models of medical expertise?

Two improvements are now stated explicitly in the new Introduction paragraph. First, classical models of clinical reasoning (e.g., Boshuizen & Schmidt [53]; Norman et al. [52]; Charlin et al. [51]) describe how acquired knowledge is mobilised in interpretation, but treat the upstream perceptual selection of information from the visual scene largely as an unmodelled prior step. The PV-CP model decomposes this perceptual front-end into specifiable sub-processes that can be measured. Second, eye-tracking research in radiology and surgery (e.g., Brunyé et al. [4]; Kundel et al. [16]; van der Gijp et al. [48]; Waite et al. [2]) has produced a rich descriptive catalogue of expert–novice gaze differences, but generally without a domain-specific process model that links those differences to interpretive reasoning. The PV-CP model bridges these literatures by mapping gaze indicators onto a process-level architecture that also accommodates verbal reasoning indicators, and by proposing that valid inference about professional vision requires the alignment of both indicator families.

Why do you think current models of clinical reasoning are not sufficient?

Current models of clinical reasoning are well developed for the knowledge-mobilisation side of diagnostic decision making but are largely silent on three points relevant to medical education: (i) how perceptually relevant information is selected in real time from a complex multimodal visual scene; (ii) how the temporal organisation of visual exploration (e.g., breadth of overview, sequencing of fixations) interacts with hypothesis generation; and (iii) how interpersonal regulation of gaze in patient interaction shapes what information enters the reasoning process at all. These are precisely the points the PV-CP model is designed to make tractable for both research and instructional design. The new Introduction paragraph and the new chapter 2.1 paragraph develop this argument.

The manuscript appropriately notes that gaze ≠ cognition, but still sometimes overstates the interpretability of eye-tracking metrics.

Addressed jointly with Reviewer 1's third comment. We have substantially expanded the interpretive caveat in chapter 3.1, which now explicitly notes that the inference from gaze to attention rests on auxiliary assumptions that hold imperfectly in dynamic clinical scenes (where parafoveal and covert attention can be substantial), and that long fixations and high transition frequencies are each compatible with several competing cognitive interpretations. The paragraph states plainly that disambiguating these readings requires either converging multimodal data such as additional verbal data or strong task-design constraints, and that gaze indicators should be read throughout the paper as process-level proxies whose interpretation is theoretically constrained by the PV-CP model rather than as direct windows on cognition. We have also added a new paragraph at the end of chapter 4.1 on AOI-definition sensitivity in dynamic clinical scenes (also addressing Reviewer 1's fifth comment.

„It is important to note that eye movements do not constitute direct measures of attention or cognition. Gaze behavior provides only probabilistic evidence for underlying attentional allocation, and even this inference rests on auxiliary assumptions — most centrally that foveal vision is required for detailed information uptake [41] — that hold imperfectly in dynamic clinical scenes where parafoveal and covert attention can be substantial [16,18]. Long fixations may reflect deep processing, encoding difficulty, hesitation, or simple disengagement; high transition frequencies may indicate skilled relational integration or fragmented search. Disambiguating these competing readings requires either converging evidence from multimodal data such as additional verbal data (see Chapter 4.2) or strong task-design constraints. We therefore treat gaze indicators throughout this paper as process-level proxies whose interpretation is theoretically substantiated by the PV-CP model rather than as direct windows on cognition.“

The paper proposes a useful framework, but stops short of operational next steps. A stronger final section should include concrete research questions, such as:

Do gaze/verbal profiles predict OSCE performance?
Can PV be trained longitudinally?
Which PV subprocesses distinguish residents from attendings?
EMT-guided feedback improves transfer to patient care?
How stable are PV indicators across cases?

Thank you for this valuable comment. We have addressed this by extending chapter 5.3 (Implications for professional competence development) with a new and more detailed closing paragraph that coverst he questions raised by the reviewer. We chose to integrate the research agenda into chapter 5.3 rather than open a separate section, because chapter 5.3 already frames PV as a longitudinal, developmental phenomenon and the five questions follow directly from that framing. The new paragraph reads:

„Building on these developmental considerations, the PV-CP framework opens a tractable empirical agenda for medical education research. We highlight five questions that, in our view, are both feasible with current methodology and consequential for the field. First, do combined gaze and verbal-reasoning profiles, derived from standardised PV tasks, predict performance on Objective Structured Clinical Examinations (OSCEs) or in-vivo diagnostic accuracy beyond what is predicted by knowledge tests alone? Demonstrating such incremental validity would establish PV as a distinct competence component rather than a redundant proxy for knowledge. Second, can professional vision be trained longitudinally, and which sub-processes are most malleable? Designs that track noticing and reasoning indicators across the medical curriculum — ideally at matched task points in years 1, 3, and final — would clarify whether PV develops gradually with case exposure, in step-wise fashion at clerkship transitions, or only with deliberate practice. Third, which PV sub-processes most reliably distinguish residents from attending physicians on the same cases? Existing expert–novice contrasts compare populations that differ on many dimensions; resident–attending contrasts on identical material would isolate the sub-processes that change with the final, slowest-acquired layers of expertise. Fourth, do EMT-guided feedback interventions — for example, showing learners their own gaze contrasted with an expert reference — produce transfer to authentic patient-care tasks rather than only to similar laboratory cases? Fifth, how stable are PV indicators across cases of comparable difficulty, and across modalities (static images, dynamic scenes, patient encounters)? Generalisability-theory analyses are needed to estimate how many cases are required for a defensible inference at the individual learner level, which is the precondition for any formative or summative use of PV measures.“

The manuscript needs professional English editing.

Thank you for noting this. The final manuscript will undergo professional English proofreading prior to publication to ensure consistent style, grammar, and idiomatic phrasing throughout.

Some minor issues:

Duplicate references exist (e.g., Waite et al. 2019 appears twice).

Thanks, we fixed this issue.

Figure 1 is conceptually dense and visually difficult to interpret.

We have retained the figure in its current form because it represents the full set of sub-processes and their relationships as posited by the PV-CP model, and we found that further visual simplification would either lose theoretical content or shift content into supplementary panels at the cost of an integrated overview. To support readers, however, we have substantially expanded the figure caption to make the figure self-contained: the revised caption walks the reader through the perceptual encoding stage, the schema-comparison stage, and the integration stage, and explicitly identifies which sub-processes constitute the noticing component and which constitute the reasoning component. We hope that this provides a clearer entry point into the figure while preserving its representational completeness.

„Figure 1. Professional vision cognitive processing model (PV-CP) [11]. The model represents professional vision as a temporally unfolding process in which incoming visual information from a clinical scene is first selectively encoded through foveal and parafoveal processing (left), shaped by activated professional schemas in extended long-term working memory. Encoded information is then compared against case- and schema-based expectations: schema-aligned cues support fast, fluent processing, whereas schema-non-aligned cues trigger more fine-grained visual search and schema adaptation (centre). Selected cues are organised into meaningful chunks and integrated with domain-specific knowledge to support interpretation, hypothesis generation, and clinical decision making (right). The two upper sub-processes constitute the noticing component of professional vision; the two lower sub-processes constitute the reasoning component.“

Define abbreviations at first use consistently (EMT, AOI, PV-CP)

Thanks, we now define all abbreviations.

Table 1 is useful but should include psychometric cautions/reliability notes.

Thanks! We addressed this comment by adding a set of reliability and psychometric notes immediately below Table 1, keyed to the relevant indicator families by superscript letters (a–e) added to the corresponding cells in the table. The notes flag the most consequential considerations for each family — including AOI-boundary sensitivity for fixation- and transition-based measures, scale-sensitivity of distributional ratios, the need for an a priori specification of expected vs. anomalous regions for schema-based measures, algorithm-dependence of scanpath similarity indices, and the small-sample bias of entropy estimators — together with concrete recommendations for what should be reported in empirical studies.

„Notes on reliability and psychometric considerations. ᵃ Fixation-based measures (Row 1): Generally good test–retest reliability for total dwell and fixation count on stable AOIs; both metrics are sensitive to AOI boundary definition and to minimum-fixation thresholds, which should be reported. ᵇ AOI-based distribution measures (Row 2): Distributional ratios are scale-sensitive — small AOIs in cluttered scenes can yield unstable estimates. Report AOI size, count, and the rationale for central vs. peripheral classification.ᶜ Schema-based measures (Row 3): Validity hinges on a defensible a priori specification of expected vs. anomalous regions, ideally derived from expert consensus, with inter-rater agreement reported.

ᵈ Transition- and scanpath-based measures (Row 4): Sensitive to AOI granularity; scanpath similarity indices vary substantially by algorithm (e.g., string-edit, ScanMatch, MultiMatch). Use one indicator family consistently and report the algorithm. ᵉ Entropy and variability measures (Row 5): Entropy estimates require minimum scanpath length and have documented small-sample bias; report fixation counts per trial and consider bias-corrected estimators.“

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The paper still reads as an adaptation of teacher professional vision theory rather than a distinctly medical framework. More emphasis on what sets medical visual work apart, diagnostic uncertainty, procedural dynamics, patient safety, would strengthen the argument.
Professional vision probably varies considerably across radiology, surgery, bedside examination, and ward rounds. The authors should clarify whether the PV-CP model is meant to be domain-general or adapted to specific clinical tasks.
The authors should clarify whether PV indicators are intended for individual-level assessment, group-level research, or both.
The inclusion of auditory and verbal elements sits in tension with the paper's visual attention framing. The authors should clarify whether professional vision here is strictly visual or extends to multimodal clinical perception.

Author Response

The paper still reads as an adaptation of teacher professional vision theory rather than a distinctly medical framework. More emphasis on what sets medical visual work apart, diagnostic uncertainty, procedural dynamics, patient safety, would strengthen the argument.

We thank the reviewer for this important comment. We believe that the substantive adaptation of the model to medical visual work was already present in the previous version, but the rhetorical order of Section 2 placed the generic professional-vision architecture before the medical features that motivate it, which made the medical specifics read as framing rather than as constitutive of the model. We have addressed this by restructuring Section 2 so that medical distinctiveness now leads the argument and the cognitive mechanisms are introduced as a direct response to it.

Concretely, we made two major changes. First, the four features of medical visual work as also identified by the reviewer — diagnostic uncertainty under incomplete information, procedural dynamics and perception–action coupling, patient-safety consequences and time pressure, and multimodal parallel information streams with communicative gaze — are now treated in a dedicated opening subsection, 2.1 What makes visual work in medicine distinctive, that precedes further discussion of cognitive mechanisms. Each feature is presented in its own paragraph and closes with an explicit statement in which the specific sub-process of the PV-CP model are related to each respective feature.

Second, the cognitive mechanisms (information-reduction hypothesis, holistic model of image perception, organizing and integrating sub-processes) are now presented in a renamed Section 2.2 A cognitive process model for the demands of medical visual work, framed at the outset as the architecture that meets the demands set out in chapter 2.1. Medical anchor points are inserted at the places where the cognitive mechanisms are most clearly shaped by medical demands — notably, schema alignment is now introduced as a response to decision-making against differential diagnoses rather than a determinate scene, parafoveal processing is anchored to radiological images and ward-round monitoring, information reduction is linked to safety-critical time pressure, and the chunking of organized information is illustrated with a multimodal ward-round chunk that bundles visual, auditory, and numerical cues.

Professional vision probably varies considerably across radiology, surgery, bedside examination, and ward rounds. The authors should clarify whether the PV-CP model is meant to be domain-general or adapted to specific clinical tasks.

We thank the reviewer for prompting this clarification. Our position is that the PV-CP model is proposed as a cognitive architecture across medical visual tasks rather than as a domain-general framework on the one hand or a collection of task-specific model variants on the other. The four sub-processes — information selection, breadth of visual field, organizing, and integrating — are posited to be active across radiology, surgery, bedside physical examination, and ward rounds, but the weighting and observable expression of these sub-processes is expected to differ by task.

We have added a paragraph at the end of Section 2.2 that states this scope claim and gives a one-sentence specification of the task-specific emphases for each of the four clinical contexts the reviewer names. Radiology is characterized by the dominance of information selection on static or quasi-static images and breadth of visual field; surgery and other procedural work by organizing and integrating within a perception–action loop; bedside physical examination by multimodal cue integration and schema-driven anomaly detection; and ward rounds by organizing across heterogeneous information sources together with the communicative regulation of gaze. The paragraph closes by noting that the model offers a shared vocabulary and sub-process structure for studying professional vision across these contexts, while leaving the relative weight of each sub-process open to task-specific specification.

A small change was also made to Table 1: we now give concrete examples of typical medical applications for each metric, directly linked to the indicator families.

“Scope of the model. The PV-CP model is proposed as a cognitive architecture across medical visual tasks rather than as a domain-general or fully task-specific framework. The four sub-processes — information selection, breadth of visual field, organizing, and integrating — are posited to be active across radiology, surgery, bedside physical examination, and ward rounds, but the weighting and observable expression of these sub-processes differ by task. In radiology, information selection on static or quasi-static images and breadth of visual field (foveal–parafoveal coordination, holistic first impression) dominate, while organizing operates over spatially distributed image regions [16, 42, 47]. In surgery and other procedural work, organizing and integrating unfold within a tight perception–action loop, and the relevant breadth of field includes the surgeon's own instruments and the patient anatomy in motion. In bedside physical examination, multimodal cue integration and schema-driven anomaly detection are central, and the visual scene is co-constructed with the patient's behavior. In ward rounds, organizing across heterogeneous information sources — patient, monitor, record, colleague — and the communicative regulation of gaze are particularly loaded [13]. The model thus offers a shared vocabulary and a shared sub-process structure for studying professional vision across these contexts, while leaving the relative weight of each sub-process open to task-specific specification.”

The authors should clarify whether PV indicators are intended for individual-level assessment, group-level research, or both.

We thank the reviewer for this clarification. Our position is that the indicator families described in the manuscript are, at the current state of evidence, primarily suited for group-level research — expert–novice comparisons, evaluation of instructional interventions, and characterization of sub-process profiles across clinical tasks. Individual-level use is treated as an aspirational application contingent on generalizability-theory analyses that estimate the number of cases and task structure required for defensible learner-level inferences. We have added a paragraph at the end of Section 3 ("Level of use") that states this position and cross-references the reliability notes in Table 1 and Research Question 5 in Section 5.3.

“Level of use. A further scope clarification concerns the level at which the measurement framework is intended to operate. The indicator families described in the following sections are, at the current state of evidence, primarily suited for group-level research — for example, comparing experts and novices, evaluating instructional interventions, or characterizing sub-process profiles across clinical tasks. Individual-level use, whether for formative feedback or summative assessment of a single learner's professional vision, places substantially stronger demands on indicator reliability than has so far been established for most gaze- and reasoning-based metrics. Reported reliabilities for fixation-, transition-, and entropy-based measures vary across studies, AOI schemes, and analytic choices (see notes to Table 1), and generalizability across cases of comparable difficulty has rarely been quantified for medical PV indicators. We therefore treat individual-level use as an aspirational application contingent on generalizability-theory analyses that estimate, for each indicator family, the number of cases and the task structure required for defensible learner-level inferences — an open empirical question we return to in Section 5.3 (Research Question 5). Until such evidence is available, individual-level interpretations should be triangulated across indicator families rather than rest on a single metric, and they should be made with explicit acknowledgement of measurement error”

The inclusion of auditory and verbal elements sits in tension with the paper's visual attention framing. The authors should clarify whether professional vision here is strictly visual or extends to multimodal clinical perception.

We thank the reviewer for identifying this tension. We retain the term professional vision — it is the established term in the literature this paper extends, visual attention is the analytically central modality in the medical tasks we consider, and it is the modality with the most developed measurement repertoire — but we now define the underlying construct as multimodal clinical perception anchored in visual attention. This scope clarification is now stated at first use in the abstract and reiterated at the close of Section 2.1, where it links back to the measurement framework in Sections 4.1 and 4.2 (gaze-based indicators capturing the visual anchor, verbal indicators capturing the reasoning that integrates information across modalities). The scope is consistent with Figure 1, which already includes auditory memory and verbal chunks alongside foveal and parafoveal visual processing.

“Taken together, this means that professional vision in medicine, as defined at the outset, is best understood as multimodal clinical perception anchored in visual attention rather than as a purely visual construct — a scope we make explicit in the measurement framework (Sections 4.1 and 4.2), where gaze-based indicators capture the visual anchor and verbal indicators capture the reasoning that integrates information across modalities.”

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The revised manuscript presents a theoretically ambitious and well-structured contribution to the medical education literature. It successfully bridges professional vision research, cognitive process theory, and multimodal measurement approaches in a way that is both conceptually rigorous and practically relevant.

A major strength of the paper is the clear articulation of the Professional Vision Cognitive Processing (PV-CP) model. The authors convincingly demonstrate that current medical expertise research often treats perceptual processing as a “black box,” while eye-tracking studies frequently lack a coherent cognitive framework linking gaze behaviour to reasoning processes.

The manuscript effectively addresses this gap by integrating the noticing and reasoning subprocesses into a unified conceptual architecture tailored to medicine.

The implications section is especially compelling because it translates the conceptual framework into actionable research and instructional directions.

The concrete examples involving ward-round simulations and eye-movement modelling examples (EMME) substantially improve the practical relevance of the manuscript and help readers envision how the framework can inform future educational design and empirical research.

Author Response

We thank the reviewer for their positive and thoughtful evaluation of our revised manuscript. We are grateful for the recognition that the Professional Vision Cognitive Processing (PV-CP) model addresses a substantive gap in the literature by linking perceptual and reasoning processes within a single framework tailored to medicine, and that the integration of noticing and reasoning subprocesses is conceptually coherent. We are equally pleased that the implications section, together with the ward-round simulation and eye-movement modelling example (EMME) illustrations, is seen to strengthen the practical relevance of the work. As the reviewer raises no further points requiring revision, no additional changes have been made in response to this report. We thank the reviewer for their careful and constructive engagement with the manuscript.

Article Menu

Modelling and Measuring Professional Vision in Medical Education: A Cognitive Process Framework

Further Information

Guidelines

MDPI Initiatives

Follow MDPI