Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (653)

Search Parameters:
Keywords = inter-rater reliability

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
14 pages, 626 KiB  
Article
Mapping Clinical Questions to the Nursing Interventions Classification: An Evidence-Based Needs Assessment in Emergency and Intensive Care Nursing Practice in South Korea
by Jaeyong Yoo
Healthcare 2025, 13(15), 1892; https://doi.org/10.3390/healthcare13151892 (registering DOI) - 2 Aug 2025
Abstract
Background/Objectives: Evidence-based nursing practice (EBNP) is essential in high-acuity settings such as intensive care units (ICUs) and emergency departments (EDs), where nurses are frequently required to make time-critical, high-stakes clinical decisions that directly influence patient safety and outcomes. Despite its recognized importance, [...] Read more.
Background/Objectives: Evidence-based nursing practice (EBNP) is essential in high-acuity settings such as intensive care units (ICUs) and emergency departments (EDs), where nurses are frequently required to make time-critical, high-stakes clinical decisions that directly influence patient safety and outcomes. Despite its recognized importance, the implementation of EBNP remains inconsistent, with frontline nurses often facing barriers to accessing and applying current evidence. Methods: This descriptive, cross-sectional study systematically mapped and prioritized clinical questions generated by ICU and ED nurses at a tertiary hospital in South Korea. Using open-ended questionnaires, 204 clinical questions were collected from 112 nurses. Each question was coded and classified according to the Nursing Interventions Classification (NIC) taxonomy (8th edition) through a structured cross-mapping methodology. Inter-rater reliability was assessed using Cohen’s kappa coefficient. Results: The majority of clinical questions (56.9%) were mapped to the Physiological: Complex domain, with infection control, ventilator management, and tissue perfusion management identified as the most frequent areas of inquiry. Patient safety was the second most common domain (21.6%). Notably, no clinical questions were mapped to the Family or Community domains, highlighting a gap in holistic and transitional care considerations. The mapping process demonstrated high inter-rater reliability (κ = 0.85, 95% CI: 0.80–0.89). Conclusions: Frontline nurses in high-acuity environments predominantly seek evidence related to complex physiological interventions and patient safety, while holistic and community-oriented care remain underrepresented in clinical inquiry. Utilizing the NIC taxonomy for systematic mapping establishes a reliable framework to identify evidence gaps and support targeted interventions in nursing practice. Regular protocol evaluation, alignment of continuing education with empirically identified priorities, and the integration of concise evidence summaries into clinical workflows are recommended to enhance EBNP implementation. Future research should expand to multicenter and interdisciplinary settings, incorporate advanced technologies such as artificial intelligence for automated mapping, and assess the long-term impact of evidence-based interventions on patient outcomes. Full article
(This article belongs to the Section Nursing)
Show Figures

Figure 1

15 pages, 580 KiB  
Article
Reliability and Inter-Device Agreement Between a Portable Handheld Ultrasound Scanner and a Conventional Ultrasound System for Assessing the Thickness of the Rectus Femoris and Vastus Intermedius
by Carlante Emerson, Hyun K. Kim, Brian A. Irving and Efthymios Papadopoulos
J. Funct. Morphol. Kinesiol. 2025, 10(3), 299; https://doi.org/10.3390/jfmk10030299 (registering DOI) - 1 Aug 2025
Abstract
Background: Ultrasound (U/S) can be used to evaluate skeletal muscle characteristics in clinical and sports settings. Handheld U/S devices have recently emerged as a cheaper and portable alternative to conventional U/S systems. However, further research is warranted on their reliability. We assessed the [...] Read more.
Background: Ultrasound (U/S) can be used to evaluate skeletal muscle characteristics in clinical and sports settings. Handheld U/S devices have recently emerged as a cheaper and portable alternative to conventional U/S systems. However, further research is warranted on their reliability. We assessed the reliability and inter-device agreement between a handheld U/S device (Clarius L15 HD3) and a more conventional U/S system (GE LOGIQ e) for measuring the thickness of the rectus femoris (RF) and vastus intermedius (VI). Methods: Cross-sectional images of the RF and VI muscles were obtained in 20 participants by two assessors, and on two separate occasions by one of those assessors, using the Clarius L15 HD3 and GE LOGIQ e devices. RF and VI thickness measurements were obtained to determine the intra-rater reliability, inter-rater reliability, and inter-device agreement. Results: All intraclass correlation coefficients (ICCs) were above 0.9 for intra-rater reliability (range: 0.94 to 0.97), inter-rater reliability (ICC: 0.97), and inter-device agreement (ICC: 0.98) when comparing the two devices in assessing RF and VI thickness. For the RF, the Bland–Altman plot revealed a mean difference of 0.06 ± 0.07 cm, with limits of agreement ranging from 0.21 to −0.09, whereas for the VI, the Bland–Altman plot showed a mean difference of 0.07 ± 0.10 cm, with limits of agreement ranging from 0.27 to −0.13. Conclusions: The handheld Clarius L15 HD3 was reliable and demonstrated high agreement with the more conventional GE LOGIQ e for assessing the thickness of the RF and VI in young, healthy adults. Full article
(This article belongs to the Section Kinesiology and Biomechanics)
12 pages, 617 KiB  
Article
Increased Posterior Tibial Slope Is Associated with Isolated Meniscal Injuries: A Case-Control Study
by Kai von Schwarzenberg, Tamara Babasiz, Jan P. Hockmann, Peer Eysel and Jörgen Hoffmann
Medicina 2025, 61(8), 1368; https://doi.org/10.3390/medicina61081368 - 29 Jul 2025
Viewed by 177
Abstract
Background and Objectives: The relationship between posterior tibial slope (PTS) and isolated meniscal injuries remains a topic of debate. This study aimed to investigate whether an increased PTS was associated with a higher risk of isolated meniscal tears, using a case-control design with [...] Read more.
Background and Objectives: The relationship between posterior tibial slope (PTS) and isolated meniscal injuries remains a topic of debate. This study aimed to investigate whether an increased PTS was associated with a higher risk of isolated meniscal tears, using a case-control design with propensity score matching to minimize confounding factors. Materials and Methods: A retrospective case-control study was conducted at a University Hospital. A total of 294 patients who underwent arthroscopic surgery for meniscal injuries were compared to a matched control group without documented knee pathology. Two independent observers measured PTS on standardized lateral knee radiographs and assessed inter- and intra-rater reliability. Propensity score matching was performed to control for potential confounders. Statistical analysis included logistic regression to evaluate the association between PTS and isolated meniscal injuries. Results: A significantly increased mean PTS was observed in patients with isolated meniscal injuries compared to controls (p < 0.05). However, PTS was not significantly associated with the specific location of meniscal tears. Inter- and intra-rater reliability for PTS measurements was excellent (intraclass correlation coefficient > 0.75). Conclusions: An increased posterior tibial slope was associated with a higher risk of meniscal injury, even in the absence of ACL rupture. However, no significant association was found between PTS and specific tear patterns or locations. These findings support the role of posterior tibial slope as an independent anatomical risk factor for meniscal damage and underscore the importance of its early identification in clinical risk assessment and prevention strategies. Full article
(This article belongs to the Special Issue Sports Injuries: Prevention, Treatment and Rehabilitation)
Show Figures

Figure 1

17 pages, 280 KiB  
Article
Reliability and Validity of the Lowenstein Communication Scale
by Anna Oksamitni, Hiela Lehrer, Ilana Gelernter, Michal Scharf, Lilach Front, Olga Bendit-Goldenberg, Amiram Catz and Elena Aidinoff
Neurol. Int. 2025, 17(8), 116; https://doi.org/10.3390/neurolint17080116 - 29 Jul 2025
Viewed by 115
Abstract
Background/Objectives: The Lowenstein Communication Scale (LCS) is a tool for the evaluation of communicative performance in patients with disorders of consciousness (DOC). This study investigated the reliability and validity of the LCS. Methods: We evaluated 23 inpatients with unresponsive wakefulness syndrome (UWS) and [...] Read more.
Background/Objectives: The Lowenstein Communication Scale (LCS) is a tool for the evaluation of communicative performance in patients with disorders of consciousness (DOC). This study investigated the reliability and validity of the LCS. Methods: We evaluated 23 inpatients with unresponsive wakefulness syndrome (UWS) and 18 in a minimally conscious state (MCS), at admission to a Consciousness Rehabilitation Department and one month later. The evaluations included assessments of LCS by two raters, and of the Coma Recovery Scale–Revised (CRS-R) by one rater. Results: Total inter-rater agreement in LCS task scoring was found in 58–100% of the patients. Cohen’s kappa values were >0.6 for most tasks. High correlations were found between the two raters on total scores and most subscales (r = 0.599–1.000, p < 0.001), and the differences between them were small. LCS subscales and total score intraclass correlations (ICC) were high. Internal consistency was acceptable (Cronbach’s α > 0.7) for most LCS subscales and total scores. Moderate to strong correlations were found between LCS and CRS-R scores (r = 0.554–0.949, p < 0.05), and the difference in responsiveness between LCS and CRS-R was non-significant. Conclusions: The findings indicate that the LCS is reliable and valid, making it a valuable clinical and research assessment tool for patients with DOC. Full article
(This article belongs to the Section Brain Tumor and Brain Injury)
13 pages, 1022 KiB  
Article
Dual-Layer Spectral CT with Electron Density in Bone Marrow Edema Diagnosis: A Valid Alternative to MRI?
by Filippo Piacentino, Federico Fontana, Cecilia Beltramini, Andrea Coppola, Daniele Mesiano, Gloria Venturini, Chiara Recaldini, Roberto Minici, Anna Maria Ierardi, Velio Ascenti, Simone Barbera, Fabio D’Angelo, Domenico Laganà, Gianpaolo Carrafiello, Giorgio Ascenti and Massimo Venturini
J. Clin. Med. 2025, 14(15), 5319; https://doi.org/10.3390/jcm14155319 - 28 Jul 2025
Viewed by 228
Abstract
Background/Objectives: Although MRI with fat-suppression sequences is the gold standard for diagnosis of bone marrow edema (BME), Dual-Layer Spectral CT (DL-SCT) with electron density (ED) provides a viable alternative, particularly in situations where an MRI is not accessible. Using MRI as the [...] Read more.
Background/Objectives: Although MRI with fat-suppression sequences is the gold standard for diagnosis of bone marrow edema (BME), Dual-Layer Spectral CT (DL-SCT) with electron density (ED) provides a viable alternative, particularly in situations where an MRI is not accessible. Using MRI as the reference standard, this study analyzed how DL-SCT with ED reconstructions may be a valid alternative in the detection of BME. Methods: This retrospective study included 28 patients with a suspected diagnosis of BME via MRI conducted between March and September 2024. Patients underwent DL-SCT using ED reconstructions obtained through IntelliSpace software v. 12.1. Images were evaluated by two experienced radiologists and one young radiologist in a blinded way, giving a grade from 0 to 3 to classify BME (0 absence; 1 mild; 2 moderate; 3 severe). To reduce the recall bias effect, the order of image evaluations was set differently for each reader. p-Values were considered significant when <0.05. Fleiss’ Kappa was used to assess inter-rater reliability: agreement was considered poor for k < 0; slight for k 0.01–0.20; fair for 0.21–0.40; moderate for 0.41–0.60; substantial for 0.61–0.80; and almost perfect for 0.81–1.00. Results: All the readers detected the presence or absence of BME using DL-SCT. Inter-rater reliability for grade 0 resulted in 1 (p-value < 0.001); for grade 1: 0.21 (p-value < 0.001); for grade 2: 0.197 (p-value < 0.001); and for grade 3: 0.515 (p-value < 0.001). Conclusions: ED reconstructions allowed the identification of BME presence or absence in all analyzed cases, thus suggesting DL-SCT as a potentially effective method for its detection. Full article
(This article belongs to the Section Nuclear Medicine & Radiology)
Show Figures

Figure 1

20 pages, 1273 KiB  
Article
Safety and Anatomical Accuracy of Dry Needling of the Quadratus Femoris Muscle: A Cadaveric Study
by Marta Sánchez-Montoya, Jaime Almazán-Polo, Néstor Vallecillo Hernández, Charles Cotteret, Fabien Guerineau, Domingo de Guzman Monreal-Redondo and Ángel González-de-la-Flor
Healthcare 2025, 13(15), 1828; https://doi.org/10.3390/healthcare13151828 - 26 Jul 2025
Viewed by 227
Abstract
Introduction: Deep dry needling (DDN) is commonly applied in physiotherapy to treat musculoskeletal pain. The quadratus femoris (QF) muscle, located in the ischiofemoral space (IFS), represents a clinically relevant yet anatomically complex target. However, limited evidence exists on the safety, accuracy, and reliability [...] Read more.
Introduction: Deep dry needling (DDN) is commonly applied in physiotherapy to treat musculoskeletal pain. The quadratus femoris (QF) muscle, located in the ischiofemoral space (IFS), represents a clinically relevant yet anatomically complex target. However, limited evidence exists on the safety, accuracy, and reliability of non-ultrasound-guided DDN in this region. Aims: To assess the safety and accuracy of a standardized, non-ultrasound-guided DDN approach to the QF muscle, and to evaluate the intra- and inter-rater reliability of key procedural outcomes. Additionally, to determine the agreement between ultrasound imaging and anatomical dissection as validation methods for needle placement. Methods: An experimental cross-sectional study was conducted on five fresh cadavers (n = 24 approaches) by two physiotherapists with different DN experience. A standardized dry needling protocol was executed without ultrasound guidance, and anatomical and procedural variables were documented. Reliability (intra/inter-rater) was assessed for needle size, sciatic nerve (SN) puncture, IFS targeting, and overall success. In a subset, needle placement was validated through ultrasound and subsequent dissection. Results: The IFS was reached in 70.8% of procedures, and the SN was punctured in 16.7%. Inter-rater reliability for needle size was poor (κ = 0.04). Agreement between ultrasound and dissection was excellent for the ischiofemoral location and success (100%) and moderate for non SN puncture (90%; κ = 0.62). Conclusions: The standardized protocol demonstrated moderate accuracy and revealed a relevant clinical risk when targeting the quadratus femoris muscle. While inter-rater reliability was limited, agreement between ultrasound and dissection methods was high, supporting their complementary use for validating needle placement in anatomically complex procedures. Full article
Show Figures

Figure 1

28 pages, 4702 KiB  
Article
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
by Cemre Aydin, Ozden Bedre Duygu, Asli Beril Karakas, Eda Er, Gokhan Gokmen, Anil Murat Ozturk and Figen Govsa
Medicina 2025, 61(8), 1342; https://doi.org/10.3390/medicina61081342 - 25 Jul 2025
Viewed by 303
Abstract
Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This [...] Read more.
Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. Materials and Methods: A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin’s CCC), inter-rater reliability (Cohen’s κ), and measurement agreement (Bland–Altman LoA). Results: The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0–14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: −21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ −0.106). Inter-rater reliability fell below random chance (ChatGPT κ = −0.039). Universal proportional bias (slopes ≈ −1.0) caused severe curve underestimation (e.g., 10–15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3–2.8° vs. 2.6–10.7°) but suboptimal specificity (21.7–26.1%) and hazardous lumbar concordance (CCC: −0.123). Conclusions: General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480–1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities. Full article
(This article belongs to the Special Issue Diagnosis and Treatment of Adolescent Idiopathic Scoliosis)
Show Figures

Figure 1

15 pages, 3892 KiB  
Article
Zero and Ultra-Short Echo Time Sequences at 3-Tesla Can Accurately Depicts the Normal Anatomy of the Human Achilles Tendon Enthesis Organ In Vivo
by Amandine Crombé, Benjamin Dallaudière, Marie-Camille Bohand, Claire Fournier, Paolo Spinnato, Nicolas Poursac, Michael Carl, Julie Poujol and Olivier Hauger
J. Clin. Med. 2025, 14(15), 5251; https://doi.org/10.3390/jcm14155251 - 24 Jul 2025
Viewed by 222
Abstract
Background/Objectives: Accurate visualization of the Achilles tendon enthesis is critical for distinguishing mechanical, degenerative, and inflammatory pathologies. Although ultrasonography is the first-line modality for suspected enthesis disease, recent technical advances may expand the role of magnetic resonance imaging (MRI). This study evaluated [...] Read more.
Background/Objectives: Accurate visualization of the Achilles tendon enthesis is critical for distinguishing mechanical, degenerative, and inflammatory pathologies. Although ultrasonography is the first-line modality for suspected enthesis disease, recent technical advances may expand the role of magnetic resonance imaging (MRI). This study evaluated the utility of ultra-short echo time (UTE) and zero echo time (ZTE) sequences versus proton density-weighted imaging (PD-WI) for depicting the enthesis organ in healthy volunteers. Methods: In this institutional review board (IRB)-approved prospective single-center study, 50 asymptomatic adult volunteers underwent 3-Tesla hindfoot MRI with fat-suppressed PD-WI, UTE, and ZTE between 2018 and 2023. Four radiologists assessed image quality, signal-to-noise ratio, visibility, and abnormal high signal intensities (SIs) of the periost, sesamoid, and enthesis fibrocartilages (PCa, SCa, and ECa, respectively). Statistical tests included Chi-square, McNemar, paired Wilcoxon, and Benjamini–Hochberg adjustments for multiple comparisons. Results: The median age was 36 years (range: 20–51); 58% women were included. PD-WI and ZTE sequences were always available while UTE was unavailable in 24% of patients. PD-WI consistently failed to concomitantly visualize all fibrocartilages. ZTE and UTE visualized all fibrocartilages in 72% and 92.1% of volunteers, respectively, with significant differences favoring ZTE and UTE over PD-WI (p < 0.0001) and UTE over ZTE (p = 0.027). Inter-rater agreement exceeded 80% except for SCa on ZTE (68%, 95%CI: 53.2–80.1). Abnormal SCa findings in asymptomatic patients were more frequent with UTE (23.7%) and ZTE (34%) than with PD-WI (2%) (p = 0.0045). Conclusions: At 3-Tesla, UTE and ZTE sequences reliably depict the enthesis organ of the Achilles tendon, outperforming PD-WI. However, the high sensitivity of these sequences also presents challenges in interpretation. Full article
Show Figures

Figure 1

14 pages, 320 KiB  
Article
Evaluating Large Language Models in Cardiology: A Comparative Study of ChatGPT, Claude, and Gemini
by Michele Danilo Pierri, Michele Galeazzi, Simone D’Alessio, Melissa Dottori, Irene Capodaglio, Christian Corinaldesi, Marco Marini and Marco Di Eusanio
Hearts 2025, 6(3), 19; https://doi.org/10.3390/hearts6030019 - 19 Jul 2025
Viewed by 452
Abstract
Background: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are being increasingly adopted in medicine; however, their reliability in cardiology remains underexplored. Purpose of the study: To compare the performance of three general-purpose LLMs in response to cardiology-related clinical queries. Study [...] Read more.
Background: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are being increasingly adopted in medicine; however, their reliability in cardiology remains underexplored. Purpose of the study: To compare the performance of three general-purpose LLMs in response to cardiology-related clinical queries. Study design: Seventy clinical prompts stratified by diagnostic phase (pre or post) and user profile (patient or physician) were submitted to ChatGPT, Claude, and Gemini. Three expert cardiologists, who were blinded to the model’s identity, rated each response on scientific accuracy, completeness, clarity, and coherence using a 5-point Likert scale. Statistical analysis included Kruskal–Wallis tests, Dunn’s post hoc comparisons, Kendall’s W, weighted kappa, and sensitivity analyses. Results: ChatGPT outperformed both Claude and Gemini across all criteria (mean scores: 3.7–4.2 vs. 3.4–4.0 and 2.9–3.7, respectively; p < 0.001). The inter-rater agreement was substantial (Kendall’s W: 0.61–0.71). Pre-diagnostic and patient-framed prompts received higher scores than post-diagnostic and physician-framed ones. Results remained robust across sensitivity analyses. Conclusions: Among the evaluated LLMs, ChatGPT demonstrated superior performance in generating clinically relevant cardiology responses. However, none of the models achieved maximal ratings, and the performance varied by context. These findings highlight the need for domain-specific fine-tuning and human oversight to ensure a safe clinical deployment. Full article
Show Figures

Graphical abstract

13 pages, 1056 KiB  
Article
Diagnostic Accuracy and Interrater Agreement of FDG-PET/CT Lymph Node Staging in High-Risk Endometrial Cancer: The SENTIREC-Endo Study
by Jorun Holm, André Henrique Dias, Oke Gerke, Annika Loft, Kirsten Bouchelouche, Mie Holm Vilstrup, Sarah Marie Bjørnholt, Sara Elisabeth Sponholtz, Kirsten Marie Jochumsen, Malene Grubbe Hildebrandt and Pernille Tine Jensen
Cancers 2025, 17(14), 2396; https://doi.org/10.3390/cancers17142396 - 19 Jul 2025
Viewed by 353
Abstract
Background/Objectives: The SENTIREC-endo study identified a safe sentinel lymph node mapping algorithm combined with PET-positive node dissection, matching radical pelvic and paraaortic lymphadenectomy in high-risk endometrial cancer. The present study evaluated the diagnostic accuracy of FDG-PET/CT for lymph node metastases in the same [...] Read more.
Background/Objectives: The SENTIREC-endo study identified a safe sentinel lymph node mapping algorithm combined with PET-positive node dissection, matching radical pelvic and paraaortic lymphadenectomy in high-risk endometrial cancer. The present study evaluated the diagnostic accuracy of FDG-PET/CT for lymph node metastases in the same population based on location, size, and Standardised Uptake Value (SUV), in addition to assessing interrater agreement across three Danish centres. Methods: This prospective multicentre study included women with high-risk endometrial cancer from the Danish SENTIREC study database (2017–2023). All patients underwent preoperative FDG-PET/CT. Diagnostic accuracy was evaluated against a pathology-confirmed reference standard. Interrater agreement was evaluated between trained specialists in Nuclear Medicine. Results: Among 227 patients, 52 patients (23%) had lymph node metastases. FDG-PET/CT identified lymph node metastases with 56% sensitivity (95% CI: 42–68) and 91% specificity (95% CI: 86–94). Positive and negative predictive values were 64% and 87%, respectively. Specificity for paraaortic nodes was high (97%), though sensitivity remained limited (56%). Lymph node size and SUVmax had moderate diagnostic value (AUC-ROC ~0.7). Interrater proportion of agreement was 95% and Cohen’s Kappa κ = 0.84 (95% CI: 0.73–0.94), the latter of which was ‘almost perfect’. Conclusions: FDG-PET/CT had limited sensitivity in lymph node staging in high-risk EC, and the diagnostic accuracy of FDG-PET/CT remains complementary to the sentinel node procedure. Due to its high specificity and strong interrater reliability, FDG-PET/CT is recommended for clinical implementation in combination with the sensitive sentinel node biopsy for the targeted dissection of PET-positive lymph nodes, particularly in paraaortic regions. Full article
(This article belongs to the Special Issue Lymph Node Dissection for Gynecologic Cancers)
Show Figures

Graphical abstract

18 pages, 6084 KiB  
Article
Diagnostic Accuracy and Agreement Between AI and Clinicians in Orthodontic 3D Model Analysis
by Sabahattin Bor, Fırat Oğuz and Ayla Khanmohammadi
Appl. Sci. 2025, 15(14), 7786; https://doi.org/10.3390/app15147786 - 11 Jul 2025
Viewed by 421
Abstract
Background: Artificial intelligence (AI) is increasingly integrated into orthodontic workflows, including digital model analysis modules embedded in orthodontic software. While these systems offer efficiency and automation, the accuracy and clinical reliability of AI-generated measurements and diagnostic assessments remain unclear. Therefore, to use AI [...] Read more.
Background: Artificial intelligence (AI) is increasingly integrated into orthodontic workflows, including digital model analysis modules embedded in orthodontic software. While these systems offer efficiency and automation, the accuracy and clinical reliability of AI-generated measurements and diagnostic assessments remain unclear. Therefore, to use AI systems safely and effectively in clinical orthodontics, it is important to check their results by comparing them with those of experienced orthodontists. Methods: Digital models of 48 patients were analyzed by the Orthodontist group and two AI platforms: Titan (full) and SoftSmile (Bolton only). Three orthodontists independently measured all variables using 3Shape OrthoAnalyzer, and group means were used for comparison. A subset of models was reanalyzed after two weeks to assess consistency. Data distribution was evaluated, and appropriate statistical tests were applied. Reliability was assessed using intraclass correlation coefficients (ICC) and Cohen’s kappa. Results: Almost perfect agreement was observed between the orthodontists and Titan AI in molar classification (κ = 0.955 right, κ = 0.900 left; p < 0.001), with perfect agreement reported across all groups—including between the orthodontists themselves—for Angle classification (κ = 1.00). In anterior and overall Bolton analyses, no meaningful agreement was found between the orthodontists and AI platforms. However, in a subset of patients where all three methods identified the tooth size discrepancy in the same arch (either maxilla or mandible), no significant differences were found in anterior (p = 0.226) or overall Bolton values (p = 0.795). Overjet, overbite, and space analysis values showed significant differences between the orthodontist and Titan groups (p < 0.001). ICC analysis indicated good to excellent intra- and inter-rater reliability within the orthodontist group (≥0.77), while both AI systems demonstrated excellent internal consistency, with ICC values exceeding 0.95. Conclusions: AI-based platforms showed high agreement with orthodontists only in Angle classification. While their performance in Bolton analysis was limited, significant differences were observed in other linear measurements, indicating the need for further refinement before clinical use. Full article
Show Figures

Figure 1

10 pages, 592 KiB  
Article
Assessing the Accuracy and Reliability of the Monitored Augmented Rehabilitation System for Measuring Shoulder and Elbow Range of Motion
by Samuel T. Lauman, Lindsey J. Patton, Pauline Chen, Shreya Ravi, Stephen J. Kimatian and Sarah E. Rebstock
Sensors 2025, 25(14), 4269; https://doi.org/10.3390/s25144269 - 9 Jul 2025
Viewed by 273
Abstract
Accurate range of motion (ROM) assessment is essential for evaluating musculoskeletal function and guiding rehabilitation, particularly in pediatric populations. Traditional methods, such as optical motion capture and handheld goniometry, are often limited by cost, accessibility, and inter-rater variability. This study evaluated the feasibility [...] Read more.
Accurate range of motion (ROM) assessment is essential for evaluating musculoskeletal function and guiding rehabilitation, particularly in pediatric populations. Traditional methods, such as optical motion capture and handheld goniometry, are often limited by cost, accessibility, and inter-rater variability. This study evaluated the feasibility and accuracy of the Microsoft Azure Kinect-powered Monitored Augmented Rehabilitation System (MARS) compared to Kinovea. Sixty-five pediatric participants (ages 5–18) performed standardized shoulder and elbow movements in the frontal and sagittal planes. ROM data were recorded using MARS and compared to Kinovea. Measurement reliability was evaluated using intraclass correlation coefficients (ICC3k), and accuracy was evaluated using root mean squared error (RMSE) analysis. MARS demonstrated excellent reliability with an average ICC3k of 0.993 and met the predefined accuracy threshold (RMSE ≤ 8°) for most movements, with the exception of sagittal elbow flexion. These findings suggest that MARS is a reliable, accurate, and cost-effective alternative for clinical ROM assessment, offering a markerless solution that enhances measurement precision and accessibility in pediatric rehabilitation. Future studies should enhance accuracy in sagittal plane movements and further validate MARS against gold-standard systems. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

14 pages, 751 KiB  
Article
Comparison of Validity and Reliability of Manual Consensus Grading vs. Automated AI Grading for Diabetic Retinopathy Screening in Oslo, Norway: A Cross-Sectional Pilot Study
by Mia Karabeg, Goran Petrovski, Katrine Holen, Ellen Steffensen Sauesund, Dag Sigurd Fosmark, Greg Russell, Maja Gran Erke, Vallo Volke, Vidas Raudonis, Rasa Verkauskiene, Jelizaveta Sokolovska, Morten Carstens Moe, Inga-Britt Kjellevold Haugen and Beata Eva Petrovski
J. Clin. Med. 2025, 14(13), 4810; https://doi.org/10.3390/jcm14134810 - 7 Jul 2025
Viewed by 537
Abstract
Background: Diabetic retinopathy (DR) is a leading cause of visual impairment worldwide. Manual grading of fundus images is the gold standard in DR screening, although it is time-consuming. Artificial intelligence (AI)-based algorithms offer a faster alternative, though concerns remain about their diagnostic reliability. [...] Read more.
Background: Diabetic retinopathy (DR) is a leading cause of visual impairment worldwide. Manual grading of fundus images is the gold standard in DR screening, although it is time-consuming. Artificial intelligence (AI)-based algorithms offer a faster alternative, though concerns remain about their diagnostic reliability. Methods: A cross-sectional pilot study among patients (≥18 years) with diabetes was established for DR and diabetic macular edema (DME) screening at the Oslo University Hospital (OUH), Department of Ophthalmology, and the Norwegian Association of the Blind and Partially Sighted (NABP). The aim of the study was to evaluate the validity (accuracy, sensitivity, specificity) and reliability (inter-rater agreement) of automated AI-based compared to manual consensus (MC) grading of DR and DME, performed by a multidisciplinary team of healthcare professionals. Grading of DR and DME was performed manually and by EyeArt (Eyenuk) software version v2.1.0, based on the International Clinical Disease Severity Scale (ICDR) for DR. Agreement was measured by Quadratic Weighted Kappa (QWK) and Cohen’s Kappa (κ). Sensitivity, specificity, and diagnostic test accuracy (Area Under the Curve (AUC)) were also calculated. Results: A total of 128 individuals (247 eyes) (51 women, 77 men) were included, with a median age of 52.5 years. Prevalence of any vs. referable DR (RDR) was 20.2% vs. 11.7%, while sensitivity was 94.0% vs. 89.7%, specificity was 72.6% was 83.0%, and AUC was 83.5% vs. 86.3%, respectively. DME was detected only in one eye by both methods. Conclusions: AI-based grading offered high sensitivity and acceptable specificity for detecting DR, showing moderate agreement with manual assessments. Such grading may serve as an effective screening tool to support clinical evaluation, while ongoing training of human graders remains essential to ensure high-quality reference standards for accurate diagnostic accuracy and the development of AI algorithms. Full article
(This article belongs to the Special Issue Artificial Intelligence and Eye Disease)
Show Figures

Figure 1

28 pages, 1987 KiB  
Article
LLM-as-a-Judge Approaches as Proxies for Mathematical Coherence in Narrative Extraction
by Brian Keith
Electronics 2025, 14(13), 2735; https://doi.org/10.3390/electronics14132735 - 7 Jul 2025
Viewed by 556
Abstract
Evaluating the coherence of narrative sequences extracted from large document collections is crucial for applications in information retrieval and knowledge discovery. While mathematical coherence metrics based on embedding similarities provide objective measures, they require substantial computational resources and domain expertise to interpret. We [...] Read more.
Evaluating the coherence of narrative sequences extracted from large document collections is crucial for applications in information retrieval and knowledge discovery. While mathematical coherence metrics based on embedding similarities provide objective measures, they require substantial computational resources and domain expertise to interpret. We propose using large language models (LLMs) as judges to evaluate narrative coherence, demonstrating that their assessments correlate with mathematical coherence metrics. Through experiments on two data sets—news articles about Cuban protests and scientific papers from visualization conferences—we show that the LLM judges achieve Pearson correlations up to 0.65 with mathematical coherence while maintaining high inter-rater reliability (ICC > 0.92). The simplest evaluation approach achieves a comparable performance to the more complex approaches, even outperforming them for focused data sets while achieving over 90% of their performance for the more diverse data sets while using less computational resources. Our findings indicate that LLM-as-a-judge approaches are effective as a proxy for mathematical coherence in the context of narrative extraction evaluation. Full article
Show Figures

Figure 1

30 pages, 787 KiB  
Systematic Review
Success Factors in Transport Interventions: A Mixed-Method Systematic Review (1990–2022)
by Pierré Esser, Shehani Pigera, Miglena Campbell, Paul van Schaik and Tracey Crosbie
Future Transp. 2025, 5(3), 82; https://doi.org/10.3390/futuretransp5030082 - 1 Jul 2025
Viewed by 286
Abstract
This study is titled “Success Factors in Transport Interventions: A Mixed-Method Systematic Review (1990–2022)”. The purpose of the systematic review is to (1) identify effective interventions for transitioning individuals from private car reliance to sustainable transport, (2) summarise psychosocial theories shaping transportation choices [...] Read more.
This study is titled “Success Factors in Transport Interventions: A Mixed-Method Systematic Review (1990–2022)”. The purpose of the systematic review is to (1) identify effective interventions for transitioning individuals from private car reliance to sustainable transport, (2) summarise psychosocial theories shaping transportation choices and identify enablers and barriers influencing sustainable mode adoption, and (3) determine the success factors for interventions promoting sustainable transport choices. The last search was conducted on 18 November 2022. Five databases (Scopus, Web of Science, MEDLINE, APA PsycInfo, and ProQuest) were searched using customised Boolean search strings. The identified papers were included or excluded based on the following criteria: (a) reported a modal shift from car users or cars to less CO2-emitting modes of transport, (b) covered the adoption of low-carbon transport alternatives, (c) comprised interventions to promote sustainable transport, (d) assessed or measured the effectiveness of interventions, or (e) proposed behavioural models related to mode choice and/or psychosocial barriers or drivers for car/no-car use. The identified papers eligible for inclusion were critically appraised using Sirriyeh’s Quality Assessment Tool for Studies with Diverse Designs. Inter-rater reliability was assessed using Cohen’s Kappa to evaluate the risk of bias throughout the review process, and low-quality studies identified by the quality assessment were excluded to prevent sample bias. Qualitative data were extracted in a contextually relevant manner, preserving context and meaning to avoid the author’s bias of misinterpretation. Data were extracted using a form derived from the Joanna Briggs Institute. Data transformation and synthesis followed the recommendations of the Joanna Briggs Institution for mixed-method systematic reviews using a convergent integrated approach. Of the 7999 studies, 4 qualitative, 2 mixed-method, and 30 quantitative studies successfully passed all three screening cycles and were included in the review. Many of these studies focused on modelling individuals’ mode choice decisions from a psychological perspective. In contrast, case studies explored various transport interventions to enhance sustainability in densely populated areas. Nevertheless, the current systematic reviews do not show how individuals’ inner dispositions, such as acceptance, intention, or attitude, have evolved from before to after the implementation of schemes. Of the 11 integrated findings, 9 concerned enablers and barriers to an individual’s sustainable mode choice behaviour. In addition, two integrated findings emerged based on the effectiveness of the interventions. Although numerous interventions target public acceptance of sustainable transport, this systematic review reveals a critical knowledge gap regarding their longitudinal impact on individuals and effectiveness in influencing behavioural change. However, the study may be affected by language bias as it only included peer-reviewed articles published in English. Due to methodological heterogeneity across the studies, a meta-analysis was not feasible. Further high-quality research is needed to strengthen the evidence. This systematic review is self-funded and has been registered on the International Platform of Registered Systematic Review and Meta-analysis Protocols (INPLASY; Registration Number INPLASY202420011). Full article
Show Figures

Figure 1

Back to TopTop