Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare
Abstract
1. Introduction
1.1. Context and Significance of the Study
1.2. Literature Review and Previous Work
- The connection between findings obtained through NLP and traditional healthcare quality indicators remains limited.
- There is little evidence that these approaches are actually used in clinical practice to influence healthcare worker behavior or managerial decision-making [61].
1.3. Limitations of Previous Work and Bridge to the Present Study
1.4. The Present Study: Aim and Objectives
- Develop a hybrid method for creating an annotated corpus, combining expert validation and Large Language Models (LLMs) with iterative prompt engineering, enabling annotation scaling while maintaining quality comparable to expert annotation.
- Introduce and justify practical metrics (Practical Accuracy, Practical F1) that account for the hierarchical structure of errors and reflect the real business value of models in the medical context.
- Conduct a comparative analysis of eight machine learning architectures (from Logistic Regression to Stacking Ensemble based on BERT, TF-IDF, and LightGBM) on a balanced dataset of 22,417 reviews.
- Scale the best model to an array of 4.3 million real reviews from the Prodoctorov.ru platform, with assessments of the class distribution and calculation of the business implementation effect.
- Extract substantive insights from the analysis of the million-sized array, including the relationship between review type and rating/length, temporal dynamics, geographical distribution normalized by population, and analysis by medical specialties.
1.5. Scientific Novelty and Contribution of the Work
- Methodological contribution: A hybrid “expert + LLM” approach is proposed and validated for creating large annotated corpora, using iterative prompt engineering (15 versions) and modified Cohen’s kappa with hierarchical error weighting. It is shown that LLM can achieve a level of agreement with experts (κ_mod = 0.745) comparable to inter-expert agreement, allowing its use for scaling annotation.
- Metric contribution: Practical metrics (Practical Accuracy, Practical F1) are developed and justified as an application-specific adaptation of cost-sensitive evaluation, accounting for the different costs of errors in the medical context and allowing for the evaluation of models from the perspective of their real business value, responding to the challenge formulated in [61].
- Empirical contribution: The most comprehensive comparison to date of eight architectures (from simple linear models to complex ensembles) on a unified balanced dataset with evaluation by both standard and practical metrics is conducted.
- Applied contribution: Successful scaling of a patient review classification model to an array of 4.3 million records is demonstrated for the first time, with a quantitative assessment of business effect, directly addressing the gap identified in the literature about the lack of evidence for implementing NLP approaches in real practice [61].
- Analytical contribution: Based on the analysis of the million-sized array, stable patterns are identified that have independent value for understanding patient behavior and satisfaction dynamics.
1.6. Structure of the Article
2. Materials and Methods
2.1. Data Sources and Sampling Principles
- Collection period: 2011–2025 (14 years);
- Geographic coverage: 22 Russian cities;
- Number of medical specialties: 224;
- Rating range: 0 to 5 stars;
- Review length: 3 to 10,792 characters.
2.2. Development and Validation of the LLM-Based Classification System
2.3. Formation of the Final Training Dataset
- M: 7662 reviews (34.2%);
- O: 7839 reviews (35.0%);
- C: 6916 reviews (30.8%).
2.4. Machine Learning Models and Experimental Setup
3. Results
3.1. Results of LLM Classification Validation
3.2. Comparative Analysis of Models on the Combined Sample
3.3. Scaling on Prodoctorov.ru Data (4,340,691 Reviews)
3.4. Substantive Insights from the Analysis of 4 Million Reviews
4. Discussion
4.1. Interpretation of Main Results
4.2. Comparison with Previous Studies
4.3. Practical Significance and Business Value
4.4. Limitations of the Study
4.5. Directions for Future Research
5. Conclusions
5.1. Methodological Conclusions
- The hybrid “expert + LLM” approach enables the creation of large, annotated corpora with quality comparable to expert annotation. We developed and validated an iterative prompt engineering method (15 versions), through which a Large Language Model learned to classify reviews based on expert feedback. Modified Cohen’s kappa with hierarchical error weighting (critical M ↔ O vs. acceptable involving C) was κ_mod = 0.745, practically reaching the established validation threshold of 0.75 and comparable to inter-expert agreement (κ_mod = 0.740). The proportion of critical LLM errors was 5.3%, corresponding to a Practical Accuracy of 94.7%. This result directly addresses the need for scalable patient feedback analysis methods that maintain expert-level quality, as highlighted in recent reviews [46,61].
- Three-class classification (M, O, and C) is necessary to accurately reflect the real structure of patient reviews. Analysis of 4.3 million reviews showed that combined reviews (class C) constitute 32.9% of all appeals and have independent value. They are 2.5 times longer than organizational reviews (802 vs. 320 characters) and contain the most complete information about patients’ experiences. Ignoring class C leads to the loss of one-third of the available information and, as shown by correlation analysis (r = 0.914), to systematic bias in model quality estimates.
5.2. Technical Conclusions
- 3.
- Hybrid BERT-based models significantly outperform traditional approaches in Practical Accuracy. Comparative analysis of eight architectures on a balanced dataset (22,417 reviews) revealed a consistent advantage of models using transformer semantic representations. The Stacking Ensemble achieved a Practical Accuracy of 92.9%, the BERT + TF-IDF Hybrid reached 92.7%, and the BERT + Logistic Regression attained 91.5%. The gap with traditional methods (TF-IDF + LR: 89.3%, Logistic Regression: 81.3%) amounts to up to 11.6 percentage points, confirming the critical importance of accounting for semantic nuances in distinguishing medical and organizational problems.
- 4.
- Critical errors (M ↔ O confusion) are reduced to an acceptable level of 1.4%. The best model (Stacking Ensemble) makes 317 critical errors out of 22,417 reviews (1.4%), corresponding to approximately 14 critical errors per 1000 processed reviews. For a medical organization receiving 50,000 reviews per year, this means only 700 cases per year require manual checking compared to 50,000 without automation.
- 5.
- We identified and justified a trade-off between conservative and active classification strategies. We found a stable inverse relationship between Practical Accuracy and Practical F1 (ρ = −0.93). Models with high Practical Accuracy implement a conservative strategy, minimizing critical errors at the cost of less frequent use of class C. For medical organizations where the cost of incorrect routing is maximal, the conservative strategy is preferable.
- 6.
- The BERT + TF-IDF Hybrid model demonstrates maximum robustness to data variations (CV = 0.009). The Stacking Ensemble also showed high robustness (CV = 0.011). The low robustness of MLP (CV = 0.062) makes this model unsuitable for production deployment.
- 7.
- The TF-IDF + Logistic Regression model provides an optimal balance of speed and quality for resource-constrained scenarios. With a Practical Accuracy of 89.3%, this model trains in 7.1 s intervals (35 times faster than hybrid approaches) and can run exclusively on CPU, making it suitable for real-time processing and organizations without GPU infrastructure.
5.3. Applied Conclusions
- 8.
- The developed system automates up to 93% of patient feedback processing. When implementing the Stacking Ensemble in a typical medical organization (50,000 reviews per year), 46,450 reviews (92.9%) are processed automatically and directed to the correct departments, while only 700 reviews (1.4%) require manual checking due to critical errors. Operational cost savings amount to approximately 2.3 million rubles per year.
- 9.
- Scaling to 4.3 million Prodoctorov.ru reviews confirmed the production readiness of the approach. Application of the Stacking Ensemble to real data demonstrated the stable class distribution (M 46.1%, O 21.0%, and C 32.9%) and advantage over traditional methods in 156,000 additional correctly processed complaints (3.6 percentage points).
- 10.
- An adaptive architecture combining speed and accuracy is proposed for high-throughput scenarios. With appropriate confidence threshold tuning, such a system achieves the balance between processing speed and Practical Accuracy, ensuring a throughput sufficient for any real data flows.
5.4. Analytical Conclusions (Insights from 4 Million Reviews)
- 11.
- M reviews are rated significantly higher than O reviews. The average rating of M reviews is 4.75; O reviews—4.59 (p < 0.001). The proportion of high ratings (4–5) in M reviews is 94.0%, in O reviews, it is 80.7%. Patients primarily value the professional competence of doctors, not the service aspects.
- 12.
- C reviews are the most informative. Their average length is 802 characters—2.5 times longer than O reviews (320) and 1.6 times longer than M reviews (513). This confirms the thesis of Baines et al. [62] about the key role of narrative comments in feedback acceptance.
- 13.
- Zero-star reviews reveal distinct dissatisfaction patterns. Among 163,277 extremely negative reviews (3.8% of all feedback), O complaints dominate (38.2%), followed by C (33.7%) and M (28.1%). Zero-star C reviews average 1013 characters—27% longer than the overall C review average—demonstrating that extreme dissatisfaction generates particularly detailed patient narratives.
- 14.
- Over 14 years (2011–2025), average ratings increased by 1.24 points. The dynamics indicate the systemic improvement in patients’ perceptions of medical services. The gap between ratings of M and O aspects decreased from 0.17 to 0.07 points.
- 15.
- After normalization by population, Krasnodar and Rostov-on-Don lead in review activity. Class distribution remains stable across all cities (M 44–49%, O 18–24%, and C 31–37%), confirming the universality of the developed classification.
- 16.
- Surgical specialties dominate in M reviews, observational specialties in O reviews. The highest proportion of M reviews is among dental surgeons (54.0%); the lowest is among gynecologists (37.8%) and obstetricians (38.1%).
5.5. Recommendations for Implementation
- 17.
- Architecture choice should be determined by the requirements of the specific medical organization. For maximum safety and accuracy, the Stacking Ensemble (92.9% Practical Accuracy) is recommended; for the optimal balance of accuracy and robustness, the BERT + TF-IDF Hybrid (92.7%); for CPU-based real-time processing, TF-IDF + Logistic Regression (89.3%); and for rapid prototyping, the baseline Logistic Regression (81.3%).
- 18.
- For production deployment, regular model retraining (annually) is recommended, considering the dynamic nature of language and the changing patterns of reviews.
5.6. Final Reflections
5.7. Limitations and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| BERT | Bidirectional Encoder Representations from Transformers |
| C | Combined Content (both medical and organizational complaints) |
| CV | Coefficient of Variation |
| eWOM | Electronic Word of Mouth |
| κ_mod | Modified Cohen’s Kappa with Hierarchical Weighting |
| LLM | Large Language Model |
| LR | Logistic Regression |
| M | Medical Content (complaints about treatment/diagnosis) |
| ML | Machine Learning |
| NLP | Natural Language Processing |
| O | Organizational Content (complaints about service organization) |
| PA | Practical Accuracy |
| PRW | Physician Rating Websites |
| TF_IDF | Term Frequency–Inverse Document Frequency |
Appendix A. Classification Manual for M, O, and C Reviews (Version v15)
- Specific medical actions: diagnosis, diagnostic interpretation, test analysis;
- Prescription of treatment, procedures, medications, surgeries;
- Assessment of quality of medical procedures and their outcomes;
- Specific treatment results, effectiveness, medical outcomes;
- Description of medical manipulations, examinations, operations;
- Quality of conducted studies (ultrasound, MRI, X-ray, CT), with details;
- Specific errors in diagnosis or treatment;
- Criticism of medical incompetence, with examples;
- Specific medical consultations, with description of actions.
- Specific assessment of medical decisions and their justification → M;
- Criticism of medical competence with examples → M;
- “Check moles for safety” → M (specific diagnosis);
- “Recommendations and treatment” → M (specific actions);
- “Consultation, examination, treatment” → M (specific actions).
- Appointment scheduling, waiting times, schedule delays;
- Administration work, reception, staff;
- Service costs, finances, pricing policy;
- Location convenience, accessibility, parking;
- Space organization, comfort, cleanliness;
- Communication aspects: politeness, attentiveness, talkativeness;
- Personal characteristics without assessment of professionalism;
- General service quality, service;
- Staff attitude, clinic atmosphere;
- General thanks without specifics of medical actions;
- General phrases about professionalism without medical specifics;
- Psychological comfort during communication;
- Aesthetic results without medical assessment;
- Criticism of personal characteristics without medical context.
- Only communication aspects without assessment of professionalism → O;
- General assessment of service without medical specifics → O;
- “God-given doctor, polite, attentive” → O (general phrases);
- “I’ve been treated for many years” → O (without specifics);
- “Not self-confident” → O (personal characteristic).
- Both topics are present approximately equally and are significant;
- Difficult to determine the dominant category;
- Specific medical actions combined with substantial service aspects;
- Specific medical actions + equivalent criticism/evaluation of service;
- Medical prescriptions combined with financial issues;
- Criticism of treatment combined with communication problems.
- If specific medical actions dominate → assign M
- If only general or service evaluations without medical specifics are present → assign O
- If specific medical actions appear with equivalent criticism or evaluation of service → assign C
- Specific medical actions → M
- Only general or service evaluations → O
- Equivalent combination → C
| Case | Decision | Rationale |
|---|---|---|
| “Check moles for safety” | M | Specific medical diagnosis |
| “Recommendations and treatment from doctor” | M | Specific medical prescriptions |
| “Consultation, examination, treatment” | M | Specific medical actions |
| “Prescribed to take tests” | M | Specific prescriptions |
| “Incorrect temperature measurement” | M | Medical error |
| “God-given doctor, polite, attentive” | O | General phrases without medical specifics |
| “I’ve been treated for many years” | O | No specifics of medical actions |
| “Not self-confident” | O | Personal characteristic |
| “Polite, pleasant” | O | Communication aspects |
| “Prescribed tests but was nervous” | C | Medical prescriptions + communication |
| “Treatment is effective but expensive” | C | Medical outcome + financial issue |
| “Diagnosed correctly but had to wait long” | C | Medical competence + organizational problem |
- Unambiguously M: Contains specific descriptions of medical actions, diagnoses, treatments, prescriptions, or medical errors with concrete examples.
- Unambiguously O: Contains only general praise, service comments, personal characteristics, or administrative issues without any specific medical content.
- Unambiguously C: Contains both specific medical content AND significant organizational/service aspects in roughly equal measure.
References
- Litvin, S.W.; Goldsmith, R.E.; Pan, B. Electronic word-of-mouth in hospitality and tourism management. Tour. Manag. 2008, 29, 458–468. [Google Scholar] [CrossRef]
- Cantallops, A.S.; Salvi, F. New consumer behavior: A review of research on eWOM and hotels. Int. J. Hosp. Manag. 2014, 36, 41–51. [Google Scholar] [CrossRef]
- Ismagilova, E.; Dwivedi, Y.K.; Slade, E.; Williams, M.D. Electronic Word of Mouth (eWOM). In Electronic Word of Mouth (eWOM) in the Marketing Context: A State of the Art Analysis and Future Directions; Springer: Cham, Switzerland, 2017; pp. 17–30. [Google Scholar] [CrossRef]
- Emmert, M.; McLennan, S. One decade of online patient feedback: Longitudinal analysis of data from a German physician rating website. J. Med. Internet Res. 2021, 23, e24229. [Google Scholar] [CrossRef]
- Kleefstra, S.M.; Zandbelt, L.C.; Borghans, I.; de Haes, H.J.; Kool, R.B. Investigating the potential contribution of patient rating sites to hospital supervision: Exploratory results from an interview study in The Netherlands. J. Med. Internet Res. 2016, 18, e201. [Google Scholar] [CrossRef]
- Bardach, N.S.; Asteria-Peñaloza, R.; Boscardin, W.J.; Dudley, R.A. The relationship between commercial website ratings and traditional hospital performance measures in the USA. BMJ Qual. Saf. 2013, 22, 194–202. [Google Scholar] [CrossRef]
- Van de Belt, T.H.; Engelen, L.J.; Berben, S.A.; Teerenstra, S.; Samsom, M.; Schoonhoven, L. Internet and social media for health-related information and communication in health care: Preferences of the Dutch general population. J. Med. Internet Res. 2013, 15, e220. [Google Scholar] [CrossRef]
- Hao, H.; Zhang, K.; Wang, W.; Gao, G. A tale of two countries: International comparison of online doctor reviews between China and the United States. Int. J. Med. Inform. 2017, 99, 37–44. [Google Scholar] [CrossRef]
- Lantzy, S.; Anderson, D. Can consumers use online reviews to avoid unsuitable doctors? Evidence from RateMDs.com and the Federation of State Medical Boards. Decis. Sci. 2020, 51, 962–984. [Google Scholar] [CrossRef]
- Gilbert, K.; Hawkins, C.M.; Hughes, D.R.; Patel, K.; Gogia, N.; Sekhar, A.; Duszak, R., Jr. Physician Rating Websites: Do Radiologists Have Online Presence? J. Am. Coll. Radiol. 2015, 12, 867–871. [Google Scholar] [CrossRef]
- Okike, K.; Peter-Bibb, T.K.; Xie, K.C.; Okike, O.N. Association between physician online rating and quality of care. J. Med. Internet Res. 2016, 18, e324. [Google Scholar] [CrossRef]
- Mostaghimi, A.; Crotty, B.H.; Landon, B.E. The availability and nature of physician information on the internet. J. Gen. Intern. Med. 2010, 25, 1152–1156. [Google Scholar] [CrossRef]
- Lagu, T.; Hannon, N.S.; Rothberg, M.B.; Lindenauer, P.K. Patients’ evaluations of health care providers in the era of social networking: An analysis of physician-rating websites. J. Gen. Intern. Med. 2010, 25, 942–946. [Google Scholar] [CrossRef]
- López, A.; Detz, A.; Ratanawongsa, N.; Sarkar, U. What patients say about their doctors online: A qualitative content analysis. J. Gen. Intern. Med. 2012, 27, 685–692. [Google Scholar] [CrossRef]
- Shah, A.M.; Yan, X.; Qayyum, A.; Naqvi, R.A.; Shah, S.J. Mining topic and sentiment dynamics in physician rating websites during the early wave of the COVID-19 pandemic: Machine learning approach. Int. J. Med. Inform. 2021, 149, 104434. [Google Scholar] [CrossRef] [PubMed]
- Ghimire, B.; Shanaev, S.; Lin, Z. Effects of official versus online review ratings. Ann. Tour. Res. 2022, 92, 103247. [Google Scholar] [CrossRef]
- Xu, Y.; Xu, X. Rating deviation and manipulated reviews on the Internet—A multi-method study. Inf. Manag. 2023, 60, 103829. [Google Scholar] [CrossRef]
- Hu, N.; Bose, I.; Koh, N.S.; Liu, L. Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decis. Support Syst. 2012, 52, 674–684. [Google Scholar] [CrossRef]
- Luca, M.; Zervas, G. Fake it till you make it: Reputation, competition, and Yelp review fraud. Manag. Sci. 2016, 62, 3412–3427. [Google Scholar] [CrossRef]
- Namatherdhala, B.; Mazher, N.; Sriram, G.K. Artificial Intelligence in Product Management: Systematic review. Int. Res. J. Mod. Eng. Technol. Sci. 2022, 4, 2914–2917. [Google Scholar]
- Jabeur, S.B.; Ballouk, H.; Arfi, W.B.; Sahut, J.M. Artificial intelligence applications in fake review detection: Bibliometric analysis and future avenues for research. J. Bus. Res. 2023, 158, 113631. [Google Scholar] [CrossRef]
- Bidmon, S.; Elshiewy, O.; Terlutter, R.; Boztug, Y. What patients value in physicians: Analyzing drivers of patient satisfaction using physician-rating website data. J. Med. Internet Res. 2020, 22, e13830. [Google Scholar] [CrossRef]
- Shah, A.M.; Yan, X.; Tariq, S.; Ali, M. What patients like or dislike in physicians: Analyzing drivers of patient satisfaction and dissatisfaction using a digital topic modeling approach. Inf. Process. Manag. 2021, 58, 102516. [Google Scholar] [CrossRef]
- Emmert, M.; Meier, F. An analysis of online evaluations on a physician rating website: Evidence from a German public reporting instrument. J. Med. Internet Res. 2013, 15, e2655. [Google Scholar] [CrossRef]
- Nwachukwu, B.U.; Adjei, J.; Trehan, S.K.; Chang, B.; Amoo-Achampong, K.; Nguyen, J.T.; Taylor, S.A.; McCormick, F.; Ranawat, A.S. Rating a sports medicine surgeon’s “quality” in the modern era: An analysis of popular physician online rating websites. HSS J. 2016, 12, 272–277. [Google Scholar] [CrossRef] [PubMed]
- Obele, C.C.; Duszak, R., Jr.; Hawkins, C.M.; Rosenkrantz, A.B. What patients think about their interventional radiologists: Assessment using a leading physician ratings website. J. Am. Coll. Radiol. 2017, 14, 609–614. [Google Scholar] [CrossRef]
- Kapoor, N.; Haj-Mirzaian, A.; Yan, H.Z.; Wickner, P.; Giess, C.S.; Eappen, S.; Khorasani, R. Patient experience scores for radiologists: Comparison with nonradiologist physicians and changes after public posting in an institutional online provider directory. Am. J. Roentgenol. 2022, 219, 338–345. [Google Scholar] [CrossRef]
- Gao, G.G.; McCullough, J.S.; Agarwal, R.; Jha, A.K. A changing landscape of physician quality reporting: Analysis of patients’ online ratings of their physicians over a 5-year period. J. Med. Internet Res. 2012, 14, e38. [Google Scholar] [CrossRef]
- Emmert, M.; Meier, F.; Heider, A.K.; Dürr, C.; Sander, U. What do patients say about their physicians? An analysis of 3000 narrative comments posted on a German physician rating website. Health Policy 2014, 118, 66–73. [Google Scholar] [CrossRef] [PubMed]
- Emmert, M.; Meier, F.; Pisch, F.; Sander, U. Physician choice making and characteristics associated with using physician-rating websites: Cross-sectional study. J. Med. Internet Res. 2013, 15, e2702. [Google Scholar] [CrossRef]
- Rahim, A.I.A.; Ibrahim, M.I.; Musa, K.I.; Chua, S.L.; Yaacob, N.M. Patient satisfaction and hospital quality of care evaluation in malaysia using servqual and facebook. Healthcare 2021, 9, 1369. [Google Scholar] [CrossRef]
- Galizzi, M.M.; Miraldo, M.; Stavropoulou, C.; Desai, M.; Jayatunga, W.; Joshi, M.; Parikh, S. Who is more likely to use doctor-rating websites, and why? A cross-sectional study in London. BMJ Open 2012, 2, e001493. [Google Scholar] [CrossRef]
- Hanauer, D.A.; Zheng, K.; Singer, D.C.; Gebremariam, A.; Davis, M.M. Public awareness, perception, and use of online physician rating sites. JAMA 2014, 311, 734–735. [Google Scholar] [CrossRef]
- Lin, Y.; Hong, Y.A.; Henson, B.S.; Stevenson, R.D.; Hong, S.; Lyu, T.; Liang, C. Assessing patient experience and healthcare quality of dental care using patient online reviews in the United States: Mixed methods study. J. Med. Internet Res. 2020, 22, e18652. [Google Scholar] [CrossRef] [PubMed]
- Daskivich, T.J.; Houman, J.; Fuller, G.; Black, J.T.; Kim, H.L.; Spiegel, B. Online physician ratings fail to predict actual performance on measures of quality, value, and peer review. J. Am. Med. Inform. Assoc. 2018, 25, 401–407. [Google Scholar] [CrossRef] [PubMed]
- Gray, B.M.; Vandergrift, J.L.; Gao, G.G.; McCullough, J.S.; Lipner, R.S. Website ratings of physicians and their quality of care. JAMA Intern. Med. 2015, 175, 291–293. [Google Scholar] [CrossRef] [PubMed]
- Skrzypecki, J.; Przybek, J. Physician review portals do not favor highly cited US ophthalmologists. In Seminars in Ophthalmology; Taylor & Francis: Abingdon, UK, 2018; Volume 33, pp. 547–551. [Google Scholar] [CrossRef]
- Widmer, R.J.; Maurer, M.J.; Nayar, V.R.; Aase, L.A.; Wald, J.T.; Kotsenas, A.L.; Timimi, F.K.; Harper, C.M.; Pruthi, S. Online physician reviews do not reflect patient satisfaction survey responses. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The Netherlands, 2018; Volume 93, pp. 453–457. [Google Scholar] [CrossRef]
- Saifee, D.H.; Bardhan, I.; Zheng, Z. Do Online Reviews of Physicians Reflect Healthcare Outcomes? In International Conference of Smart Health; Springer International Publishing: Cham, Switzerland, 2017; pp. 161–168. [Google Scholar] [CrossRef]
- Trehan, S.K.; Nguyen, J.T.; Marx, R.; Cross, M.B.; Pan, T.J.; Daluiski, A.; Lyman, S. Online patient ratings are not correlated with total knee replacement surgeon–specific outcomes. HSS J. 2018, 14, 177–180. [Google Scholar] [CrossRef]
- Doyle, C.; Lennox, L.; Bell, D. A systematic review of evidence on the links between patient experience and clinical safety and effectiveness. BMJ Open 2013, 3, e001570. [Google Scholar] [CrossRef]
- Okike, K.; Uhr, N.R.; Shin, S.Y.; Xie, K.C.; Kim, C.Y.; Funahashi, T.T.; Kanter, M.H. A comparison of online physician ratings and internal patient-submitted ratings from a large healthcare system. J. Gen. Intern. Med. 2019, 34, 2575–2579. [Google Scholar] [CrossRef]
- Lu, S.F.; Rui, H. Can we trust online physician ratings? Evidence from cardiac surgeons in Florida. Manag. Sci. 2018, 64, 2557–2573. [Google Scholar] [CrossRef]
- Greaves, F.; Ramirez-Cano, D.; Millett, C.; Darzi, A.; Donaldson, L. Harnessing the cloud of patient experience: Using social media to detect poor quality healthcare. BMJ Qual. Saf. 2013, 22, 251–255. [Google Scholar] [CrossRef]
- Ranard, B.L.; Werner, R.M.; Antanavicius, T.; Schwartz, H.A.; Smith, R.J.; Meisel, Z.F.; Asch, D.A.; Ungar, L.H.; Merchant, R.M. What can Yelp teach us about measuring hospital quality? Health Aff. (Proj. Hope) 2016, 35, 697–705. [Google Scholar] [CrossRef]
- Khanbhai, M.; Anyadi, P.; Symons, J.; Flott, K.; Darzi, A.; Mayer, E. Applying natural language processing and machine learning techniques to patient experience feedback: A systematic review. BMJ Health Care Inform. 2021, 28, e100262. [Google Scholar] [CrossRef] [PubMed]
- Doing-Harris, K.; Mowery, D.L.; Daniels, C.; Chapman, W.W.; Conway, M. Understanding Patient Satisfaction with Received Healthcare Services: A Natural Language Processing Approach. AMIA Annu. Symp. Proc. 2017, 2016, 524–533. [Google Scholar] [PubMed]
- Nawab, K.; Ramsey, G.; Schreiber, R. Natural language processing to extract meaningful information from patient experience feedback. Appl. Clin. Inform. 2020, 11, 242–252. [Google Scholar] [CrossRef] [PubMed]
- Wallace, B.C.; Paul, M.J.; Sarkar, U.; Trikalinos, T.A.; Dredze, M. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J. Am. Med. Inform. Assoc. 2014, 21, 1098–1103. [Google Scholar] [CrossRef]
- Hao, H.; Zhang, K. The voice of Chinese health consumers: A text mining approach to web-based physician reviews. J. Med. Internet Res. 2016, 18, e108. [Google Scholar] [CrossRef]
- Shah, A.M.; Yan, X.; Shah, S.A.A.; Mamirkulova, G. Mining patient opinion to evaluate the service quality in healthcare: A deep-learning approach. J. Ambient Intell. Humaniz. Comput. 2020, 11, 2925–2942. [Google Scholar] [CrossRef]
- Hao, H. The development of online doctor reviews in China: An analysis of the largest online doctor review website in China. J. Med. Internet Res. 2015, 17, e134. [Google Scholar] [CrossRef]
- Jiang, S.; Street, R.L. Pathway linking internet health information seeking to better health: A moderated mediation study. Health Commun. 2017, 32, 1024–1031. [Google Scholar] [CrossRef]
- Syed, U.A.; Acevedo, D.; Narzikul, A.C.; Coomer, W.; Beredjiklian, P.K.; Abboud, J.A. Physician Rating Websites: An Analysis of Physician Evaluation and Physician Perception. Arch. Bone Jt. Surg. 2019, 7, 136–142. [Google Scholar]
- Russkikh, T.N.; Tinyakova, V.I.; Kukharets, D.V. Semantic analysis of patient feedback to provide decision support in the medical services market. Creat. Econ. 2024, 18, 455–474. (In Russian) [Google Scholar] [CrossRef]
- Kostrov, S.A.; Potapov, M.P.; Akkuratov, E.G. Personalizing communication with the patient: Large language models. Patient-Oriented Med. Pharm. 2025, 3, 68–79. (In Russian) [Google Scholar] [CrossRef]
- Kalabikhina, I.; Moshkin, V.; Kolotusha, A.; Kashin, M.; Klimenko, G.; Kazbekova, Z. Advancing Semantic Classification: A Comprehensive Examination of Machine Learning Techniques in Analyzing Russian-Language Patient Reviews. Mathematics 2024, 12, 566. [Google Scholar] [CrossRef]
- Kalabikhina, I.E.; Kolotusha, A.V. Database of negative reviews from patients of medical clinics in Russian cities with a population of over a million (based on infodoctor.ru for the period 2012–2023). Popul. Econ. 2025, 9, 117–126. [Google Scholar] [CrossRef]
- Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. Medical vs. Organizational Complaints: A Machine Learning Analysis Reveals Divergent Patterns in Patient Reviews Across Russian Cities. Healthcare 2025, 13, 2641. [Google Scholar] [CrossRef]
- Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. How Different Medical Practices Are Associated with Types of Patient Complaints in Russian Clinics. Healthcare, 2026; 14, in press.
- Feizollah, A.; Lin, C.; O’Malley, L.; Thompson, W.; Listl, S.; Byrne, M. The Use of Natural Language Processing to Interpret Unstructured Patient Feedback on Health Services: Scoping Review. J. Med. Internet Res. 2025, 27, e72853. [Google Scholar] [CrossRef]
- Baines, R.; Regan de Bere, S.; Stevens, S.; Read, J.; Marshall, M.; Lalani, M.; Bryce, M.; Archer, J. The impact of patient feedback on the medical performance of qualified doctors: A systematic review. BMC Med. Educ. 2018, 18, 173. [Google Scholar] [CrossRef] [PubMed]
- Wong, E.; Mavondo, F.; Fisher, J. Patient feedback to improve quality of patient-centred care in public hospitals: A systematic review of the evidence. BMC Health Serv. Res. 2020, 20, 530. [Google Scholar] [CrossRef]
- Ruksakulpiwat, S.; Thongking, W.; Zhou, W.; Benjasirisan, C.; Phianhasin, L.; Schiltz, N.K.; Brahmbhatt, S. Machine learning-based patient classification system for adults with stroke: A systematic review. Chronic Illn. 2023, 19, 26–39. [Google Scholar] [CrossRef] [PubMed]
- Landis, J.R.; Koch, G.G. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977, 33, 363–374. [Google Scholar] [CrossRef] [PubMed]
| LLM\Gold Standard | M | O | C | Σ |
|---|---|---|---|---|
| M | 476 | 28 | 67 | 571 |
| O | 51 | 192 | 164 | 407 |
| C | 220 | 61 | 241 | 522 |
| Σ | 747 | 281 | 472 | 1500 |
| Model | Architecture Type | Input Features | Key Components/Hyperparameters |
|---|---|---|---|
| Stacking Ensemble | Ensemble (Stacking) | BERT embeddings + TF-IDF | Base learners: BERT (rubert-tiny2), TF-IDF + LR, LightGBM; meta-learner: Logistic Regression |
| BERT + TF-IDF Hybrid | Hybrid (Concatenation) | BERT embeddings + TF-IDF | BERT embeddings (last hidden layer, averaged over tokens) concatenated with TF-IDF features; classifier: Logistic Regression |
| BERT + Logistic Regression | Fine-tuned Transformer | BERT embeddings | rubert-tiny2 with classification head on [CLS] token; fine-tuned for 3-class classification |
| TF-IDF + Logistic Regression | Linear (Bag-of-Words) | TF-IDF | Unigrams + bigrams, max 10,000 features; L2 regularization |
| MLP | Feedforward Neural Network | TF-IDF | Two hidden layers (128 and 64 neurons), ReLU activation; trained on TF-IDF features |
| LightGBM | Gradient Boosting | TF-IDF | 100 trees, max depth 7 |
| Random Forest | Ensemble (Bagging) | TF-IDF | 100 trees, Gini criterion |
| Logistic Regression | Linear (Baseline) | TF-IDF | L2 regularization |
| Model | Standard Accuracy | Standard F1 | Practical Accuracy 1 | Practical F1 2 | Critical Errors (M ↔ O) 3 | CV 4 | Training Time (s) |
|---|---|---|---|---|---|---|---|
| Stacking Ensemble | 0.680 ± 0.007 | 0.679 ± 0.007 | 0.929 ± 0.005 | 0.101 ± 0.001 | 317 (1.4%) | 0.011 | 250.0 |
| BERT + TF-IDF Hybrid | 0.676 ± 0.006 | 0.675 ± 0.006 | 0.927 ± 0.004 | 0.103 ± 0.000 | 326 (1.5%) | 0.009 | 227.0 |
| BERT + Logistic Regression | 0.653 ± 0.009 | 0.652 ± 0.009 | 0.915 ± 0.006 | 0.120 ± 0.001 | 379 (1.7%) | 0.014 | 146.0 |
| TF-IDF + Logistic Regression | 0.613 ± 0.008 | 0.610 ± 0.008 | 0.893 ± 0.006 | 0.150 ± 0.001 | 479 (2.1%) | 0.013 | 7.1 |
| MLP | 0.498 ± 0.031 | 0.491 ± 0.030 | 0.868 ± 0.027 | 0.216 ± 0.007 | 591 (2.6%) | 0.062 | 89.3 |
| LightGBM | 0.499 ± 0.009 | 0.499 ± 0.009 | 0.831 ± 0.007 | 0.244 ± 0.002 | 760 (3.4%) | 0.018 | 8.7 |
| Random Forest | 0.450 ± 0.008 | 0.451 ± 0.008 | 0.822 ± 0.007 | 0.261 ± 0.002 | 798 (3.6%) | 0.017 | 12.4 |
| Logistic Regression | 0.503 ± 0.007 | 0.495 ± 0.006 | 0.813 ± 0.005 | 0.249 ± 0.002 | 837 (3.7%) | 0.013 | 5.2 |
| Class | Count | Percentage |
|---|---|---|
| M (Medical) | 1,999,467 | 46.1% |
| O (Organizational) | 911,901 | 21.0% |
| C (Combined) | 1,429,323 | 32.9% |
| Total | 4,340,691 | 100% |
| Class | Mean Rating | Median | Std. Deviation |
|---|---|---|---|
| M (Medical) | 4.75 | 5.0 | 1.00 |
| O (Organizational) | 4.59 | 5.0 | 1.29 |
| C (Combined) | 4.55 | 5.0 | 1.29 |
| Class | Mean | Median | 25th Percentile | 75th Percentile | Maximum |
|---|---|---|---|---|---|
| C (Combined) | 802 | 692 | 491 | 1003 | 10,792 |
| M (Medical) | 513 | 423 | 294 | 627 | 10,492 |
| O (Organizational) | 320 | 278 | 154 | 415 | 10,024 |
| Year | M (Medical) | O (Organizational) | C (Combined) | Total |
|---|---|---|---|---|
| 2011 | 176 | 182 | 107 | 465 |
| 2012 | 1096 | 1522 | 460 | 3078 |
| 2013 | 4819 | 6977 | 1890 | 13,686 |
| 2014 | 7649 | 10,737 | 3086 | 21,472 |
| 2015 | 18,164 | 23,807 | 6515 | 48,486 |
| 2016 | 58,335 | 53,216 | 24,449 | 136,000 |
| 2017 | 90,562 | 77,216 | 36,355 | 204,133 |
| 2018 | 132,596 | 99,371 | 59,103 | 291,070 |
| 2019 | 125,040 | 90,855 | 83,318 | 299,213 |
| 2020 | 94,251 | 61,139 | 88,022 | 243,412 |
| 2021 | 195,189 | 83,946 | 115,871 | 395,006 |
| 2022 | 248,051 | 74,135 | 179,769 | 501,955 |
| 2023 | 325,888 | 74,237 | 253,046 | 653,171 |
| 2024 | 434,350 | 90,748 | 361,453 | 886,551 |
| 2025 * | 251,098 | 57,959 | 208,262 | 517,319 |
| Year | M (Medical) | O (Organizational) | C (Combined) |
|---|---|---|---|
| 2011 | 3.64 | 3.81 | 3.59 |
| 2012 | 3.94 | 4.03 | 3.30 |
| 2013 | 3.96 | 4.13 | 3.27 |
| 2014 | 4.05 | 4.15 | 3.40 |
| 2015 | 4.25 | 4.34 | 3.47 |
| 2016 | 4.43 | 4.40 | 4.02 |
| 2017 | 4.43 | 4.43 | 3.89 |
| 2018 | 4.48 | 4.47 | 3.97 |
| 2019 | 4.53 | 4.50 | 4.15 |
| 2020 | 4.68 | 4.60 | 4.34 |
| 2021 | 4.77 | 4.67 | 4.43 |
| 2022 | 4.81 | 4.72 | 4.60 |
| 2023 | 4.84 | 4.75 | 4.69 |
| 2024 | 4.87 | 4.77 | 4.74 |
| 2025 | 4.88 | 4.81 | 4.75 |
| City | Total Reviews | % of Total | Population (Avg 2019–2023), Millions | Reviews per 1 Million | M (%) | O (%) | C (%) |
|---|---|---|---|---|---|---|---|
| Moscow | 828,864 | 19.1% | 12.65 | 65,525 | 49.0% | 19.3% | 31.7% |
| Saint Petersburg | 588,178 | 13.6% | 5.38 | 109,344 | 44.7% | 18.1% | 37.2% |
| Krasnodar | 463,383 | 10.7% | 1.04 | 444,966 | 45.3% | 23.4% | 31.3% |
| Rostov-on-Don | 312,601 | 7.2% | 1.14 | 275,068 | 45.4% | 23.5% | 31.1% |
| Kazan | 231,391 | 5.3% | 1.26 | 183,895 | 44.6% | 22.0% | 33.4% |
| Specialty | Total Reviews | M (%) | O (%) | C (%) |
|---|---|---|---|---|
| Gynecologist | 321,205 | 37.8% | 30.7% | 31.5% |
| Ultrasound Physician | 224,481 | 37.7% | 27.5% | 34.8% |
| Dentist | 197,403 | 40.0% | 25.2% | 34.8% |
| Obstetrician | 174,825 | 38.1% | 32.7% | 29.2% |
| Dental Surgeon | 116,345 | 54.0% | 18.2% | 27.8% |
| ENT Specialist | 95,760 | 42.4% | 22.6% | 35.0% |
| Therapist | 95,333 | 41.5% | 24.7% | 33.8% |
| Dermatologist | 79,530 | 42.7% | 22.2% | 35.1% |
| Neurologist | 70,746 | 40.8% | 24.2% | 35.0% |
| Ophthalmologist | 66,324 | 41.6% | 24.0% | 34.4% |
| Rating Category | Definition | Count | Percentage |
|---|---|---|---|
| High | 4–5 (inclusive) | 3,895,449 | 89.7% |
| Medium | >0 and <4 | 281,965 | 6.5% |
| Zero | 0 | 163,277 | 3.8% |
| Total | 4,340,691 | 100% |
| Characteristic | Kalabikhina et al., 2024 [57] (Mathematics) | Kalabikhina et al., 2025 [59] (Healthcare) | Present Study |
|---|---|---|---|
| Data Volume | 60 thousand | 18.7 thousand (negative only) | 4.34 million |
| Classes | Binary (positive/negative) | Binary (M, O) | Three-class (M, O, and C) |
| Sentiment Coverage | All | Negative only | All |
| Dataset Creation | Expert annotation | Expert annotation | Hybrid (expert + LLM) |
| Primary Metrics | Standard accuracy | Standard accuracy | Practical Accuracy |
| Models Compared | 3 architectures (GRU, LSTM, and CNN) | Logistic regression | 8 architectures |
| Scaling | No | No | 4.34 million reviews |
| Business Effect | Not quantified | Not quantified | +156,000 correctly processed complaints |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kalabikhina, I.E.; Kolotusha, A.V.; Moshkin, V.S. Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data Cogn. Comput. 2026, 10, 114. https://doi.org/10.3390/bdcc10040114
Kalabikhina IE, Kolotusha AV, Moshkin VS. Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data and Cognitive Computing. 2026; 10(4):114. https://doi.org/10.3390/bdcc10040114
Chicago/Turabian StyleKalabikhina, Irina Evgenievna, Anton Vasilyevich Kolotusha, and Vadim Sergeevich Moshkin. 2026. "Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare" Big Data and Cognitive Computing 10, no. 4: 114. https://doi.org/10.3390/bdcc10040114
APA StyleKalabikhina, I. E., Kolotusha, A. V., & Moshkin, V. S. (2026). Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare. Big Data and Cognitive Computing, 10(4), 114. https://doi.org/10.3390/bdcc10040114

