Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools
Abstract
:1. Introduction
2. Materials and Methods
2.1. Developing the Survey for Transparent Reporting
2.2. Participation Procedure
2.3. Transparency and Trustworthiness Assessment
- Q9: Was the intended use specified for a specific clinical task?
- Q10: Is the tool assistive, i.e., designed to include human oversight by a medical expert?
- Q11 and Q12: Is the tool recommended for applications in any setting for the intended use or optimized for specific settings? If applicable anywhere, was the tool sufficiently validated in external validation settings?
- Q15: Was the AI model output specified, and is it appropriate for the intended use?
- Q16: Was the development in close clinical collaboration to ensure medical integrity and safety?
- Q29, 34, 36, 38, 39, and 42: Were the training data source, the timeframe of the data collection, the number of samples in the total dataset and subclasses, instruments and settings, and medical image sizes transparently specified?
- Q30: Is the training data accessible for other researchers or regulatory bodies?
- Q43: Was cross-sectional metadata recorded and variables reported? (This information is important to specify requirements for quality assessment)
- Q44: Was missing data reported transparently?
- Q45: Were the inclusion and exclusion criteria reported transparently?
- Q50 and Q51?: Were the training data preprocessing steps, including splitting, reported transparently?
- Q54: Was the data anonymized and personal information protected?
- Q55 and Q56: Did individuals give consent that their anonymized data can be used to develop this AI model? If yes, was consent revocable?
- Q57: Were any stated ethical principles considered during product development?
- Q58: Did the model deliberately use sensitive attributes to make predictions?
- Q59: Did the report reflect a performed assessment of fairness (performance stratification among the subgroups)? If yes, which groups were investigated, and was the performance similar across them all?
- Q60: Was potential harm reflected and transparently disclosed?
- Q61: Was the risk of bias across the subgroups mitigated? (Can be scored with one point if the performances across the subgroups were investigated but no differences were found.)
- Q62: Was the model performance assessed on external data?
- Q63, 65: Were the sizes of the total test dataset and classes transparently reported?
- Q64: Were the inclusion and exclusion criteria for the test dataset transparently reported?
- Q66 and Q67: Were the results from the model assessment shared transparently, including performance plots?
- Q68–74: Was the model assessment done across the quality dimensions of bias, fairness, robustness, interpretability, human comparison, and cost efficiency?
- Were the caveats for deployment (e.g., regarding underrepresented patient groups or clinical considerations) reflected and transparently reported?
- Were underrepresented groups in data transparently reported and for further performance investigation in those suggested?
3. Results
3.1. Survey Respondents and Use Cases
3.2. Transparency and Trustworthiness Scores
3.3. Summary of Assessment Results
3.3.1. Intended Use
3.3.2. Implemented Machine Learning Technology
3.3.3. Training Data Information
3.3.4. Ethical Considerations
3.3.5. Technical Validation and Quality Assessment
3.3.6. Caveats and Recommendations for Deployment
4. Discussion
4.1. Survey and Teleconference
4.2. Respondents
4.3. External Audit
4.4. Exploratory Results from Use Cases
4.5. Scoring Transparency and Trustworthiness
4.6. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Name of Consideration or Guideline | Author | Focus | |
---|---|---|---|
1 | TRIPOD statement | Moons, K. G. M., Altman, D. G., Reitsma, J. B., Ioannidis, J. P. A., Macaskill, P., et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Ann. Intern. Med. 162, W1–W73, DOI: 10.7326/M14-0698 (2015). Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ 350, 1–9, DOI: 10.1136/BMJ.g7594 (2015). | Transparent reporting of multivariable prediction models for prognosis or diagnosis |
2 | Guidelines for developing and reporting machine learning predictive models | Luo, W., Phung, D., Tran, T., Gupta, S., Rana, S., et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. J. Med. Internet Res. 18, 1–10, DOI: 10.2196/jmir.5870 (2016). | |
3 | Datasheets for Datasets | Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., et al. Datasheets for Datasets. 1–28, (2018). | Reporting about datasets that are provided for the development of prediction models |
4 | Model cards for model reporting | Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., et al. Model cards for model reporting. FAT* 2019—Proc. 2019 Conf. Fairness, Accountability, Transpar. 220–229, DOI: 10.1145/3287560.3287596 (2019). | Framework to encourage transparent machine learning model reporting |
5 | Model facts labels | Sendak, M. P., Gao, M., Brajer, N. & Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digital Medicine vol. 3 1–4, DOI: 10.1038/s41746-020-0253-3 (2020). | Presenting machine learning model information to clinical end users |
6 | FactSheets: Increasing trust in AI services through supplier’s declarations of conformity | Arnold, M., Piorkowski, D., Reimer, D., Richards, J., Tsay, J., et al. FactSheets: Increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 63, 1–13, DOI: 10.1147/JRD.2019.2942288 (2019). | Multidimensional fact sheets capture and quantify various aspects of the product and its development to make it worthy of consumers’ trust. |
7 | A roadmap for responsible machine learning for healthcare | Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V. X., et al. Do no harm: a roadmap for responsible machine learning for health care. Nat. Med. 15–18, DOI: 10.1038/s41591-019-0548-6 (2019). | Laying out critical considerations for the development, testing, and deployment of new solutions for a broad audience. |
8 | ITU/WHO Focus group AI for Health | Wiegand, T., Krishnamurthy, R., Kuglitsch, M., Lee, N., Pujari, S., et al. WHO and ITU establish benchmarking process for artificial intelligence in health. Lancet 394, 9–11, DOI: 10.1016/S0140-6736(19)30762-7 (2019). | Standardized audit framework for medical AI |
9 | CONSORT-AI | Liu, X., Cruz Rivera, S., Moher, D., Calvert, M., Denniston, A. K., et al. CONSORT-AI extension. Nat. Med. 26, 1364–1374, DOI: 10.1038/s41591-020-1034-x (2020). | Reporting guidelines for clinical trial reports for interventions involving AI |
10 | SPIRIT AI | Rivera, S. C., Liu, X., Chan, A.-W., Denniston, A. K. & Calvert, M. J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. Bmj 370, m3210, DOI: 10.1136/bmj.m3210 (2020). | Guidelines for clinical trial protocols for interventions involving AI |
11 | STARD 2015 checklist | Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, 1–9, DOI: 10.1136/bmj.h5527 (2015). | An updated list of essential items for reporting diagnostic accuracy studies |
12 | DECIDE AI | Vasey, B., Nagendran, M., Campbell, B., Clifton, D. A., Collins, G. S., et al. Consensus statement Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 12, 28, DOI: 10.1038/s41591-022-01772-9 (2022). | Reporting guidelines to bridge the development to implementation gap in clinical AI |
13 | Twenty critical questions on transparency, replicability and effectiveness for machine learning and artificial intelligence research | Vollmer, S., Mateen, B. A., Bohner, G., Király, F. J., Ghani, R., et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ 368, 1–12, DOI: 10.1136/bmj.l6927 (2020). | For developers, editors, patients, clinicians and patients to inform and critically appraise where new findings may deliver patient benefit. |
14 | Ethics and governance of AI for health | World Health Organization. Ethics and governance of artificial intelligence for health. (2021). | Contains key ethical principles for the design and use of AI for health |
15 | Ethics guidelines for trustworthy AI | High-Level Expert Group on Artificial Intelligence (AI-HLEG), European Commission. Ethics guidelines for trustworthy AI. European Commission https://ec.europa.eu/futurium/en/ai-alliance-consultation.1.html (2019). Available at https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai (accessed on 18 January 2021). | |
16 | Understanding artificial intelligence ethics and safety | Leslie, D. Understanding artificial intelligence ethics and safety. The Alan Turing Institute DOI: https://zenodo.org/record/3240529 (2019) (accessed on 18 November 2021). | A guide for the responsible design and implementation of AI systems in the public sector |
17 | Evidence standards framework for digital health by National Institute of health and Care Excellence (NICE) | NICE. Evidence Standards Framework for Digital Health Technologies. Grants Regist. 2019 540–540, (2019). Available at https://www.nice.org.uk (accessed on 18 November 2021). | Evidence standards framework for digital health |
18 | Reimagining Global Health through Artificial Intelligence.The Roadmap to AI Maturity by Broadband Commission | Broadband Commission for Sustainable Development, U. Working Group on Digital and AI in Health Reimagining Global Health through Artificial Intelligence: The Roadmap to AI Maturity. (2020). | Actionable recommendations and call to action for advancing countries on their path to AI maturity. |
19 | Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan by the U.S. Food and Drug Administration (FDA) | U.S. Food and Drug Administration. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. (2021). | Action plan to advance AI/ML-Based Software as Medical Device |
20 | Eight guiding principles for good Machine Learning practice | U.S. Food and Drug Administration. Good Machine Learning Practice for Medical Device Development: Guiding Principles. (2021). Available at https://www.fda.gov/media/153486/download (accessed on 18 November 2021). | |
21 | A Practical Framework for Artificial Intelligence Product Development in Healthcare | Higgins, D., & Madai, V. I. (2020). From Bit to Bedside: A Practical Framework for Artificial Intelligence Product Development in Healthcare. Advanced Intelligent Systems, 2(10), 2000052. https://doi.org/10.1002/aisy.202000052 | Product development framework |
Questionnaire for Transparent Model Reporting in Artificial Intelligence for Health | ||||
---|---|---|---|---|
Purpose: This questionnaire elicits details about the development and evaluation process of your AI model implementing machine learning to predict a health outcome. We expect respondents to have a solid understanding of the development and evaluation processes of the use case. Respondents can answer questions in a team or consult colleagues to answer certain questions. We will assess answers with respect to compliance to stated transparent reporting and trustworthy AI guidelines. | ||||
Question | Suggesting Resource | Trans- parency | Trust worthiness | |
1 | What is the name of the use case/project/product/Al model which you will refer to during this questionnaire?
| / | ||
2 | Who developed the use case? (e.g., details of developer or team, type of organization (academia, private company, government, etc.) academic institution private company government institution other______________________________ | 1, 4, 6 | ||
3 | In which country was this use case developed?
| / | ||
4 | Are you answering this questionnaire alone or with other team members? alone
| / | ||
5 | What was your role during the use case development?
| / | ||
6 | What is your academic background?
| / | ||
7 | For how long have you been working with machine learning?
| / | ||
8 | Please indicate how long you have been involved in the development of the Al model.
| / | ||
Section 1—Intended use of the AI model In the following we will ask questions about the intended application of the model, model output and clinical collaborations. | ||||
9 | [Multiple answers possible] Please specify the primary intended use for the Al model Predicting the onset of a health status change_____________
| 1, 2, 4, 5, 6, 11, 20 | x | x |
10 | Should the Al model work autonomously or assistively?
| 4, 15 | x | x |
11 | Is your Al model optimized for a specific local or clinical setting (e.g., a specific clinical department, country, etc.)?
| 1, 2, 4 | x | x |
12 | Could the Al model potentially be utilized for tasks, different from the primary intended use? If yes, please give details.
| x | x | |
13 | [Multiple answers possible] Which of the following clinical considerations apply to your Al model outcome?
| 4 | x | |
14 | [Multiple answers possible] Which form of benefit does your AI allow on the human side?
| 2, 5, 13 | x | |
15 | [Multiple answers possible] Please specify the Al model output.
| 1, 2, 4, 5, 6, 9, 10, 15 | x | x |
16 | Did you consult clinicians during the Al model development? If yes, at which stage? (e.g., Design, data selection, testing)
| 6, 7, 15, 20 | x | x |
Section 2—Implemented machine learning (ML) technology In the following we will ask questions about the implemented AI methods for this use case. | ||||
17 | [Multiple answers possible] Which ML algorithm was used to build the Al model?
| 1, 2, 4, 5, 15 | x | |
18 | Was the Al model training supervised, semi-supervised or unsupervised?
| x | ||
19 | Please provide more details on the ML method (e.g., for deep learning: architecture with the number of layers and trainable parameters).
| x | ||
20 | Does the Al model solve a single task or multiple tasks? (Example for multiple tasks: Segmentation and classification)
| x | ||
21 | [Multiple answers possible] Which criteria was used to select the best/final Al model during training? (e.g., highest accuracy, F1-score, …)?
| 2, 6 | x | |
22 | Does the model make decisions based on predefined thresholds? If yes, please specify those and their clinical significance
| 1, 6 | x | |
23 | [Multiple answers possible] Was any technique implemented to speed up the computational process of Al model training?
| 6 | x | |
24 | Were any methods applied to reduce overfitting? If yes, please specify hyperparameters.
| 6 | x | |
25 | Do you have one or multiple selected best Al models?
| 2 | x | |
26 | Please provide any relevant citations of ML methods which were applied.
| x | ||
27 | Can you share the source code for the model?
| x | ||
28 | [Multiple answers possible] Where can we find more detailed information on the Al model?
| |||
Section 3—Training data information The following questions refer only to the data used during the training stage and not during testing with external data (holdout from training). Questions 29–44 address the original, unprocessed, dataset (raw) and questions 45–53 address the processed data selected for training. | ||||
(3a) Information about the original unprocessed, unfiltered dataset | ||||
29 | [Multiple answers possible] Please give information where the dataset used to develop the model was collected? Please name known locations
| 1, 2, 3, 5, 6, 7, 14, 15, 20 | x | x |
30 | Please report on the availability and accessibility of the data.
| 6 | x | x |
31 | [Multiple answers possible] Who collected the dataset? Please specify the name of the selected option
| 3, 5, 6, 15 | x | |
32 | [Multiple answers possible] Who funded the data collection? Please specify the name of the selected option
| 1, 3, 6, 9, 10, 11 | x | |
33 | What was the purpose of collecting the data?
| 3, 6, 7, 15 | x | |
34 | When was the data collected? Please specify the timeframe.
| 1, 2, 3, 5, 6, 7, 9, 10, 14, 15, 20 | x | x |
35 | What does one data sample represent? Please specify.
| 3, 6 | x | |
36 | How many total data samples does the original dataset contain?
| 1, 2, 3, 5, 6 | x | x |
37 | [Multiple answers possible] Which data modalities are included in the original dataset? Please specify.
| 3, 6 | x | |
38 | [Multiple answers possible] Which instruments and settings were used to capture the data?
| 3, 4, 6 | x | x |
39 | If the dataset contained images: Please specify the image size of the original (raw) images.
| 3, 6 | x | x |
40 | Are individuals represented at one or at multiple timepoints in the original dataset? If multiple, please specify time intervals and irregularities.
| 3, 6 | x | x |
41 | Are data samples annotated with labels? If yes, how and by whom were these annotated?
| 1, 2, 3, 6, 14, 20 | x | x |
42 | [Multiple answers possible] How many samples of each label class were present in the original dataset?
| 1, 2, 3, 6, 9 | x | x |
43 | [Multiple answers possible] Does the dataset record cross-sectional metadata? Please select present variables and specify the frequencies or appropriate summary statistics.
| 1, 2, 3, 4, 6 | x | x |
44 | [Multiple answers possible] Did you encounter any missing data in the original dataset? If yes, please specify affected variables or data-modalities, missing fraction relative to all entries and potential reasons for missing data.
| 1, 2, 3, 6, 14 | x | x |
(3b) Information about data selection and preprocessing to prepare data for model development, comprising training and validation. This excludes testing on hold-out data. | ||||
45 | [Multiple answers possible] How many samples/individuals were selected from the original dataset for developing the model?
| 1, 2, 3, 5, 6, 14, 15 | x | x |
46 | Did you encounter any errors, sources of noise, redundancies present in the original dataset which were relevant for selecting the data for training? If yes, please provide a description and how you handled them.
| 2, 3, 6, 14 | x | |
47 | [Multiple answers possible] Which data modalities or variables were selected for the processed dataset as model input? Please choose relevant categories and specify within.
| 1, 2, 3, 5, 6, 14 | x | |
48 | [Multiple answers possible] Which preprocessing steps were performed to prepare data for ML model development?
| 1, 2, 3, 6, 14 | x | x |
49 | [Multiple answers possible] By which proportions did you split the preprocessed data samples into training, validation and test set?
| 2, 6, 7 | x | |
50 | Did you assign samples to each split at random or stratified by any criteria?
| 2, 6, 7 | x | x |
51 | Did you apply k-fold cross validation?
| 2 | x | |
52 | If k-fold cross validation was applied, was the test set held separate or was it mixed with the validation folds?
| 2 | x | |
53 | Any other comments or relevant information about model development, which was not addressed previously?
| |||
Section 4—Ethical considerations | ||||
54 | Were the datasets de-identified or anonymised so that individuals cannot be identified?
| 3, 14, 15 | x | x |
55 | Did individuals who are represented in this data give consent for using their information developing this use case?
| 3, 6, 10, 14, 15 | x | x |
56 | Were individuals provided with any mechanism to revoke their consent in the future or for specific uses?
| 3, 6, 10, 14, 15 | x | x |
57 | Which kind of ethical considerations did you follow in your product development (e.g., from EMA, FDA, WHO,…)?
| / | x | x |
58 | [Multiple answers possible] Does the Al model use any sensitive attributes to make predictions? If yes, please specify the attributes.
| 4, 15 | x | x |
59 | Are there any subgroups in which the model might have lower or higher performance compared to others?
| 6, 7, 9, 13, 14, 15 | x | x |
60 | [Multiple answers possible] What are potential harms if model predictions are false?Please try to estimate the (1) likelihood that this harm occurs in an application setting and the severity of harm and give reasons for your rating.
| 4, 9, 14, 15 | x | x |
61 | Did you apply any mitigation strategies to overcome risk of bias across sensitive attributes? If yes, please specify the method and results.
| 4, 6, 7, 14, 15 | x | x |
Section 5—Technical validation and quality assessment | ||||
62 | Which type of evaluation will you report in the following? (Please choose the type with the highest relevance for regulatory approval)
| 2, 6, 9,10, 11, 14, 15 | x | x |
63 | How many total data samples does the evaluation dataset contain?
| 1, 2, 3, 6, 9, 11, 14 | x | x |
64 | [Multiple answers possible] Please specify the inclusion and exclusion criteria for samples/individuals in the test dataset.
| 1, 2, 3, 4, 6, 9, 10, 11, 14, 15, 20 | x | x |
65 | [Multiple answers possible] How many samples of each label class were present in the test dataset?
| 1, 2, 5, 11, 14 | x | x |
66 | [Multiple answers possible] Which performance measures are reported for this evaluation? Please specify the gold standard and respective results.
| 1, 2, 4, 5, 6, 7, 9, 10, 11, 14, 15, 20 | x | x |
67 | Can you provide relevant plots and tables about the evaluation results (e.g., ROC-AUC plot)
| / | x | x |
68 | [Multiple answers possible] Did you investigate Al model performance variations across different groups? If yes, please specify the groups and report the results here.
| 4, 6, 7, 13, 14, 15, 20 | x | x |
69 | [Multiple answers possible] Are there output classes or groups (see previous question) for which the Al model performed worse compared to others?
| 4, 6, 11, 13, 14, 15, 20 | x | x |
70 | Have you applied statistical testing to compare Al model performance across different groups? If yes, specify the tests and significance level of p-values applied.
| 4, 6, 9, 13 | x | x |
71 | Did you perform an analysis to determine which features were most important to predict the model output? E.g., SHAP, class-activation or saliency maps? If yes, how was it done and which input features were most important?
| 6, 7, 14, 15 | x | x |
72 | Did you use approaches to assess uncertainty and variability in model output? If yes, which methods and what were the results?
| 4, 6, 7, 15 | x | x |
73 | Did you compare the model performance to one or more human experts? If yes, describe the analysis approach, competence level of the human, gold standard and results (e.g., conditions, under which the machine or the human performs better)
| 20 | x | x |
74 | Did you perform a cost-efficiency (e.g., saved human hours) analysis to quantify to which extent the application of your model can save healthcare costs? If yes, describe the analysis approach and results.
| 2, 7, 13 | x | x |
75 | Any other evaluation results which you would like to report? Here is space for additional information.
| |||
Section 7—Caveats and recommendations for deployment Are there any caveats or recommendations for applying the product correctly or safely? | ||||
76 | [Multiple answers possible] Are there relevant subgroups that were not represented or under-represented in the validation dataset and in which Al model performance should be investigated?
| 1, 2, 4, 5, 6, 7, 9, 14, 15, 20 | x | x |
77 | Are there medical contexts or populations in which the reported use case is not recommended / advisable to be applied?
| 1, 2, 4, 5, 7, 9, 14, 15, 20 | x | x |
78 | Are there additional recommendations or caveats for deploying the product?
| 1, 4, 5, 14, 15, 20 |
Appendix B
Appendix B.1. Participant Information and Consent
- (0) Information about the participant;
- (1) Intended use of the medical AI product;
- (2) Implemented machine learning (ML) technology;
- (3) Training data information;
- (4) Ethical considerations;
- (5) Technical validation and quality assessment;
- (6) Caveats and recommendations for deployment.
Appendix B.1.1. Contact
Appendix B.1.2. What Does the Privacy Policy Apply to?
Appendix B.1.3. What Personal Data Do We Collect from You?
Appendix B.1.4. Legal Basis
Appendix B.1.5. Who Will Get Your Data?
Appendix B.2. What Are Your Rights?
Appendix B.2.1. Your Right to Withdraw
Appendix B.2.2. Your Right to Information and Correction
Appendix B.2.3. Your Right to Deletion of Your Personal Data
- ○
- your personal data is no longer required for the purposes for which it was collected,
- ○
- you have withdrawn your consent and there is no other legal basis,
- ○
- you object to the processing and there are no overriding legitimate grounds to justify processing,
- ○
- your personal data has been processed unlawfully, or
- ○
- your personal data must be deleted in order to comply with the legal requirements.
Appendix B.2.4. Your Right to Restrict the Processing of Your Personal Data
- ○
- the accuracy of your personal data is contested by you until we can prove the accuracy of the data,
- ○
- the processing is not lawful;
- ○
- your data is no longer required for the purposes of processing but you need it to assert, exercise, or
- ○
- defend yourself against legal claims; or
- ○
- you have raised an objection, as long as it is not yet been determined whether your interests prevail.
Appendix B.2.5. Your Right to Object
Appendix B.2.6. Your Complaint Right
Appendix B.2.7. Your Right to Data Transferability
Appendix B.2.8. How Long Do We Store Your Data?
Appendix B.3. Consent and Link to Survey
References
- Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
- Davenport, T.; Kalakota, R. The potential for artificial intelligence in healthcare. Future Healthc. J. 2019, 6, 94–98. [Google Scholar] [CrossRef] [PubMed]
- Bejnordi, B.E.; Zuidhof, G.; Balkenhol, M.; Hermsen, M.; Bult, P.; van Ginneken, B.; Karssemeijer, N.; Litjens, G.; van der Laak, J. Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images. J. Med. Imaging 2017, 4, 1. [Google Scholar] [CrossRef]
- Lakhani, P.; Sundaram, B. Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017, 284, 574–582. [Google Scholar] [CrossRef] [PubMed]
- Matek, C.; Schwarz, S.; Spiekermann, K.; Marr, C. Human-level recognition of blast cells in acute myeloid leukaemia with convolutional neural networks. Nat. Mach. Intell. 2019, 1, 538–544. [Google Scholar] [CrossRef]
- Zhang, J.; Xie, Y.; Pang, G.; Liao, Z.; Verjans, J.; Li, W.; Sun, Z.; He, J.; Li, Y.; Shen, C.; et al. Viral Pneumonia Screening on Chest X-ray Images Using Confidence-Aware Anomaly Detection. IEEE Trans. Med. Imaging 2020, 40, 879–890. [Google Scholar] [CrossRef]
- Obermeyer, Z.; Emanue, E.J. Predicting the Future—Big Data, Machine Learning, and Clinical Medicine. N. Engl. J. Med. 2016, 375, 1216–1219. [Google Scholar] [CrossRef]
- Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef]
- Andaur Navarro, C.L.; Damen, J.A.; Takada, T.; Nijman, S.W.; Dhiman, P.; Ma, J.; Collins, G.S.; Bajpai, R.; Riley, R.D.; Moons, K.G.; et al. Risk of bias in studies on prediction models developed using supervised machine learning techniques: Systematic review. BMJ 2021, 375, 2281. [Google Scholar] [CrossRef]
- Liao, T.; Schmidt, L.; Raji, D. Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2021), Virtual-only, 7–10 December 2021. [Google Scholar]
- WHO. Ethics and Governance of Artificial Intelligence for Health; WHO: Geneva, Switzerland, 2021; ISBN 9789240012752. [Google Scholar]
- AI-HLEG. Ethics Guidelines for Trustworthy AI; European Commission: Brussels, Belgium, 2019; pp. 1–39. [Google Scholar]
- Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model cards for model reporting. In Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 220–229. [Google Scholar] [CrossRef] [Green Version]
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Daumeé, H.; Crawford, K. Datasheets for Datasets. arXiv 2018, 1–28. [Google Scholar] [CrossRef]
- Moons, K.G.M.; Altman, D.G.; Reitsma, J.B.; Ioannidis, J.P.A.; Macaskill, P.; Steyerberg, E.W.; Vickers, A.J.; Ransohoff, D.F.; Collins, G.S. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Ann. Intern. Med. 2015, 162, W1–W73. [Google Scholar] [CrossRef] [PubMed]
- Collins, G.S.; Reitsma, J.B.; Altman, D.G.; Moons, K.G.M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ 2015, 350, 1. [Google Scholar] [CrossRef] [PubMed]
- Bossuyt, P.M.; Reitsma, J.B.; Bruns, D.E.; Gatsonis, C.A.; Glasziou, P.P.; Irwig, L.; Lijmer, J.G.; Moher, D.; Rennie, D.; De Vet, H.C.W.; et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015, 351, h5527. [Google Scholar] [CrossRef] [PubMed]
- Luo, W.; Phung, D.; Tran, T.; Gupta, S.; Rana, S.; Karmakar, C.; Shilton, A.; Yearwood, J.; Dimitrova, N.; Ho, T.B.; et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. J. Med. Internet Res. 2016, 18, e323. [Google Scholar] [CrossRef]
- Vasey, B.; Nagendran, M.; Campbell, B.; Clifton, D.A.; Collins, G.S.; Watkinson, P.; Weber, W.; Wheatstone, P.; Mcculloch, P.; DECIDE-AI Expert Group. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 2022, 12, 28. [Google Scholar] [CrossRef]
- Liu, X.; Cruz Rivera, S.; Moher, D.; Calvert, M.; Denniston, A.K.; Spirit-ai, T.; Group, C.W. CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef] [PubMed]
- Rivera, S.C.; Liu, X.; Chan, A.-W.; Denniston, A.K.; Calvert, M.J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI Extension. BMJ 2020, 370, m3210. [Google Scholar] [CrossRef]
- Scott, I.; Carter, S.; Coiera, E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform. 2021, 28, e100251. [Google Scholar] [CrossRef]
- Vollmer, S.; Mateen, B.A.; Bohner, G.; Király, F.J.; Ghani, R.; Jonsson, P.; Cumbers, S.; Jonas, A.; McAllister, K.S.L.; Myles, P.; et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ 2020, 368, 16927. [Google Scholar] [CrossRef] [Green Version]
- Sendak, M.P.; Gao, M.; Brajer, N.; Balu, S. Presenting machine learning model information to clinical end users with model facts labels. NPJ Digit. Med. 2020, 3, 41. [Google Scholar] [CrossRef]
- Wynants, L.; Riley, R.D.; Timmerman, D.; Van Calster, B. Random-effects meta-analysis of the clinical utility of tests and prediction models. Stat. Med. 2018, 37, 2034–2052. [Google Scholar] [CrossRef] [PubMed]
- Wu, E.; Wu, K.; Daneshjou, R.; Ouyang, D.; Ho, D.E.; Zou, J. How medical AI devices are evaluated: Limitations and recommendations from an analysis of FDA approvals. Nat. Med. 2021, 27, 582–584. [Google Scholar] [CrossRef]
- Muehlematter, U.J.; Daniore, P.; Vokinger, K.N. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): A comparative analysis. Lancet Digit. Health 2021, 3, e195–e203. [Google Scholar] [CrossRef]
- Raji, I.D.; Smart, A.; White, R.N.; Mitchell, M.; Gebru, T.; Hutchinson, B.; Smith-Loud, J.; Theron, D.; Barnes, P. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 3 January 2020; pp. 33–44. [Google Scholar]
- AI-HLEG. The Assessment List for Trustworthy AI (ALTAI) for Self Assessment; European Commission: Brussels, Belgium, 2020; ISBN 978-92-76-20009-3. [Google Scholar]
- Zicari, R.V.; Brodersen, J.; Brusseau, J.; Dudder, B.; Eichhorn, T.; Ivanov, T.; Kararigas, G.; Kringen, P.; McCullough, M.; Moslein, F.; et al. Z-Inspection®: A Process to Assess Trustworthy AI. IEEE Trans. Technol. Soc. 2021, 2, 83–97. [Google Scholar] [CrossRef]
- Liu, X.; Glocker, B.; Mccradden, M.M.; Ghassemi, M.; Denniston, A.K.; Oakden-rayner, L. Viewpoint The medical algorithmic audit. Lancet 2022, 7500, 3–6. [Google Scholar] [CrossRef]
- Oala, L.; Fehr, J.; Gilli, L.; Calderon-Ramirez, S.; Li, D.X.; Nobis, G.; Munoz Alvarado, E.A.; Jaramillo-Gutierrez, G.; Matek, C.; Shroff, A.; et al. ML4H Auditing: From Paper to Practice. In Proceedings of the Machine Learning Research, NeuriIPS 2020 ML4H Workshop, Virutal-only, 11–12 December 2020; Volume 136, pp. 281–317. [Google Scholar]
- Hind, M.; Houde, S.; Martino, J.; Mojsilovic, A.; Piorkowski, D.; Richards, J.; Varshney, K.R. Experiences with improving the transparency of AI models and services. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Bak, M.; Madai, V.I.; Fritzsche, M.; Mayrhofer, M.T. You Can’t Have AI Both Ways: Balancing Health Data Privacy and Access Fairly. Front. Genet. 2022, 13, 929453. [Google Scholar] [CrossRef]
- Amann, J.; Vetter, D.; Blomberg, S.N.; Christensen, H.C.; Coffee, M.; Gerke, S.; Gilbert, T.K.; Hagendorff, T.; Holm, S.; Livne, M.; et al. To explain or not to explain?—Artificial intelligence explainability in clinical decision support systems. PLoS Digit. Health 2022, 1, e0000016. [Google Scholar] [CrossRef]
Score | Meaning |
---|---|
0 | The answer did not provide information because of any of the following reasons:
|
0.5 | The answer provided partial information. * Q27 and Q28 if the source code and model details were planned to be published. |
1 | The answer provided sufficient information. * Q61 if the potential of bias was sufficiently investigated. |
UC1 | UC2 | UC3 | ||||
---|---|---|---|---|---|---|
Section | Trans x (%) | Trust x (%) | Trans x (%) | Trust x (%) | Trans x (%) | Trust x (%) |
(1) Intended use | 6 (75.0) | 4 (66.7) | 7 (87.5) | 5 (83.3) | 8 (100.0) | 6 (100.0) |
(2) Machine learning technology | 8 (66.7) | / | 11 (91.7) | / | 5.5 (45.8) | / |
(3) Training data info | 16.5 (68.8) | 7.5 (57.7) | 15 (62.5) | 5 (38.5) | 14.5 (60.4) | 4 (30.8) |
(4) Legal and ethical considerations | 5 (62.5) | 5 (62.5) | 3 (37.5) | 3 (37.5) | 4.5 (56.3) | 4.5 (56.3) |
(5) Technical validation and quality | 4 (30.8) | 4 (30.8) | 8 (61.5) | 8 (61.5) | 10.5 (80.8) | 10.5 (84.8) |
(6) Caveats and recommendations | 0 (0.0) | 0 (0.0) | 1 (50.0) | 1 (50.0) | 2 (100.0) | 2 (100.0) |
Total | 39.5 (59.0) | 20.5 (48.8) | 45 (67.2) | 22 (52.4) | 45 (67.2) | 27 (64.3) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fehr, J.; Jaramillo-Gutierrez, G.; Oala, L.; Gröschel, M.I.; Bierwirth, M.; Balachandran, P.; Werneck-Leite, A.; Lippert, C. Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare 2022, 10, 1923. https://doi.org/10.3390/healthcare10101923
Fehr J, Jaramillo-Gutierrez G, Oala L, Gröschel MI, Bierwirth M, Balachandran P, Werneck-Leite A, Lippert C. Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare. 2022; 10(10):1923. https://doi.org/10.3390/healthcare10101923
Chicago/Turabian StyleFehr, Jana, Giovanna Jaramillo-Gutierrez, Luis Oala, Matthias I. Gröschel, Manuel Bierwirth, Pradeep Balachandran, Alixandro Werneck-Leite, and Christoph Lippert. 2022. "Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools" Healthcare 10, no. 10: 1923. https://doi.org/10.3390/healthcare10101923
APA StyleFehr, J., Jaramillo-Gutierrez, G., Oala, L., Gröschel, M. I., Bierwirth, M., Balachandran, P., Werneck-Leite, A., & Lippert, C. (2022). Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare, 10(10), 1923. https://doi.org/10.3390/healthcare10101923