A Safety Controlling System Framework For Implementing Artificial Intelligence In Healthcare

In response to the need to address the safety challenges in the use of artificial intelligence (AI), this research aimed to develop a framework for a safety controlling system (SCS) to solve the AI black-box mystery in the healthcare industry. The system was developed by adopting the multi-attribute value model approach (MAVT), which comprises four parts: extracting attributes, generating weights for the attributes, developing a rating scale, and finalizing the system. On the basis of the MAVT approach, three layers of attributes were created. The first level contained 6 key dimensions, the second level included 14 attributes, and the third level comprised 78 attributes. The key first level dimensions of the SCS included safety policies, incentives for clinicians, clinician and patient training, communication and interaction, planning of actions, and control of such actions. The proposed system may provide a basis for detecting AI utilization risks, preventing incidents from occurring, and developing emergency plans for AIrelated risks. This approach could also guide and control the implementation of AI systems in the healthcare industry.

AI outperforming its human counterparts in some medical tasks. Since 2016, the application of AI in healthcare has achieved considerable success, and AI models have accomplished various medical subtasks at or above the performance levels of physicians [17]. A highly accurate neural network algorithm was developed in ophthalmology for detecting diabetic retinopathy after training with manually labeled retinal fundus photographs [18]. In radiology, a convolutional neural network trained with labeled frontal chest X-ray images outperformed radiologists in detecting pneumonia [17,19]. In cardiology, a deep learning algorithm diagnosed heart attack with a performance comparable to that of (human) cardiologists [20]. In pathology, one study trained AI algorithms with whole-slide pathology images to detect lymph node metastases of breast cancer and compared the results with those of pathologists [21]. In dermatology, a convolutional neural network was trained with clinical images and was found to classify skin lesions accurately [22]. In gastroenterology, one study developed an algorithm for detecting and characterizing polyps in colonoscopy and reported performance comparable to that of expert gastroenterologists [23]. In infectious diseases, an effective algorithm more reliable than physicians was developed to recommend medications for patients with sepsis [24]. A highly accurate algorithm was developed in pediatrics to diagnose common childhood conditions such as meningitis and influenza [25]. In addition, the Smart Tissue Autonomous Robot armed with an AI algorithm outperformed surgeons in several in vivo and ex vivo surgical tasks [26]. In psychology, one study indicated that AI models were able to categorize patient mental health status with a performance at or above the human level [27].
One of the main challenges is safety. Several reports have described unsafe and incorrect recommendations by AI algorithms [29]. The safety of AI models is mainly associated with model interpretability and explainability [1]. Interpretability is defined as the ability to understand how an AI model reaches its decisions [1]. Regarding interpretability, AI models can be categorized into white-box models, such as decision trees, and black-box models, such as neural networks [30].
Compared with white-box models, black-box models have excellent performance with almost no interpretability [31].
To address the AI black-box challenge, a considerable amount of research has focused on developing explainable AI to open the black-box [30]. As a primary method for addressing the AI black-box issue, the visualization approach was developed to explain the models' main features [32]. For example, De Fauw et al. [33] visualized sections of the patient optical coherence tomography scans used by an AI model to make medical decisions. However, visualization is challenging to explain, and users tend to misread the results and over-trust their judgement [34].
Other approaches for addressing the AI black-box issue have been developed, such as (1) analyzing one isolated layer at a time to learn the differences between layers in neural networks [35], (2) using a simplified version of the algorithm for debugging and detecting potential errors, and then training an accurate version of the algorithm [36], (3) training the black-box model to explain the level of safety by assigning a confidence level to the model's prediction [37], (4) using a group of independent algorithms for decision-making [38], and (5) feeding specific data to an AI model and then analyzing the accuracy and behavior of the model [39]. However, these methods focus on diminishing the black-box rather than opening the black-box of AI [33]. To open the black-box, the logic behind AI models' decision-making processes must be identified, and specific model tasks must be able to be paused or modified as necessary [40].
In contrast, some researchers are less concerned about opening the black-box of AI [35]. From this standpoint, understanding how an AI model makes decisions is less crucial than empirically verifying its accuracy [41]. According to this viewpoint, regulators and clinicians should accept the AI black-box models, because opaque systems are common in medicine [10]. For example, several efficient medications such as aspirin and penicillin were used before their mechanisms were discovered [42]. Because of the excellent performance and popularity of AI black-box models, and given the absence of effective methods to open the black-box, accepting AI black-box models could be considered an acceptable option. However, addressing the safety issues of AI black-box models is also essential [39,42,43].
The present study focused on developing a tool to evaluate the safety practices of AI models implemented in healthcare. The main objective of this article was to build safety guidelines for implemented AI black-box models to reduce the risk of health-related incidents and accidents. For this purpose, a three-level multi-attribute value model (MAVT) approach was used to develop a safety controlling system (SCS) for AI systems implementation.

Methodology
The SCS for AI implementation was developed by using a three-level MAVT adapted from Teo and Ling [44]. This approach consisted of four parts: (1) extracting attributes in different levels; (2) generating weights for the attributes; (3) assigning a rating scale for the attributes; and (4) finalizing the system [44]. Several techniques were used to accomplish these steps. A combination of a systematic literature review and expert interviews was used for extracting attributes; a questionnaire-based survey was used for generating weights; and questionnaire-based survey was used for developing a rating scale (Figure 1). Figure 1. Components of the three-level multi-attribute value approach.

List of attributes
In the proposed SCS, the attributes were divided into three levels. The first level attributes, called key dimensions, were adopted from Fernández-Muñiz et al. [45]. These were extracted from applicable safety standards and guidelines. These key dimensions were the fundamental and wellknown elements of any robust safety management system and included safety policies, incentives for clinicians, clinician and patient training, communication and interaction, the planning of actions, and the control of actions.
The second and third level attributes were developed by using a systematic literature review and interviewing ten AI domain experts. As the lowest level, the third level attributes were measurable safety elements for implemented AI systems in healthcare. The third level attributes were extracted from the systematic literature review and were subsequently refined during expert interviews. The third level attributes were clustered according to their predominant topics. These topics were named as the second level attributes.
The main reasons for using a combination of systematic review and interviews was to identify the main topics of AI safety in the included literature and expand these topics through consultation with ten AI domain experts. In addition, we aimed to ensure that all main aspects (elements) of AI implementation safety were addressed. For this purpose, first, a systematic review was conducted, and the main elements of safety in different key dimensions were extracted.
Second, the extracted information was categorized and discussed during interviews with AI domain experts to produce the third attributes, as illustrated in Figure 2.

Systematic review
To identify the attributes, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [46]. Two main features of the research question and search strategy were developed [47]. The following research question guided the systematic review: SRQ: What are the primary safety attributes of implemented AI models in healthcare for each key dimension?
A search strategy was developed by (1) defining keywords and identifying all relevant records, (2) filtering the identified articles, and (3) addressing the risk of bias among records [48,49]. Three sets of keywords were defined, and their combinations were used to identify relevant articles ( Figure 3). The first set included "artificial intelligence," "machine learning," "deep learning," and "neural networks"; the second set consisted of "health," "healthcare," and "medicine"; and the third set included "safety policies," "safety incentives," "safety training," "communication or interaction," "safety planning," and "safety controlling." The PubMed and Google Scholar databases were used to discover relevant articles published through the end of July 2020. After identifying 829 articles with relevant content, we used a formal screening process based on inclusion and exclusion criteria. The inclusion criteria were articles associated with the research question and those written in English. The exclusion criteria included articles written in other languages, presentations, anecdotes, chapters of books, letters, and posters [48,49]. The screening of the titles, abstracts, conclusions, and keywords in the identified records after removal of duplicates (n = 799) excluded 732 articles. The remaining articles (n = 67) were read in full to determine eligibility, and all were included in the review ( Figure 4). Two authors independently reviewed the titles and abstracts of the identified articles selected for full-text review to address potential biases. Afterward, the authors compared their results and created a unified list.
After reading the article bodies, two authors mutually decided on whether to include the articles.

Interviews
A primary method for collecting qualitative data is interviews, which are widely used in conducting research [50]. Some researchers apply structured interviews to test a priori hypotheses by using standardized questions and analysis. In contrast, others apply qualitative interviewing methods to better understand the stated hypotheses [51]. In this article, both types of interviews were used.
After completing the systematic literature review, we applied an interview approach to identify the SCS's key dimensions. The objective was to examine extracted information from the literature review, discuss unidentified aspects of safety systems, and determine measurable third level attributes. Consequently, we interviewed ten AI domain experts working in U.S. technology companies. Three of the selected experts worked at the management level; most of them had job titles of data scientist and data engineer; two individuals were in post-doctoral positions.
We asked these ten specialists the following questions: • IRQ1. What are the attributes of safety policies for implemented AI models in healthcare?
• IRQ2. What are the attributes of incentives for clinicians for implemented AI models in healthcare?
• IRQ3. What are the attributes of clinician and patient training for implemented AI models in healthcare?
• IRQ4. What are the attributes of communication and interaction for implemented AI models in healthcare?
• IRQ5. What are the attributes of planning of actions for implemented AI models in healthcare?
• IRQ6. What are the attributes of control of actions for implemented AI models in healthcare?
The interviews were conducted during August 2020. Each interview lasted approximately 1 hour and was divided into two main parts. First, the key dimensions and results of the systematic literature review were explained and discussed. Second, six research questions were asked, and the third level attributes were formed. After discovery of attributes on the basis of the systematic literature review and expert interviews, we organized the key dimensions and the second and the third level attributes into a hierarchy tree. In this knowledge structure, the higher-level attributes represented the overall view of safety in implemented AI models, and the lower-level attributes measured the elements of safety in AI models ( Figure 5). Notably, the highest level had 6 attributes, the middle level had 14 attributes, and the lower level had 78 attributes.

Weight of attributes
Because the identified attributes differed in importance regarding AI system safety, differentiating essential attributes from desirable attributes was essential. Therefore, we assigned a weight to each attribute to understand its degree of importance. Weights are crucial for decision-making because they indicate the most critical safety elements in AI systems implementation. For assigning weights to attributes, we used a 4-point Likert scale. For this purpose, a questionnaire was designed containing the third level attributes. To evaluate the significance of the third level attributes, we asked ten AI experts who participated in developing the attributes to rate these attributes on a 4-point scale: not important = 1; neutral = 2; important = 3; and very important =

4.
We assessed agreement among AI experts by calculating Kendall's W (Kendall's coefficient of concordance) [52,53]. This non-parametric statistic ranges in value between 0 and 1, such that 1 indicates more substantial agreement [54]. We assessed the concordance of opinions regarding six key dimensions of the SCS. There was strong agreement (Kendall's W scale bigger than 0.6) among AI experts in the key dimensions of "planning of actions" and "control of actions." In addition, the AI experts were moderately in agreement (Kendall's W scale between 0.3 to 0.6) regarding the remaining key dimensions. However, we decided to adopt the average experts' ratings as each third level attribute's weight. Next, the weights of all third level attributes were recalculated such that the sum of all weights was 100. In the final step, the weights of the second and the first level attributes were determined. According to the results, the key dimension of "communication and interaction" had the highest weight and was followed by "control of actions" and "safety policies." The weights of the key dimensions and the second level attributes are shown in Figure 6

The rating system
The next part of the MAVT approach was developing a rating system and assigning it to the third level attributes. To reduce the probability of having different results from different auditors and to improve the generalization of the SCS, we developed a rating system by allocating points to the third level attributes in a straightforward manner. Different types of rating systems were extracted from Teo and Ling [44] and used in the survey. Four possible rating options were as follows: 1) 0/1, in which the rating options are "0" (no) or "1" (yes), 2) 0-1, in which the rating options are a fraction between "0" and "1", 3) 0/1/NA, in which the rating options are "0" or "1" or "not applicable", and 4) 0-1/NA, in which the rating options are a fraction between "0" and "1" or "not applicable." A questionnaire was designed containing the third level attributes. To assign the most relevant rating system to each attribute, we asked the ten AI experts to select the most relevant rating system. We assessed the agreement among AI experts by assigning numbers from 1 to 4 to each rating system and calculating Kendall's W. There was strong agreement (Kendall's W scale bigger than 0.6) among the AI experts in all key dimensions except "control of actions" and "safety policies." The most relevant (popular) rating system was assigned to each attribute according to the collected data.

Finalizing the model
The score of each third level attribute was determined by multiplying the attribute's weight in the auditor's assessment by the attribute according to the assigned rating system. After scores were determined, the total score was calculated by adding all scores of third level attributes. In conclusion, the entire MAVT approach for developing the SCS in 19 steps is represented in Figure   8.

Results
An SCS can be defined as a set of policies, practices, procedures, strategies, roles, functions, and resources associated with safety that interacts in an organized way to decrease the damage generated in a process [45,55]. Different SCSs have been developed for different industries and technologies, but there is a lack of studies aiming to understand the key dimensions and measurable indicators of the safety of black-box AI models in healthcare. Although the developed safety models and guidelines for industries may not apply directly to AI models in healthcare, their methods and frameworks can be adapted to create a comprehensive safety system suitable for black-box AI models.
This article developed a system to evaluate the safety performance of AI models implemented in healthcare. The proposed system was constructed by applying the three-level MAVT approach [44]. The first level attributes, adopted from Fernández-Muñiz et al. [45], were the main elements of safety standards and guidelines. The 14 attributes of the second level and the 78 elements of the third level were extracted by using a systematic literature review, conducting interviews, and performing two small questionnaire-based surveys.
The key first level dimensions of the SCS are as follows: (1) safety policies; (2)

The first key dimension.
Safety policies can be divided into three attributes of legislation and codes of practice (including six attributes), liability (including nine attributes), and continual development (including five attributes). Developing AI models in healthcare faces several legal regimes, such as federal regulations, state tort law, the Common Rule, and the Federal Trade Commission Act.
In malpractice claims, owing to the use of AI black-box models in clinical workflows, the current legal system is not suitable [56,57]. Therefore, the responsibilities of different parties, including AI developers, the source of training data, clinicians, and suppliers who provide the AI system platform, must be clearly defined [1]. In addition, clinical systems are being controlled by designed rules, and using black-box data-driven devices can introduce new risks [58]. For example, traditional medical devices are updated manually, whereas AI-based devices are updated by training with new data [58]. The differences between the data-driven and the traditional devices require clinical regulations to be updated to correctly implement AI-based devices in the clinical workflows [56].
The continual development of AI models is one of the main attributes of safety policies. AIbased devices are a new type of medical technology, and they may become outdated because of continually changing medical treatment patterns and improvements in medical instruments [59,60]. Therefore, this unique aspect of AI-based medical devices in medical regulation is important to consider [1]. The Food and Drug Administration (FDA) has defined and developed the Software as Medical Device (SaMD) category and Digital Health Software Precertification (Pre-Cert) Program to address this issue [56,61]. Accordingly, the FDA's new policy centers on AI developers' organizational excellence rather than approving AI-based medical devices [62].
Organizational excellence is defined as an effort to develop different processes and standards intended to engage employees to deliver excellent products [62]. Consequently, developers are authorized to update AI models without to review by the FDA [63]. However, a testing process must crucially be developed so that the updated models' performance is not below that of the primary models [64].
The second key dimension.
Because of false confidence, clinicians often accept the results and recommendations of AI models regardless of their accuracy [36]. To address this issue and to motivate clinicians to constantly check the results of AI systems, clinicians' incentives can be considered according to two attributes: developing safety incentive programs and adopting resolutions according to clinicians' recommendations. Safety incentive programs comprise three aspects, such as: "Are there any incentives offered to clinicians to put defined procedures of implemented AI systems into practice?" The attribute of adopting resolutions comprises three aspects, such as: "Is there any meeting with clinicians to adopt their recommendations concerning AI-based medical device operation?" The third key dimension.
Clinicians and patients should be educated on the risks, benefits, and limitations of AI models [42]. Different actions that should be taken may include engaging clinicians in developing datadriven systems, providing training events in health organizations before and after AI model implementation, using different teaching methods to educate clinicians, asking for feedback from learners, and developing personalized education [65]. Clinician and patient training can be divided into (1) general training, including five attributes, such as "Are clinicians given sufficient training concerning AI system operation when they enter a health institution, change their positions or use new AI-based devices?" and (2) specific training for certain patients and clinicians facing highrisk events, including two attributes, such as "Are specific patients or clinicians trained who are facing high-risk events?" The fourth key dimension.
Communication and interaction can be divided into two main attributes of human-human interactions: interaction between parties (such as healthcare institutions and AI developers, including five attributes), and human-AI interaction (including 18 attributes mainly adopted from Amershi et al.) [66]. The safe implementation of AI-based devices in healthcare depends on comprehensive and effective interaction among healthcare institutions, clinicians, and AI developers. This interaction is necessary, because AI models cannot be trained and tested for all disease states and patient demographics during clinical trials. This interaction includes five attributes, such as "Is there any information system developed between a health institution and an AI developer during the lifetime of AI-based medical devices?" [67][68][69]. The attribute of human-AI interaction is associated with designing safety guidelines for interaction between humans and AI. A total of 18 attributes are considered for human-AI interaction, such as: "Is there any established description of what the AI-based medical device can do?" [66].
The fifth key dimension.
Because implemented AI models are required to perform tasks in dynamic and complex healthcare environments, and because these models cannot be fully evaluated during clinical trials, a safety plan must be created to identify all risks and adverse events, and plan a course of action to remove risks and plan emergency actions. The planning of actions is divided into two main aspects: (1) risk assessment and preventive plans, including eight attributes, such as: "Are all risks and adverse events identified concerning implemented AI systems?" and (2) emergency plan for occurred risks, including four attributes, such as: "Do the health institution and the AI developer have an emergency plan for remaining risks and adverse events of AI operation?" The sixth key dimension.
The control of actions aims to monitor all risks and adverse events and all procedures and planning. The control of actions is divided into two main aspects: 1) checking the AI system's effectiveness internally and externally, and 2) comparing incident rates with benchmarks.
Checking the AI system's effectiveness internally and externally involves eight attributes, such as "Is effective post-market surveillance developed to monitor AI-based medical devices?" Postmarket surveillance has two main parts. The first is practical cooperation among clinicians, health organizations, and AI developers to gather clinical and safety-related data (explained the communication and interaction attribute). The second is monitoring and analyzing different safety signals, longitudinal data, risks and adverse events, and thresholds for AI-based device recall [70].
The comparison of incidence rates with benchmarks includes two attributes. All attributes are shown in Table 1. Are there procedures to report the results of external and internal evaluation? 2.25 0/1/NA Comparing incident rates with benchmarks 6.00 Do the accident and incident rates regularly compare with those of other healthcare institutions from the same sector using similar processes?
3.00 0/1/NA Do all policies and procedures regularly compare with those of other healthcare institutions from the same sector using similar processes? 3.00 0/1/NA

Discussion
This study offers an alternative solution for opening the AI black-box in healthcare by introducing an SCS. The proposed framework and system provide a basis for implementing and monitoring safety legislation and procedures, identifying the risks and adverse events in AI activities, preventing accidents and incidents from occurring, and having an emergency plan for threats. Therefore, the proposed framework and tool can guide the safety activities of implemented AI systems.
The SCS represents a set of attributes in different layers and can be used in healthcare institutions with implemented AI models. The management of healthcare institutions can use the proposed set of attributes as a checklist, verifying whether a set of desired safety elements exists.
Having useful specific attributes in healthcare systems will lead to high scores in the SCS.
Healthcare institutions can use this framework to (1) calculate their safety score, and compare it with those of other institutions, and (2) detect deficiencies in current safety practices regarding the implemented AI models. The above steps can help improve the overall safety performance.
The proposed framework for evaluating AI safety performance was developed by using the MAVT approach, comprising four parts: extracting attributes, generating weights for attributes, developing a rating scale, and finalizing the system. With the MAVT approach, three layers of attributes were created. The first level contained 6 key dimensions, the second level contained 14 attributes, and the third level contained 78 attributes.

First key dimension.
Three attributes-"legislations and codes of practice," "liability," and "continual development"-were extracted as primary elements of safety policies from the literature review and were confirmed in interviews. Commitment to current legislation and codes of practice is a basic element of every AI system. Among current legal regimes, data privacy-related legislation plays a vital role in developing and implementing AI systems. Because of the complexity of protecting data privacy and its effects on data availability, three different viewpoints concerning the level of adaptation of data protection legislation have recently been proposed.

First view. The European Union has adopted legislation entitled General Data Protection
Regulation (GDPR), which details a comprehensive and uniform approach for data privacy, regardless of how data are collected, in what format, or who the custodian is [71]. Under GDPR, only anonymous data can be shared. The anonymization process under GDPR requires implementing different techniques on datasets to prevent data re-identification [71]. Although GDPR aims to protect data privacy rather than to prevent data sharing, a fear of violation penalties has decreased data collection and data aggregation efforts among European companies and even data flow from Europe to the U.S. [72].
Second view. The current U.S. data privacy legislation is more lenient than that of the European Union [73]. In general, Europe places more emphasis on protecting citizens from technological risks, whereas the U.S. focuses more on innovation and technology [73]. Under U.S.
privacy law, health data are treated differently depending on how they have been created, who is handling the data, and who the data custodian is [74]. The Health Insurance Portability and Accountability Act (HIPAA) includes a privacy rule that prohibits disclosing protected health information [56]. HIPAA limits the use of protected health information unless there is authorization from the patient or Institutional Review Board [74]. Under HIPAA policy, any type of de-identified data is considered non-personal and not subject to data protection regulation [71].
Furthermore, HIPAA focuses on specific actors and their activities rather than on the data itself; therefore, a considerable amount of health data are not covered by HIPAA [74].
Third view. From China's perspective, AI is a powerful tool for economic success, military dominance, and controlling the population [72]. Chinese companies accumulate a tremendous amount of health-related data, which can be used in AI development, owing to lenient regulation on data collection and little public concern about data privacy [63,75]. However, in recent years, the Chinese public has started to petition large companies, such as Baidu and Alibaba, for the right to data privacy [75]. Consequently, China has initiated personal data protection laws and ethical principles for developing and using AI [76,77].
Among the third level attributes of safety policies, the elements "Software as Medical Device (SaMD)," "Digital Health Software Precertification (Pre-Cert) Program," "current legal regimes," and "assigning responsibility" were mainly extracted from the included articles. The elements of "safety objectives of the AI-based medical device," "positive atmosphere in the health institution," and "coordinating the AI-based medical device policies with existing policies" were mainly found from the interviews. However, we observed that the AI experts differed in the weights assigned to this crucial dimension's attributes. The most confusing second level attribute was the liability, on which AI experts did not reach agreement.
The term Software as a Medical Device (SMD) is described as "software that uses an algorithm that operates on data input to generate an output that is used for medical purposes" [78].
SMD applications are diverse as Computer-Aided Detection (CAD) software-for example, software detecting breast cancer-and smartphone applications for diagnostic purposes-for example, software for analyzing images collected from a magnetic resonance imaging medical device. Although some FDA guidelines for SMD overlap with attributes of other key dimensions, we decided to consider "commitment to FDA regulations regarding Software as Medical Device" under "safety policies." As described earlier, the Pre-Cert Pilot Program looks first at the AI developers rather than at AI-based medical devices, in contrast to the FDA process for traditional medical devices [79]. Because the FDA selected several companies to participate in developing the Software Pre-Cert pilot program, we decided to include it as an attribute.

Second and third key dimensions.
Both the "incentives for clinicians" and "clinician and patient training" attributes were formed and developed in interviews. There was moderate agreement regarding the weights of attributes and strong consensus regarding the assigned rating system.

Fourth key dimension.
Although two parts of this key dimension were mainly extracted from the literature review, a considerable amount of interview time was spent on this aspect to define the third level attributes.
Human-human interactions are associated with communication management among all parties-for example, AI developers and health institutions-involved in implemented AI-based medical devices. All main communication management elements, including planning, managing, and monitoring communication, were discussed in interviews, and measurable attributes were defined. One of the main attributes of human-human interaction is developing an information system for storing, processing, collecting, creating, and distributing information. This information system contains different elements of hardware and software, system users and developers, and the data itself.
Regarding human-AI interactions, the attributes from Amershi et al. [66] were discussed in the interviews to define measurable attributes. The main elements of the human-AI interaction included the following: AI system capability, AI system accuracy, AI system time service, AI system displaying information, AI system language, social and cultural norms in human-AI interaction, AI system readiness, dismissal of unwanted service, AI system recovery, AI system disambiguation, AI system explainability (black-box mitigation), AI system short term memory, personalizing the AI system, updating the AI system, feedback mechanisms in the AI system, the AI system's reaction to wrong actions, customizing the AI system, and notification mechanisms in the AI system. Importantly, personalization means that AI systems can learn from clinicians' actions, and customization means that clinicians can customize the AI system's actions.
One of the main controversial elements of human-AI interaction is the AI system's accuracy and effectiveness. As a part of model safety, the AI model's performance in clinical trials should outperform the performance of existing diagnostics devices and clinicians' judgment [56].
Accuracy, defined as a proper fraction of predictions, is a commonly used metric for evaluating AI algorithms' performance [56].
Many studies have reported the three measures of accuracy, sensitivity, and specificity in clinical trials to capture the full extent of models' properties [56]. However, covering all essential differences in patient demographics and disease states in clinical trials is impossible [59]. One solution is to add external validation after the clinical trials before implementing the model in clinical workflows [59]. The external validation phase would include training and testing the model by using data from the clinics where the AI model will be used [59].
Other metrics to measure model performance are stability and robustness [39]. Model stability means that, when given two almost identical input data sets, an AI model generates almost the same results [80]. Model robustness indicates the stability of the model's performance after including noise in the input data [39]. Robustness represents the model efficiency for new data outside the training data [39]. These measures are essential for applying AI models in healthcare, because the lack of stability and robustness can diminish clinicians' and patients' trust in AI models [81].

Fifth key dimension.
In this key dimension, risk assessment was mainly extracted from the literature review, and elements of the preventive plan and emergency plan were discussed in interviews. The foundation of the "planning of action" dimension is risk assessment. The principal risks of implemented AI systems include data difficulties, technological problems, security problems, models misbehaving, and interaction issues [82]. Two elements of models misbehaving and interaction issues were addressed in AI-human interactions. Therefore, the main risks associated with the implemented AI system are data difficulties and technology problems.
Risk of data difficulties. One of the main concerns regarding AI in healthcare is data availability [1, 2,28]. Despite considerable recent efforts in collecting and releasing high-quality AI-ready datasets, most health data are not accessible to the public [1, 2,28]. These data are generally collected and controlled by hospitals and other health organizations and used for operations but not for analytics or research. Therefore, the formats of the data are often not ideal for training AI models. For example, image data may not be anonymized, organized, or appropriately annotated [83]. Of the publicly available datasets, most are released once and become progressively outdated [59]. For example, despite advances in fundus camera technology, the Messidor database is still used to train AI algorithms on images acquired in 2007 [84].
Other issues in data availability include coverage of rare and novel cases [85], missing data in datasets, a lack of appropriately labeled data [86], high-dimensionality together with small sample sizes [87], and data contamination with artifacts and noise [88]. Among image datasets, the main issues include difficulty in collecting many high quality manually annotated images [89], the limitations of human perception in annotating and labeling images [90], the time required for reviewing and annotating each image in a dataset [91], the level of raters' sensitivity to a particular target [92,93],, loss of information due to image processing and resizing [94], and collection of images from only a specific device [95].
Data privacy is the main difficulty in increasing data availability in healthcare [1, 2,28]. A delicate balance must be struck between stimulating the potential benefits of aggregating health data and protecting individual privacy rights. To do so, different reported practices include anonymizing data before sharing, using validated protocols for de-identification, exploring safer ways to share data, and defining the responsibilities of health organizations as data custodians [96].
However, linking de-identified data is much more difficult when patients visit different health institutions, obtain insurance through various companies, or change their location [74].
Consequently, forming fragmented health data makes data-driven innovation more difficult [74].
Mitigating the risk of data difficulties. High-quality AI-ready data are the foundation for developing accurate algorithms. Even the unintentional effects of biases due to selecting unsuitable data can decrease the accuracy of AI models. To generate high-quality AI-ready data, different methods have been proposed in various studies. Data aggregation efforts across health organizations are one way to generate high-quality data [97]. One of the main challenges in data aggregation is that the data format may differ among health organizations [98]. Therefore, usable data with consistently structured formats must be generated among health organizations [98].
Several efforts have been proposed to address this concern, including developing cloud infrastructures, adopting unified data formats such as Faster Healthcare Interoperability Resources, and launching collaborative efforts among health organizations to create high-level joint features [1,99].
Training AI models in a simulated virtual environment has created a unique opportunity to cover the lack of high-quality healthcare data [100]. By using the virtual environment, an AI model can learn and become powerful before it is implemented in the physical world [101]. Chawla [102] has reported the successful implementation of AI models trained inside a virtual environment. The key advantages of using a virtual environment for training AI models are as follows: 1.
The virtual environment allows AI developers to simulate rare cases for training models [101].

2.
The entire training process can occur in a simulated environment without the need to collect data [102].

3.
Learning in the virtual environment is fast; for example, AlphaZero, an AI-based computer program, was trained over a day to become a master in playing Go, chess, and shogi [36].
However, using a virtual environment for training AI models in healthcare is not as advanced as its applications in other fields, such as autonomous cars. For example, the Waymo company has created virtual models of whole cities, and every day it sends 25,000 virtual self-driving cars through these cities to train AI algorithms [103]. Using a virtual environment gives Waymo the ability to simulate more than 5 billion miles of autonomous driving [103]. This achievement may inspire healthcare companies to develop a vast virtual world including all disease states, patient demographics, and health conditions to train AI models.
Another way of generating high-quality data is building health datasets comprising data from volunteers and groups of consenting individuals. Encouraging patients to share their electronic medical record information and medical images, and creating datasets of volunteers' data have been described in several studies [83]. For example, in 2015, the U.S. National Institutes of Health set an objective to develop genomic data, lifestyle data, and biomarker data from 1 million volunteers from diverse backgrounds [63]. Another project supported by Google is developing a dataset comprising data from 10,000 volunteers over 4 years [98]. Participants in this project monitor their sleeping patterns and daily activities, answer common questions, and periodically visit specific medical testing locations [98]. However, various concerns exist regarding this type of data generation, including the lack of a specific mechanism for patients to share their data and the absence of a well-founded repository for aggregating patient data outside health care organizations. Awareness about the benefits of this process is lacking, and no institution has been authorized to monitor these projects [83,104].
The involvement of tech companies in healthcare has created a new trend of high-quality data generation [105]. For example, big tech companies collect massive amounts of behavioral data from social media and sensors [105]. Biomedical signals such as heart rate and rhythm, blood pressure, blood oxygen saturation, voice, tremor respiratory rate, limb movement, and temperature can be recorded by modern wearable devices [28]. These biological signals can be used for detecting several health conditions and diseases [2]. Patient-generated health data are another unique method for creating high-quality data. Various health-related datasets can be built by patients and caregivers outside clinics by using software applications, wearable sensors, monitoring devices, smartphones, and tablets with cameras [106]. Recently, substantial improvements have been made in high-quality and low-cost technologies with the potential to collect various patient-generated data regarding movement and behavior, environmental toxins, social interactions, diseases, images, and other physiological variables [107]. For example, one study has begun developing comprehensive open-access datasets through parents recording the behavior of their children with autism by using cell phone cameras [106]. In addition, the FDA has made efforts to establish a path for collecting patient-and caregiver-generated health data in clinical trials [108].
Collecting lifespan data from implemented AI-based medical devices is another method to access high-quality health data. These efforts require creating a system as a combination of hardware and software components to store and transfer generated data [109]. For example, by implementation of an AI model in different health organizations, high-quality data can be collected and stored in a repository outside health organizations, with consideration of data privacy Security problems. One of the main risks associated with implemented AI systems is security. Adversarial attacks, one of the major types of security problems in the AI system, can result when flawed AI systems are susceptible to manipulation by inputs explicitly designed to fool it [59]. For example, one study has shown that adding a very small amount of perturbation to images can cause medical image classifiers to incorrectly classify a mole with a 100% confidence level [59]. Because the issue of adversarial attacks cannot be completely addressed in clinical trials, fully managing malicious attacks is a main aspect of the safe implementation of AI systems in healthcare. Hostile attacks can be partially addressed by effective post-market surveillance; however, implementing regulatory actions and novel techniques can secure AI systems against adversarial attacks [11]. For example, in situations in which clinical data can be changed with fraudulent intent, using the BlockChain technique allows for data storage in immutable interconnected blocks [11].
Technological problems. Typically, the technological problems in AI systems relate to software and hardware. From a software perspective, AI systems are explicitly concerned with algorithms. Although we have discussed the main issues associated with algorithms, such as data difficulties and accuracy, generalization and algorithm fairness must also be addressed. Unknown accuracy of the results for minority subgroups is a major element of algorithm fairness [59]. For example, one study has developed an AI algorithm with high accuracy in the classification of benign and malignant moles but has found that it has poor performance on images of darker skin because it was trained on data from mainly fair-skinned patients [59]. Therefore, in developing and implementing AI systems, further training of AI models on data from minority groups, and the accuracy of AI models for underrepresented groups, must be considered [59].
From a hardware perspective, AI systems are mostly concerned with implementing algorithms on a physical computation platform [110]. Different physical computation platformsdistinguishable in terms of power efficiency, computation capability, and form factor-have been developed for AI systems, including a general-purpose central processing unit; graphical processing units; customizable and programmable accelerator hardware platforms, such as application-specific integrated circuits and field-programmable gate arrays; and other emerging platforms such as memristor crossbar circuits [110]. However, from the hardware perspective, the memory wall is a major challenge for AI systems [110]. The memory wall is defined as a situation in which improvements in processor speed are masked by the much slower progress in dynamic random access (DRAM) memory speed [111]. Although DRAM organization has improved, this aspect is a major issue in AI systems [111].
Sixth key dimension.
Among elements of this key dimension, post-market surveillance was mainly extracted from the literature review. This effort was supplemented by internal and external validation and use of benchmarks formed and discussed during interviews. Part of the safe implementation of AI-based healthcare devices is post-market surveillance to monitor medical devices' safety [70].
Implementing comprehensive and effective post-market surveillance is essential for two reasons: (1) the FDA's new policy focuses on AI developers rather than AI-based medical devices, and (2) AI models cannot be trained and tested for all disease states and patient demographics during clinical trials and external validation [68]. The post-market surveillance system should include practical cooperation among clinicians, health organizations, and AI developers to efficiently gather clinical and safety-related data. Such a system should correctly identify safety signals, practically collect longitudinal data, effectively report adverse events, and strictly define thresholds for device recall [70,112]. An ideal level of post-market surveillance in AI-based medical devices includes three parts: extensively collecting data across the lifespan of devices, integrating results into electronic health records, and full tracking and reporting adverse events [67]. Developing and implementing a clear definition and distinction between information systems (data for human-human interaction) and post-market surveillance systems (data for AI-human interaction) is crucial.

Study limitations
The proposed framework of the AI SCS in the healthcare industry has several limitations.
First, we did not perform safety audits to ensure the developed tool's effectiveness. Therefore, at this time, the quality of the proposed approach cannot be assessed in terms of: 1. The comprehensibility of the considered safety elements to potential auditors.

2.
The robustness of the rating scale for each safety element to secure a reliable rating under similar conditions.

3.
The potential for improving key dimensions and different layers of attributes.

4.
The feedback from the healthcare institutions about the system.
To address the above challenges, the proposed framework should be implemented in several healthcare institutions concurrently to investigate its effectiveness. In addition, several key questions should be addressed, including (1) clinicians' acceptance of the framework, (2) the compatibility of the model across multiple healthcare institutions, (3) the opportunity for implementation in different types of healthcare organizations, (4) and the framework's effectiveness.
The second limitation of this study is that the number of AI experts involved was relatively small. Many framework attributes were identified during the interviews, thus indicating their importance in developing AI safety system requirements. Finally, the structural relationships between measurable variables (the third level attributes) and latent variables (the first and the second level attributes) should also be assessed to validate the developed model. A survey including many health institutions considering implementation of AI-based systems (including medical devices) should be conducted for that purpose. Another essential consideration is developing a set of robust AI-relevant safety criteria. Finally, implementation of the proposed system in real settings would require comprehensive management and appropriate regulatory oversight.

Conclusion
This article has discussed the challenges in advancing the implementation of AI in healthcare.
We have outlined the safety challenges of AI in the context of explainability as opposed to the black-box approach. Our main objective was to develop a framework for an SCS for AI as an alternative to opening the black-box. We adopted the MAVT approach to develop an AI system's safety attributes at three levels. This development process consisted of four parts: extracting attributes, generating weights for attributes, creating a rating scale, and finalizing the framework's architecture. We used a systematic literature review and interviews with subject experts to establish the safety attributes' hierarchical structure. The first level contained 6 key dimensions, the second level included 14 attributes, and the third level had 78 attributes. The SCS's key first