Predicting Risks of Machine Translations of Public Health Resources by Developing Interpretable Machine Learning Classifiers

We aimed to develop machine learning classifiers as a risk-prevention mechanism to help medical professionals with little or no knowledge of the patient’s languages in order to predict the likelihood of clinically significant mistakes or incomprehensible MT outputs based on the features of English source information as input to the MT systems. A MNB classifier was developed to provide intuitive probabilistic predictions of erroneous health translation outputs based on the computational modelling of a small number of optimised features of the original English source texts. The best performing multinominal Naïve Bayes classifier (MNB) using a small number of optimised features (8) achieved statistically higher AUC (M = 0.760, SD = 0.03) than the classifier using high-dimension natural features (135) (M = 0.631, SD = 0.006, p < 0.0001, SE = 0.004) and the automatically optimised classifier (22) (M = 0.7231, SD = 0.0084, p < 0.0001, SE = 0.004). Furthermore, MNB (8) had statistically higher sensitivity (M = 0.885, SD = 0.100) compared with the full-feature classifier (135) (M = 0.577, SD = 0.155, p < 0.0001, SE = 0.005) and the automatically optimised classifier (22) (M = 0.731, SD = 0.139, p < 0.0001, SE = 0.0023). Finally, MNB (8) reached statistically higher specificity (M = 0.667, SD = 0.138) compared to the full-feature classifier (135) (M = 0.567, SD = 0.139, p = 0.0002, SE = 0.026) and the automatically optimised classifier (22) (M = 0.633, SD = 0.141, p = 0.0133, SE = 0.026).


Introduction
Public health discourse is a heterogeneous system which encompasses clinical, research, and educational resources. Clinical resources are known for their lexical, syntactic irregularities introduced by clinical professionals to medical records [1,2]. Research resources are known for their linguistic, content complexity, formality [3][4][5]. Both genres pose significant challenges for machine translation (MT) technologies which are susceptible to language difficulty and irregularity [6][7][8]. Compared to these specialised discourses, public health educational resources developed by health authorities are designed to have high information accessibility, as these resources are intended to inform, guide, and support the general public in acquiring an essential understanding of health risks and diseases. Given the important social function of public health education resources, the translation of original English health resources at national and international health authorities is often made available through professional translation for not only English-speaking populations but also people and communities from linguistically and culturally diverse backgrounds. For many years, the World Health Organization (WHO) has been a champion in multilingual health communication and promotion. Public health educational resources developed by health authorities such as the WHO have significantly less lexical, syntactic irregularities compared to clinical writings and controlled language difficulty compared to research or policy materials.
With increasing cost-effectiveness and practical convenience, machine translation is having increasing applications in online health education and promotion. Neural MTs (NMTs) such as Google Translate have significantly improved their performance over traditional statistical MT tools, especially the linguistic fluency and acceptability of MT outputs [9][10][11][12]. MT tools are not only used extensively by professional translators but also by global multilingual users, especially those with limited English proficiency and health literacy, in search for authoritative health information in order to facilitate the understanding and acquisition of latest reliable health knowledge. Our study aimed to explore the risks of MT tools in the translation of online health promotion resources for non-English speaking readers. We developed risk-prevention mechanisms, i.e., machine learning classifiers for risk prediction, to enable a safer and more informed use of MT tools in health promotion and to close the gap in the provision of quality health information to vulnerable populations and communities.

Related Work
Machine translation (MT) is having increasing applications in healthcare settings to facilitate the communication and interaction between medical professionals and multicultural and multilingual populations including migrants, ethnic minorities, and refugees. A growing number of studies have shown that despite the convenience, low-cost, and increasing accuracy of MT tools, there are still considerable risks in the use of current translation technologies in healthcare settings. The development of effective risk-prevention mechanisms will enable a safer and more informed use of MT tools and help close the gap in technology-assisted healthcare to provide better quality health services to vulnerable populations and communities. Much of current studies on the evaluation of MT output are based on lexical, grammatical, or syntactic properties of the language. The authors of [13] used Naive Bayes, decision tree, and support vector machine classifiers in order to automatically detect statistical machine translation errors in Brazilian Portuguese, including number agreement, gender agreement, verb inflectional errors, part of speech errors, extra word, missing word, untranslated word, incorrectly translated word, misspelling, missing, untranslated or mistranslated multiword expressions, and reordering words. The decision tree algorithm correctly classified 76.7% of the instances using the 5-W (a token window of size five), whereas Naive Bayes and SVM achieved their best overall precision using the 7-W (73.6 and 74.5%, respectively). However, such studies rarely explored erroneous MT output in the context of clinical and health research by using machine learning methods. In our study, we detected and classified some important types of MT errors which are clinically significant, especially for vulnerable users of MT tools who have very limited bilingual skills and medical knowledge, as well as medical professionals who might use MT tools to communicate with their patients.

Collection and Selection of WHO Public Health Education Resources
We systematically searched for articles on public health education on infectious diseases from the website of the WHO (May 2021) under the section of Health Topics (www.who.int/health-topics/, accessed on 31 May 2021). We reviewed and retained 185 articles for subsequent corpus annotation and statistical analyses. The standard structure of articles in the Health Topics section included an overview of the disease, its symptoms, and treatment. Articles that appeared in sections other than Health Topics such as WHO news/media, research publications, or policy materials were not included in the data collection to control for the content difficulty and the target readerships of the original English health materials to be used for developing machine learning algorithms. The finalized sample covered more than 30 global and regional leading infectious diseases: human immunodeficiency virus, Dengue fever, Ebola virus, Lassa fever, Yellow fever, Zika virus, foodborne botulism, tetanus, campylobacteriosis, Brucellosis, Cholera, Shiga toxin-producing Escherichia coli, waterborne Legionella pneumophila, plague, Salmonella, Salmonella Typhi, mycotoxins, herpes simplex virus type 1 and type 2, lymphatic filariasis, dracunculiasis, Schistosomiasis, human monkeypox, leprosy, malaria, human papillomavirus, cervical cancer, Middle East respiratory syndrome coronavirus, listeriosis, leishmaniasis, and so on.

Statistical Analysis
Flesch Reading Ease [14], Gunning Fog Index [15], and SMOG Index [16]  Ease, a mean score of 32 indicates the suitability of these English health resources for senior college students; for Gunning Fog and SMOG Index, a mean score higher than 14 also suggests that these materials were accessible only by highly educated readers. Existing research shows that linguistic difficulty caused by syntactic and lexical complexity can cause an impact on the quality of automatic translation outputs. However, the applicability of readability tools as input features for the development of machine learning algorithms remains understudied.

Screening and Classification of Machine Translation Mistakes
In order to develop machine learning classifiers to assess the risk profile or suitability of original English health education and promotion materials for machine translation, we first identified the human translations of the original English texts collected. The WHO has 5 official languages, and the English resources we collected on the WHO website have been mostly translated to the other 4 official languages including the target language under study, Chinese. Human translations were matched pairwise with machine translation outputs by Google Translate of the same English materials. Two bilingual translators compared and assessed differences between paired human and machine translations. Linguistic differences between the paired translations that did not cause barriers of understanding of the Chinese translations or any clinically significant misunderstanding of the original English materials were allowed as acceptable translation variations. Machine translation divergences from their matching human translations, which was perceived concurrently by both bilingual translators (Cohen's kappa coefficient 0.941, Asymptotic Standard Error = 0.026, 95% CI: 0.89, 0.992) as clinically significant, were highlighted. The review of machine translation outputs identified 4 large categories of automatic translation issues of the WHO standard public health education promotion resources: conceptual mistakes, mistranslation of disease symptoms, incomprehensible translations, and ambiguous translations causing difficulties in understanding the translations. The first two types of translation errors (Examples 1-9) are clinically significant, as without the correction of the erroneous machine translation outputs by bilingual medical professionals, the translated information can cause misdiagnosis of diseases or even result in life-threatening behaviours. The last two types of automatic translation issues (Examples 10-11) were related to the suboptimal translation outputs of low understandability and low translated health information usability.

Classification of Clinically Significant MT Errors
In our study, to demonstrate the significant impact of English health resources on MT (Google Translate) output quality, we selected standard health information on infectious diseases from the World Health Organization (WHO) as training and testing data for developing MCLs using multinomial Naive Bayes (MNB). Mistakes in MT Chinese outputs exhibited variability in terms of accuracy of clinical information (epidemiology, symptoms, and diagnosis) and communicative effectiveness (logical confusion).
Diverse machine translation issues identified through double-blind human evaluation revealed an important fact that the original English materials associated with erroneous or suboptimal and less effective MT outputs did not involve any complex medical terminologies or jargon as conventionally believed (Table 1). Instead, most of these translation mistakes occurred with high frequency and low difficulty vocabulary (see Appendix A) such as 'onset' (example 1), 'domestic' and 'semi-domestic' (example 2), 'drug solutions' (example 3), 'event driven' (example 4), 'culling' (example 5), 'outbreak' (example 6), 'delivery' (example 7), 'occasionally' (example 9). It was suspected that the occurrence of such inconspicuous yet clinically significant medical and health translation mistakes was due to the limited ability of current neural MT systems to correctly interpret and discriminate the meaning of polysemous words between general versus specialised texts such as medical and health materials. English: longer onset times for lower levels of intoxication. Human: it took longer for symptoms to reveal when the intoxication was low.
MT: the attacks of symptoms took longer when the intoxication was low.
English: different mosquito species live in different habitats; some breed around houses (domestic), others in jungles (wild), and some in both habitats (semi-domestic).
Human: different mosquito species live in different habitats; some breed around houses (domestic), others in jungles, and some in both environments (semi-domestic). MT: different mosquito species live in different habitats; some breed around houses (national), others in jungles, and some in both places (semi-national).
Related to medical, clinical, and procedural measures English: serological or other testing; culling can be effective in areas with low prevalence. Human: serological or other testing; killing can be effective in areas with low prevalence. MT: serological or other testing; removal can be effective in areas with low prevalence.

Confusion (sentence level)
Related to epidemiology such as disease or virus transmission English: occasionally, humans working or travelling in the forest are bitten by infected mosquitoes and develop yellow fever. Human: People working or travelling in the forest are occasionally bitten by infected mosquitoes and develop yellow fever. MT: People who occasionally work or travel in the forest are bitten by infected mosquitoes and develop yellow fever.
Related to medical, clinical, and procedural measures English: medical male circumcision reduces the risk of heterosexually acquired HIV infection in men by approximately 50%, including in 'real world' settings where scale up occurred alongside the increasing coverage of ART with its secondary prevention effect.
Human: Medical male circumcision can reduce the risks of men contracting HIV through heterosexual activities by about 50%, including in 'real world'. With the increasing application of antiretroviral therapy and its secondary prevention effect, this method (medical male circumcision) was on the rise. MT: Medical male circumcision can reduce the risks of men contracting HIV through heterosexual activities by about 50%, including in 'real world'. With the increasing application of ART, its secondary prevention effect increased as well.
English: a person wanting to know his HIV status collects a specimen, performs a test, and interprets the test results in private or with someone they trust. Human: a person wanting to know his HIV status can privately or with someone they trust collect a specimen, perform a test, and interpret the test results. MT: a person wanting to know his HIV status can collect a specimen, perform a test, and then privately or explain the test results to people they trust.
Note: Underlined text: This is the where the translation mistake occurs. The meaning of polysemous words that are predominant in general textual materials tends to be prioritized in machine translation outputs. In health and medical domain, this translation bias can cause clinically significant, misleading, or incomprehensible translation outputs. These translation issues, however, are less conspicuous or detectable than earlier MT systems such as statistical MT, as neural machine translation outputs are characterized by their improved fluency and idiomaticity, as noted by Way in the following [17].
[Neural] MT output can be deceptively fluent; sometimes perfect target-language sentences are output, and less thorough translators and proofreaders may be seduced into accepting such translations, despite the fact that such translations may not be an actual translation of the source sentence at hand at all! For end users of these MT tools, a lack of adequate medical training or health literacy can significantly increase the risks of using MT tools to acquire critical health knowledge and public health advice from health authorities such as the WHO.

Training and Testing of Machine Learning Classifiers
We divided the total sample set into 70% training (129) and 30% testing (56) data. We manually classified the original English materials as MT-error-prone English texts (EPET) and non-MT-error prone English texts (NEPET) in order to develop supervised machine learning classifiers for predicting the likelihood of MT errors (falling under any of the MT mistakes defined in this study, such as conceptual mistakes, symptom mistakes, and incomprehensible or ambiguous/confusing translation outputs), given the features of the original English health materials. In the training data, the ratio of EPET and NEPET was 1.35 (EPET: 74; NEPET: 55). In the testing data, the ratio of EPET and NEPET was 0.87 (EPET: 26; NEPET: 30).
Various English corpus annotation systems exist. In our study, we chose to annotate the WHO original English public health materials by using Readability Studio (Oleander Software Ltd., Vandalia, OH, USA) and USAS (University of Lancaster Semantic Annotation System) [18] to annotate the original English health materials from the WHO website. The use of Readability Studio was for verifying the hypothesis that the language difficulty of the original English texts was the main factor contributing to the occurrence of clinically significant errors in machine translation outputs. There is a rich literature on the impact of source texts on the quality of translations. Readability Studio generated descriptive statistics of a total of 20 morphological, structural, and lexical features of the original English texts including average number of sentences per paragraph; number of difficult sentences (more than 22 words); longest sentence; average sentence length; number of unique words; number of syllables; average number of characters; average number of syllables; number of proper nouns; number of monosyllabic words; number of unique monosyllabic words; number of complex (more than three syllable) words; number of unique (more than there) syllable words; number of long (more than six characters) words; number of unique long words; misspellings; overused words; wordy items; passive voice; and sentences that begin with conjunctions.
The use of the USAS semantic annotation system was based on our observation of the patterns of clinically significant errors in machine translation outputs as shown in the illustrative examples. It was high frequency polysemous words that tended to cause mistakes in automatic translations, instead of morphological or syntactically complex expressions. Semantic annotation will help explore the relations between the semantic meanings of original English expressions and the errors that occurred in the machine translation results. The USAS semantic system annotated the original English health texts with 115 finely classified semantic features (general or abstract terms A1-A15, body/the individual B1-B5, arts/crafts C1, emotion E1-E6, food/farming F1-F4, government/public G1-G3, housing/home H1-H5, commerce/industry I1-I4, sports/games K1-K6, living creatures L1-L3, movement/location/transport M1-M8, and numbers/measurements N1-N6; substances/materials/objects/equipment O1-O4, education P1, language/communication Q1-Q4, social actions, states, and processes S1-S9; time T1-T4; world and environment W1-W5, psychological actions, states, processes X1-X9, science and technology Y1-Y2, and names and grammar Z1-Z99). The use of USAS was based on the observation of the importance of the discrimination of semantic meanings by neural machine translation tools such as Google Translate to produce reliable and accurate health translation outputs.

Optimisation Techniques
The total number of features of the annotated English health materials was 135. Small (185) and high-dimensional (135) data such as ours can add uncertainty or cause overfitting to machine learning classifiers being developed. In order to increase the performance of machine learning algorithms, we used a backward selection method and recursive feature elimination (RFE) with support vector machine (SVM) as the base estimator to optimize the original high-dimensional feature set. We applied two optimisation techniques to identify the best feature set using minimal classification error (MCE) as the selection criterion. Machine learning is distinct from statistical modelling in that the relevance of a certain feature largely depends on the other features included in the machine learning model. Features which have statistically significant differences in two samples might not be considered by machine learning as top discriminating features. It is the interaction of features which affects the performance of machine learning classifiers. The two optimisation techniques we used were based on two different hypotheses: hypothesis 1 states that errors in machine translations of public health resources were due to morphological, structural and lexical complexity and difficulty of the resources, and hypothesis 2 states that errors in machine translations of public health resources were due to the combined morphological, structural, lexical, and semantic complexity of these resources, particularly the technically challenging issue of automatic translation of polysemous words in specialised heath domains. This was based on our observation of the original WHO resources in English and their machine translations to Chinese. Subsequently, we applied two optimisation techniques to test these two hypotheses. First, we applied separate optimisation on the marked morphological, structural, or lexical (MSL) features (20 in total) and the annotated semantic features (115). This resulted in two separate optimised feature sets: the optimised MSL feature set (5) which contained the number of difficult sentences, average number of characters, average number of syllables, passive voice, and sentences beginning with conjunctions; and the following optimised semantic feature set (10): A14 (exclusivizers and particularizes), A7 (probability), A8 (seem), H5 (furniture, household fittings), I1 (money generally), L1 (life and living things), N6 (frequency), S3 (relationship), S5 (groups, affiliation), and X9 (ability). Next, we applied joint optimisation of two combined feature sets to explore the interaction between morphological, structural, lexical, and semantic features using the same backward elimination method. An optimised feature set through joint optimisation (22) emerged which included 2 syntactic and 20 semantic features: number of difficult sentences, average number of characters, A7 (probability), A14 (exclusivizers, particularizes), B1 (anatomy, physiology), B5 (clothes, personal belongings), E2 (liking), F3 (smoking, non-medical drugs), F4 (farming, horticulture), K1 (entertainment), L2 (living creatures), M8 (stationary), O4 (physical attributes), Q4 (media), S1 (social actions, states, processes), S3 (relationship), S7 (power relationship), W5 (green issues), X3 (sensory), X7 (wanting; planning; choosing), X8 (trying), and X9 (ability).

Refinement of Automatically Optimised Features
In order to develop interpretable machine learning classifiers to predict the likelihood of machine translation errors, we reviewed the jointly optimised features (CFJO 22) and further reduced the number of features in this automatically selected feature set on the basis of the interpretability of semantic features. Only features which were linguistically relevant in health and medical materials were retained: B1 (anatomy, physiology), L2 (living creatures), S1 (social actions, states, processes), and X9 (ability). In order to compensate for the loss of model accuracy of the jointly optimised feature after manual feature elimination and refinement (refine CFJO_6), we added two semantic features based on observation of the patterns of errors in automatic translation outputs: A13 (degree) and T2 (time). We also compared the relative performance of binary classifiers using the five optimised MLS features to further improve the performance of the refined jointly optimised feature set. Table 2 shows the area under the receiver operating characteristic curves (ROC) of the binary classifiers using the independently optimised morphological, lexical, and syntactic features: number of difficult sentences, average number of characters, average number of syllables, passive voice, and sentences that begin with conjunctions. It shows that the number of difficult sentences (22 words or more) (AUC = 0.534, 95% CI: 0.45, 0.62) and passive voice (AUC = 0.532, 95% CI: 0.45, 062) had the two highest AUCs.  Figure 1 and Table 3 show the AUCs of binary logistic regressions with different optimised feature sets as independent variables in order to predict the probabilities of the original WHO English health materials as machine translation error prone texts. It shows that among the five optimised feature sets, the optimised feature set through joint optimisation achieved the highest AUC (0.778), followed by the enhanced joint optimisation which included five features from the automatic joint optimisation and three features added manually following a close examination of the linguistic relevance of automatically selected semantic features (semantic optimisation 10) and a comparison of the relative performance of automatically selected morphological, lexical, and syntactic features (MLS optimisation 5) (Table 3).

Multinominal Naïve Bayes (MNB) Classifiers
Multinomial Naïve Bayes (MNB) classifier is a Bayes theorem-based statistical classification algorithm and is shown to be effective for categorical text data analysis [19]. Based on Bayes theorem, MNB is effective, robust, and highly scalable, which works well on small data and can develop simple and powerful models for disease prediction [20,21]. Moreover, the MNB classifier performs probabilistic learning that computes the posterior probability of the given sample which can be applied for further decision making [22,23]. In this research, the MSL and semantic features are represented as the number of feature occurrences in the original text, i.e., bag-of-words features, which makes MNB more suitable than other Naïve Bayes classifiers such as Bernoulli Naive Bayes (assumes that the features are binary) and Gaussian Naive Bayes (assumes that the features are continuous and follow the normal distribution). Furthermore, the object of our research is to learn how likely an English source text would prone a machine translation error. Thus, the Naïve Bayes classifier (MNB) is more suitable than other machine learning models in our study for ensuring interpretability and reliability of the learned classifiers. We performed 5-fold cross-validation (on the training data) and hold-out validation (trained on training data and evaluated on testing data) in order to assess the performance of MNB under different settings.

Classifier Scalability
In order to evaluate robustness and scalability of the MNB classifiers, we evaluated the performance of MNB on different sizes of the training data (with 20, 40, 60, 80, 100, and 120 samples) when using different features sets. Furthermore, the Mann-Whitney U test was applied to evaluate the statistical significance of the differences in the AUCs in different settings (using different features and different training data set sizes). The performance was validated on the testing data (56 samples) in terms of AUCs. The comparison results were shown in Figure 3 and Table 6 test was applied to evaluate the statistical significance of the differences in the AUCs in different settings (using different features and different training data set sizes). The performance was validated on the testing data (56 samples) in terms of AUCs. The comparison results were shown in Figure 3 and Table 6. Regardless of the changes in the training data sizes, MNBs with MSL full (AUC Mean: 0.5442) and enhanced CFJO (AUC Mean: 0.7549) features were more robust and stable than other feature sets, with the lowest standard deviations of 0.0146 and 0.0203, respectively. The performances of MNBs with Semantic full (AUC Mean: 0.6594) and CFJO (AUC Mean: 0.7017) features were more fluctuated when the size of training data changed, with the largest standard deviations of 0.0359 and 0.0357, respectively. MNB with the enhanced CJFO features achieved the highest AUC mean 0.7549 when comparing to the classifiers using other feature sets.
In terms of the validation of scalability of MNBs, we performed the Mann-Whitney U test to access the statistical significance of different features with the training dataset sizes changed. The enhanced CFJO was statistically improved compared to the other four features sets, which the p-values of paired Mann-Whitney U test were all less than 0.05. CFSO was statistically better than both MSL and semantic full features, with p-values of 0.005 (95% CI of AUC mean difference: 0.1316 to 0.1809) and 0.045 (95% CI of AUC mean difference: −0.0508 to 0.1329), respectively. CFJO was also statistically better than both MSL full features with a p-value of 0.0049 (95% CI of AUC mean difference: 0.0377 to 0.1928). The statistical results demonstrated that the MNB with the enhanced CFJO feature had better stability, scalability, and robustness, which was less likely to be affected by the training data size and worked well on small data.
. Figure 3. AUC of MNB on testing data with different feature set and different training dataset size.  In terms of the validation of scalability of MNBs, we performed the Mann-Whitney U test to access the statistical significance of different features with the training dataset sizes changed. The enhanced CFJO was statistically improved compared to the other four features sets, which the p-values of paired Mann-Whitney U test were all less than 0.05. CFSO was statistically better than both MSL and semantic full features, with p-values of 0.005 (95% CI of AUC mean difference: 0.1316 to 0.1809) and 0.045 (95% CI of AUC mean difference: −0.0508 to 0.1329), respectively. CFJO was also statistically better than both MSL full features with a p-value of 0.0049 (95% CI of AUC mean difference: 0.0377 to 0.1928). The statistical results demonstrated that the MNB with the enhanced CFJO feature had better stability, scalability, and robustness, which was less likely to be affected by the training data size and worked well on small data.

Discussions
Probabilistic Outputs Table 7 shows outputs of the readability tool based binary classifiers and MNB classifiers as probabilities of belonging to either machine translation error-prone and non-errorprone English public health materials. Readability tools-based classifiers and outputs of MNB classifiers used two full feature sets: MLS (20) and semantics (115) and their separately optimised feature sets MLS optimised (5) and semantics optimised (10) did not differ significantly between English health materials prone to machine translation errors and those which were not prone to machine translation errors. The outputs of MNB classifiers using separately optimised and then combined feature sets (CFSO) and jointly optimised feature set (CFJO) and the derived refined CFJO models (refined CFJO; enhanced CFJO) differed significantly between two sets of English materials on infectious diseases.  Figure 4 is a histogram that shows the number of non-MT error prone (non-EPET) and MT-error-prone English (EPET) health materials that fell into each 10% probability bin based on MNB output. Sixty-seven percent of non-MT error prone English health materials were assigned a probability of risk-free texts <50% (specificity = 0.67), and 88% of MT-error-prone English health materials were assigned a probability of risky texts ≥50% (sensitivity = 0.88).  In Figure 4, about 12% of MT-error-prone WHO English public health educational and promotion materials were assigned low probabilities of 0-10% and 21-30%. In order to improve the classifier sensitivity and prediction accuracy, one method is to adjust probability thresholds in order to achieve the desired sensitivity and specificity pairing. Table  8 shows that if we decrease the current classifier probability threshold from 0.5 to 0.2, the model sensitivity increases from 0.8846 (95% CI: 0.762, 1.00) to 0.9615 (0.888, 1.0), but specificity decreases from 0.667 (95% CI: 0.498, 0.835) to 0.33 (95% CI: 0.165, 0.502). By contrast, if we increase the probability threshold from 0.5 to 0.8 while specificity increases to 0.667 (95% CI: 0.498, 0.835) to 0.80 (95% CI: 0.657, 0.943), the sensitivity decreases from 0.8846 to 0.5385 (95% CI: 0.347, 0.730). The diagnostic utility (positive likelihood ratio LR+, negative likelihood ratio LR-) is another effective measurement of the effectiveness of the assessment tool. Positive likelihood ratio is the ratio between sensitivity and false positivity (1specificity), and negative likelihood ratio is the ratio between false negativity (1-sensitivity) and specificity. Large positive likelihood ratios and small negative likelihood ratios are indicators of good assessment tools. Table 8 shows that 0.5 is the best probability threshold for the best performing MNB classifier using the enhanced CFJO (8) feature set which includes two MLS features of the number of difficult sentences and passive voice and six semantic features of B1 (anatomy, physiology), L2 (living creatures), S1 (social actions, states, processes), X9 (ability), A13 (degree), and T2 (time).  In Figure 4, about 12% of MT-error-prone WHO English public health educational and promotion materials were assigned low probabilities of 0-10% and 21-30%. In order to improve the classifier sensitivity and prediction accuracy, one method is to adjust probability thresholds in order to achieve the desired sensitivity and specificity pairing. Table 8 shows that if we decrease the current classifier probability threshold from 0.5 to 0.2, the model sensitivity increases from 0.8846 (95% CI: 0.762, 1.00) to 0.9615 (0.888, 1.0), but specificity decreases from 0.667 (95% CI: 0.498, 0.835) to 0.33 (95% CI: 0.165, 0.502). By contrast, if we increase the probability threshold from 0.5 to 0.8 while specificity increases to 0.667 (95% CI: 0.498, 0.835) to 0.80 (95% CI: 0.657, 0.943), the sensitivity decreases from 0.8846 to 0.5385 (95% CI: 0.347, 0.730). The diagnostic utility (positive likelihood ratio LR+, negative likelihood ratio LR−) is another effective measurement of the effectiveness of the assessment tool. Positive likelihood ratio is the ratio between sensitivity and false positivity (1-specificity), and negative likelihood ratio is the ratio between false negativity (1-sensitivity) and specificity. Large positive likelihood ratios and small negative likelihood ratios are indicators of good assessment tools. Table 8 shows that 0.5 is the best probability threshold for the best performing MNB classifier using the enhanced CFJO (8) feature set which includes two MLS features of the number of difficult sentences and passive voice and six semantic features of B1 (anatomy, physiology), L2 (living creatures), S1 (social actions, states, processes), X9 (ability), A13 (degree), and T2 (time).

Conclusions
The main purpose of our study was twofold: first, to develop machine learning classifiers as decision aids and help vulnerable non-English speaking people to appreciate the risks of using MT tools to seek and acquire health information online. Second, to develop machine learning classifiers as a risk-prevention mechanism to help medical professionals with little or no knowledge of the patient's languages in order to predict the likelihood of clinically significant mistakes or incomprehensible MT outputs based on the features of English source information as input to the MT systems. An important contribution of our study was the development of interpretable machine learning models to help assess and predict the risks of using machine learning tools to translate public health educational and promotion resources developed by international health organization such as WHO. The application of the machine learning classifier in clinical or health research settings requires an understanding of its functionality.
The linguistic cues that the MNB classifier uses to predict whether a certain English health text is likely to be translated erroneously by Google Translate are the syntactic structure and the semantic meanings of neighboring words in the vicinity of the actual translation mistake. We call these neighboring words as indicative contextual features (ICF). The best performing classifier we developed succeeded in classifying MT-error-prone and non-MT-error prone English texts (185) using eight ICFs, with sensitivity of 88%, specificity of 67%, and AUC of 0.76. It should be noted that the occurrence of translation errors does not require the presence of all eight ICFs in an English text. The illustrative examples show that the patterns of the distribution of these ICFs are diverse, which could be an MLS feature (number of difficult sentences) and one or a few semantic ICFs or a combination of semantic ICFs or an MLS feature alone. The empirical evidence of our study has challenged the traditional view that mistakes or suboptimal translation outputs were largely due to the presence of complex medical jargons as hallmarks of medical discourse.
Our study revealed that with rapid development in computer science, the accuracy of machine translation such as neural machine translation tools is improving significantly. The translation of medical terminology and jargons no longer represents the top challenge in automatic translation technologies. Rather, it is linguistic phenomena such as polysemy or context-dependent nature of common words in specialised health and medical domains that are causing subtle yet clinically significant errors and confusion in machine translation outputs. The development of machine learning classifiers could help add protection to MT-assisted health education to vulnerable communities and people by allowing health and medical professionals to have a reliable, real-time estimate of the risks of using MT tools to engage with people with limited English and health literacy.
Promoting the safe and risk-aware use of MT technologies to enhance access to public health resources and services represents an understudied field of health communication. Despite widely reported issues and risks of MT tools, there is a persistent lack of research tools to help reduce and minimize the risks of using MT mediated health communication among multilingual and multicultural populations. We developed machine learning classifiers to provide intuitive probabilistic prediction of the risks of translating certain English health resources using Google Translate online to translate to the Chinese language,