Forecasting Erroneous Neural Machine Translation of Disease Symptoms: Development of Bayesian Probabilistic Classifiers for Cross-Lingual Health Translation

Background: Machine translation (MT) technologies have increasing applications in healthcare. Despite their convenience, cost-effectiveness, and constantly improved accuracy, research shows that the use of MT tools in medical or healthcare settings poses risks to vulnerable populations. Objectives: We aimed to develop machine learning classifiers (MNB and RVM) to forecast nuanced yet significant MT errors of clinical symptoms in Chinese neural MT outputs. Methods: We screened human translations of MSD Manuals for information on self-diagnosis of infectious diseases and produced their matching neural MT outputs for subsequent pairwise quality assessment by trained bilingual health researchers. Different feature optimisation and normalisation techniques were used to identify the best feature set. Results: The RVM classifier using optimised, normalised (L2 normalisation) semantic features achieved the highest sensitivity, specificity, AUC, and accuracy. MNB achieved similar high performance using the same optimised semantic feature set. The best probability threshold of the best performing RVM classifier was found at 0.6, with a very high positive likelihood ratio (LR+) of 27.82 (95% CI: 3.99, 193.76), and a low negative likelihood ratio (LR−) of 0.19 (95% CI: 0.08, 046), suggesting the high diagnostic utility of our model to predict the probabilities of erroneous MT of disease symptoms to help reverse potential inaccurate self-diagnosis of diseases among vulnerable people without adequate medical knowledge or an ability to ascertain the reliability of MT outputs. Conclusion: Our study demonstrated the viability, flexibility, and efficiency of introducing machine learning models to help promote risk-aware use of MT technologies to achieve optimal, safer digital health outcomes for vulnerable people.


Introduction
Digital technologies are having increasing applications in healthcare and clinical settings [1][2][3][4][5][6][7][8]. Machine translation (MT) tools are offering rapid, cost-effective solutions to persistent barriers in health communication caused by language issues, compound by other socioeconomic factors such as educational levels, health literacy, cultural backgrounds and so on. The availability, convenience and privacy afforded by online MT tools has enabled better access to health and medical information among vulnerable people and communities. However, the risks and harms of the increasing uptake of these MT tools which are often designed for general purposes [9][10][11][12], in clinical or self-diagnosis settings, are known [13][14][15]. For people with bilingual skills, higher educational or health literacy levels, the effect of these MT tools on their health decision making is largely limited, as Int. J. Environ. Res. Public Health 2021, 18, 9873 2 of 11 people can utilise relevant health knowledge and skills or direct contacts with medical professionals to critically assess the reliability and validity of MT outputs. For vulnerable people, the increasing use of online MT tools without the necessary bilingual skills and medical knowledge can have clinically significant consequences.
Research has shown that various factors can contribute to erroneous outputs of MT tools when applied in specialised medical or healthcare settings. Contrary to previous models of MT technologies, such as statistical MT or rule-based MT, neural MT tends to outperform in the translation of difficult medical jargons, more complex sentence structures and generate more fluent and natural MT outputs. In our study, we focused on the MT quality issue associated with disease symptoms which are often conveyed in high frequency, polysemous words in a certain language which require higher levels of context-dependent interpretation of their meanings. By contrast with signs, symptoms are the subjective description and assessment of individual health conditions. Important variability in the semantic meanings of symptoms exists between their usage in general language versus specialised domains such as health and medicine. They provide first-hand information from patients to medical professionals in disease diagnosis and confirmation of cases. In health and medical resources developed for educational, promotion purposes, an exact, well-defined use of symptom terms can effectively help people to understand the conditions, progression of their health status. Currently, there is a lack of standardised bilingual vocabularies of symptoms, despite that symptoms are widely used in international guidelines of disease definition and classification, alongside laboratory tests. For example, the inclusion of symptoms in the detection of dengue fever helped increase the specificity of disease screening tools, whereas laboratory tests contributed to higher screening sensitivity [16][17][18][19]. In scenarios of limited healthcare sources, accurate symptom description is more affordable than laboratory tests.
The translation of underdefined symptom terms poses significant challenges to neural MT systems like Google Translate. Our study aimed to develop effective, affordable research solutions, countermeasures to the MT issue related to symptoms. We developed Bayesian machine learning classifiers to predict the likelihood of MT errors in terms of their treatment of symptoms. The outputs of our models were the probabilities of a certain original English medical text on disease diagnosis which would cause erroneous symptom translation using Google Translate. People and MT users with limited medical knowledge can thus make more informative health decision for themselves and those they care for.

Screening of Original English Source Texts
To promote the informed use of MT tools to acquire health information through computer-aided translation by vulnerable patients and their caregivers, we developed Bayesian machine learning classifiers to help the public understand the likelihood of inaccurate self-diagnosis based on outputs of online MT applications. The Merk Manual of Diagnosis and Therapy (MSD Manuals) are widely used in health education and family healthcare around the world [20,21]. Its Chinese consumer edition is commissioned to national leading medical professionals of the Chinese Preventive Medicine Association. High-quality human translations of MSD Manuals were used as references to evaluate the quality, reliability of neural machine translation outputs. We screened human, professional translations of MSD Manuals for information on self-diagnosis of infectious diseases and produced their matching neural machine translation outputs for subsequent pairwise quality assessment by trained bilingual health researchers. Pairwise comparison between human and machine translations helped use to identify and verify clinically significant MT errors (kappa coefficient 0.842, 95% CI: 0.762, 0.922) of symptoms which could cause inaccurate self-diagnosis of highly transmissible diseases by consumers of the MSD Manuals.

Multi-Dimensional Features
Through the observation of the original clinically significant errors in machine translation outputs, the language difficulty, morphological or syntactically complex expressions and the semantic meanings of original English expressions were the main factors contributing to the occurrence of machine translation errors. Thus, the original MSD Manuals were represented by global, high-level, and multi-dimensional features instead of the traditional local lexical features (the frequency/occurrence of words, e.g., bag-of-words). The multi-dimensional features contained both structural and semantic features, which were extracted by two public available English corpus annotation systems.

Structural Features
The Readability Studio (Oleander Software) was applied to extract a total of 20 morphological and structural features of the original English texts, containing descriptive statistics [22][23][24][25][26]. The structural features consisted of four global features of the original texts of different dimensions: complex sentences (six features), lexical complexity (three features), morphological and orthographic complexity (eight features), and content density (three features). The complex sentence features were average number of sentences per paragraph, number of difficult sentences (more than 22 words), longest sentence, average sentence length, passive voice, and sentences that begin with conjunctions. The lexical complexity features were number of unique words, number of unique long words, and number of unique monosyllabic words. The morphological and orthographical complexity features consisted of number of syllables, average number of characters, average number of syllables, number of monosyllabic words, number of complex (three+ syllable) words, number of unique three+ syllable words, number of long (six+ characters) words, and misspellings. The content density features were number of proper nouns, overused words, and wordy items.

Semantic Features
For semantic features, USAS (University of Lancaster Semantic Annotation System) [22,23] was utilized to explore the potential relations between clinically significant symptom errors in MT and the original English words semantic type and expressions. In total, 115 fine-grained semantic features of the original English health texts were extracted and annotated by the USAS semantic system. The extracted 115 features fell into 21 major discourse fields: general and abstract terms (A1-A15, 15 features); the body and the individual (B1-B5, five features); arts and crafts (C1); emotion (E1-E6, six features); food and farming (F1-F4, four features); government and public (G1-G3, three features); architecture, housing and the home (H1-H5, five features); money and commerce in industry (I1-I4, four features); entertainment, sports and games (K1-K6, six features); life and living things (L1-L3, three features); movement, location, travel and transport (M1-M8, eight features); numbers and measurements (N1-N6, six features); substances, materials, objects and equipment (O1-O4, four features); education (P1), language and communication (Q1-Q4, four features); social actions, states and processes (S1-S9, nine features); time (T1-T4, four features); world and environment (W1-W5, five features); psychological actions, states and processes (X1-X9, nine features); science and technology (Y1-Y2, two features); names and grammar (Z0-Z9, Z99, 11 features). These hierarchically arranged semantic types of words gave us a global view of the distribution of semantic meanings of the original English texts on disease diagnosis, which were useful for investigating the importance of the word choice and vocabulary diversity for neural machine translation tools like Google Translate to provide a reliable and accurate translation.

Bayesian Machine Learning Classifiers
The Bayesian framework-based methods provide probabilistic predictions of given samples and are widely used for assisting decision making in medical research [27][28][29]. Probabilistic learning allows researchers to develop a more intuitive interpretation of uncertainty and make utility assessment interpretable and useful to patients and medical professionals in disease diagnosis. In our study, two Bayesian machine learning classifiers, relevance vector machine (RVM) and multinomial naïve Bayes (MNB), were used to develop to predict MT errors of clinical symptoms in Chinese neural MT outputs. RVM has the identical function as support vector machines (SVM). RVM is known as a sparse classifier, which is not susceptible to the issue of overfitting, as a result of algorithm complexity. RVM suits the development of machine learning classifiers on small data sets like ours because of its enhanced generalization ability [30,31]. MNB is an effective and easy-to-train Bayes theorem-based statistical classification classifier, which works well on categorical text data and highly scalable that is less likely to overfit data [32,33].
The collected MSD Manuals (totally 185 samples) were manually annotated as symptomerror-prone (75 samples) and non-symptom-error-prone (110 samples) English health materials. To evaluate the performance of the developed RVM and MNB, the annotated data were randomly split into training data (70%) and testing data (30%) for evaluation. The training data (129 samples) contained 53 English health materials that were symptomerror-prone and 76 English health materials that were non-symptom-error-prone. The testing data (56 samples) contained 22 symptom-error-prone English health materials and 34 non-symptom-error-prone English health materials. We applied both five-fold crossvalidation and holdout validation to evaluate the performance of classifiers using five evaluation metrics (accuracy, macro F-score, sensitivity, specificity, and area under the curve, AUC). For five-fold cross-validation, the training data (129 samples) were further randomly split into five subsets. For each fold, the classifier was trained on the selected four subsets and validated on the remaining one. This process was repeated five times during which each subset served as the validation data once. For holdout validation, the classifiers were trained on the training data (129 samples) and validated on the holdout testing data (56 samples).

Feature Optimisation
The original English texts were represented by a total of 135 multi-dimensional features (20 structural features and 115 semantic features), of which the feature dimension (135) was larger than the number of training data (129). Aiming at discovering a simple and concise yet effective features set to develop a simple model with good generalization ability and lower risk of overfitting, we applied recursive feature elimination (RFE) with support vector machine (SVM) as the base estimator to perform backward feature reduction and remove the features that were unimportant [34]. To obtain a set of features that could produce a stable performance, we performed five-fold cross-validation on training data for recursive feature elimination. The features with higher five-fold cross-validated performance were selected by RFE as the optimised features.
To explore the relevance between different aspects (morphological and structural complexity only; semantic complexity only; and interaction between morphological and structural complexity and semantic complexity) of original language complexity and symptom-error-prone in machine translations of public health resources, two optimisation techniques were applied to extract the most informative features from the original features. First, the RFE was applied on 20 structural features and 115 semantic features to obtain the best Structural-Optimised Features (TOF) and Semantic-Optimised Features (SOF) separately. Then, we applied RFE to perform joint optimisation on the full 135 multidimensional features (Jointly Optimised Features, JOF) to explore the potential interaction and relations between morphological structural features and semantic features.
Furthermore, to prevent the features with a larger range from dominating the RVM optimisation process, we performed data normalization to scale the data features to improve the model generalization ability [35,36]. MNB, using discrete features (the number of feature occurrences), was not required to perform data normalisation. Two normalization methods were applied in our study: Min-Max normalization (denoted as Min-Max, the data were scaled to a certain range, e.g., [0, 1]) and L 2 -norm normalization (denoted as L 2 , the data samples were scaled individually to the unit norm, i.e., the sum of the squares of the data will always be up to 1).

Results
We compared the performance of different methods with different feature sets (structural-optimised features, TOF; semantic-optimised features, SOF; and jointly optimised features JOF) and data normalization techniques (Min-Max and L 2 ) with respect to AUC, accuracy, f-score, sensitivity and specificity metrics. The results of five-fold cross-validation (CV) on training data and holdout validation on testing data of different models are shown in Table 1 and Figure 1. For the RVM classifier, the performance of RVM with optimised features always outperformed RVM with non-optimised features (the original full features) on the testing data: using the structural-optimised features, the AUC and specificity of RVM increased from 0.682 and 0.71 (using structural full features) to 0.759 and 0.91, respectively; using semantic-optimised features, the AUC and specificity of RVM increased from 0.894 and 0.91 (using semantic full features) to 0.912 and 0.94, respectively; applying jointly-optimised features, the AUC and sensitivity of RVM increased from 0.77 and 0.868 (using full structural and semantic features) to 0.82 and 0.878, respectively. With data normalisation, the performances of RVM with semantic-optimised features and jointly optimised features were both further improved. The best performing RVM was the one using L 2 normalised SOF, with an AUC of 0.937, a sensitivity of 0.86 and a specificity of 0.94. For MNB that does not require a data normalization, the best performing model was the one using JOF, with an AUC of 0.933, a sensitivity of 0.82 and a specificity of 0.97. The performance of MNB with optimised features was not less consistently improved on the training data (five-fold CV).
These results demonstrated that developing a simple yet highly cost-effective model with less features indicative of English health materials prone to symptom errors in neural machine translations was both practicable and applicable. Compared with MNB, RVM with L 2 normalised SOF had higher AUC, sensitivity and specificity, which was selected as the best performing model for further diagnostic utility assessment and decision making in our study.
To evaluate the suitableness of the Bayesian machine learning classifiers for assessing whether an original English materials would prompt machine translation errors, we compared the performance of RVM and MNB with traditional readability formulas: Flesch Reading Ease Scores (based on average sentence length and average number of syllables per word), Gunning Fog Index (used average sentence length and percentage of hard words) and SMOG Index (used polysyllabic words that had more than three syllables). Applying the readability formulas as binary classifiers, the underlying hypothesis was that there was a positive correlation between the difficulty of English texts and the number of errors in the MT outputs of the original English texts. That is to say, the more difficult the original English health materials were, the more likely the MT systems would produce a machine translation error as defined in our study. Thus, the materials with Flesch Reading Ease Score lower than 60, Gunning Fog Index greater than 12 and SMOG Index greater than 12 were regarded as difficult to read and symptom-error-prone. As shown in Table 1, the performance of readability-formula-based binary classifiers was worse than a random guess (AUC = 0.5), with AUCs of 0.318 (Flesch Reading Ease Scores), 0.277 (Gunning Fog Index) and 0.283 (SMOG Index). This finding suggested that the symptom-error-prone materials were not relevant to the readability and complexity of original English health materials. The easy-to-read materials also had potential to prompt MT systems to produce a clinically significant symptom error. Thus, it is not suitable and reliable to assess whether the machine translation of English source materials would contain symptom errors by utilizing the standard (currently available) readability formulas. The best performing RVM (AUC: 0.937; sensitivity: 0.86; specificity: 0.94) and MNB (AUC: 0.933; sensitivity: 0.82; specificity: 0.97) demonstrated that machine learning methods were more suitable, effective and robust for identifying the symptom-error-prone English health materials on infectious diseases.

Machine Learning Classifiers using Different Optimised Feature Sets (number of features)
Structural-optimised (TOF) RVM ( Table 2 shows the two-tailed Mann-Whitney U test of RVM with different feature sets on testing data using five evaluation metric results: AUC, accuracy, f-score, sensitivity and specificity. The results showed that the overall performance (considering all five evaluation metrics) of the best performing RVM with L 2 normalised SOF was statistically significantly improved comparing to RVM using Min-Max normalised TOF (p-value: 0.0122, CI: 0.0685 to 0.4043), RVM with JOF (p-value: 0.0367, CI: 0.0381 to 0.0715), RVM with structural full feature (p-value: 0.0122, CI: 0.1128 to 0.6004), and RVM with structural and semantic features (p-value:0.0367, CI: 0.0522 to 0.0367). This result indicates that the semantic features were more informative and effective for identifying the symptomerror-prone English health education materials than morphological and structural features. The machine translation with significant symptom errors was mainly associated with the bilingual vocabularies and expression of symptoms instead of language syntactically complexity (e.g., average number of sentences per paragraph, number of difficult sentences and number of unique words). These results demonstrated that developing a simple yet highly cost-effective model with less features indicative of English health materials prone to symptom errors in neural machine translations was both practicable and applicable. Compared with MNB, RVM with L2 normalised SOF had higher AUC, sensitivity and specificity, which was selected as the best performing model for further diagnostic utility assessment and decision making in our study.   Table 3 shows outputs of the readability formula-based binary classifiers and RVM, MNB machine learning classifiers as probabilities of belonging to either symptom-errorprone (SEP), and non-symptom-error-prone (NSEP) English health materials. RVM using L 2 normalised structural-optimised feature and MNB using structural-optimised feature (5) did not differ significantly between English health materials prone to machine translation errors and those which were not prone to machine translation errors. Outputs of readability formulas-based classifiers and MNB, RVM classifiers using other feature sets differed significantly between two sets of original health materials in English on infectious diseases. The RVM with L 2 normalised SOF and MNB with SOF had the highest probability means (RVM: 0.802; MNB: 0.818) on SEP English health materials and low probability means (RVM: 0.209; MNB: 0.077) on NSEP English health materials, showing the effectiveness of the semantic-optimised features and the ability of Bayesian machine classifiers for distinguishing between the SEP and NSEP English health materials. Table 3. Comparison of readability formula and MLC (RVM, MNB) output between symptom-error-prone (SEP) and non-symptom error-prone (NSEP) English texts (machine learning classifier outputs were assigned probabilities). Bold: bold values were significant.   Figure 2 shows the histograms that displayed the number of symptom-error-prone (SEP) and non-symptom-error-prone (NSEP) English health materials that fell into each 10% probability bin based on outputs of RVM with L 2 normalised SOF (left) and MNB with SOF (right). For RVM, 94% of NSEP English health materials were assigned a probability of error-prone < 50% (specificity = 0.94), and 86% of SEP English health materials were assigned a probability of error-prone ≥ 50% (sensitivity = 0.86), showing considerable overlap in outputs between the NSEP and SEP texts. For MNB, 94% of NSEP English health materials were assigned a probability of error-prone < 50% (specificity = 0.94), and 82% of SEP English health materials were assigned a probability of error-prone ≥ 50% (sensitivity = 0.82). Compared to RVM, as shown in Figure 2, the MNB outputs were less overlapped outputs between the NSEP and SEP texts. Thus, RVM was more suitable than MNB for further decision making since it allows the expert to select different thresholds to gain the desired sensitivity and specificity pairings for diagnostic utility based on different criteria. On the other hand, with fewer overlapped outputs, changing the thresholds of MNB will not change the sensitivity and specificity.

Diagnostic Utility
In Figure 2 (left), nearly 14% of symptom-error-prone MSD manuals were assigned low probabilities of 21-30%. In order to improve the classifier sensitivity, we can adjust the probability thresholds to gain the desired sensitivity and specificity pairings. Table 4 showed that if the probability threshold of the best performing RVM decreased from 0.50 to 0.23, the model sensitivity increased from 0.86 (95% CI: 0.72 to 1.01) to 0.95 (95% CI: 0.87 to 1.04), but the specificity decreased from 0.94 (95% CI: 0.86 to 1.02) to 0.71 (95% CI: 0.55 to 0.86). By contrast, if the probability threshold increased from 0.5 to 0.9, the sensitivity decreased from 0.86 (95% CI: 0.72 to 1.01) to 0.59 (95% CI: 0.39 to 0.80) and the specificity increased from 0.94 (95% CI: 0.86 to 1.02) to 0.97 (95% CI: 0.91 to 1.03). Diagnostic utility (positive likelihood ratio LR+, negative likelihood ratio LR−) was also an effective criterion for evaluation of the assessment tool. The likelihood ratio decided how the prediction changed the probability of certain outputs (positive likelihood ratio was the ratio of sensitivity to false positivity; negative likelihood ratio was the ratio of false negativity and specificity). The assessment tool was regarded as effective and practicable with large positive likelihood ratios and small negative likelihood ratios. Table 4 shows that 0.6 was the best probability threshold for the best performing RVM classifier using the 14 L 2 normalised semantic-optimised features, including A2 (affect: modify, change, and cause/connected), A3 (being), A4 (classification: generally kinds, groups, examples, particular/ general and detail), A6 (comparing: similar/different, usual/unusual and variety), A7 (definite), A13 (degree), B1 (anatomy and physiology), B2 (health and disease), B3 (medicines and medical treatment), N5 (quantities: entirety, maximum, exceeding and waste), O1 (substances and materials generally: solid, liquid and gas), O2 (objects generally), Z5 (grammatical bin), and Z99 (unmatched).  Figure 2 shows the histograms that displayed the number of symptom-error-prone (SEP) and non-symptom-error-prone (NSEP) English health materials that fell into each 10% probability bin based on outputs of RVM with L2 normalised SOF (left) and MNB with SOF (right). For RVM, 94% of NSEP English health materials were assigned a probability of error-prone <50% (specificity = 0.94), and 86% of SEP English health materials were assigned a probability of error-prone ≥50% (sensitivity = 0.86), showing considerable overlap in outputs between the NSEP and SEP texts. For MNB, 94% of NSEP English health materials were assigned a probability of error-prone <50% (specificity = 0.94), and 82% of SEP English health materials were assigned a probability of error-prone ≥50% (sensitivity = 0.82). Compared to RVM, as shown in Figure 2, the MNB outputs were less overlapped outputs between the NSEP and SEP texts. Thus, RVM was more suitable than MNB for further decision making since it allows the expert to select different thresholds to gain the desired sensitivity and specificity pairings for diagnostic utility based on different criteria. On the other hand, with fewer overlapped outputs, changing the thresholds of MNB will not change the sensitivity and specificity.

Figure 2.
Percentage of symptom-error-prone (SEP) and non-symptom-error-prone (NSEP) English texts assigned by RVM with L2 normalised SOF (left) and MNB with SOF (right) classifier to each 10% probability bin.

Diagnostic Utility
In Figure 2 (left), nearly 14% of symptom-error-prone MSD manuals were assigned low probabilities of 21-30%. In order to improve the classifier sensitivity, we can adjust the probability thresholds to gain the desired sensitivity and specificity pairings. Table 4 showed that if the probability threshold of the best performing RVM decreased from 0.50 to 0.23, the model sensitivity increased from 0.86 (95% CI: 0.72 to 1.01) to 0.95 (95% CI: 0.87 to 1.04), but the specificity decreased from 0.94 (95% CI: 0.86 to 1.02) to 0.71 (95% CI: 0.55 to 0.86). By contrast, if the probability threshold increased from 0.5 to 0.9, the sensitivity decreased from 0.86 (95% CI: 0.72 to 1.01) to 0.59 (95% CI: 0.39 to 0.80) and the specificity increased from 0.94 (95% CI: 0.86 to 1.02) to 0.97 (95% CI: 0.91 to 1.03). Diagnostic utility (positive likelihood ratio LR+, negative likelihood ratio LR−) was also an effective criterion for evaluation of the assessment tool. The likelihood ratio decided how the prediction changed the probability of certain outputs (positive likelihood ratio was the ratio of sensitivity to false positivity; negative likelihood ratio was the ratio of false negativity and specificity). The assessment tool was regarded as effective and practicable with large positive likelihood ratios and small negative likelihood ratios. Table 4 shows that 0.6 was the best probability threshold for the best performing RVM classifier using the 14 L2 normalised semantic-optimised features, including A2 (affect: modify, change, and cause/connected), A3 (being), A4 (classification: generally kinds, groups, examples, particular/ general and detail), A6 (comparing: similar/different, usual/unusual and variety), A7 (definite), A13 (degree), B1 (anatomy and physiology), B2 (health and disease), B3

Conclusions
MT technologies offer convenient, cost-effective solutions to existing barriers of access of vulnerable people to healthcare services in multicultural countries. Although the risks and harms of the increasing uptake of MT tools in clinical settings are well documented, limited protective mechanisms or countermeasures have been developed to help alleviate their impact on communities, people who rely on these low-cost technologies to access medical services. Our study demonstrated the viability, flexibility, efficiency of introducing machine learning models to help promote risk-aware use of MT technologies to achieve optimal, safer digital health outcomes for vulnerable people. We found that erroneous neural MT outputs of infectious disease symptoms were associated with a current lack of standardized bilingual vocabularies of symptoms. The interpretation of subjective symptom terms can vary substantially between the general and specialised use of these terms, as well as across individuals: types, severity of pains, ranges and alarming levels of body temperatures, cognitive abilities, consciousness, physical mobility, types of experienced vision problems or disturbances, and malfunction of body parts. These were the symptom terms that were often mistranslated by neural MT tools which could cause misleading self-diagnosis. Highfrequency, polysemous symptom words in Chinese require context-dependent approaches to medical translation, for which human translators clearly outperformed neural MT tools. Our research solution to this issue with current neural MT tools when applied in health and medical settings was the development of high-sensitivity machine learning classifiers which could effectively predict the likelihood of erroneous MT outputs in terms of the translation of subjective symptom terms. We believe that the combined use of machine translation and machine learning tools will help add more needed security to online digital health aids and tools and help empower vulnerable communities and people.