The Application of Projection Word Embeddings on Medical Records Scoring System

Medical records scoring is important in a health care system. Artificial intelligence (AI) with projection word embeddings has been validated in its performance disease coding tasks, which maintain the vocabulary diversity of open internet databases and the medical terminology understanding of electronic health records (EHRs). We considered that an AI-enhanced system might be also applied to automatically score medical records. This study aimed to develop a series of deep learning models (DLMs) and validated their performance in medical records scoring task. We also analyzed the practical value of the best model. We used the admission medical records from the Tri-Services General Hospital during January 2016 to May 2020, which were scored by our visiting staffs with different levels from different departments. The medical records were scored ranged 0 to 10. All samples were divided into a training set (n = 74,959) and testing set (n = 152,730) based on time, which were used to train and validate the DLMs, respectively. The mean absolute error (MAE) was used to evaluate each DLM performance. In original AI medical record scoring, the predicted score by BERT architecture is closer to the actual reviewer score than the projection word embedding and LSTM architecture. The original MAE is 0.84 ± 0.27 using the BERT model, and the MAE is 1.00 ± 0.32 using the LSTM model. Linear mixed model can be used to improve the model performance, and the adjusted predicted score was closer compared to the original score. However, the project word embedding with the LSTM model (0.66 ± 0.39) provided better performance compared to BERT (0.70 ± 0.33) after linear mixed model enhancement (p < 0.001). In addition to comparing different architectures to score the medical records, this study further uses a mixed linear model to successfully adjust the AI medical record score to make it closer to the actual physician’s score.


Introduction
With the increasing advancement of technology, the data amount generated by humans is growing explosively [1]. Effectively taking advantage of these growing data may bring valuable information, which many successful cases from different industries [2] have already proved. However, the majority of these data are not structured [3], which cannot be directly used by traditional analytical methods. At the same time, it is expected to employ new algorithms to use these data to allow for stronger decision-making capacity [4,5]. In recent years, with the breakthrough developments of the deep neural network in diverse fields, we are already capable of directly analyzing data in the forms of videos, texts, and voices. Hence, the focus of researches is now to develop applications to solve practical problems.
The medical system is an important field that is very suitable to develop the abovementioned applications. Medical knowledge is accumulating quickly, making it more and more possible for doctors to have knowledge gaps [6], which may cause misdiagnoses and, thus, urgently need to be solved [7]. Computer-aided diagnosis systems have been greatly developed in recent years, aiming to solve this problem, yet unsuccessfully so far [8]. This is probably because the majority of medical data are non-structural data [9]; take cancer, for example, where about 96% of cancer diagnoses are made from pathological section reports, the data of which, however, are recorded in text descriptions and videos [10]. Thus, it is difficult for traditional models to link these original non-structural data with diagnosis information directly. With the advancement of artificial intelligence (AI) technology, the new generation of computer-aided diagnosis systems is expected to make great contributions to the intellectualization of medical systems. It can further eliminate human errors to increase the quality of medical care [11]. In 2012, AlexNet was the ILSVRC champion, leading the 3rd AI revolution [12]. Since then, more powerful deep learning models have been developed, such as VGGNet [13], Inception Net [14], ResNet [15], DenseNet [16], etc. This revolution led by deep learning has made enormous progress in image recognition tasks, driving breakthroughs in related research. Computer-aided diagnosis tools built based on deep learning technology have led to an increase in medical care quality [11]. Examples include lymph node metastasis detection [17], diabetic retinopathy detection [18], skin cancer classification [19], pneumonia detection [20], bleeding identification [21], etc. There have been over 300 studies (mostly in the last 2 years) using such technologies in medical image analysis [22]. It is worth mentioning that the most impressive capacity of deep learning technology is automatic feature extraction. With the precondition of a large database for annotation, it has been proven to reach, or even surpass, the level of human experts [15,23,24].
The current method to use a large amount of information from medical records is to code through recognition by experts and according to ICD (The International Statistical Classification of Diseases and Related Health Problems). This work is not only necessary for our national health insurance declaration system but may also be used in disease monitoring, hospital management, clinical studies, and policy planning. However, artificial classification is not only expensive but is also time-inefficient, which is the most important. For example, in disease monitoring, since the outbreak of infectious disease will cause large casualties [25], many countries have developed their disease monitoring systems specifically aiming at contagious diseases, such as the Real-time Outbreak and Disease Surveillance (RODS) system [26]. To ensure time efficiency, this system stipulates emergency physicians to input data within required time limits when identifying notifiable diseases, making it hard to be promoted to other diseases. With the advancement of data science, it has been universally expected that an automatic disease interpretation model can be developed to solve the high-cost and time-inefficient problems of artificial interpretation.
Due to the popularization of medical records electronization, a great number of studies have attempted to use this information for text mining and ICD code classification. The current technology primarily uses a bag-of-words model to standardize text medical records, then uses a support vector machine (SVM), random forest tree, and other classifiers for diagnosis classification [27][28][29][30][31]. However, previous studies have found that these methods were incapable of accurate diagnosis classification because of the particularity and diversity of clinical terms, where synonyms need to be properly processed before data preprocessing [10]. A complete medical dictionary integrates the currently recommended forms of clinical terms; yet, it is almost impossible due to the complexity of clinical terms. Therefore, traditional automatic classification programs can hardly make significant progress. In addition, the bag-of-words model treats different characters as different features and counts the number of features in one article. Although this makes it possible to use a dictionary to handle the synonym problem, similar characters would be considered two different features. Thus, the number of features integrated by the bag-of-words model will be strikingly huge, causing a curse of dimensionality when classified by subsequent classifiers, leading to inefficiency and slow progress of traditional algorithms.
Other than classification efficiency, the greatest challenge for traditional algorithms is new diseases. For instance, there was an H1N1 outbreak in 2009, with related cases that had never been recorded before 2008. Traditional classification algorithms are completely unable to perform proper classification of newly emerged words [27][28][29][30][31]. This disadvantage makes it absolutely impossible for traditional methods to reach full automation. Regarding this issue, we proposed word embedding as a technical breakthrough in disease classification. Since the 20th century, word embedding has been an important technology to allow computers to understand the semantic meaning further. Its core logic is hoping to characterize every single word into a vector in high-dimensional space and expecting similar vectors for similar characters/words to express semantic meaning [32,33]. The word2vec published by the Google team in 2013 is considered the most important breakthrough in recent word embedding studies. It has been verified to allow similar characters to have very high cosine similarity and very close Euclidean distance in vector space [34]. However, this technology has a disadvantage that, once applied, it converts an article into an unequal matrix, making it inapplicable for traditional classifiers, such as SVM and random forest trees. A general solution is to average or weighted average the word vector of all characters in an article as semanteme [35]. However, from the MultiGenre NLI (MultiNLI) Corpus competition release by the natural language research team of Stanford (https://nlp.stanford.edu/projects/snli/), we can still see that combining modern AI technology gives better efficiency to models. Language processing conducts analysis mostly based on Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN). Its core principle is to use convolutional layer (does not have memory but can gradually integrate surrounding single-character information in higher-order features, requires more layers) or Long Short-Term Memory Unit (has short-and long-term memory, thus needing fewer layers) for feature extraction and is able to process information in matrix form [36]. CNN has become the primary method in all computer vision competitions. Its reason for success is a fuzzy matching technique of convolutional layer, allowing for integrating similar image features. We will be able to change the convolutional layer from recognizing similar image features to recognizing similar vocabularies through certain designs. Hence, CNN has been applied in text mining, such as semantic classification [37], short sentence searching [38], and chapter analysis [39], and has shown considerably good efficiency. In the most recent study, Bidirectional Encoder Representations from Transformers (BERT), developed by Google, has swept all kinds of natural language process competitions [40]. Yet, its core is still good work/sentence/paragraph embedding. Generally speaking, combining good embedding technology with modern deep learning neural networks is undoubtedly the best option for current natural language processing tasks.
Our team has already applied it in disease classification of discharge record summaries and proved that it compared with traditional models. AI model with combined word embedding model and CNN reduces 30% error rate in disease classification tasks, makes modeling easier by avoiding troublesome text integration preprocessing, and learns external language resources through unmonitored learning to integrate similarity among clinical clauses [41]. However, although the combination of word embedding and CNN is better in disease classification tasks than traditional methods, its accuracy still cannot be compared with humans. One of the reasons is the error in understanding the semantic meaning. Therefore, improving the word embedding model's understanding of the meaning of medical terms might increase its subsequent analytical efficiency [42]. There are two studies that have evaluated the application of word embedding models trained by different resources on biomedical NLP and found EHR-trained word embedding could better capture semantic property [43,44]. On the other hand, external data resources have a neglected advantage in that the vocabulary diversity of external internet data resources is far more than that of internal task database. This advantage will greatly affect real disease coding tasks. Hence, an embedded training process needs to be developed to maintain the vocabulary diversity of internet resources and medical terms' understanding of the internal task database. A recent word embedding comparison study showed that EHR-trained work embedding could usually better capture medical semantic meaning [43]. Even the research team of abroad Mayo Clinic uses an EHR with a large amount of data. The total number of words is only about 100,000, the vocabulary diversity of which is still far less than the external database [43,44]. This is due to the lack of some rare diseases and periodic diseases, such as the 2003 SARS outbreak and the 2009 H1N1 outbreak. Therefore, EHR-trained word embedding models are unable to include enough vocabulary. For this reason, our team developed a projection word embedding model that has the vocabulary diversity of Wikipedia/PubMed, as well as an understanding of medical terms in EHR [45].
A medical record is a historical record and also the foundation of a patient's medical care. It records the patient's conditions, reasons, results of examinations/tests, treatment methods, and results during care processes. It integrates and analyzes patients' related information, presents the executive ground of medical decisions, and even affects national health policy. The basic purpose of medical records is to remind oneself or other medical care colleagues of a patient's daily conditions and attending physician's current thoughts. When medical treatment is being performed, the medical record serves as the communication tool among physicians and means for continuous treatment. In other words, the medical record is the only text material that records a patient's conditions and focuses on all medical care personnel. A medical record is an index of medical care quality reflecting a physician's clinical thinking and diagnostic basis. It serves as the reference for learning, research, and education. Meanwhile, it also serves as the evidence for medical disputes to clarify the attribution of liabilities. The medical record is the foundation of patient care as it records the contents of patient care provided by medical personnel. Thus, all results obtained from observation or examination can be found on the medical record. Therefore, any change in the patient's condition can be found from the medical record so that the patient's current condition can be evaluated for suitable treatments. Moreover, communication with a patient should also be included in the medical record so that medical personnel can learn the patient's expectations on the treatment, resulting in a closer doctor-patient relationship. For other professionals, a detailed medical record saves a lot of communication time and avoids misunderstanding or missing the patient's previous conditions that may lead to mistreatment.
The content of medical records also has legal effects. It is the basis of insurance benefits and even affects national health policy. For example, public health studies usually need to include case information under national health insurance, and, through studying a large number of medical records, such studies can help public health researchers and medical officials to establish more suitable public health decisions and administrative rules that protect the rights and interests of both doctors and patients. Clinical decision-making guides formulated by many specialized medical associations also used information from medical records. The implicit demographic information from these medical records is also collected at the national level and published as national health demographic information to compare with other countries so as to serve as a way to communicate and learn from each other for mutual benefits.
In this study, as shown in the graphical abstract, a scoring database was established by experts performing scoring on medical records. An AI model was trained to learn experts' scoring logics so as to screen high-quality medical record summaries. In contrast, the database made up of which will have the chance to promote the establishment of other subsequent AI models, improve model accuracy, and serve as a teaching example to improve medical education efficiency.

Data Source
In this study, inpatient medical records from Tri-Service General Hospital from 1 January 2016 to 31 December 2019 were used as the basic database, which was ethically approved by institutional review board (IRB NO. A202005104). Physicians of different levels from different departments were invited for medical records summary scoring. Scoring dimensions include different indexes, based on clinical writing standards, it contains 12 scoring items from each detailed structure of the QNOTE scale's inpatient record, including chief of complaint, history of the present illness, problem list, past medical history, medications, adverse drug reactions and allergies, social and family history, review of systems, physical findings, assessment, plan of care, and follow-up information. The completeness of each item's record, as well as the 5 structures (completeness, correctness, concordance, plausibility, and currency) of electronic medical records' examination information, are evaluated in 5 levels of the Likert scale: strongly disagree, disagree, no comment (not agree nor disagree), agree, and strongly agree. Specialists from different departments were required to review 227,689 medical records and preliminarily score them on a 10-point Likert scale based on the average of above 5 structures. These scores were then used as the training target of the AI model to represent medical record writing quality. All samples were divided into a training set (n = 74,959) and testing set (n = 152,730) based on time, and then they were evaluated by different departments. Data of the testing set was compared with the actual scores for analysis, and MAE from the Likert scale was used as the evaluation index for model performance. In the end, the aforementioned model was applied in Tri-Service General Hospital. A medical record auto-scoring system was established in the hospital so as to screen high-quality medical records for future teaching and research studies.

AI Algorithm
The collected medical records and various writing quality indicators can be used for artificial intelligence model training. The model architecture uses the word embedding and LSTM model developed by our team. The word embedding also uses the projection word embedding comparison table to perform single-character conversion mathematical vectors and uses the entire input article as the input matrix. We used projection word embedding to construct a deep convolutional network model to enable the network to integrate the transformed semantic vectors and extract written medical records based on different word combinations. First, we used the word embedding comparison table trained by Wikipedia and PubMed library, and then we used EHR to perform projection word embedding training. Next, we connected the converted text matrix in parallel so that the network can refer to two different word embedding sources simultaneously. In addition, we used different word embeddings separately as conversion sources to compare their effects on prediction performance.

Long Short-Term Memory (LSTM)
In RNN, the output can be given back to the network as input, thereby creating a loop structure. RNNs are trained through backpropagation. In the process of backpropagation, RNN will encounter the problem of vanishing gradient. We use the gradient to update the weight of the neural network. The problem of vanishing gradient is when the gradient shrinks as it propagates backwards in time. Therefore, the layers that obtain small gradients will not learn but will, instead, cause the network to have short-term memory.
The LSTM architecture was introduced by Hochreiter and Schmidhuber [46] to alleviate the problem of vanishing gradients. LSTMs can use a mechanism called gates to learn long-term dependencies. These gates can learn which information in the sequence is important to keep or discard. LSTMs have three gates: input, forget, and output. This is the core of the LSTM model, where pointwise addition and multiplication are performed to add or delete information from the memory. These operations are performed using the input and forget gate of the LSTM block, which also contains the output "tanh" activation function. In addition to using the original architecture and model parameters, the other settings are Epochs = 20, Batch size = 300, and Learning rate = 0.001.

Bidirectional Encoder Representation from Transformers (BERT)
Other than the original word embedding and LSTM architecture, BERT architecture was also used for feature extraction. BERT is a recent attention-based model with a bidirectional Transformer network that was pre-trained on a large corpus. This pre-trained model is then effectively used to solve various language tasks with fine-tuning [40,47]. In brief terms, the task-specific BERT architecture represents input text as sequential tokens. The input representation is generated with the sum of the token embeddings, the segmentation embeddings and the position embeddings [40]. For a classification task, the first word in the sequence is a unique token which is denoted with [CLS]. An encoder layer is followed with a fully-connected layer at the [CLS] position. Finally, a softmax layer is used as the aggregator for classification purposes [47]. If the NLP task has pair of sentences as in question-answer case, the sentence pairs may be separated with another special token [SEP]. BERT multilingual base model (cased) is used as transfer feature learning, and other parameters are set to Epochs = 30, Batch size = 32, and Learning rate = 0.00001.
Through these two methods, we can enable the network to learn the semantic meanings of different individual characters. We can also let the network learn from different texts, such as from Wikipedia and PubMed. Then, through EHR for Fine-tune retraining, the BERT architecture that has finished learning only needs to change from predicting its context output to predicting the categories of multiple medical record quality dimensions; then, it can be trained with medical record information.

Linear Mixed Model Function for Medical Records Scoring Prediction
Suppose data are collected from m independent groups of observations (called clusters or subjects in longitudinal data).
Here, Y m is an n × 1 vector of the dependent variable for patient m, and X i is an n × q matrix of all the independent variables for patient m. B m is a q × 1 unknown vector of regression coefficients, and e m is an n × 1 vector of residuals. This results in a multi-level mixed model with random effects for all samples, which is expressed as where Z is a matrix of known constants included in the information of the independent variables with random effects, and u is a matrix of random effects for all patients. The best linear unbiased prediction (BLUP) is important for predicting the medical record score in each patient, and it can be calculated by following the steps in [48].
Y m is an n × 1 vector of the dependent variable for patient m, and X i is an n × q matrix of all independent variables for patient m. Moreover, Z m is an n × p matrix of independent variables with random effects for patient m. These matrices contain the observed data and are defined as After building the prediction tool, we have the G matrix, B vector and σ 2 . G is a variance co-variance matrix of the random effects (p × p), and B is the fixed effect coefficients vector (q × 1). σ 2 is the variance of the residuals. We can calculate a matrix R (n × n) using If the independence assumption holds (i.e., Finally, the BLUP of the random effect in patient m can be estimated using We can estimate the regression coefficients (B m ) in patient m based on the above result, and B m can be used to predict the disease progression. B m can be calculated using Note that this calculation cannot make direct forecasts without the co-variable values. Thus, the co-variables information at the time of interest must be generated. We propose two methods for generating this information: (1) assume consistency between the last time and the time of interest and (2) predict the linear expectations. We will assess these methods in our analysis. Unquestionably, clinicians can use the most reasonable values based on their judgment to predict the co-variables at the time of interest. In summary, we can combine this method with population information to predict the medical record score.

Evaluation Criteria
We evaluated the generalization performance of each model in the training and testing samples. Mean absolute error (MAE) were used to compare the performance of the models, as follows:

Results
The research scheme is shown in Figure 1, where a total of 227,689 medical records were scored by experts. In AI model training, the medical records were divided into the training set and testing set based on year, where 74,959 records were used to establish BERT and LSTM models, and 152,730 records were used to test record scoring. LMM was then employed to modify BERT and LSTM to establish another two models. In the end, MAE was used to compare the four models' efficiencies in predicting medical record scores.
were scored by experts. In AI model training, the medical records were divided into the training set and testing set based on year, where 74,959 records were used to establish BERT and LSTM models, and 152,730 records were used to test record scoring. LMM was then employed to modify BERT and LSTM to establish another two models. In the end, MAE was used to compare the four models' efficiencies in predicting medical record scores.

Figure 1.
Training and testing sets generation. Schematic of the data set creation and analysis strategy, which was devised to assure a robust and reliable data set for training and testing of the network. Once a medical records data were placed in one of the data sets, that individual's data were used only in that set, avoiding 'cross-contamination' among the training and testing sets. The details of the flow chart and how each of the data sets was used are described in the Methods. Table 1 shows the distribution of medical records in different departments. It can be seen that 74,959 records were included for modeling, and then 152,730 records were used for prediction. The average score from experts was 7.24 ± 1.02 for the training set and 7.67 ± 0.84 for the testing set; after BERT and LSTM modeling of medical record scoring, the average score of BERT prediction in the testing set was 7.47 ± 0.89, and 7.15 ± 1.05 for LSTM. After training through the BERT and LSTM models, the artificial intelligence model had already scored the medical records.  Figure 1. Training and testing sets generation. Schematic of the data set creation and analysis strategy, which was devised to assure a robust and reliable data set for training and testing of the network. Once a medical records data were placed in one of the data sets, that individual's data were used only in that set, avoiding 'cross-contamination' among the training and testing sets. The details of the flow chart and how each of the data sets was used are described in the Methods. Table 1 shows the distribution of medical records in different departments. It can be seen that 74,959 records were included for modeling, and then 152,730 records were used for prediction. The average score from experts was 7.24 ± 1.02 for the training set and 7.67 ± 0.84 for the testing set; after BERT and LSTM modeling of medical record scoring, the average score of BERT prediction in the testing set was 7.47 ± 0.89, and 7.15 ± 1.05 for LSTM. After training through the BERT and LSTM models, the artificial intelligence model had already scored the medical records. Our team's projection word embedding model allowed the model to have both the vocabulary diversity of Wikipedia/PubMed and an understanding of medical terms in EHR. The concept of projection word embedding used the results of our previous studies, a concept in linear algebra that projects through matrix multiplication to allow all coordinates to convert into a new coordinate system. Such conversion changes the correlation of certain points while at the same time maintaining all current coordinates. In addition to the original projection word embedding and LSTM architecture, we attempted to use BERT architecture for feature extraction. BERT stands for Bidirectional Encoder Representations from Transformers, the elementary unit of BERT architecture is the encoder's Multi-Head Self-Attention Layer in the transformer. In contrast, the overall architecture of BERT is stacked by a bidirectional Transformer Encoder Layer. As shown in Table 2, in general, on the ground of experts' scoring, the trained scoring model BERT had a prediction score of 7.49 ± 0.28. In contrast, LSTM had 7.17 ± 0.31; after modification by the linear mixed model (LMM), BERT's and LSTM's prediction scores were 7.36 ± 0.56 and 7.33 ± 0.65, respectively. After layering different departments, such as internal medicine, surgery, obstetrics, and pediatrics, it can be learned that BERT all had higher prediction scores than LSTM, while, after LMM modification, all LSTM prediction scores increased. Through further looking into different departments, it was found that most departments' BERT prediction scores were higher than that of LSTM, and the latter increased after LMM modification. Table 2. BERT and LSTM original prediction scores and LMM-modified scores. Table 2. Cont.

LMM-Modified LSTM Prediction Scores
It can be learned from Table 3 that, when reviewer physicians' scores and AI scores were calculated using mean absolute error (MAE), both BERT and LSTM AI scores were 0.6~1.3 points lower than reviewer physicians' scores; thus, the linear mixed model (LMM) was introduced for modification, thereby reducing the score difference to 0.3~1 points, showing a significant reduction (p < 0.001) in score difference. The reason for the modification using LMM is that an ordinary linear regression contains only two influencing factors: fixed effect and noise. The latter is a random factor not considered in our model, while the former are those predictable factors that can also be completely divided. The AI scoring of medical records after modification by LMM is also more realistic. After department layering, it was found that, in some departments, LMM-modified MAE was not significantly reduced comparing with the original MAE. Hence, experts' scores were made into a heat map (Figure 2), where it was found that some groups of scoring physicians and scored physicians had closer scores, and were separately analyzed. In Table 4, medical record prediction scores and MAE are analyzed from Block A to H, respectively, and, except for block F, most blocks had similar record scores with previous results, and the MAE of LSTM prediction scores significantly reduced (p < 0.05) after LMM modification.  scoring of medical records after modification by LMM is also more realistic. After department layering, it was found that, in some departments, LMM-modified MAE was not significantly reduced comparing with the original MAE. Hence, experts' scores were made into a heat map (Figure 2), where it was found that some groups of scoring physicians and scored physicians had closer scores, and were separately analyzed. In Table 4, medical record prediction scores and MAE are analyzed from Block A to H, respectively, and, except for block F, most blocks had similar record scores with previous results, and the MAE of LSTM prediction scores significantly reduced (p < 0.05) after LMM modification.

Figure 2.
Heat map of medical record scores from scoring and scored physicians. X-axis: physicians who wrote the medical records; Y-axis: scoring physicians and their departments. A redder grid means record scoring physicians give a higher score to record writing physicians. There are clusters in some areas; thus, we put out some blocks and observe the block (A to H) characteristics in Table 4. . Heat map of medical record scores from scoring and scored physicians. X-axis: physicians who wrote the medical records; Y-axis: scoring physicians and their departments. A redder grid means record scoring physicians give a higher score to record writing physicians. There are clusters in some areas; thus, we put out some blocks and observe the block (A to H) characteristics in Table 4.
In spite of this, we were still unable to identify the reason why the MAE of certain departments had no significant reduction after LMM modification. Thus, heat map analysis was performed on LMM-modified LSTM prediction scores. Figure 3 shows that some reviewers' LMM-modified LSTM prediction scores had relatively greater MAE. After grouping using LMM modified MAE (Grade-LMM modified LSTM), experts' scores were close among groups, but BERT and LSTM prediction scores were lower than the original experts' scores. In Figure 4, We further using MAE to evaluate model efficiency, and then comparing MAE (|Grade-LMM modified BERT|, |Grade-LMM modified BERT|) of LMM-modified BERT or LSTM with the MAE (|Grade-BERT|, |Grade-LSTM|) of the original BERT or LSTM, it was found MAE was effectively reduced through LMM modification in Q1~Q3, but not in Q4. Thus, it is suspected that some scoring physicians in Q4 may have scored incorrectly.
Healthcare 2021, 9, x 15 of 20 In spite of this, we were still unable to identify the reason why the MAE of certain departments had no significant reduction after LMM modification. Thus, heat map analysis was performed on LMM-modified LSTM prediction scores. Figure 3 shows that some reviewers' LMM-modified LSTM prediction scores had relatively greater MAE. After grouping using LMM modified MAE (Grade-LMM modified LSTM), experts' scores were close among groups, but BERT and LSTM prediction scores were lower than the original experts' scores. In Figure 4, We further using MAE to evaluate model efficiency, and then comparing MAE (|Grade-LMM modified BERT|, |Grade-LMM modified BERT|) of LMM-modified BERT or LSTM with the MAE (|Grade-BERT|, |Grade-LSTM|) of the original BERT or LSTM, it was found MAE was effectively reduced through LMM modification in Q1~Q3, but not in Q4. Thus, it is suspected that some scoring physicians in Q4 may have scored incorrectly. Figure 3. MAE heat map of LMM-modified LSTM prediction scores from scoring and scored physicians. X-axis: physicians who wrote the medical records; Y-axis: scoring physicians and their departments. By subtracting the MAE of the original score from the LMM modified LSTM prediction score, and using the MAE and coring physicians to conduct a heat map analysis, it can be found that some reviewer scores are on the high side. Figure 3. MAE heat map of LMM-modified LSTM prediction scores from scoring and scored physicians. X-axis: physicians who wrote the medical records; Y-axis: scoring physicians and their departments. By subtracting the MAE of the original score from the LMM modified LSTM prediction score, and using the MAE and coring physicians to conduct a heat map analysis, it can be found that some reviewer scores are on the high side.

Discussion
In this study, the projection word embedding model was used to develop an AI system to evaluate the writing quality of inpatient medical records. The AI system is already capable of accurate classification to level 3 ICD-10 coding, combined with results from previous studies. Since level 3 coding is already at the disease level, subsequent coding will all just be remarks (such as location), and reaching such a level will allow for the possibility of full automation of common disease classification tasks, as well as extraction of disease features from other medical descriptions, through this algorithm. In addition to the original word embedding and LSTM architecture, BERT architecture was also employed to extract disease features for medical record scoring. LMM was further used for modification to get AI scores closer to actual reviewer physicians' scores. Moreover, it was also identified that some physicians over-scored medical records. If these scoring standards can be improved in the future, a better medical writing quality could be expected.
In addition, why is the quality of medical record writing so important? Because the medical record is the historical record of the patient's health care; it is also the basis of care, and its content records the patient's condition during the care process, the reason and result of the inspection, and the treatment method and result. In recent studies, it is feasible to use electronic health records (EHR) to predict disease risk, such as atrial fibrillation (AF) [49], coronary heart disease in patients with hypertension [50], fall risk [51], multiple sclerosis disease [52], and cervical cancer [53]. Over the past two decades, the investigation of genetic variation underlying disease susceptibility has increased considerably. Most notably, genome-wide association studies (GWAS) have investigated tens of millions of single-nucleotide polymorphisms (SNPs) for associations with complex diseases. However, results from numerous GWAS have revealed that the majority of statistically significantly associated genetic variants have small effects [54] and may not be predictive of disease risks [55], and many diseases are associated with tens of thousands of genetic variants [56]. These findings have led to the resurgence of the polygenic risk score (PRS), an aggregate measure of many genetic variants weighted by their individual effects on a given phenotype. However, epidemiologic studies are expensive and complex to run, which raises the question of whether a PRS could be developed and applied in a clinical setting using genetic data that are more readily available. Recently, some scholars proposed new ideas for developing and implementing PRS predictions using biobank-linked EHR data [57].
For the medical records scoring system, this not only saves doctors the time for scoring medical records but also can get feedback immediately after the writing is completed to improve the quality of medical record writing. In the past research, clinicians spent 3.7 h per day, or 37% of their work day, on EHR [58]. There was a marked reduction in EHR time with both clinician and resident seniority. Despite this improvement, the total time spent on EHR remained exceedingly high amongst even the most experienced physicians [58]. The significance of an increasing shift towards EHR is a growing paradigm that cannot be understated, particularly in the current era of healthcare, and there is increasing scrutiny on documentation [59,60]. These increased demands can lead to EHR fatigue and physician burnout. In a survey of a general internal medicine group, 38% reported feeling burnt out, with 60% citing high documentation pressure and 50% describing too much EHR time at home [61]. Burnout has been linked to an increased risk of resident's wellbeing [62].
There are still some limitations for electronic medical records. First, this scoring system can only be used in our hospital because the medical record system of different hospitals do not talk to each other. Second, entering data into an EHR requires a doctor to spend a lot of time doing so, leading to most physicians experiencing burnout symptoms due to EMRrelated workloads. Third, cyber-attacks are a perennial concern for EHRs. It is, therefore, imperative that cybersecurity is continually enhanced. Fourth, timing discrepancies occur in EHRs, and they can lead to serious clinical consequences.
In summary, combining projection word embedding and LSTM with LMM can give better prediction scores. This system can be used to assist medical record scoring so that young physicians can get immediate writing feedback, so as to improve the quality of medical record writing in my country and let the public, Medical units, and insurance units can all get better help. In the future, it may be possible to actively introduce such technologies into hospitals to achieve personalized precision medicine.