A Novel Machine-Learning-Based Hybrid CNN Model for Tumor Identiﬁcation in Medical Image Processing

: The popularization of electronic clinical medical records makes it possible to use automated methods to extract high-value information from medical records quickly. As essential medical information, oncology medical events are composed of attributes that describe malignant tumors. In recent years, oncology medicine event extraction has become a research hotspot in academia. Many academic conferences publish it as an evaluation task and provide a series of high-quality annotation data. This article aims at the characteristics of discrete attributes of tumor-related medical events and proposes a medical event. The standard extraction method realizes the combined extraction of the primary tumor site and primary tumor size characteristics, as well as the extraction of tumor metastasis sites. In addition, given the problems of the small number and types of annotation texts for tumor-related medical events, a key-based approach is proposed. A pseudo-data-generation algorithm that randomly replaces information in the whole domain improves the transfer learning ability of the standard extraction method for different types of tumor-related medical event extractions. The proposed method won third place in the clinical medical event extraction and evaluation task of the CCKS2020 electronic medical record. A large number of experiments on the CCKS2020 dataset verify the effectiveness of the proposed method.


Introduction
With the rapid popularity of electronic medical records and the advent of big medical data, natural language processing (NLP) technology in the medical field has become a current research hotspot. NLP-related technologies, such as event extraction, relationship extraction, etc., can be used as automated methods to quickly extract scientifically valuable information from clinical medical records, thereby improving the work efficiency of scientific researchers and accelerating the progress of drug research [1].
Event extraction is a primary task of NLP. Its purpose is to extract events that users are interested in from unstructured information and present them to users in a structured form. In recent years, tumor-related medical event extraction has become a research hotspot in academia; the 4th Health Information Processing Conference (CHIP2018 [2]) and the 13th and 14th National Conference on Knowledge Graph and Semantic Computing 2 of 13 (CCKS2019 [3], CCKS2020 [4]) all use it as a heavyweight evaluation task, attracting the participation of a large number of industry personnel and providing a series of high-quality annotation data, which significantly promotes the research of medical event extraction .
Tumor-related medical event extraction, given the central entity of the medical record text data of the tumor, defines several attributes of the tumor-related medical event, such as tumor size, primary tumor site, etc., and identifies and extracts the events and details. Data released by CHIP2018, CCKS2019, and CCKS2020 define three attributes: prior tumor site, primary tumor size, and tumor metastasis site. However, these three attributes are relatively discrete; that is, they can exist relatively independently without being affected by other features. For example, in medicine, any body part may become the site of tumor metastasis, regardless of the tumor's original location; the size of the primary tumor and the site of tumor metastasis are neither medically nor realistically related. The only partial connection is as a description of the measurement of the primary location of the tumor. The size of the primary tumor usually coexists with the sentence level of the primary site of the tumor, but this situation is not absolute. For the extraction of tumor-related medical events, the author of [5] proposed CCMNN in a previous work. The method of multi-neural-network collaboration realizes the extraction of three attributes. Among them, based on the conclusion that the primary tumor site and the size of the primary tumor co-occur at the sentence level, CCMNN uses a rule-based method to extract the size of the primary tumor. However, due to natural language, the arbitrariness of medical records, and the irregularity of medical record text writing, the actual performance of the above method is not good [6,7].
In response to the problems in CCMNN, this article improves CCMNN and proposes a standard extraction method of medical events. This method realizes the joint extraction of the primary tumor site and primary tumor size and the extraction of the tumor metastasis sites. This method is used in CCKS2020. In the electronic medical record-based clinical medical event extraction and evaluation task, an F1 value of 73.52 was obtained, winning third place in this evaluation task. To verify the method's effectiveness in this paper, the CCKS2019 and CCKS2020 medical event extraction datasets were targeted.
Furthermore, many comparative experiments between the technique in this paper and CMNN have been carried out. The experimental results show that the method in this paper has a significant improvement in the absolute F1 value of CCMNN, compared to CCMNN [8,9]. Further exploratory analysis shows that the method in this paper has dramatically improved the performance of the extraction of the primary tumor size, achieving the research purpose of this article.
In addition, given the problem of the small number and types of medical record texts for oncology medical events, this paper proposes a pseudo-data-generation algorithm based on the global random replacement of crucial information [10][11][12]. The experimental results on the CCKS2020 medical event extraction dataset show that the algorithm can effectively expand the number and types of annotated medical record texts and improve the transfer learning ability of this method for different kinds of tumor-related medical events.

Related Research
Similar to information extraction in general fields [13,14], medical information extraction refers to determining the boundaries of professional terms in medical texts and then classifying them based on domain information [15]. The current methods of medical information extraction mainly include shallow machine-learning methods and two types of deep-neural-network methods. External machine-learning methods mainly include the Hidden Markov Model (HMM), Conditional Random Field (CRF), Support Vector Machine (SVM), etc. [16]. The author of [17] verified that, based on CRF, and according to the Gimli method, the F1 value on the JNLPBA 2004 dataset reached 72. 23. The author of [18] proposed a multi-feature fusion CRF method, which can accurately identify the disease and symptom entities in the medical record text and can also accurately identify Sustainability 2022, 14, 1447 3 of 13 unregistered entity words. Shallow machine-learning methods rely, to a large extent, on the design of artificial features. The author of [19] used the CRF model for biomedical entity-recognition to solve the above problems and added different word vectors based on basic artificial features. For example, the F1 value on the JNGLPBA 2004 dataset reached 71. 39. The author of [20] used a small number of artificial features and word vectors to construct a CRF model [21,22] and added post-processing. As a result, the F1 value on the JNLPBA 2004 corpus was 71.77.
In the study of using deep neural networks for medical information extraction, the author of [23] first used neural networks to generate word vectors on unlabeled biomedical texts and then built a multi-layer neural network, obtaining 71 on the JNLPBA 2004 dataset, with an F1 value of 01. The author of [24] used the BiLSTM model to obtain an F1 value of 88.6 on the BioCreativeGM dataset, and at the same time, obtained an F1 value of 72.76 on the JNLPBA 2004 corpus. Finally, the author of [25] proposed a neural-network model based on CNN BLSTM CRF that has reached the optimal F1 value on the BiocreativeIIGM and JNLPBA 2004 datasets [26][27][28][29][30].
In the tumor-related medical event extraction study, the author of [31] proposed an extraction method based on pattern-matching, which achieved an F1 value of 69.7 on the CHIP2018 dataset. The author of [32] proposed a multiple-extraction method of neuralnetwork collaboration that has obtained an F1 value of 76.35 on the CCKS2019 dataset. Zhao et al. [33] proposed a plan based on a multi-sequence labeling model and received 76.17 on the CCKS2019 dataset. The author of [34] proposed an Elmo-based sequencelabeling method, which combined rules to obtain an F1 value of 70.69 on the CCKS2019 data-set. In the latest related research, Dai et al. [35] proposed a method based on the extraction method of RoBERT, that, combined with a large number of external resources to fine-tune RoBERT, uses rules to preprocess the data; the process obtained an F1 value of 76.23 on the CCKS2020 dataset. The author of [36] used the BiLSTM GCRF model to extract tumors. For medical incidents, the model is based on RoBERTa and uses dataaugmentation and rules to process the data. It has achieved an F1 value of 74.58 on the CCKS2020 dataset [37,38].
The research of medical information extraction closely follows the pace of general information-extraction research. Still, the research progress is relatively lagging, mainly due to the lack of large-scale and high-quality medical annotation data. In addition, the current electronic medical record-based tumor-related medical event extraction method, or the use of a large number of rules, either significantly reduces the generalization ability of the extraction method [39,40], or relies highly on pre-training language models [41] and external resources, which increase the demand for computing resources and domain knowledge of the extraction method, hindering the actual application of the extraction method.

Taks Analysis
The definition of primary tumor site, primary tumor size, and tumor metastasis site are as follows. Tumor prior site is the tissue or organ where a specific malignant tumor first appears. Usually, there are apparent characteristic words in the context of the primary tumor site, such as "cancer", "malignant tumor", "MT", "CA", etc. Primary tumor size is a measure of the size of the primary tumor, generally in the form of length, area, and volume. Tumor metastasis site is where the malignant tumor transfers from the original site to other tissues or organs.
The CCKS2020 electronic medical record-based clinical medical event extraction and evaluation task takes the migration learning ability of the research method across different types of tumor-related medical event extractions as an essential indicator. Therefore, the data distribution of the training set and the test set provided are mainly reflected in the type of tumor. Therefore, this article counts the tumor type information in the training (train) and test set (test) of the CCKS2020 medical event extraction dataset, as listed in Table 1. It can be seen from Table 1 that the train mainly contains two kinds of tumor-related medical events, lung and breast, accounting for 83.48%, of which lung-related tumor-related medical events accounted for 62.67%. Many tumor-related medical events were included in the test. On the other hand, it does not appear in trains, such as the stomach, pancreas, uterus, and other such tumor-related medical events. In addition, there are also significant differences in specific descriptions of tumor-related medical events that co-occur in train and test. distribution of the training set and the test set provided are mainly reflected in the type of tumor. Therefore, this article counts the tumor type information in the training (train) and test set (test) of the CCKS2020 medical event extraction dataset, as listed in Table 1. It can be seen from Table 1 that the train mainly contains two kinds of tumor-related medical events, lung and breast, accounting for 83.48%, of which lung-related tumor-related medical events accounted for 62.67%. Many tumor-related medical events were included in the test. On the other hand, it does not appear in trains, such as the stomach, pancreas, uterus, and other such tumor-related medical events. In addition, there are also significant differences in specific descriptions of tumor-related medical events that co-occur in train and test. Figure 1 shows the architecture diagram of the medical event joint-extraction method proposed in this article. This article is divided into two parts: (1) the joint extraction of the primary tumor site and the size of the primary tumor; (2) the extraction of the tumor metastasis site. The method in this paper first extracts the candidate words of the tumor's primary site, formalizes the extraction process (named entity recognition), and uses the BiLSTM GCRF model to extract.

Design Methods
The first layer of the BiLSTM GCRF model is the embedding layer, which maps each token contained in the medical record text to a token-embedding representation (a token refers to a character in the medical record text, or punctuation, or English letters, or other symbols), and finally, proceeds with the embedding representation sequence of the text. If a medical record text X contains n tokens, the embedding representation sequence of X The method in this paper first extracts the candidate words of the tumor's primary site, formalizes the extraction process (named entity recognition), and uses the BiLSTM GCRF model to extract.
The first layer of the BiLSTM GCRF model is the embedding layer, which maps each token contained in the medical record text to a token-embedding representation (a token refers to a character in the medical record text, or punctuation, or English letters, or other symbols), and finally, proceeds with the embedding representation sequence of the text. If a medical record text X contains n tokens, the embedding representation sequence of X can be expressed as X = (×1, ×2, ×3, ×, ×n), where xiεRd,d is the token embedding representation. Before entering the next layer, set dropout to alleviate over fitting.
The training data of the BiLSTM CRF model adopts the labeling mode of BIO, and the data is processed into a format suitable for model training according to the artificial labeling information of the data.
The candidate words of the primary tumor site may contain multiple candidate words that belong to the same body part, but with different granularities, so they need to be screened. The screening process follows the principle of refinement and commoditization, the most accurate description of the reserved part. For example, if both "lung" and "left upper lobe" are candidates, then "left upper lobe" is selected as the primary site of the tumor (Figure 2).
The training data of the BiLSTM CRF model adopts the labeling mode of BIO, and the data is processed into a format suitable for model training according to the artificial labeling information of the data.
The candidate words of the primary tumor site may contain multiple candidate words that belong to the same body part, but with different granularities, so they need to be screened. The screening process follows the principle of refinement and commoditization, the most accurate description of the reserved part. For example, if both "lung" and "left upper lobe" are candidates, then "left upper lobe" is selected as the primary site of the tumor (Figure 2).

Figure 2. Bidirectional LSTM for Brain Tumor Detection
The size of the primary tumor is composed of numbers, length units (mm or cm), and binary symbols representing multiplication (;, ×, X, etc.) by specific rules. This article first includes all words in the medical record text that meet the defined form. Then, it extracts them as candidates for the size of the primary tumor.
Next, the primary tumor site and the primary tumor size candidate words are combined to obtain the tumor size relationship candidate tuple. The principle of combination is that the prior tumor site should appear before the primary tumor size candidate words in the medical record text. Due to the randomness of natural language, there are a lot of abbreviations and abbreviations in the case text during the writing process. To solve the above problems, this article is for each tumor source.

Datasets and Result Evaluation
CCKS2019 released the task of sampling and evaluating medical treatment incidents of tumors. A total of 1000 annotated tumor-related medical record texts are used as the training set, and 400 annotated tumor-related medical record texts are used as the test set; CCKS2020 released 1000 annotated tumor-related medical record texts as the training set, and 300 annotated tumor-related medical record texts are used as the test set. The above two datasets verify the effectiveness of the method in this paper. This paper uses the standard accuracy rate (P), recall rate (R), and micro-average F1 as model evaluation indicators, and the formula is as follows: The size of the primary tumor is composed of numbers, length units (mm or cm), and binary symbols representing multiplication (;, ×, X, etc.) by specific rules. This article first includes all words in the medical record text that meet the defined form. Then, it extracts them as candidates for the size of the primary tumor.
Next, the primary tumor site and the primary tumor size candidate words are combined to obtain the tumor size relationship candidate tuple. The principle of combination is that the prior tumor site should appear before the primary tumor size candidate words in the medical record text. Due to the randomness of natural language, there are a lot of abbreviations and abbreviations in the case text during the writing process. To solve the above problems, this article is for each tumor source.

Datasets and Result Evaluation
CCKS2019 released the task of sampling and evaluating medical treatment incidents of tumors. A total of 1000 annotated tumor-related medical record texts are used as the training set, and 400 annotated tumor-related medical record texts are used as the test set; CCKS2020 released 1000 annotated tumor-related medical record texts as the training set, and 300 annotated tumor-related medical record texts are used as the test set. The above two datasets verify the effectiveness of the method in this paper. This paper uses the standard accuracy rate (P), recall rate (R), and micro-average F1 as model evaluation indicators, and the formula is as follows: Among them, TP stands for actual cases, that is, the number of attributes predicted by the model to be positive, and the real is also positive; FP stands for false positives, that is, the number of details predicted by the model to be positive and negative; FN stands for false negatives, that is, the number of attributes predicted by the model to be negative and positive.

Experimental Results
The method in this paper obtained CCKS2020 electronic medical records of clinical medical events. The third place in the evaluation task is drawn. This section first gives the top five results of this evaluation task. The results are listed in Table 2. The teams DST and TMAIL both use the RoBERTa pre-trained language model. The DST team grabbed 960,000 medical texts from the Internet to fine-tune the RoBERTa model and used rules to clean the data and replace characters. In addition, the TMAIL team also used rules to perform preprocessing operations, such as data cleaning and character replacement.The methods proposed by DST and TMAIL and the methods in this article all use the BiLSTM GCRF model as the main component of the technique. However, by analyzing tumors, the properties of the two attributes of the primary site and primary tumor size are determined. The method in this paper uses two BiLSTM GCRF models to achieve the extraction of two features. Therefore, from an empirical point of view, this paper is specific to these two attributes; under the same conditions, our method is more substantial and yields a better extraction effect. In addition, compared with the methods proposed by DST and TMAIL as presented in Figure 3 below, the advantages of this method are: Among them, TP stands for actual cases, that is, the number of attributes predicted by the model to be positive, and the real is also positive; FP stands for false positives, that is, the number of details predicted by the model to be positive and negative; FN stands for false negatives, that is, the number of attributes predicted by the model to be negative and positive.

Experimental Results
The method in this paper obtained CCKS2020 electronic medical records of clinical medical events. The third place in the evaluation task is drawn. This section first gives the top five results of this evaluation task. The results are listed in Table 2. Table 2. Medical event extraction on CCKS2020 dataset.

Team
F1 Score DST [1] 76.23 TMAIL [2] 74.58 LHJB [3] 73.25 ARALOAK [4] 72.73 The teams DST and TMAIL both use the RoBERTa pre-trained language model. The DST team grabbed 960,000 medical texts from the Internet to fine-tune the RoBERTa model and used rules to clean the data and replace characters. In addition, the TMAIL team also used rules to perform preprocessing operations, such as data cleaning and character replacement.The methods proposed by DST and TMAIL and the methods in this article all use the BiLSTM GCRF model as the main component of the technique. However, by analyzing tumors, the properties of the two attributes of the primary site and primary tumor size are determined. The method in this paper uses two BiLSTM GCRF models to achieve the extraction of two features. Therefore, from an empirical point of view, this paper is specific to these two attributes; under the same conditions, our method is more substantial and yields a better extraction effect. In addition, compared with the methods proposed by DST and TMAIL as presented in Figure 3 below, the advantages of this method are:   1. Not using the RoBERT pre-training language model, but using randomly initialized token-embedding representation; 2.
Not using any external resources; 3.
No rules are used to clean the dataset, character replacement, and other preprocessing operations. Because no rules are used to preprocess the data, the method in this article has better generalization capabilities; because no external resources and pre-training languages are used in the model, the method in this article requires lower computing resources. To verify the superiority of the standard extraction method proposed in this article, the technique and CCMNN were run on the CCKS2019 and CCKS2020 datasets. The training set was used to train the model, and the model was tested on the corresponding test set; the experimental results are listed in Table 3. For fairness, we did not use the pseudo-data-generation algorithm proposed in this article in the two methods. It can be seen from Table 3 that, on the two datasets, the performance of the method in this paper is consistent with that of CCMNN, which proves the effectiveness of the standard extraction method proposed in this paper. Specifically, compared to CCMNN, the method in this paper uses the CCKS2019 dataset. As a result, the absolute F1 value increases by 3.13, and the total F1 value on the CCKS2020 dataset has increased by 4.14.
In addition, it can also be obtained from Figure 4 that, whether it is the method in this paper or the CCGMNN, the performance difference on the two datasets is significant, mainly due to the following two reasons:

1.
For the transfer learning ability of the evaluation method, the CCKS2020 dataset, the data distribution of the training set, and the test set are fairly different; 2.
The data distribution of the two datasets of CCKS2019 and CCKS2020 is quite different as seen in Figures 5 and 6.
Sustainability 2022, 14, x FOR PEER REVIEW 7 of 13 1. Not using the RoBERT pre-training language model, but using randomly initialized token-embedding representation; 2. Not using any external resources; 3. No rules are used to clean the dataset, character replacement, and other preprocessing operations.
Because no rules are used to preprocess the data, the method in this article has better generalization capabilities; because no external resources and pre-training languages are used in the model, the method in this article requires lower computing resources. To verify the superiority of the standard extraction method proposed in this article, the technique and CCMNN were run on the CCKS2019 and CCKS2020 datasets. The training set was used to train the model, and the model was tested on the corresponding test set; the experimental results are listed in Table 3. For fairness, we did not use the pseudo-datageneration algorithm proposed in this article in the two methods. It can be seen from Table 3 that, on the two datasets, the performance of the method in this paper is consistent with that of CCMNN, which proves the effectiveness of the standard extraction method proposed in this paper. Specifically, compared to CCMNN, the method in this paper uses the CCKS2019 dataset. As a result, the absolute F1 value increases by 3.13, and the total F1 value on the CCKS2020 dataset has increased by 4.14.
In addition, it can also be obtained from Figure 4 that, whether it is the method in this paper or the CCGMNN, the performance difference on the two datasets is significant, mainly due to the following two reasons: 1. For the transfer learning ability of the evaluation method, the CCKS2020 dataset, the data distribution of the training set, and the test set are fairly different; 2. The data distribution of the two datasets of CCKS2019 and CCKS2020 is quite different as seen in Figures 5 and 6.   To further explore the advantages of this method compared with CCMNN, we further explore the statistics of the method of this article. Since this method and CCMNN use the same method to extract tumor metastasis sites, the extraction results are also the same, so we did not show them in Table 4. It can be seen from Table 4 that, on the two datasets, the method in this paper and the CCMNN method have achieved nearly the same performance in the extraction of the primary tumor site. However, in the extraction of the primary tumor size, the method in this paper has a significant improvement over the absolute F1 values of CCMNN, which,  To further explore the advantages of this method compared with CCMNN, we further explore the statistics of the method of this article. Since this method and CCMNN use the same method to extract tumor metastasis sites, the extraction results are also the same, so we did not show them in Table 4. It can be seen from Table 4 that, on the two datasets, the method in this paper and the CCMNN method have achieved nearly the same performance in the extraction of the primary tumor site. However, in the extraction of the primary tumor size, the method in this paper has a significant improvement over the absolute F1 values of CCMNN, which, To further explore the advantages of this method compared with CCMNN, we further explore the statistics of the method of this article. Since this method and CCMNN use the same method to extract tumor metastasis sites, the extraction results are also the same, so we did not show them in Table 4. It can be seen from Table 4 that, on the two datasets, the method in this paper and the CCMNN method have achieved nearly the same performance in the extraction of the primary tumor site. However, in the extraction of the primary tumor size, the method in this paper has a significant improvement over the absolute F1 values of CCMNN, which, respectively, are +8.93 (CCKS2019) and +7.51 (CCKS2020) as in Figure 7. Compared with CCMNN, which uses a rule-based method to extract the size of the primary tumor, the common extraction method proposed in this paper can effectively improve the extraction performance of the primary tumor size, achieving the research purpose of this article.
Sustainability 2022, 14, x FOR PEER REVIEW 9 of 13 respectively, are +8.93 (CCKS2019) and +7.51 (CCKS2020) as in Figure 7. Compared with CCMNN, which uses a rule-based method to extract the size of the primary tumor, the common extraction method proposed in this paper can effectively improve the extraction performance of the primary tumor size, achieving the research purpose of this article. To improve the transfer learning ability of the method in this paper, this paper proposes a pseudo-data-generation algorithm based on the global random replacement of crucial information. DST and TMAIL use similar data pseudo-labelling algorithms. For example, TMAIL uses medical record texts, the sentences are globally reordered, and 2800 pseudo-annotated data are obtained.
To verify the effectiveness of the pseudo-data-generation algorithm proposed in this article, we conducted a series of experiments on the CCKS2020 dataset. First, the algorithm was used to generate 2000 pseudo-labeled data; then, according to different combinations of training data, combined with the method in this article, the algorithm was used to train the neural network model and use it on the CCKS2020 test set presented in Figure  8 and Figure 9 respectively. To improve the transfer learning ability of the method in this paper, this paper proposes a pseudo-data-generation algorithm based on the global random replacement of crucial information. DST and TMAIL use similar data pseudo-labelling algorithms. For example, TMAIL uses medical record texts, the sentences are globally reordered, and 2800 pseudoannotated data are obtained.
To verify the effectiveness of the pseudo-data-generation algorithm proposed in this article, we conducted a series of experiments on the CCKS2020 dataset. First, the algorithm was used to generate 2000 pseudo-labeled data; then, according to different combinations of training data, combined with the method in this article, the algorithm was used to train the neural network model and use it on the CCKS2020 test set presented in Figures 8 and 9 respectively.
A total of five sets of experiments were conducted. Train refers to 1000 training data of CCKS2020; test refers to 300 test data of CCKS2020; train +500, train +1000, train +1500, and train +2000, respectively, refer to train. The corresponding amount of pseudo-labeled data is added in. In addition, due to the randomness of the pseudo-labeled data, this paper carries out ten iterations of the above experimental process, taking the average F1 value of these ten iterations as the final F1 value. It can be seen that, when 1000 pieces of pseudo-labeled data are added to the train, the F1 value of 74.68 is obtained by the method in this paper, which surpasses the F1 value (73.52) of the method in the CCKS2020 medical event extraction and evaluation task. In addition, we can also conclude that, as the amount of pseudo-labeled data added to the training set increases, the performance of the method in this paper increases before decreasing after reaching a peak. Still, the pseudo-labeled data is always beneficial to the method in this paper. The reasons are: The algorithm can significantly expand the number and types of medical record texts labelled, which are critical to improving model performance; 2.
The pseudo-labeled samples generated by this algorithm are random and may not necessarily match the actual scene. Therefore, adding too much pseudo-labelled data will make the model correct. Sexuality produces a certain amount of interference, which affects model performance improvement.
To improve the transfer learning ability of the method in this paper, this paper proposes a pseudo-data-generation algorithm based on the global random replacement of crucial information. DST and TMAIL use similar data pseudo-labelling algorithms. For example, TMAIL uses medical record texts, the sentences are globally reordered, and 2800 pseudo-annotated data are obtained.
To verify the effectiveness of the pseudo-data-generation algorithm proposed in this article, we conducted a series of experiments on the CCKS2020 dataset. First, the algorithm was used to generate 2000 pseudo-labeled data; then, according to different combinations of training data, combined with the method in this article, the algorithm was used to train the neural network model and use it on the CCKS2020 test set presented in Figure  8 and Figure 9 respectively.  A total of five sets of experiments were conducted. Train refers to 1000 training data of CCKS2020; test refers to 300 test data of CCKS2020; train +500, train +1000, train +1500, and train +2000, respectively, refer to train. The corresponding amount of pseudo-labeled data is added in. In addition, due to the randomness of the pseudo-labeled data, this paper carries out ten iterations of the above experimental process, taking the average F1 value of these ten iterations as the final F1 value. It can be seen that, when 1,000 pieces of pseudolabeled data are added to the train, the F1 value of 74.68 is obtained by the method in this paper, which surpasses the F1 value (73.52) of the method in the CCKS2020 medical event extraction and evaluation task. In addition, we can also conclude that, as the amount of pseudo-labeled data added to the training set increases, the performance of the method in this paper increases before decreasing after reaching a peak. Still, the pseudo-labeled data is always beneficial to the method in this paper. The reasons are: 1. The algorithm can significantly expand the number and types of medical record texts labelled, which are critical to improving model performance; 2. The pseudo-labeled samples generated by this algorithm are random and may not

Conclusions
This article proposes a standard extraction method of medical events, which realizes the joint extraction of two tumor event attributes. It presents a pseudo-data-generation algorithm based on the global random replacement of crucial information, which improves the model's migration learning ability. The method in this paper has won third place in the clinical medical event extraction and evaluation task of CCKS2020 electronic medical records. A large number of experiments on the CCKS2019 and CCKS2020 datasets show that the method's performance in this paper is greatly improved, compared with the CCMNN method, especially for primary tumors. In addition, the performance of size extraction has been dramatically improved, and the research purpose of this article has been achieved. However, the pseudo-data-generation algorithm proposed by the method in this article has large randomness, resulting in the generated pseudo-data not necessarily conforming to the natural semantics, damaging the model to a certain extent Therefore, next, we will study pseudo-data-generation algorithms based on semantic similarity replacement to improve the quality of pseudo-data generation and further improve the model's performance.