Automated Generation of Synoptic Reports from Narrative Pathology Reports in University Malaya Medical Centre Using Natural Language Processing

Pathology reports represent a primary source of information for cancer registries. University Malaya Medical Centre (UMMC) is a tertiary hospital responsible for training pathologists; thus narrative reporting becomes important. However, the unstructured free-text reports made the information extraction process tedious for clinical audits and data analysis-related research. This study aims to develop an automated natural language processing (NLP) algorithm to summarize the existing narrative breast pathology report from UMMC to a narrower structured synoptic pathology report with a checklist-style report template to ease the creation of pathology reports. The development of the rule-based NLP algorithm was based on the R programming language by using 593 pathology specimens from 174 patients provided by the Department of Pathology, UMMC. The pathologist provides specific keywords for data elements to define the semantic rules of the NLP. The system was evaluated by calculating the precision, recall, and F1-score. The proposed NLP algorithm achieved a micro-F1 score of 99.50% and a macro-F1 score of 98.97% on 178 specimens with 25 data elements. This achievement correlated to clinicians’ needs, which could improve communication between pathologists and clinicians. The study presented here is significant, as structured data is easily minable and could generate important insights.


Introduction
In the year 2020, there were 48,639 Malaysians diagnosed and reported with cancer. Among them, 17.3% (8418) had suffered from breast cancer [1]. Each verified cancer diagnosis is based on tissue histology, which is documented in a pathology report. A breast pathology report is a medical document that contains the description of breast cells and tissues, called specimens, made by a pathologist based on microscopic evidence and used to make a diagnosis of disease [2]. By reading the description from the report, the clinicians can determine whether the tissue is cancerous or noncancerous, and consequently decide the best treatment solution for the patient.
To date, traditional narrative pathology reporting comprises of the following three main sections: Macroscopy, microscopy, and gross description, which is still the preferable standard in the most clinical institutions [3][4][5], especially in University Malaya Medical Centre (UMMC), which served as the training centre to train the trainees on how to write a The dataset obtained is relatively small due to the use of closed-architecture third-party LIS in UMMC, which has restricted the accessing of non-medical personnel and the direct export of the pathology reports. Therefore, the pathologist is required to copy each pathology report manually from the LIS to a Docx file. Consequently, there is an increase in the time needed and the difficulty in obtaining the pathology reports within the time constraint. However, all the datasets obtained were randomly selected to reduce bias in the datasets.
In the initial step, the reports of a patient will be extracted from the system. This report may contain related breast pathology reports of the same patient in a single document. Subsequently, the next step is to separate these reports from the same patient. Moreover, a pathology report may include one or more specimens (number of specimens in this study = 593). Figure 1 shows the composition of a patient's pathology reports.

Study Setting
The study was conducted at the Department of Pathology, UMMC. A total of 298 narrative breast pathology reports with 593 specimens in Docx format from 174 patients written by UMMC pathologists were obtained from the Laboratory Information System (LIS). The dataset obtained is relatively small due to the use of closed-architecture third-party LIS in UMMC, which has restricted the accessing of non-medical personnel and the direct export of the pathology reports. Therefore, the pathologist is required to copy each pathology report manually from the LIS to a Docx file. Consequently, there is an increase in the time needed and the difficulty in obtaining the pathology reports within the time constraint. However, all the datasets obtained were randomly selected to reduce bias in the datasets.
In the initial step, the reports of a patient will be extracted from the system. This report may contain related breast pathology reports of the same patient in a single document. Subsequently, the next step is to separate these reports from the same patient. Moreover, a pathology report may include one or more specimens (number of specimens in this study = 593). Figure 1 shows the composition of a patient's pathology reports. These reports contain three main sections: Macroscopy, microscopy, and interpretation. The macroscopy section delineates the measurement size of lesions and their margins that can be observed under naked eyes. The microscopy section describes the measurement size of lesions and their margins in the samples under the microscope. It also describes whether the cancer cells are in the lymph channels or lymph nodes. Pathologists also report the results of the ImmunoHistoChemistry (IHC) test in the microscopy section. The interpretation section describes the overall condition of the examination, such as the breast laterality, type of procedure, histologic type, and grade.
This study has been approved by the Medical Research Ethics Committee (MREC), UMMC, Kuala Lumpur (MREC ID NO: #733.22) to develop a point of care data capture for institutional breast cancer registry. The data used were de-identified secondary data from EMRs. In the development of the rule-based NLP algorithm in this study, a total of 415 pathology specimens from 174 patients (70% of the total dataset) were randomly selected for the training process, while the remaining 178 pathology specimens from 52 patients (30% of the total dataset) as testing data to evaluate the information extraction ability of the algorithm. Figure 2 shows the proposed workflow. The pathologist from the Department of Pathology, UMMC, involved in this study, identified several key data These reports contain three main sections: Macroscopy, microscopy, and interpretation. The macroscopy section delineates the measurement size of lesions and their margins that can be observed under naked eyes. The microscopy section describes the measurement size of lesions and their margins in the samples under the microscope. It also describes whether the cancer cells are in the lymph channels or lymph nodes. Pathologists also report the results of the ImmunoHistoChemistry (IHC) test in the microscopy section. The interpretation section describes the overall condition of the examination, such as the breast laterality, type of procedure, histologic type, and grade.
This study has been approved by the Medical Research Ethics Committee (MREC), UMMC, Kuala Lumpur (MREC ID NO: #733.22) to develop a point of care data capture for institutional breast cancer registry. The data used were de-identified secondary data from EMRs. In the development of the rule-based NLP algorithm in this study, a total of 415 pathology specimens from 174 patients (70% of the total dataset) were randomly selected for the training process, while the remaining 178 pathology specimens from 52 patients (30% of the total dataset) as testing data to evaluate the information extraction ability of the algorithm. Figure 2 shows the proposed workflow. The pathologist from the Department of Pathology, UMMC, involved in this study, identified several key data elements from the pathology report. Table 1 shows the data elements identified by the pathologist. elements from the pathology report. Table 1 shows the data elements identified by the pathologist.    • Report's reference number -Unique ID for a report.

Interpretation section
• Type of procedure -Procedure used to obtain specimen.
• Specimen laterality -Site of breast involved.
• Histologic type -Description of a tumor look under a microscope.
• Histologic grade -Nottingham modification of Bloom-Richardson (NSBR) grading system based on tumor tubule formation, number of mitotic figures in most active areas and nuclear pleomorphism.

Macro and microscopy section
• Lesion -Type of lesions and its size observed macro and microscopically.
• Margins -Distance of lesion from different margins. • DCIS appearances -Architectural growth pattern of DCIS.
• Lymphovascular invasion -Presence of tumor cells in lymphatics or blood vessels.
• Skin change involvement -Presence of skin change include puckering, dimpling, a rash, or redness of the skin of the breast.
• Paget disease -Presence of eczema-like changes to the skin of the nipple and the area of darker skin surrounding the nipple.
• Regional lymph nodes -Number of lymph nodes examined and number of lymph nodes involved by tumor cell.

Rule-Based NLP Algorithm
The algorithm in this study was implemented in R programming language version 3.6.1. The first version of the NLP algorithm comprising a set of rules to convert the narrative report into a synoptic format was developed based on the training set. It was then improved according to the pathologist's suggestions after verifying the results extracted manually by the pathologist. These rules were interpreted with a specific ordering, called a decision list, to resolve the ambiguity in the rule-based system.
Before the extraction process, the algorithm proposed in this study not only classifies the different reports from the same patient but also classifies each specimen from the same report, as illustrated in Figure 1, automatically by recognizing the report reference number in the document. As a result, one free-text diagnosis row was available for each specimen as an input to the NLP algorithm. Figure 3 illustrates the data input of the NLP algorithm obtained from a report.
In the first pre-processing step after reading the single diagnosis row as input, all the characters were changed to lowercase to avoid case-sensitive issues during key element extraction. Next, meaningless symbols and blank spaces included in the report that was generated when moving across platforms from LIS were removed. Roman numerals used to indicate the histologic grade value were converted to Arabic numerals to maintain consistency of format. Word stemming using the R package "hunspell" [19] was performed to reduce the variety of terms in the report.
Since the raw pathology reports were narrative, the possibility of them containing misspelled terms is high. Henceforth, this might affect the efficiency in the following extraction step. With the aim to correct any misspellings in the input data, our NLP algorithm included a spell checker function that could identify misspelled words and return the correct version by utilizing the R package "hunspell". This package was selected because of its convenience in adding customized dictionaries to the current dictionary without overwriting the existing contents. After normalization, the specimen in the report was split into three main sections by recognizing the occurrence of labels for each section, creating semi-structured data. Successive pre-processing steps used a subset of sections or applied specific rules to different sections to greatly reduce the computation time. Each section was further subdivided into sub-rows coincidentally with every new paragraph. Then, each sub-row was analyzed using NLP rules.
Before the extraction process, the algorithm proposed in this study not only classifies the different reports from the same patient but also classifies each specimen from the same report, as illustrated in Figure 1, automatically by recognizing the report reference number in the document. As a result, one free-text diagnosis row was available for each specimen as an input to the NLP algorithm. Figure 3 illustrates the data input of the NLP algorithm obtained from a report.  When performing information extraction, several R lists were created. The structure of the list that contains a set of syntactic expressions used to match the pattern within the input text is shown in Equation (1), where the PredefinedList is an R list that stores n number of the regular expression for each specific data element.
The information of the type of procedure is commonly reported in the interpretation section. The algorithm matches the information within the text and the predefined list created from the interpretation section. However, in some cases, the pathologists may report the procedure type in the macroscopy section. Consequently, the algorithm analyzes all diagnosis rows in the macroscopy section only if the result was absent in the interpretation section. In the extraction of the examination date, the R package "lubridate" [20] was utilized. Integrating this package into the algorithm simplifies the extraction of date in various kinds of formats such as day (D)/month (M)/year (Y), M/D/Y, or Y/M/D, where the month can be written in a numerical or alphabetical format easily. The algorithm automatically extracts the dates from the report in the format of Y/M/D to ensure the format uniformity of the date's value in all reports.
In the extraction of types of the lesion with its measurement size and distance to different margins from the macroscopy and microscopy section, the algorithm first draws out all the measurements with three-dimensional as a priority and followed by two-dimensional and then one-dimensional in every row of input. In this way, it avoids duplication in the extraction of measurements. The measurements of the specimens, normally in three-dimensional, are similar to the sizes of the lesions, which were reported by the pathologists in the report. However, it is less important for a clinician when reviewing a patient. Therefore, our algorithm was designed to differentiate the measurement whether it indicates a specimen or tumor by recognizing the keyword "specimen" or other terms used to describe a lesion that occurred in a sentence. In extracting the distance of the margins, it must meet the following criteria, measurement in one-dimensional (1D) or measurement within a range (R), and margin's keywords (K) were present in the sentence as shown in Equation (2).
When extracting the presence (P) of lymphovascular invasion, skin change, and Paget disease, the sentences containing the keywords for these data elements (K) were identified, and are shown in Equation (3). Hence, negation detection was performed on the particular sentences by a set of defined linguistic rules. A list of negation words (n) used in clinical writings was created based on the training dataset to perform negation detection. Then, the location of the negation terms in the sentence was located. Different data elements may be present in the same sentence; hence, with these locations defined, the algorithm can identify which data element was negated by selecting the closet negation term to the specific keyword (N) (see Equation (4)). The absence of the targeted data elements can be defined as in Equation (5).
There were two grading systems used in pathology reporting, which were the NSBR grading and the DCIS grading system. The NSBR grading system uses the numeric value in grading (grade 1 to 3), while the DCIS grading system uses terms that are low, intermediate, and high in grading. This difference in grading helps our algorithm to differentiate these data elements. Most of the time, the number of lymph nodes examined and the number of lymph nodes that showed malignancy were reported together in one sentence. Hence, more semantic rules were required to identify the sentence boundary to extract the result correctly as both data elements' responses were in the same numeric format. The algorithm draws a new sentence boundary by identifying the presence of comma punctuation or conjunction keywords.
In order to extract the histologic type of specimen, a list of morphology terms often used was predefined. When the morphology description identified is fully included in another morphology description that is matched with the predefined list, the more specific one is prioritized. For example, "papillary carcinoma" is included in a more specific description, "papillary carcinoma with invasion". While with the presence of a decision list, the more precise description was extracted as a priority. In the ancillary studies, which included the breast biomarker ER, PR, and HER2 testing results, the terms used to describe the outcomes for biomarker ER and PR were "positive" or "negative" or a percentage of staining. While in HER2 result reporting, other than terms "positive" and "negative", some pathologists will report it using a scoring method from score 0 to 3+. Pathologists often describe these three biomarkers in the same sentence; hence, the location of the biomarkers and the respective results are critical in getting the correct result during extraction. The distance formula (see Equation (4)) was applied to identify the test result of the targeted biomarker. Table 2 lists the examples of regular expressions used for the key data elements.

Synoptic Report
The synoptic report proposed in this study not only indicates the running text of different elements that are mentioned in separate lines, but it was in a more confined structure, as defined by the CAP [8]. In that way, the required data elements adhered to a paired format where each required data element is followed by a response. Therefore, separated data elements with their responses extracted from the narrative report by the NLP algorithm were displayed on separate lines.
A checklist-style pathology reporting template was also created as an alternative to create a synoptic report. Both the checklist-style pathology reporting template and the NLP algorithm shared the same database in order to maintain the consistency of data input. Since the results extracted by the NLP algorithm were in a structured format, the data can be easily integrated into the MySQL database through the R package "RMySQL" [21]. The primary key "reportID" in table report_info acts as the foreign key in the rest of the database tables to enable the linking among tables. Figure 4 illustrates the entity-relationship diagram (ERD) of the database.

Evaluation of the NLP Algorithm
Human validation was performed to evaluate the performance of the NLP algorithm. The pathologist in UMMC was provided with all the documents of the unannotated narrative pathology reports and the synoptic pathology reports generated automatically by the proposed algorithm, including both training and testing datasets. To validate the result of the system, the pathologist marked the extracted results as either present or absent in the narrative pathology report and correctly or incorrectly identified from the report. Borderline cases can occur when the number of specimens in each section does not match. For example, there were two examined specimens in the macroscopy section, while only one specimen was reported in the microscopy section. In this scenario, the accuracy of the extraction task may be affected. As a solution, the text mining program will add a reminder message on this problem to the initial of the program-generated synoptic report, so that the clinicians can pay more attention to the problem addressed. Precision, recall, and F1 scores were computed for each classification category in the algorithm's evaluation performance:

Evaluation of the NLP Algorithm
Human validation was performed to evaluate the performance of the NLP algorithm. The pathologist in UMMC was provided with all the documents of the unannotated narrative pathology reports and the synoptic pathology reports generated automatically by the proposed algorithm, including both training and testing datasets. To validate the result of the system, the pathologist marked the extracted results as either present or absent in the narrative pathology report and correctly or incorrectly identified from the report. Borderline cases can occur when the number of specimens in each section does not match. For example, there were two examined specimens in the macroscopy section, while only one specimen was reported in the microscopy section. In this scenario, the accuracy of the extraction task may be affected. As a solution, the text mining program will add a reminder message on this problem to the initial of the program-generated synoptic report, so that the clinicians can pay more attention to the problem addressed. Precision, recall, and F1 scores were computed for each classification category in the algorithm's evaluation performance: Precision = True positive True positive + False positive , Recall = True positive True positive + False negative , These performance scores were first computed independently for each variables and then the average value (macro-average) was taken; thus the average value was calculated by aggregate contributions of all variables (micro-average).

NLP Algorithm to Extract Important Variables from Breast Pathology Report
Responses of 25 data elements were extracted by the proposed NLP algorithm. These include: Once the NLP algorithm was refined and optimized on the training dataset with performance scores for micro-F1 = 0.9959 and macro-F1 = 0.9931, the algorithm was then tested on the testing set data that included 178 specimens and achieved the performance scores of micro-F1 = 0.9950 and macro-F1 = 0.9897. Table 3 shows the performance score of three metrics for training data and testing data. Overall, the high F1-score in both the training and testing data shows the effectiveness of the proposed rule-based NLP algorithm in identifying the responses of the key data elements from the narrative pathology reports from a single institution, UMMC, and it was converted into a synoptic pathology report with structured data. Other than achieving high accuracy in data element response extraction, our proposed NLP algorithm is able to split each report from the same patient and separate each specimen with the three major sections from a report correctly in all samples involved. The successful classification of each specimen helps to increase the efficiency of the proposed NLP algorithm in the following extraction step. When reporting, the unit for measurement, such as the size of the lesion or distance from a specific margin, may miss out. In Example 1-S1, the unit for the tumor's distance from the anterior margin was missing. Hence, this caused the algorithm to recognize the numeric value as a float number and exclude it during extraction. Consequently, our algorithm included a set of rules that can differentiate the regular integer and measurement to add the missing unit. For example, unit "cm" will be added to Example 1-S1, but not in S2.
Integrating the R package "lubridate" in the algorithm helped to differentiate the date value with other numeric values such as size, distance, and patient's register number. However, in some rare cases, the package "lubridate" extracts the date value incorrectly. For example, the date is written as "5.6.19", the month value can be 5 or 6 depending on the decision list, concerning whether to check the month or day first. As a result, this increased the false positive number and reduced the precision score. In extracting the results (positive or negative) for the biomarkers test, the proposed algorithm can identify the value correctly even when both text "positive" and "negative" appeared in the same sentence that is shown in Example 2. The original text was "nuclear positivity" before word stemming. Other than that, our algorithm can also classify the correct result for different biomarkers even when they are mentioned in the same sentence (Example 3). This is achieved by calculating the distance between the biomarker's keyword with results in the text.
When extracting the lesion size measurement, the sizes of specimens (Example 4-S1) were excluded in the extraction process even when it had the same format as lesion size (Example 4-S2). Our algorithm differentiated the size measurement by recognizing the keyword "specimen" and other keywords that describe a lesion. Different lesion sizes (Example 4-S2 and S4) and margin distance sizes (Example 4-S3 and S5) could extract and store separately in the database table to ensure correctness when generating the synoptic report. However, there are still some false positive cases which are shown in Example 5. The algorithm extracted the type of lesion involved as "cystic spaces", while the exact response should be "fibrosis". This is due to both morphology terms being presented in the predefined list, and the algorithm will only select the nearest keyword to the measurement.
In some cases, the reporting can be ambiguous, such as the information was written by pathologists and had a mismatch in different sections. In Example 6, the number of lymph nodes stated in the microscopy section is "6", while in the interpretation section, the number reported is "2". This leads to an increase in the false positive rate. The algorithm failed to extract the appearance of DCIS in Example 7 since the keyword "DCIS" is not present in the sentence; hence, it was skipped by the algorithm and increased the false negative value. In some rare cases, the lesion involved in the margin distance was not mentioned by the pathologists but was reported in different sections (Example 8). Hence, the algorithm could not extract the type of lesion involved in the specific margin, leading to an increase in the false negative rate.
The examples of diagnosis row existing in the pathology report are: S1 "The tumor is 1 from the anterior margin and 1.5 cm from the deep margin." S2 "Block 7-medial margin (shaved)." Example 2. "ER: Negative, <1% of nuclear positive." Example 3. "The tumour cells show strong nuclear positivity (>90%) for ER and PR and negative for HER2 (0-1+)."

Example 4.
S1 "A mastectomy specimen weighing 350 g, measuring 18 cm × 13 cm × 3 cm." S2 "Serial cut sections show a well circumscribed tumour 1.7 cm × 1.5 cm × 0.6 cm." S3 "It is abutting the deep margin which is 0.1 cm away, 1.5 cm from the superior margin, 3.6 cm from inferior margin, 0.1 cm from the medial margin and 17.5 cm from the lateral margin." S4 "There is an area of firm whitish comedo-like lesion 3.5 cm × 2.5 cm × 2 cm." S5 "It is 0.5 cm from the deep margin, 0.1 cm from inferior margin, 3.5 cm from superior margin, 2 cm from medial margin and 10 cm from lateral margin." Example 5. "Shows scattered areas of fibrosis with a focal area displaying cystic spaces measuring 0.3 cm in maximum dimension." Example 6. In microscopy section: "A total of 6 reactive lymph nodes present which shows no tumor involvement." In interpretation section: "2 lymph nodes. No tumour metastasis" Example 7. "The neoplastic cells are arranged in a solid pattern with foci of comedo-necrosis." Example 8. "The nearest margin is the deep margin at 1 mm away."

Synoptic Report
After the extraction step, the extracted values were matched to their data elements as a "response"; each data element with its corresponding response was displayed on separate lines. Table 4 illustrated some examples of converting narrative information in pathology reports to a synoptic format. A checklist-style reporting template (see Supplementary Materials) that shared the same database was created to maintain the data structure's consistency concerning existing and newly inserted reports. Table 4. Examples of pathology report information in narrative and synoptic format, respectively.

Discussion
The rising data in oncological diagnostics increased the complexity of pathology reports. However, the pathology reports from LIS are reported as a free text with different sections instead of as distinct categories of each data element. Clinicians are needed to review and interpret the key point from the long narrative report to determine the following treatment process. Most of the clinical information was stored in textual form, and the database structure was different for structured data. In such a scenario, NLP offers an opportunity to automate the encoding of narrative reports into clean and structured data, hence producing a synoptic report as an alternative to the costly manual data extraction process [22,23]. However, the majority of the NLP techniques applied in the clinical domain mainly focus on the primary tasks, such as ureteric stones identification [24], stroke detection [25], generating section label [26], or cancer morphology classification [27], or they focus on classification tasks involving fields with a few labels, for example, site, laterality, behavior, histology, and grade [6].
We developed an automated rule-based NLP to extract responses for 25 data elements from the pathology reports, thus converting them into a synoptic pathology report. These reports were characterized by a high variability with the different writing styles and a highly unstructured nature to label the same data element in the different pathology reports written by various pathologists. Henceforth, the closed-architecture third party information system used in the institution caused the export of the existing pathology report across platforms to become troublesome. A single exported document contained different pathology reports of the same patient. So, we designed our algorithm to allow the automatic separation of each pathology report before performing the extraction process to minimize the human effort.
The proposed algorithm was developed with the aim to be used and integrated into the current reporting pipeline in UMMC to produce synoptic reports to ease the clinician's audit and research purpose. For the purpose of achieving this objective, it was necessary to reach the highest possible micro-average performance scores (best value at 1, and worst score at 0) [28]. Therefore, our proposed automated NLP algorithm achieved encouraging results in both the training and the testing dataset when compared to the manual extraction by the expert pathologist (micro-F1 equal to 0.9950 on the test set), with the data elements, microscopy margin distance which has the highest number of false positive and false negative values, 2 out of 27 and 3 out of 27, respectively. These errors are mainly due to the confusing writing format, such as combining different specimens into the same section or reporting a data element in multiple sections by some pathologies (often not present in most pathology reports). A similar work based on a self-supervised convolutional neural network-based algorithm was developed by Spandorfer et al. to convert unstructured narrative computed tomography pulmonary angiography (CTPA) reports into structured reports [26]. While their algorithm promised a high accuracy, achieving 91.6% and 95.9% using strict and modified criteria, it provided only the most basic structure by applying section labels to the sentences.
Despite the rise of machine learning techniques applied in NLP, the rule-based NLP algorithm is still widely applied in clinical NLP, which is considerably different from the general NLP community. Regarding the straightforward characteristic of the rule-based algorithm, it eases the debug process by the developer through interactive refinement with the clinician's feedback [29]. Besides that, machine learning-based NLP methods were mainly used for data prediction, estimation, and association mining. Furthermore, the difficulty in interpreting and correcting specific errors reported by the end-user due to the black box mechanism in machine learning algorithms consolidated the popularity of rule-based NLP in the clinical domain.
In the study by Hammami et al., an automated cancer morphology based on a rulebased NLP approach was developed [27]. Their algorithm achieved a successful result with a micro-F1 score of 98.14% on a single task, extracting cancer morphology codes as defined in the Third Edition of International Classification of Diseases for Oncology (ICD-O-3) from Italian pathology reports. Another rule-based NLP algorithm developed by Odisho et al. achieved an overall accuracy of over 94%. However, the data extraction from the prostate pathology report only involved a small field of data elements, such as Gleason Score, margin status, extracapsular extension, seminal vesicle invasion, and TN stage with a small range of possible values [30]. In addition, Bozkurt et al. developed three types of NLP algorithms, which were rule based, deep learning, and hybrid model, to compare the performance in classifying the severity of a prostate cancer patient using clinical notes. Their accuracy results show that the rule-based model achieved the highest accuracy of 0.86, which outperformed the deep model with an accuracy of 0.73 and a hybrid model that combined both methods with an accuracy of 0.75 [31].
Even though the generation of rules in the rule-based NLP requires human effort, its transparent characteristic is essential for clinical application to incorporate domain knowledge from knowledge bases or experts. Most machine learning-based techniques require an enormous set of well-curated input to promise accuracy in a specific task, such as data prediction, classification, and association mining. For example, in the study conducted by Levy et al., they developed a current procedural terminology code predictor by support vector machine (SVM), extreme gradient boosting (XGBoost), and bidirectional encoder representations from transformer (BERT) from 93,039 pathology reports [32]. In another study by Kalra et al., four types of machine learning-based NLP algorithms-linear SVM, radial basis function SVM, logistic regression, and extreme gradient boost-were developed to classify 1949 manually cleaned pathology reports into different diagnosis categories [7].
Synoptic reporting, an alternative reporting style, has shown a significant increase in the completeness of data elements in pathology reports across various diseases, including but not limited to breast, upper gastrointestinal, lung, colon, and prostate cancers [33][34][35][36][37]. Other than these favorable quantitative outcomes, Yunker's study showed that synoptic reports reduce time spent on the production of the report by pathologists [38]. Furthermore, increasing the convenience of reading the diagnosis report helps to improve its quality. In several studies, the use of standardized proforma in colorectal cancer pathology reporting greatly improved the quality of the report by improving the mean number of lymph nodes identified in the surgical specimen [36,39,40]. Other than that, the highly structured synoptic reports are more amenable and valued for secondary use, as most of the analysis models are derived from structured data [4].
Nevertheless, Lankshear's study drew an opposite conclusion in the time required to produce a cancer pathology report. Lankshear's study showed that the pathologists who were provided with a five-point Likert scale (1 = significantly less than narrative reports, 3 = about the same, and 5 = significantly more than narrative reports) reported that slightly more time is required (mean score = 3.51) which indicated the synoptic report required 25% to 50% more time to reach completion. However, the majority physician group (60%) reported that the time required to obtain the final pathology report in synoptic format was about the same as the narrative format [9]. The length of the reports can be another issue for synoptic reports [5]. In narrative reporting, it is acceptable to omit the absent data element. While in synoptic reporting, the absent data element will be reported as "not applicable" and hence increase the length of reports. Overall, the advantages of synoptic reports outweigh the disadvantages, notwithstanding synoptic reports still have not been widely adopted in the clinical domain. The most critical barriers in implementing synoptic reporting are the pathologists' personal preference for the flexibility of narrative report [41,42]. Another factor preventing the successful implementation is the incompatibility of a new reporting format in existing work environments, such as the database structure used to store the EMRs [3,43].
There are several limitations to this study. First, the proposed rule-based NLP algorithm in this study achieved more than 95% accuracy for most data elements, but part of this accuracy is dependent on the underlying patterns in the pathology reports, which are mostly standardized at UMMC. In other words, there is still room for improvement in the generalizability of the algorithm. Henceforth, these rules in NLP may require manual updating as only a small dataset was used. It involved 593 specimens in the algorithm development, and may not cover all the scenarios, hence increasing the false negative value. Currently, acquiring the pathology report is done manually by the pathologist, which can be too laborious. Therefore, future work on minimizing the human intervention on automatically obtaining and annotating the pathology report from the LIS in UMMC is needed. In addition, expanding the dataset from different institutions will be considered to increase the generalizability and further reduce the bias of the algorithm.
Our study showed significant progress in promoting the implementation of synoptic reporting in the clinical domain by presenting an automated way to convert the existing narrative report into a synoptic report to suit the working environment. This innovation is not only to digitize old reports for audit and research purposes but also prospective digitized data collection.

Conclusions
This study demonstrates a novel NLP algorithm in extracting a larger field of data elements with its respective responses that rely on ad hoc linguistic rules defined on 593 specimens from 174 patients, achieving a micro-F1 score of more than 98% in both training and testing set. A synoptic report that is highly structured promises the diverse requirements of various corresponding users. Nevertheless, the involvement of experts in different areas such as clinicians, pathologists, and data scientists is important as their domain knowledge and insight in recognizing which features work best to improve synoptic reporting for all stakeholders. In summary, our algorithm assists clinicians in highlighting the key points from narrative reports and serves as a preliminary step in promoting synoptic reporting in the clinical domain.