You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

9 December 2023

Parallel-Based Corpus Annotation for Malay Health Documents

,
,
and
1
Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, UKM, Bangi 43000, Selangor, Malaysia
2
Faculty of Industrial Technology, Universitas Pembangunan Nasional “Veteran” Yogyakarta, Yogyakarta 55283, Indonesia
3
Faculty of Engineering and Informatics, Universitas Multimedia Nusantara, Banten 15810, Indonesia
*
Author to whom correspondence should be addressed.

Abstract

Named entity recognition (NER) is a crucial component of various natural language processing (NLP) applications, particularly in healthcare. It involves accurately identifying and extracting named entities such as medical terms, diseases, and drug names, and healthcare professionals are essential for tasks like clinical text analysis, electronic health record management, and medical research. However, healthcare NER faces challenges, especially in Malay, in which specialized corpora are limited, and no general corpus is available yet. To address this, the paper proposes a method for constructing an annotated corpus of Malay health documents. The researchers leverage a parallel source that contains annotated entities in English due to the limited tools available for the Malay language, and it is very language-dependent. Additional credible Malay documents are incorporated as sources to enhance the development. The targeted health entities in this research include penyakit (diseases), simptom (symptoms), and rawatan (treatments). The primary objective is to facilitate the development of NER algorithms specifically tailored to the healthcare domain in the Malay language. The methodology encompasses data collection, preprocessing, annotation of text in both English and Malay, and corpus creation. The outcome of this research is the establishment of the Malay Health Document Annotated Corpus, which serves as a valuable resource for training and evaluating NLP models in the Malay language. Future research directions may focus on developing domain-specific NER models, exploring alternative algorithms, and enhancing performance. Overall, this research aims to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain.

1. Introduction

Named entity recognition (NER) is a crucial task in the field of natural language processing (NLP). It involves identifying and categorizing named entities in text, such as people, organizations, locations, dates, and other specific terms [1] NER is highly valuable in various applications, such as information retrieval, data analysis, and decision support systems. For example, in healthcare, NER can be used to extract relevant medical terms, diseases, and drug names from clinical texts, facilitating clinical text analysis, electronic health record management, and medical research [2]. NER is also used in other domains like information retrieval, where it helps improve search results by accurately identifying and categorizing entities mentioned in documents [3].
The availability of a standard Malay language corpus and machine learning algorithms can catalyze a new wave of Malay NLP research, particularly in ongoing research on NER, semantic analysis, information retrieval, sentiment analysis, and translations. These resources would enable researchers to develop more accurate and effective NER models specific to the Malay language, improving the overall quality of Malay NLP applications.
Currently, domain-specific applications primarily focus on the specific context and often do not extend to other languages with diverse morphological and syntactic structures [4]. Therefore, the development of a standard Malay language corpus and machine learning algorithms tailored to the language is essential. This would enable the expansion of NLP applications to encompass a wider range of domains and promote cross-linguistic research and development. By investing in creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, researchers can unlock the full potential of NER and other NLP tasks in the Malay language [5]. This will not only contribute to the growth of the field but also facilitate the development of innovative applications that cater to the unique linguistic characteristics of Malay.
The growth of health-related information in the Malay language necessitates the development of NLP tools and resources tailored to the Malay-speaking community. However, the existing NER tools primarily focus on basic entity types, such as person, organization, and location, and often do not support the Malay language. Moreover, it should be noted that the field of identifying syntax and semantics in the Malay language lacks the abundance of tools and resources that are readily available in English [6]. This scarcity poses a significant challenge in accurately performing named entity recognition (NER) tasks in Malay health documents. These challenges highlight the need for specialized NER models and resources specifically tailored to the Malay language and the health domain.
In addressing this challenge, leveraging parallel corpora, which consist of aligned texts in English and Malay, emerges as the most suitable solution. By utilizing parallel corpora, we can leverage the existing tools and resources for English NER and adapt them effectively to the Malay language, facilitating the identification of named entities in Malay health documents. This approach maximizes the available resources and enables the development of robust NER models specifically tailored to the Malay language.
The primary objective of this research is to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain. By creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, this research aims to unlock the full potential of NER and other NLP tasks in the Malay language, ultimately improving information extraction, analysis, and understanding in the healthcare sector.
In this paper, the research on building an annotated corpus for Malay health documents is presented, focusing on named entity recognition (NER). The paper is divided into four sections. Section 2 covers the background and related work, providing an overview of the current state of NER research in the Malay language and discussing the limitations of existing resources and approaches. Section 3 describes the methodology employed to create the Malay Health Document Annotated Corpus. This includes data collection, preprocessing, annotation of both English and Malay text, and the process of combining annotated documents to create the corpus. In Section 4, the primary results of the study are presented, which are the creation of the Malay Health Document Annotated Corpus. Section 5 discusses the challenges in building the Malay Health Document Annotated Corpus. The importance of this corpus as a useful tool for training and testing NER models in the Malay language is elucidated, along with the wide range of biomedical concepts that have been correctly identified and labeled within the corpus. In the final section, the main findings of the research are summarized, and potential future directions for this work are discussed.

3. Methodology

The main functions of research methodology are to ensure that the research is conducted systematically, consistently, and objectively. The creation of the Annotated Malay Health Document Corpus consists of several stages, as illustrated in Figure 1. These stages encompass data collection via dataset scrapping, annotating text for English, followed by annotating text for Malay, and finally the creation of the corpus. Each of these stages contributes to the overall process of developing a valuable resource for Malay health document analysis and research.
Figure 1. Framework for corpus creation.

3.1. Data Collection

Health information is widely available via various sources, including online articles and social media. Each of these sources has different writing styles, and their information bears varying levels of availability and reliability. The unstructured text, which will be used as material and a data source, comes from web pages on health-themed websites. We employed the technique of web scraping to extract data from websites with a health-related focus. The Malaysian Ministry of Health is responsible for maintaining the MyHealth portal, which was our main focus. In our study, our health text data were mainly sourced from the MyHealth portal, an online platform active in 2022.
This methodology allowed for the collection of substantial data from unorganized textual sources, thereby facilitating subsequent examination and annotation. The MyHealth portal plays a pivotal role in the healthcare system of Malaysia, with the objective of facilitating its transition toward a more comprehensive, interconnected, and digitally enabled service. It aims to offer healthcare information that is comprehensive, easily understandable, and of superior quality. By using data from the MyHealth portal, our study benefits from the wealth of health-related information available on this platform.
For this study, we selected about 100 articles and documents from the MyHealth portal as shown in Table 1. These were analyzed and annotated to create a substantial corpus. Once the data are collected, the next step is to prepare the collected data for further analysis and annotation. Irrelevant information such as advertisements, unrelated images, author biographies, reference or support group information, and final reviews is carefully removed. This is carried out to ensure that only relevant content, i.e., content directly related to health topics, is retained. Additionally, any formatting inconsistencies that existed in the original documentation, such as variations in font size, style, and line spacing, have also been addressed. This is enacted to ensure uniformity across documents, so that data are easier to analyze and process.
Table 1. Statistics of the manually annotated parallel corpus.
In this research project, we gathered a robust corpus consisting of approximately 3952 health-related sentences in the Malay language and roughly 3728 corresponding sentences in English. The large corpus size is essential for conducting thorough analysis and annotation, as it provides a diverse range of data for examination. With a substantial corpus, we can draw more reliable conclusions and insights from our study. Examples of the sentences in both Malay and English can be viewed in Table 2. This table is illustrative of the variety and complexity of sentences that were included in our data collection effort.
Table 2. Some examples of text in English and Malay.
The selection of the English language as the reference point for our dataset was based on its extensive utilization in health-related studies on natural language processing, which has resulted in a robust framework. The utilization of English as a standard allows for the maintenance of consistency and precision in our process of comparing and analyzing. This enables us to utilize pre-existing research and methodologies established in the field of English language studies, and subsequently employ them in our cross-linguistic investigation. The utilization of this methodology was implemented in order to guarantee coherence and precision in our examination and evaluation, given that the English language possesses a firmly established structure within health-related studies pertaining to natural language processing.
The Malay and English collections exhibit an equivalent quantity of documents, although a discernible discrepancy is observed in the number of sentences. The Malay language corpus exhibits a greater quantity of sentences in comparison to the English corpus. The main reason behind this disparity lies in the structural and linguistic differences between the two languages. Often, a single English sentence can expand into multiple sentences in Malay to convey the same meaning. This is due to the nuanced complexities inherent in the translation process between English and Malay. As Malay has unique syntactic and semantic properties, it often requires more sentences to capture the same information contained in a single English sentence. This linguistic phenomenon is illustrated in the first two rows of Table 2. This crucial observation underscores the challenges and intricacies involved in cross-lingual studies, particularly when developing natural language processing algorithms that accurately capture the subtleties of different languages. It also highlights the importance of developing tailored methodologies that take into account the specific linguistic features and structures of the target language.

3.2. Annotation of English Text

The existing entity recognition algorithms, such as the Stanford CoreNLP tools [25], predominantly classify basic entity types like person, organization, and location. These established tools, while effective in their own right, lack comprehensive support for the Malay language. This poses a significant challenge for our project since the primary objective is to develop a customized named entity recognition (NER) and relation extraction system tailored to Malay.
Considering this, we resolved to create a tailored annotation schema that would effectively cater to the unique needs of the Malay language. This approach would ensure that our annotated text corpus was well equipped to serve as a potent training and evaluation resource for custom NER and relational extraction algorithms.
For the English texts within our corpus, we employed biomedical NER tools such as BioYODIE NER. This powerful tool enabled us to efficiently identify named entities such as disease, symptoms, care, and others [25]. This identification process is critical as it facilitates the comprehensive mapping of each text’s entity landscape, providing valuable data for subsequent processing and analysis (Table 3).
Table 3. Examples result in annotated English text (Using BioYODIE tools).
To enhance the breadth of our entity identification, we additionally employed the Stanza i2b2 and NCBI-Disease tools [26]. These resources were instrumental in identifying other biomedical entities, including categories like problem, treatment, test, and disease. The inclusion of these tools in our entity recognition process ensures broader coverage, enabling us to capture a more diverse set of entities within the corpus (Table 4 and Table 5).
Table 4. Examples result in annotated English text (using NCBI-Disease tools).
Table 5. Examples result in annotated English text (using Stanza i2b2 tools).
Via the judicious use of these tools, we were able to create a comprehensive annotated corpus that encompasses a wide range of entity categories. This enriched corpus serves as a valuable resource for training and evaluating our custom NER and relation extraction algorithms, bringing us one step closer to achieving our project objectives. By tailoring our approach to suit the unique linguistic context of Malay, we aim to drive significant advancements in the field of Malay language processing.

3.3. Annotation of Malay Text

In the process of creating annotated Malay texts and documents, we leverage reference annotations derived from English texts. More specifically, these are annotated English texts that have been processed using the BioYODIE tools [26], which are designed to provide entity annotations for diseases, symptoms, and care. In addition to this, we also draw upon references from annotated English texts that have been processed using the Stanza and NCBI tools [26], which specialize in providing entity annotations for diseases.
In order to ensure the accurate identification and labeling of biomedical-named entities within our corpus, we consult additional resources such as the Malay Wikipedia [27] and the dictionary from Dewan Bahasa [28]. These additional sources provide valuable insights into the specific linguistic and terminological nuances of the Malay language.
The primary aim of annotating Malay texts and documents is to identify named entities such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These annotations serve as an invaluable asset in the process of training and evaluating natural language processing (NLP) models tailored to the Malay language. Upon the completion of the annotation process, we are left with a comprehensive corpus of annotated Malay texts. Representative examples of these annotated texts can be found in Table 6. This rigorous process of annotation serves to guarantee the accurate identification and classification of biomedical-named entities within the Malay language, thus paving the way for the development of highly effective NLP models designed specifically for the Malay language.
Table 6. Examples result in annotated Malay text.

4. Corpus Malay Health Document

The Malay Health Document Annotated Corpus, a detailed collection of annotated health documents, is a crucial asset for researchers and practitioners focusing on the Malay language. It facilitates the training and evaluation of named entity recognition (NER) models specifically crafted for Malay. These models excel in accurately extracting pertinent information from Malay health documents, benefiting medical research, clinical text analysis, and electronic health record management.
Moreover, this corpus plays a pivotal role in advancing various natural language processing (NLP) technologies in healthcare, such as natural language understanding, sentiment analysis, and text classification. It covers a wide array of health-related entities, including diseases, symptoms, and treatments, thus thoroughly representing the healthcare sector, encompassing medical, pharmaceutical, and clinical research areas. The utilization of this corpus not only enhances the effectiveness of NER models in discerning and retrieving valuable data from Malay-language health documents but also aids in expanding the scope and efficiency of NLP technologies within the healthcare field. This amplifies their applicability and utility in diverse scenarios like medical research and clinical text analysis (see Figure 2.)
Figure 2. Example of tagging.
The primary result of this research is the creation of the Malay Health Document Annotated Corpus, which is derived from both English and Malay health documents. The corpus contains a diverse set of accurately labeled health-named entities, such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These entities can be seen in Table 7, which provides descriptions and examples for each entity type.
Table 7. Descriptions and examples for each entity type. Examples are translated from Malay.
The development of the Malay Health Document Annotated Corpus significantly contributes to the growing body of NLP resources for the Malay language. By providing a comprehensive annotated corpus, researchers are enabled to develop and evaluate NER models that can accurately analyze Malay health documents. This ultimately leads to better health outcomes for Malay speakers. Furthermore, the annotated corpus serves as a starting point for future research in Malay NLP, particularly in the health domain, opening up opportunities for advancements in this field.
By enabling the development of more accurate NER models for Malay health documents, the Malay Health Document Annotated Corpus can contribute to the creation of innovative healthcare technologies. These technologies can automate the analysis and interpretation of health information, leading to faster diagnosis, more personalized treatment plans, and improved patient outcomes.
Unlike existing NLP resources for the Malay language that focus on general text or news articles, the Malay Health Document Annotated Corpus specifically targets the healthcare domain. This makes it a specialized resource that captures the unique vocabulary, terminology, and entities found in health documents. By focusing on this specific domain, the corpus provides researchers and practitioners with a more accurate and tailored resource for developing healthcare-related NLP technologies.

5. Discussion (Challenges)

From this research, several things emerged as challenges in making the Malay Health Document Annotated Corpus: synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and polysemous terms.

5.1. Synonyms in Malay Annotations

The section mentions the presence of synonyms in the Malay language, but it would be helpful to provide specific examples to illustrate this challenge. Including examples of synonyms and their different lexical realizations would make this argument more concrete and easier to understand. For example, the synonyms “barah” and “kanser” both refer to the concept of “cancer” in Malay. These terms represent the same concept but have different lexical realizations. Capturing these synonyms in the annotated corpus requires careful attention to ensure that their identical meaning is retained.
This task is not trivial, as it directly influences the efficacy of the subsequent training and evaluation of NLP models. Machine learning models rely on a clear, consistent representation of the data to learn effectively. If a model perceives “barah” and “kanser” as distinct entities, it may fail to generalize appropriately, leading to potential misclassifications in unseen data or new contexts.
This can have significant consequences in NLP tasks such as sentiment analysis, text classification, or information retrieval, where accurate representation and understanding of the data are crucial for reliable results. Misclassifications can lead to incorrect interpretations, biased predictions, or inaccurate information retrieval, undermining the effectiveness and trustworthiness of NLP models.
Additionally, the intricacy of handling synonyms extends beyond mere identification. The model must also consider the context and co-occurrence of these terms within the textual data. It is important to note that even though synonyms refer to the same concept, their usage might differ based on the context. For example, one term may be more prevalent in formal writing, while the other is commonly used in daily conversations or specific regions.
Moreover, it is also essential to acknowledge the cultural and linguistic nuances associated with these synonyms. Some terms might carry different connotations or emotional valences despite referring to the same concept, which further emphasizes the need for nuanced understanding and handling of these terms during the annotation and model training process [29].
To address these challenges, advanced NLP techniques, such as word embeddings or contextual models, might be deployed. These techniques can capture the semantic similarity between different words and help the model understand that “barah” and “kanser” refer to the same concept. Furthermore, domain expertise and a careful annotation process play a crucial role in ensuring the consistency and accuracy of the data representation.
Overall, having synonyms in the data makes the process of annotating it and training models more difficult. However, these problems can be solved by being careful and using advanced NLP techniques. This helps make NLP models that are strong and aware of their surroundings.

5.2. Ambiguous Entity Categorization

There are certain words or phrases that can serve as entities for multiple category types, presenting a complex issue in named entity recognition (NER). For instance, consider the phrase “sakit dada” (chest pain), which could be perceived as an entity within either the disease or symptom categories. This duality generates a demand for context-specific interpretation by the NER system. If “sakit dada” appears within a disease diagnostic context, the NER should classify it within the disease entity category. Alternatively, if the phrase is cited in the description of symptoms, the NER should allocate it to the symptom entity category.
In many scenarios, the NER system needs to analyze the broader context, taking into account related words in the sentence or document, to determine the most appropriate entity categorization. This is essentially utilizing the principles of co-reference resolution and word sense disambiguation to clarify semantic relationships and meanings.
This context-sensitive entity categorization presents significant challenges in developing an accurate and reliable NER system. The complexity is amplified when dealing with the medical domain, given the vast range of terminologies and their potential overlap between categories. Furthermore, the NER system must also factor in the linguistic and cultural nuances that can influence the interpretation of certain words or phrases. Consequently, handling such ambiguities requires sophisticated models with robust context-understanding capabilities, well-crafted feature sets, and effective training methods. These requirements underscore the need for high-quality, annotated training data like the Malay Health Document Annotated Corpus.
However, even with these resources, achieving a high level of accuracy in ambiguous entity categorization remains a demanding task. This is a significant area of research focus, with potential solutions exploring advanced techniques like deep learning and complex NLP models, as well as inter-disciplinary approaches integrating linguistics, medical knowledge, and computational methodologies.

5.3. Co-Reference in Translation

The next challenge lies in the use of co-reference during the translation process from English to Malay. Co-reference refers to the use of words or phrases that point to the same concept or entity within a sentence or text. For example, in the sentence that can be seen in Table 8, the pronoun “it” could be used later in the text to refer back to Table 8. The use of co-reference is crucial in the translation process as it aids in maintaining consistency and clarity.
Table 8. Examples of the sentences translated to Malay using co-reference.
Co-references can become significantly complex, especially within lengthy and nuanced texts. For instance, a document might initially mention “sakit perut” and subsequently use pronouns like “ia” in other parts of the text to refer to “sakit perut”. In such scenarios, the NER system must be adept enough to recognize that “ia” is indeed referring to the initial mention of “sakit perut”.
Effectively leveraging co-reference in the translation process necessitates a deep understanding of the structure and semantics of both languages. The system must recognize and maintain co-reference throughout the translation process while ensuring that the final translation remains accurate and comprehensible [30]. This requires advanced techniques in natural language processing and machine learning, as well as a good understanding of both languages’ cultural and social contexts.
Furthermore, in many instances, the source and target languages might have different co-reference rules and conventions. For example, Malay might have different ways of referring to entities or concepts compared to English. Thus, the system must be capable of adapting the co-reference from the source language to the target language in a natural and accurate manner. This is often challenging and necessitates ongoing research and development.
Addressing these challenges requires careful consideration and the development of methodologies that account for synonyms, resolve entity categorization ambiguities, and accurately handle co-reference during translation. By addressing these challenges, the Malay Health Document Annotated Corpus can be further refined and serve as a valuable resource for training and evaluating NLP models in the Malay language.

5.4. Polysemous Terms

In some cases, the challenge lies in what are known as “multiple translations” or “polysemous terms”. This refers to situations where a single word or phrase can hold multiple meanings or translations in another language, particularly within specialized contexts like medicine or technology. For instance, “shortness of breath” and “breathlessness” are two medical terms that signify “difficulty breathing” or “shortness of breath” in English. Both terms share the same translation in Malay, which is “sesak nafas”.
This can complicate the selection of the appropriate translation, especially when context sensitivity is a requirement. Context plays a vital role in determining the best translation for such polysemous terms, and this challenge increases when the context is intricate or subject to individual interpretation. This is a common issue in machine translation, and solutions usually employ deep learning models that can consider broader contextual information to better understand and determine an accurate translation.
Additionally, such polysemous terms also pose a significant challenge to the named entity recognition (NER) systems since the same word or phrase might be classified under different categories based on its different meanings. This introduces the necessity for advanced models that can effectively discern the semantic boundaries of such terms within the given context.
Moreover, this issue further accentuates the importance of domain-specific knowledge. In the example of “shortness of breath” and “breathlessness”, having knowledge about medical terminologies can guide the translation process more accurately. It highlights the requirement for a multidisciplinary approach, incorporating subject matter expertise in conjunction with computational methodologies, to effectively handle multiple translations and polysemous terms [31].
Lastly, this challenge also calls attention to the value of extensive, high-quality, and well-annotated corpora, like the Malay Health Document Annotated Corpus. They serve as critical resources for training machine translation and NER systems, enabling them to better understand and handle the complexities of multiple translations and polysemous terms.

6. Conclusions and Future Work

This research has successfully spearheaded the development of the Malay Health Document Annotated Corpus, which is a crucial resource for training and evaluating named entity recognition (NER) models for the Malay language. By meticulously identifying and labeling biomedical-named entities, this corpus significantly enhances the suite of NLP resources available for Malay. It has the potential to improve health outcomes for Malay speakers by enabling the development and evaluation of NER models that can efficiently analyze Malay health documents.
The research conducted using the Malay Health Document Annotated Corpus has shown promising results in the development of an NER model for the Malay language. The model, trained using supervised machine learning techniques like the conditional random field algorithm, has demonstrated the ability to accurately identify and extract biomedical entities from Malay health documents. The evaluation of the model using standard measures such as precision, recall, and the F1-score has provided insights into its effectiveness. These findings highlight the potential of NLP technologies in the health sector for the Malay language.
Several challenges such as synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and the handling of polysemous terms have been identified as key areas for future research. Addressing these issues will not only enhance the quality of the corpus but also significantly contribute to the advancement of natural language processing technologies. Focusing on these areas promises to improve the accuracy and utility of NLP models, particularly in the context of the Malay language, thereby elevating the overall effectiveness of language processing applications.
Furthermore, future research could delve into the development of domain-specific NER models customized for other sectors such as finance, law, or education. This would substantially broaden the spectrum of NLP resources available for the Malay language. Researchers could also investigate the use of different machine learning algorithms, advanced deep learning techniques that can learn from large amounts of data, or methods that leverage knowledge from related tasks to enhance the performance of NER models. These avenues have the potential to augment the performance of NER models tailored to the Malay language, thereby expanding the reach and potential of NLP within the Malay-speaking world.

Author Contributions

Conceptualization, H.; methodology, H.; software, H.; writing—original draft preparation, H.; writing—review and editing, H., S.S., L.Q.Z. and A.F.N.; supervision, S.S. and L.Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://www.myhealth.gov.my/.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goyal, A.; Kumar, M.; Gupta, V. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 2018, 29, 21–43. [Google Scholar] [CrossRef]
  2. Raza, S.; Reji, D.J.; Shajan, F.; Bashir, S.R. Large-Scale Application of Named Entity Recognition to Biomedicine and Epidemiology. PLOS Digital Health 2022, 1, e0000152. [Google Scholar] [CrossRef] [PubMed]
  3. Patil, N.; Patil, A.; Pawar, B.V. Named Entity Recognition using Conditional Random Fields. In Proceedings of the International Conference on Computational Intelligence and Data Science (ICCIDS 2019), Gurgaon, India, 6–7 September 2019. [Google Scholar]
  4. Morsidi, F.; Sulaiman, S.; Suliana, S.; Siti, A.M.; Rohaizah, A.W. Malay Named Entity Recognition: A Review. J. ICT Educ. JICTIE 2016, 2, 1–14. [Google Scholar]
  5. Salleh, M.S.; Asmai, S.A.; Basiron, H.; Ahmad, S. A Malay Named Entity Recognition Using Conditional Random Fields. In Proceedings of the International Conference on Information and Communication Technology (ICoICT), Melaka, Malaysia, 17–19 May 2017. [Google Scholar]
  6. Mohd Noor, N.; Sulaiman, J.; Noah, S.A. Malay Name Entity Recognition Using Limited Resources. Adv. Sci. Lett. 2016, 22, 2968–2971. [Google Scholar] [CrossRef]
  7. Ramachandran, R.; Arutchelvan, K. Named entity recognition on biomedical literature documents using a hybrid-based approach. J. Ambient. Intell. Humaniz. Comput. 2021, 1–10. [Google Scholar] [CrossRef]
  8. Wei, H.; Gao, M.; Zhou, A.; Chen, F.; Qu, W.; Wang, C.; Lu, M. Named entity recognition from biomedical texts using a fusion attention-based BiLSTM-CRF. IEEE Access 2019, 7, 73627–73636. [Google Scholar] [CrossRef]
  9. Bhasuran, B.; Murugesan, G.; Abdulkadhar, S.; Natarajan, J. Stacked Ensemble Combined with Fuzzy Matching for Biomedical Named Entity Recognition of Diseases. J. Biomed. Inform. 2016, 64, 1–9. [Google Scholar] [CrossRef]
  10. Keretna, S.; Lim, C.P.; Creighton, D. A Hybrid Model for Named Entity Recognition Using Unstructured Medical Text. In Proceedings of the International Conference on Systems Engineering (SOSE), Glenelg, SA, Australia, 9–13 June 2014. [Google Scholar]
  11. Wang, C.; Wang, H.; Zhuang, H.; Li, W.; Han, S.; Zhang, H.; Zhuang, L. Chinese medical-named entity recognition based on a multi-granularity semantic dictionary and multimodal tree. J. Biomed. Inform. 2020, 111, 103583. [Google Scholar] [CrossRef] [PubMed]
  12. Li, L.; Zhao, J.; Hou, L.; Zhai, Y.; Shi, J.; Cui, F. An attention-based deep learning model for clinical named entity recognition of Chinese electronic medical records. BMC Med. Inform. Decis. Mak. 2019, 19, 235. [Google Scholar] [CrossRef] [PubMed]
  13. Herwando, R.; Jiwanggi, M.A.; Adriani, M. Medical entity recognition using a conditional random field (CRF). In Proceedings of the 2017 International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia, 23–24 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 57–62. [Google Scholar]
  14. Suwarningsih, W.; Supriana, I.; Purwarianti, A. ImNER Indonesian Medical Named Entity Recognition. In Proceedings of the 2nd International Conference on Technology, Informatics, Management, Engineering, and Environment, Bandung, Indonesia, 19–21 August 2017; pp. 184–188. [Google Scholar]
  15. Mohamed, H.; Omar, N.; Aziz, M.J.A. Malay Part of Speech Tagger: A Comparative Study on Tagging Tools. Asia-Pac. J. Inf. Technol. Multimed. 2015, 4, 11–23. [Google Scholar] [CrossRef]
  16. Saad, S.; Mansor, M.K. Named entity recognition approach for Malay crime news retrieval. Gema Online J. Lang. Stud. 2018, 18, 216–235. [Google Scholar] [CrossRef]
  17. Nadia, U.; Omar, N. Malay named entity recognition using a rule-based approach. Asia-Pac. J. Inf. Technol. Multimed. 2019, 8, 37–47. [Google Scholar] [CrossRef]
  18. Salleh, M.S.; Asmai, S.A.; Basiron, H.; Ahmad, S. Named Entity Recognition using the Fuzzy C-Means Clustering Method for Malay Textual Data Analysis. J. Telecommun. Electron. Comput. Eng. JTEC 2018, 10, 121–126. [Google Scholar]
  19. Ulanganathan, T.; Ebrahim, A.; Xian BC, M.; Bouzekri, K.; Mahmud, R.; Hoe, O.H. Benchmarking Mi-NER: Malay entity recognition engine. In Proceedings of the 9th International Conference on Information, Process, and Knowledge Management, Nice, France, 19–23 March 2017; pp. 52–58. [Google Scholar]
  20. Sazali, S.S.; Rahman, N.A.; Bakar, Z.A. Information extraction: Evaluating named entity recognition from classical Malay documents. In Proceedings of the 2016, the Third International Conference on Information Retrieval and Knowledge Management (CAMP), Malacca, Malaysia, 23–24 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 48–53. [Google Scholar]
  21. Alfred, R.; Leong, L.C.; On, C.K.; Anthony, P. Malay Named Entity Recognition Based on a Rule-Based Approach International. J. Mach. Learn. Comput. 2014, 4, 300–306. [Google Scholar] [CrossRef]
  22. Lan, T.S.; Logeswaran, R. Challenges and developments in Malay natural language processing. J. Crit. Rev. 2020, 7, 61–65. [Google Scholar]
  23. Salah, R.E.; Zakaria, L.Q.B. Building the classical Arabic entity recognition corpus (CANERCorpus). In Proceedings of the 2018, the Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), Kota Kinabalu, Malaysia, 26–28 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
  24. Fu, Y.; Lin, N.; Yang, Z.; Jiang, S. An open-source dataset and a multi-task model for malay named entity recognition. arXiv 2021, arXiv:2109.01293. [Google Scholar]
  25. Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean, D.; Dobson, R.J. Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit. Artif. Intell. Med. 2021, 117, 102083. [Google Scholar] [CrossRef] [PubMed]
  26. Kühnel, L.; Fluck, J. We are not ready yet: Limitations of state-of-the-art disease named entity recognizers. J. Biomed. Semant. 2022, 13, 26. [Google Scholar] [CrossRef] [PubMed]
  27. Wikipedia Bahasa Melayu. 2022. Available online: https://ms.wikipedia.org/ (accessed on 23 December 2022).
  28. Portal Rasmi Pusat Rujukan Persuratan Melayu. 2022. Available online: https://prpm.dbp.gov.my/ (accessed on 19 December 2022).
  29. Sharifian, F. Cultural linguistics: The state of the art. Adv. Cult. Linguist. 2017, 1–28. [Google Scholar] [CrossRef]
  30. Brack, A.; Müller, D.U.; Hoppe, A.; Ewerth, R. Co-reference resolution in research papers from multiple domains. In Proceedings of the Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, 28 March–1 April 2021; Proceedings, Part I 43. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 79–97. [Google Scholar]
  31. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 3111–3119. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.