Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

: During the previous decades, intelligent identiﬁcation of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the ﬁelds of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identiﬁcation; (2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classiﬁcation. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers ﬁve features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers ﬁve features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.


Introduction
Big data can be distinguished based on the large amount of digital data that is being created at an unprecedented rate by humans, sensor networks, mobile telecommunications, the Internet of Things, and many other heterogeneous devices [1][2][3][4]. These data exist in the form of query logs [5,6]; transaction records in database [7]; images, videos, and audios; abstracts of digital manuscripts; webpages; and microblog posts [8]. The accumulation of mobile telecommunication data has been

•
We introduce eight continuous features to represent the acronym and expansion pairs, which differ from the feature vectors introduced by Chang et al. [16] for scoring abbreviations. The first five features were initially proposed by Wahyudi et al. [23]; however, in this study, two of these features are modified to accommodate the acronyms of typed initialisms, whereas the remaining three are new features that have been introduced to improve the accuracy. The three new features measure the ratio of accurate matching between the characters in the expansion and those in the acronym; furthermore, they can distinguish between accurate and inaccurate ratios. The formulas and their definitions are discussed in Section 4.

•
We compare the performance of several supervised learning algorithms, namely SVM [26], K-NN [27], and Bidirectional Encoder Representations from Transformers (BERT) [28], to automatically determine the accurate acronym and expansion pairs from a large Indonesian corpus based on precision, recall, and F-measure. Further, we measure the performance of the SVM using several different kernels, the performance of K-NN using various k values, and the performance of BERT-Base and BERT-Base Multilingual Cased models [28]. • We evaluate the throughput of the big data technology under different data nodes using Hadoop MapReduce to construct the candidate pairs of acronym and expansion and obtain their feature vectors.
This article is organized as follows. In Section 2, the discussion of several related works (literature review) is presented. In Section 3, we present an approach to recognize potential acronym and expansion pairs, and we discuss the new definitions and formulas of the eight numerical features in Section 4. Further, we present our methodology in Section 5, results and discussions in Section 6, and conclusions in Section 7.

Related Works
An automated method, known as the acronym finding program (AFP), was introduced by Taghva et al. [13] to recognize the acronym and expansion pairs in unstructured texts. An acronym and its expansion can be recognized by matching the acronym's characters with the initial letter of each word in the definition based on a certain confidence level. If the confidence level is above the given threshold, the matching is acceptable. The AFP demonstrated high accuracy and was tested using 1328 text files.
Park et al. [15] introduced a hybrid text mining approach based on the abbreviation pattern rules, linguistic cue words, and text markers that were specifically designed for documents containing English text. The pattern rules specify the manner in which each character in an acronym is formed from the expansion. The text markers are symbols that are often used to describe the relation of an acronym and its expansion, whereas the linguistic cue words are words that are frequently used to relate the abbreviation and its definition such as "short", "stand", and "or". They verified their hybrid approach using three different types of documents: automotive technical books, pharmaceutical books, and National Aeronautics and Space Administration (NASA) press releases; moreover, they determined that their method exhibited high recall and precision rates.
Another statistical learning method to identify the acronym and expansion pairs in biomedical literature was proposed by Chang et al. [16]. This method is divided into four main steps: identifying possible abbreviations, aligning the abbreviations with their prefix strings, computing their feature vectors, and scoring them using logistic regression. Although the overall accuracy is high with a precision of 99% and a recall rate of 82%, the researchers discovered that errors occurred with respect to MEDLINE because of the presence of synonyms and words with identical meanings; therefore, the algorithm failed when examining the correspondences between letters, which is a major drawback of the letter-matching techniques [16]. Furthermore, Nadeau et al. [17] proposed supervised learning with weak constraints to detect the acronym-definition pairs in English texts. The results indicate that the method was comparable with other methods that used stronger constraint rules. A previously conducted study [18] used SVM to heuristically determine the accurate expansion candidates based on words similar to the acronyms. Moreover, Choi et al. [21] introduced a method to extract the acronym and expansion pairs and examined the co-occurrence of words in Wikipedia to solve the problem of polysemous acronyms.
The problem of finding acronyms in texts and present them with their related expansion is a particular form of Named Entity Recognition (NER). Devli et al. [29] introduced BERT, a new language representation model designed to pre-train deep bidirectional representations from texts by jointly conditioning on the left and right context in all layers. The pre-trained model can be further fine-tuned with an additional output layer to produce a state-of-the-art model for many different tasks, for instance, NER task to find acronym and expansion pairs in texts. BERT models have been applied to the CoNLL-2003 NER task and they performed competitively with state-of-the-art methods [29].
For conducting big data analysis, many studies have used Hadoop technology to obtain a solution to the scalability and performance problems associated with traditional computing techniques [1,30].
Hadoop is a parallel and distributed processing platform that uses the MapReduce computing paradigm [31,32] to uniformly distribute the computing tasks across data nodes to rapidly process large amounts of data on the Hadoop distributed file system (HDFS) [33]. MapReduce simplifies data processing using two functions, i.e., map and reduce. The map function separates the data input into key-value pairs. It subsequently uses the computational power of the data nodes to process the key-value pairs and returns a set of intermediate key-value pairs to the reduce function for obtaining the results [32].
A previously conducted study [34] introduced a set of a priori algorithms for data association analysis using the MapReduce paradigm and investigated their effectiveness using a set of 5 million unique single items. The authors determined that MapReduce is suitable for conducting big data analysis. Xun et al. [35] proposed a method for identifying frequent itemsets in a parallel and distributed manner using MapReduce. Zhonghua [36] performed big data processing using Hadoop in petroleum exploration applications to extract the seismic attributes. The seismic attribute analysis is usually unable to handle a large volume of data when using traditional computing paradigms.

Determining the Candidate Pairs of Acronym and Expansion
The acronym candidates can be determined by splitting each sentence in a text into words. Then, the ratio of the uppercase letters and the length of the word is calculated for each word. If the ratio is more than ≥75%, then the word is an acronym candidate; however, if the ratio is 50% to 75%, then the word is examined to determine whether it contains digits. If it contains at least one digit, then it is an acronym candidate; otherwise, it is further examined to check if the word is present in the Indonesian dictionary. If it is not present, then it is an acronym candidate; otherwise, it is ignored. Algorithm 1 shows the step-by-step procedure to determine the candidate of acronym and Algorithm 2 shows the step-by-step procedure to obtain the candidate pairs of acronym and expansion.
After identifying an acronym candidate, the next step is to build all the expansion candidates from the words surrounding the acronym; i.e., the sentences on the left and right sides of the acronym [15], beginning from 2 words to n words, where n can be calculated as follows. Let S be a sentence on the right or left side of acronym A that comprises two words or more, denoted as W = {w 1 , w 2 , . . . , w k }, such that the elements in W represent the words in S. Let |W| be the number of elements in W. Then, assuming that A is an acronym formed by uppercase letters, U is either the number of uppercase letters in A or it is the number of vowels in A assuming that A is an acronym formed by a combination of syllables, and D is a digit value in A. Indonesian acronyms can be formed by a combination of letters and a digit. The digit represents the number of the appearance of the letter before it. For example "L2Dikti" can be written as "LLDikti" and denotes "Lembaga Layanan Pendidikan Tinggi" or "The Institute for Higher Education Services" in English. Therefore, the value of n can be set using the following equation: for k ← 2 to n do 6: pairs ← pairs + (word, k_grams) push the acronym and expansion pairs to the array 7: end for 8: rightSentence ← GETSENTENCE(right, sentence) 9: n ← DETNVALUE(word, rightSentence) 10: for k ← 2 to n do 11: pairs ← pairs + (word, k_grams) push the acronym and expansion pairs to the array 12: end for 13: return pairs return an array containing candidate of the acronym and expansion pairs 14: end procedure

Numerical Features
Once n is determined, the next step is to calculate the continuous features of each acronym and its definition pair, representing the correlation between the acronym and its expansion. The features show the correlation between the uppercase letters in the acronym and the letters in the expansion for acronyms of type uppercase letters, whereas, for acronyms of type speech sound, the features show the relation between the words in the expansion and the syllables of the acronym. The features are as follows: • F 1 measures the correlation between the total number of characters in the acronym and the total number of words in the expansion; generally, the former matches the latter. Therefore, F 1 is equal to 1 if they match exactly; otherwise, F 1 is less than 1. Let A be an acronym and E be the expansion. Let L A be the length of A for an acronym of type uppercase letters or the number of syllables for an acronym of type sequence of speech sounds. In addition, let L E be the number of words in E excluding conjunctions and prepositions. Then, F 1 is calculated using the following equation: • F 2 measures the number of words in the expansion that are in title case (capitalized in the first word). Let A be an acronym and E be an expansion comprising several words that is denoted as W = {w 1 , w 2 , . . . , w k }, such that the elements in W represent the words in E. Let |W| be the number of elements that are written in title case, excluding conjunctions and prepositions, and let L E be the number of words in E, excluding conjunctions and prepositions. Then, F 2 is calculated as follows: • F 3 weights the matching of the letters in the acronym and its expansion, excluding conjunctions and prepositions. The acronyms formed by the combination of uppercase letters are generally abbreviated based on the letters in the expansion; thus, F 3 provides a good weight for the matching of the letters. Let A be an acronym and L A be the length of A. Further, we assume that T m is a total match and t m is a total mismatch between the letters in the acronym and its expansion. Then, F 3 is calculated using the following equation: For example, "NPWP," which stands for "Nomor Pokok Wajib Pajak" ("Tax ID number" in English), has F 3 = 1 because it is abbreviated according to the letters in the expansion, i.e., T m = 4, t m = 0, and L A = 4. However, F 3 would be less than 1 if at least one mismatch occurred. • F 4 weights the correlation between the first and last letters of the acronym. The first letter of the acronym will be matched with the first letter of the expansion and the last letter of the acronym will be matched with the first letter of the last word of the expansion. For the acronyms formed by a sequence of speech sounds (syllables), this feature weights the correlation between the first syllable of the acronym and that of the expansion. Furthermore, it measures the matching between the last syllable of the acronym and the first syllable of the last word in the definition. F 4 is 1 if both the correlations match, 0.5 if at least one correlation matches, and 0 otherwise. • F 5 penalizes the acronym definitions that contain many prepositions and conjunctions because acronyms usually do not contain many prepositions and conjunctions. Let E be an expansion that comprises several words and W p = {w 1 , w 2 , . . . , w k } be the conjunctions and prepositions present in E. Furthermore, let |W p | be the number of elements and L E be the number of words in E. Then, the equation of F 5 can be given as follows: • F 6 is the ratio of accurate matching between the characters in the expansion and those in the acronym. Let A be an acronym and L A be the length of A. Assuming that T m is a total match, the ratio F 6 can be measured using the following equation: • F 7 is introduced to distinguish between the accurate ratio of appearance (F 6 ) and an inaccurate ratio. Therefore, the value of F 7 is 1 if F 6 is 1, indicating that the order of the acronym characters matches the characters in the expansion; otherwise, the value of F 7 is 0 if F 6 is less than 1. • F 8 is the mean of F 1 to F 7 .

Methodology
Our proposed approach encompasses several important stages, as shown in Figure 1. The initial stage is the crawling stage, which intends to collect a large number of online news articles from the web. The web articles are downloaded from trusted sources such as news.detik.com, okezone.com, liputan6.com, sindonews.com, jpnn.com, tribunnews.com, kompas.com, and viva.co.id. After the web articles are downloaded, they are stored locally. The second stage is the cleaning stage, which includes the process of removing hypertext markup language (HTML) tags, links, images, and JavaScript codes from the webpages. After the completion of the cleaning process, the webpages become plain texts. The next stage involves identification of the acronym candidates by tokenizing each sentence into words and determining the possible expansions on both sides of the acronyms [15]. Subsequently, the value of n is determined using Equation (1). The next stage involves the generation of the feature vectors for each acronym and expansion pair; the feature vectors represent the correlation between the two. For the acronyms of type uppercase letters, the features show the relation between the uppercase letters in the acronym and its expansion. However, for acronyms of type syllables, the features show the correlation between the words in the expansion and the syllables. Models are constructed using the training set, and the accuracy of the models is verified using the testing set and measured using the precision, recall, and F-measure (F1-score). F1-measure is the weighted average of precision and recall. For the bi-class classification problem, TP represents an accurately predicted positive class, FP represents an inaccurately predicted positive class, FN represents an inaccurately predicted negative class, and TN represents an accurately predicted negative class. The equations for precision, recall, and F-measure can be given as follows: We used SVM-Light [37] to build the SVM models using linear, polynomial, radial, and sigmoid kernels and the Weka data mining tool [38] to learn using the K-NN method with k values of 3, 5, 7, and 9. In addition, we used BERT-Base and BERT-Base Multilingual Cased models [28] to compare the SVM and K-NN algorithms with the state-of-the-art language model to find acronym and expansion pairs. The default BERT parameters such as the number of epochs for training, learning rate, the maximum sequence length, etc. were used for the fine-tuning process. We used 5000 manually annotated acronym and expansion pairs to train and evaluate the models using ten-fold cross-validation [39].

Data
In this study, many news articles were successfully downloaded from the trusted online news portals. All downloaded webpages were initially cleaned. Furthermore, the 5000 acronym and expansion pairs used as the training set were manually annotated from the candidate set. Similarly, the 2000 acronym and expansion pairs in the testing set were also manually annotated from the candidate set. The distribution of each acronym type that satisfies the rules for the training set is 3073 (61.46%) for the acronym of type capital letters or the ratio is greater than 75%; 25 (0.50%) for the acronym of type combination of characters and a digit or the ratio is between 50% and 75%, for example "LP2M" or "LPPM" that stands for "Lembaga Penelitian dan Pengabdian kepada Masyarakat" or "The Institute for Research and Community Services" in English; and 1902 (38.04%) for the acronym of type syllables or the ratio is below 50% and they are checked if they can be found in an Indonesian dictionary. In addition, the distribution for the testing set is 1118 (55.90%) for the acronym of type capital letters, 26 (1.30%) for the acronym of type combination of characters and a digit, and 856 (42.80%) for the acronym of type syllables.
The uniform resource locator (URL) pattern of each news portal must be initially determined, including the publication date of the articles, to facilitate the automatic crawling process. Table 1 presents several examples of the URL patterns.

Training Models
We evaluated the accuracy of the training models using ten-fold cross-validation. The learning process was repeated ten times, and each training subsample was used for validation. The model with the highest accuracy was selected as the final model [39]. The ten-fold cross-validation test was performed for the SVM, K-NN, and BERT algorithms. The SVM algorithm used linear, polynomial, radial, and sigmoid kernels, whereas the K-NN algorithm used four different k values, i.e., 3, 5, 7, and 9. The pre-trained BERT-Base and BERT-Base Multilingual Cased models that support the Indonesian language were used and fine-tuned using the default parameters. There were 5000 manually annotated training data, comprising 2469 positive and 2531 negative samples, denoting a fairly balanced amount of training data. The TP, FP, FN, and TN of the training models are summarized in Table 2.  Table 2 denotes that the accuracy (F1-score) of the SVM model constructed using the polynomial kernel is higher than those of the models constructed using the remaining three kernels. The accuracy of the SVM model is 99.52%, whereas the accuracy of the SVM model with radial, linear, and sigmoid kernels is 99.37%, 99.17%, and 99.09%, respectively. The accuracy of the K-NN model with k = 3 is the same as that of the K-NN model with k = 5 (99.15%); however, it is slightly higher than the accuracy of the K-NN model with k = 7 and k = 9 (99.11%). Furthermore, the accuracies of the BERT-Base and BERT-Base Multilingual Cased model are 95.36% and 95.99%, respectively. Hence, the SVM model with the polynomial kernel is the best supervised learning model to find acronym and expansion pairs from Indonesian corpus.

Determining the Best Method
We evaluated and compared the performance of the SVM model based on the polynomial kernel using the eight feature vectors proposed in this study. Moreover, we examined the performance of the K-NN model with k = 3 using eight feature vectors and the K-NN model with k = 3 using five feature vectors (F 1 , F 2 , F 3 , F 4 , and F 5 ), as previously proposed by Jufri et al. [23]. We also evaluated the performance of the two BERT-Base methods. We used 2000 additional manually annotated acronym and expansion pairs for performing this evaluation. The results demonstrate that the accuracy of the SVM model with the polynomial kernel (97.93%) was greater than that of the SVM model that used five feature vectors, the two K-NN methods, and the two BERT-Base models. The findings indicated that our proposed approach and the incorporation of the eight feature vectors resulted in superior accuracy to the other supervised learning methods. The confusion matrix, summarized in Table 3, denotes that the SVM model obtained using eight feature vectors is superior to the SVM model obtained using five feature vectors, the two K-NN methods obtained using two different feature vectors, and the two BERT-Base models, measured based on the F-measure (F1-score). Both BERT-Base models predicted more negative samples as positive; therefore, the precision of the models was low. The results also denote that the specificity (true negative rate) of the proposed method is high and did not considerably differ from the positive rate. We display the acronyms and their expansions identified automatically using our proposed approach in a web-based repository at http://indoacro.cs.unsyiah.ac.id.

Hadoop Performance Analysis
Performance evaluation was conducted using cleaned web files of various sizes, i.e., 100,000, 200,000, and 300,000. Different file sizes were created to observe the time trends associated with the generation of acronym and definition pairs and the calculation of their numerical features. Approximately 52 million pairs of acronyms and expansions were generated from 100,000 cleaned files; approximately 103 million pairs were generated from 200,000 cleaned files; and approximately 145 million pairs were generated from 300,000 cleaned files. Further, we evaluated the performance using several virtual servers, each of which has the Centos7 (x86 64) operating system installed, 8-core 2 GHz processor, 16 GB memory, and a storage of 200 GB. Figure 2 shows that the performance improves when additional data nodes are added into the Hadoop cluster.

Conclusions
Our proposed method can effectively find the accurate acronym and expansion pairs and successfully recognize the acronyms that are pronounceable, such as "Pilkada" ("regional election" in English), "Damkar" ("firefighters"), and "Kemendagri" ("Ministry of Home Affairs"). The SVM polynomial model obtained using eight feature vectors exhibits the highest accuracy (97.93%), outperforming the SVM model obtained using five feature vectors (90.45%), the K-NN algorithm with k = 3 using eight feature vectors (96.82%), the K-NN algorithm with k = 3 using five feature vectors (95.66%), the BERT-base model (81.64%), and the BERT-base multilingual cased model (88.10%). Furthermore, the results confirm that the Hadoop MapReduce rapidly and effectively generates candidates of acronym and expansion pairs and calculates the feature vectors of the pairs. The results also confirm that a large number of data nodes improves the performance.