A Digital Thesaurus of Ethnic Groups in the Mekong River Basin

: This research was aimed at constructing a thesaurus of the ethnic groups in the Mekong River Basin that is a compilation of controlled vocabularies of both Thai and English language, with a digital platform that enables semantic search and linked open data. The research method involved four steps: (1) organization of knowledge content; (2) construction of the thesaurus; (3) development of a digital thesaurus platform; and (4) evaluation. The concepts and theories used in the research comprised knowledge organization, thesaurus construction, digital platform development, and system evaluation. The tool for developing the digital thesaurus was the Tematres web application. The research results are: (1) there are 4273 principle words related to the ethnic groups that have been compiled and classiﬁed by the terms for each of the eight deep levels, 2596 were found to have hierarchical relationships, and 6858 had associative relationships; (2) the digital thesaurus platform was able to manage the controlled vocabularies related to the Mekong ethnic groups by storing both Thai and English vocabularies. When retrieved, the vocabulary, details of the broader term, narrow term, related term, cross reference, and scope note are displayed. Thus, semantic search is viable through applications, linked open data technology, and web services.


Introduction
The term 'ethnic group' is generally understood in the humanities as a social group or a category of the population who cluster together as part of a bigger society, which can be a community, a country, or a region. Due to common characteristics, frequently taken from biological and physical aspects, and relations in terms of kinship, language, culture, religion, and belief, an ethnic group possesses common key cultural heritage that is different from other groups or communities. Each member is able to perceive the differences and each has a means of communication and interaction in the group. The members of an ethnic group have lifestyles and social activities that differ from one area to another, which also depend on the government's policy, the environment, and socio-economic development carried out in their domicile [1][2][3][4]. It can be said that an ethnic group is the population that has become part of a bigger multi-cultural society. An ethnic group is an important element behind the country's development because the members' cultural identities and beliefs are hard to change or demolish. Thus, national development that brings impact on the cultures and beliefs of ethnic groups requires profound knowledge and understanding, especially in a region where a high number of different ethnic groups live. In short, the cultural multiplicity and differences of ethnic groups are significant elements when setting national development strategies [5,6].
Research on the social and developmental aspects of ethnic groups has been conducted in a great number, e.g., studies on the influence of ethnic groups and cultural identities on conflicts in the United States [7][8][9][10][11][12], studies on various public policies, for example, education, public health, and population movement, which have an impact on the way of life and cultures of ethnic groups [13,14], and studies on risk factors of minority groups residing in different areas, which include health, consumption, way of life, and cultural change [15,16]. There are also a number of studies that are aimed at classifying ethnic groups, with in-depth analysis or comparison of genetic traits, classification of the colors of skin, eyes, and other characteristics [17][18][19]. It can be understood that the ethnic group issue is linked to many fields of study other than humanities. There are studies of ethnic groups conducted under the fields of history, archaeology, politics and government, demography, social development, and medical science.
When looking at the field of information sciences, which covers analysis, categorization, system management, and construction of access tools to serve users, research involving ethnic groups is few. As for systematic classification of knowledge that enables access to information in libraries or databases, such as the Dewey Decimal System and American Cabinet System, ethnic groups have been categorized in subclasses and subdivisions of the social science category and/or sociology and humanities, without branching out contents or related details at the ethnic group level (subclass 305 in the system: DDC [https://www.oclc.org/content/dam/oclc/webdewey/help/300.pdf (accessed on: 11 March 2021)], subclass HT in the system: LC [https://www.loc.gov/aba/cataloging/ classification/lcco/lcco_h.pdf (accessed on: 2 February 2021]). Research on the spoken languages of ethnic groups, e.g., in Thailand, classified ethnic groups into five categories: Austro-Asiatic, Tai dialect, Sino-Tibetan, Hmong-Mien, and Austronesian [20]. The ethnic groups in Southeast Asia have been classified into four groups: Austro-Asiatic, Tai-Kadai dialect, Sino-Tibetan, and Malayo-Polynisian [21]. Chaikhambung and Tuamsuk additionally conducted a study and categorized the ethnic groups in Thailand in order to build an ontology for semantic search purposes, which was based on knowledge organization. The ethnic groups in Thailand have been divided into 12 classes and 51 subclasses [22,23]. Later, Chansanam et al. [24], extended the finding by arranging the Thai ethnic groups ontology in an accessible form using linked open data technology. Categorization of ethnic groups in different countries in the Mekong Basin appears in the research by, for instance, Pholsena [25], who reported that the first Prime Minister of Lao PDR, Kaysone Phomvihane, changed the vocabulary and ethnic group classification in Lao PDR by dividing them into Lao-Thai, Mon-Khmer, Hmong-Mien, and Sino-Tibetan. In addition, Mackerras [26] mentioned the classification of the minority groups in China in his study. These works, however, did not analyze the categories or organize knowledge based on the principle of information science, nor did they have the objective to apply the results to knowledge management or semantic search, and hence they were not able to provide open access via a web service.
A thesaurus is a knowledge resource that compiles vocabularies that have been classified into groups based on semantic similarities and word relationships. Schütze and Pedersen [27] defined a thesaurus as the matching of one word to another that is related. The list of words in a thesaurus is in the alphabetical order, with priorities given to the classes and the broader term (BT), related term (RT), and cross reference that states both the use and use-for cases [28,29]. It can be said that a thesaurus is a tool that assists users to understand the overall vocabulary under any one domain. Moreover, it is important as a retrieval tool for information in various databases. This is because a thesaurus organizes knowledge that represents the concept and vocabulary used in practice or found in the natural language. It also provides additional meaning and relationships of words in its corpus. A thesaurus is thus used as an efficient and precise retrieval index as required by users [30]. In this research, we defined theoretical concepts of a "word" as "a unit of language that native speakers can identify" and a "term" as "a word or expression used for some particular thing." A thesaurus in information science is the controlled vocabulary developed, in terms of both structural and grammatical methods, to compile the words used and is used as the tool to provide descriptive keywords to the document. A thesaurus is also used to assist in selecting the keyword when searching information. It is thus the controlled vocabulary, it is not only a set of vocabulary that assists the index maker or the searcher to select a word that best represents the topic needed, but it also enables the user to see the holistic picture of all words and arrive at the elements of the topic being searched, which cover all relevant aspects. Compared to subject headings, which is also a controlled vocabulary, the controlled vocabulary of a thesaurus seems similar to subject headings, but with a more complicated structure. The symbols representing the relationship also have clearer specific forms and meaning than subject headings [31][32][33].
From studying related digital thesauruses or online thesauruses, no digital thesauruses in the ethnic group domain was found. The existing, internationally and widely known thesaurus, for instance, the UNESCO Thesaurus [http://vocabularies.unesco.org/browser/ thesaurus/en/ (accessed on: 2 February 2021)], is the controlled vocabulary in the field of education, culture, natural science, humanities, social sciences, communication, and information. It was found that most vocabularies related to ethnic groups were of the ethnic groups in America and Africa, but it had inadequate details to be used as a management or specific access tool for ethnic groups. Other cultural digital vocabulary corpuses include Yale University [https://hraf.yale.edu/ (accessed on: 16 April 2021)] and The Getty Research Institute [https://www.getty.edu/research/tools/vocabularies/aat/ (accessed on: 2 February 2021)], in which limited ethnic vocabulary is found that has not been put in the form of a thesaurus.
The Mekong is the life source for over 60 million people who live around its basin in the areas of six countries including Southern China, Myanmar, Laos, Thailand, Cambodia, and Vietnam. The basin is the source of multiplicity of cultures and the domicile of over 95 different ethnic groups, whose lifestyles still follow the beliefs and roots of their former cultures, although changes can be traced owing to the influence of environmental and technological development [34]. The Mekong River Basin is "the river of economy", owing to its major role in the socio-economic development in Southeast Asia. It is an upstream resource of agricultural systems, energy production, food security, ecological system, and human well-being. Thus, attempts to enable the Mekong River Basin to acquire its roles in regional economic propulsion are being investigated at an international level. However, besides political issues and interrelations among the countries, the development process also involves the problem of understanding the ethnic groups in the area, which is of great importance [35]. Therefore, research studies for knowledge and understanding of ethnic groups are unavoidable.
Besides compiling the vocabularies related to the ethnic groups in the Mekong River Basin, the development of a thesaurus for the ethnic groups was carried out based on the concept of knowledge organization in the classification of the information on the ethnic groups, such that there is linked open access to all related contents, for example, dialects, languages, beliefs, attires, rituals, or social systems. Thus, the study did not only involve the organization of the names of the ethnic groups, but the thesaurus can also be used as a resource for studying the relationships of the content about the ethnic groups and as a tool to access and retrieve knowledge about the ethnic groups in the database and on the Internet. Furthermore, the results obtained can be used in semantic search and open access for an international standard of data exchange.

Research Objectives
This research was aimed at constructing a thesaurus of the ethnic groups in the Mekong River Basin, which compiles the controlled vocabularies both in Thai and English languages, with a digital platform for managing the thesaurus in terms of semantic search and linked open data.

Methodology
The Research and Development Method was applied for the research, consisting of four steps: (1) analysis, synthesis, and knowledge organization; (2) construction of the thesaurus; (3) development of a digital thesaurus platform for the ethnic groups in the Mekong River Basin; and (4) evaluation of the digital thesaurus platform for the ethnic groups in the Mekong River Basin.

1.
Analysis, synthesis, and knowledge organization: These processes were performed by means of document analysis and knowledge organization as follows: 1.1. Data resources for the analysis: The information related to the ethnic groups in the Mekong River Basin was compiled from various sources, namely: (1) the domestic and international databases of information resources, in which a lot of collections exist in the fields of humanities and socio-cultural aspects of the Mekong River Basin, thus the research emphasized the studies conducted in Thai and English; (2)  Compilation of data: The researcher stipulated the keywords for retrieval of information, which included; ethnic group, ethnicity, and the Mekong River Basin, and retrieved information from the different databases stated in 1.1. from the retrieval channel of each data source, which were the topics, keywords, subject headings, abstracts, or descriptions. The data was then downloaded and the documents were filed systematically on a cloud drive.

1.3.
Extraction and screening of data: The researcher extracted the keywords or vocabulary appearing in the collected data in the cloud drive by considering the vocabulary with specific meaning related to the ethnic groups. Next, the vocabulary was screened and selected by counting the frequency of the same word that appeared, removing repetitive words, synonyms, and ambiguous words, and obtained 4069 words related to the ethnic groups in the Mekong River Basin.

1.4.
Word classification: The researcher classified the vocabulary according to the fundamental criteria for categorization and justification for knowledge organization based on domain-specific criteria [36,37]; starting from highfrequency down to low-frequency words, placing words with the same meaning together, words with close meanings next to one another, separating words with different meanings, checking the correctness and avoiding ambiguity of meanings based on an online dictionary in English (WordWeb, https://www.wordwebonline.com/ (accessed on: 11 March 2021)), and finally recording the word groups that had been arranged using the TemaTres 3.1 Program [38]. The outcome is the structure of vocabularies convenient for use and development of further thesauruses. The completed process provides arrangements of 12 vocabulary groups of the ethnic groups in the Mekong River Basin: language groups, social organization, costume, art works and entertainment, general name, demography and residential, history, customs and rituals, social dynamics, economic system, way of life, and religion and beliefs ( Figure 1). In each group, there are subgroups of different levels on the same topic, or close topics, as the example in the language groups shown in Figure 2.

Domestic and International Databases
Information Corpus or International Databases vocabulary with specific meaning related to the ethnic groups. Next, the vocabulary was screened and selected by counting the frequency of the same word that appeared, removing repetitive words, synonyms, and ambiguous words, and obtained 4069 words related to the ethnic groups in the Mekong River Basin. 1.4. Word classification: The researcher classified the vocabulary according to the fundamental criteria for categorization and justification for knowledge organization based on domain-specific criteria [36,37]; starting from high-frequency down to low-frequency words, placing words with the same meaning together, words with close meanings next to one another, separating words with different meanings, checking the correctness and avoiding ambiguity of meanings based on an online dictionary in English (WordWeb, https://www.wordwebonline.com/ (accessed on: 11 March 2021)), and finally recording the word groups that had been arranged using the TemaTres 3.1 Program [38]. The outcome is the structure of vocabularies convenient for use and development of further thesauruses. The completed process provides arrangements of 12 vocabulary groups of the ethnic groups in the Mekong River Basin: language groups, social organization, costume, art works and entertainment, general name, demography and residential, history, customs and rituals, social dynamics, economic system, way of life, and religion and beliefs ( Figure 1). In each group, there are subgroups of different levels on the same topic, or close topics, as the example in the language groups shown in Figure 2.  2. Construction of the thesaurus: The approaches in thesaurus construction were investigated from many concepts [39][40][41][42] and the following steps were followed:

2.
Construction of the thesaurus: The approaches in thesaurus construction were investigated from many concepts [39][40][41][42] and the following steps were followed: Cross referencing was done, i.e., USE and UF (use for), to link the words with the same meaning or the words that can be used interchangeably. Scope notes were next added to the words having broader term to make the thesaurus complete.

2.5.
The thesaurus word list was verified and evaluated by specialists including two information scientists who have expertise in knowledge organization and thesaurus construction and three academics in anthropology and sociology who have expertise in the ethnic groups of the Mekong River Basin. The snowball technique was used in selecting the experts, beginning from the first and the second information science experts, followed by the third, fourth, and fifth ethnic group experts. The vocabularies were adjusted following the experts' opinions and suggestions before arriving at the thesaurus structure for the ethnic groups in the Mekong River Basin.   of relationship, and retrieval [44]. The evaluation was performed following the steps below: 4.1. Query selection of the term in order to show the system efficiency in terms of the stored corpuses was done by two experts in ethnic studies. In this research, 15 sets of vocabularies were sampled for retrieval from 160 corpuses as shown in Table 2 in order to find the "precision" and "recall" values in the next step. 4.2.
Evaluation of retrieval is very important in measuring the efficiency and effectiveness of the system [45]. In this research, the system efficiency was measured by precision and recall [46], as shown in Table 3, and the F-measure was used to test the precision.  Table 3. Experimentation result-percentages of precision and recall. as compared to 1, the system and corpuses prove to have high effectiveness in retrieval and completeness [47].

No of Query
F-measure was used to measure the system efficiency, according to the following formula: The F-measure value obtained from this study was F = 0.8358, which indicates that the system is efficient and the precision in retrieval is high. Thus, searching the information related to the ethnic groups in the Mekong River Basin by the developed thesaurus with the controlled vocabulary can be of use.  2. The digital thesaurus of the ethnic groups in the Mekong River Basin is a digital platform that manages the controlled vocabulary related to the ethnic groups in the Mekong River Basin. It contains both Thai and English vocabularies, and when searched, it will display the result of the word with its broader term, narrower term, related term, cross reference, and scope note (for a certain word). The digital thesaurus enables semantic search via an application based on WWW technology at https://www.thesaurus.asiana.net/vocab/ accessed on 14 July 2021 ( Figures 5 and 6), open data via SPARQL endpoint at https://www.thesaurus.asiana.net/vocab/sparql.php accessed on 14 July 2021 (Figure 7), and a web service by means of an application programming interface (API) at https://www.thesaurus.asiana.net/vocab/services.php accessed on 14 July 2021 (Figure 8). 2. The digital thesaurus of the ethnic groups in the Mekong River Basin is a digital platform that manages the controlled vocabulary related to the ethnic groups in the Mekong River Basin. It contains both Thai and English vocabularies, and when searched, it will display the result of the word with its broader term, narrower term, related term, cross reference, and scope note (for a certain word). The digital thesaurus enables semantic search via an application based on WWW technology at https://www. thesaurus.asiana.net/vocab/ accessed on 14 July 2021 ( Figures 5 and 6), open data via SPARQL endpoint at https://www.thesaurus.asiana.net/vocab/sparql.php accessed on 14 July 2021 (Figure 7), and a web service by means of an application programming interface (API) at https://www.thesaurus.asiana.net/vocab/services.php accessed on 14 July 2021 (Figure 8).

Results of Research
will display the result of the word with its broader term, narrower term, related term, cross reference, and scope note (for a certain word). The digital thesaurus enables semantic search via an application based on WWW technology at https://www.thesaurus.asiana.net/vocab/ accessed on 14 July 2021 ( Figures 5 and 6), open data via SPARQL endpoint at https://www.thesaurus.asiana.net/vocab/sparql.php accessed on 14 July 2021 (Figure 7), and a web service by means of an application programming interface (API) at https://www.thesaurus.asiana.net/vocab/services.php accessed on 14 July 2021 (Figure 8).

Discussion
This research is the continuation of former work [22][23][24], which organized knowledge and constructed the channel for exchanging information on the ethnic groups in Thailand in the form of ontology taxonomy and linked open data. This research differs from the previous work as demonstrated in Table 4, especially in the scope of knowledge that has been expanded to cover the ethnic groups in the Mekong River Basin and the thesaurus construction with broad terms, narrow terms, related terms, and vocabulary relationships as well as cross references. This structure helps point out the connections of information and knowledge of ethnic groups in different dimensions, i.e., languages, social structure, marriage, art and entertainment, demography, history, tradition, beliefs, lifestyles, socioeconomic movements, etc. In addition to the relationships of the content, other relationships can be seen in the Mekong Basin's ethnic groups that are found to be from the same root. The thesaurus in this research, moreover, can manage and provide digital platform access on the Internet, with functions that serve semantic search and linked open data. Therefore, interested academics and researchers can use the thesaurus as a tool to link with the existing databases of resources already existing in various aspects related to the ethnic groups, for instance, folk tales, folk music, folklore, beliefs, etc. based on the international standard. The digital thesaurus of ethnic groups in the MRB was developed in a two-language version: English and Thai. As the thesaurus was aimed at offering an access tool to the source of knowledge of the ethnic groups in the Mekong River Basin, standard controlled vocabulary and open access were essential. When compared to the existing and internationally known thesaurus, the UNESCO Thesaurus [http://vocabularies.unesco.org/ browser/thesaurus/en/ accessed on 14 July 2021] developed in 1977, which offers a list of controlled vocabulary that is useful for content analysis and documentary search in culture, natural science, humanities, social science, communication, and information, and is at present continuously improved, it is found that the UNESCO Thesaurus is outstanding in presenting vocabularies in many languages including English, Arabic, French, Russian, and Spanish and is thus called a multilingual thesaurus. This is probably the limitation of our research that can be expanded in the next step because Tematres does not support multilingual thesaurus development. However, when considering the vocabularies related to the ethnic groups, the vocabularies found are of the ethnic groups in America and Africa. Few words are offered for Asia, with inadequate details such that it cannot be specifically used as a management tool for access to the ethnic groups. As demonstrated in Figure 9, there are 12 items related to ethnic groups with narrower concepts. If "Asian" is selected, only one ethnic group, i.e., Indians, is shown. Other capacities are provided such as linked data by searching a screenshot through a web browser, linked open data, or web services (API). It can be concluded that the outcome of this research is a tool to access standard and international thesauruses other than The UNESCO Thesaurus. The thesaurus from this research has the specificity of the ethnic groups in the Mekong River Basin that enables knowledge management and digital information resources in humanities with high coverage and completeness.
This development of a digital thesaurus platform is the start towards more expansion of contents in humanities and cultural heritage and other issues such as development of a thesaurus of digital humanities, information analysis, storing and semantic search, and development of information resources for open access to other information sources. One important issue is to develop other digital platforms on online social networks by presenting vocabularies. Besides group discussions to find the conclusion in considering vocabularies, opportunities should be provided for academics or interested individuals to take part and experiment on real usage by construction or set contents on online platforms through the presentation of research results that link with the internet networks in a wide circle. With the aim that users can have access to all research works at any place and time. Interested individuals can also take part in the consideration of words and provide their opinions, to enhance the value of the research works. cepts. If "Asian" is selected, only one ethnic group, i.e., Indians, is shown. Other capacities are provided such as linked data by searching a screenshot through a web browser, linked open data, or web services (API). It can be concluded that the outcome of this research is a tool to access standard and international thesauruses other than The UNESCO Thesaurus. The thesaurus from this research has the specificity of the ethnic groups in the Mekong River Basin that enables knowledge management and digital information resources in humanities with high coverage and completeness. This development of a digital thesaurus platform is the start towards more expansion of contents in humanities and cultural heritage and other issues such as development of a thesaurus of digital humanities, information analysis, storing and semantic search, and development of information resources for open access to other information sources. One important issue is to develop other digital platforms on online social networks by presenting vocabularies. Besides group discussions to find the conclusion in considering vocabularies, opportunities should be provided for academics or interested individuals to take part and experiment on real usage by construction or set contents on online platforms through the presentation of research results that link with the internet networks in a wide circle. With the aim that users can have access to all research works at any place and time. Interested individuals can also take part in the consideration of words and provide their opinions, to enhance the value of the research works.