Metadata Integration Framework for Data Integration of Socio-Cultural Anthropology Digital Repositories: A Case Study of Princess Maha Chakri Sirindhorn Anthropology Centre

: Data integration is one of the most challenging tasks for digital collections whose data are stored across various repositories. Data integration across digital repositories has several challenges. First, data heterogeneity in terms of data schema and data values usually occurs across diverse data sources. Second, heterogeneity in data representation and semantic issues are among the problems. The same data may appear in different repositories with varied data representations, i.e., metadata schema. Recent research has focused on matching several related metadata schemas. In this paper, a metadata integration framework is proposed to support digital repositories in socio-cultural anthropology at the Princess Maha Chakri Sirindhorn Anthropology Centre (SAC), Thailand. The proposed framework is deﬁned based on the Metadata Lifecycle Model (MLM). It utilizes non-procedural schema mappings to express data relationships in diverse schemas. A case study of metadata integration over the SAC digital repositories was conducted to validate the framework. The SAC common metadata schema was designed to support data mapping across 13 digital repositories. The SAC “One Search” system was developed to exemplify the system implementation of the framework. Evaluation results showed that the proposed metadata integration framework can support domain experts in socio-cultural anthropology in uniﬁed searching across the repositories.


Introduction
For more than 30 years, Thailand's Princess Maha Chakri Sirindhorn Anthropology Centre (SAC) has been developing digital repositories in anthropology, archaeology, history, ethnology, and socio-cultural studies for academic forums and the general public. Currently, the center has 31 digital repositories with more than 140,000 digital resources, including data records, online databases, articles, e-books, newsletters, videos, photos, audio files, etc.
[1]. The data provided by the center are academically reliable and cover a wide range of fields, so these databases have become one of the most important online information sources in the anthropology field in Thailand.
Although each digital repository has its own purposes and designs, with some shared common entities such as ethnic groups, the data have been stored in different locations and database systems. This creates a limitation wherein users are unable to search for the desired information in an integrated and unified fashion. Moreover, the user interface (UI) and the data schema of the data records are different in each database. As a result, Informatics 2022, 9,38 2 of 15 users who are not familiar with the subjects of all the available repositories can be confused as to where to search for the information. Therefore, to solve the problem of data silos and disparate retrieval systems in which the data in the repositories are not linked, and to implement a unified search UI, the SAC's "One Search" approach is designed and developed to integrate the data of all repositories and to enable users to access and retrieve all data within the SAC digital repository through one search channel.
In this paper, a metadata integration framework is proposed to support data integration across digital repositories. The proposed framework is defined based on the Metadata Lifecycle Model (MLM). The framework consists of five steps: analyzing information content, creating metadata requirements, developing metadata schema, creating metadata schema mapping profiles, and developing metadata service system and evaluation. A case study of metadata integration over the SAC digital repositories was conducted to validate the framework.
In adopting the framework, the SAC common metadata schema was designed based on the existing metadata schema standards, i.e., Dublin Core (DC) and Europeana Data Model (EDM). The design was also based on content analysis of the SAC digital repositories. The metadata schema mapping profiles define non-procedural schema mappings to express data relationships in diverse schemas. Using the mapping profiles, the existing source metadata schemas were subsequently mapped into the target SAC common metadata schema. Thus, the integration of data from different sources can be conducted, and a search system can be developed based on the SAC common metadata schema. A prototype of the SAC "One Search" system over 13 SAC digital repositories exemplified data integration and unified search system development. Based on the evaluation results, the unified search system provides sufficient support for the description and retrieval of the data by domain experts.

Data Heterogeneity
Data heterogeneity is a common phenomenon in distributed information sources and is growing with the development of systems and applications, which has created an enormous amount of data and information [2,3]. When data are used, sharing and integrating data causes a challenge in the implementation process [4][5][6][7][8][9]. Data non-standardization, diverse data representation, data disputes, and data with related semantic features are some of the issues that may be found inside the data [10].
There are still numerous issues to be overcome in the deployment of data integration. Sharing and integrating data from loosely coupled sources, heterogeneity of data representation, and mapping data from diverse data sources are the most challenging aspects of data integration [11][12][13][14]. The semantic characteristics of multiple data forms and sources are particularly problematic when dealing with extensive data, which almost certainly contain heterogeneous data [10,13,15,16].
One of the most vexing issues in data management is automatically identifying proper mappings between various structured data types [17,18]. Data mappings are fundamental in data cleaning [19,20], data integration [21], and semantic integration [22,23]. In addition, they constitute the fundamental connection for the construction of large-scale semantic web and peer-to-peer information systems, which promote the collaboration of independent data sources [24]. As a result, the challenge of data mapping is manifested in different ways, including schema matching [25,26], schema mapping [17,27], ontology alignment [28], and model matching [18,29].
From a semantic perspective, a semantic data mapping procedure is one of the potential approaches to solving heterogeneous data problems from a semantic perspective [30][31][32][33][34]. The main objective of the semantic data mapping process is to produce data format representations from data sources and convert them into an XML data format using a semantic perspective [35][36][37]. This is an important process in the implementation of data integration technology [38]. The semantic data mapping process is the standardization and mapping process to produce uniformity between data with various data representations, heterogeneity format data, and different semantic aspects between applications in the other data sources [39][40][41].

Semantic Data Integration
Integrating datasets or data sources is a major problem for semantic integration because of the complexity of identifying that the data contain semantic information. The semantic information determined from the data refers to real-world concepts and can be integrated. Many technologies are used for semantic integration to fix the challenges it faces. This section will discuss approaches, frameworks, techniques, and related challenges for semantic integration. Schema matching is the task of finding semantic correspondences between elements (or attributes) of two given database schemas [42][43][44][45][46]. This task is essential for enabling data integration and systems interoperability in e-commerce, geospace, biology, health, etc.
There are several reasons that the schema matching challenge is difficult: different schemas might have various names for items, such as attributes that express the same conceptual idea. On the other hand, elements with similar terminology might be referred to differently. It is possible that items that are structurally equal in two schemas vary. Many items from one schema may represent a single element representing a notion from another schema.
Semantic correspondences between items in two schemas are found through schema matching [43]. Database schemas, XML DTDs, HTML form elements, and other types of heterogeneous data sources are all good sources of schemas [44]. Connecting two disparate data sources is an important initial step in any integration [45].
While there have been several approaches to this problem throughout the years, none of them are now regarded as full solutions. When using a technique, it is sometimes necessary for a specialized user to check the results to ensure that they are accurate. Schema matching methods typically use one or more functions to establish a similarity value between pairs of schema items. The elements' similarity increases with the value of the parameter. A pair is referred to as a matching candidate. Between 0 and 1, these matchers evaluate the similarity of two input items. Schema element names, thesaurusbased semantic similarity, data type, cardinality comparisons, and even access data values may all be used by matchers to assess similarities.
It is possible to combine data integration and data semantics in a method known as semantic integration. Using numerous data sources to manipulate them transparently is essential for data integration [46]. It is possible to describe semantics as "the field of linguistics and logic concerned with meaning" [47] while addressing the topic. A technique that employs conceptual models of the bonds or connections and a representation of data conceptually, reducing any heterogeneities, is achieved when semantics and data integration are integrated. The integration of semantically diverse data is a key challenge. Structure and semantic heterogeneity are two forms of data heterogeneity difficulties [48]. Goh summarized the reasons for semantic heterogeneity [49]. The reasons are listed below: − Naming Conflicts: Consists of synonyms and homonyms among attribute values. − Scaling and Units Conflicts: Adoption of different unit measures or scales in reporting. − Confounding Conflicts: Arise from the confounding of distinct concepts.
By achieving data interoperability, ontology is accountable for resolving data heterogeneity. Gruber defined ontology as the "specification of a conceptualization" [50].

Metadata Ontology
Cverdelj-Fogaraši and colleagues proposed one of the recent techniques for semantic data integration: metadata ontology [51]. The proposed technique is focused on semantic integration for information systems. The method provides semantics to document metadata descriptions and enables semantic mapping between metadata of a domain and metadata of another field. The metadata ontology technique consists of the service layer, data access layer, and persistence layer.
In order to implement the metadata ontology, the ebXML Registry Information Model standard [52] can be utilized to specify the metadata. There are four parts to the metadata ontology: the core, classification, association, and provenance. The major components are illustrated in Figure 1. It was tested and evaluated in real-life data by two independent departments successfully [53]. It is important to remember that the core classes and related attributes, classification, and association all fall under this system's "core" category, as with the higher ontology idea of provenance.

Metadata Ontology
Cverdelj-Fogaraši and colleagues proposed one of the recent techniques for semantic data integration: metadata ontology [51]. The proposed technique is focused on semantic integration for information systems. The method provides semantics to document metadata descriptions and enables semantic mapping between metadata of a domain and metadata of another field. The metadata ontology technique consists of the service layer, data access layer, and persistence layer.
In order to implement the metadata ontology, the ebXML Registry Information Model standard [52] can be utilized to specify the metadata. There are four parts to the metadata ontology: the core, classification, association, and provenance. The major components are illustrated in Figure 1. It was tested and evaluated in real-life data by two independent departments successfully [53]. It is important to remember that the core classes and related attributes, classification, and association all fall under this system's "core" category, as with the higher ontology idea of provenance. One of the fundamental challenges for semantic integration is data heterogeneity. There are three types of data heterogeneity. Syntactic heterogeneity is caused by the use of different models or languages. Schema heterogeneity results from structural differences. Semantic heterogeneity is caused by different meanings or interpretations of data in various contexts [49]. In addition to the challenges mentioned above, there are other challenges to implementing the semantic integration architecture in real life [53]. These challenges may be divided into the following primary categories: scalability with the size of the schema, user interaction, and mapping maintenance [49]. Whereas most methodologies focus on small-sized schema, techniques that work well with large-sized schemas must be investigated. Schema mapping cannot be completely autonomous. Thus, designing interaction with the user in performing a schema mapping task is a significant challenge. Schemas often change. Thus, schema matching techniques must also facilitate mapping maintenance. One of the fundamental challenges for semantic integration is data heterogeneity. There are three types of data heterogeneity. Syntactic heterogeneity is caused by the use of different models or languages. Schema heterogeneity results from structural differences. Semantic heterogeneity is caused by different meanings or interpretations of data in various contexts [49]. In addition to the challenges mentioned above, there are other challenges to implementing the semantic integration architecture in real life [53]. These challenges may be divided into the following primary categories: scalability with the size of the schema, user interaction, and mapping maintenance [49]. Whereas most methodologies focus on smallsized schema, techniques that work well with large-sized schemas must be investigated. Schema mapping cannot be completely autonomous. Thus, designing interaction with the user in performing a schema mapping task is a significant challenge. Schemas often change. Thus, schema matching techniques must also facilitate mapping maintenance.

Methodology
In supporting data integration across digital repositories, a metadata integration framework is defined based on the Metadata Lifecycle Model (MLM) [54]. MLM, proposed by the Metadata Architecture and Application Team, is a methodology involving a ten-step process by which digital library projects can design and implement metadata provision. MLM emphasizes the iterative processes from requirement and content analysis and system specification to metadata system and service evaluation.
The proposed metadata integration framework, shown in Figure 1, is a generic framework that not only can guide the design of metadata schema of digital repositories based on requirement and content analysis but also cover the process of metadata schema mappings across digital repositories and metadata service system development. The framework consists of five steps: analyzing information content, creating metadata requirements, developing metadata schema, creating metadata schema mapping pro-files, and developing metadata service system and evaluation. The steps in adopting the framework for the SAC digital repositories are described as follows.

Analyzing Information Content
The SAC digital repositories, when considering content, context, and structure, can be classified into five groups as follows: 1. Ethnic Groups; 2. Museums and Archives; 3. Cultural Heritage; 4. Archaeology and History, and 5. Anthropology. Details are shown in Table 1.

Creating Metadata Requirements
This research used the content analysis and related metadata standards to create the guidelines for identifying metadata by adapting and applying the Dublin Core (DC) metadata [55] to analyze the elements of SAC's repositories. DC was used as a base model to design the metadata schema. The Dublin Core Metadata Element Set comprises 15 elements as follows. (1) Contributor-an entity responsible for making contributions to the resource. (2) Coverage-the spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.

Developing the Metadata Schema
In developing the common metadata schema for the SAC's repositories, metadata elements were defined based on the Dublin Core Metadata elements. Some DC metadata elements were selected based on their appropriateness in the context of the subjects of SAC repositories. Next, the selected elements from DC were adopted together with new elements to ensure that the developed metadata schema could describe and enable users to access the needed information.
The SAC common metadata elements defined in this research consist of 11 metadata elements. They are based on five metadata elements from the Dublin Core Metadata [55] and six metadata elements from the Europeana Data Model (EDM) mappings of Europeana [56], which is related to a variety of anthropological data and a large amount of image data. The SAC common metadata elements that are based on the Dublin Core Metadata [55] consist of five elements: (1) Title, (2) Description, (3) Creator, (4) Type, and (5) Relation. The elements that are based on the EDM consist of six metadata elements: Properties, Provenance, Time, Location, Rights, and References.

Creating Metadata Schema Mapping Profiles
In this step, the metadata elements of existing SAC digital repositories, i.e., source metadata elements, are grouped based on the metadata elements of the SAC common metadata schema, i.e., target metadata elements. The mapping can have one to many relationships. Specifically, more than one source metadata element of a repository can be grouped into one target metadata element. For example, the Anthropology Museum repository contains two metadata elements, Title and Alternate Title, which can be grouped into the Title element of the SAC common metadata schema.
The metadata schema mapping profile stores all the mapping information between the source and the target metadata elements. The mapping profile can be represented in the form "Source Repository Name (Metadata Element Names) => Target Metadata Schema Name (Metadata Element Name), e.g., "Museum (Title, Alternative title) => SACCommon (Title)". The use of mapping profiles can facilitate mapping maintenance, i.e., profile updates, when the source metadata element names are added or updated.

Developing Metadata Service System and Evaluation
In adopting the SAC common metadata schema and mapping profiles, the SAC "One Search" prototype system is developed. The development of the prototype system consists of two major steps: metadata transformation and search system development. The steps are described as follows.

•
Metadata transformation. In this step, the metadata schema mapping profiles are added into the search system. The mapping profiles allow the metadata elements of all the repository resources to be transformed into the SAC common metadata elements. Specifically, the resources of the source repositories will be described based on the SAC common metadata elements in an integrated repository. The source metadata element names are also preserved for display purposes.

•
Search system development. A prototype search system called SAC "One Search" is developed. The system allows all the repository resources to be displayed and searched in a unified fashion. Specifically, the resources of the source repositories will be displayed based on the SAC common metadata elements. In addition, user queries in terms of SAC common metadata schema can be conducted. The major components of the SAC's "One Search" prototype system are illustrated in Figure 2.
on the SAC common metadata elements in an integrated repository. The source metadata element names are also preserved for display purposes.
• Search system development. A prototype search system called SAC "One Search" is developed. The system allows all the repository resources to be displayed and searched in a unified fashion. Specifically, the resources of the source repositories will be displayed based on the SAC common metadata elements. In addition, user queries in terms of SAC common metadata schema can be conducted.
The major components of the SAC's "One Search" prototype system are illustrated in Figure 2.

SAC Common Metadata Schema
The SAC common metadata schema, developed as a common metadata schema for the repositories of the Princess Maha Chakri Sirindhorn Anthropology Centre, consists of 11 metadata elements, as shown in Table 2. The description of each metadata element consists of a name, definition, format, and example.

SAC Common Metadata Schema
The SAC common metadata schema, developed as a common metadata schema for the repositories of the Princess Maha Chakri Sirindhorn Anthropology Centre, consists of 11 metadata elements, as shown in Table 2. The description of each metadata element consists of a name, definition, format, and example.

Metadata Schema Mappings
The results of metadata element mappings between the 13 source digital repositories and the SAC common metadata schema are shown in Table 3. For brevity, only partial lists of the metadata schema mapping profiles are shown. Table 3. Metadata element mapping between source digital repositories and SAC common metadata schema.

Unified Search System Development
A prototype search system was developed as a unified metadata service system. The system organizes and presents data from 13 SAC digital repositories based on the SAC common metadata schema. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [57] is a standard used to retrieve the data from the source repositories.
The SAC "One Search" system for demonstrating the unified metadata approach of the Sirindhorn Anthropology Centre can be accessed at: https://onedb.sac.or.th/, accessed on 9 March 2022. The system allows the unified representation and searching of 140,000 digital resources from SAC's 13 repositories based on the SAC common metadata schema. An example of unified resource representation on the SAC One Search system is shown in Figure 3. Informatics 2022, 9,

Comparison with Existing Socio-Cultural Anthropology Digital Repositories
In order to verify the coverage of the SAC common metadata schema, the metadata elements of the SAC common metadata schema are compared with those of the common metadata schema of two existing socio-cultural anthropology digital repositories: the Smithsonian Learning Lab (https://learninglab.si.edu/search, accessed on 12 April 2022) and the National Institutes for the Humanities, Japan (https://int.nihu.jp/?lang=en&, accessed on 12 April 2022). Both repositories were selected because they have provided access to digital resources in socio-cultural anthropology collections and provided a unified search system in exploring various resource types and collections. The comparison of the SAC common metadata elements and those of the reference systems is shown in Table 4. The comparison results show that the coverage of the SAC common metadata elements is comparable with that of the reference systems. Specifically, most of the metadata elements of the reference systems can be mapped with the SAC common metadata ele-

Comparison with Existing Socio-Cultural Anthropology Digital Repositories
In order to verify the coverage of the SAC common metadata schema, the metadata elements of the SAC common metadata schema are compared with those of the common metadata schema of two existing socio-cultural anthropology digital repositories: the Smithsonian Learning Lab (https://learninglab.si.edu/search, accessed on 12 April 2022) and the National Institutes for the Humanities, Japan (https://int.nihu.jp/?lang=en&, accessed on 12 April 2022). Both repositories were selected because they have provided access to digital resources in socio-cultural anthropology collections and provided a unified search system in exploring various resource types and collections. The comparison of the SAC common metadata elements and those of the reference systems is shown in Table 4. The comparison results show that the coverage of the SAC common metadata elements is comparable with that of the reference systems. Specifically, most of the metadata elements of the reference systems can be mapped with the SAC common metadata elements.
However, there is one metadata element of the reference systems that has no equivalence in the SAC common metadata elements, which is "Subject/Keyword". The element is currently planned for future work to support unified subject classification among the SAC digital repositories using the domain ontology approach.

Evaluation of Search Application
The prototype system was subsequently evaluated by anthropology domain experts and information management experts from SAC. The assessment was carried out on 12 December 2021, based on Bruce and Hillmann's Continuum of Metadata Quality [58], comprising four dimensions: integrity, validity, accessibility, and compliance with expectations (completeness, accuracy, accessibility, and conformance to expectations). Experts were satisfied with the four dimensions of metadata on the highest level (mean above 3.50), which was most in line with expectations (mean = 4.78) ( Table 5). Data at some point is the addition of an element of "Provenance", which provides the feature to add new elements for system users and provides the English version of the metadata. The researchers modified the metadata schema in response to discussion with group experts to ensure that quality improvements were made as advised.

Conclusions
Data heterogeneity among various digital repositories of a data provider, i.e., data silos, has often led to inconsistency and inefficiency in users' data access. In this paper, a metadata integration framework based on metadata schema mapping is proposed to resolve such a challenge. The framework was designed as a generic framework based on the Metadata Lifecycle Model. Based on the framework, the common metadata schema of Thailand's Princess Maha Chakri Sirindhorn Anthropology Centre (SAC Common Metadata) was developed. The SAC common metadata schema consists of 11 metadata elements designed based on the Dublin Core (DC) and the Europeana Data Model (EDM) metadata elements. The mapping procedure between the source metadata elements from 13 SAC's anthropology digital repositories and the target SAC common metadata schema was described. Metadata integration of the existing digital repositories increases the likelihood of the resources being discovered and accessed via a unified search system. Finally, the SAC "One Search" system was developed as a prototype search system. It has provided a web-based portal for representing and searching digital resources from different repositories based on the metadata elements of the common metadata schema. The metadata schema mapping profiles have supported the process of metadata transformation from the source repositories into the target integrated repository. An evaluation of the metadata schemas found that they can sufficiently support the description and retrieval of the data by domain experts. The coverage and comparison of the SAC common metadata elements with those of two existing socio-cultural digital repositories are also provided. The implications of this research include (1) the elaboration and description of a metadata integration framework defined based on the Metadata Lifecycle Model (MLM) and (2) the design and adoption of a common metadata schema and metadata schema mapping profiles to support the development of a unified search system for heterogeneous digital repositories in the socio-cultural anthropology domain. Future work includes extending the common metadata schema and mappings to support unified subject classification across digital repositories using the domain ontology approach. Software tools and implementation based on the framework are planned to be released to benefit other digital repositories with similar requirements. One of the limitations of the proposed framework is that it relies on experts in creating metadata schema mapping profiles. Future research should investigate combining a semi-automated mechanism in simplifying experts' mapping tasks.