Optimizing Sensor Ontology Alignment through Compact co-Firefly Algorithm

Semantic Sensor Web (SSW) links the semantic web technique with the sensor network, which utilizes sensor ontology to describe sensor information. Annotating sensor data with different sensor ontologies can be of help to implement different sensor systems’ inter-operability, which requires that the sensor ontologies themselves are inter-operable. Therefore, it is necessary to match the sensor ontologies by establishing the meaningful links between semantically related sensor information. Since the Swarm Intelligent Algorithm (SIA) represents a good methodology for addressing the ontology matching problem, we investigate a popular SIA, that is, the Firefly Algorithm (FA), to optimize the ontology alignment. To save the memory consumption and better trade off the algorithm’s exploitation and exploration, in this work, we propose a general-purpose ontology matching technique based on Compact co-Firefly Algorithm (CcFA), which combines the compact encoding mechanism with the co-Evolutionary mechanism. Our proposal utilizes the Gray code to encode the solutions, two compact operators to respectively implement the exploiting strategy and exploring strategy, and two Probability Vectors (PVs) to represent the swarms that respectively focuses on the exploitation and exploration. Through the communications between two swarms in each generation, CcFA is able to efficiently improve the searching efficiency when addressing the sensor ontology matching problem. The experiment utilizes the Conference track and three pairs of real sensor ontologies to test our proposal’s performance. The statistical results show that CcFA based ontology matching technique can effectively match the sensor ontologies and other general ontologies in the domain of organizing conferences.


Introduction
In recent years, sensors have been used in a wide range of applications, such as urban traffic planning, flood prediction, health care, satellite imaging for earth and space observation et al. To make different types of sensors collaborate on a common task to detect and identify a multitude of observations, we need to combine the sensor data with the Internet, web services and database technologies, which is the so-called sensor web [1,2]. However, since the acquired sensor data on the sensor web lacks of semantic information and may be heterogeneous in the syntax, schema and semantic level and so forth, it is difficult to share and integrate them and implement the The rest of the paper is organized as follows: Section 3 presents the basic concepts on sensor ontology matching; Section 4 shows the CcFA-based sensor ontology matching technique in details; Section 5 presents the greedy strategy to filtering the final alignment; Section 6 shows the statistical experiment; and finally, Section 7 draws the conclusion and presents the future work.

Swarm Intelligence Algorithm Based Ontology Alignment
Many Machine Learning (ML) techniques have been applied to match ontologies to determine high-quality alignment, such as Logistic Regression (LR) [19], Neural Network (NN) [20], Word Embedding (WE) [21], Graph Embedding (GE) [22], Support Vector Machine (SVM) [23], Clustering Algorithm [24], Decision Tree (DT) [25] and so forth. Researchers also try to improve the matching efficiency through the high performance computing techniques such as Parallel Computing (PC) [26] and Cloud Computing (CC) [27]. Due to the complexity of the ontology matching process, recently, SIA-based technique has become an efficient approach for determining high-quality ontology alignment.
The first generation of SI-based matchers aimed at solving the ontology meta-matching problem, that is, how to determine the optimal parameters to aggregate different matchers and optimize the quality of obtained ontology alignment. Genetics for Ontology ALignments (GOAL) [28] was the first SI-based ontology meta-matcher, which used Evolutionary Algorithm (EA) to optimize the aggregating weight set of different ontology matchers. Ginsca et al. [29] proposed to use EA to optimize the all the parameters in the whole meta-matching process, which included the aggregating weight set and a threshold for filtering the final alignment. Xue et al. [30] introduced a new metric to measure the ontology alignment's quality, which did not require the utilization of golden standard alignment, and formally defined ontology meta-matching problem. Their approach was able to match multiple ontology pairs at a time and overcame three drawbacks of the EA-based meta-matchers. More recently, He et al. [31] used Artificial Bee Colony (ABC) algorithm to address the ontology meta-matching problem, whose results were better than the EA-based matchers. Recently, Xue et al. [32] propose a Multi-objective CFA (MCFA) to optimize the ontology alignments. Their proposal borrows the idea of MOEA/D [33] by first decomposing the original problem into three sub-problems, and then use three PVs to respectively address three sub-problems. Later on, Xue [34] further construct a single objective model for the biomedical ontology matching problem, and proposes a Compact Firefly Algorithm (CFA) to address it. CFA utilizes the compact encoding mechanism to represent the swarm of fireflies with a probabilistic model, instead of storing each one, but it may result in premature convergence. To overcome this drawback, in this work, CcFA combines the compact encoding mechanism with the co-Evolutionary mechanism to further enhance the algorithm's performance.

Sensor Ontology Matching Problem
An ontology can be defined as a 3-tuples (C, DP, OP), where C is the set of classes, that is, the set of concepts that populate the domain of interest; DP is the set of data properties, that is, the set of features describing the classes; OP is a set of object properties, that is, the set of relations existing between the concepts. In particular, concept, datatype property and object property are called ontology entities. Sensor ontology formally defines the shared concepts and their relationships in the sensor domain. However, since the sensor ontologies is developed and maintained by different organizations with various requirements, they may define the same observation with different ways, which yields the sensor ontology heterogeneity problem. It is necessary to find the identical sensor entities to bridge the semantic gap between two ontologies, which is the so-called sensor ontology matching. The obtained sensor entity mapping set is called sensor ontology alignment.
It is obvious that how to calculate two sensor entity's similarity value is the prerequisite technique for matching sensor ontologies [35]. Since none of the similarity measure can ensure their effectiveness in any context, usually, a sensor ontology system aggregates several similarity measures to improve the result's confidence. In this work, we combine three broad categories of similarity measures to measure two sensor entities' similarity, that is, structure-based, linguistic-based and syntax-based similarity measures. To be specific, we construct a context profile for each sensor entity by collecting the information from its direct ascendant and descendant entities. Then, the similarity of two entities e 1 and e 2 is calculated as follows: where p 1 and p 2 are respectively e 1 and e 2 's profiles. With respect to the similarity measure on two elements w 1 and w 2 in two profiles, we aggregate the Wordnet [36] based distance, a linguistic-based similarity measure, and the N-gram distance [37], a syntax-based similarity measure, whose equation is as follows: Since the alignment with more correspondences and higher mean similarity value should have better quality, the quality of a sensor ontology alignment is calculated as follows: where |O 1 |, |O 2 | are respectively the entity number of two sensor ontology O 1 and O 2 , |A| is the mapping number of the alignment A, and sim i is the i-th entity mapping's similarity. Further, the sensor ontology matching problem is defined as follows: where x i is the ith entity mapping between ith entity in O 1 and x i th entity in O 2 .

Compact co-Firefly Algorithm
FA is inspired by the social behaviour of fireflies, where fireflies produce short and rhythmic flashes for communication. The intuition behind FA is that fireflies tend to fly to the brighter locations, which is a solution with better objective function's value [18]. In this work, we propose a Compact co-Firefly Algorithm, that is, CcFA, which utilizes the compact encoding mechanism and the co-Evolutionary mechanism to improve the solution's quality. CcFA works with two sub-swarms, that is, the better swarm with a better elite solution and a worse swarm with a worse elite solution, which respectively apply different evolving strategy. In particular, the better swarm mainly focus on the exploitation which can increase the convergence speed, while the worse swarm emphasizes on the exploration which is able to enhance the solution's diversity. In each generation, the one with a better elite solution will become the better swarm, and in this way, two swarms can better trade off the algorithm's exploitation and exploration. To improve the searching performance, we utilize two Probability Vectors (PVs) to represent two sub-swarms, that is, PV better and PV worse .

Compact Encoding Mechanism
An alignment consists of several entity mappings, and each entity mapping can be described by two entities' indices. On this basis, in this work, we empirically choose the Gray code to encode an alignment, which is a binary encoding mechanism. An example of the compact encoding mechanism is shown in the Figure 1, where the index means the source concept index and the corresponding bit values are the target concept index that is encoded through Gray code, for example, the source concept "Measure" with index 8 is mapped to target concept "Parameter" with index 6 whose Gray code is 110.
In particular, Gray code 000 means a source concept is not mapped to any target concept. We need to point out that, given the the scale of target ontology n, the length of code is equal to log 2 n , and Figure 1 only shows a simple example of encoding with Gray code. A PV's dimension number is equal to the length of an individual, and each dimension represents the probability of being one corresponding to each individual bit's value. Since PV stores each individual bit's corresponding probability of being one, each bit value of a newly generated individual can be determined by comparing a random number with its corresponding probability in PV. Figure 2 shows an example of generating a solution through PV. Given a PV (0.1, 0.3, 0.5, 0.9) T where each element presents the probability of being 1 with respect to a solution's gene bit, generate four random numbers in [0,1], for example, 0.2, 0.4, 0.6 and 0.1, and we can determine a new solution by comparing them with PV's elements accordingly. To be specific, since 0.2 > 0.1, 0.4 > 0.3, 0.6 > 0.5 and 0.1 < 0.9, the newly generated solution is 0001. In addition, PV is updated in each generation to make the new individuals generated in the next are closer to the elite. With respect to the word "closer", it means the newly generated individual is more likely to be the elite solution. For example, given two PVs PV 1 = (0.1, 0.1, 0.1, 0.1) T and PV 2 = (0.9, 0.9, 0.9, 0.9) T , it is obvious that a new individual generated by PV 2 is more likely to be 1111 than that generated by PV 1 . Therefore, if the elite 1111 is the global optimal solution, we need to update PV 1 by moving it towards PV 2 . In particular, if the bit value of the elite solution is 1 (or 0), PV's first element will be increased (decreased) by a step, which can make the newly generated solution more closer to the elite solution. Given a PV (0.1, 0.3, 0.5, 0.9) T , an elite solution 1110 and step = 0.1, since the first bit value of the elite solution is 1, accordingly, we update the first element of PV by step, that is 0.2, which makes the first bit value of newly generated individual is more likely to be 1 (the same with elite solution's first bit value). Therefore, after updating all elements of PV, the newly generated individuals would be closer to the elite solution in terms of each bit value. When all elements of PV are closed to 1 or 0, the newly generated individuals will be same and the algorithm converges. For more about the theory analysis on the compact encoding mechanism, please see also the work of Harik et al. [7].

Movement Operator
In the classic FA, a firefly ind i 's position is updated by moving it to a more attractive firefly ind j by the α-step and β-step, which are respectively given as follows: where α is a randomisation function, r ij is two fireflies' distance, β 0 is the attractiveness when r ij = 0, γ is the light absorption parameter.

Exploitation Strategy
FA's exploitation searches for the better individual in the vicinity of a solution, which is implemented through α-step. In this work, α-step can be implemented through a local search process on ind i . Given a firefly ind, we first generate C new individuals through PV; then, utilize the binary crossover operator on each of the new individual and ind to obtain ind's neighborhood, finally, we select the elite from its neighborhood. The pseudo-code of α-step algorithm is shown in Algorithm 1.

Algorithm 1 α-step Algorithm
end if end while end for return the best individual in {ind 1 , ind 2 , · · · , ind C }

Exploration Strategy
FA's exploration aims at keeping the the population's diversity, which is implemented through β-step. In this work, we utilize the edit distance to measure two fireflies ind i and ind j 's distance: where |ind i | is the number of ind i 's bits, ind i,k and ind j,k are respectively the kth bit of ind i and ind j . Next, a new individual ind i can be obtained according to edit(ind r , ind s ) and β 0 /(1 + γr 2 ij ). The pseudo-code of β-step algorithm is shown in Algorithm 2.
When the edit distance between two PVs is too close, all PV worse 's elements will be initialized. In particular, the edit distance between two PVs are defined as follows: where |PV| is PV's dimension number, PV i better and PV i worse are respectively the ith dimension of the PV better and PV worse .

Pseudo-Code of Compact co-Firefly Algorithm
In this work, CcFA uses the following configuration: the maximum generation maxGen = 3000, the attractiveness β 0 = 1.0, the light absorption parameter γ = 0.02, the local search's neighborhood scale C = 5, the step length for updating PV step = 0.1. These parameters represent a trade off configuration obtained in an empirical way to achieve the highest average alignment quality on all testing cases. CcFA's pseudo-code is shown in Algorithm 3.
CcFA applies two evolutionary strategies on PV better and PV worse , respectively. Through the competition between two elites, PV better and PV worse are updated. By adaptively switching the search strategies, CcFA can better trade off the algorithm's converging speed and the individuals' diversity. In particular, when all elements of PV better and PV worse are equal to 1 or 0, CcFA converges, and the algorithm terminates the loop and outputs elite better .

Algorithm 3 Compact co-Firefly Algorithm
** Initialization ** generation = 0; Initialize PV better and PV worse by setting all the probabilities inside as 0.5; generate elite better and elite worse through PV better and PV worse , respectively; ** Evolving Process ** while generation < maxGen do ** Exploitation ** generate a solution ind new through PV better ; ind new = al pha-step(ind new ); // see also Algorithm

Final Alignment Determination
In this work, we utilize a greedy heuristic to filter the alignment. First, we remove those correspondences with similarity value lower than 0.88 from the obtained alignment to ensure the precision of the final alignment. Here, we utilize the threshold 0.88 by referring to Fernandez et al. [38]. Then, we sort the resting correspondences by descending similarity, and select them one by one into the final alignment as long as they do not conflict with previous selected ones. In this work, we use a logic reasoning approach to judge whether two correspondences conflict with each other. According to Wang [39], an ontology's entities are often organized by their "is-a" relationships, and a correct alignment should be consistent with that hierarchy. In Figure 3, if the correspondences (a 1 , b 1 ) has high similarity values, that is, a 1 matches b 1 , the mapping between a 1 's super-concepts a 2 (or sub-concept a 3 ) and b 1 's sub-concept b 3 (or super-concept b 2 ), that is, (a 2 , b 3 ) and (a 3 , b 2 ) both conflict with (a 1 , b 1 ).

Experimental Configuration
In the experiments, the OAEI's Conference track with ra1 version [40] and three pairs of real sensor ontologies are used to test CcFA's performance. We compare CcFA with two SIA-based ontology matching techniques, that is, EA-based matcher [41], PSO-based matcher [42] and four state-of-the-art sensor ontology matching systems, that is, SOBOM [43], CODI [44], ASMOV [45] and FuzzyAlign [38], whose code and configuration parameters are available online. SOBOM works based on the syntax and structure based similarity measures, and it can obtain better results when the literal of concept and ontology hierarchy structure is complete. CODI utilize the syntax based similarity measure and the Markov Logic based probabilistic model to produce the alignment. ASMOV determines the alignment through an iterative way, and it also takes into consideration three kinds of similarity measure to calculate the similarity values. FuzzyAlign use a fuzzy rule-based strategy to combine three categories of similarity measure, and it utilizes the EA to determine a threshold for filtering the final alignment.
The obtained alignments are evaluated through the f-measure [46], which works based on the golden standard alignment. EA, PSO and CcFA's results are the mean values of thirty independent runs. CcFA use the parameters (see also Section 4.3) which can ensure the highest average alignment quality on all exploited testing cases, EA and PSO utilize the configurations according to their own literatures, and all SIA-based techniques' results are the average of thirty independent runs.

The Results of Statistical Comparison
The conference track requires matching seven ontologies describing the domain of organizing conferences, that is, Cmt (http://msrcmt.research.microsoft.com/cmt), Pcs (http:// precisionconference.com), OpenConf (http://www.zakongroup.com/technology/openconf.shtml), Edas (http://edas.info/), Ekaw (http://ekaw.vse.cz), Iasted (http://iasted.com/conferences/2005/ cancun/ms.htm) and Sigkdd (http://www.acm.org/sigs/sigkdd/kdd2006). Three pairs of real sensor ontologies are MMI Device ontology vs SSN ontology, CSIRO sensor ontology vs SSN ontology, and MMI Device ontology vs CSIRO sensor ontology. In particular, MMI Device Ontology (http: //mmisw.org/ont/mmi/device) is developed by the Marine Metadata Project (MMP) for promoting the exchange, integration and use of marine data; SSN ontology (https://www.w3.org/TR/vocab-ssn) is developed by the World Wide Web Consortium (W3C) for modeling the knowledge in the sensor network domain; and CSIRO sensor ontology (https://www.w3.org/2005/Incubator/ssn/wiki/ SensorOntology2009) is developed by M. Compton et al. from CSIRO (Australia) for providing a semantic description of sensors in terms of the sensor grounding and operation specification. All these sensor ontoloies are all widely used and open to achieve. The reasons that we select SSN ontology, CSIRO sensor ontology and MMI Device ontology in the experiment are: (1) they have defined lots of overlapping information with different representations; (2) SSN is one of the most used global reference ontologies, and it provides the alignment with another upper ontology-DOLCE ultra lite (http://ontologydesignpatterns.org/ont/dul/DUL.owl), which can be used as the golden alignment to measure an alignment's quality, that is, calculate the f-measure value. Table 1 shows the statistical information about the above ontologies. Table 1. A brief description on the sensor ontologies. Cmt  36  10  49  Pcs  23  14  24  OpenConf  62  21  24  Edas  104  20  30  Ekaw  77  0  33  Iasted  140  3  38  Sigkdd  49  11  17  MMI Device ontology  55  43  28  SSN ontology  19  2  36  CSIRO sensor ontology  75  11  61 We utilize the Friedman's test [47] and Holm's test [48] to carry out the statistical experiment in terms of the alignments' quality. In particular, Friedman's test [47] is used to figure out whether there are any differences among these competitors, and Holm's test [48] is used to check whether one competitor statistically outperforms others. In Friedman's test, we need to reject the null-hypothesis that all the competitors are equivalent. Therefore, the computed value X 2 r must be equal to or greater than the tabled critical chisquare value at the specified level of significance α. Here, we choose α = 0.05, and we need to consider the critical value for 6 degrees of freedom since we are comparing 7 matchers, that is, X 2 0.05 = 12.592. In Table 2, each value represents the f-measure, and the number in round parentheses is the corresponding computed rank. The computed X 2 r = 119.73, which is greater than 12.592, and therefore, the null hypothesis is rejected. Then, the Holm's test is further carried out. As shown in Table 2, since our approach ranks with the lowest value, it is set as a control matcher that will be compared with others. In Holm's test, z−value is the testing statistic for comparing the ith and jth matchers, which is used for finding the p-value (the corresponding probability from the table of the normal distribution). p-value is then compared with α = 0.05, which is an appropriate level of significance. According to Table 3, we can state that CcFA statistically outperforms other competitors on f-measure at 5% significance level. In Figure 4, we compare CcFA with OAEI's participants (http://oaei. ontologymatching.org/2019/results/conference/index.html) in terms of average f-measure. As can be seen from the tables and figure, the f-measure values obtained by CcFA outperform all the other competitors, which shows that CcFA can effectively optimize the ontology alignments. In particular, the quality of alignment of CcFA is better than EA and PSO, which shows that CcFA's compact encoding mechanism and compact operators can better trade off the algorithm's exploration and exploitation. Since none of the similarity measures can effectively distinguish all the heterogeneous concepts in any situations, it is necessary to aggregate several similarity measures to improve the result's precision. We utilize a hybrid similarity measure which combines three kinds of similarity measures to calculate the entity similarity value, and therefore CcFA's results are significantly higher than other systems that only take into consideration one or two categories of similarity measure, such as SOBOM, CODI, DOME, Lily, ALin, LogMap family and Wiktionary. However, FuzzyAlign, ASMOV, AML, SANOM and ONTMAT1 apply too many similarity measures that lead to the conflicting results, which decreases the recall value. Thus, how many similarity measures should be selected and combined to ensure the quality of the alignment will be one of our future work. Currently, there is a new trend for developing lightweight sensor ontologies, such as the Sensor, Observation, Sample, and Actuator (SOSA) ontology [49], which only provides the specification for the kernel SSN's entities that involves in the acts of observation, actuation and sampling, and IoT-Lite [50]. CcFA can also represent an efficient approach for matching these lightweight sensor ontologies since they own less entities and the search space is relatively smaller. However, since the entities' semantic relationships in the lightweight ontologies could be more complicated and richer, lightweight sensor ontology alignment is more complex, that is, one source ontology entity is mapping with more than one target ontology entity, and the relationships could be equivalence or subsumption. To address this complex matching problem [51], a feasible method is to introduce various mapping patterns [52] into CcFA to detect the complex correspondences, which is one of our future work.

Conclusions and Future Work
In order to support the information integration of various sensor ontologies, in this paper, a general-purpose ontology matching technique based on CcFA is proposed. Our proposal makes use of compact α-step and β-step to implement the discrete exploitation and exploration, and trade off them during the evolving process through two PVs. The experiment shows the effectiveness of this combination of compact encoding mechanism and co-Evolutionary mechanism, and our approach statistically outperforms other competitors on alignment's quality at 5% significance level when matching sensor ontologies and other ontologies in the domain of organizing conferences.
In the future, we will further study the technique that can adaptively select and combine various similarity measures according to different heterogeneity situation. Moreover, we will improve CcFA based approach to match the large-scale sensor ontologies, which is a challenge in the ontology matching domain. Another challenge in ontology matching domain is the problem of Instance Coreference Resolution (ICR) [53] in the sensor network domain, which requires matching large-scale sensor instances in the Linked Open Data cloud (LOD). Currently, there is no SIA-based technique that could effectively solve ICR, and we are also interested in addressing this challenge with CcFA. In this work, we mainly aim at matching sensor ontologies, as well as other general ontologies, such as those in the domain of organizing conferences. With respect to the specific ontologies like biomedical ontology and geographical ontology, directly applied our proposal to match them could yield low precision and recall because these tasks require specific background knowledge base and complex forms of alignment. Therefore, we would like to extend the CcFA-based matching technique to address these matching tasks in the specific domains.