Ontology-based Feature Selection: A Survey

The SemanticWeb emerged as an extension to the traditional Web, towards adding meaning to a distributed Web of structured and linked data. At its core, the concept of ontology provides the means to semantically describe and structure information and data and expose it to software and human agents in a machine and human-readable form. For software agents to be realized, it is crucial to develop powerful artificial intelligence and machine learning techniques, able to extract knowledge from information and data sources and represent it in the underlying ontology. This survey aims to provide insight into key aspects of ontology-based knowledge extraction, from various sources such as text, images, databases and human expertise, with emphasis on the task of feature selection. First, some of the most common classification and feature selection algorithms are briefly presented. Then, selected methodologies, which utilize ontologies to represent features and perform feature selection and classification, are described. The presented examples span diverse application domains, e.g., medicine, tourism, mechanical and civil engineering, and demonstrate the feasibility and applicability of such methods.


Introduction
The recent development of the Semantic Web enables the systematic representation of vast amounts of knowledge within an ontological framework. An ontology is a formal and explicit description of shared and agreed knowledge as a set of concepts within a domain and the relationships that hold among them. The ontological model provides a rich set of axioms to link pieces of information and enables automatic reasoning to infer knowledge that has not been explicitly asserted before.
In many cases, reasoning can be cast as a classification task. An important step towards an accurate and efficient classification, is feature selection. Consequently, identification of high-quality features from an ontology hierarchy plays a significant role in the ability to extract information from an ontological model. This report summarizes related work, which tackles the problem of feature representation and selection based on ontologies in the context of knowledge extraction from documents, images, databases and human expertise. In the first and second sections, a brief summary of selected classification and feature selection methods is presented. In the third section, the concept of ontology as a building block of the Semantic Web is introduced. In the fourth section, example methodologies of ontology-based feature selection are summarized. In the fifth section open issues and challenges are considered. In the last section the conclusions of this survey are discussed.

Data Classification
One of the most common applications of machine learning is data classification. Data classification can be defined as the data analysis task which, given a set of observations belonging to known categories, aims at identifying to which category a new observation belongs. In the case that the feature and target variables are not categorical, but take continuous values, the classification task is called regression.
In essence this problem attempts to learn the relationship between a set of feature variables and a target variable of interest. In practice a large variety of problems can be expressed as associations between feature and target variables which provides a broad range of applications such as customer target marketing [1,2], medical disease diagnosis [3,4,5], speech and handwriting recognition [6,7,8,9], multimedia data analysis [10,11], biological data analysis [12], document categorization and filtering [13,14], and social network analysis [15,16,17].
Classification algorithms typically contain two steps, the learning step and the testing step. The first one constructs the classification model, while the second evaluates it by assigning class labels to unlabeled data. For a test instance under consideration, the output of a classification algorithm may be presented either as discrete label or numerical score. In the former case, the classifier assigns a single label, which identifies the class of the test instance, while in the latter a numerical score is returned which associates the test instance with each class. The advantage of a numerical score is that it incorporates the additional information of "belongingness" of the test instances to each category and thus it facilitates their ranking.
A close relative to the classification problem is data clustering [18,19]. Clustering is the task of dividing a population of data points into a number of groups, such that the members of the same group are in some sense similar to each other and dissimilar to the data points in other groups.
The key difference between the two tasks is that in the case of clustering, data are segmented using similarities between feature variables, while in the case of classification, data partitioning is based on a training data set. Consequently, clustering has no understanding of the underlying group structure, whereas classification uses knowledge encoded in the training data set in the form of a target variable. As a result, the classification task is referred to as supervised learning and clustering as unsupervised learning.
Probabilistic methods are based on two probabilities, namely a prior probability, which is derived from the training data, and a posterior probability that a test instance belongs to a particular class. There are two approaches for the estimation of the posterior probability. In the first approach, called generative, the training dataset is used to determine the class probabilities and class-conditional probabilities and the Bayes theorem is employed to calculate the posterior probability. In the second approach, called discriminative, the training dataset is used to identify a direct mapping of a test instance onto a class.
A common example of a generative model is the Naive Bayes classifier [31,32]. Assuming a test instance T with features x = [x 1 , x 2 , . . . , x d ], the probability that T belongs to class y can be calculated with the Bayes theorem: P (y|x) = P (y)P (x|y)/P (x) (1) Then the problem of classification is to find the class which maximizes the above probability given the features of the test instance T. Since the denominator is constant across all classes the problem can be approximated as follows: In the above equation, the class probability P (y) is the fraction of training instances which belong to class y. The classconditional probability P (x|y) can be calculated under the naive Bayes assumption that the features x i are independent to each other. This simplification allows the classconditional probability to be calculated as a product of the feature-wise conditional probabilities: The term P (x i |y) is computed as the fraction of the training instances classified as y, which contain the ith feature. Generally, the naive Bayes assumption does not hold, however the naive Bayes model has been proven to work quite well in practice. A popular discriminative classifier is the logistic regression [31], where the posterior probability is modeled as: where θ is a vector of parameters to be estimated from the training data. Given m independent training instances with class labels y = [y 1 , y 2 , . . . , y m ] and feature vectors X = [x 1 , x 2 , . . . , x m ] respectively, the unknown parameters are derived from the maximization of the posterior probability with respect to θ: In Decision Tree Classification [20,21,22], data are recursively split into smaller subsets until all formed subsets exhibit class purity i.e., all members of each subset are sufficiently homogeneous and belong to the same unique class. In order to optimize the decision tree, an impurity measure is employed and the optimal splitting rule at each node is determined by maximizing the amount that the impurity decreases due to the split. A commonly used function for this purpose is the Shannon entropy. If T are the training records at the Nth node of the decision tree and (p 1 , p 2 , . . . , p k ) are the fractions of the records which belong to k different classes, then the entropy H(T) is: and the Information Gain (impurity decrease), from picking attribute α for splitting the node, is: where T v denotes the set of all training records where attribute α = v. An extension to decision tree classification is the Random Forest (RF) algorithm [33]. This algorithm, essentially, trains a large set of decision trees and combines their predictive ability in a single classifier. These decision trees vote for the class membership of an instance and the class with the majority of votes becomes the result of the classification. In order to build each decision tree, first a training data set is derived from the original dataset, by random sampling with replacement, meaning that training records are allowed to be selected multiple times. Then, during training, only a random subset of features, out from the whole feature space, is considered for splitting each node. These two steps introduce randomness into the tree learning process and increase the diversity of the base classifiers, thus improving the overall accuracy of the classification process. The RF classifier belongs to a broader family of methods called ensemble learning [31].
A classification method closely related to Decision Trees is called Rule Based Classification [26,27,28]. Essentially, all paths in a decision tree represent rules, which map test instances to different classes. However, for Rule-based methods the classification rules are not required to be disjoint, rather they are allowed to overlap. Rules can be extracted either directly from data (rule induction) or built indirectly from other classification models.
CN2 [34] and RIPPER [35] are two popular direct algorithms, which use the sequential covering paradigm. In this approach, a target class is selected and the corresponding rule is mined from the data incrementally by successive additions of conjuncts to the rule antecedent, according to a certain criterion (e.g., maximization of FOIL's Information gain for RIPPER or minimization of the entropy measure for CN2). After each rule is grown, all matched training instances are removed. This process is repeated until all remaining training tests belong to one class. This class serves as the default, and is assigned to all test instances which do not activate any rules.
A novel family of algorithms which aim at mining classification rules indirectly, is the so called Associative Classification [36]. Associations are interesting relations between variables in large datasets. Association rules can quantify such relations by means of constraints on measures of significance or interest. The best-known constraints are imposing minimum threshold values on support and confidence. In the training phase, an associative classifier, mines a set of Class Association Rules from the training data. This process occurs in two steps. First, the training data set is searched for repeated patterns of feature-value pairs. Then, association rules are generated from the resulting pairs, according to the proportion of the data they represent (support) and their accuracy (confidence). The mined CARs are used to build the classification model according to some strategy such as applying the strongest rule, selecting a subset of rules, forming a combination of rules or using rules as features.
Another promising new class of methods are Support Vector Machines (SVM) [37]. SVM classifiers are generally defined for binary classification tasks. Intuitively, they attempt to draw a decision boundary between the data items of two classes, according to some optimality criterion. A common criterion employed by SVM is that the decision surface must be maximally far away from any data point. The margin of the separation is determined by the distance from the decision surface to the closest data points. Such data points are called support vectors.
Finding the maximum margin hyperplane is a quadratic optimization problem [38]. In case the training data are not linearly separable, slack variables can be introduced in the formulation to allow some training instances to violate the support vector constraint, i.e., they are allowed to be on the "other" side of the support vector from the one which corresponds to their class. When data are non-linear, a linear model will not have acceptable accuracy. However, in this case it is possible to use a mapping to transform the original data into linear-separable data in a higher dimension. Such transformations rely on the notion of similarity between data records. Such similarities are expressed as dot-products which are called kernel functions. A typical function of this kind is the Gaussian Kernel function.
Recently, Artificial Neural Networks have been proven to be powerful classifiers [32]. They attempt to mimic the human brain by means of an interconnected network of simple computational units, called neurons. Neurons are functions which map an input feature vector to an output value according to predefined weights. These weights express the influence of each feature over the output of the neuron and are learned during the training phase. A typical tool to perform the training process is the backpropagation algorithm. Backpropagation uses the chain rule to compute the derivative of the error (loss function) with respect of the weights of the network and applies gradient methods (e.g., stochastic gradient descent) to find weight values which minimize the error.
In general, the neural mapping, results from the composition of a net value function ξ, which summarizes the input data into a net value v = ξ(x, w) and an activation function which transforms the net value into an output value h = φ(v). Depending on the form of functions ξ and φ, several neuron types can be derived. For example, the perceptron or linear neuron uses a weighted sum of the features plus a bias value as the net value function while the activation function is a simple multiple of the net value with a parameter α, usually equal to 1. The sigmoidal neuron uses the same net value function as the perceptron, but replaces the activation function either with the sigmoid function or with the hyperbolic tangent function. On the other hand, the distance neuron uses the same activation function as the perceptron, but replaces the net value function with a distance measure ξ(x, w) = x − w . The radial basis function neuron uses again a distance measure, typically the Euclidean distance or the Mahalanobis distance, as the net value function but employs the exponential φ = exp(−v) as the activation function. Polynomial units extend the perceptron net value function by adding a quadratic form, ξ(x, w) = w T x + w 0 + x T W x, while the activation function is usually linear or sigmoid.
The aforementioned classifiers are eager learners, meaning that they build the generalization function during training by observing the whole training set, but without any insight into future queries. When the training data are not distributed evenly in the input space, eager methods may suffer from poor generalization capabilities. Essentially, by trying to minimize the global error over the whole training set they may exhibit inaccurate approximations of local subspaces.
Instance-based classification attempts to parry this shortcoming [39]. In the first phase, Instance-based classifiers, do not build any approximation models, rather they simply store the training records. When a query is submitted, the system uses a distance function to extract, from the training data set, those records which are most similar to the test instance. Label assignment is performed based on the extracted subset. Common, instancebased classifiers are the K-Nearest neighbor (KNN), Self-organizing Map (SOM), Learning Vector Quantization (LVQ) [40]. A generalization of instance-based learning is lazy learning, where training examples in the neighborhood of the test instance are used to train a locally optimal classifier. The field of classification is vast and still in its infancy. For an excellent in depth discussion on classification methods, the curious reader is referred to [31].

Feature Selection
The first step towards successful classification, is finding an accurate representation of the problem domain. In other words, one has to concretely define the features and target classes, which will be input to the classifier. In practice, these are part of available raw data, for example in huge databases or through a live data collection system. In this pile of data, explicit features and target classes are not so clear cut, and need to be uncovered. In general, this process is called Feature Engineer-ing (FE) and encompasses algorithms for generating features from raw data (feature generation), transforming existing features (feature transformation), selecting most important features (feature selection), understanding feature behavior (feature analysis) and determining feature importance (feature evaluation) [41].
Feature selection is one of the most popular and well studied methods of FE. When data records are derived from databases, usually they contain data columns which are unrelated to the targets under consideration. If all data are blindly used as features they increase the dimensionality of the problem, which has negative impact on the classification task. First of all, an increasing number of dimensions in the feature space, results in exponential expansion of the computational cost. This issue is directly related to the curse of dimensionality problem. Furthermore as the volume of feature space increases, it becomes sparsely populated and even close data points may be driven apart from irrelevant data, thus appearing as far away as unrelated data points. This will increase overfitting and reduce the accuracy of the classifier. Also, the more features are used for the classification, the more difficult it becomes to decode the underlying relationship of the used features to the classifier. Restricting the used features to only those that are strictly relevant to the target classes results in improved interpretability of the model. Finally, data collection becomes easier and with higher quality when the number of features is small.
The feature selection process attempts to remedy these issues by identifying features which can be excluded without adversely af-fecting the classification outcome. Feature selection is closely related to feature extraction. The main difference is that while feature selection maintains the physical meaning of the retained features, feature extraction attempts to reduce the number of dimensions by mapping the physical feature space on a new mathematical space. Consequently, the derived features lack a physical meaning, which makes the addition and analysis of new physical features problematic. In this sense feature selection appears to be superior in terms of readability and interpretability.
Feature selection can be supervised, unsupervised or semi-supervised. Supervised methods consider the classification information, and use measures to quantify the contribution of each feature to total information, thus keeping only the most important ones. Unsupervised methods attempt to remove redundant features in two steps. First, features are clustered into groups, using some measure of similarity, and then the features with the strongest correlations to the other features in the same group are retained as the representatives of the group. Identification and removal of irrelevant features is more difficult and abstract and depends on some heuristic of relevance or interestingness. To devise such heuristics researchers have employed several measures such as scatter separability, entropy, category utility, maximum likelihood, density, and consensus [42]. Semi-supervised feature selection addresses the case when both a large set of unlabeled and a small set of labeled data are available. The idea is to use the supervised class-based clustering of features in the small dataset as constraint for the unsupervised locality-based cluster-ing of the features in the large dataset.
Depending on whether and how they use the classification system, feature selection algorithms, are divided into three categories, namely filters, wrappers and embedded models.
Filter models select subsets of variables as a preprocessing step, independently of the chosen classifier. In the first step, features are analyzed and ranked on the basis of how they correlate to the target classes. This analysis can either consider features separately and perform ranking independently of the feature space (univariate), or evaluate groups of features (multivariate). Multivariate analysis has the advantage that interactions between features are taken into account in the selection process. In the second step, the features with the highest scores are used in the classification model.
Some of the most common evaluation metrics which have been used for ranking and filtering are: Chi-Square: The χ 2 correlation uses the contingency table of a feature target-pair to evaluate the likelihood that a selected feature and a target class are correlated. The contingency tables shows the distribution of one variable (the feature) in rows and another (the target) in columns. Based on the entries the observed values are calculated. Also under the assumption that the variables are independent (null hypothesis) the expected values are derived. Small values of χ 2 show that the expected values are close to the observed values, thus the null hypothesis stands. On the contrary high values show strong correlation between the feature and the target value. The χ 2 metric is defined as where O ij and E ij the observed and expected values of feature i with respect to class j. ANOVA: A metric related to χ 2 is Analysis of Variance. It tests whether several groups are similar or different by comparing their means and variances, and returns an F-statistic which can be used for feature selection. The idea is that a feature where each of its possible values corresponds to a different target class, will be a useful predictor. Letf be the mean value of feature f (grand mean),f i the mean value of feature f in each individual group i with class assignement c i and f ij the value of feature f at record j of group i. The Sum of Squares Between is defined as SSB = (f −f i ) 2 and the Sum of Squares Then the F-statistic is computed as: where DF B = NumberOfClasses − 1 and DF W = NumberOfGroupRecords − 1 are the degrees of freedom of the target group and feature group respectively. A high value of F indicates that feature f has high predictive power. Fisher Score: It is based on the intuition that high quality features should assign similar values to instances in the same class and different values to instances from different classes. Let µ i denote the mean of the i-th feature, n j the number of instances in the j-th class and µ ij , ρ ij denote the mean and the variance of the i-th feature in the j-th class, respectively. Then the Fisher Score for the i-th feature is: Pearson Correlation Coefficient: It is used as a measure for quantifying linear dependence between a feature variable X i and a target variable Y k . It ranges from -1 (perfect anticorrelation) to 1 (perfect correlation) and is defined as: where the covariance and variance of the variables are estimated from the training data.
Mutual Information: The Information Gain metric provides a method of measuring the dependence between the ith feature and the target classes c = [c 1 , c 2 , . . . , c k ], as the decrease in total entropy, namely is the entropy of f i and H(f i |c) the entropy of f i after observing c. High information gain indicates that the selected feature is relevant. IG has gained popularity due to its computational efficiency and simple interpretation. It has also been extended to account for feature correlation and redundancy. Other MI metrics are Gini Impurity and Minimum-Redundancy-Maximum-Relevance.
Filter models select features based on their statistical similarities to a target variable. Wrapper methods take a different approach and use the preselected classifier as a way to evaluate the accuracy of the classification task for a specific feature subset. A wrapper algorithm consists of three components, namely a feature search component, a feature evaluation component and a classifier. At each step, the search component generates a subset of features which will be evaluated for the classification task. When the total number of features is small, it is possible to test all possible feature combinations. However, this approach, known as SUBSET, becomes quickly computationally intractable.
Greedy search methods overcome this problem by using a heuristic rule to guide the subset generation. In particular, forward selection starts with an empty set and evaluates the classification accuracy of each feature separately. The best feature initializes the set. In the subsequent iterations, the current set is combined with each of the remaining features and the union is tested for its classification accuracy. The feature producing the best classification is added permanently to the selected features and the process is repeated until the number of features reaches a threshold or none of the remaining features improve the classification. On the other hand, backward elimination starts with all features. At each iteration, all features in the set are removed one by one and the resulting classification is evaluated. The feature affecting the classification the least, is removed from the list. Finally, bidirectional search starts with an empty set (expanding set) and a set with all features (shrinking set). At each iteration, first a feature is forward selected and added to the expanding set with the constraint that the added feature exists in the shrinking set. Then a feature is backward eliminated from the shrinking set with the constraint that it has not already been added in the expanding set.
Many more strategies have been used to search the feature space, such as branchand-bound, simulated annealing and genetic algorithms. Branch-and-bound uses depth-search to traverse the feature subset tree, pruning those branches which have worse classification score than the score of an already traversed fully expanded branch. Simulated annealing and genetic algorithms encode the selected features in a binary vector. At each step, offspring vectors, representing different combinations of features, are generated and tested for their accuracy. A common technique for performance assessment is k-fold cross-validation. The training data are split into k sets and the classification task is performed k times, using at each iteration one set as the validation set and the remaining k-1 sets for training. Filter methods are cheap, but selected features do not consider the biases of the classifiers. Wrapper methods select features tailored to a given classifier, but have to run the training phase many times, hence they are very expensive. Embedded methods combine the advantages of both filters and wrappers by integrating feature selection in the training process. For example, pruning in Decision Trees and Rule-based Classifiers is a built-in mechanism to select features. In another family of classification methods, the change in the loss function incurred by changes in the selected features, can be either exactly computed or approximated, without the need to retrain the model for each candidate variable. Combined with greedy search strategies, this approach allows for efficient feature selection (e.g., RFE/SVM, Gram-Schmidt/LLS). A third type of embedded methods are regularization methods and apply to classifiers where weight coefficients are assigned to features (e.g., SVM or logistic regression). In this case, the feature selection task is cast as an optimization problem with two components, namely maximization of goodnessof-fit and minimization of the number of variables. The latter condition is achieved by forcing weights to be small or exactly zero. Features with coefficients close to zero are removed.
Specifically, the feature weight vector is defined as: w = arg min w c(w, X) + penalty(w) (12) where c is the loss function, penalty(w) is the regularization term. A well studied form of the regularization term is Lasso regularization uses p = 1, while for Ridge p = 2. Elastic net combines the two: penalty(w) = λ 1 w 1 +λ 2 w 2 2 Many more feature selection algorithms and variations can be found in literature. Due to its significance in the classification task, feature selection and feature engineering in general, is a highly active field of research. For an in-depth presentation, the interested reader is referred to [41], [42] and [43]. Comprehensive reviews can be found in [44] and [45].

Ontologies
The enormous amount of information available in the continuously expanding Web by far exceeds human processing capabilities. This gave rise to the question whether it is possible to build tools which will automate information retrieval and knowledge extraction from the Web repository. The Semantic Web emerged as a proposed solution to this problem. In its essence, it is an extension to the Web, in which content is represented in such a way that machines are able to process it and infer new knowledge from it. Its purpose is to alleviate the limitations of current knowledge engineering technology with respect to searching, extracting, maintaining, uncovering and viewing information, and support advanced knowledgebased systems. Within the Semantic Web framework, information is organized in conceptual spaces according to its meaning. Automated tools search for inconsistencies and ensure content integrity. Keywordbased search is replaced by knowledge extraction through query answering.
In order to realise its vision, the Semantic Web does not rely on "exotic" intelligent technology, where agents are able to mimic humans in understanding the predominant HTML content. Rather it approaches the problem from the Web page side. Specifically, it requires Web pages to contain informative (semantic) annotations about their content. These semantics (metadata) enable software to process information without the need to "understand" it. The eXtensible Markup Language (XML) was a first step towards this goal. Nowadays, the Resource Description Framework (RDF), RDF Scheme (RDFS) and the Web Ontology Language (OWL) are the main technologies which drive the implementation of the Semantic Web.
In general, ontologies are the basic building blocks for inference techniques on the Semantic Web. As stated in W3C's OWL Requirements Documents [46]:"An ontol-ogy defines the terms used to describe and represent an area of knowledge." Ontological terms are concepts and properties which capture the knowledge of a domain area. Concepts are organised in a hierarchy which expresses the relationships among them by means of superclasses, representing higher level concepts and subclasses, representing specific (constrained) concepts. Properties are of two types: those that describe attributes (features) of the concepts, and those that introduce binary relations between the concepts.
In order to succeed in the goal to express knowledge in a machine-processable way, an ontology has to exhibit certain characteristics, namely abstractness, preciseness, explicitness, consensus and domain specificity. An ontology is abstract when it specifies knowledge in a conceptual way. Instead of making statements about specific occurrences of individuals, it tries to cover situations in a conceptual way. Ontologies are expressed in a knowledge representation language which is grounded on formal semantics, i.e., it describes the knowledge rigorously and precisely. Such semantics do not refer to subjective intuitions, nor are they open to different interpretations. Furthermore, knowledge is stated explicitly. Notions which are not directly included in the ontology are not part of the conceptualization it captures. In addition, an ontology reflects a common understanding of domain concepts within a community. In this sense a prerequisite of an ontology is the existence of social consensus. Finally, it targets a specific domain of interest. The more refined the scope of the domain, the more effective an ontology can be at capturing the details rather then covering a broad range of related topics.
The most popular language for engineering ontologies is OWL [47]. OWL (and latest OWL2) defines constructs, namely classes, associated properties and utility properties, which can be used to create domain vocabularies along with constructs for expressiveness (e.g., cardinalities, unions, intersections), thus enabling the modelling of complex and rich axioms. There are many tools available which support the engineering of OWL ontologies (e.g., Protégé, TopBraid Composer) and OWL-based reasoning (e.g., Pellet, HermiT). Ontology engineering is an active topic and a growing number of fully developed domain and generic/upper ontologies are already publicly available, such as the Dublin Core (DC) [48], the Friend Of A Friend (FOAF) [49], Gene Ontology (GO) [50], Schema.org [51], to name a few. An extensive list of ontologies and ontology engineering methodologies have been recently published in Kotis et al. 2020 [52].
The Semantic Web is vast and combines many areas of research and technological advances. A comprehensive introduction can be found in [53] and [54]. The interested reader can find a detailed presentation of Semantic Web technologies in [55].

Ontology-based Feature Selection
The main research domain where ontologies have been employed to select features, is text classification, namely the task of assigning predefined categories to free-text documents based on their content. The continuous increase of volumes of text docu-ments in the Internet, makes text classification an important tool for searching information. Due to their enormous scale in terms of the number of classes, training examples, features and feature dependencies, text classification applications present considerable research challenges. Elhadad et al. [56] use the WordNet [57] lexical taxonomy (as an ontology) to classify Web text documents based on their semantic similarities. In the first phase, a number of filters are applied to each document to extract an initial vector of terms, called bag of words (BoW), which represent the document space. In particular, a Natural Language Processing Parser (NLPP) parses the text and extracts words in the form of tagged components (part of speech), such as verbs, nouns, adjectives, etc. Words which contain symbolic characters, non-English words and words which can be found in pre-existing stopping words lists, are eliminated. Furthermore, in order to reduce redundancy, stemming algorithms are used to replace words with equivalent morphological forms, with their common root. For example the words "Fighting" and "Fights" are replaced with the stemmed word "Fight". In the second phase, all words in the initial BoW are examined for semantic similarities with categories in WordNet. Specifically, if a path exists, in the WordNet taxonomy, from a word to a WordNet category via a common parent (hypernym), then the word is retained, otherwise it is discarded. Once the final set of terms has been selected, the feature vector for each document is generated by assigning a weight to each term. Authors use the term frequency-inverse document frequency (TFIDF) statistical measurement, since it computes the importance of a term t, both in an individual document and in the whole training set. TFIDF is defined as: where T F (t) = N umber of occurances of term t T otal number of terms in doc (15) and IDF (t) = log(T otal N umber of docs) N umber of docs with term t (16) Effectively, terms which appear frequently in a document, but rarely in the overall corpus, are assigned larger weights. Authors compared against the Principal Component Analysis (PCA) method and report superior classification results. However, they recognize that a limitation in their approach is that important terms that are not included in WordNet will be excluded from the feature selection. Vicient et al. [58], employ the Web to support feature extraction from raw text documents, which describe an entity (symbolized with ae), according to a given ontology of interest. In the first step, the OpenNLP [59] parser analyzes the document and detects potential named entities (PNE ) related to the ae, as noun phrases containing one or more words beginning with a capital letter. A modified Pointwise Mutual Information (PMI) measure is used to rank the PNE and identify those which are most relevant to the ae according to some threshold. In particular, for each pne i ∈ P NE probabilities are approximated by Web hit counts provided by a Web search engine, In the second step, a set of subsumer concepts (SC ) is extracted from the retained named entities (NE ). To do so, the text is scanned for instances of certain linguistic patterns which contain each ne i ∈ NE. Each pattern is used in a Web query and the resulting Web snippets determine the subsumer concepts representing the ne i . Next, the extracted SC are mapped to ontological classes (OC ) from the input ontology. Initially, for each ne i all its potential subsumer concepts are directly matched to lexically similar ontological classes. If no matches are found then WordNet is used to expand the SC and direct matching is repeated. Specifically, the parents (hypernyms) in the WordNet hierarchy of each subsumer concept sc i are added to SC. In order to determine, which parent concepts are mostly relevant to the named entity ne i , a search engine is queried for common appearances of the ae and the ne i . The returned Web snippets are used to determine which parent synsets of sc i are mostly related to ne i . Synsets in Wordnet are groupings of words from the same lexical category which are synonymous and express the same concept. Finally, a Web-based version of the PMI measure, defined as SOC score (soc i , ne i , ae) = hits(soc i &ne i &ae) hits(soc i &ae) (18) is used to rank each of the extracted ontological classes (soc i ), related to a named entity. The soc i with the highest score which exceeds a threshold is used as annotation. The authors tested their method in the Tourism domain. For the evaluation, they compared precision (ratio of correct feature to retrieved features) and recall (ratio of correct features to ideal features) against manually selected features from hu-man experts. They report 70-75% precision and more than 50% accuracy and argue that such results considerably reduce the human effort required to annotate textual resources.
Wang et al. [60] reduce the dimensionality of the text classification problem by determining an optimal set of concepts to identify document context (semantics). First, the document terms are mapped to concepts derived from a domain-specific ontology. For each set of documents of the same class, the extracted concepts are organized in a concept hierarchy. A hillclimbing algorithm [61] is used to search the hierarchy and derive an optimal set of concepts which represents the document class. They apply their method to classification of medical documents and use the Unified Medical Language System (UMLS) [62] as the underlying domain-specific ontology. UMLS query API is used to map document terms to concepts and to derive the concept hierarchy. For the hill-climbing heuristic, a frequency measure is assigned to each leaf concept node. The weight of parent nodes is the sum of the children' weights. Based on the assigned weights, a distance measure between two documents is derived, and used to define the fitness function. Test documents undergo the same treatment and are classified based on the extracted optimal representative concepts. For their experiments the authors use a KNN classifier and report improved accuracy, but admit that an obvious limitation of their method is that it is only applicable in domains that have a fully developed ontology hierarchy.
Khan et al. [63] obtain document vectors defined in a vector space model. This is accomplished in terms of the following steps. First, after identifying all the words in the documents, they remove the stopwords from the word data base, creating a BoW. Next, a stemming algorithm is applied to assign each word to its respective root word. Phrase frequency is estimated using a part of speech tagger (POS). Next, they apply the maximal frequent sequence (MFS) [64] to extract the most frequent terms. MFS is a sequence of words that is frequent in a document collection and moreover it is not contained in any other frequent sequence [64]. The final set of features is selected by examining similarities with ontology-based categories in WordNet [57] and applying a wrapper approach. Using the TFIDF statistical measure weights are assigned to each term. Finally, the classifier is trained in terms of the Naive Bayes algorithm.
Abdollali et al. 2019 [65] also address feature selection in the context of classification of medical documents. In particular, they aim at distinguishing clinical notes which reference Coronary Artery Disease (CAD) from those that do not. Similarly to [60], they use a query tool (MetaMap) to map meaningful expressions, in the training documents to concepts in UMLS. Since, their target is CAD documents, they only keep concepts "Disease or Syndrome" and "Sign or Symptom" and discard the rest. The retained concepts are assigned a TFIDF weight to form the feature vector matrix which will be used in the classification. In the second stage, the Particle Swarm Optimization [66] algorithm is used to select the optimal feature subset. The value of particles is initialized randomly by numbers in [-1, 1], where a positive number indicates an active feature while a negative value an inactive one. The fitness function for each particle is based on the classification accuracy, i.e., : F ittness(S) = T P + T N T P + T N + F P + F N (19) where S represents the features set, TP and FP are the number of correctly and incorrectly identified documents and TN and FN the number of correctly and incorrectly rejected documents. 10-fold cross validation is used to compute a particle's fitness value as the average of the accuracies of ten classification runs. The authors evaluated their method using five classifiers (NP, LSVM, KNN, DT, LR) and reported both significant reduction of the feature space and improved accuracy of the classification in most of their tests. Lu et al. 2013 [67] attempt to predict the probability of hospital readmission within 30 days after a heart failure, by means of the medication list prescribed to patients during their initial hospitalization. In the first stage, the authors combine two publicly accessible drug ontologies, namely RxNorm [68] and NDF-RT [69], into a tree structure, that represents the hierarchical relationship between drugs. The RxNorm ontology serves as drug thesaurus, while NDF-RT as drug functionality knowledge base. The combined hierarchy consists of six class levels. The top three levels correspond to classes derived from the Legacy VA class list in NDF-RT and represent the therapeutic intention of drugs. The fourth level represents the general active ingredients of drugs. The fifth level refers to the dosage of drugs and uses a unique identifier to match drugs to the most representative class in RxNorm (RXCUI). The lowest level, refers to the dose form of drugs and uses the lo-cal drug code used by different hospitals. Each clinical drug corresponds to a single VA class, a single group of ingredients and a single RxNorm class. In the second stage, a top-down depth-first traversal of the tree hierarchy is used to select a subset of nodes as features. For each branch, the nodes are sorted according to the information gain ratio (IGR(F) = IG(F)/H(F)). The features in the ordered list are marked for selection one by one, while parent and child features with lower score are removed from the list. In order to evaluate their method, the authors use the Naive Bayes classifier and employ the area under the receiver operating characteristic curve to evaluate its performance. Their experiments showed that the ontology guided feature selection outperformed the other non-ontology-based methods.
An important application of ontologybased feature selection algorithms is the selection of manufacturing processes. Mabkhotet al. [70] describe an ontologybased decision support system (DSS), which aims at assisting the selection of a suitable manufacturing process (MPS) for a new product. In essence, selected aspects of MPS are mapped to ontological concepts, which serve as features in rules used for case-based reasoning. Traditionally, MPS has relied on expert human knowledge to achieve the optimal matching between design specifications, material characteristics and process capabilities. However, due to the continuous evolution in material and manufacturing technologies and the increasing product complexity, this task becomes more and more challenging for humans. The proposed DSS consists of two components, namely the ontology and the case-based reasoning subsystem (CBR). The purpose of the ontology is to encode all the knowledge related to manufacturing in a way which enables the reasoner to make a recommendation for a new product design. It consists of three main concepts, the manufacturing process (MfgProcess), the material (EngMaterial) and the product (EngProduct). The Mfg-Process concept captures the knowledge about manufacturing in subconcepts, such as casting, molding, forming, machining, joining and rapid manufacturing.
The properties of each manufacturing process are expressed in terms of shape generation capabilities, which describe the product shape features a process can produce, and range capabilities, which express the product attributes that can be met by the process such as dimensions, weight, quantity and material. The EngMaterial concept captures knowledge about materials, in terms of material type (e.g., metal, ceramic, etc) and material process capability (e.g., sand casting materials). The EngProduct concept encodes knowledge about products, defined in the form of shape features and attributes. The ontology facilitates the construction of rules, which relate manufacturing processes with engineering products, through the matching of product features and attributes, with process characteristics and capabilities. The semantic Web rule language (SWRL [71]) has been used as an effective method to represent causal relations. The purpose of the CBR subsystem is to find the optimal product-to-process matching. It does so in two steps. First, it scans the ontology for a similar product. To quantify product similarity, appropriate feature and attribute similarity measures have been developed and human experts have been employed to assign proper weights to features and attributes. If a matching product is found then the corresponding process is presented to the decision maker, otherwise SWRL rule-based reasoning is used to find a suitable manufacturing process. Finally, the ontology is updated with the newly extracted knowledge. The authors presented a use case to demonstrate the usability and effectiveness of the proposed DSS and argue that in the future such systems will become more and more relevant.
In [72], Kang et al. develop an ontologybased representation model to select appropriate machining processes as well as the corresponding inference rules. The ontology is quantified in terms of features, process capability with relevant properties, machining process, and relationships between concepts. A reasoning inference mechanism is applied to obtain the final set of processes for individual features. The final process with the highest contribution is determined by a procedure that matches the accuracy requirements of a specific feature with the capability of the candidate processes. The preceding machining process is then selected so that the precedence relationship constraint between the processes is met until no further precedent processes are required. The whole process selection scheme is neutral (i.e., general enough) in the sense that it does not depend on a specific restriction, and thus it constitutes a reusable platform.
Hat et al. [73] also apply ontology within the mechanical engineering domain, in particular the field of Noise, Vibration and Harshness (NVH). Similar to the previous work, authors map important aspects of noise identification to ontological concepts, which serve as features for reasoning. They propose an ontology-based system for identifying noise sources in agricultural machines. At the same time, their method provides an extensible framework for sharing knowledge for noise diagnosis. Essentially, they seek to encode prior knowledge relating noise sound signals (targets) with vibrational sound signals (sources) in an ontology, equipped with rules, and perform reasoning to identify noise sources based on the characteristics of test input and output sound signals (parotic noise). In order to build the ontology, first, professional experience, literature and standard specifications were surveyed to extract the concepts related to NVH. The Protégé tool was used to convert the concept knowledge into an OWL ontology and implement the SWRL rules, which match sound source and parotic noise signals. The Pellet tool is employed for reasoning. To quantify the signal correlations, the time signals are converted to the frequency domain and the values for seven common signal characteristics are calculated. Specifically, relation of the frequency of the parotic signal to the ignition frequency, peak frequency, Pearson Coefficient, frequency doubling, loudness, sharpness and roughness. The effectiveness of the method was demonstrated in a use case, where the prototype system correctly identified the main noise source. After improving the designated area the noise was significantly reduced. The authors argue that the continuous improvement in the knowledge base and rule set of the ontology model, has the potential to allow the design sys-tem to perform reasoning that simulates the thinking process of the expert in the field of NVH.
Belgiu et al. 2014 [74] develop an ontological framework to classify buildings based on data acquired with Airborne Laser Scanning (ALS). They followed five steps. Initially, they preprocessed the ALS point cloud and applied the eCognition software to convert it to raster data, which were used to delineate buildings and remove irrelevant objects. Additionally, they obtained values for 13 building features grouped in four categories: extent features, which define the size of the building objects, shape features, which describe the complexity of building boundaries, height and slope of the buildings' roof. In the next step, human expert knowledge and knowledge available in literature was employed to define three general purpose building ontology classes, independent of the application and the data at hand, namely A. Residential/Small Buildings, B. Apartment/Block Buildings and C. Industrial and Factory Buildings. In order to identify the metrics which were mostly relevant to the identification of building types, a set with 45 samples was used to train a Random Forest classifier with 500 trees and √ m features (m number of input features). The feature selection process identified slope, height, area and asymmetry as the most important features. The first three were modelled in the ontology with empirically determined thresholds by the RF classifier. Finally, building type classification was carried out based on the formed ontology. The classification accuracy was assessed by means of precision, recall and F -measure and the authors reported convincing results for class A while classes B and C had less accurate results. However, they argue that their method can prove useful for classifying building types in urban areas. Finally, two interesting applications of ontology-based feature selection algorithms concern the recommendation systems (RS) and the information security/privacy research areas. In [75], Di Noia et al. develop a filter-based feature selection algorithm by incorporating ontology-driven data summarization for linked-data (LD) based recommender system (RS). The feature selection mechanism determines the k most important features in evaluating the similarity between instances of a given class on top of data summaries built with the help of an ontology. Two types of descriptors are employed namely, pattern frequency and cardinality descriptors. A pattern is defined as a schema using an RDF triple denoted as (C, P, D), where C and D are classes or datatypes, and P is a property that expresses their relationship. C is called the source type and D the target type. The patterns are used to generate data summarization from a knowledge graph-based framework. Each pattern is associated with a frequency that corresponds to the number of relational assertions from which the pattern has been extracted. Therefore, a pattern frequency descriptor can be viewed as a set of statistical measures. A cardinality descriptor encodes information about the semantics of properties as used within specific patterns and can be used in computing the similarities between these patterns. To obtain the cardinality descriptors, the authors extended the above-mentioned knowledge graph framework. The LD and one or more ontologies are the inputs to the knowledge graph framework, while its outputs are: a type graph, a set of patterns along with the respective frequencies, and the cardinality descriptors. To this end, the filtering-based feature selection consists of two main steps. First, the cardinality descriptors are implemented to filter out features (i.e., pattern properties) that correspond to properties connecting one target type with many source types. Second, the pattern frequency descriptors are applied to rank in a frequency-based descending order all features and select the top k features.
In [76], Guan et al. studied the problem of mapping security requirements (SR) to security patterns (SP). Viewing the SPs as features, feature selection is set up to perform the above mapping procedure. This selection is based on developing an ontology-based framework and a classification scheme. To accomplish this task, they described the SRs using four attributes namely, asset (A), threat (T), security attribute (SA), and priority (P). The SRs are represented as rows in a two-dimensional matrix, where the columns correspond to the above attributes. Then, the meaning of each SR is, for a given asset A, one or more threats Ts may threaten A by violating one or more attribute values of SA. In addition, each SR is to be fulfilled in a sequence according to the value of P during software development. Then, they generate complete and consistent SRs by eliciting values for the above attributes using the risk-based analysis proposed in [77]. On the other hand, security patterns (SP) are described in terms of three attributes namely, Context that defines the conditions and situation in which the pattern is applicable, Problem that defines the vulnerable aspect of an as-set, and Solution that defines the scheme that solves the security. To intertwin the above information they developed a twolevel ontological framework using an OWLbased security ontology. The first level concerns the ontology-based description of SRs and the second the ontology-based description of SPs. These descriptions were carried out by quantifying mainly the risk relevant and annotating security related information. To this end, a classification scheme selects an appropriate set of SPs for each SR. The classification scheme is developed by considering multiple aspects such as lifecycle stage that organizes, architectural layer that organizes information from low to high abstraction level, application concept that partitions the security patterns according to which part of the system they are trying to protect, and threat type that uses the security problems solved by the patterns.

Open Issues and Challenges
Features show dependencies among each other and therefore, they can be structured as trees or graphs. Ontology-based feature selection in the era of knowledge graphs such as DBpedia, Freebase and YAGO, can be influenced by two issues [ An important and open issue in this domain is the linking of one document-mentioned entity to a particular KG's entity and they way it affects how other surrounding document entities are linked. Furthermore, it is more and more common nowadays to see an increasing number of inter-task dependencies being modeled, where pairs of tasks such as a) Named-Entity Recognition (NER) and Entity Extraction and Linking (EEL), b) WordSense Disambiguation (WSD) and EEL, or c) EEL and Relation Extraction and Linking(REL), are seen as interdependent. The combinatorial approach of those tasks will continue to exist and advance since it has been proven highly effective to the precision of the overall information/knowledge extraction process. Regarding the contributed communities in this area of research, related works have been conducted by the Semantic Web community as well as from others such as the NLP, AI and DB communities. Works conducted by the NLP community focus more on unstructured input, while Database and Data Mining related works target more to semistructured input [78]. As abovementioned, ontologies play a key-role in feature selection. However, the engineering of ontologies, although has been advancing in a fast pace over the last decade, it has not yet reached the status were consensus in domain-specific communities will deliver gold-standard ontologies for each case and application area. On the other hand, a number of issues and challenges related to the collaborative engineering of reused and live ontologies have been recently reported [52], indicating that this topic is still active and emerging. For instance, as far as concerns feature selection, different ontologies of the same domain used in the same knowledge extraction tasks will most probably result in different set of features selected (schema-bias). Furthermore, human bias in conceptualizing context during the process of engineering ontologies (in a top-down collaborative ontology engineering approach) will inevitably influence the feature selection tasks. Specifically, in the cases where large KGs (e.g., DBpedia) are used for knowledge extraction, such a bias is present in both conceptual/schema (ontology) and data (entities) levels. Debiasing KGs is a key challenge in the Semantic Web and KG community itself [79], and consequently in the domain of KG-based feature selection.
Important challenges arise when ontology-based feature selection is applied to linked data (LD). LD appear to be one of the main structural elements of big data. For example, data created in social media platforms are mainly LD. LD appear to have significant correlations regarding various types of links and therefore, they possess more complex structure than the traditional attribute-valued data. However, they provide extra, yet valuable, information [80]. The challenges of using ontology-based feature selection in LD concern the development of ontology-based frameworks to exploit complex relation between data samples and feature and how to use them in performing effective feature selection, and to evaluate the relevance of features without the guide of label information.
Another interesting research area is the real-time feature selection. The main difficulty in dealing with real-time feature selection is that both data samples and new features must be taken into account simultaneously. Most of the methods that exist in the literature rely on feature pre-selection or on feature selection without online classification [81,82]. On the other hand ontology encoded in trees or knowledge graphs may provide some benefits such as solid representations of the current relations between features, which can be used to predict any possible relation between the current available features and the ones that are expected to arise in real-time processing tasks. Therefore, achieving real-time analysis and prediction for high-dimensional datasets remains a challenge. Finally, an important open issue to consider is scalability. Scalability quantifies the impact imposed by increasing the training data size on the computational performance of an algorithm in terms of accuracy and memory [80,82]. The basics of feature selection and classification were developed before the era of big data. Therefore, most feature selection algorithms do not scale well on extremely high-dimensional data; their efficiency deteriorates quickly or is even computationally infeasible. On the other hand, scaling-up favors the accuracy of the model. Therefore, there is a tradeoff between finding an appropriate set of features and the model's accuracy. In this direction, the challenge is to define appropriate ontology-based relations between features in order to group them in such a way that the resulting set of features will be able to maintain acceptable model's accuracy.

Conclusion
This study provides an overview of ontology-based classification with emphasis on the feature selection process. The presented methodologies show that ontologies can effectively uncover dominant features in diverse knowledge domains and can be integrated into existing feature selection and classification algorithms. Specifically, in the context of text classification, domain-specific ontologies combined with the WordNet taxonomy, can be utilized to map terms in documents to concepts in the ontology, thus replacing specific termbased document features with abstract and generic concept-based features. The latter capture the content of the text and can be used to train accurate and efficient classifiers. In the field of mechanical engineering, ontologies can be employed to map human knowledge to concepts, that serve as features for case-based reasoning, and support decision making, such as selection of manufacturing process or noise source identification. To this end, in the area of civil engineering, building type recognition can be facilitated by ontology. Although, this survey is by no means exhaustive, it demonstrates the broad applicability and feasibility of ontology-based feature extraction and selection. Finally, certain open issues and challenges are discussed and a number of relevant problems are identified.