Integration of Context Information through Probabilistic Ontological Knowledge into Image Classiﬁcation

: The use of ontological knowledge to improve classiﬁcation results is a promising line of research. The availability of a probabilistic ontology raises the possibility of combining the probabilities coming from the ontology with the ones produced by a multi-class classiﬁer that detects particular objects in an image. This combination not only provides the relations existing between the different segments, but can also improve the classiﬁcation accuracy. In fact, it is known that the contextual information can often give information that suggests the correct class. This paper proposes a possible model that implements this integration, and the experimental assessment shows the effectiveness of the integration, especially when the classiﬁer‘s accuracy is relatively low. To assess the performance of the proposed model, we designed and implemented a simulated classiﬁer that allows a priori decisions of its performance with sufﬁcient precision


Introduction
The topic of this paper is the problem of recognising the content of a digital image.This is a particularly important problem due to the very large number of images now available on the Internet, and the need to produce an automatic description of the content of the images.This research topic has received increasing attention, as shown by the references in Section 2, and well performing systems using deep networks have been proposed.In this paper, which extends what was presented in [1,2], we consider a method that exploits context information in the image to improve the performance of a classifier.
Classifiers for recognising the content of natural images are usually based on information extracted only from images and can be, in most general cases, prone to errors.The approach taken in this paper attempts to integrate some domain knowledge in the loop.The framework presented here aims to integrate the output-a classier/detector, considered as the probability that a particular object is present in a definite part of the input image with an encoded domain knowledge.The most commonly used tools for encoding a-priori information are standard ontologies; however, they do fail when dealing with real-world uncertainty.For this reason, we preferred to include a Probabilistic Ontology (henceforth, PO) [3] in our framework, which associates probabilities with the coded information, and then provides an adequate solution to the issue of coding the context information necessary to correctly understand the content of an image.Such information is then combined with the classifier output to correct possible classification errors on the basis of surrounding objects.
The aim of this work is to boost the performance of a system for the recognition/identification of classes of objects in natural images, introducing knowledge coming from the real world, expressed in terms of the probability of a set of spatial relations between the objects in the images, into the loop.A probabilistic ontology can be made available for the considered domain, but it could also be built or enriched by using entities and relations extracted from a document related to the image.For example, the picture could have been extracted from a technical report or a book where the text gives information that is related to the considered images.We wish to stress the fact that we are not thinking of a text that directly comments on or describes the image, but of a text which is completed and illustrated by the image.In this case, both classes of objects which can appear in the image and the relations connecting them could be mentioned in the text and could therefore be automatically extracted [4].A probability can then be associated with them on the basis of the reliability of the extraction or the frequency of the item in the text.
The objective of the system discussed in this paper, which is depicted in Figure 1 and detailed in Section 3, is to obtain a set of keywords that can be used to describe the content of an image.The system takes an image as the input and produces a set of hypotheses on the presence of some objects in the image.Some of these hypotheses are likely to be wrong.As an example, let us consider the case of the reflection of a building on the water, beneath a boat; it is likely that a simple classifier will label that reflection as a building, while the boat can be labelled correctly.Our opinion that the spatial relation between the two image segments together with the external knowledge that an image segment beneath a boat and surrounded by water is more likely to be water than a building can be used to correct the misclassification.This world knowledge, formalised in a probabilistic ontology, together with the output of the classifier, is fed to a probabilistic model [5], with the goal of improving the performance of the single classifiers.The framework described in this paper has two main aspects of novelty.The first one is that, to the best of our knowledge, a probabilistic ontology has never been proposed for a computer vision problem.The integration of a probabilistic model with a probabilistic ontology presents a second element of novelty.
This paper extends [1,2] by giving an experimental evaluation of the simulated classifier introduced in Section 4 and justifying the choice of the parameters used in the proposed probabilist model experimentally.
The paper is structured as follows.Related works are briefly reviewed in the next section.Our system is described in Section 3, and the experimental results are described in Section 4 and discussed in Section 5. Section 6 presents some final remarks.

Related Work
Human beings express their knowledge and communicate using natural language, and, in fact, they usually find it easy to describe the content of images with simple and concise sentences.Because of this human skill, it is not difficult for a human user, when using an image search engine, to formulate a query by means of natural language.
Due to the large number of images available on the web, to answer textual image queries, the capability to describe the content of an image automatically would be very helpful.However, such a task is not easy at all for a machine, as it requires a visual understanding of the scene.This means that almost each object in the image must be recognised, and how the objects relate to each other in the scene must be understood [6].This task is tackled in two different ways.The most classical one [7][8][9][10] attempts to solve the single sub-problems separately and combines the solutions to obtain a description of an image.A different approach [6,11,12] proposes a framework that incorporates all of the sub-problems in a single joint model.A method that tries to merge the two main approaches was proposed recently in [13] using a semantic attention model.The problem is, however, very far from being solved.
In the context of textual image queries, it can be enough to extract a less complex description from the images (image annotation [14]), such as a list of entities represented in the image and information about their positions and mutual spatial relations in the image.The work proposed in this document addresses this task that is also, as mentioned above, a necessary sub-task of the more general problem of generating a description in natural language.
The use of ontologies in the context of image recognition is not new [15].For instance, in [16], a framework was proposed for an ontology based image retrieval for natural images, where a domain ontology was developed to model qualitative semantic image descriptions.An ontology of spatial relations was proposed in [17] in order to guide image interpretation and the recognition of the structures contained.In [18], low-level features, describing the colour, position, size, and shape of segmented regions, were extracted and mapped to descriptors; these descriptors were used to build a simple vocabulary termed object ontology.Probabilistic ontologies have been applied recently in various tasks.In [19], the authors described a PO which models a list of publications from the DBLP database; new interest for the authors was inferred using a Bayesian network.An activity recognition system integrating probabilistic inference with the represented domain ontology was introduced in [20]; this ontology based activity recognition system is augmented with probabilistic reasoning through a Markov Logic Network.An infrastructure for probabilistic reasoning with ontologies based on a Markov logic engine was recently presented in [21], and applied to different tasks including activity recognition and root cause analysis.In [22], the authors proposed a scheme that uses a PO capable of detecting potential violations of contracts between on-demand Cloud service providers and customers, and alerts the provider when a violation is detected.A probabilistic semantic model that enables reasoning over uncertainty without losing semantic information is the basis for a system providing reminders to elderly people in their home environment while they perform their daily activities [23].To the best of our knowledge, a probabilistic ontology has never been used for the task of image recognition and annotation.
Contextual information has been used in image recognition for long time [24,25], and it has been already shown [26] that the use of spatial relations can decrease the response time and error rate, and that the presence of objects that have a unique interpretation improves the identification of ambiguous objects in the scene.Just to mention a few application domains, contextual information has been used for face recognition [27], medical image analysis [28], and analysis of group activity [29].
In the same way, the use of probabilistic models is not new in computer vision; in particular, a probabilistic model combining the statistics of local appearance and position of objects was proposed in [30] for the task of face recognition, and in [31] in an image retrieval task, showing that adding a probabilistic model in the loop can improve the recognition rate.In [32], a probabilistic semantic model was proposed in which the visual features and the textual words are connected via a hidden layer.More recently, in the context of 3D object recognition, a system that builds a probabilistic model for each object based on the distribution of its views was proposed in [33].In [34], a hierarchical Bayesian network was introduced in a weakly supervised segmentation model; in particular, the system learns the semantic associations between sets of spatially neighbouring pixels, defined as the probability that these sets share the same semantic label.Finally, Ref. [35], in the context of action recognition, presented a generative model that allows the characterisation of joint distributions of regions of interest, local image features, and human actions.

Materials and Methods
The framework proposed in this paper (see Figure 1 for a graphical description) is a pipeline composed of several logical modules.The first module is a classifier, or a set of classifiers, the goal of which is detecting a set of predefined classes of objects in the image, therefore determining a set of regions of interest in the image.For each identified region of interestm the first module produces classifier scores for each one of the classes of objects considered, computed in terms of probability, and the spatial relations between all the regions of interest.
The hypotheses formulated for each segment in the image by a statistical classifier are then fed to a probabilistic model, that has been trained offline.This module is the core of our framework and has the objective of validating or correcting the hypotheses produced by the first module.The probabilistic model integrates the output of the classifier with the world knowledge coded in a probabilistic ontology that is expressed in terms of the probability of a spatial relationship between instances of two classes of image objects.
The class associated with each segment together with the relations existing between segment pairs constitute the output of the system and can be interpreted as a basic description of the image.

Probabilistic Ontology
This section discusses how a fragment of PO providing the information needed by our system can be constructed.Such a fragment is needed for the experimental assessment of the system.
Ontologies cannot cope properly with uncertain information when dealing with real-world problems.To overcome this problem, in previous years, some tools have been designed to add probabilities to the information contained in the ontologies.Among the tools proposed, one of the most important is probably PrOWL [36].The POs obtained are capable of encoding a priori knowledge for real-world applications.
As a consequence, the research area concerning POs is very active and we expect that a number of POs in different domains will be available soon.However, a PO in the domain of the dataset used in our assessment is necessary for the experiments.Therefore, we designed and implemented an ad-hoc ontology.The scheme of the PO needed to contain the classes of objects and the spatial relations considered in our analysis.The probabilities were estimated from a training set of images where the objects had been manually labelled, and spatial relations were constructed between pairs of regions of interest.More in detail, we estimated the probability of two classes having a given relation by the frequency of the event in the dataset.No smoothing was applied.
Formally, D denotes the set of image segments used to compute the probabilities, and R = {r 1 , . . ., r i } denotes the set of relations considered, with C being the set of classes of objects.The probability that where D r (c x , c y ) is the number of times pairs of segments in D classes, respectively, c 1 and c 2 , satisfy the relation r.In general, it is Pr(r, c 1 , c 2 ) = Pr(r, c 2 , c 1 ), since the relations are not necessarily symmetric.We are not aware of any available tool designed for constructing a PO directly; therefore, we used Protégé (freely available from http://protege.stanford.edu/)to formalise the schema of the ontology, and we used Pronto [37] as a reasoner for POs, because it adopts the standard OWL 1.1.
The schema developed with Protégé was imported into Pronto simply by adding the probabilities into the corresponding XML files.
An example is given in Figure 2, where the element tagged pronto:certainty is added to the axiom produced with by Protégé.Although Pronto accepts probability ranges, simple values are used, the two extremes of the interval coincide (0.070990; 0.070990 in the example).

Combination Models
In this section, we discuss the probabilistic model adopted to integrate the classifiers with the ontological knowledge.
The role of PO in the task considered in this work requires that probabilities describing the domain of interest are integrated with the probabilities coming from the classifier associated with each class of objects for each input region of interest.
The main goal of the system discussed here is the identification of some classes of objects in the input image.Our aim is to exploit the spatial relations between pairs of identified objects to improve the classification.In more formal terms, in every image, a set S of regions of interest is identified and there are a number of possible relations R between pairs of regions.The classifier associates a probability distribution to each region for all possible classes C. The classifier alone associates the region of interest with the class with the highest probability, that is, the image segment is classified by choosing the most probable class; this represents our baseline, as it only considers the classifier output without any information coming from the PO.However, the output of the classifier can be interpreted as a random variable c(s) with values in the set of classes C. The remainder of this section reports the method used to integrate this random variable with the ontological probabilities.
Given any triplet (c 1 , c 2 , r), where c 1 , c 2 ∈ C is any pair of classes, r ∈ R is any possible relation, the probability Pr(r, c 1 , c 2 ) that two image instances of classes c 1 and c 2 , respectively, have a relation r is given in Equation (1).Our claim is that the classification performance can improve by integrating this last element of information with the probabilities returned by the classifier.It is worth pointing out that the solution produced by this integration procedure is likely to be consistent with the knowledge given by the ontology.This last one is a very important feature for systems where post-processing requires a set of properties on the considered candidates.In fact, if the relation holding between two regions of interest is unlikely for the classes assigned by the classifier, the corresponding ontological probability is very low, lowering the probability of the corresponding pair of classes.
Given a context x = (s 1 , s 2 , r : r(s 1 , s 2 )), where s 1 and s 2 are the identified regions of interest and r is a relation holding between s 1 and s 2 and two classes c 1 and c 2 , we compute the following log-linear probability where f C (s, c) = Pr(c(s) = c) and f PO (r, c 1 , c 2 ) = Pr(r(c 1 , c 2 )), while Z x,c 1 ,c 2 is a normalisation factor that depends on x and on the classes assigned to the two segments.Note that the features f C (•) are produced by the classifier, while f PO (•) depends on the probabilistic ontology.To summarise, in Equation ( 2 A training step is necessary for the estimation of the parameters defined above.The objective function of the training maximises the likelihood of the training set.For the implementation of this optimisation procedure, we used the Toolkit for Advanced Optimisation (TAO) library (that is part of the PETSc library [38], which includes a collection of optimisation algorithms for a variety of classes of problems (unconstrained, bound-constrained, and PDE-constrained minimisation, nonlinear least-squares, and complementarity).For this paper, we focused on unconstrained minimisation methods, which are very popular when minimising a function with many unconstrained variables.The method adopted was the Limited Memory Variable Metric, that is a quasi-Newton optimisation solver which solves the Newton step with an approximation factor composed using the BFGS update formula [39].
After the training step, once all the parameters V = {v c , v r,c i ,c j }, with c, c i , c j ∈ C and r ∈ R, have been estimated, the objective is to associate the correct class with each identified region of interest.Given a context x and two candidate classes c 1 and c 2 , we must assign a score expressing how the two classes fit the context.To this aim, two different models were considered.The former, referred to as M1, considers the score equal to the probability Pr(c 1 , c 2 |x) given in Equation ( 2), while the latter (M2) takes the logarithm of Pr(c 1 , c 2 |x) as the score.In fact, when adopting, as in our case, a log-linear expression, only considering exponents is much more efficient than directly summing probabilities.We used then two expressions, sc 1 and sc 2 , respectively corresponding to M1,and M2, given as After this we needed to compute, for each context x, a score indicating how much a given class c is associated with a region of interest s in the context.This was done by summing up all the scores of the association of each class with each region, and the relations were assumed to include any of the possible relation types.We then associated the class which maximised such a score in all segment pairs including itself to the first region.Formally, this is expressed as where sc stays for sc 1 or sc 2 , according to the model adopted between M1 and M2.It is worth pointing out that all the relations considered in this work were symmetrical; then, for each existing context x = (s 1 , s 2 , r : r(s 1 , s 2 )), also the symmetrical x = (s 2 , s 1 , r(s 2 , s 1 )) was defined, producing the same scores as the first one; therefore, only the first of the two cases could be considered when computing the scores.If asymmetrical relations come into play, the score expressions can be easily generalised.The last step is assigning the class which maximises the score of the class to each detected region of interest, that is c * (s) = arg max c∈C SC(c|s). ( This final output of the classifier, together with the relations between the regions of interest can be used as a starting point for creating a simple textual description of the image.

Results
This section describes the quantitative assessment of the performance of the proposed method detailed in Section 3.
The main objective of the experiments described in this section was to assess whether the model proposed really improves the performance of the classifier.To this end, we measured the classification performance of our model against the classifier's performance.The literature on object recognition is very rich [40], but in order to make this experiment as general as possible, we decided not to use an existing system for the detection/classification task, but preferred to design a simulated classifier for which we were able to set a desired accuracy.The use of a simulated classifier is not novel (see, for instance, [41]).In this way, it is possible to obtain an idea of the impact that the ontological information has on the performance, and to describe the dependence of the system performance on the classification accuracy.
What we needed was then a method to simulate the behaviour of a multi-class classifier with an assigned accuracy a.To this end, we designed a strategy that is detailed by the pseudo-code in Algorithm 1.In this strategy, given a region of interest in an image containing an object of a particular class c, a set of n random scores from the interval [0, 1] is extracted where n is the number of classes.After this, the maximum score is assigned to the gold class c, while the other scores are assigned to the other classes randomly.As a last step, the scores are normalised to get a probability distribution.In the end, since the simulated classifier assigns the class with the highest score to each image region of interest, the classifier has the desired accuracy a, as the highest score is assigned to the correct class with a probability of a.The dataset selected for this experimental assessment is a subset of the MIT-Indoor [42] where interesting objects have been manually segmented and labelled; this gives us a reliable ground-truth for estimating the performance of our combination model.The dataset includes 1700 images that have been manually segmented.These images, see Figure 3 for few samples, were taken at common indoor locations, such as kitchens, bedrooms, libraries, gyms, and so on.
The dataset is partitioned into three subsets: S PO , S CM and S Test .The first two, each containing 30% of the whole dataset, were used to train the probabilistic ontology (S PO ) and the combination model (S CM ) proposed in this work (see Section 3), and the remaining 40% went into the S Test subset that was used to assess the performance of the system.In our experiments, the three subsetss were selected randomly at each run of the algorithm.Each run was repeated several times (see below for details) in order to avoid experiment bias due to lucky or unlucky data splits.From our point of view, it is particularly important that the probabilistic ontology and the combination model are trained on different data, as this is what it is very likely to happen in real cases.The dataset contains a large set of object classes, some of them with very few objects.In order to avoid the impact that small classes might have on the construction of the probabilistic ontology, we preferred to take only the six with the largest number of items: the adopted classes and the number of times they occur in the data set are reported in Table 1.The final information necessary for building the probabilistic ontology was the definition of the spatial relations between the objects.We considered three different relations corresponding to the relative positions of two regions of interest in the same image: near, very near and intersecting.All the three relations were symmetrical.The near and very near relations between two region of interest were computed considering where S 1 and S 2 are the two regions of interest, CH S 1 ,S 2 is the convex-hull [43] determined by the two regions S 1 and S 2 , A(•) is the area of the region passed as the argument; the parameter θ is set to 0.5 for near and to 0.8 for very near regions.The two relations are not exclusive, so two regions that are very near are also near.The simulated classifier played an important role in our experimental evaluation, so we decided to verify how close the accuracy of the classifier was to the one requested.The results are shown in Table 2.The experiment was run using the test set, and, due to the randomness of the simulator, the results were averaged over 10 trials, with a target accuracy increasing from 20% to 70%.It is clear from the table that the simulated classifier performed as expected.In Section 3.2, where the proposed model is explained, it is stated that the model depends, in our case, on 114 parameters: 6 for the classifier probabilities, one for each class, and 108 (3 • 6 • 6) for all possible relations between classes.This is the most general parametrisation of the model as it considers the maximal number of parameters.However, there are other possible parametrisations.For instance, we can consider the following alternatives: one single parameter for the classifier (all the six parameters are equal) and one for the relations (all the 108 are equal), adding to two parameters; six different parameters for the classifier and only one parameter for the ontological part of the model, giving seven parameters in total; 108 different parameters for the ontological part and a single parameter for the classifier, that is, only 108 parameters; 108 learnt parameters for the ontological part and a single parameter for the classifier, summing to 109 parameters.
We tested the performance of the system using all four alternative parametrisations, and the results are shown in Table 3.The results of models with two and seven parameters, presented in the first two lines of the table, were always below the classifier accuracy, except for the 20% case.The case of 108 return accuracies showed a lack of regularity which makes this parametrisation unreliable.The behaviour when using 109 parameters was shown to be similar to the cases with two and seven parameters.The last model, using 108 parameters, behaved much better, as seen later in the next section, and therefore, is the one we chose to use in our system.Having ruled out alternative parametrisations, we focused our attention on the model discussed later in the paper, for which we aimed to assess the improvement that could be obtained by introducing probabilistic ontological knowledge into the loop.To this end, we compared the system's performance with a baseline consisting of the (simulated) classifier alone.The two approaches discussed in Section 3.2 were applied to combine the PO into the system: M1 and M2.The results, shown in Figure 4 and discussed in the next section, were obtained by averaging over 20 runs.

Discussion
The system accuracy of the approaches proposed in this paper is depicted in Figure 4 and compared with the accuracy of the statistical classifier applied alone.
The experiment inspected the performance of our system by considering a wide range of accuracy levels for the simulated classifier; in fact, the values spanned from a lowest accuracy of 20% to an accuracy of 80%.It is clear from the graph in Figure 4 that the M2 model outperformed the M1 model, the performance of which slightly deteriorated as the classifier performance improved.A possible interpretation for this behaviour could be that too much confidence is given to the a priori score given by the probabilistic ontology with respect to the actual input data evidence.
On the other hand, the M2 model was shown to behave better than the classifier alone when performance of the latter was no more the 55% which is a realistic experimental condition.We can also point out that this model performed better than the classifier alone when this was below than 30%.A low performance of the classifier can indicate that the task is not particularly easy.Even for classifiers obtaining an accuracy between 30% and 55%, the adoption of an approach integrating PO knowledge was shown to be advantageous.
It is also worth pointing out that model M2 showed an almost constant accuracy when the classifier accuracy was no more than 40%; then, it started increasing, but at a slower rate than the classifier.The graph of the M2 model starts getting steeper soon after the classifier alone returns a better performance, so that the two graphs are almost parallel.A possible explanation for this behaviour is that, different from the M1 model where it seems that the model always gives the same level of trust to the classifier, the M2 model is always trying to adapt to the classifier performance.This could suggest that a better ontology design, resulting in a better PO, could help the system to overcome the performance obtained by the classifier alone.

Conclusions
This paper proposes two probabilistic models for integrating probabilities coming from a probabilistic ontology, representing a domain knowledge, with the probabilities produced by some sort of statistical classifier.The two models were experimentally evaluated and only one of them showed a level of performance that may encourage researchers to push forward its use in real systems.
In order to obtain a clear idea of the performance of the integration module, we removed the effects of most of the external factors.To this end, we conducted our experiments using images that had been manually segmented and labelled, and used a simulated classifier designed in such a way that we could control its accuracy.In future work, we plan to assess the performance of the proposed approach when coupled with state-of-the-art modules.
A prototype of a fragment of a probabilistic ontology was designed and populated using three binary relations which can be automatically detected in input images.The probabilities corresponding to each relation were estimated from their frequencies in the ontology training set.When more sophisticated ontologies containing information from large datasets are available, we expect the integration to give even better results.

Figure 1 .
Figure 1.Scheme of the proposed framework.

Figure 2 .
Figure 2. Piece of the XML of the Probabilistic Ontology (PO) corresponding to an axiom with an associated probability.
), two families of parameters are taken into account: class parameters v c for each class c and relation parameters v r,c 1 ,c 2 for each type of relation r and pair of classes (c 1 , c 2 ).In total, the model has |C| class parameters and |R||C| 2 relation parameters.

Figure 3 .
Figure 3. Sample images from the dataset [42] used in the experiments.

Figure 4 .
Figure 4. Performance of the two systems compared with the baseline.The error bars give the 95% confidence intervals.Results are the average of 20 runs.

Algorithm 1 :
Pseudo-code for the simulated classifier.
Input: A set of classes ClassSet; A ground truth GoldClass; A target accuracy DesiredAccuracy

Table 2 .
Accuracy of the simulated classifier.The accuracies of the simulated classifier were averaged over 10 trials.

Table 3 .
Different parametrisations for the combination model.Performance was measured on single runs with varied classifier accuracy levels (top row).