A General Model for Side Information in Neural Networks

: We investigate the utility of side information in the context of machine learning and, in particular, in supervised neural networks. Side information can be viewed as expert knowledge, additional to the input, that may come from a knowledge base. Unlike other approaches, our formalism can be used by a machine learning algorithm not only during training but also during testing. Moreover, the proposed approach is ﬂexible as it caters for different formats of side information, and we do not constrain the side information to be fed into the input layer of the network. A formalism is presented based on the difference between the neural network loss without and with side information, stating that it is useful when adding side information reduces the loss during the test phase. As a proof of concept we provide experimental results for two datasets, the MNIST dataset of handwritten digits and the House Price prediction dataset. For the experiments we used feedforward neural networks containing two hidden layers, as well as a softmax output layer. For both datasets, side information is shown to be useful in that it improves the classiﬁcation accuracy signiﬁcantly.


Introduction
The use of side information in a supervised machine learning (ML) setting, proposed in [1], was referred to as privileged information, and was further developed in [2], who extended the paradigm to unsupervised learning.As illustrated in [3], privileged information refers to data which neither belong to the input space nor to the output space but are still beneficial for learning.In general, side information may, for example, be knowledge given by a domain expert.
In our work, side information comes separate to the training and test data, but unlike other approaches such as the one described above, it can be used by the ML algorithm not only during training but also during testing.In particular, we concentrate on supervised neural networks (NNs) and do not constrain the side information to be fed into the input layer of the network, which constitutes a non-trivial extension of the paradigm.To distinguish between the two approaches we refer to the additional information used in the previous approaches, when the information is only visible during the training phase, as privileged information, and in our approach, when the information is visible during both the train and test phases, simply as side information.In our model, side information is fed into the NN from a knowledge base, which inspects the current input data and searches for relevant side information in its memory.Two examples of the potential use of side information in ML are the use of azimuth angles as side information for images of 3D chairs [4], and the use of dictionaries as side information for extracting medical concepts from social media [5].
Here, we present an architecture and formalism for side information, and also a proof of concept of our model on a couple of datasets to demonstrate how it may be useful in the sense of improving the performance of an NN.We note that side information is not necessarily useful in all cases, and, in fact, it could be malicious, causing the performance of an NN to decrease.
Our proposal for incorporating side information is relatively simple when side information is available, and, within the proposed formalism, it is straightforward to measure its contribution to the ML.It is also general in terms of the choice of neural network architecture that is used, and able to cater for different formats of side information.
To reiterate, the main research question we address here is when side information comes separate to the training and test data, what is the effect, on the ML algorithm, of making use of such side information, not only during training but also during testing?The rest of the paper is organised as follows: In Section 2, we review some related work, apart from the privileged information already mentioned above.In Section 3, we present a neural network architecture that includes side information, while in Section 4 we present our formalism for side information based on the difference between the loss of the NN without and with side information, where side information is deemed to be useful if the loss decreases.In Section 5, we present the materials and methods used for implementing the proof of concept for side information based on two datasets, while in Section 6 we present the results and discuss them.Finally, in Section 7, we present our concluding remarks.

Related Work
We now briefly mention some recent related research.In [6], a language model is augmented with a very large external memory, and uses a nearest-neighbour retrieval model to search for similar text in this memory to enhance the training data.The retrieved information can be used during the training or fine-tuning of neural language models.This work focuses on enhancing language models, and, as such, does not address the general problem of utilising side information.
In a broad sense, our proposal fits into the informed ML paradigm with prior information, where prior knowledge is independent from the input data; has a formal representation; and is integrated into the ML algorithm [7].Moreover, human knowledge may be input into an ML algorithm in different formats, through examples, which may be quantitative or qualitative [8].In addition, there are different ways of incorporating the knowledge into an ML algorithm, say an NN, for example, by modifying the input, the loss function, and/or the network architecture [9].Another trend is that of incorporating physical knowledge into deep learning [10], and more specifically encoding differential equations in NNs, leading to physics-informed NNs [11].
Our method of using probabilistic facts, described in Section 3, is a straightforward mechanism for extracting the information that is needed from the knowledge base in order to integrate them into the neural network architecture in a flexible manner.Moreover, our proposal allows feeding side information into any layer of the NN, not just into the input layer, which is to the best of our knowledge novel.
Somewhat related is also neuro-symbolic artificial intelligence (AI), where neural and symbolic reasoning components are tightly integrated via neuro-symbolic rules; see [12] for a comprehensive survey of the field, and [13] which looks at combining ML with semantic web components.While NNs are able to learn non-linear relationships from complex data, symbolic systems can provide a logical reasoning capability and structured representation of expert knowledge [14].Thus, in a broad sense, the symbolic component may provide an interface for feeding side information into an NN.In addition, a neurosymbolic framework was introduced in [15] that supports querying, learning, and reasoning about data and knowledge.
We point out that, as such, our formalism for incorporating side information into an NN does not attempt to couple symbolic knowledge reasoning with an NN as does neuro-symbolic AI [12,15].The knowledge base, defined in Section 3, acts as a repository, which we find convenient to describe in terms of a collection of facts providing a link, albeit weak, to neuro-symbolic AI.For example, in Section 5, in the context of the well-known MNIST dataset [16], we use expert knowledge on whether an image of a digit is peculiar as side information for the dataset, and inject this information directly into the NN, noting that we do not insist the side information exists for all data items.

A General Method for Incorporating Side Information
To simplify the model under discussion, let us assume a generic neural network (NN) with input and output layers and some hidden layers in-between.We use the following terminology: x t is the input to the NN at time step t, y t is the true target value (or simply the target value), ŷt is the estimated target value (or the NN's output value), and s t is the side information, defined more precisely below.
The side information comes from a knowledge base, KB, containing both facts (probabilistic statements) and rules (if-then statements).To start off, herein, we will restrict ourselves to side information in the form of facts represented as triples, (p, z, c), with p being the probability that the instance z is a member of class c, and where a class may be viewed as a collection of facts.In other words, the triple (p, z, c) can be written as p = P(z|c), which is the likelihood of z given c.We note that facts may not necessarily be atomic units, and that side information such as s t is generally viewed as a conjunction of one or more facts.
Let us examine some examples of side information facts, demonstrating that the knowledge base KB can, in principle, contain very general information: (i) (p, headache, symptom) is the probability of a headache being a symptom.Here, headache is a textual instance and the likelihood is P(headache|symptom). (ii) (p, image i , green) is the probability that the background of image i is green; in this case the probability could denote the proportion of the background of the image.Here, image i is an image instance and the likelihood is P(image i |green).(iii) (p, ecg i , irregular) is the probability that ecg i is irregular.Here, ecg i is a segment of an ECG time series instance and the likelihood is P(ecg i |irregular).
The architecture of the neural network incorporating side information is shown in Figure 1.We assume that the NN operates in discrete time steps, and that at the current time step, t, instance x t is fed into the network and ŷt is its output.As stated above, we will concentrate on a knowledge base of facts, and detail what preprocessing may occur to incorporate facts into the network as side information.It is important that the preprocessing step at time t can only access the side information and the input presented to the network at that time, i.e., it cannot look into the past or the future.It is, of course, sensible to assume that one cannot access inputs that will be presented to the network in the future at times greater than t, and regarding the past, at times less than t, the reasoning being that this information is already encoded in the network.Moreover, the learner may not be allowed to look into the past due to privacy considerations.On a high level, this is similar to continual learning (CL) frameworks where a CL model is updated at each time step using only the new data and the old model, without access to data from previous tasks due to speed, security, and privacy constraints [17].
As noted in the caption of Figure 1, side information may be connected to the rest of the network in several ways through additional nodes.The decision of whether, for example, to connect the side information to the first or second hidden layers and/or also to the output node can be made on the basis of experimental results.We emphasise that in our model, side information is available in both the training and test phases.Let x t be the input instance to the network at time step t and s t = KB(x t , C) represent the side information at time t, where C is a set of constraints which may be empty.To clarify how side information is incorporated into an NN, we give, in Algorithm 1, high-level pseudo-code corresponding to the architecture for side information, as shown in Figure 1.

Algorithm 1
Training or testing with side information 1: x t is the input to the network at time step t during the train or test phase 2: λ ← layer of the network that the side information is fed into 3: C ← the set of constraints on the knowledge base 4: for time step t = 1, 2, . . ., n do 5: x t is fed as input to the network 6: an encoding of s t is fed into the network at layer λ 8: ŷt ← output of the network 9: end for Figure 1.Architecture of a neural network with side information.The arrow from additional side information nodes to the hidden layers indicates that there are weighted links between them.The architecture is instantiated once it is specified which links from side information nodes to the network are active, and furthermore, we do not discount the possibility of having active links directly from side information nodes to the output node.
We stipulate that KB(x t , C) = KB C (x t ), i.e., we can either first compute the result of applying KB to x t , resulting in KB(x t ), and then constrain KB(x t ) to C, or, alternatively, we can first constrain KB to C, resulting in KB C , and then apply KB C to x t .For this reason, from now on, we prefer to write KB C (x t ) instead of KB(x t , C), noting that when C is empty, i.e., C = ∅, then KB C = KB ∅ = KB.We leave the internal working of KB C (x t ) unspecified, although we note that it is important that the computational cost of KB C (x t ) is low.
The set of constraints C may be application-dependent, for example, on whether x t is text-or image-based.Moreover, for a particular dataset we may constrain knowledgebased queries to a small number of one or more classes.To give a more concrete example, if x t = image i , i.e., the input to the network is an image, and we have constrained the knowledge base to the class green, then its output will be of the form s t = (p, image i , green).In this case, the knowledge base matches the input image precisely and returns in s t its probability of being green.In the more general case, the knowledge base will only match part of the input.
More specifically, we let s t = {s t1 , s t2 , . . ., s tm } be the result of applying the knowledge base KB C to the input x t , where each s tj , for j = 1, 2, . . ., m, encodes a fact (p tj , z tj , c tj ), so that the probability, p tj = P(z tj |c tj ), for j = 1, 2, . . ., m, will be fed to the network as side information.From a practical perspective it will be effective to set a non-negative threshold θ such that s tj = (p tj , z tj , c tj ) is included in s t only when p tj > θ.
We also assume for convenience that the number of facts returned by s t = KB C (x t ) that are fed into the network as side information is fixed at a maximum, say m, and that they all pass the threshold, i.e., p tj > θ.Now, if there are more than m facts that pass the threshold, then we only consider the top m facts with the highest probabilities, where ties are broken randomly.On the other hand, if there are less than m facts that pass the threshold, say k < m, then we consider the values for the m − k facts that did not pass the threshold as missing.A simple solution to deal with the missing values is to consider their imputation [18], and in particular, we suggest using a constant imputation [19] replacing the missing values with a constant such as the mean or the median for the class, which will not adversely affect the output of the network.We observe that if k = 0 for all inputs x t in the test data, but not necessarily in the training data, then side information is reduced to privileged information.Now, assuming that all the side information is fed directly into the first hidden layer, as is the input data, then it can be shown that using side information is equivalent to extending the input data by m values through a preprocessing step, and then training and testing the network on the extended input.When side information is fed into other layers of the network this equivalence breaks down.

A Formalism for Side Information
Our formalism for assessing the benefits of side information is based on quantifying the difference between the loss of the neural network without and with side information; see (1) below.Additionally, we provide detailed analysis for two of the most common loss functions, squared loss and cross-entropy loss [20]; see (2) and (3), respectively, below.
Moreover, in (5) we present the notion of being individually useful at time step t, which is the smallest non-negative δ such that the probability that the difference in loss at t when side information is absent and present is larger than a given non-negative margin, , which is greater than or equal to 1 − δ.When side information is individually useful for all time steps it is said to be globally useful.Furthermore, in (6) we modify (5) to allow for the concept of side information being useful on average.
To clarify the notation, we assume that x t is the input to the NN at time t, y t is the target value given input x t , ŷt is the estimated target value given x t , i.e., the output of the NN when no side information is available for x t , and ŷs t is the estimated target value given x t and side information s t , i.e., the output of the NN when side information s t is present for x t .
To formalise the contribution of the side information, let L(y t , ŷt ) be the NN loss on input x t and L(y t , ŷs t ) be the NN loss on input x t with side information s t .The difference between the loss from ŷt and the loss from ŷs t , where 0 ≤ y t ≤ 1 and 0 < ŷt , ŷs t < 1, is given by L(y t , ŷt ) − L(y t , ŷs t ).
Now assuming squared loss for a regression problem, we obtain which has a positive solution when ŷt < ŷs t < 2y t − ŷt , i.e., both terms in the last equation are negative, or ŷt > ŷs t > 2y t − ŷt , i.e., both these terms are positive.
If we restrict ourselves to binary classifiers, then squared loss reduces to the wellknown Brier score [21].Inspecting (2) for this case, when y t = 1 we have the constraint ŷt < ŷs t , ensuring both terms are negative.Correspondingly, when y t = 0, we have the constraint ŷt > ŷs t , ensuring that both terms are positive; note that in this case, ŷs t < ŷt is intuitive since the classifier returns the probability of y t being 1, i.e., estimating the chance of a positive outcome.
On the other hand, assuming cross-entropy loss for a classification problem, we obtain L(y t , ŷt ) − L(y t , ŷs t ) = −y t log( ŷt ) + y t log( ŷs t ) = y t log ŷs t ŷt , which has a positive solution when ŷt < ŷs t .
In this case, if we restrict ourselves to binary classifiers, then cross-entropy loss reduces to the well-known logarithmic score [21].Here, we need to note that a positive solution always exists, since the binary cross-entropy loss is given by when the estimated target value is ŷt , and similarly replacing ŷt with ŷs t when the estimated target value is ŷs t .Now, we would like the probability, 1 − δ, for 0 ≤ δ ≤ 1 and a given margin, ≥ 0, for the difference between L(y t , ŷt ) and L(y t , ŷs t ) to be minimised; that is, we wish to evaluate arg min δ P(L(y t , ŷt ) − L(y t , ŷs t ) ≥ ) ≥ 1 − δ. ( The probability in (5) can be estimated by parallel or repeated applications of the train and test cycle for the NN without and with side information, and the losses L(y t , ŷt ) and L(y t , ŷs t ) are recorded for every individual run in order to minimise over δ for a given .Each time the NN is trained its parameters are initialised randomly and the input is presented to the network in a random order, as was done in [22]; other methods of randomly initialising the network are also possible [23].
When ( 5) is satisfied, then we say that side information is individually useful for t, where t ∈ {1, 2, . . ., n}.In the case when side information is individually useful for all t ∈ {1, 2, . . ., n} we say that side information is (strictly) globally useful.We define side information to be useful on average if the following modification of ( 5) is satisfied: noting that we can estimate (6) in a similar way to estimating (5) with the difference that, in order to minimise over δ for a given , the average loss is calculated for each run of the NN.
In the case where the greater or equal operator (≥) in the inequality L(y t , ŷt ) − L(y t , ŷs t ) ≥ , contained in (5), is substituted with equality to zero (= 0) or is within a small margin thereof (i.e., is close to 0), we may say that side information is neutral.Moreover, we note that there may be situations when side information is malicious, which can be captured by swapping L(y t , ŷt ) and L(y t , ŷs t ) in ( 5).We will not deal with these cases herein and leave them for further investigation.(Similar modifications can be made to (6) for side information to be neutral or malicious on average.) We note that malicious side information may be adversarial, i.e., intentionally targeting the model, or simply noisy, which has a similar effect but is not necessarily deliberate.
In both cases, the presence of side information will cause the loss function to increase.Here, we do not touch upon the methods for countering adversarial attacks, which is an ongoing endeavour [24].

Materials and Methods
As a proof of concept of our model of side information, we consider two datasets.The first is the Modified National Institute of Standards and Technology (MNIST) dataset [16].It is a handwritten, 28 × 28 pixel image, digit database consisting of a training set of 60,000 instances and a test set of 10,000 instances.These are the standard sizes of training and test for MNIST, which result in about 85.7% of the data being used for training and about 14.3% reserved for testing.This is a classification problem where there are 10 labels each depicting a digit identity between 0 and 9.
Our second experiment is based on the House Price dataset, which is adapted from the Zillow's home value prediction Kaggle competition dataset [25], which contains 10 input features as well as a label variable [26].The dataset consists of 1460 instances.The main task is to predict whether the house price is above or below the median value according to value of the label variable being 1 or 0, respectively, so it is therefore a binary classification problem.
For both experiments we utilised a fully connected feedforward neural network (FNN) containing two hidden layers, as well as a softmax output layer; see also Figure 1 and Algorithm 1 regarding the modification of the neural network to handle side information.Our experiments mainly aim at evaluating whether or not the utilisation of side information is beneficial to the performance of the learning algorithm, i.e., whether it is useful.This is achieved by evaluating the difference between the respective NN losses without and with side information.The experiments demonstrate the effectiveness of our approach in improving the learning performance, which is evidenced by a reduction in the loss (i.e., improvements in the accuracy) of the classifiers, for each of the two experiments, during the test phase.
In both experiments, all the reported classification accuracy values and usefulness estimates (according to (5) and ( 6)) are averaged over 10 separate runs of the NN.For every experiment, we also report the statistical significance values of the differences between the corresponding classification accuracy values, without and with side information.To this effect, we perform a paired t-test to assess the difference in accuracy levels obtained, where the null hypothesis states that the corresponding accuracy levels are not significantly different.
For each run out of the 10 repetitions, the parameters of the NN are initialised randomly.In addition, the order via which the input data are presented to the network is also randomly shuffled during each run.Also, missing side information values are replaced with the average value of the side information for the respective class label.The number of epochs used is 10, with a batch size of 250.The adopted loss used in both experiments is the cross-entropy loss.
Before presenting the results in Section 6, we describe some additional detail for the two datasets in Sections 5.1 and 5.2.

MNIST Dataset
For this dataset we add side information in the form of a single attribute, giving information about how peculiar the handwriting of the corresponding handwritten digit looks.The spectrum of values is between 0, which refers to a very odd-looking digit with respect to its ground-truth identity, and 1, which denotes a very standard-looking digit; we also made use of three intermediate values, 0.25, 0.5 and 0.75.This side information was coded manually for 20% of the test and training data, and was fed into the network through the knowledge base right after the first hidden layer.In accordance with the adopted FNN architecture, we believe that adding side information after the first hidden layer best highlights the utilisation of the proposed concept of side information, since if the side information was to be added prior to that (i.e., before the first layer), its impact would have been equivalent to extending the input.Otherwise, adding it after the second hidden layer (although technically possible) may not enable the side information to sufficiently impact the learning procedure.According to the (p, z, c) representation, the side information here can be generated as (p, image, not-peculiar).
As mentioned earlier, an FNN with two hidden layers is utilised.The first hidden layer consists of 350 hidden units, whereas the second hidden layer contains 50 hidden units (these choices are reasonable given the size and structure of the dataset), followed by a softmax activation.Similar to numerous previous works in ML, we focus on comparisons based on an FNN with a certain capacity (two hidden layers in our case, rather than five or six, for instance) in order to concentrate on assessing the algorithm rather than the power of a deep discriminative model.

House Price Dataset
The FNN utilised in this experiment consists of two hidden layers, followed by a softmax activation.The first hidden layer consists of 12 hidden units, while the second consists of 4 hidden units (these choices are reasonable given the size and structure of the dataset).A portion of 20% of the data is reserved for the test set.
Out of the 10 features, we experimented with one feature at a time being kept aside.Whilst the other 9 features are presented as input to the input layer of the NN in the normal manner, the aforementioned 10th feature is used to generate side information, which is fed to the NN through the knowledge base right after the first hidden layer.The reasoning behind selecting this location for adding the side information is similar to the one highlighted with MNIST.According to the (p, z, c) representation, here, the side information can be generated as (p, house-data, f eature-value), where the feature in question varies according to the one left out.

Results and Discussion
For the MNIST dataset, as can be seen in the first row of Table 1, the inclusion of side information according to the manner proposed here leads to a significant improvement in the overall classification accuracy.In addition, the different values of and δ displayed in the first row in Table 2 indicate that, with MNIST, side information is useful, both individually and on average.Due to the fact that MNIST is a dataset on which FNNs typically attain high accuracy values, achieving an improvement when learning from MNIST is important, even if it is not so large.
For the House Price dataset, as can be seen in the second and third rows of Table 1, including a single feature as side information always leads to a significant improvement in the classification accuracy.We refer to Group 1 of the features as the scenario of leaving out one of the first five features (i.e., feature 1, 2, 3, 4 or 5) as the side information feature.Correspondingly, we refer to Group 2 as the scenario based on leaving out one of the other five features (i.e., feature 6, 7, 8, 9 or 10) as the side information feature.We have found it convenient to group features in this manner given the fact that those belonging to the same group have ultimately resulted in very similar accuracy values.Interestingly, we notice that using any of the first five features as the side information feature (in Group 1) leads to a higher improvement than leaving out one of the latter five features (in Group 2).As such, information involved in the first five features is seemingly more relevant for the classifier.This is also in line with the intuition that can be obtained after a manual inspection of the features, since the information involved in the Group 1 features is overall more global than the information in the Group 2 features.For example, Group 1 contains information about the overall quality and condition of the house.This opens possibilities for utilising our side information framework for future work on explainability [27] as well as feature selection [28].
The values of and δ reported in the second and third rows of Table 2 demonstrate the fact that including side information in the learning procedure is considerably useful, both individually and on average, when learning from the House Price dataset.Furthermore, the findings concluded from Table 2 are in line with those from Table 1 regarding the dominance of the features from Group 1 over the Group 2 features in terms of relevance.
Table 1.Test classification accuracy for the two experiments: MNIST and House Price, followed by the results of a statistical significance test (p-value for a 95% confidence interval).A bold entry indicates that the classification accuracy of the FNN for the respective experiment is significantly higher than its competitor.The inclusion of side information leads to a statistically significant improvement in the classification accuracy in both experiments, compared to the case when side information is not included.2. Difference between the classification losses for the two experiments: MNIST and House Price, without and with side information, expressed in the second and third columns by and δ, according to (5) when side information is individually useful (we note that the values were normalised and reported as percentages).In addition, in the fourth and fifth columns, the values of and δ express how useful side information is on average, according to (6).For both datasets, side information is useful both individually and on average.2.96% 0.10 3.60% 0.10

Concluding Remarks
Herein, we presented a novel model for side information in neural networks, where the side information is made visible to the network during both the training and testing phases.The side information is fed into the neural network with the aid of a knowledge base of facts, as shown in Figure 1 and the high-level pseudo-code given in Algorithm 1.
We developed a formalism for side information, whereby it can be individually useful according to (5) or useful on average according to (6).As a proof of concept, we then provided experimentation on two datasets: MNIST and House Price, averaging the results over 10 runs of an FNN with two hidden layers.For both datasets the use of side information improves the classification accuracy significantly and is useful both individually and on average.Extensive experimentation to provide further validation of the approach is a task that is left as further work.
In future work, we aim to investigate the relationship between side information and explainable ML models, noting that explainability is one of the pillars of trustworthy ML, broadly known as explainable AI (XAI) [27].By explainable ML, we refer to methods which aim to analyse the data in a way that involves attributes that can be understandable to humans.The peculiarity of digits in the MNIST dataset is an example of such an attribute.Some of these attributes can be cast as side information in our formalism, and when they are shown to be useful will improve the performance of the NN.