The Constrained Median: A Way to Incorporate Side Information in the Assessment of Food Samples

: A classical problem in the ﬁeld of food science concerns the consensus evaluation of food samples. Typically, several panelists are asked to provide scores describing the perceived quality of the samples, and subsequently, the overall (consensus) scores are determined. Unfortunately, gathering a large number of panelists is a challenging and very expensive way of collecting information. Interestingly, side information about the samples is often available. This paper describes a method that exploits such information with the aim of improving the assessment of the quality of multiple samples. The proposed method is illustrated by discussing an experiment on raw Atlantic salmon ( Salmo salar ), where the evolution of the overall score of each salmon sample is studied. The inﬂuence of incorporating knowledge of storage days, results of a clustering analysis, and information from additionally performed sensory evaluation tests is discussed. We provide guidelines for incorporating different types of information and discuss their beneﬁts and potential risks.


Introduction
One of the most traditional ways of determining the quality of food is by performing sensory evaluation tests, such as asking panelists to provide absolute evaluations expressing the overall quality of a food sample [1][2][3]. Typically, scales with three to nine points are used to express the degree of spoilage/freshness [4][5][6][7], the appearance [8], or the flavor [9] of a food sample. In acceptance tests, the nine-point hedonic scale [10,11] is often used, where the points on the scale represent ordered categorical labels ranging from "dislike extremely" as a score of "1" to "like extremely" as a score of "9". These scores can be used to determine a consensus score of the overall quality of a food sample. Unfortunately, the availability of panelists is oftentimes limited. Generally, a small panel (usually less than 30 individuals) could hardly be a good representation of a target market [12], and the resulting consensus score might not be a good representation of the overall quality of a sample. Thus, incorporating other sources of information may help improve the assessment of the quality of a sample.
In most cases, side information about the samples is available. Typical examples of such information include: the storage days of the food sample, additionally performed chemical analyses, or additionally performed sensory evaluation tests, such as ranking, discrimination, and threshold tests. These types of information usually are of a relative nature and hint at some relations between the scores assigned to the food samples. Several studies have shown that learning with side information can be effective in machine learning [13,14]. In this paper, we develop a method that combines such side information about several food samples and scores provided by panelists to find their overall score jointly. This paper is organized as follows. In Section 2, the method to combine scores and other types of information is described and the experimental setup is provided. Section 2.1 describes the median, the most commonly used measure of central tendency of scores. Section 2.2 provides a non-exhaustive list of some potential real-life situations where side information about the samples could be available, and Section 2.3 describes an efficient method to combine scores and other types of information to assign an overall score to multiple samples. Section 3 illustrates the method by presenting an experiment on raw Atlantic salmon (Salmo salar). The results are shown and discussed in Section 4. The paper ends with some conclusions in Section 5.

The Median
We start with a description of the theory that provides the main building blocks of our approach. Denote by x j the jth sample in a set X = {x 1 , . . . , x n } of n food samples. Consider the setting where each of m panelists has assigned a score on a k-point scale to a given food sample, and the goal is to agree on the consensus score that should be assigned.
Since it will be the scale used in the experimental setup, the five-point scale illustrated in Figure 1 is considered. However, it is important to highlight that the results of this paper are straightforwardly extended to any other k-point scale. Scales with an odd number of points are typically preferred, allowing for a neutral response in case of bipolar scales [15].  Notably, the median is the most commonly used measure of the central tendency of scores found in studies on food quality [16][17][18][19][20]. This measure can be understood as the score that separates the lower half from the upper half of the scores provided by the panelists for a given sample [21]. As will be explained below, this procedure is equivalent to assigning the score that minimizes the sum of absolute differences.
Typically, gathering together several panelists is a challenging, time consuming, and expensive exercise. Therefore, it is common to provide the panelists with multiple food samples during the same experiment. In general, the scores assigned to each sample are considered to be independent, and the assessment of a consensus score to each sample is assumed to be an independent task. Note that it is often difficult to gather the same number of panelists for different experiments.
Consider the problem setting where several panelists provide scores for each of the n food samples. For example, consider a simple setting where nine panelists each assign a score to a given food sample on the five-point scale fixed in Figure 1. The scores provided by the panelists are 4, 1, 2, 1, 4, 3, 3, 4, 3 and are represented in increasing order as 1, 1, 2, 3, 3, 3, 4, 4, 4. Denote by m j the number of scores assigned to the jth sample and by s i (j) the ith lowest score assigned to sample x j , where j ∈ {1, . . . , n} and i ∈ {1, . . . , m j }.
The goal is to agree on the consensus score that should be assigned to each sample in X . Obtaining the median score for each sample is equivalent to directly computing the vector s * , as follows: where s * (j) is referred to as the median of the jth sample. Note that in case m j is odd for all j, the median (and thus, the minimizer of Equation (1)) is unique. In case m j is even for at least one j, there can be multiple medians (and thus, multiple minimizers of Equation (1)).

Information about Samples
In most cases, side information about the different samples could be available. We provide a non-exhaustive list of some potential real-life situations hereafter.

Knowledge of Storage Days
Researchers are often interested in studying the temporal evolution of the attributes of perishable food. This is typically done by asking panelists to provide a score to a food sample that comes from different time spans of the shelf life of the same food product. In general, it is expected that food samples should be less fresh as time goes by. Thus, it is expected that the less fresh the sample is, the lower the score should be. One example is evaluating the freshness or tenderness of meats, where the score may only decrease with time [22].
Consider the setting where samples coming from the same food product are indexed in increasing order of storage days. Thus, the potential consensus scores should naturally reduce to those that satisfy the following constraints: Note that the overall trend of the scores should be decreasing; however, different decreasing patterns are possible. One possible pattern of scores is illustrated in Figure 2. Note that multiple consecutive samples could be assigned the same score, as illustrated in Figure 2 for Points 0, 1, and 2. Typically, this occurs when the number of points (k) on the scale is small. In studies on the acceptability of beverages, the evolution of certain attributes of beverages is of interest. A common method used in this situation is the time-intensity (TI) method [21] (see Chapter 8).
Typically, a beverage sample is first swallowed, and then, an attribute is evaluated by assigning a score at different time spans. For several types of beverages, such as beer, wine, and soda, it is expected that a beverage sample should have an increasing acceptance at first, eventually decreasing afterwards.
Typical examples include the evaluation of the astringency and flavor of beer and wine, where these attributes increase in intensity at first and eventually decrease with time [23,24].
Consider the setting where a beverage sample is evaluated by panelists over a period of time. Thus, the potential consensus scores should naturally reduce to those that satisfy the following constraints: s(1) ≤ · · · ≤ s(a) and s(a) ≥ · · · ≥ s(n) , for some a ∈ {1, . . . , n}.
This means that there should be a unimodal pattern. One possible pattern of scores is illustrated in Figure 3. Note that if one considers a short duration, say t ∈ {0, . . . , 3}, then the overall trend is only increasing. However, if one considers a long duration at a later time, say t ∈ {2, . . . , 8}, then the overall trend is only decreasing.

Results of a Clustering Analysis
In many studies, food samples are stored at different (temperature and atmospheric) conditions or represent the same food product, but originating from a different initial batch, manufacturer, or season. In addition, the initial contamination of the food (i.e., initial microbial load) plays a big role in the spoilage rate of every sample, and thus, the decreasing pattern of the scores might not always hold. Thus, the storage days could not be used as the only tool to compare these samples. For instance, it is not always the case that samples that have been stored at different conditions for the same duration of time will be similar. Similarly, it is not always the case that a sample is always preferred over another sample that is stored at different conditions and has been stored for longer.
It is well established that microbial growth is the most important cause of food spoilage, particularly in meats [25], producing volatile organic compounds (VOCs) and, subsequently, off-odors and off-flavors. These odors and flavors result in an olfactory impact that is associated with the spoilage of food. Therefore, the relation between the VOC profiles and the quality of food has caught the attention of many researchers in food science. Several studies have successfully used the composition of the VOC profiles to evaluate the quality of food, such as seafood [26,27] and meat [28].
To establish a relation between the VOC profiles of the samples and their resulting consensus scores, clustering analysis, a method for merging similar groups of samples based on the similarity of their VOC profile, can be used. In general, it is expected that samples clustered together should be quite similar, and thus, their scores should not be very different. Therefore, the absolute difference of the scores of these samples should not exceed a certain threshold. Note that we prefer not to impose that samples in the same cluster should have strictly the same scores because this might be too restrictive. However, this is still a possibility, as will be further explained below.
The considered setting may naturally reduce the potential consensus scores to those that satisfy the following constraints: where I b is the set of indices corresponding to the b th cluster and is a threshold on the absolute difference of the scores of samples in the same cluster. Note that the value of may depend on several considerations, such as the distance between clusters (different clustering analysis tools result in different distances between clusters) or the number of points (k) on the scale used for scoring. The special case where = 0 amounts to restricting with equality constraints only. For instance, consider the case where samples {x 1 , x 2 } are found in one cluster and samples {x 3 , x 4 , x 5 } are found in another cluster. It is expected that the absolute difference of the scores of every pair of samples in each cluster should be smaller than or equal to one. This process is illustrated in Figure 4. It can be seen that the absolute difference of scores for each couple of samples in the same cluster is smaller than or equal to one, thus satisfying the constraints.

Ranking test
Recently, researchers have been adopting scoring methods for determining the quality of food samples along with ranking methods to order the samples according to their quality [29][30][31][32]. Ranking tests involve several panelists providing rankings (with ties) on samples. Typically, these rankings are aggregated to obtain a consensus ranking that describes an underlying order of the samples; thus, it is expected that the scores agree with this consensus ranking of the samples. For example, rankings have been previously used to study the desirability of different meats [33,34]. In this setting, these rankings can give useful information as a reference for the relative desirability of meats.
In general, it is expected that samples ranked higher are preferred over samples ranked lower; thus, it is expected that the higher the sample is ranked, the greater the score should be. Note that we do not impose that a sample ranked higher than another sample should have a strictly greater score; instead, we allow their scores to be equal as well. This is due to the fact that the considered scale is typically not rich enough for allowing to distinguish between similar samples. In case two samples are tied, it is expected that their scores should be similar. Note that the situation where samples are tied is similar to that where samples are in the same cluster. For simplicity, the special case where = 0 is considered. The considered setting may naturally reduce the potential consensus scores to those that satisfy the following constraints: For instance, consider the ranking with ties It is expected that x 1 should be assigned a score smaller than or equal to that of sample x 2 , which should be assigned a score equal to that of sample x 3 , which should be assigned a score smaller than or equal to that of sample x 4 . More formally, the resulting constraints are s(1) ≤ s(2) = s(3) ≤ s(4). This process is illustrated in Figure 5. Figure 5. Example of scores describing the ranking

Discrimination test
Many discrimination tests can be seen as a special case of a ranking test. For instance, in an A-notA test, panelists are provided with one sample and are asked whether or not it is similar to a reference sample A [35]. Based on the responses of the panelists, if there is no significant difference between the samples, then it is expected that they should be assigned a similar score. Therefore, the absolute difference of the scores of these samples should not exceed a certain threshold .
Another instance is a duo-trio test, where panelists are provided with two samples and a reference sample that is identical to one of the two samples and are asked to match one of the two samples to the reference sample [36] (see Chapter 4). It is expected that the reference sample and the sample identical to it should be scored equally. Moreover, if a large number of panelists are not able to distinguish the identical samples from the third sample, then it is expected that this third sample should be assigned a score similar to that of the identical samples. Therefore, the absolute difference of the scores of the non-identical samples should not exceed a certain threshold . This process is illustrated in Figure 6.
Another instance is a two-out-of-five test, where panelists are given five samples and are asked to distinguish two identical samples from the other three samples [37]. It is expected that the identical samples should be scored equally. Moreover, if a large number of panelists are not able to distinguish the identical samples from the other three, then it is expected that there is no significant difference among the five samples and that all the samples should be assigned similar scores. Therefore, the absolute difference of the scores of these samples should not exceed a certain threshold . x 1 Ref Figure 6. Example of scores describing that there is no significant difference between x 3 and the reference sample x 1 (equivalently, its identical to sample x 2 ) and where a threshold = 1 is considered.

Threshold test
In threshold tests, panelists are asked to determine a threshold of noticing a certain stimulus [21] (see Chapter 6). Different versions of the threshold test have been proposed, the differential threshold test and the absolute threshold test being the most prominent examples. In the former, the aim is to determine the threshold at which an increase in a noticed stimulus can be perceived, whereas in the latter, the aim is to determine the lowest threshold at which a stimulus can be noticed. Note that the case in which there is a decrease in stimulus can also be considered. One example is determining the (consumer) rejection of chocolate bitterness [38].
In differential threshold tests, it is expected that the sample where an increase in stimulus is not noticed should have a quite similar score to the previous sample that has one increment less of the stimulus, and thus, their scores should not be very different. Therefore, the absolute difference of the scores of these samples should not exceed a certain threshold . However, it is expected that the samples where an increase in stimulus is noticed should have a score greater than or equal to the score of the previous sample. Therefore, the absolute difference of the scores of these samples should be greater than or equal to this threshold.
Note that samples arranged in increasing (or decreasing) order of stimulus should have scores that are either increasing s(i) ≤ s(i + 1) or decreasing s(i) ≥ s(i + 1) for any i ∈ {1, . . . , n}. The considered setting may naturally reduce the potential consensus scores to those that satisfy the following additional constraints: where ξ 1 is the set of indices corresponding to the samples where a stimulus is not noticed, ξ 2 is the set of indices corresponding to the samples where a stimulus is noticed, and is a threshold on the absolute difference of the scores of consecutive samples. For instance, consider that the stimulus is first noticed at sample x 4 . It is expected that the absolute differences of the scores of consecutive samples x 1 and x 2 and samples x 2 and x 3 should be smaller than or equal to = 1 and that the absolute difference of the scores of samples x 3 and x 4 should be greater than or equal to = 1. Similarly, given that the stimulus is noticed a second time at sample x 6 , it is expected that the absolute difference of the scores of samples x 4 and x 5 should be smaller than or equal to one and that the absolute difference of the scores of samples x 5 and x 6 should be greater than or equal to one. This process is illustrated in Figure 7. Figure 7. Example of scores in a differential detection test describing no detection of a stimulus (in green) and the detection of a stimulus (in red) at samples x 4 and x 6 and where a threshold = 1 is considered.
In absolute threshold tests, samples with a stimulus are only compared to a reference sample. Thus, if an increase in stimulus is not noticed in a sample, then it is expected that this sample should be assigned a score similar to that of the reference sample. Therefore, the absolute difference of their scores should not exceed a certain threshold . However, it is expected that the samples where an increase in stimulus is noticed should be assigned a different score than that assigned to the reference sample. Therefore, the absolute difference of their scores should be greater than or equal to this threshold.
Note that the scores should be either increasing s(i) ≤ s(i + 1) or decreasing s(i) ≥ s(i + 1) for i ∈ {1, . . . , n}. We consider the first sample (i = 1) to be the reference sample. The considered setting may naturally reduce the potential consensus scores to those that satisfy the following additional constraints: where c is the sample at which an increase in stimulus is noticed and is a threshold on the absolute difference between the scores of the samples and that of the reference sample.
For instance, consider that a stimulus is first noticed at sample x 4 . It is expected that the absolute difference of the scores of samples x 1 and x 4 should be greater than or equal to one. Now, consider that sample x 4 , where the first stimulus is noticed, is the new reference and that a second stimulus is noticed at sample x 6 . It is expected that the absolute difference of the scores of samples x 4 and x 6 should be greater than or equal to one. This process is illustrated in Figure 8. x 1

The Constrained Median
As we have previously discussed, the considered settings may naturally reduce the set of potential consensus scores from {1, . . . , 5} n to a non-empty subset S ⊆ {1, . . . , 5} n . We conjecture that, in most real-life situations, it seems natural for S to result from the conjunction of some (in)equality constraints on the components of s. However, this condition should not be a requirement if it does not comply with the characteristics of the considered problem.
Thus, the consensus scores should be the ones given by the vector that minimizes the sum of distances while satisfying the constraints of S. Therefore, the problem defined by Equation (1) can now be restricted to s ∈ S, as follows: where s * is referred to as a constrained median. This concept is illustrated in the following example.

Example 1.
Consider a simple setting where nine panelists each assign a score to two given food samples on the five-point scale fixed in Figure 1. The scores assigned to each sample are represented in increasing order in Table 1. From Table 1, it can be clearly seen that the median for sample x 1 is three and the median for sample x 2 is four (i.e., the score in the middle, in this case for i = 5). Analogously, we consider the problem defined by Equation (1), and we compute the sum of distances between the scores provided by the panelists and every possible vector of scores (in general, for n samples and k scores, the number of possible vectors of scores is k n ). The results are illustrated in Table 2. We see that the minimizer (thus the median) is the vector of scores (3,4). Note that, as expected, this vector coincides with the result of computing the median for each of the different samples separately.
In the setting where it is known that the first sample is fresher than the second sample (they are samples from different time spans of the shelf life of the same food), it is expected that the score of the first sample should be greater than or equal to the score of the second sample. Finding a solution by simply looking at Table 1 is not an easy task. Thus, the problem defined by Equation (2) is considered. The set of constraints S is formed by the vectors of scores in which the first sample is assigned a score greater than or equal to the score of the second sample. Such vectors are highlighted in gray in Table 2. It can be seen that the minimizer (thus the constrained median) is the vector of scores (4,4). A conclusion is reached that the score assigned to sample x 1 should be greater than that originally assigned. Score of sample x 1 (s i (1)) 1

2 3 3 3 4 4 4
Score of sample x 2 (s i (2)) 2 3 4 4 4 5 5 5 5 Table 2. Sum of distances between the scores provided by the panelists and all possible vectors of scores. The minimizers are shown in bold, and the vectors of scores in which the first sample is assigned a score greater than or equal to the score of the second sample are highlighted in gray. Note that the difference between the number of panelists for each experiment can be extremely large in some instances. For instance, a very small number of panelists is gathered for one experiment, and a larger number of panelists is gathered for another experiment. Such a scenario can be approached from two different points of view: (a) each panelist is represented by one evaluation or (b) each sample is represented by one evaluation. The former approach is analogous to the problem defined by Equation (2), whereas the latter approach can be formalized as follows: where the evaluations are averaged based on the number of panelists m j that provide scores for the jth sample. It must be noted that both approaches are equivalent if all m j 's are equal. Moreover, both problems are also equivalent if there are no constraints (i.e., solving the problem defined by Equation (1) resulting in the median).

Example 2.
Consider a simple setting where one panelist assigns a score to a given food sample and nine panelists each assign a score to a second given food sample on the five-point scale fixed in Figure 1. The scores are represented in increasing order in Table 3.
To determine the vector of consensus scores, the problem defined by Equation (1) is considered, and for each of the 25 possible vectors of scores, the sum of distances to the scores provided by the panelists is computed. The minimizer of this sum of distances (thus the median) is the vector of scores (2,3).
Consider now that we know that the first sample is fresher than the second sample. Based on the approach where each panelist is represented by one evaluation, the problem defined by Equation (2) is solved, resulting in the vector of scores (3,3) as the constrained median. Based on the approach where each sample is represented by one evaluation, the problem defined by Equation (3) is solved, resulting in the vector of scores (2, 2) as the constrained median. If we consider that each panelist is represented by one evaluation (i.e., the score of each sample depends on the number of panelists), then it seems logical to change the median score of x 1 to three. However, if we consider that each sample is represented by one evaluation (i.e., the score of each sample depends on the proportion of panelists), then it seems logical to change the median score of x 2 to two. The ith Lowest Score 1 2 3 4 5 6 7 8 9 Score of sample x 1 (s i (1)) 2 Score of sample x 2 (s i (2)) 2 2 3 3 3 5 5 5 5

Materials and Methods
The evolution of four fresh Atlantic salmon fillets (A, B, C, D) was studied over time. The data of this study originated from the same experiment performed by [39], who described the adopted materials and methods summarized hereafter.
Each fillet was (equally) divided into five samples that were stored under the same conditions for a specific number of storage days and then analyzed by selective-ion flow-tube mass spectrometry (SIFT-MS); a subindex was added to each salmon sample to indicate the corresponding storage day. Subsequently, after a sample was stored for a specific number of days, it was frozen at −32 • C under vacuum. The samples were thawed and grouped into five groups of four samples, one sample from each fillet, and were then provided to the panelists in a random order as shown in Table 4. The groups of samples were not provided to the panelists in chronological order to prevent the panelists from recognizing a pattern in the experiments that could affect their evaluations. Table 4. The order of grouping the salmon samples from different storage days (represented by the corresponding subindex) and the day each group was provided to the panelists. Several panelists (nine or ten depending on the day) were recruited from the Department of Food Safety and Food Quality at the Faculty of Bioscience Engineering at Ghent University with prior experience in performing sensory evaluation of salmon. Sensory evaluation was based on olfactory evaluation and performed in individual booths under red light at SensoLab in Ghent University. Each panelist was asked to assign to each sample a score on the 5-point scale described in Figure 1, where the scores "1", "3", and "5" instead represented spoiled, neither spoiled nor fresh, and fresh, respectively.
Additionally, several panelists (between 23 and 28 depending on the day) were recruited from multiple departments at the Faculty of Bioscience Engineering with no prior experience in sensory evaluation of salmon and were asked to express a ranking of the four samples of salmon by ordering them from most fresh to least fresh. For each group of samples, we computed the consensus ranking(s) r * as described in [39], and we refer to our work for further details on the considered method of [40] for the aggregation of rankings.

Consensus Scores of Salmon
The scores assigned to each sample are represented in increasing order in Table 5. The median score for each sample can be directly determined as the scores in the middle of each column, shown in bold in Table 5. Note that for Groups 3, 4, and 5, there was an even number of scores provided by the panelists. Since the scores right before (i.e., i = 5) and right after (i.e., i = 6) the middle were the same, the median was unique. Table 5. Scores assigned by the panelists on each day. The medians are shown in bold.

Group 1 Group 2 Group 3 Group 4 Group 5
In what follows, the influence of incorporating knowledge of storage days, results of a clustering analysis, and information from additionally obtained rankings of the samples is illustrated, and the results are discussed.

Incorporating Knowledge of Storage Days
By considering the consensus scores of each sample separately, it was clear that the scores of the samples from Fillets A and B were decreasing over time. However, the scores of the samples from Fillets C and D were increasing at certain storage days. For instance, sample C 3 was assigned a score of three, and sample C 4 was assigned a higher score of four. Similarly, sample D 5 was assigned a score of two, and sample D 6 was assigned a higher score of four.
To determine the consensus vector of scores that should be assigned to the five samples of the same fillet, while incorporating the knowledge of storage days of the samples, the problem defined by Equation (2) was considered, where the additional constraints are summarized in Table 6. The median and the constrained median for the samples of each fillet are summarized in Table 9. After incorporating the knowledge of storage days of the samples from each of Fillets C and D, the results are seen in the upcoming Table 9 where there are no increasing values of evaluations for any fillet. Table 6. The constraints based on the storage days of the samples.

Fillet Constraints
Note that including the knowledge of storage days should only be considered when several factors of the initial conditions of the samples are similar, particularly their contamination, dimensions, composition, and packaging and storage conditions. Verifying the similarity of the initial conditions of the samples required measurements.
From Table 9, it was deduced that, in an ideal situation where the samples from the same fillet had similar initial conditions, incorporating the knowledge of storage days of the samples from Fillet C indicated that sample C 4 should be assigned a lower score and that samples C 3 , C 4 , and C 5 should be equally scored in terms of freshness. Moreover, it was deduced that incorporating the knowledge of storage days of the samples from Fillet D indicated that sample D 5 should be assigned a higher score and sample D 6 should be assigned a lower score, and these samples should be scored equally in terms of freshness.
There existed several potential risks when incorporating knowledge of storage days of samples that were not initially similar. The main concern was that an assumption was made that the samples had similar spoilage rates, and thus, their assigned scores might be incorrectly related to their storage days. Since microbiological analysis of each salmon sample was not performed, the storage days would not be used as the only tool to compare these samples.

Incorporating Results of a Clustering Analysis
Recently, the characterization of VOCs using selective-ion flow-tube mass spectrometry (SIFT-MS) has attracted the attention of many researchers and has been validated for fish metabolite research [26,27]. Additionally, researchers have used hierarchical agglomerative clustering [41], a commonly used clustering analysis tool, to establish a relation between the VOC profiles of the samples and their resulting consensus scores [42]. The methods of using SIFT-MS to quantify the VOC profiles and the results of the clustering were described by [39] and are summarized in Table 7. Table 7. Clustered samples based on the similarity of their VOC profile as described by [39]. To determine the consensus vector of scores that should be assigned to each of the samples, while incorporating the results of a clustering analysis of the samples, the problem defined by Equation (2) with a threshold = 1 on the absolute difference of the scores of the samples was considered.

Clusters Samples
The median and the constrained median for the samples of each fillet are gathered in the upcoming Table 9.
From Table 9, it was deduced on the basis of the scores provided by the panelists that, in Cluster 3, samples C 6 and D 5 were less fresh than all the other samples in the cluster. However, incorporating the results of a clustering analysis of all the samples indicated that samples C 6 and D 5 should each be assigned a higher score.
Note that including the results of a clustering analysis should only be considered when the measurements were accurate and the clusters were well defined. Otherwise, there existed the potential risk of inaccurately clustering samples and, thus, assigning incorrect scores. In many real-world datasets, there is no absolute optimal number of clusters. As a result, a balance between a clustering that reflects the data best and a parsimonious model should be determined. For instance, a very small number of clusters may result in a large number of samples in a single cluster, and thus, their assigned scores may be incorrectly set to a close value. Equivalently, a very large number of clusters may result in a small number of samples in each cluster, thus resulting in a small number of constraints.

Incorporating Consensus Rankings
The simultaneous adoption of different sensory evaluation tests, such as scoring tests for determining the quality of food samples and ranking tests for determining the existence of a significant difference between the samples, has attracted the attention of many researchers [29][30][31][32]39]. The additional data produced from the ranking tests could be incorporated to improve the assessment of the quality of the salmon samples. The methods to obtain a consensus ranking of these samples were described by [39], and the results are summarized in Table 8. As the method of [40] was considered for identifying the consensus ranking(s), it might be the case that for some groups, there were multiple consensus rankings (obtained as the minimizers of the Kemeny distance). In such a case, each of these consensus rankings was considered separately as a set of constraints.
To determine the consensus vectors of scores that should be assigned to the four samples in each group, while incorporating the consensus rankings of the samples, the problem defined by Equation (2) was considered. The median and the constrained median for the samples in each group are summarized in the upcoming Table 9.
From Table 9, it was deduced that, in an ideal situation where a large number of panelists provided rankings of the samples, incorporating the knowledge of the consensus ranking of the samples in Group 4 indicated that sample B 5 was similar to sample C 6 . Therefore, sample B 5 should be assigned a lower score, resulting in equal scores assigned to samples B 5 , C 6 , and D 7 . Similar conclusions could be drawn for Groups 1, 2, 3, and 5.
Note that incorporating a consensus ranking should only be considered when the number of panelists providing a ranking was large enough. Otherwise, there existed the potential risk of inaccurately ordering samples and, thus, assigning incorrect scores. In this experiment, the number of panelists providing rankings on the salmon samples was between 23 and 28 depending on the group. This number may be considered to be large enough for obtaining consensus rankings; however, having more panelists would result in more reliable consensus rankings. Therefore, in this study, we did not use the consensus rankings as the only tool to compare these samples.

Comparing the Constrained Medians
To study the influence of incorporating the previously discussed information that invoked different constraints on the median of each sample, we summarize all the medians and the constrained medians for each setting in Table 9. Table 9. The median and the constrained medians for each sample in every group after incorporating knowledge of storage days, results of a clustering analysis, and consensus rankings. It could be seen that the medians and the constrained medians for some of the samples were equal. A conclusion was reached that the scores assigned to these samples agreed with each type of side information on the samples. However, the medians and the constrained medians for the other samples differed. The additional constraints might provide a better understanding of the score that should be assigned to each sample.
Note that simultaneously incorporating the knowledge of storage days, the results of a clustering analysis and consensus rankings would result in many constraints. As a result, the constrained median of samples C 7 , D 7 , and D 8 stayed the same while the constrained median of all the other samples was either three or four. The reader should bear in mind that adding too many constraints might result in forcing the scores of all the samples to be the same or, even worse in the case of contradictory constraints, rendering the set S in Equation (2) empty.
It is important to note that choosing to incorporate side information depends on the reliability of the scores assigned to the samples. For instance, it is recommended to incorporate side information in settings where the assigned score might not be a good representation of the overall quality of a sample, such as when the number of panelists is very small. Moreover, choosing an optimal source of side information depends on the quality of that information. For instance, it is recommended that the initial conditions of the samples should be similar, the clustering analysis should be performed carefully, and the number of panelists performing ranking tests should be large enough.
In our case, since the measurements of the VOC profiles of our samples were accurate and we believed the chemical analysis of the VOC profile was our most reliable source of information, we restricted our attention to the constraints provided by the clustering analysis. We concluded that samples C 6 and D 5 should be assigned a higher score than was originally assigned to them by the classical median.

Conclusions
In this paper, we presented a new method allowing combining scores provided by panelists for given samples and different types of side information on these samples. The presented method was especially useful in the setting of sensory evaluation in case the number of panelists providing scores was very small, yet side information on the samples was available or could be easily obtained. It is noteworthy that this method was not limited by the size of the scale or the number of panelists and samples. Other examples of potential applications for this method include, but are not limited to, decision making problems [43,44], online evaluation [45], and recommender systems (e.g., social matching systems and gift, music, and movie recommenders) [46].
Here, we illustrated the method by means of an experiment concerning the freshness of raw Atlantic salmon. Moreover, we discussed the influence of incorporating the knowledge of storage days, the results of a clustering analysis, and the consensus ranking of the samples on the assignment of the consensus scores to these samples. We provided guidelines for incorporating the different types of side information and pointed out their benefits and potential risks.
We end by noting that, in the field of food science, researchers are not only interested in determining the quality of food samples, but also in understanding and identifying the reasons for the scores assigned to the samples. Thus, the resulting consensus scores can be of use in relating the characteristics of samples to their assigned scores.