Panel and Panelist Performance in the Sensory Evaluation of Black Ripe Olives from Spanish Manzanilla and Hojiblanca Cultivars

There is vast experience in the application of sensory analysis to green Spanish-style olives, but ripe black olives (≈1 × 106 kg for 2016/2017) have received scarce attention and panelists have less experience on the evaluation of this presentation. Therefore, the study of their performance during the assessment of this presentation is critical. Using previously developed lexicon, ripe olives from Manzanilla and Hojiblanca cultivars from different origins were sensory analysed according to the Quantitative Descriptive Analysis (QDA). The panel (eight men and six women) was trained, and the QDA tests were performed following similar recommendations than for green olives. The data were examined while using SensoMineR v.1.07, programmed in R, which provides a diversity of easy to interpret graphical outputs. The repeatability and reproducibility of panel and panelists were good for product characterisation. However, the panel performance investigation was essential in detecting details of panel work (detection of panelists with low discriminant power, those that have interpreted the scale in a different way than the whole panel, the identification of panelists who required training in several/specific descriptors, or those with low discriminant power). Besides, the study identified the descriptors of hard evaluation (skin green, vinegar, bitterness, or natural fruity/floral).


Introduction
World table olive production was around 2.6 × 10 6 tones in season 2016/2017 according to the last consolidated balance of the International Olive Oil Council [1]. Approximately, 40% of them were processed as black ripe table olives (Californian style). This style was first developed in the USA, which is still one of the most relevant contributors with current production of about 80 × 10 3 tons [1], but other countries, like Spain, Greece, Turkey, or Egypt, are progressively increasing their productions. Black ripe table olive processing includes a phase of storage, which is usually accomplished by immersing the fruits in brine or acidified solution, followed by a darkening step, which consists of the application of one (or several) lye treatments and subsequent immersion in tap water to remove the excess of alkali. During this oxidation phase, air is also bubbled through the suspension to accelerate browning. The colour is then fixed by a ferrous gluconate solution, after which the olives are packed and the cans sterilised [2]. The products usually offer a rather plain organoleptic profile, which has been a favourable condition for its introduction in new markets, due to their numerous treatments in aqueous solutions. In fact, according to the Trade Standards Applying to Table Olives [3], the only requisites for these olives are sensory characteristics and texture in agreement with their processing system.

Olives and Their Processing
The olives were of the Manzanilla and Hojiblanca cultivars, harvested at green maturation stage in October 2016. Their origins were: Aljarafe (Sevilla) and Lora de Estepa (Sevilla) for Manzanilla, and Lora de Estepa (Sevilla), and Alameda (Málaga) for Hojiblanca. The samples were identified as MAL, ML, HL, and HA, according to cultivar (initial letter) and growing area (remaining letter/s).
Just harvested olives from each cultivar and origin were directly brined in 25 L (15 kg olives) PVC (polyvinyl chloride) fermenters in an acidified (2.4% acetic acid) solution. After three months of storage, the fruits were subjected to the darkening process. For this purpose, horizontal stainless steel cylindrical containers (0.4 m diameter, 0.7 m length) were used. The fruits were treated with a 3% lye solution until the alkali reached the pit. After removing the alkali, the olives were washed to low the pH up to 8.0 units. During both operations, an oxygen-saturated ambient was maintained in the suspension by bubbling air through a perforated tube lying along the bottom of the oxidation vessels. Subsequently, the black colour developed was fixed, while using a 0.1% ferrous gluconate solution with pH adjusted to 4.5 to prevent the precipitation of the element as hydroxide. Afterwards, the darkened olives were introduced in glass jars (145 g of olives), together with 170 mL of 3.5% NaCl cover solution, which also contained 0.2 g ferrous gluconate/L and had the pH adjusted to 4.5 with acetic acid. Finally, the jars were closed and sterilised at 130 • C for 20 min [23].
The sensory analysis of the above-prepared black ripe olives was achieved after storage at room temperature for 30 (to allow complete olive flesh/brine equilibrium) and 210 days (estimated maximum normal period of the product in the shelves before reposition). The new codes were those previously mentioned, plus 1 (one-month storage) and 2 (seven-month storage), respectively. Therefore, the A panel composed of eight men and six women, making a total of 14 panelists (40 years' average age) performed the analysis. They all belonged to the Instituto de la Grasa staff and had vast experience on sensory studies due to their participation in the development of the Sensory Analysis Method for Table Olives [4] and the permanent involvement in diverse IG table olive sensory projects (e.g., [10,11]). Before the tests, the panelists were trained for one h twice a week for two months to familiarise them with the QDA techniques and the black ripe olive descriptors, while using industrially processed Spanish cultivars black ripe olives. The presentation of the samples was always made in the standard glasses [24], which were coded with three randomly chosen digits. After each test, the mouth was washed with tap water, freely available in each booth. Therefore, the panelists were progressively familiarised with the product, the sensory descriptors that were included in the evaluation sheet, informal tentative evaluations, and, finally, allowed for practicioning with the unstructured scale (1, complete absence; 11, strongest perception) of the evaluation sheet for another month. After these periods, they were considered ready for the evaluation of the real samples because of the previous expertise of the panelists in sensory testing. The assessed descriptors included appearance (skin red, skin green, skin sheen, flesh red, flesh yellow, and flesh green), aroma (briny, mushroom, earth/soil, oak/barrel, nutty, artificial fruity/floral, natural fruity/floral, vinegary, alcohol, fishy smell/ocean, and cheese smell), taste (sourness, bitterness, and saltiness), flavor (ripeness, buttery, metallic, rancid, soapy smell/medicinal, and gassy smell), and texture/mouthfeel (firmness, fibrousness, moisture release, mouth coating, chewiness, astringency, and residual). Their definitions and references may be found elsewhere [10].
For performing the tests, the black ripe olive samples were presented to panelists at an ambient temperature (20 ± 1 • C) and in a panel room that was equipped with individual booths under incandescent white lighting and free from any odors. The panelists were asked to mark the intensity of the different descriptors in the evaluation sheets. The scores of the attributes were measured with the exactitude of one decimal point and the results tabulated.

Data Analysis
The data were mainly studied while using the SensoMineR v.1.07 software (Agrocampus Ouest, Rennes, France) [25], a package that was designed and programmed in R language [26]. It is characterized by combining classical sensory statistical methods as well as others directly conceived in the developers' laboratory. In this way, SensoMineR provides a synthesis of the results of the usual analysis of variance (ANOVA) models, as well as a diversity of easy to interpret graphical outputs. Notably, the package includes several options for the panel evaluation, such as multivariate analysis and the generation of virtual panels, by bootstrapping techniques, which allow for the estimation of the corresponding confidence limits. XLSTAT [27] was also applied in specific analysis and tests.

Results and Discussion
The matrix of data was constituted by the following variables: sample-storage period (just sample from now on), panelist, session, and the 33 descriptors making a total of 36 columns. Additionally, sample, panelist, and session had 8, 14, and 3 levels, respectively, making a total of 336 rows. Therefore, the overall number of cells was 12,096. The generated database was already used for product characterization [10], but, in this work, the analysis is focused on the panel and panelists performance as an exercise for improving their evaluation and training.

Overview of Results
After checking the dataset for possible outliers and typing errors, they were also subjected to a first overview (frequency histograms and boxplots), which indicates that several descriptors received low scores and they were hardly noticed; however, others were perceived by the panelists, distributed along the scale, and allowed for discrimination among samples (data not shown). Further details can be found elsewhere [10].

Panel Performance
The techniques that are available for panel and panelists performance are numerous, with ANOVA and multivariate analysis being the most common. Kermit and Lengard Almli [16] presented univariate and multivariate data analysis methods to assess the individual and group performances in a sensory panel. Notably, Husson et al., [25] developed the SensoMineR, which includes several innovative tools with this objective.

Effect of Sample (Power of Discrimination)
The evaluation of the panel performance is an essential premise not only for obtaining reliable results on sensory analysis, but also for improving the selection of panelists and their training. In this work, the panelperf instruction from SensoMineR, with the appropriate models and the corresponding analysis of variance, was used. The ANOVA was fitted to the following full model: Score = sample + panelist + session + sample panelist + sample session + panelist session where score stands for the expected evaluation value, while sample, panelist, and session for the predictive variables, with the effect of storage being included as levels of the variable sample. The panelist and the session were both studied as random effects, but the sample was considered to be fixed [28].
The results regarding performance (Table 1) showed that the panel was able to discriminate the samples based on skin green, flesh green, skin sheen, flesh red, firmness, fibrousness, flesh yellow, skin red, vinegary, moisture release, fishy smell/ocean, and saltiness. Good segregation among the samples or products by panelists is systematically reported in numerous publications ( [6,17,[28][29][30], among others).

Effect of Panelist
The significant effect of the panelist, with very low p-values, regardless of descriptors, indicates a different interpretation of the scales. Such an effect is not desirable, but it is usually observed. However, its presence does not represent any inconvenience for achieving appropriate conclusions, since the panelists' variance can be eliminated thanks to the ANOVA analysis and by centring the data with respect to panelists [31]. The assessors' performance will be studied in detail later.

Effect of Session
The effect of the session was not significant for any descriptor (Table 1), which indicates an overall good panelist performance over time (the samples were assessed in the same way from one session to another), which is an appropriated and desired situation. Subsequently, no further comments regarding this aspect are also required.

Sample·Panelist Interaction
In the case of a total consensus among the members of the panel to assess the descriptors in all samples, their effects should not be significant. However, in this work, there were numerous significant cases ( Table 1). The evaluation of the interaction is usually measured by the coefficients of the ANOVA, defined as the difference between the expected mean score by all panelists and that given by a specific one. It is tedious to reproduce their meaning in all descriptors, so only the case of skin red and flesh red are shown as examples ( Figure 1). The effect might be significant because of two circumstances: (i) the panelists do no rank the samples in the same order and (ii) they do no use the scale in the same way. Both situations were found in this work. Examples of different ranks were observed, among other descriptors, for skin red, panelist 1 gave the highest score to HA1, but panelist 2 ranked it as the second one from the bottom; a similar behaviour occurred for flesh red regarding panelist 5 with respect to panelist 6 ( Figure 1).

Sample·Panelist Interaction
In the case of a total consensus among the members of the panel to assess the descriptors in all samples, their effects should not be significant. However, in this work, there were numerous significant cases ( Table 1). The evaluation of the interaction is usually measured by the coefficients of the ANOVA, defined as the difference between the expected mean score by all panelists and that given by a specific one. It is tedious to reproduce their meaning in all descriptors, so only the case of skin red and flesh red are shown as examples ( Figure 1). The effect might be significant because of two circumstances: (i) the panelists do no rank the samples in the same order and (ii) they do no use the scale in the same way. Both situations were found in this work. Examples of different ranks were observed, among other descriptors, for skin red, panelist 1 gave the highest score to HA1, but panelist 2 ranked it as the second one from the bottom; a similar behaviour occurred for flesh red regarding panelist 5 with respect to panelist 6 ( Figure 1). On the other side, for skin red, panelist 1 used a narrower scale than panelist 6; the same trend can be observed for flesh red by panelist 1 and panelist 12 ( Figure 1). Therefore, to improve panel performance, it will be required further additional training in the scoring of some attributes and the amplitude of their scales. On the other side, for skin red, panelist 1 used a narrower scale than panelist 6; the same trend can be observed for flesh red by panelist 1 and panelist 12 ( Figure 1). Therefore, to improve panel performance, it will be required further additional training in the scoring of some attributes and the amplitude of their scales. The corresponding coefficients of each panelist in the ANOVA model were assessed by the identification of the panelists who mainly contributed to the interaction [19]. With this aim, the difference between the expected score and that given by a concrete panelist, overall sessions and samples, represent how far a specific panelist scores the sample differently to the product mean of the whole panel. No significant differences were usually observed (panelists had, in general, good reproducibility), but some peculiarities were noticed. For example, panelist A12 scored skin green ( Figure 2A) sensibly higher than any other panelist; subsequently, he was critical in the significance of this interaction. Additionally, panelist A3 tends to scoring skin red, skin sheen, and flesh red above the panel average (Figure 2A). The corresponding coefficients of each panelist in the ANOVA model were assessed by the identification of the panelists who mainly contributed to the interaction [19]. With this aim, the difference between the expected score and that given by a concrete panelist, overall sessions and samples, represent how far a specific panelist scores the sample differently to the product mean of the whole panel. No significant differences were usually observed (panelists had, in general, good reproducibility), but some peculiarities were noticed. For example, panelist A12 scored skin green ( Figure 2A) sensibly higher than any other panelist; subsequently, he was critical in the significance of this interaction. Additionally, panelist A3 tends to scoring skin red, skin sheen, and flesh red above the panel average (Figure 2A).  Another way of observing the sample·panelist interaction and measuring the panelists' reproducibility is by plotting the mean per panelist over the mean on the whole panel according to samples. In agreement with previous comments, some panelists gave high scores to several descriptors and, in this line, panelist A12 overscored skin green in samples HL2, HA2, MAL2, and ML2 ( Figure 2B). These high scores were due to a tendency of this panelist to evaluate several descriptors (flesh yellow and briny, data not shown) higher than other panel members. Similarly, outstanding scores were observed for panelist A5 in vinegary, alcohol, and sourness, and for panelist A8 in mouth coating, chewiness, stringency, and residual (data not are shown). However, most of the panelists differently scored only one descriptor like A4 in grassy smell, A10 in cheesy smell, A3 in a buttery, or A6 in rancid, to mention a few cases. Therefore, no panelist systematically contributed to the interaction, but the above-mentioned results could indicate that the panel performance would be improved by the further training of some panel members (A12, A5, and A8, on several descriptors or A4, A10, A3, or A6, only regarding specific ones). Kermit and Lengard Almli [19] also found several assessors who showed poor performance in some attributes, such as mealiness or fruity flavor.

Sample·Session Interaction
These interactions refer to the variation of the mean of each sample from one session to another and they should not be confused with the session effect, which applies to the mean of all samples between sessions. In the study (Table 1), the sample·session interaction was only significant in two cases: saltines (which was an important descriptor for sample discrimination) and metallic (Table 1). In saltiness, the significant interaction was mainly produced because of the different scoring for samples HA2, HL1, HA1, MAL1, and MAL2 in session S1 (Figure 3), while, in the case of metallic, the significant interaction is due to the abnormally high score of MAL1 in session S1 (Figure 3).

Panelist·Session Interaction
If significant, it means that one or more panelists do not similarly grade for all of the products from one session to another. There were several significant panelist·session interactions. Among the descriptors that contributed to discrimination, mushroom, oak barrel, cheesy smell, sourness, chewiness, bitterness, and saltiness had significant interactions ( Table 1). The contribution of panelists to this interaction might also be evaluated by their respective coefficients, estimated as above-commented. Figure 4 shows examples.
Among the panelists that most contributed to the differences in scores between sessions according to descriptors, were: A13 for skin red, flesh red, and flesh green. Regarding other descriptors, A12 actively contributed to vinegar or A5 to natural fruity/floral, alcohol, and earthy soil (data not shown). However, most of the panelists had homogeneous contributions in most of the descriptors (skin green, skin sheen, flesh yellow, or briny, Figure 4). Moreover, no panelist showed a systematic trend for all descriptors, except a few of them, like A12 for skin sheen and flesh red or A7 for mushroom ( Figure 4). Subsequently, the interaction was mainly due to the contribution of a reduced number of panelists (frequently only one) with limited influence on the panel repeatability.
The panelist·session interaction might also be presented as a plot of the mean per session over the mean on the whole sessions, according to panelists ( Figure 5). Ideally, they should follow a line, regardless of sessions. In general, the panelists followed a similar trend over sessions ( Figure 5 for some descriptors) with only punctual exceptions, like panelist A6 for rancid. Other cases were related to panelists A4, A12, and A8 for bitterness due to the abnormally low scores given by them (data not shown).

Panelist·Session Interaction
If significant, it means that one or more panelists do not similarly grade for all of the products from one session to another. There were several significant panelist·session interactions. Among the descriptors that contributed to discrimination, mushroom, oak barrel, cheesy smell, sourness, chewiness, bitterness, and saltiness had significant interactions ( Table 1). The contribution of panelists to this interaction might also be evaluated by their respective coefficients, estimated as above-commented. Figure 4 shows examples.  Among the panelists that most contributed to the differences in scores between sessions according to descriptors, were: A13 for skin red, flesh red, and flesh green. Regarding other descriptors, A12 actively contributed to vinegar or A5 to natural fruity/floral, alcohol, and earthy soil (data not shown). However, most of the panelists had homogeneous contributions in most of the descriptors (skin green, skin sheen, flesh yellow, or briny, Figure 4). Moreover, no panelist showed a systematic trend for all descriptors, except a few of them, like A12 for skin sheen and flesh red or A7 for mushroom ( Figure 4). Subsequently, the interaction was mainly due to the contribution of a reduced number of panelists (frequently only one) with limited influence on the panel repeatability.
The panelist·session interaction might also be presented as a plot of the mean per session over the mean on the whole sessions, according to panelists ( Figure 5). Ideally, they should follow a line, regardless of sessions. In general, the panelists followed a similar trend over sessions ( Figure 5 for some descriptors) with only punctual exceptions, like panelist A6 for rancid. Other cases were related to panelists A4, A12, and A8 for bitterness due to the abnormally low scores given by them (data not shown). Finally, the plot of the different coefficients over sessions is the most common evaluation of the panelist·session interaction ( Figure 6, for flesh red as an example). In this case, the problems that could be observed are, again, of different ranking in successive sessions or different amplitude of scale over sessions. In Figure 6, panelist A13 assigned an excessive high score in the first session, while in the second session the score was low. Additionally, the amplitude of the scale for this descriptor was wider-spread in the first session than in the second. In saltiness, the situation was different, A12 had a very low contribution (coefficient) but the scale amplitude was similar among sessions; in firmness and fibrousness, panelist A13 was the only who had an excessive high score and, subsequently, a high contribution to the interaction, while, on the contrary, had low contribution on saltiness. Therefore, the analyses in detail of this interaction allowed for detecting some weakness Finally, the plot of the different coefficients over sessions is the most common evaluation of the panelist·session interaction ( Figure 6, for flesh red as an example). In this case, the problems that could be observed are, again, of different ranking in successive sessions or different amplitude of scale over sessions. In Figure 6, panelist A13 assigned an excessive high score in the first session, while in the second session the score was low. Additionally, the amplitude of the scale for this descriptor was wider-spread in the first session than in the second. In saltiness, the situation was different, A12 had a very low contribution (coefficient) but the scale amplitude was similar among sessions; in firmness and fibrousness, panelist A13 was the only who had an excessive high score and, subsequently, a high contribution to the interaction, while, on the contrary, had low contribution on saltiness. Therefore, the analyses in detail of this interaction allowed for detecting some weakness of panel performance and lack of coherence in some panelist. Then, personalized training would be advisable. of panel performance and lack of coherence in some panelist. Then, personalized training would be advisable.

Panelist Performance
When a panelist can discriminate among samples and is well repeatable and reproducible (that is, score the same product consistently and agrees with the rest of the panel), it is considered to be reliable according to Rossi [18]. There are several techniques for evaluating these panelist's performance parameters. Tomic et al. [20] develop a series of graphs for easy visualisation of the sensory profiling data for performance. Kermit and Lengard Almli [19] mentioned consonance analysis with PCA, full ANOVA model and notation, assessor sensitivity, assessor reproducibility, or agreement test as appropriate to evaluate the assessor and panel performance. Lanza and Amoruso [17] mention the repeatability index (RIt) and deviation index (DIt) to evaluate how assessors perform against themselves over time and their performance with respect to the whole panel, respectively. In this work, the diverse tools that were proposed by Husson et al. [31] for studying the panelist work will be particularly followed.

Discrimination Power of Each Panelist
The individual efficiency of panelists was evaluated with the model: score = sample + session. The p-values ( Table 2) that are associated with the F-test of the sample effect on each panelist are, then, the appropriate parameter to measure this discrimination power. Their values, with rows and columns being sorted by the median estimated over them ( Table 2), showed that most of the panelists were able to discriminate the black ripe table olive samples based on several of the descriptors that were developed by Lee et al. [9] and used later by López-López et al. [10]. Their efficiencies, in decreasing order, were: A14, A4, A2, A3, A6, A5, A8, A1, A12, A13, and A7, while only A11, A10, and A9 had not any discriminant power (Table 2). Skin green was the only descriptor that received an overall significant median; however, mouth coating, flesh red, briny, flesh green, or skin red were among the attributes most differently perceived in the samples ( Table 2). On the contrary, soapy smell/medicinal, fishy smell, cheesy smell, alcohol, or metallic were among the most similarly perceived; however, this does not necessarily mean that the panelists were not able to differentiate samples, but that they were present in very low intensity or even completely absent ( Table 2). There is controversy in the possible p-value that could be used as a cut off-level to consider one panelist acceptable. Stone et al. [32] proposed p ≥ 0.5, but the problem was that there were so many p-values

Panelist Performance
When a panelist can discriminate among samples and is well repeatable and reproducible (that is, score the same product consistently and agrees with the rest of the panel), it is considered to be reliable according to Rossi [18]. There are several techniques for evaluating these panelist's performance parameters. Tomic et al. [20] develop a series of graphs for easy visualisation of the sensory profiling data for performance. Kermit and Lengard Almli [19] mentioned consonance analysis with PCA, full ANOVA model and notation, assessor sensitivity, assessor reproducibility, or agreement test as appropriate to evaluate the assessor and panel performance. Lanza and Amoruso [17] mention the repeatability index (RI t ) and deviation index (DI t ) to evaluate how assessors perform against themselves over time and their performance with respect to the whole panel, respectively. In this work, the diverse tools that were proposed by Husson et al. [31] for studying the panelist work will be particularly followed.

Discrimination Power of Each Panelist
The individual efficiency of panelists was evaluated with the model: score = sample + session. The p-values ( Table 2) that are associated with the F-test of the sample effect on each panelist are, then, the appropriate parameter to measure this discrimination power. Their values, with rows and columns being sorted by the median estimated over them ( Table 2), showed that most of the panelists were able to discriminate the black ripe table olive samples based on several of the descriptors that were developed by Lee et al. [9] and used later by López-López et al. [10]. Their efficiencies, in decreasing order, were: A14, A4, A2, A3, A6, A5, A8, A1, A12, A13, and A7, while only A11, A10, and A9 had not any discriminant power (Table 2). Skin green was the only descriptor that received an overall significant median; however, mouth coating, flesh red, briny, flesh green, or skin red were among the attributes most differently perceived in the samples ( Table 2). On the contrary, soapy smell/medicinal, fishy smell, cheesy smell, alcohol, or metallic were among the most similarly perceived; however, this does not necessarily mean that the panelists were not able to differentiate samples, but that they were present in very low intensity or even completely absent ( Table 2). There is controversy in the possible p-value that could be used as a cut off-level to consider one panelist acceptable. Stone et al. [32] proposed p ≥ 0.5, but the problem was that there were so many p-values below 0.5 when evaluating tea that almost any laboratory would retain them. Powers [33] pointed out that the real question was establishing the number of attributes with significant performance being necessary for a judge to be an acceptable assessor. However, no agreement on this aspect was achieved. In this work, in general, the panelists were not systematically excellent in all descriptors, but most of them were good at some descriptors (significant p-value), and their overall performance was reasonable; however, the behaviour of panelists A11, A10, and A9 should be, according to these results, candidates for possible further training or even removal from the panel if their performance will not sufficiently improve. Kermit and Lengard Almli [19] also identified an assessor with further need for training in attributes pea flavor, sweetness, fruity, and off flavor.

Panelist Repeatability
The panelists' repeatability is the ability to consistently score the same product for a given attribute [18] and was evaluated by the standard deviation (SD) of the measurements of a descriptor from each panelist on each sample. It was considered that, when the residual of the ANOVA model for each panelist and descriptor (Table 3) was ≤ 1.96 (p ≤ 0.95), the panelist scored the samples in a narrow range through the successive sessions and only panelists with residuals that were above this limit scored differently between sessions. In this work, there were no panelists who systematically graded the descriptors differently from one session to another (SD ≥ 1.96, in bold); however, several of them showed residuals above the limits for one to various descriptors, but not at a large distance. Therefore, in general, the panelists showed acceptable repeatability.

Panelist Reproducibility
The panelist agreement with the panel, as associated to reproducibility [18], was assessed by the correlation between the panelists' scores and the adjusted means of the panel (estimated by the ANOVA model) according to descriptors.
The procedure is similar to that used by Nyambaka et al. [30] to study the sensory changes in dehydrated cowpea leaves. The data are presented in a table, in which both panelists (in the column) and descriptors (in rows) are sorted from the highest to the lowest marginal median ( Table 4). The panelists' agreement with the panel (significant correlation, in black) were, in descending order of their medians, A6, A8, A14, A5, A1, A7, A13, A10, A9, A3, A2, A4, A12, and A11, while the negative correlation (in black and italic) was distributed more or less evenly, indicating opposed agreement with the panel (divergent behaviour). The inconsistence of some panelists when evaluating cowpea leaves was attributed to particular preferences of assessors [30] and could also be possible in table olives for some attributes, like firmness or fibrousness.   Overall, the descriptors that had the best agreement between panelists and panel, sorted by the median, were (in decreasing order of relationship) skin green, skin sheen, flesh red, firmness, flesh green, fibrousness, flesh yellow, and moisture release (Table 4). They were also among the descriptors with the most discriminant power. On the contrary, those with more discrepancies among the panelists were residual, artificial fruit/floral, metallic, rancid, sourness, or soapy smell/medical ( Table 4), all of them with no discriminant influence.
These results show that the overall behaviour of the panelists was reasonable, although there was still margin for some improvement in their performance, particularly regarding those panelists with strongly opposed correlation to the mean of the panel. Alternatively, they could be candidates for further rejection.
Lanza and Amoruso [17] used line plot according to the attribute and deviation index (DI t ) to evaluate the agreement between panelists and whole panel. Their results are in line with those described above, since they also found some panelists who clearly deviated from the consensus. According to these authors, this type of results helps the panel leader to identify repeatability problems of specific assessors as compared to the whole panel and correct the deviation by the corresponding training.

Clustering
A first multivariate approach of the similarity among panelists was achieved by hierarchical clustering analysis based on the scores given to the sample descriptors by each of them. The study was performed in XLSTAT, while using Wards' aggregation criterion [28]. Three groups of panelists were formed when comparing the panelists' behaviour ( Figure 7A). The greatest dissimilarity was found between the group that was formed by A4 and A6 with respect to the other panelists. The dissimilarity within the groups of other panelists was sensibly lower, leading to three groups. Two of them were composed of four and seven panelists, while the third only included panelist A8, who had a peculiar behaviour. Therefore, in this case, the cluster analysis, which considers the overall panelist performance, showed that the panelists followed a somewhat similar trend when evaluating the black ripe olive samples, but not reveal their peculiarities. In line with this result, the hierarchical classification is more usually applied for the classification of products or studying the association among descriptors. Francois et al. [28] used this technique for assessing the astringency of different beers while Pense-Lheritier et al. [29] applied it to link the sensory changes induced by the addition of drugs to different beverages. Alasalvar et al. [6] found similarity among the flavor of natural and roasted Turkish hazelnut cultivars. Clustering was also used to segregate different consumers segments according to their overall liking scores [34].

Panelist Reproducibility
The multivariate study of the agreement among panelists and the whole panel [18], while using bootstrapping, was made in SensoMiner, by considering the results of a virtual panel that was obtained by taking successive samples (500 simulations) from the real data and applying Principal Component Analysis. Only two eigenvalues ≥1 were found and they accounted for~42 and 26% of the variance, respectively. The analysis was made while using the function panelipse·session. The resampling technique has been described in detail elsewhere [31].
The closeness of the whole panel and panelists' answers was evaluated by projecting them onto the first two PCs. A PCA on the consensus allows for visualizing the strength of the consensus and the global discrimination of the products; besides, treatments identification shows the observed differences between the products [35]. In this work, the distance from each panelist to the situation of the corresponding sample assessed the agreement between the whole panel (squares symbols and different colours for the samples) and the panelists' acronyms (associated to samples by circle symbols using the same colours) ( Figure 7B). PC1 was highly efficient for segregating samples from Manzanilla (on the left) and Hojiblanca (on the right) and it could be associated to cultivar, while PC2 was able to distinguishing samples as a function of growing area and storage. In general, the projections of panelists for each sample were situated around that of the whole panel (sample associated to the same colour); although, there were some of them far for their respective samples. The discrepant panelists were (as identified by the corresponding acronyms) the same already mentioned in previous sections, mainly: A12, A8 for HL2; A8 for HA2; A13, A12, A8 and A6 for HA1; A12, A7, A9, A6, A3, and A2 for MAL2; A12, A7, A6, and A2 for ML2; A13, A11, A9, A8, A7, A5, and A1 for MAL1; and, A12, A8, A7, A6, and A2 for ML1. The panelist who scored the samples differently more times was A12, followed by A8, A7, and A6. Lower discrepancies were observed for A2, A9, A13, A3, and A5. However, they represent just a few cases of divergences, while most of the panelists' scores are jointly distributed around their corresponding samples. Additionally, panelists had greater ability (closeness to the sample average) to evaluate long stored Hojiblanca samples (HL2 and HA2) than any other sample. In conclusion, this plot has PC1 was highly efficient for segregating samples from Manzanilla (on the left) and Hojiblanca (on the right) and it could be associated to cultivar, while PC2 was able to distinguishing samples as a function of growing area and storage. In general, the projections of panelists for each sample were situated around that of the whole panel (sample associated to the same colour); although, there were some of them far for their respective samples. The discrepant panelists were (as identified by the corresponding acronyms) the same already mentioned in previous sections, mainly: A12, A8 for HL2; A8 for HA2; A13, A12, A8 and A6 for HA1; A12, A7, A9, A6, A3, and A2 for MAL2; A12, A7, A6, and A2 for ML2; A13, A11, A9, A8, A7, A5, and A1 for MAL1; and, A12, A8, A7, A6, and A2 for ML1. The panelist who scored the samples differently more times was A12, followed by A8, A7, and A6. Lower discrepancies were observed for A2, A9, A13, A3, and A5. However, they represent just a few cases of divergences, while most of the panelists' scores are jointly distributed around their corresponding samples. Additionally, panelists had greater ability (closeness to the sample average) to evaluate long stored Hojiblanca samples (HL2 and HA2) than any other sample. In conclusion, this plot has identified the panelists who will require particular training, but the performance of the others will also benefit from training. Our results are in agreement to those that were presented by Tomic et al. [21], who also found underperformance panelists and emphasized the need for a detailed study of their behavior while using the established statistical methods for the evaluation. Lanza and Amoruso [17] studied the performance of panelist against the whole panel using Eggsshell plots, concluding that there were also a few panelists that ranked some of the descriptors quite differently from the consensus, while there was a good agreement in others, like hardness.

Study by Variables Projection on the Correlation Circle According to Sessions
The analysis was carried out using the virtual panel described above [31]. A first approach of the panel repeatability was observed by projecting the descriptors (only those more relevant, contribution >0.20) onto the first two PC according to sessions. Close situations of descriptors in the correlation circle for the different sessions indicate good repeatability. The panel was particularly repeatable among sessions for some descriptors, like skin green, astringency, flesh green, moisture release, fibrousness, flesh red, skin sheen, or flesh yellow. However, others had sensible distances from one session to another, like fishy smell/ocean, saltiness, or chewiness ( Figure 8A). The interpretation of the relationships among variables is not straightforward due to these oscillations on the variables' projections. Nevertheless, it is possible to establish overall associations, mainly in those variables with high repeatability among sessions. For example, firmness, fibrousness, or chewiness are opposed to moisture release, ripeness, or flesh green. Additionally, those black ripe olives with high astringency could also present flesh yellow or skin green notes, but low vinegar or ripeness scores.
Galán Soldevilla et al. [14] associated bitter, sour, and wood with Green, Cured, and Traditional Aloreña de Málaga table olives, respectively. In black ripe olives, discrimination among the samples from different origins was mainly based on the 2nd and 3rd PCs, which were the components linked to aroma and flavour characteristics; however, the more linear behaviour of panelists was related to a textural dimension that was strongly connected to PC1 [9]. Kinesthetic sensations were also critical for the segregation between defected and un-defected samples by PCA [12].

Study by Sample Projections According to Sessions
The analysis was also carried out using the virtual panel described above. In this case, the median scores of the virtual panel perception of the samples (the same of the real panel) were projected onto the plane of the two first PCs according to sessions. Subsequently, 95% of the closest points of the generated cloud of points were used to draw their confidence ellipses (p-value = 0.05), which were built according to the procedure that was described by Husson et al. [31] ( Figure 8B). The repeatability of the panel to the session can be assessed by the displacement of the sample centres. In general, the separation between the sample centres due to session was limited, indicating a good panel agreement between sessions, which is also corroborated by the overlapping of their confidence ellipses. Incidentally, the plot also indicates that the long stored fruits showed lower dispersion by sessions than the just processed fruits (one-month storage).

Conclusions
Usually, the study of the panel performance is a previous, but superficial, task during the sensory evaluation of products. However, a detailed investigation of the panel and panelist performance is a convenient tool to uncover the details of their evaluation. In this work, such study allowed for the assessment of the panel performance as a whole, as well as detecting the panelist with the lowest discriminant power, those that have interpreted the scale in a different way than the panel and, therefore, require further training or even discovery that the stored black ripe olive products are more similarly perceived by the panelists over sessions. Besides, the study identified the descriptors of hard evaluation (skin green, vinegar, bitterness, or natural fruity/floral). Therefore, panelists would require particular training on them or, in case of not reaching the appropriate level of

Conclusions
Usually, the study of the panel performance is a previous, but superficial, task during the sensory evaluation of products. However, a detailed investigation of the panel and panelist performance is a convenient tool to uncover the details of their evaluation. In this work, such study allowed for the assessment of the panel performance as a whole, as well as detecting the panelist with the lowest discriminant power, those that have interpreted the scale in a different way than the panel and, therefore, require further training or even discovery that the stored black ripe olive products are more similarly perceived by the panelists over sessions. Besides, the study identified the descriptors of hard evaluation (skin green, vinegar, bitterness, or natural fruity/floral). Therefore, panelists would require particular training on them or, in case of not reaching the appropriate level of discrimination, be replaced by some other/s with higher sensitivity. In summary, the work has confirmed that such studies are an essential tool for the appropriate panel control and training, which should be a permanent concern of the panel leader. Funding: This research was funded in part by the Ministry of Economy and Competitiveness from the Spanish government through Project AGL2014-54048-R, partially financed by the European Regional Development Fund (ERDF).