Application of QBA to Assess the Emotional State of Horses during the Loading Phase of Transport

Simple Summary The traditional approach to animal welfare has been primarily based on assessing negative factors. However, over the past few years, due to changes in societal beliefs and values and improved scientific understanding of animals, researchers are being asked to develop tools to identify positive animal-based measures and, more in general, to evaluate animals’ emotional states. Transportation, mainly the loading phase, is a crucial welfare concern in horses and the development, enhancement, and implementation of positive animal-based indicators during transport are of critical importance. This study aimed at investigating the accuracy of Qualitative Behaviour Assessment (QBA), using both free choice profiling and fixed list methodologies, in horses during the loading phase of transport. A total of 13 stakeholders were asked to assess different sets of videos of horses being loaded for road transport using both their own descriptors and a list of descriptors. Our results showed that giving first the possibility to observer to use their own descriptor is useful to the development and implementation of a fixed list with terms that perfectly fit with the specific conditions we want to evaluate. This would allow the observers to better understand and quantify each given descriptor and therefore obtain a good QBA accuracy level. Furthermore, a specific descriptors list was developed to be used in pre-transport loading situations. Abstract To identify feasible indicators to evaluate animals’ emotional states as a parameter to assess animal welfare, the present study aimed at investigating the accuracy of free choice profiling (FCP) and fixed list (FL) approach of Qualitative Behaviour Assessment (QBA) in horses during the loading phase of transport. A total of 13 stakeholders were trained to score 2 different sets of videos of mixed breed horses loaded for road transport, using both FCP and FL, in 2 sessions. Generalized Procustes Analysis (GPA) consensus profile explained a higher percentage of variation (80.8%) than the mean of 1000 randomized profiles (41.2 ± 1.6%; p = 0.001) for the FCP method, showing an excellent inter-observer agreement. GPA identified two main factors, explaining 65.1% and 3.7% of the total variation. Factor 1 ranging from ‘anxious/ to ‘calm/relaxed’, described the valence of the horses’ emotional states. Factor 2, ranging from ‘bright’ to ‘assessing/withdrawn’, described the arousal. As for FL, Principal Component Analysis (PCA) first and second components (PC1 and PC2, respectively), explaining on average 59.8% and 12.6% of the data variability, had significant agreement between observers. PC1 ranges from relaxed/confident to anxious/frightened, while PC2 from alert/inquisitive to calm. Our study highlighted the need for the use of descriptors specifically selected, throughout a prior FCP process for the situation we want to evaluate to get a good QBA accuracy level.

not practical [48]. When a more standardized assessment is required, the use of a fixed-list is more feasible than FCP [28,46]. Advantages of using a fixed list for QBA evaluation in on-farm assessment include easier application and a straightforward interpretation of results by experienced assessors and experts [34]. Fixed QBA lists are included in several onfarm welfare assessment protocols, as a measure for both positive and negative emotional states [50][51][52][53][54][55]. A fixed list of QBA descriptors has also been developed and included in the AWIN welfare assessment protocol for horses [56].
To the authors' knowledge, QBA has never been tested in horses during transport procedures; therefore, a specific list of descriptors to be used in horses subjected to transport has never been developed. To identify valid indicators to assess animal emotional states and to assess if they may be feasible to evaluate welfare of horses during the loading phase of transport, the present study aimed at assessing the accuracy of QBA by using the FCP and the fixed list approaches.

Animals and Video-Clips
A total of 40 videos of horses during the loading phase of transport were selected from a pool of videos collected during previous studies; videos have been recorded using a digital video camera (Canon Legria HFR88, Canon Inc., Tokyo, Japan), controlled by the experimenter. The entire loading procedure was video-recorded for each horse. Length of videos ranged from 39 to 110 s (mean = 50 s).
Videos were divided as follows: • A total of 28 videos of Spanish Breton meat horses, of both sexes (M = 14; F = 14), aged 15 ± 2.79 months (min = 12 month; max = 24 months), from a meat horse farm located in North Eastern Italy. Horses were transported to the slaughterhouse, two to four horses at a time, using the same truck, according to the farm's ordinary routine. Transport took place in the afternoon (~4:00 p.m.) on different days from April to October 2018. The farm manager performed the usual loading procedures, which involved minimal handling of horses: moving fences to let horses enter the loading lane, inciting horses from behind using voice and moving a stick only when they refused to move. More information regarding this group of horses can be found in Dai and colleagues' study [15]; • A total of 12 videos of Arabian horses, of both sexes (M = 5, F = 7), aged 2.66 ± 1.77 years (min = 1 year; max = 7 years), from an equestrian center located in Northern Italy. The youngest horse was previously transported only once, all the remaining horses were previously transported to different competitions. All the horses were trained by the same person, but no specific training to load was in place. Horses were transported to show competitions, three to seven horses at a time, using the same truck, according to the farm's ordinary routine. Transport took place in the afternoon (~2:00 p.m.) on different days from August to November 2019. The farm manager or a groom performed the usual loading procedures, which involved leading horses using headcollar and lead rope, inciting horses from behind using voice, moving a whip, and touching the back of the horse with a whip only when they refused to move.

Assessors and Observation Session
A total of 13 members of a horse welfare protection organization were recruited as assessors (11 females and 2 males). All assessors had previous extensive experience with horses, while none had previous experience with QBA. All assessors were English native speakers.
Three tutors were involved in the study, to educate assessors regarding the use of QBA and guide the observation sessions. Tutors were female veterinarians, expert in applied ethology and welfare evaluation of equids, with previous experience in using QBA in horses both for research and welfare assessment purposes.
Observations were organized in three sessions: an introductory section, a first video assessment session using FCP and a second video assessment session using a fixed list of terms. All the sessions were held live on-line, using Microsoft Teams (© Microsoft 2022) due to global pandemic.

Introductory Session
Before starting the assessment of the video clips, an introductory session, lasting approximately 1 h, was organized. In this first session the assessors were introduced to the concept of QBA and to the operative procedures of the study. Tutors explained the QBA background and how it has been used for on-farm welfare assessment in different species.

Free-Choice Profile Session
In the first observation session, FCP was chosen, to ensure that assessors were not biased by predefined terms. Tutors played the set of 15 videos, including both show and meat horses during loading. After watching each video, assessors were asked to use descriptors to illustrate expressive styles of horse behavior in that specific video. Assessor could ask to re-watch each video clip twice. After watching all the videos, each assessors compiled their own unique list of descriptors. Assessors were then provided with a personal link for data collection sheet (see Section 2.2.5 for more information), containing their own terms to be scored on a Visual Analogue Scale (125 mm continuous scale). The same set of videos was played, and assessors were asked to score horse behavior using their own list of descriptors. Tutors explained how to use the scale: the left represented the 'minimum' and the right the 'maximum' point of the scale; a very left position on the scale or a zero indicated that the expressed quality of the specific descriptor was "entirely absent", whereas the 'maximum' at 125 mm stood for "the expressed quality of the specific descriptor was constantly obvious during the observation".

Fixed List Creation
Following the FCP session, tutors collected the lists generated by each assessor and collapsed them in a single list containing all the identified descriptors. Starting from this list, tutors moderated a discussion regarding the meaning of each descriptor. Assessors were asked to define each term and they were able to discuss until they reached a group consensus in the interpretation of each term. Following this session, a fixed list of 21 descriptors with definition was generated (Table 1).

Fixed List Session
The generated fixed list was then used in the last session to score the set of 25 videos of horses kept for meat production during loading procedure. Assessors received a link for data collection sheet (see Section 2.2.5 for more information), containing the fixed list of terms to be scored on a 125 mm Visual Analogue Scale.

Data Collection
Online proprietary software (Qualtrics, Provo, Utah, UT, USA) was used to build data collection sheets and distribute them. As for FCP session, one data collection sheet for each assessor was built. The sheet contained the personal list of generated terms, each followed by a Visual Analogue Scale (1 to 125 mm long). Each participant received via e-mail a personal link to access to their sheet. Following the Fixed List creation, one data collection sheet was created, containing the fixed list of descriptors, each followed by a Visual Analogue Scale (1 to 125 mm long). The link to access this sheet was distributed to all assessors.

Statistical Analysis
The inter-observer agreement in the FCP was investigated using Generalized Procrustes Analysis (GPA), a multivariate statistical technique that does not rely on fixed variables. GPA transforms individual observer scoring patterns into multidimensional configurations, which are made comparable with each other through sequence of rototranslations and rescalings, determines the "mean" of these configurations, named the "best fit" or "consensus profile". This calculation is essentially a process of pattern recognition and takes place independently of the meaning of the terminologies used by assessors. How well individual assessor scores fit the consensus profile (i.e., the degree of agreement) is quantified by the Procrustes Statistic and visually represented by an 'observer plot' [29].
To analyze inter-observer reliability for each descriptor of the generated QBA fixed list, Kendall Correlation Coefficient W was calculated on the raw descriptor scores. Kendall W values can vary from 0 (no agreement at all) to 1 (complete agreement). The following threshold [57] were used to interpret Kendall's W: Bonferroni procedure was used to adjust p-values.
For the subsequent analyses, those descriptors with Kendall's W < 0.20 (poor agreement between raters) were excluded, namely "Aggressive", "Excited", and "Withdrawn". A Principal Component Analysis (PCA, correlation matrix) was performed on the remaining descriptors (n = 18) for each assessor separately. Analyses were conducted using R software (version 3.6.1) and "FactoMineR" and "irr" packages.

Free Choice Profile
The FCP methodology allowed the observers to generate their own unique set of terms to describe behavioral expression and then use it to score the behavior of fifteen mixed horses at the loading phase of transport. The Procrustes Statistic of the GPA consensus profile explained a significantly higher percentage of variation (80.8%) than the mean of 1000 randomized profiles (41.2 ± 1.6%; p = 0.001), indicating the consensus to be a significant feature of the dataset rather than an artefact of the Procrustean calculation procedures. The observer plot (Figure 1) reflects the consensus among the 13 observers, as all of them fall within the 95% confidence region. Thus, showing an excellent inter-  [27,42,49,58,59], and generally supporting the reliability of the QBA methodology.
terms to describe behavioral expression and then use it to score the behavior of fifteen mixed horses at the loading phase of transport. The Procrustes Statistic of the GPA consensus profile explained a significantly higher percentage of variation (80.8%) than the mean of 1000 randomized profiles (41.2 ± 1.6%; p = 0.001), indicating the consensus to be a significant feature of the dataset rather than an artefact of the Procrustean calculation procedures. The observer plot (Figure 1) reflects the consensus among the 13 observers, as all of them fall within the 95% confidence region. Thus, showing an excellent interobserver agreement of the FCP method, in line with what has been already shown in other species [27,42,49,58,59], and generally supporting the reliability of the QBA methodology. Figure 1. Observer plot shows the consensus among the 13 QBA assessors using the FCP methodology, the circle represents within the 95% confidence region.
Two main factors of the consensus profile were identified, explaining 65.1% and 3.7% of the total variation between animals, respectively, in accordance with the vast majority of the studies that applied the FCP to different species [27,28,39,41,42,49,58] that found two main dimensions. To provide an overview of highly correlated terms (correlation ≥ 0.60) for all assessors, Table 1 lists the most used descriptors (their frequencies over the 13 observers) with the highest positive and negative correlation to factors 1 and 2 of the consensus profile. On the basis of these results, GPA factor 1 (GPA1) was qualitatively labelled as ranging from 'anxious/tense' (positive correlation) to 'calm/relaxed' (negative correlation), and GPA factor 2 (GPA2) as ranging from 'bright' at the high end of the axes to 'assessing/withdrawn' at the low end. GPA1 describes the valence of the horses' emotional states, while GPA2 tried to explain the arousal. Most of the terms used by our observers were then labelled as belonging to GPA1 and distinguish between horses that seemed tense/anxious (chosen by 11 and 10 observers, respectively) or even scared (chosen by 6 observers) during the loading phase and horses that seemed to be calm/relaxed (11 observers each) or even willing (reported by 3 observers). Calm vs. activation/agitation is one of the most common GPA factors also in other FCP studies even when used to evaluate other species assessed under different conditions [22,39,58,[60][61][62]. On the other hand, very few participants reported descriptors that were then labelled in GPA2: only one observer rated a horse as "bright" and the adjectives "assessing", "depressed" and "withdrawn" were used once each. In the willingness of QBA to assess how the animal is behaving and therefore catch its affective state [21,41], when applied to loading situations the most used terms to describe the emotional valence that we collected seem to embed Two main factors of the consensus profile were identified, explaining 65.1% and 3.7% of the total variation between animals, respectively, in accordance with the vast majority of the studies that applied the FCP to different species [27,28,39,41,42,49,58] that found two main dimensions. To provide an overview of highly correlated terms (correlation ≥ 0.60) for all assessors, Table 1 lists the most used descriptors (their frequencies over the 13 observers) with the highest positive and negative correlation to factors 1 and 2 of the consensus profile. On the basis of these results, GPA factor 1 (GPA1) was qualitatively labelled as ranging from 'anxious/tense' (positive correlation) to 'calm/relaxed' (negative correlation), and GPA factor 2 (GPA2) as ranging from 'bright' at the high end of the axes to 'assessing/withdrawn' at the low end. GPA1 describes the valence of the horses' emotional states, while GPA2 tried to explain the arousal. Most of the terms used by our observers were then labelled as belonging to GPA1 and distinguish between horses that seemed tense/anxious (chosen by 11 and 10 observers, respectively) or even scared (chosen by 6 observers) during the loading phase and horses that seemed to be calm/relaxed (11 observers each) or even willing (reported by 3 observers). Calm vs. activation/agitation is one of the most common GPA factors also in other FCP studies even when used to evaluate other species assessed under different conditions [22,39,58,[60][61][62]. On the other hand, very few participants reported descriptors that were then labelled in GPA2: only one observer rated a horse as "bright" and the adjectives "assessing", "depressed" and "withdrawn" were used once each. In the willingness of QBA to assess how the animal is behaving and therefore catch its affective state [21,41], when applied to loading situations the most used terms to describe the emotional valence that we collected seem to embed themselves an arousal variation. More to the point, it could be because an animal habituated and confident to be loaded react less compared to one that feels uncomfortable or even scared. This is particularly true for horses, that are well-known for their typical flight-wired reactions [63]. Horses afraid by the confinement that the trailer imposes indeed exhibit behaviors, such as rearing, pulling back, head tossing, pawing, and turning sideways [6]. Therefore, for example, it would be quite difficult to find descriptors with low valence and low arousal, such as "apathetic" or "annoyed", when the animal is not comfortable with transport, especially during this phase. Or vice versa descriptors with high valence and high arousal, such as "happy" or "look for contact". Apathetic, annoyed, happy, and look for contact are all descriptors that actually belong to the AWIN QBA fixed list [56]. If we had designed our study using that fixed list first, it is likely that our observers would have struggled to use these terms that do not fit well with the possible loading reactions of horses. This has been also hypothesized by Napolitano and colleagues in their study on dairy buffalos, where if terms were imposed on observers through pre-determined scoring lists, agreement would not be as high as found in that study, or would be high for some terms but not others [58].
On the other hand, it is well recognized that the use of long and complex lists may be difficult for different assessors to be fully understood and implemented [28,48], while a more concise list may also be easier and faster to complete.
Moreover, analyzing the numerosity of the GPA1 descriptors, there is a quite higher variability and numerosity in the list of terms with positive correlation of GPA1 (13 positive correlated descriptors vs. 5 negative correlated descriptors). Different raters found many different shades in the demeanor of those horses that were not comfortable at being loaded, while "calm", "relaxed" and "confident" were the mainly used terms to describe cooperative or not stressed horses. This could be because our observers tended to focus more on problematic behaviors and horses compared to quiet situations. Despite that, we found a clear semantic correlation between the terms used to describe each horse. Therefore, in addition to the high level of inter-observer agreement, QBA may be a tool worth considering assessing emotional states variation in valence and arousal. Or, at least, it is easier to reach a consensus on it, even in the event that all the observers were collectively wrong as hypothesized by Wemelsfelder and colleagues [29]. Figure 2 shows the 'horse plot' of the QBA in which individual horses (showed in the first 15 score videos) are positioned on the two main factors of the GPA consensus profile. These positions and the variation between them can be semantically interpreted with the qualitative labels discussed above. Once more, it seems evident that one GPA factor, GPA dimension 1, is more "influential" and variable than the other, indicating a different observer ability to catch variation in the arousal not linked to emotional valence variability.

Fixed Term List
Although FCP methodology removes possible bias due to provided terms allowing each observers to use its own terms, it is well recognized that fixed lists are more practical, feasible, and suitable than FCP for on-farm QBA's implementation [28,48], when a standardized way of assessment is needed for feasibility reasons. Therefore, starting from FCP results, we built a Fixed List, reported in Table 2 and then tested inter-observer reliability of each term after the observers used them to describe a second group of 25 horses. It is important that before they use the list, all the observers reach consensus on the meaning

Fixed Term List
Although FCP methodology removes possible bias due to provided terms allowing each observers to use its own terms, it is well recognized that fixed lists are more practical, feasible, and suitable than FCP for on-farm QBA's implementation [28,48], when a standardized way of assessment is needed for feasibility reasons. Therefore, starting from FCP results, we built a Fixed List, reported in Table 2 and then tested inter-observer reliability of each term after the observers used them to describe a second group of 25 horses. It is important that before they use the list, all the observers reach consensus on the meaning of each term, in order to remove any linguistic barriers or misunderstanding [59] and then eliminate those terms that fail to reach good agreement. Kendall's W results are reported in Table 3. For the subsequent analyses, those descriptors with Kendall's W < 0.20 (poor agreement between raters) were excluded, namely "Aggressive", "Excited", and "Withdrawn".
A Principal Component Analysis (PCA, correlation matrix) was performed on the remaining descriptors (n = 18) for each assessor separately. Then, we decided not to consider descriptors that were only mentioned once in each principal component (PC). Figure 3 provides an example of a PCA graph performed on a single assessor, with the first dimension on the x-axis and the second one on the y-axis. It is immediately evident that PC1 is the one that varies more.  Results showed that first and second PCs had both significant agreement between raters in terms of loadings and scores, and these two PCs explained on average 59.8% and 12.6% of the data variability, respectively ( Table 4). The agreement on the third PC (which explained a lower percentage of variance) was poor and non-significant. PC1 ranges from relaxed/confident to anxious/frightened, while PC2 from alert/inquisitive to calm. Thus, both in FCP and in FL method GPA dimension 1 generally demonstrates a valence of mood with "relaxed/confident" vs. "anxious/tense/frightened", in accordance with what has already been described by Clarke and colleagues [60]. In addition, even with the FL the percentage of explained variance is higher for PC1, linked to valence, compared to PC2, that described the arousal. This again confirms that in this specific situation, the loading phase of transport, with this specific species, horses, observers were not so able to distinguish arousal variability independently of the pleasantness, but they tended to use terms that embedded an arousal description. Results showed that first and second PCs had both significant agreement between raters in terms of loadings and scores, and these two PCs explained on average 59.8% and 12.6% of the data variability, respectively ( Table 4). The agreement on the third PC (which explained a lower percentage of variance) was poor and non-significant. PC1 ranges from relaxed/confident to anxious/frightened, while PC2 from alert/inquisitive to calm. Thus, both in FCP and in FL method GPA dimension 1 generally demonstrates a valence of mood with "relaxed/confident" vs. "anxious/tense/frightened", in accordance with what has already been described by Clarke and colleagues [60]. In addition, even with the FL the percentage of explained variance is higher for PC1, linked to valence, compared to PC2, that described the arousal. This again confirms that in this specific situation, the loading phase of transport, with this specific species, horses, observers were not so able to distinguish arousal variability independently of the pleasantness, but they tended to use terms that embedded an arousal description.  (13) Comfortable (12) Relaxed (12) Willing (11) Keen (9) Obedient (7) Inquisitive (5) Anxious (13) Frightened (13) Panicked (13) Stressed (13) Tense (13) Unsure (13) Flighty (8) Reactive (8) Alert ( (10) Inquisitive (9) Keen (8) Distracted (7) Flighty (7) Obedient (6) Reactive (6) Willing (4) Confident (3) Panicked (3) Comfortable (2) Calm (2) Flighty (2) Reactive (2) PC3 0.066 (0.659) 0.121 (0.063) 7.5 ± 2.1 (4.7-11.5) Alert (7) Distracted (6) Inquisitive (5) Obedient (4) Comfortable (3) Flighty (3) Keen (3) Panicked (3) Reactive (3) Willing (3) Distracted (4) Reactive (4) Flighty (3) Inquisitive (3) Unsure (3) Willing (3) Comfortable (2) Stressed (2) Tense (2) The lower agreement, compared to the one reached in the FCP phase, could be because observers were required to use fixed terms, as assumed also by Napolitano and colleagues [58]. This semantical consensus and the inter-observer agreement once again makes QBA a tool worth considering when assessing if a horse is calm and confident during loading or if otherwise it is worried and afraid.

Conclusions
As societal beliefs and values are changing and our knowledge on animal sentience is growing, a positive animal welfare approach is increasingly needed to meet these new needs. Therefore, in order to identify new feasible indicators to evaluate animals' positive and negative emotional states as a parameter to assess animal welfare, the aim of our study was assessing the accuracy of both FCP and FL QBA approaches in horses during the sensitive phase of pre-road transport loading. It is well-known that the FCP methodology frees the observers from bias and misunderstandings and allows him/her to describe how the animal is behaving with the words he/she most prefers. On the other hand, the use of fixed lists is more practical and, thus, why it is the first-choice method in on-farm situations.
Our results showed the importance of developing both these methodologies under the specific conditions we would use it to get a list of terms that fit. More specifically, both in FCP and in FL most of the terms that our observers used were labelled in the "valence dimension", while they struggled to catch arousal variation. This is most likely related to how a horse tends to deal with the loading phase: its body activation tends to increase the more it feels uncomfortable, vice versa its movements are minimized if it is calm and relaxed at the idea of being loaded. Furthermore, a more concise list may be easier and faster to complete in on farm conditions. Thus, our findings highlighted the need for the use of descriptors specifically selected, throughout a prior FCP process, for the situation we want to evaluate to get a good QBA accuracy level.  Institutional Review Board Statement: Ethical review and approval were waived for this study due to no identifiable human data collection.
Informed Consent Statement: Not applicable. According to the guidelines of the WMA Declaration of Helsinki, an Informed Consent Statement is not applicable because no identifiable human data has been collected.

Data Availability Statement:
The data that support the findings of this study are available on request from the corresponding author E.D.C.