This study assessed the reliability and feasibility of 10 animal-based welfare measures for extensively managed ewes. Body condition score, fleece condition, skin lesions, tail length, dag score and lameness are proposed for on-farm use in welfare assessments of extensive sheep production systems. These six valid measures address the main welfare concerns for ewes, and they are reliable and feasible. When combined, they provide an overview of the nutritional, health and welfare state of the ewes as well as evidencing previous or potential welfare concerns.
4.1. Reliability of the Animal-Based Welfare Measures
High inter- and intra- observer agreements, from ‘substantial/moderate’ to ‘substantial/almost perfect’ agreements, were found for BCS, fleece cleanliness, fleece condition, skin lesions, tail length, dag score and lameness. In the present study, BCS was the measure that increased the most, the inter-observer agreement and the intra-agreement of the TSO increased from ‘moderate’ at mid-pregnancy to ‘almost perfect’ at weaning. Body condition is widely accepted as a valid and important welfare measure that reflects the nutritional state of sheep [
13,
14]. Results in the present study suggests that a quarter-point scale is reliable, but that operators require sufficient training and experience to achieve high agreement in this measurement [
27,
35,
36]. In this study, the experienced observers (TSO, observers 2 and 7) showed the highest agreement and repeatability for this measure. The increased training sessions and the clarification of the descriptive terms used may have help to achieved ‘almost perfect’ inter- and intra- observer agreement at the end of the study. Although individual differences, observer expertise and differences in intervals of reassessment (15-day period at MP vs. 24 h at ML and WN) may have influenced in the levels of agreement obtained, there is evidence that the level of observer agreement increases significantly when sufficient training is provided [
13,
18].
Rumen fill, foot-wall integrity and hoof overgrowth were the measures with lower agreement in this study. This is likely the result of difficulties associated with assessing these measures, e.g., presence of fleece and the fact that ewes often moved backwards and forwards along the race, which particularly affected how easily foot-wall integrity could be assessed. In addition, the scoring scales and the descriptive terms used for foot-wall integrity may have affected the levels of observer agreement. Simplifications of the scoring scales as well as clarification of the description terms may provide higher agreement and may be more useful for future on-farm assessments.
The performance of each welfare measure was evaluated in agreement with previous reliability studies [
13,
18,
23,
27]. Percentage of agreement was used as it provides an easy illustration of observer agreement. However, as this method does not estimate the amount of agreement that could occur by chance, Kendall’s coefficient of concordance (W) and Kappa (k) were selected to statistically assess the inter and intra-observer agreement of ordinal and binominal measures. Care is needed however when interpreting k values, because they are affected by the prevalence of the condition under consideration. Populations with few animals presenting the condition of interest will provide very low values of k that may not necessarily reflect low levels of observer agreement [
37]. In the present study, the length of the tail was a simple binominal scale and presented high percentage of agreement across the three-time points examined (MP: 71–86%; ML: 85–97%; WN: 96–100%). However, k values were consistently low; from 0.28 to 0.39 at MP, from −0.01 to 0.56 at ML and from 0.37 to 1.00 at WN. Discrepancies between the percentage of agreement and k values may be a consequence of the low number of animals that had adequate tail length in this study (
n = 8, as determined by the TSO, while 92%
n = 87 had short-docked tails at weaning), and may not necessarily mean low inter- observer agreement. It is possible that higher k values would have been achieved if more animals in this study had adequate tail length. Similar difficulties in the interpretation of k values have been reported in previous studies [
23,
37,
38]. Other factors that need to be considered when evaluating reliability is intervals of reassessments. In the present study, low intra observer reliability at mid-pregnancy cannot be completely attributed to lack of consistency of the observers, as the length of the reassessment at this stage (15-day period) may have affected the levels of intra- observer agreement of dag score, skin lesions, foot-wall integrity and hoof overgrowth.
Overall, there is wide variation in the scientific literature on how reliability of welfare measures is assessed. Currently, there is no agreement on the number of animals, number of observers or the methodology that should be used. For instance, a reliability study in lambs used four observers to assess 966 lambs [
23], a study of welfare assessment for adult sheep used two observers and 360 ewes [
15], and studies assessing reliability on locomotion scoring in various species have used five observers and 83 cows [
39], three observers and 30 video clips of sheep [
40], and three observers and 80 photographs and videos of foot-rot lesions in sheep [
27]. The sample size selected in the present study was based on a power calculation and recommendations by the AWIN sheep protocol [
10], and the fact that the performance of the measures was tested on-farm during different stages of production of sheep further supports their reliability and applicability under farm conditions.
4.2. Feasibility of the Animal-Based Welfare Measures
Welfare measures need to be practical if they are to be valuable. Sheep farms in Australia can commonly have 12,000 animals, and they are usually managed by a single person [
9,
41]. This, highlights the need for feasible measures that can be taken in short periods of time with low need of resources and personnel as time and labor force are limited in extensive sheep systems. When assessing the feasibility of the measures of this study a variety of factors were considered such as time spent in the assessment, resources required and the ability to collect these measurements across different farms. Feasibility was assessed for a third party to perform the assessment, not a farmer. Generally, the measures tested proved to be feasible, requiring on average 2.5 min to assess an individual ewe at weaning. The significant decrease in the time spent in the assessment at weaning might have been influenced by individual differences of the observers, and familiarization with the scoring scales and assessment protocol. Although no differences were found in the time spent assessing the ewes between mid-pregnancy and mid-lactation. Lactation was considered the least practical period due to the presence of lambs, which made sheep handling difficult during the assessment. This needs to be considered when deciding for key times to perform on-farm welfare assessments.
The most feasible measures were found to be BCS, fleece cleanliness, fleece condition, skin lesions, tail length, dag score and lameness. Clear advantages of these measures in terms of practicality are that no measures required specialized equipment; the only infrastructure required is a raceway, which is a common facility on sheep farms, and other than the labor required to bring the sheep into the yards, they do not interrupt farm management practices. It should also be considered that most farmers visually monitor their sheep in the paddock, rather than gathering them into the yards. In this context, it has been shown that some of these measures, e.g., thin body condition, lameness and dags can be examined from the distance during key stages of the production cycle [
2,
42] with minimal interference with farm work. Thus, the measures selected may be considered more acceptable by producers. Foot-wall integrity and hoof overgrowth on the other hand, were found less practical as they were time-consuming and they were not easy to assess as ewes often moved backwards and forwards. Additionally, their implementation across farms is limited as they should be assessed in races with no covered walls alongside.
4.3. Recommended Measures for On-Farm Welfare Assessment of Extensively Managed Ewes
This research is important because it identified measurements the are suitable for use under commercial conditions [
43]. The validity of these measures reported in Munoz et al. [
19], plus their reliability and feasibility examined in this study indicate that these six animal-based measures; BCS, fleece condition, skin lesions, tail length, dag score and lameness are appropriate/recommended to include in welfare protocols for ewes managed extensively, particularly in Australia. When these measures are combined, they provide a snapshot of the current welfare status of ewes, as well as providing evidence of past or potential welfare risks. For example, combining a decline in BCS, poor fleece condition and high dag score helps to identify that the welfare of that animal is compromised, while also facilitating the identification of the problem and the appropriate treatment. These measures address important welfare issues identified by producers, industry, specialist and general public [
10,
38,
44].
Fleece cleanliness, although repeatable and feasible, might not be meaningful for extensive systems. Fleece cleanliness has previously been proposed as an important welfare measure for sheep, as it can provide information about the quality of the environment [
10,
15,
18,
23,
45]. However, this measure is more valuable for intensive indoor lambing systems where is important to assess the cleanliness of the floor/bedding and how the animal is coping with this environment. Rumen fill, foot-wall integrity and hoof overgrowth were discarded based on poor reliability and feasibility. Rumen fill has been identified as a relevant animal-based measure for sheep and lambs as provides short-term information of food access [
38]. In the present study, rumen fill was difficult to assess and this was reflected in the poor levels of agreement achieved. The presence of the fleece was the main factor affecting the levels of inter-observer agreement. Similar results have been obtained in a previous study on lambs where only ‘moderate’ inter-observer agreement was obtained [
2]. In view of the difficulties of assessing rumen fill in ewes that are not in short wool and its limitations in assessing sheep welfare, the measure was excluded. Foot-wall integrity and hoof overgrowth showed poor repeatability and feasibility to be implemented across different farms. It should also be considered that broader measures, such as lameness, may be more relevant to assess ewe welfare than foot-wall integrity and hoof overgrowth.
Besides the importance of discriminating which welfare measures would be more suitable for extensive conditions, it is also important to identify alternatives that could be used to measure on-farm welfare in sheep. For instance, limited research has been done to develop practical assessments of fear of humans in sheep, and studies on this topic vary in methodology and performance [
46]. The majority of this research has been focused on intensively managed sheep [
15,
47,
48,
49], and usually under experimental conditions [
41,
47,
50]. Further work is needed to validate a practical on-farm assessment of fear of humans that could be applied to extensive systems. Recent studies by Hazard et al. [
51,
52] have investigated several behavioral traits in sheep that could be used to validate the assessment of fear of human in extensive farming conditions. Additionally, limited work has been done to develop practical on-farm assessments for clinical and sub-clinical mastitis [
10]. Udder examination and collection of milk samples to perform an on-farm test (e.g., California mastitis test) is time-consuming and labor intensive, which make these assessments less appealing for on-farm use. Further studies in the development of practical welfare assessments should consider the incorporation of new technologies for practical assessment of mastitis and to track grazing behavior and sheep movement to detect sick/lame animals. Finally, it should be considered that extensive systems are characterized by seasonal variation in both, climate and food availability, which results in seasonal variation in the welfare status of sheep [
18]. Welfare measures therefore must be able to detect variation in the welfare status of ewes over main risk periods of the production cycle [
18], as well as be sensible to identify differences between farms. Further research into the development of welfare assessment for extensive systems should assess both seasonal variation of the measures selected and their ability to detect differences between farms as only one property was examined in the present study.