Methods used and Application of the Mouse Grimace Scale in Biomedical Research 10 Years On: A Systematic Scoping Review

The Mouse Grimace Scale (MGS) was developed 10 years ago to assess pain through characterisation of changes in five facial features or action units. The strength of the technique is that it is proposed to be a measure of spontaneous or non-evoked pain. A comprehensive scoping review of the academic literature was performed. The MGS has been employed mainly in evaluation of acute pain, particularly in the pain and neuroscience research fields. There has however been use of the technique in a wide range of fields, and based on limited study it does appear to have utility for pain assessment across a spectrum of animal models. Use of the method does allow detection of pain of a longer duration, up to a month post-initial insult. There has been less use of the technique using real-time methods and this is an area in need of further research.

experience. (Mogil, 2009) This concern is not unique to pain research; pain commonly arises in other disease conditions and may be a target for novel therapeutics. Secondly, pain and its sequelae may influence the results obtained from animal model studies, affecting a range of physiological and immunological processes occurring. This further impacts on the reliability and translatability of the results obtained from these studies. (Carbone and Austin, 2016;González-Cano et al., 2020;Peterson et al., 2017) Finally, pain presents a significant cost to animal welfare through the impact on individual animals. Therefore, the assessment of pain and application of methods to mitigate its effects, are needed to safeguard animal welfare and to conform to ethical requirements in biomedical research, for instance the refinement aspect of the 3Rs. (Russell and Burch, 1959) This assists in addressing societal concerns around the use of animals in research.
One of the more commonly used assessment methods, suggested to be specific to pain, is the use of facial expression scoring or the so-called 'grimace scales'. (Mogil et al., 2020;Whittaker and Howarth, 2014) The idea behind using facial expressions as a readout for pain neurobiology came from human facial codification scales. (Nagakura et al., 2019;Serizawa et al., 2019) The Facial Action Codification System (FACS) allows categorization of movements of the facial muscles. Specific combinations of movements leads to changes in discrete facial regions or "facial action units (FAU)", for instance the closing of the eyelids. Recognition of changes in these FAUs has been proposed to allow determination of emotional state. (Descovich et al., 2017;Ekman, 1992;LeResche, 1982) Grimace scales were developed for non-human animals, with the goal of standardizing methods for different species. The original grimace scale was developed for mice by Langford and colleagues in 2010, and validated through application of a variety of preclinical pain assays. In this scale, changes in 5 facial action units are assessed to determine level of pain: (1) orbital tightening, (2) nose bulge, (3) cheek bulge, (4) ear position and (5) whisker change. Grimace scale development in other species followed (see Mogil et al., 2020 for full history), as did further examination of the mouse grimace scale (MGS) in a range of animal models and conditions.
There have been a number of reviews on grimace scales in a variety of species, (Descovich et al., 2017;McLennan et al., 2019;Mogil et al., 2020;Mota-Rojas et al., 2020) but none to our knowledge that have focussed solely on mice, and used systematic methods to identify all studies where the MGS was utilised. Now 10 years on from the publication of the original study, a comprehensive systematic assimilation of the evidence on the MGS is warranted; mice being the most commonly used mammal in biomedical research. (Homberg et al., 2017) In contrast to a systematic review and meta-analysis, scoping reviews are broader in scope and bring together all current evidence, regardless of quality (Colquhoun et al., 2014). They may also pave the way for future systematic reviews on a clearly defined question identified in the scoping review. Therefore the aim of this systematic scoping review was to identify all published studies on the MGS and assimilate the evidence based on features of the scale use, with a particular focus on the application of the technique across a range of animal models, the methods used, and the impact of external variables on validity and reliability. This review will provide increased strength of evidence to guide researchers, ethics committees, and policy makers on the use and application of the MGS in biomedical research.

Search Strategy
The search strategy aimed to locate published studies in English. An initial limited search of Medline was undertaken to identify articles on the topic. The text words contained in the titles and abstracts of relevant articles, and the index terms used to describe the articles were used to develop a full search strategy for Medline via Pubmed using MeSH and free text terms. The search strategy was adapted for Scopus and Web of Science (including CAB abstracts) database searches. The three databases were searched in May 2020 using the developed search strategies (see Appendix A). The search was updated in October 2020. Key concepts used for searching were "mice" and "grimace scale". Hand searching of reference lists was performed to identify additional studies. Studies published from database inception were eligible for inclusion. Publications were excluded electronically if they were conference abstracts with full study detail and results not available, or review articles.

Eligibility Criteria
Studies were included if they investigated the Mouse Grimace Scale in mice irrespective of age, sex or strain. Studies that looked at a change in any number of facial action units but that did not report this as use of a 'grimace scale' were excluded. Studies that used the MGS and reported it as such but modified the method slightly were however eligible for inclusion. Only studies that investigated the MGS based on an understanding that this was a measure of pain were eligible, for example a study using the MGS to assess positive emotion would have been ineligible for inclusion. All study designs were eligible for inclusion. Studies investigating new ways of collecting MGS data, for example by automation techniques were excluded. However, studies evaluating the objective nature of the test, for example those studies examining reliability between observers or institutions were eligible for inclusion.

Study Selection
Following the search, all identified citations were collated and uploaded into EndNote X8.0.1 and duplicates removed. Potentially relevant studies were retrieved in full and their citation details imported into Covidence (Veritas Health Innovation, Melbourne, Australia). Titles were screened by one reviewer (AW) for assessment against the inclusion criteria for the review. Abstract and full text screening were performed by all authors (AW, YL, THB) with two independent reviewers being required to certify the inclusion of each study. Disagreements that arose between the reviewers at each stage of the study selection process were resolved through discussion with the third reviewer.

Data Extraction
Data were extracted from the included studies by three independent reviewers (AW, YL, THB) using an electronic form developed by the authors (Appendix 2). All reviewers initially performed independent review of the same 3 studies Dwivedi et al., 2016;Hassan et al., 2017) to pilot the extraction tool and check for data consistency. Following this the remaining studies were allocated between the 3 reviewers; each study being extracted by one reviewer. Distribution of the papers between the data extractors was done randomly. Only data directly relevant to the research question were extracted. All data extracted were reviewed by the authorship team to ensure completeness of extraction. Contact with study authors was undertaken where necessary to clarify findings or seek further information. In accordance with guidelines on systematic scoping reviews, (Peters et al., 2015) the goal of the review was to provide an overview of evidence on the MGS regardless of quality. Hence methodological quality assessment of included studies was not undertaken.

Study Characteristics
A total of 240 articles were retrieved. Six studies were retrieved through hand searching of the reference lists of included studies or forward citation searching. Following title and abstract screening 59 articles were assigned for full-text retrieval with 48 articles being included at full-text review ( Figure  1). The reason for the majority (n=7) of the exclusions after full text review, was due to the studies evaluating MGS automation methods, rather than pain in mice. The characteristics of the included studies are presented in Table 1. Observational studies were eligible for inclusion. However, the majority (92%) of the studies (n=43) adopted an experimental study design, using the typical randomized controlled trial (RCT) design or pseudo-RCT design (where allocation to groups is systematic and not random). The remaining studies adopted a quasi-experimental design, such as using a pre-test, post-test repeated measures design with no group running in parallel. Since the first report of the MGS by Langford and colleagues in 2010, the number of publications investigating the method has grown considerably to a current approximately steady state rate of around 6-9 publications per year, sustained over the last 5 years (Figure 2).

Animal Model Characteristics
Studies were allocated into three categories based on the types of interventions applied to the mice for subsequent grimace score measurement. The categories considered were 1) animal model, 2) husbandry/procedural and 3) biological. Studies were categorised as utilising animal models if they used an animal model of a human condition likely to cause pain. Husbandry/procedural grouping was applied if the study investigated procedures commonly performed as part of laboratory routines, breeding procedures or veterinary treatments including anaesthesia and analgesia provision. The biological classification was reserved for those studies that investigated grimace scores resulting from inherent biological variation such as between sexes and strains or as a result of difficult to control environmental variables such as circadian rhythms. Based on our classification 65% (n=31) of studies used animal models, 31% (n=15) looked at husbandry/procedural interventions and 4% (n=2) investigated biological variation in grimace scores. It was considered that the interventions applied would lead to pain arising of substantially different natures. We utilised a published pain classification system (Melnikova, 2010) for assignment of studies based on pain type ( Figure 3). Figure 4 presents a sub-classification of the type of animal models or procedures used in the included studies, with expected pain type resulting. The animal model groupings are based on that presented by Hau and Shapiro, 2010. It should be noted that whilst some studies may have had a primary focus on evaluating response to one intervention, they may have reported on impact of other factors, for example sex differences. In reporting, we have considered evidence from all studies irrespective of the classification assigned.

Mouse Characteristics
The included studies used a wide range of inbred strains and outbred stocks of mice. The C57BL/6 strain was used in the majority of studies (38% of uses), followed by the outbred ICR/CD-1 (24%). Transgenic or knockout/in strains of specific relevance to the research questions investigated in the publications were commonly used (14%). Figure 5 illustrates the relative uses of the various strains. Excluding the mutant, transgenic and other categories 45% of the mice used were black-coloured, 44% white-coloured and 11% brown/agouti. Considering standard inbred or outbred strains/stocks only, eight studies used more than one strain. (Cho et al., 2019;Miller et al., 2015;Leach, 2015a, 2016;Rea et al., 2018;Rosen et al., 2017;Sorge et al., 2014;Tillu et al., 2015) Only 3 of these studies directly contrasted grimace scores between the strains. (Cho et al., 2019;Miller et al., 2015;Miller and Leach, 2015a) The direction of effect for grimace scores in these comparisons are presented in Table 2. There are some differences in strain effects on grimace scores between the sexes. Note that a number of papers used more than one strain. The ICR and CD-1 nomenclature has been considered to represent the same stock. Other includes hybrid or recombinant strains. The direction of the arrow represents that the strain at the arrowhead responded with a lower MGS score. Red lines indicate a comparison between female mice, blue lines indicate comparison between male mice and black lines indicate comparisons where sex was not separated. A solid line indicates that a live score was used, a dashed line indicates that a retrospective score was used.
Male mice only were investigated in 40% (n= 19) of the studies, females in 21% (n=10) of the studies, with 36% (n=17) of the studies investigating both sexes. Sex of mice was unreported in one study (Table 3).

MGS Measurement Methods
The majority (88%) of studies evaluated MGS by retrospective scoring via photographs obtained directly via camera use, or extracted as stills from video footage, as reported in the original study. (Langford et al., 2010) To date only 5 studies have used real time methods, (Bu et al., 2015;Chartier et al., 2020;Gallo et al., 2020;Hsi et al., 2020;Miller and Leach, 2015a) , with 3 of these studies directly contrasting these results with those obtained from retrospective scoring. Gallo et al., 2020;Miller and Leach, 2015a). One study, did not state the method of MGS scoring. (Kim et al., 2015) The breakdown of collection method and timing is detailed in Figure 7. In the studies performing direct comparison, live scores were found to be significantly lower than corresponding retrospective scoring in two of the studies. Miller and Leach, 2015a) In the final study, (Gallo et al., 2020) a PCA produced a component where real time MGS and image scoring were highly intercorrelated (with nesting behaviour as a third factor).
The original study described the MGS in terms of 5 FAU's. However, in 18 (38%) of the studies scoring was modified by excluding specific action units, or in one case combining the cheek and nose bulge action unit into one. (Mai et al., 2018) In the studies that used 4 action units for scoring, whisker position was the action unit excluded in the majority (60%) of cases ( Figure 8). The method of combining the scores to arrive at a final score for the photograph or time point (real time scoring) was in the majority of studies (36/48) by averaging of individual action unit scores (yielding a maximum score of 2). In 10 studies, summation of the individual action units scores was performed to arrive at the final score (maximum score of 10 for 5FAUs). The method of achieving the final score was unclear in the remaining 2 studies. (Hassan et al., 2017;Mitchell et al., 2020) A number of studies accounted for individual responses to pain by using mean difference scores in data presentation and analysis to correct for baseline grimace scores. For studies where the whisker position FAU was excluded, 50% of the studies used mice (6/12) that were black coloured, 33% (4/12) white, and 17% (2/12) brown coloured (X 2 (2, N = 12) = 3, p=0.22).

Figure 8:
Facial Action Units (FAU's) utilised for scoring in the included studies. n represents study number. Specific action units were generally excluded as described here, although in one study two of the action units were combined.
A range of study durations were used in included studies, often with multiple time points being assessed within a single study. Duration of MGS assessment ranged from directly after the intervention to over a month following. This is illustrated in Figure 9, categorised by expected pain type. Refer to Table 1 for detail of interventions applied in the studies.
20-40% 40-60% 60-80% 80-100% Figure 9. Heat map contrasting type of pain expected to arise from the interventions with the time points after the intervention investigated. Colouration gradation represents percentage of studies where grimace scores moved in the expected direction of effect, with increased shading indicating greater number of investigations, for example 100% of studies evaluating procedures likely to cause acute pain showed increased MGS scores within the 24 hours after the intervention. * Consider that no change in MGS score is expected.

Corroborating Methods of Affective State Assessment Used
A range of alternate methods for assessing animal affective state were utilised in 37/47 (79%) of the included studies ( Figure 10). These methods were largely behavioural in nature but did include measures of physiology, such as corticosterone analyses or bodyweight (being an expression of feeding behaviour). The most common measures used in rank order were: use of Von Frey filaments for assessment of mechanical allodynia, bodyweight, general clinical/disease scoring which may have been tailored to the model used e.g. EAE scoring scheme, burrowing behaviour, pain-related behaviour scoring such as the use of composite pain measures, and open field tests for activity and locomotion. In the majority of cases (31 studies), data from these tests corroborated MGS scoring. In the remaining studies, either no association was seen with the chosen measures, Miller et al., 2015;Mitchell et al., 2020;Zhu et al., 2017) or there was unclear reporting or lack of direct comparison in the same animals. Hsi et al., 2020).

Circadian Rhythm
In the majority of the studies there was no specific reporting of light cycle stage for recording of MGS data. It was assumed that given the lack of reporting these were performed during the light stage. Five (11%) of studies either reported conducting recording during the dark stage or timelines of measurement suggested that both stages would be crossed. (Dwivedi et al., 2016;Jurik et al., 2014;Matsumiya et al., 2012;Miller and Leach, 2015a;Rea et al., 2018) However, only three of these studies performed an examination of circadian rhythm effects. (Matsumiya et al., 2012;Miller and Leach, 2015a;Rea et al., 2018) studies and their impact on the MGS are reported in Table 4. Grimace scores were higher in the dark than in bright light for the CD1 mice. Light transition led to decreased orbital tightening and nose bulge. C57BL/6J mice showed no significant difference between the CGRP-induced grimace in light and dark. Responses to CGRP were generally similar in direction as those recorded in the light.

Variability Arising From Observers
A number of studies (20/48) utilised more than one observer for ascertaining grimace scores. Ten of these studies (Table 5) specifically reported the metrics associated with agreement between the observers, that allowed them to combine the results with an assurance of external reliability.

Consistency Metrics
Inter-Observer Variability Faller et al., 2015 2 There was an excellent correlation between the two observers for MGS measurement (r = 0.98) assessed using Type II regression analysis. However, Bland-Altman analysis showed that the slope differed from unity with a bias towards higher MGS scores in one observer. Hohlbaum et al., 2020 4 (2 Novice, 2 Expert Scorers) Good agreement between all observers was observed (ICC = 0.851) when all three time points were examined. However, interrater reliability differed across timepoints. The best agreement was achieved for orbital tightening, and the poorest agreement for nose and cheek bulge, and this depended on the observers' experience levels. In general, experienced observers produced scores of higher consistency when compared to inexperienced. Langford et al., 2010 7 Inter-rater reliability was high as assessed by intra-class correlation coefficient (ICC average = 0.90). When high-definition video cameras were used, over 97% of pain versus no-pain images were categorised correctly. Mittal et al., 2016 6 ICC and Cronbach's alpha values were low (ICC average <0.7, α < 0.8). This resulted from large intra-coder variability for three of the coders. Therefore, only the results of the coders with low variability were used in data presentation (updated metrics not reported). Rea et al., 2018 2 Correlation coefficients ranged between 0.89 and 0.92. Roughan et al., 2016 4 There was high inter-observer consistency, with ICC values ranging from 0.75-0.84. Roughan and Sevenoaks, 2019 6 Novice and 6 Expert Scorers The α values for experts and novices were high (0.88 to 0.94; 0.78 to 0.87 respectively). Agreement between novices and experts was generally good (ICC ranging from 0.7 to 0.84 across the timepoints). Sorge et al., 2014 2 Moderate to high inter-rater correlation (r = 0.64, P < 0.001). Group data from one rater compared to the other were almost identical. Tuttle et al., 2018 2 High inter-rater consistency with Cronbach's alpha of 0.89. Jirkof et al., 2020 3 Median MGS scores were significantly different at a number of timepoints between the 3 laboratories. They were however qualitatively similar i.e. direction of effect.

Discussion
In this paper we have presented the first comprehensive overview of all studies investigating the MGS, assimilating information on the types of animal models/conditions where the MGS has been applied, methods applied, and external factors affecting validity of the technique. It is hoped that this assimilation will guide future validation, and use of the MGS by researchers and thus promote wider scale implementation of the method. Key findings of our assimilation are discussed below.

Methods Used
To date the majority (88%) of uses of the MGS in biomedical research settings have used retrospective recording through collection of video footage, and subsequent still extraction, or primary collection of photographic images. Retrospective scoring brings some key advantages when using the MGS as a research outcome measure. These methods provide a greater degree of certainty in the findings by allowing for the possibility of re-confirming scores and thus replicating the data, utilising multiple observers for cross-checking, and allowing scoring to occur at a time that suits the researcher. (Mota-Rojas et al., 2020) This can all occur without the potential modulating influence on the scores of a human observer. (Sorge et al., 2014) Whilst, not discussed in the included studies an assumed challenge in using cameras to secure facial images is the need to achieve a face-on shot. This might be achieved by using a 'burst' mode to take photos in rapid succession, or by manual performance by an observer. However, this does raise concerns about the effect of observer presence on grimace scores and the impact of any noise produced by the camera when photographs are taken.
A real time method has advantages for clinical pain assessment, since scores can be attained quickly, to allow immediate action such as applying a humane endpoint or providing analgesics. The method may also provide some advantages in a research scenario by limiting the need for post-processing of images, which is invariably time consuming. (Mogil et al., 2020) To date there has been limited evaluation of real time scoring in mice, and of the five studies that have utilised this, only three directly contrasted this with validated retrospective scoring methods. Two studies found live scores to be lower than corresponding retrospective scoring. Miller and Leach, 2015a) A reason proposed for the lower scores resulting from live scoring is that the nature of the face changes rapidly during live scoring whereas in images, for example, random selection will lead to capture of blinking which is assigned a high score, contributing to relatively higher scores. (Miller and Leach, 2015a) Alternately, as proposed by Chartier et al. 2020 the presence of a human observer in real time scoring may influence mouse performance of the facial action units; increased alertness could lower the grimace scores through eye widening and 'pricking' of ears. It should be noted that there are considerable differences in the technique used for collection of real time data with some studies basing a score on a single observation point, (Bu et al., 2015;Gallo et al., 2020) as opposed to mathematical integration of several scores taken across a period. Miller and Leach, 2015a) The former would be simpler in a clinical context but may be associated with loss of sensitivity and validity. In spite of this, point grimace scores were determined to move in the expected direction of effect in these studies, implying validity. In rats there has been dedicated study into methods of real time scoring and their relationship with retrospective scoring, (Leung et al., 2016;Leung et al., 2019) and this is clearly needed in mice.
Whilst 62% of studies did use all of the five original described action units for scoring, in a significant proportion of studies (37%) scoring was modified by excluding specific action units, or combining units. Most of these adaptations involved excluding whisker scoring, which seems to be regarded as hard to visualise/score. (Mogil et al., 2020) It has been suggested by some authors that this difficulty in scoring whiskers is related to black coat colour. (Cho et al., 2019;Mai et al., 2018) However, this proposition is not supported by our synthesis which implies that whiskers are excluded from scoring at similar rates independent of coat colour (although study numbers are low). There may also be an impact of inexperience in scoring on ability to accurately identify action units, for instance Hohlbaum et al., 2020 demonstrated that cheek and nose bulge scoring had reduced inter-observer agreement compared with orbital tightening, with inexperienced scorers having even reduced accuracy.

Validity of the MGS across a range of pain types
The MGS is described as a measurement of pain i.e. it has face validity for pain. There is clear evidence from the included studies that the MGS changes in response to painful events and is modified by analgesics, further supporting this proposition see eg. (Faller et al., 2015;Leach et al., 2012;Matsumiya et al., 2012) However, another important aspect of the validity of a pain measure is the extent to which the technique measures pain, and is not influenced by other conditions such as sickness behaviour, in other words whether it has construct validity. This review assists in evaluating these concepts in a number of ways.
It is clear that whilst the majority of the studies examining the MGS are conducted by researchers in the pain field, there has now been use of the technique across a range of non-pain focussed animal models. The technique being especially utilised in the oral health science and neuroscience fields. There has also been significant focus on the technique in husbandry and welfare investigations in mice, with a focus on the effects of surgery and analgesic administration on the score, and by inference pain.
In the majority of these models, especially over an acute timeframe the MGS has good utility. However, even though use of the technique has increased over the past decade 48 studies is a small fraction of all the studies being conducted in laboratory mice. It is surprising that more researchers have not taken the opportunity to include the technique in their study. This may be due to a lack of awareness by researchers outside the pain and veterinary research fields of both the technique, and its validity. It is hoped that this review will promote awareness to these researchers, but there is probably a significant role for animal ethics committees in this dissemination effort.
In the original study by Langford et al, 2010 it was considered that the MGS was only suitable for measuring acute pain, based on the lack of grimace response when models of chronic pain were applied. This would make sense from an evolutionary perspective since, as prey animals, mice may learn to control a facial pain response to avoid predation. (Matsumiya et al., 2012) However, later studies question this assumption. Figure 8 provides clear evidence that across a range different expected pain types, grimace scores are detected up to a month post-initial insult in situations where pain might be expected. This evidence is particularly strong for neuropathic pain which might be expected to be longer lasting and has been investigated in a reasonable number of studies. In visceral or mixed pain the MGS also appears to be able to detect an effect but there have been limited studies, and it should be noted that the studies into mixed pain both come from the same laboratory looking at pain in one model of breast carcinoma. (de Almeida et al., 2019;de Almeida et al., 2020) There is clearly a need for future study in models where these types of pain are expected. To date, no studies have shown the existence of changed grimace scores at timepoints greater than a month after the assumed painful treatment. However, only two studies specifically looked at these timepoints and there is the possibility that pain was not actually present at these times, especially in one of the studies, which utilised a relapsing-remitting colitis model induced by DSS.  Interpreting findings at these later timepoints is made more challenging given the lack of other validated measures of pain against which to corroborate MGS findings.
A range of physiological and behavioural outputs were measured in the included studies which lend support to the proposition that the MGS has good construct validity. These included the use of assessment of mechanical allodynia, general clinical scoring, pain behaviour scoring or indicators of luxury behaviour such as nest building or burrowing. In the main, outcomes from these tests moved in the same direction as mouse grimace scores, suggesting convergent validity. However, out of all the measures assessed, arguably only a couple are specific to pain and are plagued by the same issue that surrounds MGS validation; that of establishing incontrovertibly what they are measuring. For example, burrowing and nest building behaviour are largely taken to be generalised indicators of well-being or affective state, (Jirkof, 2014) and are modified not just in response to pain, but sickness behaviours see e.g. (Cunningham et al., 2007;Gaskill and Pritchett-Corning, 2016;Jirkof et al., 2013;Whittaker et al., 2015). Composite pain behaviour scoring and use of Von Frey testing are specific to pain and therefore more reliable corroborating measures. However, the debate around the differences between nociception and pain needs to be borne in mind (see Deuis et al., 2017 for full discussion). The former is a physiological function, but a reaction to a stimulus does not necessarily signify the experience of pain. Therefore the widespread historical use of stimulus-evoked tests, such as the Von Frey filaments, may be a contributing factor to the poor translation rates in pain research. (Deuis et al., 2017) One of the key cited advantages of the MGS is that it measures spontaneous pain. (Mogil, 2009) Based on this discussion perhaps the most reliable corroborating measure against which to assess the MGS is another readout of spontaneous pain, with composite pain behaviour scoring being the only measure to completely fulfil this description. In the studies that compared these two readouts, the direction of effect was aligned but the studies are few in number (5). (Hassan et al., 2017;Jurik et al., 2014;Leach et al., 2012;Miller et al., 2015; Another finding of this review that questions the construct validity of the MGS is the change in grimace scores in response to techniques that would not be expected to elicit pain. Out of the eleven studies that examined the MGS over the 24 hour period after an intervention, that were expected to elicit none or momentary pain, six found grimace score elevations. A further examination of these studies shows that three of the studies were examining the effect of anaesthesia/analgesia on grimace scores. (Hohlbaum et al., 2017(Hohlbaum et al., , 2018Miller et al., 2015) In general both inhalational, (Hohlbaum et al., 2017;Miller et al., 2015) and injectable anaesthetics (Hohlbaum et al. 2018) increase scores, in the absence of a presumed painful event. However, whilst analgesia might similarly be expected to elevate scores, in two studies both tramadol, (Jirkof et al., 2020) and buprenorphine, (Miller et al., 2015) were not determined to have any impact. The impact of the anaesthetics is short-lived, having resolved by 24 hours. It is postulated that this could be related to a 'hangover' or sedative effect remaining after the procedure, which could be envisaged to lead to eye closure as in sleep. However, perhaps a lingering muscle relaxant effect could similarly affect the other action units. The evidence on an elevation with inhalational anaesthetics is also not clear with a strain effect being identified in the Miller et al. 2015 study. The study by Sorge et al., 2014 is mechanistically different to the other studies within this group since exposure to a painful insult was applied, with differences in grimace response shown to result from a form of male pheromone -induced stress analgesia. The remaining two studies found increased grimace scores as a result of blood sampling (Meyer et al., 2020) and handling and identification. (Roughan and Sevenoaks, 2019) In the former, (Meyer et al., 2020) facial vein and retrobulbar bleeding increased scores in the immediate post-procedural period. This study also provides further evidence for the effects of isoflurane on the MGS with increased scores seen in anaesthetised compared to sham handled groups. In the study of Roughan and Sevenoaks, 2019 increased scores were seen as a result of tail handling and ear tagging. There are several points of relevance here in relation to MGS construct validity. Firstly, the blood sampling interventions applied are likely to produce momentary pain as opposed to no pain, (Whittaker and Barker, 2020) so evidence of a change actually supports construct validity. Secondly, tail handling has been suggested to be aversive rather than painful (Hurst and West, 2010) so an effect does call into question the specificity of the scale for pain (although noting that a previous study found no effect of handling . Thirdly, whilst blood sampling only caused immediate post-procedural changes in MGS (later time points were not examined), differences between groups for handling and identification often persisted for 24 hours, when it might be assumed that any pain would have resolved, although as demonstrated in this study, at a time point when inflammation remained. (Roughan and Sevenoaks, 2019) Interestingly, there was also non-convergence of findings relating to inflammatory response and MGS with tunnel handled mice demonstrating a greater response than tail-handled animals.

Reliability
Pain scales should be reliable, that is produce similar results whenever they are used. (Good et al.,2001) This requires that between animal, intra-animal and temporal variations are minimised unless they result from differences in pain experience. Reliability impacts on validity since if errors in measurement are significant, the scale no longer performs well at assessing pain. (Mogil et al., 2020) The included papers assessed a number of measures of reliability including within observer variability (intra-observer), between observer (inter-observer) and across site variability.
Whilst a fair proportion of the studies investigating grimace scales utilised more than one observer for scoring, only 50% of these analysed and reported on between observer metrics. This represents a significant loss of data on the reliability of these scales. This raises the question of whether these data were not analysed, or not reported, perhaps because of low agreement. If there was more ability, and uptake of protocol registration in pre-clinical studies this question may not have arisen. Moreover, in encouraging the use of these scales for practical welfare assessment as clinical tools, this question is important; few institutions will be able to rely on the same, single observer to perform all scoring.
Based on the limited evidence available, inter-rater agreement generally ranges from good to excellent. However, a recent study (Hohlbaum et al., 2020) does suggest that this may change across time, with differences potentially being obscured by assimilation of all data. This is a factor that should be considered in future studies. Related to this, there may also be differences in scores for similar treatments when taken across laboratories. (Jirkof et al., 2020) It is not clear whether this relates to inter-observer differences, or differences in housing/test conditions but does call into question the external validity of MGS results. (Jirkof et al., 2020) However, importantly this study did find that whilst values across research centres were numerically different, direction of effect was similar so general validity was maintained. Fewer studies have reported on intra-rater variability. Although the study by Mittal et al 2016 which used a large number of coders (6) did report significant within coder variability in 3 individuals. All of these findings raise the question of whether training and experience in the use of the scales impacts on reliability. Few studies have specifically examined this, and detailed information on training was rarely provided in the included studies. Evidence for a training effect is currently conflicting with one study (Hohlbaum et al., 2020) suggesting greater consistency if scorers were experienced, whilst another study (Roughan and Sevenoaks, 2019) finding good correlation between novice and expert scorers. The impact of, and type/frequency of training, needed to produce reliable grimace scores is an area that needs further research especially if the technique is going to gain more widespread acceptance as a pain assessment tool. This is also a particular consideration for real time scoring which needs to be performed quickly and does not offer the opportunity for rereview of collected images.

The Impact of Biological Variation and the External Environment on the MGS
The synthesis demonstrates that there are a number of features of biology and the external environment that influence grimace scores. These include the influences of strain, sex, the circadian cycle and observers. These differences should be considered in future investigations of grimace scores, especially in the development of intervention scores.
A limited number of studies have directly contrasted more than one strain (Cho et al., 2019;Miller et al., 2015;Miller and Leach, 2015a). It is difficult to draw any conclusions on impact of strain on grimace scores since there appears, at least on the basis of one study, (Miller and Leach, 2015a) to be interactions between sex and strain on grimace scores. In general, with some exceptions due to sex differences, it appears that the strain order from propensity to score low to high is C57BL/6, CD1, C3H/H, BALB/c. However, it is worth noting that much of this information on strain differences comes from one study, (Miller and Leach, 2015a) where a painful insult was not applied. This may be of relevance, particulary on consideration of the interaction between sex and strain, since it is well established that there are differences in pain thresholds between male and female rodents, with females having a lower pain threshold in response to a variety of nociceptive inputs. (Hurley and Adams, 2008) It is interesting to note that this strain ranking shows no obvious trend based on coat colouration, implying that inability to score individual action units due to this may have minimal impact on scores obtained.
Evidence as to the presence, or nature of any differences in scores as a result of sex is far from settled. The majority of studies that compared sex differences within the same strain found no differences in scores. In regard to the minority of studies that did find sex differences, there is a fairly even split between those that found scores were lower in females and vice versa. This is perhaps surprising given the finding using traditional pain assays that female rodents have a lower pain threshold in the face of hot thermal, (Sternberg et al., 2004) chemical, (Gaumond et al., 2002) inflammatory, (Dina et al., 2001) and mechanical nociceptive insults. (Barrett et al., 2002) However, varied findings in relation to sex differences are not uncommon in these other models, and probably arise due to differences in study design as well as genotype. (Mogil et al., 2000) The absence of a sex effect in the majority of the studies that evaluated both sexes may also speak to a lack of sensitivity of the scoring, whereby differences are present, but cannot be discriminated. Another more general finding arising from the assimilation is that in spite of increased promotion of the use of both sexes in preclinical research due to concerns about translation, (Clayton and Collins, 2014;Whittaker and Hickman, 2020) the majority of studies used one sex (predominantly males). Even when two sexes were used in the included studies, an opportunity was often missed by failing to make direct comparisons between them.
Circadian rhythms commonly apply to biological and physiological processes in animals. (Konecka and Sroczynska, 1998) Mice as nocturnal animals are active mainly during the dark phase. (Ripperger et al., 2011) The strength of this circadian clock is such that even in constant darkness this pattern of activity will persist despite the absence of external cues. (Ripperger et al., 2011) There is also evidence of a circadian rhythm in pain sensitivity across a range of animal species, (Frederickson et al., 1977;Hamra et al., 1993;Konecka and Sroczynska, 1998) potentially brought about by a rhythm associated with opioid peptide production. (Naber et al., 1981;Oliverio et al., 1982) Considering that general levels of activity are likely to confound behavioural measurements particularly (although not exclusively), it follows that experimental protocols would control for this, and report on time of testing. This also raises the question of whether performing behavioural tests in the light phase is a major methodological error. (Yang et al., 2008) Given this, it is surprising that many of the included studies failed to report on timing of MGS measurements; this being an item in the updated ARRIVE guidelines recommended set. (Percie du Sert et al., 2020) Given the lack of dedicated study and reporting deficiencies, there is limited evidence to support or refute an effect of circadian rhythm on the MGS. However, two studies hint at potential differences (Matsumiya et al., 2012;Rea et al., 2018) with a suggestion of higher scores or pain in the dark phase. Nevertheless, Rea et al. 2018 did discuss that light transition appeared to cause decreases in orbital tightening and nose bulge, and it is not clear whether this effect would have persisted once acclimatised to the light.
Observer effects on the scale has been little investigated. This is unsurprising given that the majority of studies using the MGS have utilised retrospective analysis for scoring. However, as previously discussed observer effects may be relevant when photography is used, and are of clear importance in real time scoring since it is well established from animal behaviour research that a human observer may influence animal behaviour. (Martin and Bateson, 2007) There is some suggestion from other species of minimal impacts on grimace scores by human observers, see eg. Leung et al., 2016. However, this needs dedicated investigation in the context of mice. Furthermore, the nature of the observer may be important in determining their impact on scores. For example, Sorge et al., 2014 demonstrated that the presence of human males led to a stress-induced analgesia and reduced grimace scores, and familiarity with the observer may also be a factor in response. (Mogil et al., 2020)

Conclusions and Recommendations for Future Research
This review has assimilated all primary literature to date on the MGS. It is concluded that the MGS has utility across a range of animal models, and expected pain types. There do however appear to be some differences arising as a result of biological variation such as sex or strain of mouse. These variables need consideration in study design or analysis to account for them appropriately. There is also some limited evidence that the MGS may not be wholly specific to pain. However, this evidence mainly comes from studies into husbandry or drug interventions, the latter generally only having a short-term effect, which can likely be explained by the pharmacological effects. It would be interesting to delve further into any potentially non-pain related grimace effects in animal models where other symptoms might be assumed to co-occur with pain, for example sickness behaviour. This could potentially be achieved by using analgesics to eliminate the pain response, although of course the risk of drug confounding would need to be considered.
Further research is needed on the use of the MGS as a real time method, and how this can be done to maintain validity of the method, whilst being practically feasible. Related to this is the question of how reliable scoring between observers is, and what type of training (if any) is needed to maximise between observer agreement. Finally, whilst there is suggestion from studies in this synthesis, (Sorge et al., 2014) and others, (Langford et al., 2006) that there is a social modulation of pain by conspecifics and the presence of other species, there has been little investigation of this fascinating area in the context of grimace responses.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. AW was supported by a Peter Doherty Biomedical Fellowship (APP1140072).

Medline (Mouse [tiab]) OR (Mice [tiab])) OR (Murine [tiab])) OR ((Murin*) [tiab])) OR (Mus [tiab])) OR (Musculus [tiab])) OR (Transgenic Animal [tiab])) OR (Mice [mh])) AND (Grimace Scale)) OR (Grimace Score[tiab])) OR (Facial grimace[tiab])) Scopus
TITLE-ABS-KEY("Mouse" OR "Mice" OR "Murine" OR "Murin* " OR "Mus " OR "Musculus" OR "Transgenic Animal")AND TITLE-ABS-KEY("Grimace Scale" OR "Grimace Score" OR "Facial grimace") Web of Science TS=(Mouse OR Mice OR Murine OR Murin*OR Mus OR Musculus OR Transgenic Animal OR Mice) AND TS= (Grimace Scale OR Grimace Score OR Facial grimace)    Heat map contrasting interventions used by the type of pain expected to be elicited. Colouration/number in box represents number of studies. Whilst some studies could arguably have been included in multiple categories to simplify reporting one category has been assigned. ǂ Pain in this study, Mittal et al., 2016, was assigned as acute since induced by cold stress, although sickle cell pain can be neuropathic in origin. * Model of Hsi et al., 2020 did not relate to a neuropathy. Note that a number of papers used more than one strain. The ICR and CD-1 nomenclature has been considered to represent the same stock. Other includes hybrid or recombinant strains. Figure 6. Network map comparing MGS scores between strains. Each line represents a study effect. The direction of the arrow represents that the strain at the arrowhead responded with a lower MGS score. Red lines indicate a comparison between female mice, blue lines indicate comparison between male mice and black lines indicate comparisons where sex was not separated. A solid line indicates that a live score was used, a dashed line indicates that a retrospective score was used.   . Heat map contrasting type of pain expected to arise from the interventions with the time points after the intervention investigated. Colouration gradation represents percentage of studies where grimace scores moved in the expected direction of effect, with increased shading indicating greater number of investigations, for example 100% of studies evaluating procedures likely to cause acute pain showed increased MGS scores within the 24 hours after the intervention. * Consider that no change in MGS score is expected. Figure 10. Word cloud illustrating corroborating methods of affective state assessment used in the included studies. Size of the word illustrates their relative frequency of use.