Smoothing in Ordinal Regression: An Application to Sensory Data

: The so-called proportional odds assumption is popular in cumulative, ordinal regression. In practice, however, such an assumption is sometimes too restrictive. For instance, when modeling the perception of boar taint on an individual level, it turns out that, at least for some subjects, the effects of predictors (androstenone and skatole) vary between response categories. For more ﬂexible modeling, we consider the use of a ‘smooth-effects-on-response penalty’ (SERP) as a connecting link between proportional and fully non-proportional odds models, assuming that parameters of the latter vary smoothly over response categories. The usefulness of SERP is further demonstrated through a simulation study. Besides ﬂexible and accurate modeling, SERP also enables ﬁtting of parameters in cases where the pure, unpenalized non-proportional odds model fails to converge.


Introduction
The production of entire male pigs is an alternative to surgical castration taking animal welfare concerns into account. However, the elevated levels of so-called boar taint may impair consumer acceptance (see, for example, the Ref. [1] and references therein). Boar taint is (presumably) caused by two malodorous volatile substances, namely: androstenone and skatole (compare, for example, the Ref. [2]). In an experimental study presented in [3], fat samples from roughly 1000 pig carcasses were collected and subjected to a thorough sensory evaluation and quantification using a panel of 10 trained assessors on a sensory score scale ranging from 0 = 'untainted' to 5 = 'strongly tainted'. The absolute frequencies of the panelist scores are shown in Figure 1. The question of interest is how the sensory evaluation is influenced by the samples' androstenone and skatole content. In this context, the Ref. [3] considered the average panel ratings as the response, which gives a quasicontinuous variable, and made standard linear modeling the approach of choice. As an alternative, panel ratings may be discretized to a binary outcome, with a typical cut-point for dichotomization (boar-tainted/no boar taint) fixed at 2; compare, for example, the Ref. [2]. With this, a binary (e.g., logit) model can be used instead of standard linear modeling. On an individual, subject-specific level, dichotomization/binary regression is thus a sensible approach as well, whereas linear modeling with a relatively small number of (ordinal) response categories may be questionable [4]. However, dichotomization still poses the problem of loss of information and choice of the threshold. Thus, for a clearer understanding of the individual panelists' rating patterns of deviant smell, we (a) consider an ordinal model utilizing as much information as possible, and (b) fit those models to each panelist separately. The latter is done because effects of androstenone and skatole can be very different between people, while it is very important to examine those subject-specific effects, for instance when selecting potential raters to identify boar-tainted carcasses at the slaughter line. That is why [3] also provided modeling on an individual level as Stats 2021, 4 supplementary information (online appendix), but only considered the dichotomized data. Thus, with prediction/inference in mind, to realize an adequate predictive model of deviant smell via an ordinal regression model, we consider the following rating scale: no boar taint (0/1), low boar taint (2), medium boar taint (3) and high boar taint (4/5), where the extreme categories are collapsed (due to the small number of observations in categories 1 and 5). A very popular type of ordinal regression model is the cumulative logistic model, particularly the so-called proportional odds model [5]. Using the latter, however, imposes some restrictions that may not be true for at least some of our raters. Omitting those restrictions and allowing for the most flexible cumulative logit model, on the other hand, may result in numerical problems in the fitting algorithm, high variability in the estimated effects of androstenone and skatole, and results that are hard to interpret. We hence discuss a regularized cumulative model here that is a data-driven compromise between proportional and fully non-proportional odds models, assuming that parameters of the latter vary smoothly across categories. An idea that has already been applied successfully with rating scales as predictors [6]. As will be illustrated, the model proposed maintains the flexibility of the non-proportional odds model and adapts very well to the underlying data structure, while at the same time providing competitive or better accuracy with respect to parameter estimates and prediction. Furthermore, results are typically easier to interpret. The rest of the paper is organized as follows: we begin with a short review of cumulative models for ordinal response in Section 2, and introduce our smoothing/penalty approach in Section 3. A simulation study is found in Section 4, while application to the sensory data is provided in Section 5. Section 6 concludes with a discussion.

Cumulative Models for Ordinal Response
Models for categorical outcome variables have been the subject of many discussions, with several approaches available in the literature for various forms of empirical applications; see, for example, the Refs. [7,8]. Assuming a categorical response variable Y, with k distinct but ordered categories, the information supplied by each response category can be incorporated in a model using the ordinal rather than the multinomial class of models [8,9]. The ordered model that is probably most frequently used is the cumulative model developed by McCullagh [5], which is not only popular from a frequentist, but also a Bayesian point of view; see, for example, the Ref. [10]. The model is often motivated by an underlying/latent, continuous variable, sayỸ, and a linear model of the form: where x i = (x i1 , . . . , x ip ) is a vector of covariates observed on unit i = 1, . . . , n,δ is the vector of corresponding regression parameters, and i an error term with continuous distribution function F. Then, it is assumed that the observable, ordinal response Y is obtained via the threshold model with −∞ =δ 00 <δ 01 < · · · <δ 0k = ∞ being cut-points on the (latent) scale ofỸ. It then follows that The choice of F in (2) results in different forms of the cumulative model. Normally distributed i , for instance, leads to the so-called (cumulative) probit model. Note, however, that Gaussian i does not mean that the latentỸ i is also marginally normal. Neither thresholdsδ 0r are assumed to be equidistant. As a consequence, skewed or bimodal (ordinal) data can also be analyzed within the framework of a cumulative/latent variable model. Besides the probit model, the most popular cumulative model is the cumulative logit model, which is obtained from F being the logistic distribution function such that In case of our sensory data, however, it makes sense to rewrite model (3) in terms of and δ 0r = −δ 0r , δ = −δ. Now model (4) gives the (log) odds of 'deviant smell', if threshold r is used for dichotomization, that is, to distinguish 'deviant' from 'normal' smell. This model (either in form (3) or (4)) is typically referred to as the proportional odds model (POM), since the effect of the covariates does not depend on the cut-point r, but is rather constant across categories. In other words, the odds ratio when increasing a specific covariate by one unit is the same for all cut-points r. For instance, as a 'textbook example', we may fit a proportional odds model to the data, as shown in Figure 1, with covariates androstenone and skatole, and a rater-specific effect in (1). Here, and throughout the paper, the two covariates were standardized after being transformed logarithmically. An interaction effect was also incorporated (as done in [3]). In a few cases (14 per rater), however, androstenone had a value of zero, which may be due to androstenone content below the detection threshold, or defective measurement. Therefore, those observations were excluded from further analysis. In the case of a consumer study with a large sample of consumers drawn at random from the population, and each consumer just evaluating a relatively small number of products, we would set the rater/consumer effect as a random effect. With a hand-picked panel of raters, however, and each panelist evaluating about 1000 products (see Figure 1), those rater-specific effects are set as fixed effects, that is, as an additional factor giving the rater. As a next step, when comparing this model to a more complicated model where the effects of androstenone and skatole also vary with the rater (both fitted using polr() from the R package MASS [11]), the latter model is significantly better, with the p-value being virtually zero (likelihood ratio test with LR = 219.023, df = 27). This confirms the statement made earlier that the effects of androstenone/skatole can be very different between raters. One step further, it also makes sense to have parameters δ 0r vary with raters. Since the model with both rater-specific thresholds and effects of androstenone/skatole varying with the rater is equivalent to fitting separate models for each rater, the latter will be done throughout the paper (as those individual models are easier to interpret).
In practice, however, the proportional odds assumption made so far is also sometimes violated; compare, for example, the Ref. [12]. In general, if the effects of covariates turn out to vary (substantially) across response categories, the proportional odds assumption will produce biased results. In such a situation, a more general form of the model (4) that relaxes the proportional odds assumption may be used, although at the expense of increased model complexity. The general cumulative logit model, or rather the non-proportional odds model (NPOM), is given by where δ r = (δ 1r , . . . , δ pr ) , and the restrictive global effect δ is now replaced by a more liberal category-specific effect that accounts for every single response class/cut-point r in the model. Model (5) has the property of stochastic ordering [5], which implies that δ 0,r+1 + x i δ r+1 < δ 0r + x i δ r holds for all x i and all r = 1, . . . , k − 2, since P(Y i > r + 1|x i ) < P(Y i > r|x i ) must hold for all categories. However, it is often the case that such a constraint is not met during the iterative procedure typically used to fit the model, leading to unstable likelihoods with ill-conditioned parameter space. This is one reason why the simpler model (4) is often adopted in practice even when inappropriate. Alternatively, the two different forms of effects (4) and (5) may be combined in one model. In other words, assuming x i is partitioned into u i and v i such that x i = (u i , v i ), one obtains the so-called partial proportional odds model (PPOM) [13] as follows: where u i has a global effect δ and v i has a category-specific effect γ r . PPOM could be of help when it comes to reducing model complexity, but at the extra cost of clustering candidate covariates to a particular effect type.
To distinguish between POM and NPOM, proportional odds tests may come into play, such as likelihood ratio tests. However, with those tests being dependent on the existence of the likelihood of the general model, which is often ill-conditioned, such tests are often not feasible. One test that is independent of the likelihood of the general model is the so-called Brant test [14]. This test examines the proportionality assumption of the entire model (omnibus) alongside each of the individual variables in the model. The approach is based on viewing (5) as a combination of k − 1 correlated binary logistic regressions. Brant shows that the separate binary logistic regression estimatesδ 1 , . . . ,δ k−1 for δ r , r = 1, . . . , k − 1, are asymptotically unbiased and follow a multivariate normal distribution. Consequently, a Wald-type test that is based on the differences in the estimated coefficients, producing a chi-square statistic, could be used. Thus, with δ r all equal under POM, anyδ r −δ l , r = l makes a possible test statistic for testing the proportional odds assumption, also componentwise. The corresponding test, however, suffers from low power and "may provide no clear indication as to the nature of the discrepancy [from the proportional odds model] detected" [14]. Therefore, Brant focused on testing H 0 : δ r = δ for all r vs. H 1 : δ r = φ r δ, where φ r > 0 captures misspecification of the distributional form of the latent variable, in this instance, a nonlogistic link function. Indeed, when applying the Brant test to the cumulative logit model of the sensory data (see Table 1), it turned out that not all the models met the parallel slope assumption (see the highlighted p-values). The observed non-proportionality is more pronounced in the overall model than respective covariates. In summary, about half of the models under consideration failed the parallel slope assumption. However, the reliability of this test and similar conventional tests for validating the proportional odds assumption has often been criticized for being prone to misleading conclusions in empirical applications; see, for example, the Refs. [15,16]. Using approaches other than statistical hypothesis testing have been recommended: the Ref. [16], for instance, suggests a graphical approach for validating the proportional odds assumption, and [17] proposed modified residuals that can also be used to check the proportional odds assumption. Given our sensory data, a very intuitive graphical approach is to examine the contour plots of the estimated log-odds of being in dichotomized categories of the deviant smell model for different cut-points, r. Figures 2 and 3 show the corresponding log-odds as a function of the predictors androstenone and skatole under NPOM and POM for the seventh and eighth panelist (G, H), respectively. In addition, the dichotomized data using cut-point r = 1, 2, 3 are given as red/blue dots, since the cumulative logistic model (5) can be interpreted such that a binary logit model is employed on the dichotomized data using potential threshold r = 1, 2, 3. The log-odds shown in Figure 2 for both NPOM (top) and POM (bottom) indicate that the odds of being in the upper categories (i.e., categories of more severe boar taint) increase for increasing androstenone and skatole. With NPOM (top), however, the shape of the contour lines changes between columns, that is, thresholds r, whereas log-odds all have the same shape across cut-points for POM (bottom) by construction (as δ coefficients do not change across r, compare (4)). Having the relatively large sample size in mind (n ≈ 1000 for each panelist), this may indicate that the proportional odds assumption is inappropriate here. For panelist H (Figure 3, top) though, contour plots change very much for different cut-points r as well, but appear rather erratic and hard to interpret, in contrast to POM results (bottom). From the latter, we get the clear picture of (log-)odds of deviant smell (i.e., upper categories) that are particularly varying in the androstenone direction (increasing for increasing androstenone), since contour lines are rather parallel to the skatole axis. Using NPOM (top), by contrast, we can hardly make such a statement. As a consequence, we may be willing to give up the flexibility of NPOM (5) in order to have a model fit that can be interpreted. In light of those findings, it would be desirable to have a tool available for moving NPOM estimates towards POM automatically, if and as far as it is supported by the data. Besides interpretation, there is another very important advantage of POM over NPOM. The former is much simpler, and the estimates' variance is typically smaller, which can lead to a smaller mean squared error even in the case of bias: the so-called bias variance trade-off.
To this end, the use of shrinkage penalties could be considered a viable means of reaching a good, data-driven compromise between the non-proportional and proportional odds model. In other words, when putting an appropriate penalty on parameters of the more general, non-proportional odds model, a trade-off may be found between bias and variance of estimated parameters. Several types of penalties have already been suggested in the literature for categorical models. On the one hand, these comprise of penalties adopted from regularization methods for continuous models; see, for example, the Refs. [18][19][20]. On the other hand, there are penalties specifically designed to suit the need of categorical variables. A review and discussion of the latter is, for instance, given by [21], also sketching a smoothing penalty for ordinal regression. However, the idea is neither discussed in detail nor applied to data. In this study, we will further investigate the approach, which we call the 'smooth-effects-on-response penalty' (SERP), and use it for modeling our sensory data in a more flexible way than POM.

Smooth Ordinal Regression
For a smooth transition from the general model (NPOM) to the restricted model (POM), we consider the use of a specific penalty. This penalty enables the parameters of NPOM to be smoothened across response categories, resulting in a data-driven compromise between most flexible but potentially over-complex NPOM, and very popular but potentially too restrictive POM.
Let the overall design matrix be given by X = (X 1 , . . . , X n ) and let L i (θ) = ∂h(η i )/∂η be the derivative of h(η) evaluated at η i = X i θ, where η i = (η i1 , . . . , η i,k−1 ) , compare (5), and h is the inverse link, that is, the response function of the cumulative logit model. Then, the penalized score function s p (θ) is given by where L(θ) = diag(L i (θ)) is the block-diagonal matrix of derivatives, Σ(θ) = diag(Σ 1 (θ), . . . , Σ n (θ)) the block-diagonal matrix of covariance matrices of k-dimensional binary response vectors z i indicating the category of observation i, z = (z 1 , . . . , z n ) the combined vector of observed values, and µ = (µ 1 , . . . , µ n ) the vector of mean vectors, that is, k-dimensional class probabilities µ i = h(X i θ). Equating the score function to zero yields the estimation equation s p (θ) = 0, which may be solved with the following iterative routine: where F p (θ) = F(θ) + λΩ is the penalized/pseudo-Fisher information matrix, F(θ) = X W (θ)X the Fisher information and W (θ) = L(θ)Σ −1 (θ)L(θ) the weight matrix. As- ) is the vector of the initial (k − 1)(p + 1) values in the algorithm, in our particular case (logit link),δ [0] 0 is obtained from the logistic transformation (see the left-hand side of (5)) of the cumulative, relative class frequencies in the data;δ [0] , on the other hand, is a p(k − 1) vector of zeros. Alternatively, POM estimates may be used as starting values. Given thatθ contains penalized estimates for θ parameters, the approximate covariances are obtained by the sandwich matrix: where all notations are as defined earlier. For more details on the estimation procedures for cumulative models and the like, see, for example, the Refs. [8,22].

Choosing the Tuning Parameter and Measures of Performance
Since the penalty term J λ (δ) in (7) depends on the tuning parameter λ, a suitable value of λ needs to be determined. Common practice is to fit the penalized model for a sequence of λ values and select the best value via a tuning routine [23,24]. A typical tuning routine entails choosing the best model based on criteria such as AIC, BIC, and so forth. Alternatively, the λ value at which the model's out-of-sample prediction error is minimal can be determined via cross-validation. There are various specifications of such prediction errors in the literature, including, classification error, squared error, minus log-likelihood error, and so forth. The pros and cons of some of these error types are, for example, reviewed in [25,26]. The squared error (or Brier score [27]) particularly captures the sum of squared distances of the observed classes and the predicted values/probabilities. We refer to this here as the mean squared prediction error (MSPE). For a multi-class model with response categories 1, . . . , k, it is defined as is the indicator variable for each (potential) level of the dependent variable,π ir are the predicted probabilities for the respective categories and subject i, and n is the total number of observations. We will use MSPE alongside other error metrics to determine λ and compare the performance of our proposed approach with standard approaches. Another common performance metric is the mean squared error (MSE). Given that the true parameters δ jr are known (as they are in simulation studies), we can obtain the MSE of parameter estimates as follows: where k and p are the number of response categories and predictors in the model, respectively. Also, the MSE can be calculated covariate/component-wise.
Please note, in order to reduce complexity, we used a single, global penalty parameter λ in (7). In general, we could also use covariate-specific penalty parameters λ j within the penalty term (8) This, however, would mean that cross-validation needs to be carried out over a multi-dimensional space.

Numerical Experiments
Before applying SERP to the sensory data, the effect and performance of the penalty shall be investigated in simulation studies where the truth is known. Following Equation (5), the probabilities P(Y i > r|x i ), r = 1, . . . , 3, were obtained with the two covariates x i1 and x i2 , including an interaction, where both variables are iid N(0, 1), and i = 1, . . . , 1000. However, on the one hand, the intercepts δ 0 = (δ 01 , . . . , δ 03 ) were set to be equidistant as follows: δ 0 = (0.1, −1.0, −2.1) , and on the other hand, the slope parameters δ j = (δ j1 , . . . , δ j3 ) , j = 1, 2, 3, were selected to form three different simulation settings as follows: were subsequently used on a multinomial distribution to generate corresponding observed values, y i ∈ {1, . . . , 4}. In general, the number of response categories, and sample and effect size(s) were chosen to be comparable to the sensory data from Sections 1, 2 and 5.
For an illustrative application of SERP on ordinal models, SERP is inflicted on a cumulative logit model built from the generated data, using a grid of λ ∈ [0, ∞). As shown in Figure 4 (top), at large values of λ, all NPOM estimates level up to POM estimates; see the dashed blue horizontal line on each display. The down displays provide a second visualization of SERP's smoothing steps from NPOM's original set of estimates towards POM's estimates, represented with line strokes across ordinal levels/cut-points r. Again, we have the initial cut-point specific estimates (solid black) all shrunken to the parallel estimates (dashed blue horizontal lines) with several tuning steps of SERP in between (gray). An optimal value of λ regulating the degree of smoothing can be determined following a predefined criterion, for instance, a λ value minimizing a performance metric; see, for example, Equation (10) in Section 3. In this particular setting, the unique vertical (dashed red) lines on the top and the non-horizontal dashed lines in the down displays of Figure 4 give the resulting set of estimates at which the 5-fold cross-validated test errors (MSPE) were minimal. Unless stated otherwise, this tuning criterion is used throughout this study for SERP-fitted models. We next investigate SERP's improvement on (N)POM (where the denotation (N)POM refers to POM and/or NPOM). Thus, following the described data-generating process, 100 replications (each of SERP and (N)POM) were obtained for the three different simulation settings. For comparison purposes, a test set error (MSPE) of each of the models was obtained for all the simulation runs. In addition to that, the MSE of estimates with respect to the true slope parameters plus interaction were also obtained. Figure 5 shows the pairwise differences across simulation runs in MSE and MSPE of (N)POM and SERP, with the horizontal dashed line in each plot indicating the mark of SERP's improvement over (N)POM. In other words, differences above the dashed lines indicate better performance of SERP than (N)POM. As observed in the first simulation setting (column 1), where the underlying coefficients vary across categories, SERP outperformed both NPOM and POM in terms of the MSE and MSPE. In the second simulation setting (column 2), where truly constant coefficients generated the data, POM expectantly performed distinctively better than NPOM. SERP, however, adapted very well and gave estimates that are as good as POM. The third simulation setting (column 3) had varying underlying coefficients for the first covariate and constant coefficients in the second. As before, SERP adapted very well (compare the MSE for the three coefficient vectors δ 1 , δ 2 , δ 3 ), thus producing models with better predictions on average than both NPOM and POM. It should be pointed out at this point that we used one, global λ here. Nevertheless, SERP is highly competitive to POM on (truly constant) δ 2 , while performing much better on (truly varying) δ 1 . If compared to NPOM, SERP is superior on δ 2 and competitive on δ 1 . This indicates that coefficients are decisively shrunken towards proportionality in the first case (δ 2 ), while allowing for substantial non-proportionality in the latter (δ 1 ). Finally, it should be noted that we only provided settings here that are comparable to the real, sensory data considered in Sections 1, 2, and 5 below. In general, however, it has been our experience that, if the models considered are identifiable within cross-validation, the penalty parameter chosen will yield results that are (at least) competitive to NPOM and/or POM.

Application to Sensory Evaluation
We continue with our real data problem from Section 1, where panelists' ratings of the degree of deviation from a normal smell were modeled using the two covariates, androstenone and skatole (each log-transformed, standardized), plus the interaction effect. For a detailed discussion of the experiment(s) to generate this data set, and a preliminary analysis of all variables of interest, the reader is referred to [3]. Our interest here is on the rater-specific ordinal models of the scores of deviant smell.
We have already introduced the cumulative logit model of the individual raters in Section 2. In order to understand the rating patterns of individual panelists and for an accurate inference, it is necessary to determine whether (and to what extent) categoryspecific effects or global effects are suitable for the individual models of deviant smell. Moreover, it is necessary to have the parameters in the model and the model itself be completely identifiable. SERP could, therefore, provide the means of arriving at a good set of estimates other than (N)POM's original estimates, as well as help to induce convergence where NPOM fails to converge. Hence, cumulative logit models of deviant smell were obtained for all the panelists using SERP and the two standard approaches, POM and NPOM. The obtained estimates and standard errors of SERP, with standard errors being extracted from (9), together with (N)POM are exemplarily given for panelist G and H in Tables 2 and 3, respectively. Standard errors (SE) could also be used to calculate common, approximate 95% confidence intervals in terms of 'estimate ± 2 SE'. It should be noted though, that those SE are obtained for a given smoothing parameter; compare (9). That means that we treat them as if they would have been specified a priori. Although this is commonly done in penalized regression, it ignores variation that is induced by crossvalidation (or other methods used to find an appropriate λ in practice). As a consequence, those SE are typically biased downwards and lead to some under-coverage of confidence intervals. Of course, usual POM confidence intervals (in terms of SERP, obtained for λ → ∞) are conditioned on the much stricter assumption of proportional odds, and hence are only valid if this assumption is true. With respect to point estimates, we see that in the case of panelist G, SERP shrinks NPOM estimates towards POM, but still produces a non-proportional odds model. With panelist H, by contrast, cross-validated λ yields a proportional odds model as all slope coefficients are (virtually) constant across cut-points (compare Table 3). We shall later examine the extent of SERP's improvement over (N)POM via re-sampling procedures. Figure 6 shows the log-odds of deviant smell for panelists G, H, and I, in analogy to Figures 2 and 3. Again, we see (top row) that SERP shrinks the estimates for panelist G towards the proportional odds model, but still maintains some non-proportionality; also compare Figure (Table 3), the (nearly) proportional odds model is obtained (compare Figure 3 and the Supplementary Material). More generally speaking, not only does SERP help to locate the 'best' set of coefficients, but one could also make some informed decision as to which of POM and NPOM or the combination of estimates from both models, that is, partially proportional odds (6), could be adequate in an empirical study. Another point that is nicely seen from Figure 6 and Tables 2 and 3, is that the effects of androstenone and skatole can indeed be very different between people. On the one hand, effects are much stronger for panelists G and I than for panelist H. On the other hand, panelist H senses boar taint rather for increasing androstenone than skatole, as iso-lines look rather parallel to the y-axis in Figure 6 (middle row). For panelist G, it is a combination of both, whereas, according to panelist I, deviant smell is mainly caused by increased levels of skatole (iso-lines rather parallel to x-axis in Figure 6, bottom).  Further comparison of SERP and the other methods were made with respect to the out-of-sample MSPE of the respective methods. Those were obtained for each of the panelists on a randomly chosen test set amounting to 20% of the original data set and over 100 replications of this experiment. Figure 7 captures the pairwise prediction error differences of SERP against (N)POM (with differences above zero favoring SERP). As observed, SERP shows distinct improvement over NPOM for 9 out of 10 panelists, as a greater amount of differences are seen over the dashed horizontal line for the respective panelists. Improvement over POM, however, is much less pronounced than SERP's improvement over NPOM. For panelists A, D, E, and G, at least the median of differences is (slightly) above zero. For the other panelists, though, SERP is still not worse than POM. In summary, SERP appears to be a safe choice when it comes to modeling/prediction of panelists' ratings in our application. Compared to standard approaches (POM and NPOM), results in terms of prediction accuracy are at least similar, but often better. With respect to boar taint perception, our results impressively show that people can be very different, both in terms of the effects of androstenone and skatole, as well as (non-)proportionality. This needs to be kept in mind, for instance, when hiring raters for detecting boar-tainted carcasses at the slaughter line.

Discussion
Regularization has been a topic of intensive research in statistics for decades now. However, regularization methods that are specifically designed for categorical data are relatively new. Particularly, the focus has rarely been on ordinal response models. This might be due to the fact that the model most frequently used for ordinal response data (if not employing a standard linear model) is the cumulative model with global effects, particularly the proportional odds model, which can be nicely motivated via a latent variable and a corresponding linear model. Hence, most of the regularization methods proposed for ordinal response data can be seen as extensions of methods typically found in the high-dimensional linear or generalized linear model framework. For instance, the Ref. [28] considered high-dimensional genomic data and forward stagewise regression in the proportional odds model, the Ref. [29] proposed a Boosting approach for variable selection, the Ref. [30] used a sparsity-inducing Bayesian framework, whereas [20] employed a penalty approach. In the latter case, class-specific parameters were also considered, but only in a continuation ratio model, similar to [19]. Instead of a penalty term, the Ref. [31] proposed to use pseudo-observations in order to regularize and stabilize fitting in the proportional odds model.
The proportional odds assumption, however, rules out category-specific effects. Since, with our sensory data, it is very clear that at least for some raters the proportional odds assumption is too restrictive, alternatives are needed. Simply allowing for category-specific effects that are estimated via usual maximum likelihood, however, is hardly an option because the model becomes too complex, which leads to large variance in the fitted coefficients, or even numerical issues like non-convergence. We hence investigated a penalty approach for smoothing effects across ordered categories. The new approach, called SERP, showed very encouraging results both in simulation studies and our sensory data. As observed, SERP makes it possible to find a good compromise in a completely data-driven manner between the purely non-proportional odds model and the usual but restrictive assumption of proportionality. Additionally, SERP may be used to check in a rather informal way for partial (non-)proportionality, that means, it may help to make a decision on the structure of the partial proportional odds model (6). If supported by the data, coefficients for some variables may be shrunken towards proportionality, while the coefficients for other variables still indicate non-proportionality.
A penalty very similar to SERP has been proposed by [32] in the bivariate ordered logistic model. In this framework, the authors also proposed a partial likelihood ratio test (PLRT) which works with penalized likelihood. As an alternative to the Brant test, a corresponding PLRT could also be used with our data and methodology to distinguish between raters where the proportional odds assumption might be acceptable, and those where it is significantly violated. In our case, however, the Brant test turned out to be superior in both size and power. For instance, the PLRT was very sensitive to the smoothing parameter λ. Particularly, it produced unsatisfactory results for large λ (where the results derived by [32] do not hold). When deciding between global and category-specific effects after penalized fitting employing SERP, we hence prefer making the decision in a rather informal way on the basis of the estimated penalty parameter and regression coefficients.
Besides (non-)proportional odds models as considered in this paper, various other models and methods have been proposed for handling ordinal data. For instance, the stereotype model [33], probabilistic index models [34][35][36], and rank-based models [37,38], just to name a few. In the present paper, however, we focused on cumulative models in combination with a specific type (8) of quadratic, first-order smoothing penalty. An Ref. [39] implementation of the proposed method is provided as add-on package serp [40], available from open access repositories CRAN and Github.  Institutional Review Board Statement: Not applicable, as this research only involved secondary analysis of existing data, collected for the purposes of a prior study [3].
Informed Consent Statement: Not applicable, as this research only involved secondary analysis of existing data, collected for the purposes of a prior study [3].

Data Availability Statement:
The sensory data [3] analyzed in this paper is available from zenodo. Additionally, a second real data example (analysis of wine dataset) using the proposed method is provided in R package serp [40].