2.1. STAR Models and Deep Learning
Recently, owing to the flexibility of implementations such as TensorFlow (see
Abadi et al. 2016), deep learning approaches can also be used for fitting STAR models. In
Agarwal et al. (
2021),
neural additive models are discussed. Such models use subnetworks to represent each input variable
${X}^{(j)}$ with a smooth function
${f}_{j}^{\prime}$ and connect their output nodes directly to the response
Y, which is possibly transformed by a nonlinear activation function
${g}^{1}$; see
Figure 1a for an illustration. A neural additive model thus allows the fitting of additive models of the form
in line with (
2). To fit a STAR model in the general case of (
1) through deep learning, we need to slightly generalize the structure of the neural additive model to allow the subnetworks to use as input all covariates used in component
${f}_{j}$; see
Figure 1b for a schematic example with the equation
Related models are discussed in
Rügamer et al. (
2021) in an even more general context of multivariate
$\xi $, allowing for modeling of the full conditional distribution of
Y.
2.2. STAR Models and Gradient Boosting
However, it is less known that some modern implementations of gradient boosting are actually able to nonparametrically fit STAR models of the form (
1). The solution is based on imposing socalled
feature interaction constraints, an idea that goes back to
Lee et al. (
2015). Their approach was implemented in 2018 in XGBoost, and, on our request, (
Mayer 2020) in LightGBM in 2020 as well. Interaction constraints are specified as a collection of feature subsets
${\mathcal{X}}_{1},\dots ,{\mathcal{X}}_{p}$, where each
${\mathcal{X}}_{j}$ specifies a group of features that are allowed to interact. Algorithmically, the constraints are enforced by the following simple rule during tree growth:
At each split, the set of split candidates is the union of those ${\mathcal{X}}_{j}$ that contain all previously used split variables of the current branch.Consequently, each tree branch uses features from one feature set
${\mathcal{X}}_{j}$ only. Its associated prediction on the scale of
$\xi $ contributes to the component
${f}_{j}$ of an implicitly defined STAR model of the form (
1), where each model component
${f}_{j}$ uses feature subset
${\mathcal{X}}_{j}$ as specified by the constraints.
An important type of constraint is feature
partitions, i.e., disjoint feature sets
${\mathcal{X}}_{j}$. They include the special case of the collection of singletons
$\left\{{X}^{(1)}\right\},\dots ,\left\{{X}^{(p)}\right\}$ that would produce an additive model of the form (
2). For partitions, by the above rule, the first split variable of a tree determines the feature set
${X}_{j}$ to be used throughout that tree.
Figure 2 illustrates a simple example of such a model with the following equation:
The corresponding constraints form a feature partition specified by
In the two programming languages R and Python, interaction constraints for XGBoost and LightGBM are specified as part of the parameter list passed to the corresponding
train method.
Table 1 shows how to specify the
interaction_constraints parameter for an example with covariates A, B, C, and D and interaction constraints
$\left\{A\right\}$ and
$\{B,C,D\}$ corresponding to a model as in
Figure 2.
XGBoost and LightGBM offer different loss/objective functions. These specify the functional
$\xi $ to be modeled; see
Table 2 for some of the possibilities.
Remark 1.
At the time of writing and until at least XGBoost version 1.4.1.1, XGBoost respects only nonoverlapping interaction constraint sets ${\mathcal{X}}_{j}$, i.e., partitions. LightGBM can also deal with overlapping constraint sets.
Boosted trees with interaction constraints support only nonlinear components, unlike, e.g., deep learning and componentwise boosting, both of which allow a mix of linear and nonlinear components. See Remark 4, as well as our two case studies, for a twostep procedure that would linearize some effects fitted by XGBoost or LightGBM, thus overcoming this limitation.
Feature preprocessing for boosted trees is simpler than for other modeling techniques. Missing values are allowed for most implementations, feature outliers are unproblematic, and some implementations (including LightGBM) can directly deal with unordered categorical variables. Furthermore, highly correlated features are only problematic for interpretation, not for model fitting.
Modelbased boosting “mboost” with trees as building blocks is an alternative to using XGBoost or LightGBM with interaction constraints.
2.3. STAR Models and Supervised Dimension Reduction
Since STAR models (including GAMs with twodimensional interaction surfaces) typically use only a small subset of (interacting) features per component, the keyword “dimension reduction” rarely appears in connection with this type of model. However, a strength of treebased components (e.g., trees as base learners in modelbased boosting “mboost” or the approach via interaction constraints) and deep learning is that some components can use even a large number of interacting features. Such components serve as onedimensional representations of their features, conditional on the other features. The values of such components (or sometimes of sums of multiple components) might be used as derived features in another model or analysis. Thus, STAR models offer an effective way to perform (onedimensional) supervised dimension reduction. Note that, by supervised, we mean that the dimension reduction procedure uses the response variable of the model. Some examples illustrate this.
Example 1.
 1.
House price models with additive effects for structural characteristics and time (for maximal interpretability) and one multivariate component using all locational variables with complex interactions (for maximal predictive performance). The model equation could be as follows: Component ${f}_{p}$ provides a onedimensional representation of all locational variables. We will see an example of such a model in the Florida case study.
 2.
This is similar to the first example, but adding the date of sale to the component with all locational variables, leading to a model with timedependent location effects. The component depending on locational variables and time represents those variables by a onedimensional function. Such a model will be shown in the Swiss case study below.
In the above examples, the feature subsets
${\mathcal{X}}_{j}$ used by components
${f}_{j}$ are nonoverlapping, i.e., are partitions. For a STAR model
f of the general form (
1), where features might appear in multiple components, we can extend the above idea of dimensionality reduction.
Definition 1 (Purely additive contributions and encoders)
. Let ${\mathcal{X}}^{\prime}$ be a feature subset of interest and ${\widehat{f}}_{j}$ the fitted additive components of a STAR modelas in Equation (1). The contribution of ${\mathcal{X}}^{\prime}$ to the predictor $\widehat{f}(x)$ is given by the partial predictorwhere $J({\mathcal{X}}^{\prime}):=\{j:{\mathcal{X}}^{\prime}\cap {X}_{j}\ne \varnothing \}$ denotes the index set of components using features in ${\mathcal{X}}^{\prime}$. Furthermore, we call ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}(x)$ a purely additive contribution or encoder of ${\mathcal{X}}^{\prime}$ if ${\bigcup}_{J({\mathcal{X}}^{\prime})}{\mathcal{X}}_{j}\subseteq {\mathcal{X}}^{\prime}$, i.e., if ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}(x)$ depends solely on features in ${\mathcal{X}}^{\prime}$. In this case, we say that ${\mathcal{X}}^{\prime}$ has a purely additive contribution or ${\mathcal{X}}^{\prime}$ is encodable and write the encoder as ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}({x}^{\prime})$, where ${x}^{\prime}$ are values of the feature vector ${X}^{\prime}$ corresponding to ${\mathcal{X}}^{\prime}$. Remark 2.
From the above definition, it directly follows that the purely additive contribution ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}({x}^{\prime})$ of a (possibly large) set of features ${\mathcal{X}}^{\prime}$ provides a supervised onedimensional representation of the features in ${\mathcal{X}}^{\prime}$, optimized for predictions on the scale of ξ and conditional on the effects of other features.
For simplicity, we assume that each model component is irreducible, i.e., it uses only as many features as necessary. In particular, a component additive in its features would be represented by multiple components instead.
The encoder ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}({x}^{\prime})$ of ${\mathcal{X}}^{\prime}$ is defined up to a shift, i.e., ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}({x}^{\prime})+c$ for any constant c is an encoder of ${\mathcal{X}}^{\prime}$ as well. If c is chosen so that the average value of the encoder is 0 on some reference dataset, e.g., the training dataset, we speak of a centered encoder.
Example 2. Consider the fitted STAR model Following the above definition, we can say:
 1.
Let ${\mathcal{X}}^{\prime}=\{{X}^{(1)},{X}^{(2)},{X}^{(3)}\}$. Then, is a purely additive contribution of ${\mathcal{X}}^{\prime}$.
 2.
Due to interaction effects, the singleton $\left\{{X}^{(1)}\right\}$ and $\{{X}^{(1)},{X}^{(2)}\}$ are not encodable.
 3.
Let ${\mathcal{X}}^{\prime}=\{{X}^{(4)},{X}^{(5)},{X}^{(6)}\}$ and ${x}^{\prime}=({x}^{(4)},{x}^{(5)},{x}^{(6)})$. Then, is an encoder of ${\mathcal{X}}^{\prime}$.
 4.
The fitted model $\widehat{f}(x)$ is an encoder of the set of all features. This is true for STAR models in general.
Next, we consider the fitted STAR model Here, the feature sets $\left\{{X}^{(1)}\right\},\left\{{X}^{(2)}\right\},$ and $\{{X}^{(3)},{X}^{(4)},{X}^{(5)}\}$ form a partition. As a direct consequence of Definition 1, each feature set (and all its possible unions) of a partition is encodable. This property will implicitly be used in both of our case studies.
Depending on the implementation of the STAR algorithm, it might be possible to directly extract the encoder of a feature set
${\mathcal{X}}^{\prime}$ from the fitted model. A procedure that works independently of the implementation is described in the following Algorithm 1, which requires only access to the prediction function
$\widehat{f}(x)$ on the scale of
$\xi $.
Algorithm 1 Encoder extraction 

Remark 3 (Raw scores and boosting)
. By default, predictions of XGBoost and LightGBM are on the scale of Y. If the functional ξ includes a link function (e.g., log or logit), one can obtain predictions on the scale of ξ via the argument outputmargin (XGBoost) or rawscore (LightGBM) of their predict method. This is relevant for the application of Algorithm 1 and also for interpretation of effects, as in Section 2.4. The values of a (possibly centered) encoder of a feature set ${\mathcal{X}}^{\prime}$ can be used as a onedimensional representation of ${\mathcal{X}}^{\prime}$ for subsequent analyses, e.g., as a derived covariate in a simplified regression model. In the case studies below, we will see that this approach produces models with an excellent tradeoff between interpretability and predictive strength.
Example 3. For instance, after fitting the STAR model (4) of Example 1 with XGBoost, we could extract the (centered) purely additive contribution ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}^{c}$ of all location covariates ${\mathcal{X}}^{\prime}$ using Algorithm 1 and calculate a subsequent linear regression of the formwhere ${x}^{\prime}=({x}^{(p)},\dots ,{x}^{(q)})$ represent the values of the locational variables. The main difference with the initial XGBoost model is that the effects of the building characteristics are now linear. Remark 4 (Modeling strategy)
. The workflow in the last example is in line with the following general modeling strategy. Groups of related features (for instance, a large set of locational variables) are sometimes difficult to represent in a linear regression model. How should the features be transformed? What interactions are important? How should one deal with strong multicollinearity? These burdens could be delegated to an initial STAR model of suitable structure. Then, the model components representing such feature groups would be extracted and plugged as highlevel features into a subsequent linear regression model. This allows one, e.g., to linearize some additive effects of a fitted boosted trees model with interaction constraints.
Encoders are also helpful for interpreting effects in general STAR models, as will be explained in the next section.
2.4. Interpreting STAR Models
In this section, we provide some information on how to interpret effects of features and feature sets in general STAR models (
1), with a special focus on simple, fully transparent descriptions. These techniques will be used later in the case studies.
One of the most common techniques to describe the effect of a feature set
${\mathcal{X}}^{\prime}$ in an ML model is the partial dependence plot (PDP) introduced in
Friedman (
2001). It visualizes the average partial effect of
${\mathcal{X}}^{\prime}$ by taking the average of many
individual conditional expectation profiles (ICE, see
Goldstein et al. 2015). The ICE profile for the
ith observation with observed covariate vector
${x}_{i}$ and feature set
${\mathcal{X}}^{\prime}$ is calculated by evaluating predictions
$\widehat{f}(x)$ over a sufficiently fine grid of values for vector components corresponding to
${\mathcal{X}}^{\prime}$, keeping all other vector components of
${x}_{i}$ fixed. The stronger the interaction effects from other features, the less parallel the ICE profiles across multiple observations are. Thus, a visual test for additivity can be performed by plotting many ICE profiles and checking if they are parallel (see
Goldstein et al. 2015). Conversely, if all ICE profiles of
${\mathcal{X}}^{\prime}$ are parallel,
${\mathcal{X}}^{\prime}$ is represented by the model in an additive way. In that case, a single ICE profile (or the PDP) serves as a fully transparent description of the effect of
${\mathcal{X}}^{\prime}$ in the sense that it is clear how
${\mathcal{X}}^{\prime}$ acts on
$\widehat{f}(x)$ globally for all observations ceteris paribus.
Studying ICE and PDP plots is not only interesting for interpreting feature effects in complex ML models. Along with other generalpurpose tools from the field of explainable ML (see, e.g.,
Biecek and Burzykowski 2021;
Molnar 2019), they can also be used to interpret models of restricted complexity, such as linear regression models, GAMs, or STAR models. There, they serve as (featurecentric) alternatives to partial residual plots (
Wood 2017) that are frequently used to visualize effects of
single model components. Note that, up to a vertical shift, partial residual plots coincide with ICE curves for features that appear in only one (singlefeature) component.
In the context of STAR models, from Definition 1, it is easy to see that, for a feature set ${\mathcal{X}}^{\prime}$ that has a purely additive contribution, ICE profiles of ${\mathcal{X}}^{\prime}$ are parallel across all observations, and their values correspond to values of the encoder ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}({x}^{\prime})$ evaluated over the domain of ${x}^{\prime}$ (up to an additive constant). This includes centered or uncentered encoders as extracted by Algorithm 1. In fact, Algorithm 1 differs from the calculation of ICE profiles only in terms of technical details.
Thus, the effects of an encodable feature set ${\mathcal{X}}^{\prime}$ can be described in a simple yet transparent way by, e.g., showing one ICE profile, the PDP, or by evaluating its encoder over the domain of ${x}^{\prime}$. This is a major advantage of STAR models over unstructured ML models, where transparent descriptions of feature effects are unrealistic due to complex highorder interactions.
However, since it is difficult to describe ICE/PDP/encoders of more than two features, this concept is limited from a practical perspective to feature sets with only one or two features.
Remark 5.
To benefit from additivity, effects of features in STAR models are typically interpreted on the scale of ξ.
ICE profiles and the PDP can be used to interpret effects of nonencodable feature sets as well. Due to interaction effects, however, a single ICE profile or a PDP cannot give a complete picture of such an effect.
We have mentioned that describing multivariable ICE/PDP/encoders of more than two features is difficult in practice. However, depending on the modeling situation, it is not uncommon that a possibly large, encodable feature set ${\mathcal{X}}^{\prime}$ represents a lowdimensional feature set ${\mathcal{X}}^{\u2033}$ with mapping ${X}^{\prime}=\varphi ({X}^{\u2033})$. Here, ${X}^{\prime}$ and ${X}^{\u2033}$ denote feature vectors corresponding to feature sets ${\mathcal{X}}^{\prime}$ and ${\mathcal{X}}^{\u2033}$. In this case, the ICE/PDP/encoder can be evaluated over values of ${X}^{\u2033}$ to provide, again, a fully transparent description of the effects of ${X}^{\u2033}$ and ${X}^{\prime}$. Thus, instead of using the encoder ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}^{c}({x}^{\prime})$ with highdimensional ${x}^{\prime}$, we would use the equivalent encoder ${\widehat{f}}_{J({\mathcal{X}}^{\prime})}^{c}(\varphi ({x}^{\u2033}))$ with lowdimensional ${x}^{\u2033}$ to describe the effects of feature set ${\mathcal{X}}^{\prime}$ (or ${\mathcal{X}}^{\u2033}$), even if we cannot exactly describe the effects of single features in ${\mathcal{X}}^{\prime}$.
We have seen three such situations in Example 1: The full set of location variables can represent lowdimensional features, such as:
The administrative unit (a onedimensional feature).
The address (also a onedimensional feature).
Latitude/longitude (a twodimensional feature).
Similarly, timedependent location effects could represent the two features “administrative unit” and “time”.
We use this projection trick in both of our case studies in
Section 3 and
Section 4 to describe multivariate location effects.