BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect

Kirpichenko, Stanislav; Utkin, Lev; Konstantinov, Andrei; Muliukha, Vladimir

doi:10.3390/a17010040

Open AccessArticle

BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect

Higher School of Artificial Intelligence Technologies, Peter the Great St. Petersburg Polytechnic University, 195251 St. Petersburg, Russia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2024, 17(1), 40; https://doi.org/10.3390/a17010040

Submission received: 27 October 2023 / Revised: 13 January 2024 / Accepted: 15 January 2024 / Published: 18 January 2024

(This article belongs to the Special Issue Artificial Intelligence for Fault Detection and Diagnosis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

A method for estimating the conditional average treatment effect under the condition of censored time-to-event data, called BENK (the Beran Estimator with Neural Kernels), is proposed. The main idea behind the method is to apply the Beran estimator for estimating the survival functions of controls and treatments. Instead of typical kernel functions in the Beran estimator, it is proposed to implement kernels in the form of neural networks of a specific form, called neural kernels. The conditional average treatment effect is estimated by using the survival functions as outcomes of the control and treatment neural networks, which consist of a set of neural kernels with shared parameters. The neural kernels are more flexible and can accurately model a complex location structure of feature vectors. BENK does not require a large dataset for training due to its special way for training networks by means of pairs of examples from the control and treatment groups. The proposed method extends a set of models that estimate the conditional average treatment effect. Various numerical simulation experiments illustrate BENK and compare it with the well-known T-learner, S-learner and X-learner for several types of control and treatment outcome functions based on the Cox models, the random survival forest and the Beran estimator with Gaussian kernels. The code of the proposed algorithms implementing BENK is publicly available.

Keywords:

treatment effect; survival analysis; Nadaraya–Watson regression; Beran estimator; neural network; meta-learner

1. Introduction

Survival analysis is an important and fundamental tool for modeling applications when using time-to-event data [1], which can be encountered in medicine, reliability, safety, finance, etc. This is a reason why many machine learning models have been developed to deal with time-to-event data and to solve the corresponding problems in the framework of survival analysis [2]. The crucial peculiarity of time-to-event data is that a training set consists of censored and uncensored observations. When time-to-event exceeds the duration of an observation, we have a censored observation. When an event is observed, i.e., time-to-event coincides with the duration of the observation, we deal with an uncensored observation.

Many survival models are able to cover various cases of time-to-event probability distributions and their parameters [2]. One of the important models is the Cox proportional hazards model [3], which can be regarded as a semi-parametric regression model. There are also many parametric and nonparametric models. When considering machine learning survival models, it is important to point out that, in contrast to other machine learning models, their outcomes are functions, for instance, survival functions, hazard functions or cumulative hazard functions. For instance, the well-known effective model called the random survival forest (RSF) [4] predicts survival functions (SFs) or cumulative hazard functions.

An important area of survival model application is the problem of treatment effect estimation, which is often solved in the framework of machine learning problems [5]. The treatment effect shows how a treatment may be efficient depending on characteristics of a patient. The problem is solved by dividing patients into two groups called treatment and control, such that patients from the different groups can be compared. One of the popular measures of efficient treatment that is used in machine learning models is the average treatment effect (ATE) [6], which is estimated on the basis of observed data about patients, such as the mean difference between outcomes of patients from the treatment and control groups.

Due to the difference between characteristics of patients and their responses to a particular treatment, the treatment effect is measured using the conditional average treatment effect (CATE), which is defined as the mean difference between outcomes of patients from the treatment and control groups, conditional on a patient feature vector [7]. In fact, most methods of CATE estimation are based on constructing two regression models for controls and treatments. However, two difficulties in CATE estimation can be met. The first one is that the treatment group is usually very small. Therefore, many machine learning models cannot be accurately trained on the small datasets. The second difficulty is fundamental. Each patient cannot be simultaneously in the treatment and control groups, i.e., we either observe the patient outcome under the treatment or control, but never both [8]. Nevertheless, to overcome these difficulties, many methods for estimating CATE have been proposed and developed due to the importance of the problem in many areas [9,10,11,12,13].

One of the approaches for constructing regression models for controls and treatments is the application of the Nadaraya–Watson kernel regression [14,15], which uses standard kernel functions, for instance, the Gaussian, uniform or Epanechnikov kernels. In order to avoid selecting a standard kernel, Konstantinov et al. [16] proposed to implement kernels and the whole Nadaraya–Watson kernel regression by using a set of identical neural subnetworks with shared parameters, with a specific way of the network training. The corresponding method called TNW–CATE (Trainable Nadaraya–Watson regression for CATE) is based on an important assumption that domains of the feature vectors from the treatment and control groups are similar. Indeed, we often treat patients after being in the control group, i.e., it is assumed that treated patients came to the treatment group from the control group. For example, it is difficult to expect that patients with pneumonia will be treated with new drugs for stomach disease. The neural kernels (kernels implemented as the neural network) are more flexible, and they can accurately model a complex location structure of feature vectors, for instance, when the feature vectors from the control and treatment group are located on the spiral, as shown in Figure 1, where small triangular and circle markers correspond to the treatment and control groups, respectively. This is another important peculiarity of the TNW–CATE. Results provided in [16] illustrated outperformance of the TNW–CATE in comparison with other methods when the treatment group was very small and the feature vectors had complex structure.

Following the ideas behind the TNW–CATE, we propose the CATE estimation method, called BENK (the Beran Estimator with Neural Kernels), dealing with censored time-to-event data in the framework of survival analysis. The main idea behind the proposed method is to apply the Beran estimator [17] for estimating SFs of treatments and controls and to compare them for estimating the CATE. One of the important peculiarities of the Beran estimator is that it takes into account distances between feature vectors by using kernels which measure the similarity between any two feature vectors. On the one hand, the Beran estimator can be regarded as an extension of the Kaplan–Meier estimator. It allows us to obtain SFs that are conditional on the feature vectors, which can be viewed as outcomes of regression survival models for the treatment and control groups. On the other hand, the Beran estimator can also be viewed as an analogue of the Nadaraya–Watson kernel regression for survival analysis. However, typical kernels, for example, the Gaussian one, cannot cope with the possible complex structure of data. Therefore, similarly to the TNW–CATE model, we propose to implement kernels in the Beran estimator by means of neural subnetworks and to estimate CATE by using the obtained SFs. The whole neural network model is trained in an end-to-end manner.

Various numerical experiments illustrate BENK and its peculiarities. They also show that BENK outperforms many well-known meta-models: the T-learner and the S-learner, the X-learner for several control and treatment output functions based on the Cox models, the RSF and the Beran estimator with Gaussian kernels.

BENK is implemented using the framework PyTorch with open code. The code of the proposed algorithms can be found at https://github.com/Stasychbr/BENK (accessed on 27 October 2023).

The paper is organized as follows. Section 2 is a review of the existing CATE estimation models, including CATE estimation survival models, the Nadaraya–Watson regression models and general survival models. A formal statement of the CATE estimation problem is provided in Section 3. The CATE estimation problem in the case of censored data is stated in Section 4. The Beran estimator is considered in Section 5. A description of BENK is provided in Section 6. Numerical experiments illustrating BENK and comparing it with other models can be found in Section 7. Concluding remarks are provided in Section 8.

2. Related Work

Estimating CATE. One of the important approaches to implement personalized medicine is the treatment effect estimation. As a result, many interesting machine learning models have been developed and implemented to estimate CATE. First, we have to point out an approach which uses the Lasso model for estimating CATE [18]. The SVM was also applied to solve the problem [19]. A unified framework for constructing fast tree-growing procedures for solving the CATE problem was provided in [20]. McFowland et al. [21] estimated CATE by using the anomaly detection model. A set of meta-algorithms or meta-learners, including the T-learner, the S-learner and the X-learner, were studied in [12]. Many other models related to the CATE estimation problem are studied in [22,23].

The aforementioned models are constructed by using machine learning methods, which are different from neural networks. However, neural networks became a basis for developing many interesting and efficient models [24,25,26,27].

Due to the importance of the CATE problem, there are many other publications devoted to this problem [28,29,30,31].

The next generation of models that solve the CATE estimation problem is based on architectures of transformers with the attention operations [32,33,34]. The transfer learning technique was successfully applied to the CATE estimation in [35,36]. Ideas of using the Nadaraya–Watson kernel regression in the CATE estimation were studied in [37]. These ideas can lead to the best results under the condition of large numbers of examples in the treatment and control groups. At the same time, a small amount of training data may lead to overfitting and unsatisfactory results. Therefore, the problem of overcoming this possible limitation motivated researchers to introduce a neural network of a special architecture, which implements the trainable kernels in the Nadaraya–Watson regression [16].

Machine learning models in survival analysis. The importance of survival analysis applications can be regarded as one of the reasons for developing many machine learning methods that deal with censored and time-to-event data. A comprehensive review of machine learning survival models is presented in [2]. A large portion of models use the Cox model, which can be viewed as a simple and applicable survival model that establishes a relationship between covariates and outcomes. Various extensions of the Cox model have been proposed. They can be conditionally divided into two groups. The first group remains the linear relationship of covariates and includes various modifications of the Lasso models [38]. The second group of models relaxes the linear relationship assumption accepted in the Cox model [39].

Many survival models are based on using the RSFs, which can be regarded as powerful tools, especially when models learn on tabular data [40,41]. At the same time, there are many survival models based on neural networks [42,43].

Estimating CATE with censored data. Censored data can be regarded as an important type, especially for estimating the treatment effect because many applications are characterized by time-to-event data as outcomes. This peculiarity is a reason for developing many CATE models that deal with censored data in the framework of survival analysis [44,45,46]. Modifications of the survival causal trees and forests for estimating the CATE based on censored observational data were proposed in [44]. An approach combining a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network was studied in [47]. Nagpal et al. [48] presented a latent variable approach to model the CATE under assumption that an individual can belong to one of the latent clusters with distinct response characteristics. The problem of CATE estimation by focusing on learning (discrete-time) treatment-specific conditional hazard functions was studied in [49]. A three-stage modular design for estimating CATE in the framework of survival analysis was proposed in [50]. A comprehensive simulation study presenting a wide range of settings, describing CATE by taking into account the covariate overlap, was carried out in [51]. Rytgaard et al. [52] presented a data-adaptive estimation procedure for estimation of the CATE in a time-to-event setting based on generalized random forests. The authors proposed a two-step procedure for estimation, applying inverse probability weighting to construct time-point-specific weighted outcomes as input for the forest. A unified framework for counterfactual inference, applicable to survival outcomes and formulation of a nonparametric hazard ratio metric for evaluating the CATE, were proposed in [53].

In spite of many works and results devoted to estimating the CATE with censored data, these methods are mainly based on assumptions of a large number of examples in the treatment group. Moreover, there are no results implementing the Nadaraya–Watson regression by means of neural networks.

3. CATE Estimation Problem Statement

According to the CATE estimation problem, all patients are divided into two groups: control and treatment. Let the control group be the set

C = {(x_{1}, f_{1}), \dots, (x_{c}, f_{c})}

of c patients, such that the i-th patient is characterized by the feature vector

x_{i} = (x_{i 1}, \dots, x_{i d}) \in R^{d}

and the i-th observed outcome

f_{i} \in R

(time to event, temperature, the blood pressure, etc.). It is also supposed that the treatment group is the set

T = {(y_{1}, h_{1}), \dots, (y_{t}, h_{t})}

of t patients, such that the i-th patient is characterized by the feature vector

y_{i} = (y_{i 1}, \dots, y_{i d}) \in R^{d}

and the i-th observed outcome

h_{i} \in R

. The indicator of a group for the i-th patient is denoted as

T_{i} \in {0, 1}

, where

T_{i} = 0

(

T_{i} = 1

) corresponds to the control (treatment) group.

We use different notations

x_{i}

and

y_{i}

for controls and treatments in order to avoid additional indices. However, we use the vector

z \in R^{d}

instead of

x

and

y

when estimating the CATE.

Suppose that the potential outcomes of patients from the control and treatment groups are F and H, respectively. The treatment effect for a new patient with the feature vector

z

is estimated by the individual treatment effect, defined as

H - F

. The fundamental problem of computing the CATE is that only one of the outcomes f or h for each patient can be observed. An important assumption of unconfoundedness [54] is used to allow the untreated patients to be used to construct an unbiased counterfactual for the treatment group [55]. According to the assumption, potential outcomes are characteristics of a patient before the patient is assigned to a treatment condition, or, formally, the treatment assignment T is independent of the potential outcomes for F and H that conditional on the feature vector

z

, which can be written as

T ⊥ {F, H} ∣ z .

(1)

The second assumption, called the overlap assumption, regards the joint distribution of treatments and covariates. This assumption claims that a positive probability of being both treated and untreated for each value of

z

exists. This implies that the following holds with probability 1:

0 < Pr {T = 1 ∣ z} < 1 .

(2)

Let

Z

be the random feature vector from

R^{d}

. The treatment effect is estimated by means of CATE, which is defined as the expected difference between two potential outcomes, as follows [56]:

τ (z) = E [H - F ∣ Z = z] .

(3)

By using the above assumptions, CATE can be rewritten as

τ (z) = E [H ∣ Z = z] - E [F ∣ Z = z] .

(4)

The motivation behind unconfoundedness is that nearby observations in the feature space can be treated as having come from a randomized experiment [7].

Suppose that functions

g_{0} (z)

and

g_{1} (z)

express outcomes of the control and treatment patients, respectively. Then, they can be written as follows:

f = g_{0} (z) + ε, h = g_{1} (z) + ε,

(5)

where

ε

is noise governed by the normal distribution with the zero expectation.

The above imply that the CATE can be estimated as

τ (z) = g_{1} (z) - g_{0} (z) .

(6)

An example illustrating the controls (circle markers), treatments (triangle markers) and corresponding unknown function

g_{0}

and

g_{1}

are shown in Figure 1.

4. CATE with Censored Data

Before considering the CATE estimation problem with the censored data, we introduce basic statements of survival analysis. Let us define the training set

D_{0}

, which consists of c triplets

(x_{i}, δ_{i}, f_{i})

,

i = 1, \dots, c

, where

x_{i}^{T} = (x_{i 1}, \dots, x_{i d})

is the feature vector characterizing the i-th patient from the control group,

f_{i}

is the time to the event concerning the i-th control patient and

δ_{i} \in {0, 1}

is the indicator of censored or uncensored observations. If

δ_{i} = 1

, then the event of interest is observed (the uncensored observation). If

δ_{i} = 0

, then we have the censored observation. Only the right-censoring is considered when the observed survival time is less than or equal to the true survival time. Many applications of survival analysis deal with the right-censored observations [2]. The main goal of survival machine learning modeling is to use set

D_{0}

to estimate probabilistic characteristics of time F to the event of interest for a new patient with the feature vector

z

.

In the same way, we define the training set

D_{1}

, which consists of d triplets

(y_{i}, γ_{i}, h_{i})

,

i = 1, \dots, s

, where

y_{i}^{T} = (y_{i 1}, \dots, y_{i d})

is the feature vector characterizing the i-th patient from the treatment group,

h_{i}

is the time to the event concerning the i-th treatment patient and

γ_{i} \in {0, 1}

is the indicator of censoring.

The survival function (SF), denoted

S (t ∣ z)

, can be regarded as an important concept in survival analysis. It represents the probability of survival of a patient with the feature vector

z

up to time t, that is,

S (t ∣ z) = Pr {T > t ∣ z}

. The hazard function, denoted

λ (t ∣ z)

, can be viewed as another concept in survival analysis. It is defined as the rate of an event at time t given that no event occurred before time t. It is expressed through the SF as follows:

λ (t ∣ z) = - \frac{d}{d t} ln S (t ∣ z) .

(7)

The integral of the hazard function, denoted

H (t ∣ x)

, is called the cumulative hazard function and can be interpreted as the probability of an event at time t given survival until time t, i.e.,

Λ (t ∣ z) = \int_{0}^{t} λ (r ∣ z) d r .

(8)

It is expressed through the SF as follows:

Λ (t ∣ z) = - ln (S (t ∣ z)) .

(9)

The above functions for controls and treatments are written with indices 0 and 1, respectively, for instance,

S_{0} (t ∣ z) = Pr {F > t ∣ z}

and

S_{1} (t ∣ z) = Pr {H > t ∣ z}

.

In order to compare survival models, Harrell’s concordance index, or the C-index [57], is usually used. The C-index measures the probability that, in a randomly selected pair of examples, the example that failed first had a worst predicted outcome. It is calculated as the ratio of the number of pairs correctly ordered by the model to the total number of admissible pairs. A pair is not admissible if the events are both right-censored or if the earliest time in the pair is censored. The corresponding survival model is supposed to be perfect when the C-index is 1. The case when the C-index is 0.5 says that the survival model is the same as random guessing. The case when the C-index is less than 0.5 says that the corresponding model is worse than random guessing.

In contrast to the standard CATE estimation problem statement given in the previous section, the CATE estimation problem with censored data has another statement, which is due to the fact that outcomes in survival analysis are random times to an event of interest having some conditional probability distribution. In other words, predictions corresponding to a patient characterized by vector

z

in survival analysis provided by a survival machine learning model are represented in the form of functions of time, for instance, in the form of SF

S (t ∣ z)

. This implies that the CATE

τ (x)

should be reformulated by taking into account the above peculiarity. It is assumed that SFs as well as hazard functions for control and treatment patients, estimated by using datasets

D_{0}

and

D_{1}

, will have indices 0 and 1, respectively.

The following definitions of the CATE in the case of censored data can be found in [58]:

Difference in expected lifetimes:

$τ (z) = \int_{0}^{t_{max}} (S_{1} (t ∣ z) - S_{0} (t ∣ z)) d t = E \{T_{1} - T_{0} ∣ X = z\};$

(10)
Difference in SFs:

$τ (t, z) = S_{1} (t ∣ z) - S_{0} (t ∣ z);$

(11)
Hazard ratio:

$τ (t, z) = λ_{1} (t ∣ z) / λ_{0} (t ∣ z) .$

(12)

We will the first integral definition of the CATE. Let

0 = t_{0} < t_{1} < \dots < t_{n}

be the distinct times to an event of interest, which are obtained from the set

{f_{1}, \dots, f_{c}} \cup {h_{1}, \dots, h_{s}}

. The SF provided by a survival machine learning model is a step function, i.e., it can be represented as

S (t ∣ z) = \sum_{j = 1}^{n} S^{(j)} (z) \cdot χ_{j} (t)

, where

χ_{j} (t)

is the indicator function, taking a value of 1 if

t \in [t_{j - 1}, t_{j}]

;

S^{(j)} (z)

is the value of the SF in interval

[t_{j - 1}, t_{j}]

. Hence, the following holds:

\begin{matrix} τ (z) & = \int_{0}^{t_{max}} (S_{1} (t ∣ z) - S_{0} (t ∣ z)) d t \\ = \sum_{j = 1}^{n} (S_{1}^{(j)} (z) - S_{0}^{(j)} (z)) (t_{j} - t_{j - 1}) . \end{matrix}

(13)

5. Nonparametric Estimation of Survival Functions and CATE

The idea to use the nonparametric kernel regression for estimating SFs and other concepts of survival analysis has been proposed by several authors [59,60]. One of the interesting estimators is the Beran estimator [17] of the SF, which is defined as follows:

S (t ∣ x) = \prod_{f_{i} \leq t} {\{1 - \frac{W (x, x_{i})}{1 - \sum_{j = 1}^{i - 1} W (x, x_{j})}\}}^{δ_{i}},

(14)

where

W (x, x_{i})

are the kernel weights, defined as

W (x, x_{i}) = \frac{K (x, x_{i})}{\sum_{j = 1}^{n} K (x, x_{j})} .

(15)

The above expression is given for the controls. The same estimator can be written for treatments, but

x

,

δ_{i}

,

f_{i}

are replaced with

y

,

γ_{i}

,

h_{i}

, respectively.

The Beran estimator can be regarded as a generalization of the Kaplan–Meier estimator because the former is reduced to the latter if the kernel weights take values

W (x, x_{i}) = 1 / n

. It is also interesting to note that the product in (14) only takes into account uncensored observations, whereas the weights are normalized by using uncensored as well as censored observations.

By using (14) and (13), we can construct a neural network that is trained to implement the weights

W (z, x_{i})

,

W (z, y_{i})

and to estimate SFs

S_{1} (t ∣ z)

and

S_{0} (t ∣ z)

for computing

τ (z)

.

6. Neural Network for Estimating CATE

Let us consider how the Beran estimator with neural kernels can be implemented by means of a neural network of a special type. Our first aim is to implement kernels

K (x, x_{i})

by means of a neural subnetwork, which is called the neural kernel and is a part of the whole network for implementing the Beran estimator. The second aim is for this network to learn on the control data. Having the trained kernel, we can apply it to compute the conditional survival function for controls, as well as for treatments, because the kernels in (14) do not directly depend on times to events

f_{i}

or

h_{i}

. However, in order to train the kernel, we have to train the whole network because the loss function is defined through SF

S_{0} (t ∣ x)

, which represents the probability of survival of a control patient up to time t, which is estimated by means of the Beran estimator. This implies that the whole network contains blocks of the neural kernels for computing kernels

K (x, x_{i})

, normalization for computing the kernel weights

W (x, x_{i})

and the Beran estimator in accordance with (14). In order to realize a training procedure for the network, we randomly select a portion (n examples) from all control training examples and form a single specific example from n selected ones. This random selection is repeated N times to have N examples for training. Thus, for every

x_{i}

,

i = 1, \dots, c

, from the control group, we add another vector

x_{k}

from the same set of controls. By composing n pairs of vectors

(x_{i}, x_{k})

,

k = 1, \dots, n

, and including other elements of training examples (

δ_{i}, f_{i}

), we obtain one composite vector of data, representing one new training example for the entire neural network. Such new training examples can be constructed for each

i = 1, \dots, c

. The formal construction of the training set is considered below.

Having the trained neural kernel, it can be successfully used for computing SF

S_{0} (t ∣ z)

of controls and SF

S_{1} (t ∣ z)

of treatments for arbitrary vectors of features

z

, again applying the Beran estimator.

Let us consider the training algorithm in detail. First, we return to the set of c controls

C = {(x_{i}, δ_{i}, f_{i}), i = 1, \dots, c}

. For every i from set

{1, \dots, c}

, we construct N subsets

C_{i}^{(r)}

,

r = 1, \dots, N

, having n examples randomly selected from

C ∖ (x_{i}, δ_{i}, f_{i})

, which have indices from the index set

I^{(r)}

, i.e., the subsets

C_{i}^{(r)}

are of the form

C_{i}^{(r)} = {(x_{k}^{(r)}, δ_{k}^{(r)}, f_{k}^{(r)}), k \in I^{(r)}}, r = 1, \dots, N .

(16)

Here, N and n can be regarded as tuning hyperparameters. Upper index r indicates that the r-th example

(x_{k}^{(r)}, δ_{k}^{(r)}, f_{k}^{(r)})

is randomly taken from

C ∖ (x_{i}, δ_{i}, f_{i})

, i.e., there is an example

(x_{j}, δ_{j}, f_{j})

from

C

such that

x_{k}^{(r)} = x_{j}

,

δ_{k}^{(r)} = δ_{j}

,

f_{k}^{(r)} = f_{j}

. Each subset

C_{i}^{(r)}

, jointly with

(x_{i}, δ_{i}, f_{i})

, forms a training example

a_{i}^{(r)}

for the control network as follows:

a_{i}^{(r)} = (C_{i}^{(r)}, x_{i}, δ_{i}, f_{i}), i = 1, \dots, c, r = 1, \dots, N .

(17)

The number of possible examples

a_{i}^{(r)}

is

c \cdot N

, and these examples are used for training the neural network, whose output is the estimate of SF

{\tilde{S}}_{0} (t ∣ x_{i})

.

The architecture of the neural network, consisting of n subnetworks that implement the neural kernels, is shown in Figure 2. Examples

a_{i}^{(r)}

produced from the dataset of controls are fed to the whole neural network, such that each pair

(x_{i}, x_{k}^{(r)})

,

k \in I^{(r)}

, is fed to each subnetwork, which implements the kernel function. The output of each subnetwork is kernel

K (x_{i}, x_{k}^{(r)})

. All subnetworks are identical and have shared weights. After normalizing the kernels, we obtain n weights

W (x_{i}, x_{k}^{(r)})

, which are used to estimate SFs by means of the Beran estimator in (14). The block of the whole neural network that implements the Beran estimator uses all weights

W (x_{i}, x_{k}^{(r)})

,

k \in I^{(r)}

, and the corresponding values

δ_{k}^{(r)}

and

f_{k}^{(r)}

,

k \in I^{(r)}

. As a result, we obtain SF

{\tilde{S}}_{0} (t ∣ x_{i})

. In the same way, we compute SFs

{\tilde{S}}_{0} (t ∣ x_{k})

for all

k = 1, \dots, c

. These functions are the basis for training. In fact, the normalization block and the block that implements the Beran estimator can be regarded as part of the neural network, and they are trained in an end-to-end manner.

According to (13), expected lifetimes are used to compute the CATE

τ (z)

. Therefore, the whole network is trained by means of the following loss function:

L = \frac{1}{c^{*} \cdot N} \sum_{i \in C^{*}} \sum_{k = 1}^{N} {({\tilde{E}}_{k}^{(i)} - f_{k}^{(i)})}^{2} .

(18)

Here,

C^{*}

is a subset of

C

, which contains only uncensored examples from

C

,

c^{*}

is the number of elements in

C^{*}

;

f_{k}^{(i)}

is the time to an event of the k-th example from the set

C^{*} ∖ (x_{i}, δ_{i}, f_{i})

and

{\tilde{E}}_{k}^{(i)}

is the expected lifetime computed through SF

{\tilde{S}}_{0} (t ∣ x_{k})

, obtained by integrating the SF:

{\tilde{E}}_{k}^{(i)} = \sum_{j = 1}^{n} (f_{j}^{(i)} - f_{j - 1}^{(i)}) {\tilde{S}}_{0} (f_{j}^{(i)} ∣ x_{k}) .

(19)

The sum in (18) is taken over uncensored examples from

C

. However, the Beran estimator uses all the examples.

One of the loss functions, which takes into account all data (censored and uncensored), is the C-index. However, our aim is not to estimate the SF or the CHF. We aim to estimate the difference between the predicted time to event and the expected time to event. Therefore, we use the standard mean squared error (MSE) loss function. But the censored times introduce bias into MSE and, therefore, they are not used.

It is important to point out that our aim is to train subnetworks with shared training parameters, which are the neural kernels. By having the trained neural kernels, we can use them to compute kernels

K (z, x_{i})

and

K (z, y_{i})

and then to compute estimates of SFs

{\tilde{S}}_{0} (t ∣ z)

and

{\tilde{S}}_{1} (t ∣ z)

for controls and treatments, respectively, i.e., we realize the idea of transferring tasks from the control group to the treatment group. Let

t_{1}^{(0)} < t_{2}^{(0)} < \dots < t_{c}^{(0)}

and

t_{1}^{(1)} < t_{2}^{(1)} < \dots < t_{s}^{(1)}

be the ordered time moments corresponding to times

f_{1}, \dots, f_{c}

and

h_{1}, \dots, h_{s}

, respectively. Then, the CATE

τ (z)

can be computed through SFs

S_{1} (t ∣ z)

and

S_{0} (t ∣ z)

, again by using the Beran estimators with the trained neural kernels, i.e., in accordance with (13), it holds that

τ (z) = \sum_{j = 1}^{s} (t_{j}^{(1)} - t_{j - 1}^{(1)}) {\tilde{S}}_{1}^{(j)} (z) - \sum_{k = 1}^{c} (t_{k}^{(0)} - t_{k - 1}^{(0)}) {\tilde{S}}_{0}^{(k)} (z),

(20)

where

{\tilde{S}}_{1}^{(j)} (z)

is the estimation of the SF of treatments on the interval

[t_{j - 1}^{(1)}, t_{j}^{(1)})

,

{\tilde{S}}_{0}^{(k)} (z)

is the estimation of SF of controls in interval

[t_{k - 1}^{(0)}, t_{k}^{(0)})

and it is assumed that

t_{0}^{(0)} = t_{0}^{(1)} = 0

.

The illustration of the neural networks that predict

K (z, x_{i})

and

K (z, y_{i})

for a new vector

z

of features is shown in Figure 3. It can be seen from Figure 3 that the first neural network consists of c subnetworks, such that pairs of vectors

(z, x_{i})

,

i = 1, \dots, c

, are fed to the subnetworks, where

x_{i}

is taken from the dataset of controls. Predictions of the first neural network are c kernels

K (z, x_{i})

, which are used to compute

{\tilde{S}}_{0} (t ∣ z)

by means of the Beran estimator (14). The same architecture has the neural network for predicting kernels

K (z, y_{i})

, used for estimating the treatment SF

{\tilde{S}}_{1} (t ∣ z)

. This network consists of s subnetworks and uses vectors

y_{i}

from the dataset of treatments. After computing estimates

{\tilde{S}}_{0} (t ∣ z)

and

{\tilde{S}}_{1} (t ∣ z)

, we can find the CATE

τ (z)

.

Phases of training and computing CATE

τ (x)

by means of neural kernels are schematically shown as Algorithms 1 and 2, respectively.

Algorithm 1 The algorithm for training neural kernels

Require:: Datasets $C$ of c controls and $T$ of s treatments, number N of generated subsets $C_{i}^{(r)}$ of $C$ , number of examples in generated subsets n
Ensure:: Neural kernels $K (\cdot, \cdot)$ for their use in the Beran estimator for control and treatment data
1:: for $i = 1$ , $i \leq c$ do
2:: for $r = 1$ , $r \leq N$ do
3:: Generate subset $C_{i}^{(r)} \subset C ∖ (x_{i}, y_{i})$
4:: Form example $a_{i}^{(r)} = (C_{i}^{(r)}, x_{i}, δ_{i}, f_{i})$
5:: end for
6:: end for
7:: Train the weight sharing neural network with the loss function given in (18) on the set of examples $a_{i}^{(r)}$

Algorithm 2 The algorithm for computing CATE for a new feature vector

z

Require:: Trained neural kernels, datasets $C$ and $T$ , testing example $z$
Ensure:: CATE $τ (x)$
1:: for $i = 1$ , $i \leq c$ do
2:: Form pair $(z, x_{i})$ of vectors by using the dataset $C$ of controls
3:: Feed pair $(z, x_{i})$ to the trained neural kernel and predict $K (z, x_{i})$
4:: end for
5:: for $i = 1$ , $i \leq s$ do
6:: Form pair $(z, y_{i})$ of vectors by using the dataset $T$ of treatments
7:: Feed pair $(z, y_{i})$ to the trained neural kernel and predict $K (z, y_{i})$
8:: end for
9:: Compute $W (z, x_{i})$ , $i = 1, \dots, c$ , $W (z, y_{i})$ , $i = 1, \dots, s$
10:: Estimate ${\tilde{S}}_{0} (t ∣ x_{k})$ and ${\tilde{S}}_{1} (t ∣ y_{k})$ using (14)
11:: Compute $τ (x)$ using (20)

7. Numerical Experiments

Numerical experiments for studying BENK and its comparison with available models are performed by using simulated datasets because the true CATEs are unknown due to the fundamental problem of causal inference for real data [8]. This implies that control and treatment datasets are randomly generated in accordance with predefined outcome functions.

7.1. CATE Estimators for Comparison and Their Parameters

For investigating BENK and its comparison, we use nine models, which can be united in three groups (the T-learner, the S-learner, the X-learner), such that each group is based on three base models for estimating SFs (the RSF, the Cox model, the Beran estimator with Gaussian kernels). The models are given below in terms of survival models:

The T-learner [12] is a model which estimates the control SF $S_{0} (t ∣ z)$ and the treatment SF $S_{1} (t ∣ z)$ for every $z$ . The CATE in this case is defined in accordance with (13);
The S-learner [12] is a model which estimates SF $S (t ∣ z, T)$ instead of $S_{0} (t ∣ z)$ and $S_{1} (t ∣ z)$ , where the treatment assignment indicator $T_{i} \in {0, 1}$ is included as an additional feature to the feature vector $z_{i}$ . As a result, we have a modified dataset

$D = {(z_{1}^{*}, δ_{1}, f_{1}), \dots, (z_{c}^{*}, δ_{c}, f_{c}), (z_{c + 1}^{*}, γ_{1}, h_{1}), \dots, (z_{c + s}^{*}, γ_{s}, h_{s})},$

(21)

where $z_{i}^{*} = (x_{i}, T_{i}) \in R^{d + 1}$ if $T_{i} = 0$ , $i = 1, \dots, c$ , and $z_{c + i}^{*} = (y_{i}, T_{i}) \in R^{d + 1}$ if $T_{i} = 1$ , $i = 1, \dots, t$ . The CATE is determined as

$τ (z) = \sum_{j = 1}^{s} (t_{j}^{(1)} - t_{j - 1}^{(1)}) {\tilde{S}}^{(j)} (z, 1) - \sum_{k = 1}^{c} (t_{k}^{(0)} - t_{k - 1}^{(0)}) {\tilde{S}}^{(k)} (z, 0);$

(22)
The X-learner [12] is based on computing the so-called imputed treatment effects and is represented in the following three steps. First, the outcome functions $g_{0} (x)$ and $g_{1} (y)$ are estimated using a regression algorithm. Second, the imputed treatment effects are computed as follows:

$D_{1} (y_{i}) = h_{i} - g_{0} (y_{i}), D_{0} (x_{i}) = g_{1} (x_{i}) - f_{i} .$

(23)

Third, two regression functions $τ_{1} (y)$ and $τ_{0} (x)$ are estimated for imputed treatment effects $D_{1} (y)$ and $D_{0} (x)$ , respectively. The CATE for a point $z$ is defined as a weighted linear combination of the functions $τ_{1} (z)$ and $τ_{0} (z)$ as $τ (z) = α τ_{0} (z) + (1 - α) τ_{1} (z)$ , where $α \in [0, 1]$ is a weight that is equal to the ratio of treated patients. The original X-learner does not deal with censored data. Therefore, we propose a simple survival modification of the X-learner. It is assumed that $g_{0} (y_{i})$ and $g_{1} (x_{i})$ are expectations $E_{0} (y_{i})$ and $E_{1} (x_{i})$ of the times to an event corresponding to control and treatment data, respectively. Expectations $E_{0} (y_{i})$ and $E_{1} (x_{i})$ are computed by means of one of the algorithms for determining estimates of SFs $S_{0} (t ∣ z)$ and $S_{1} (t ∣ z)$ . The functions $τ_{1} (y)$ and $τ_{0} (x)$ are implemented using the random forest regression algorithm for all the basic models.

Estimations of SFs

S_{0} (t ∣ z)

and

S_{1} (t ∣ z)

as well as

S (t ∣ z, T)

are carried out by means o the following survival regression algorithms:

The RSF parameters of random forests used in experiments are the following:
- The numbers of trees are 10, 50, 100, 200;
- The depths are 3, 4, 5, 6;
- The smallest values of examples which fall in a leaf are 1 example, 1%, 5%, 10% of the training set.
The above values for the hyperparameters are tested, choosing those leading to the best results;
The Cox proportional hazards model [3], which is used with the elastic net regularization with the 3 to 1 ratio coefficient $L_{1}$ / $L_{2}$ ;
In contrast to the proposed BENK model, we use the Beran estimator with the standard Gaussian kernels. Values $10^{i}$ , $i = - 4, \dots, 3$ , and also values $0.5$ , 5, 50, 200, 500, 700 of the bandwidth parameter of the Gaussian kernel are tested, choosing those leading to the best results.

In sum, we have nine models for comparison, whose notations are given in Table 1.

7.2. Generating Synthetic Datasets

As has been described above, we consider generating the artificial complex feature spaces and outcomes in the numerical experiments. All the vectors of features, including controls

x

and treatments

y

, are generated by means of three functions: the spiral function, the bell-shaped function and the circular function. The idea to use these functions stems from the goal to obtain complex structures of data, which are poorly processed by many standard methods. The above functions are defined through a parameter

ξ

as follows:

Spiral functions: The feature vectors, having dimensionality d and being located on the Archimedean spirals, are defined for even d as

$x = (ξ sin (ξ), ξ cos (ξ), \dots, ξ sin (ξ \cdot d / 2), ξ cos (ξ \cdot d / 2)),$

(24)

and for odd d as

$x = (ξ sin (ξ), ξ cos (ξ), \dots, ξ sin (ξ \cdot ⌈d / 2⌉)) .$

(25)

Values of $ξ$ are uniformly generated from the interval $[0, 10]$ for all numerical experiments;
Bell-shaped functions: Features are represented as a set of almost non-overlapping Gaussians. As $ξ$ is uniformly generated in the numerical experiments, we can define $ξ_{min}$ and $ξ_{max}$ as corresponding bounds of the uniform distribution. Therefore, the feature vector of dimensionality d is represented as

$\begin{matrix} x & = (x_{0}, x_{1}, \dots, x_{d - 1}), \\ σ & = \frac{ξ_{max} - ξ_{min}}{6 d}, μ = \frac{ξ_{max} - ξ_{min}}{d - 1}, \\ x_{i} & = \frac{1}{σ \sqrt{2 π}} \cdot exp (\frac{- {(ξ - i \cdot μ)}^{2}}{2 σ^{2}}), i = 1, \dots, d - 1 . \end{matrix}$

(26)

Therefore, each feature $x_{i}$ corresponds to its own region in the $ξ$ distribution;
Circular functions: The corresponding feature space is generated by using only the even numbers of features. The feature vectors are located on the two-dimensional circles as follows:

$\begin{matrix} c_{n u m} & = \frac{d}{2}, c_{r a n g e} = \frac{ξ_{max} - ξ_{min}}{c_{n u m}}, \\ x & = (x_{1}^{1}, x_{1}^{2}, x_{2}^{1}, x_{2}^{2}, \dots, x_{c_{n u m}}^{1}, x_{c_{n u m}}^{2}), \\ x_{i}^{1} & = sin (\frac{2 π (ξ - (i - 1) \cdot c_{r a n g e})}{c_{r a n g e}}) \cdot I_{i}, \\ x_{i}^{2} & = cos (\frac{2 π (ξ - (i - 1) \cdot c_{r a n g e})}{c_{r a n g e}}) \cdot I_{i}, \\ I_{i} & = I {(i - 1) \cdot c_{r a n g e} \leq ξ < i \cdot c_{r a n g e}}, i = 1, \dots, c_{n u m}, \end{matrix}$

(27)

where $I_{i}$ is an indicator function.
Each pair of features $(x_{i}^{(1)}, x_{i}^{(2)})$ corresponds to their own two-dimensional circle and to their own region in the $ξ$ distribution.

In all experiments, feature vectors

y

are generated in the same way as vectors

x

. However, for feature vectors

x

and

y

, from the control and treatment groups, the corresponding times to events f and h are different and are generated by using the Weibull distribution, as follows:

f (ξ) = - {(\frac{log (u)}{0.0005 \cdot exp (1.6 \cdot ξ)})}^{1 / 2},

(28)

h (ξ) = - {(\frac{log (u)}{0.005 \cdot exp (0.8 \cdot ξ)})}^{1 / 2},

(29)

where u is the random variable, uniformly distributed on the interval

(0, 1)

; values f and h larger than 2000 are clipped to this value.

This way for generating f and h is in agreement with the Cox model. Hence, we can use the Cox model as a base model among RSFs and the Beran estimator with Gaussian kernels in the numerical experiments.

The proportion of censored data, denoted as p, is taken as 33% of all observations in the experiments. Hence, parameters of censoring

δ_{i}

and

γ_{i}

are generated from the binomial distribution with probabilities

Pr {δ_{i} = 1} = Pr {γ_{i} = 1}

= 0.67,

Pr {δ_{i} = 0} = Pr {γ_{i} = 0}

= 0.33.

The Precision in Estimation of Heterogeneous Effects metric (PEHE), proposed in [61], is used to reduce the variance in the numerical experiments. According to [61], this metric evaluates the ability of each method to capture treatment effect heterogeneity.

If we label the test dataset as

Z

, then the PEHE can be defined as follows:

\begin{matrix} PEHE (Z) = \sqrt{\frac{1}{N_{z}} \sum_{z \in Z} {(E [(h - f) ∣ z (ξ)] - τ (z))}^{2}}, \\ E (f ∣ z (ξ)) = \frac{1}{\sqrt{0.0005 \cdot exp (1.6 \cdot ξ)}} Γ (\frac{3}{2}), \\ E (h ∣ z (ξ)) = \frac{1}{\sqrt{0.005 \cdot exp (0.8 \cdot ξ)}} Γ (\frac{3}{2}), \end{matrix}

(30)

where

N_{z}

is the size of the set

Z

, taken for all numerical experiments as

N_{z} = 1000

.

The proportion of treatments and controls in most experiments is

20 %

, except for experiments studying how the proportion of treatments impacts the CATE, where the proportion of treatments and controls is denoted as q. For example, if 100 controls are generated for an experiment with

q = 0.2

, then 20 treatments are generated in addition to controls, such that the total number of examples is 120. The generated feature vectors in all experiments consist of 10 features; the volume of the

C

set is 300 unless otherwise stated. To select optimal hyperparameters of BENK, additional validation examples are generated, such that they belong to only the control group, and the size of this additional validation set is

50 %

of the set

C

size. After the BENK neural network training, this validation set is concatenated with

C

for other models, which are trained using cross-validation with three splits. For studying the dependencies, we repeat the numerical experiments 100 times and provide the mean values across these 100 iterations.

Each subnetwork is a fully connected neural network consisting of five layers, with corresponding activation functions ReLU6, ReLU6, ReLU6, Tanh, Softplus. Inputs for each subnetwork are represented in the form

∥x_{i} - x_{j}∥

to ensure the symmetry property of kernels. The non-negativity property of neural kernels is achieved by using the activation function Softplus in the last layer of the subnetworks, which ensures that the output is always positive.

7.3. Study of the BENK Properties

In all pictures illustrating results of numerical experiments, dotted curves correspond to the T-learner (triangle markers), the S-learner (triangle markers) or the X-learner (the circle marker) under the condition of using the Beran estimator with the Gaussian kernels. Dash-and-dot curves correspond to the Cox models. Dashed curves with the same markers correspond to the same models implemented using RSFs. The solid curve with cross markers corresponds to BENK. The PEHE metric is used to represent results of experiments. The smaller the values of the PEHE, the better the obtained results. To avoid clutter of curves on the figures, we pick the best model for each T-,S- or X-learner obtained in each experiment.

First, we study different CATE estimators using different numbers c of controls, taking the values 100, 200, 300, 500, 1000. The number of treatments t is determined as

20 %

of the number of controls. Values of n are equal to

min {t, 100}

. Figure 4, Figure 5 and Figure 6 illustrate how values of the PEHE metric depend on the number c of controls for different estimators when different functions are used for generating examples. Figure 4 shows the difference between the PEHE metric of BENK and other models in the experiment, with the feature vectors located around the spiral. The T-SF, S-Beran and X-SF models are provided in Figure 4 because they show the best competitive metric values. In order to illustrate how the variance in results depends on the amount of input data, the error bars are also depicted in Figure 4. It can be seen from Figure 4 that the variance in results is reduced with the number of controls. This property of results indicates that the neural network is properly trained. We do not add the error bars to other graphs so as to not mask the relative positions of the corresponding curves. Figure 5 illustrates similar dependencies when the bell-shaped function is used for generating the feature vectors. The selected models in this case are T-Cox, S-SF and X-Cox. Figure 6 illustrates the relationship between different models obtained on the circular feature space. The competitive algorithms given in the picture are T-Beran, S-Beran and X-Beran. It can be seen from Figure 4, Figure 5 and Figure 6 that the proposed model BENK provides better results in comparison with other models. The largest relative difference between BENK and other models can be observed when the feature vectors are generated in accordance with the spiral function. This function produces the most complex data structure, such that other studied models cannot cope with it.

Another interesting question is how the CATE estimators depend on the proportion q of treatments and controls in the training set. Particularly, for the proposed BENK model, we try to study whether an increasing number of treatments (the set

T

) provides better CATE results with an unchanged number of controls (the set

C

). The corresponding numerical results are shown in Figure 7, Figure 8 and Figure 9. One can see from Figure 7, Figure 8 and Figure 9 that the enhancement in the PEHE is sufficient in comparison with other CATE estimators when q is changed from 10% to 20% in the experiments with the spiral and bell-shaped functions. Moreover, we again observe the outperformance of BENK in comparison with other estimators.

In the previous experiments, the amount of the censored data was taken

p = 33 %

of all observations. However, it is interesting to study how this amount impacts the PEHE of the CATE estimators. Figure 10, Figure 11 and Figure 12 illustrate the corresponding dependencies when different generating functions are used. It can be seen from Figure 10, Figure 11 and Figure 12 that the PEHE metrics for all estimators, including BENK, increase with the amount of censored data.

Table 2 aims to quantitatively compare results under the following conditions:

c = 400

,

s = 40

,

p = 0.2

,

m = 20

,

N = 1000

. One can see from Table 2 that BENK provides outperforming results. Let us compare results obtained for BENK with the results provided by other models in Table 2. For comparison, we can apply the standard t-test. The obtained p-values for all pairs of models are shown in the last column. We can see from Table 2 that all p-values are smaller than

0.05

. Hence, we can conclude that the outperformance of BENK is statistically significant. It is interesting to note from Table 2 that methods based on the Cox model (T-Cox, S-Cox, X-Cox) show worse results. This can be explained by the weak assumption of the linear relationship of features, which takes place in the Cox model. This assumption contradicts the complex spiral, bell-shaped and circular functions and does not allow us to obtain better results. It should be pointed out that T-NW provides the best result for the bell-shaped generating function among results given by methods other than BENK. This is explained by the fact that the bell-shaped function is close to the Gaussian function; therefore, the method based on using Nadaraya–Watson kernel regression does not crucially differ from BENK. It is also interesting to note that the efficient methods such as the S-learner and the X-learner often provide worse results in comparison with the T-learner, which is rather weak in standard CATE tasks. This is due to peculiarities of survival data, which differ from the standard regression and classification data.

It should be noted that we did not provide results of various deep neural network extensions of the CATE estimators because they have not been successful. The problem is that neural networks require a large amount of data for training and the considered small datasets have led to overfitting the networks. This is why we studied models which provide satisfactory predictions under condition of small amounts of data.

8. Conclusions

A new method called BENK for solving the CATE problem under the condition of censored data has been presented. It extends the idea behind TNW–CATE proposed in [16] to the case of censored data. In spite of many similar parts of TNW-CATE and BENK, they are different because BENK is based on using the Beran estimator for training and can be successfully applied to survival analysis of controls and treatments. However, TNW–CATE and BENK use the same idea to train neural kernels: implementation as neural networks instead of using standard kernels.

It is also interesting to point out that BENK does not require oneto have a large dataset for training, even though the neural network is used for implementing the kernels. This is due to a special way that is proposed to train the network, which considers pairs of examples from the control group for training, as in Siamese neural networks. Our experiments have illustrated the outperforming characteristics of BENK. At the same time, we have to point out some disadvantages of BENK. First, it has many tuning parameters, including parameters of the neural network and parameters of training n and N, such that the training time may be significantly increased in comparison with other methods of solving the CATE problem. Second, BENK assumes that the feature vector domains are similar for controls and treatments. This does not mean that they have to totally coincide, but the corresponding difference in domains should not be very large. A method which could take into account a possible difference between the feature vector domains for controls and treatments can be regarded as a direction for further research. An idea behind the method is to combine the domain adaptation models and BENK.

Another direction for further research is to study robust versions of BENK when there are anomalous observations that may impact training the neural network. An idea behind the robust version is to use attention weights for feature vectors and also to introduce additional attention weights for predictions.

It should be noted that the Beran estimator is one of several estimators that are used in survival analysis. Moreover, we have studied only the difference in expected lifetimes as a definition of the CATE in the case of censored data. There are other definitions, for instance, the difference in SFs and the hazard ratio, which may lead to more interesting models. Therefore, BENK implementations and studies using other estimators and definitions of the CATE can be also considered as directions for further research.

The proposed method can be used in applications that are different from medicine. For example, it can be applied to selection and control of the most efficient regimes in the Internet of Things. This is also an interesting direction for further research.

Author Contributions

Conceptualization, S.K., L.U. and A.K.; methodology, L.U. and V.M.; software, S.K. and A.K.; validation, S.K., V.M. and A.K.; formal analysis, A.K. and L.U.; investigation, L.U., A.K. and V.M.; resources, A.K. and V.M.; data curation, S.K. and V.M.; writing—original draft preparation, L.U. and A.K.; writing—review and editing, S.K. and V.M.; visualization, A.K.; supervision, L.U.; project administration, V.M.; funding acquisition, V.M. All authors have read and agreed to the published version of the manuscript.

Funding

The research is partially funded by the Ministry of Science and Higher Education of the Russian Federation as part of World-Class Research Center program: Advanced Digital Technologies (contract No. 075-15-2022-311. dated 20 April 2022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hosmer, D.; Lemeshow, S.; May, S. Applied Survival Analysis: Regression Modeling of Time to Event Data; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Wang, P.; Li, Y.; Reddy, C. Machine Learning for Survival Analysis: A Survey. ACM Comput. Surv. (CSUR) 2019, 51, 110. [Google Scholar]
Cox, D. Regression models and life-tables. J. R. Stat. Soc. Ser. (Methodol.) 1972, 34, 187–220. [Google Scholar] [CrossRef]
Ishwaran, H.; Kogalur, U. Random Survival Forests for R. R News 2007, 7, 25–31. [Google Scholar]
Shalit, U.; Johansson, F.; Sontag, D. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; pp. 3076–3085. [Google Scholar]
Fan, Y.; Lv, J.; Wang, J. DNN: A Two-Scale Distributional Tale of Heterogeneous Treatment Effect Inference. arXiv 2018, arXiv:1808.08469v1. [Google Scholar] [CrossRef]
Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
Kunzel, S.; Stadie, B.; Vemuri, N.; Ramakrishnan, V.; Sekhon, J.; Abbeel, P. Transfer Learning for Estimating Causal Effects using Neural Networks. arXiv 2018, arXiv:1808.07804v1. [Google Scholar]
Acharki, N.; Garnier, J.; Bertoncello, A.; Lugo, R. Heterogeneous Treatment Effects Estimation: When Machine Learning meets multiple treatment regime. arXiv 2022, arXiv:2205.14714. [Google Scholar]
Hatt, T.; Berrevoets, J.; Curth, A.; Feuerriegel, S.; van der Schaar, M. Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects. arXiv 2022, arXiv:2202.12891. [Google Scholar]
Jiang, H.; Qi, P.; Zhou, J.; Zhou, J.; Rao, S. A Short Survey on Forest Based Heterogeneous Treatment Effect Estimation Methods: Meta-learners and Specific Models. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 October 2021; pp. 3006–3012. [Google Scholar]
Kunzel, S.; Sekhon, J.; Bickel, P.; Yu, B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. USA 2019, 116, 4156–4165. [Google Scholar] [CrossRef]
Zhang, W.; Li, J.; Liu, L. A Unified Survey of Treatment Effect Heterogeneity Modelling and Uplift Modelling. ACM Comput. Surv. 2022, 54, 162. [Google Scholar] [CrossRef]
Nadaraya, E. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
Watson, G. Smooth regression analysis. Sankhya Indian J. Stat. Ser. A 1964, 26, 359–372. [Google Scholar]
Konstantinov, A.; Kirpichenko, S.; Utkin, L. Heterogeneous Treatment Effect with Trained Kernels of the Nadaraya–Watson Regression. Algorithms 2023, 16, 226. [Google Scholar] [CrossRef]
Beran, R. Nonparametric Regression with Randomly Censored Survival Data; Technical Report; University of California: Berkeley, CA, USA, 1981. [Google Scholar]
Jeng, X.; Lu, W.; Peng, H. High-dimensional inference for personalized treatment decision. Electron. J. Stat. 2018, 12, 2074–2089. [Google Scholar]
Zhou, X.; Mayer-Hamblett, N.; Khan, U.; Kosorok, M. Residual Weighted Learning for Estimating Individualized Treatment Rules. J. Am. Stat. Assoc. 2017, 112, 169–187. [Google Scholar] [CrossRef]
Athey, S.; Tibshirani, J.; Wager, S. Generalized random forests. arXiv 2019, arXiv:1610.0171v4. [Google Scholar]
McFowland, E.; Somanchi, S.; Neill, D. Efficient Discovery of Heterogeneous Treatment Effects in Randomized Experiments via Anomalous Pattern Detection. arXiv 2018, arXiv:1803.09159v2. [Google Scholar]
Chen, R.; Liu, H. Heterogeneous Treatment Effect Estimation through Deep Learning. arXiv 2018, arXiv:1810.11010v1. [Google Scholar]
Yao, L.; Lo, C.; Nir, I.; Tan, S.; Evnine, A.; Lerer, A.; Peysakhovich, A. Efficient Heterogeneous Treatment Effect Estimation with Multiple Experiments and Multiple Outcomes. arXiv 2022, arXiv:2206.04907. [Google Scholar]
Curth, A.; van der Schaar, M. Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; pp. 1810–1818. [Google Scholar]
Du, X.; Fan, Y.; Lv, J.; Sun, T.; Vossler, P. Dimension-Free Average Treatment Effect Inference with Deep Neural Networks. arXiv 2021, arXiv:2112.01574. [Google Scholar]
Nair, N.; Gurumoorthy, K.; Mandalapu, D. Individual Treatment Effect Estimation Through Controlled Neural Network Training in Two Stages. arXiv 2021, arXiv:2201.08559. [Google Scholar]
Qin, T.; Wang, T.Z.; Zhou, Z.H. Budgeted Heterogeneous Treatment Effect Estimation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8693–8702. [Google Scholar]
Chu, Z.; Li, S. Continual treatment effect estimation: Challenges and opportunities. In Proceedings of the Machine Learning Research. AAAI Bridge Program on Continual Causality, Washington, DC, USA, 7–8 February 2023; pp. 11–17. [Google Scholar]
Kennedy, E.H. Towards optimal doubly robust estimation of heterogeneous causal effects. Electron. J. Stat. 2023, 17, 3008–3049. [Google Scholar]
Krantsevich, N.; He, J.; Hahn, P.R. Stochastic tree ensembles for estimating heterogeneous effects. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 6120–6131. [Google Scholar]
Verbeke, W.; Olaya, D.; Guerry, M.A.; Van Belle, J. To do or not to do? Cost-sensitive causal classification with individual treatment effect estimates. Eur. J. Oper. Res. 2023, 305, 838–852. [Google Scholar] [CrossRef]
Guo, Z.; Zheng, S.; Liu, Z.; Yan, K.; Zhu, Z. CETransformer: Casual Effect Estimation via Transformer Based Representation Learning. In Proceedings of the Pattern Recognition and Computer Vision (PRCV 2021), Beijing, China, 29 October–1 November 2021; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2021; Volume 13022, pp. 524–535. [Google Scholar]
Melnychuk, V.; Frauen, D.; Feuerriegel, S. Causal Transformer for Estimating Counterfactual Outcomes. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Zhang, Y.F.; Zhang, H.; Lipton, Z.; Li, L.E.; Xing, E.P. Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation. arXiv 2022, arXiv:2202.01336. [Google Scholar]
Aoki, R.; Ester, M. Causal Inference from Small High-dimensional Datasets. arXiv 2022, arXiv:2205.09281. [Google Scholar]
Zhou, G.; Yao, L.; Xu, X.; Wang, C.; Zhu, L. Learning to Infer Counterfactuals: Meta-Learning for Estimating Multiple Imbalanced Treatment Effects. arXiv 2022, arXiv:2208.06748. [Google Scholar]
Park, J.; Shalit, U.; Scholkopf, B.; Muandet, K. Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression. In Proceedings of the 38 th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8401–8412. [Google Scholar]
Witten, D.; Tibshirani, R. Survival analysis with high-dimensional covariates. Stat. Methods Med. Res. 2010, 19, 29–51. [Google Scholar] [CrossRef]
Widodo, A.; Yang, B.S. Machine health prognostics using survival probability and support vector machine. Expert Syst. Appl. 2011, 38, 8430–8437. [Google Scholar] [CrossRef]
Ibrahim, N.; Kudus, A.; Daud, I.; Bakar, M.A. Decision tree for competing risks survival probability in breast cancer study. Int. J. Biol. Med. Res. 2008, 3, 25–29. [Google Scholar]
Wright, M.; Dankowski, T.; Ziegler, A. Unbiased split variable selection for random survival forests using maximally selected rank statistics. Stat. Med. 2017, 36, 1272–1284. [Google Scholar] [CrossRef]
Haarburger, C.; Weitz, P.; Rippel, O.; Merhof, D. Image-based Survival Analysis for Lung Cancer Patients using CNNs. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019. [Google Scholar]
Katzman, J.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; Kluger, Y. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 2018, 18, 24. [Google Scholar] [CrossRef] [PubMed]
Cui, Y.; Kosorok, M.R.; Sverdrup, E.; Wager, S.; Zhu, R. Estimating heterogeneous treatment effects with right-censored data via causal survival forests. J. R. Stat. Soc. Ser. Stat. Methodol. 2023, 85, 179–211. [Google Scholar]
Hou, J.; Bradic, J.; Xu, R. Treatment effect estimation under additive hazards models with high-dimensional confounding. J. Am. Stat. Assoc. 2023, 118, 327–342. [Google Scholar] [CrossRef]
Hu, L.; Ji, J.; Liu, H.; Ennis, R. A flexible approach for assessing heterogeneity of causal treatment effects on patient survival using large datasets with clustered observations. Int. J. Environ. Res. Public Health 2022, 19, 14903. [Google Scholar] [CrossRef] [PubMed]
Schrod, S.; Schäfer, A.; Solbrig, S.; Lohmayer, R.; Gronwald, W.; Oefner, P.; Beissbarth, T.; Spang, R.; Zacharias, H.; Altenbuchinger, M. BITES: Balanced Individual Treatment Effect for Survival data. Bioinformatics 2022, 38, i60–i67. [Google Scholar] [CrossRef]
Nagpal, C.; Goswami, M.; Dufendach, K.; Dubrawski, A. Counterfactual Phenotyping with Censored Time-to-Events. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022. [Google Scholar]
Curth, A.; Lee, C.; van der Schaar, M. SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021; pp. 1–14. [Google Scholar]
Zhu, J.; Gallego, B. Targeted estimation of heterogeneous treatment effect in observational survival analysis. J. Biomed. Inform. 2020, 107, 103474. [Google Scholar] [CrossRef] [PubMed]
Hu, L.; Ji, J.; Li, F. Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat. Med. 2021, 40, 4691–4713. [Google Scholar] [PubMed]
Rytgaard, H.; Ekstrom, C.; Kessing, L.; Gerds, T. Ranking of average treatment effects with generalized random forests for time-to-event outcomes. Stat. Med. 2023, 42, 1542–1564. [Google Scholar]
Chapfuwa, P.; Assaad, S.; Zeng, S.; Pencina, M.; Carin, L.; Henao, R. Enabling Counterfactual Survival Analysis with Balanced Representations. In Proceedings of the CHIL ’21: Proceedings of the Conference on Health, Inference, and Learning, Virtual, 8–10 April 2021; ACM: New York, NY, USA, 2021; pp. 133–145. [Google Scholar]
Rosenbaum, P.; Rubin, D. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
Imbens, G. Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econ. Stat. 2004, 86, 4–29. [Google Scholar]
Rubin, D. Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
Harrell, F.; Califf, R.; Pryor, D.; Lee, K.; Rosati, R. Evaluating the yield of medical tests. J. Am. Med Assoc. 1982, 247, 2543–2546. [Google Scholar] [CrossRef]
Chapfuwa, P.; Assaad, S.; Zeng, S.; Pencina, M.; Carin, L.; Henao, R. Survival analysis meets counterfactual inference. arXiv 2020, arXiv:2006.07756. [Google Scholar]
Pelaez, R.; Cao, R.; Vilar, J. Nonparametric estimation of the conditional survival function with double smoothing. J. Nonparametr. Stat. 2022, 34, 1063–1090. [Google Scholar] [CrossRef]
Tutz, G.; Pritscher, L. Nonparametric estimation of discrete hazard functions. Lifetime Data Anal. 1996, 2, 291–308. [Google Scholar] [CrossRef]
Hill, J. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar]

Figure 1. An example of the control

g_{0} (x)

and treatment

g_{1} (x)

functions, which are unknown, and of the control (circle markers) and treatment (triangle markers) data points, which are observed.

Figure 1. An example of the control

g_{0} (x)

and treatment

g_{1} (x)

functions, which are unknown, and of the control (circle markers) and treatment (triangle markers) data points, which are observed.

Figure 2. The neural network training on examples

a_{i}^{(r)}

, composed of controls, for producing the Beran estimator in the form of SF

{\tilde{S}}_{0} (t ∣ x_{i})

.

Figure 2. The neural network training on examples

a_{i}^{(r)}

, composed of controls, for producing the Beran estimator in the form of SF

{\tilde{S}}_{0} (t ∣ x_{i})

.

Figure 3. Neural networks consisting of the c and s trained neural kernels, predicting new values of kernels

K (z, x_{i})

and

K (z, y_{i})

that correspond to controls and treatments for computing estimates of

S_{1} (t ∣ z)

and

S_{0} (t ∣ z)

, respectively.

Figure 3. Neural networks consisting of the c and s trained neural kernels, predicting new values of kernels

K (z, x_{i})

and

K (z, y_{i})

that correspond to controls and treatments for computing estimates of

S_{1} (t ∣ z)

and

S_{0} (t ∣ z)

, respectively.

Figure 4. The PEHE metric as a function of the number of the controls when the spiral function is used for generating examples.

Figure 5. The PEHE metric as a function of the number of controls when the bell-shaped function is used for generating examples.

Figure 6. The PEHE metric as a function of the number of controls when the circular function is used for generating examples.

Figure 7. The PEHE metric as a function of the part of treatments when the spiral function is used for generating examples.

Figure 8. The PEHE metric as a function of the part of treatments when the bell-shaped function is used for generating examples.

Figure 9. The PEHE metric as a function of the part of treatments when the circular function is used for generating examples.

Figure 10. The PEHE metric as a function of the amount of censored observations in the training dataset when the spiral function is used for generating examples.

Figure 11. The PEHE metric as a function of the amount of censored observations in the training dataset when the bell-shaped function is used for generating examples.

Figure 12. The PEHE metric as a function of the amount of censored observations in the training dataset when the circular function is used for generating examples.

Table 1. Notations of the models, depending on meta-learners and base models.

	Meta-Model
Survival regression algorithms	T-learner	S-learner	X-learner
Beran estimator	T-Beran	S-Beran	X-Beran
Cox model	T-Cox	S-Cox	X-Cox
RSF	T-SF	S-SF	X-SF

Table 2. The PEHE values of CATE for different models obtained via different generating functions and the corresponding p-values.

	Generating Functions
Model	Spiral	Bell-Shaped	Circular	p-Value
T-NW	$5.876$	$4.868$	$5.713$	$0.0457$
S-NW	$5.759$	$5.868$	$5.946$	$0.0121$
X-NW	$4.985$	$5.090$	$6.317$	$0.0149$
T-Cox	$6.198$	$6.518$	$6.126$	$0.0128$
S-Cox	$5.959$	$5.941$	$5.963$	$0.0112$
X-Cox	$6.331$	$7.396$	$8.357$	$0.0178$
T-SF	$3.721$	$5.563$	$6.460$	$0.0401$
S-SF	$5.959$	$5.900$	$5.882$	$0.0035$
X-SF	$4.853$	$6.339$	$7.176$	$0.0154$
BENK	$2.373$	$3.288$	$3.570$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kirpichenko, S.; Utkin, L.; Konstantinov, A.; Muliukha, V. BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect. Algorithms 2024, 17, 40. https://doi.org/10.3390/a17010040

AMA Style

Kirpichenko S, Utkin L, Konstantinov A, Muliukha V. BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect. Algorithms. 2024; 17(1):40. https://doi.org/10.3390/a17010040

Chicago/Turabian Style

Kirpichenko, Stanislav, Lev Utkin, Andrei Konstantinov, and Vladimir Muliukha. 2024. "BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect" Algorithms 17, no. 1: 40. https://doi.org/10.3390/a17010040

APA Style

Kirpichenko, S., Utkin, L., Konstantinov, A., & Muliukha, V. (2024). BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect. Algorithms, 17(1), 40. https://doi.org/10.3390/a17010040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect

Abstract

1. Introduction

2. Related Work

3. CATE Estimation Problem Statement

4. CATE with Censored Data

5. Nonparametric Estimation of Survival Functions and CATE

6. Neural Network for Estimating CATE

7. Numerical Experiments

7.1. CATE Estimators for Comparison and Their Parameters

7.2. Generating Synthetic Datasets

7.3. Study of the BENK Properties

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI