Estimating the Individual Treatment Effect on Survival Time Based on Prior Knowledge and Counterfactual Prediction

The estimation of the Individual Treatment Effect (ITE) on survival time is an important research topic in clinics-based causal inference. Various representation learning methods have been proposed to deal with its three key problems, i.e., reducing selection bias, handling censored survival data, and avoiding balancing non-confounders. However, none of them consider all three problems in a single method. In this study, by combining the Counterfactual Survival Analysis (CSA) model and Dragonnet from the literature, we first propose a CSA–Dragonnet to deal with the three problems simultaneously. Moreover, we found that conclusions from traditional Randomized Controlled Trials (RCTs) or Retrospective Cohort Studies (RCSs) can offer valuable bound information to the counterfactual learning of ITE, which has never been used by existing ITE estimation methods. Hence, we further propose a CSA–Dragonnet with Embedded Prior Knowledge (CDNEPK) by formulating a unified expression of the prior knowledge given by RCTs or RCSs, inserting counterfactual prediction nets into CSA–Dragonnet and defining loss items based on the bounds for the ITE extracted from prior knowledge. Semi-synthetic data experiments showed that CDNEPK has superior performance. Real-world experiments indicated that CDNEPK can offer meaningful treatment advice.


Introduction
In this paper, problems related to the estimation of the Individual Treatment Effect (ITE) on survival time will be discussed. Estimating treatment effects from observational data is an important research topic of causal inference [1]. With the development of personalized healthcare, there has been increasing concern about estimating the Individual Treatment Effect on survival time, which indicates how much an individual could benefit from a pair of treatments in the sense of prolonging survival time [1], and therefore can help a doctor or a patient determine which treatment to select.
To introduce the related concepts and existing works more clearly and concisely, we need to give some notations in this section. Suppose there are N patients. For each individual patient i with baseline x i composed of some covariates (e.g., basic information, laboratory tests, and image tests, etc.), let Y T=1 i (x i ) and Y T=0 i (x i ) represent the potential outcomes of a pair of treatments T = {1, 0}. Since a patient can only receive one actual treatment of the two potential outcomes, the one which cannot be observed is referred to as counterfactual [2]. The Individual Treatment Effect of T = 1 relative to T = 0 is defined by i (x i ) [1,2], which is also counterfactual because one of Y T=1 i and Y T=0 i must be counterfactual. This makes it impossible to learn a model for ∆ Y i , i.e.,∆ Y i (x i ) = f (x i ), based on historical data, which is useful for predicting the ITE for a new patient before the selecting of treatments.
for young patients. A common idea to reduce selection bias in representation learning methods for ITE estimation is balancing the confounders. As one of the typical representation learning-based methods for the ITE estimation, the Counterfactual Regression (CFR) method proposed by Shalit et al. [3] uses a fully connected network (FCN) φ(x i ) to map x i into a representation space first. Then, taking the idea of separating the learning for T = 1 and T = 0 as mentioned above, the CFR method uses two FCNs to predictŷ t i =1 i andŷ t i =0 i based on D T=1 and D T=0 , respectively, and finally optimizes the three FCNs by minimizing a CFR loss function defined as the weighted summation of ∑ y i 2 for x i ∈ D T=0 , and IPM p T=1 φ , p T=0 φ , where the first two items obviously measure the estimation errors for the data from D T=1 and D T=0 , respectively, and IPM p T=1 φ , p T=0 φ represents the integral probability metric (IPM) between the probability distributions of D T=1 and D T=0 in the representation space, whose minimization means balancing the impact of the confounders on the two distributions in the representation space. As mentioned in [1], the CFR model is also extended to some other improved models, such as those in [6,7]. Additionally, considering the unmeasured confounders, Anpeng Wu et al.
propose an instrumental variable-based counterfactual regression method, which can also be regarded as an improvement to the CFR model [5]. ITE estimation methods based on representation learning (including the CFR method mentioned above) face the censoring problem when they are applied to survival time, because the output y t i =1 i or y t i =0 i denoting survival time in this case will become unavailable (also referred to as censored) if the follow-up of the patient i has been lost or patient i is still alive before the trial ends [4,8,9]. To make the methods applicable to survival data, Chapfuwa et al. proposed a Counterfactual Survival Analysis (CSA) method [4] which improves CFR by replacing the censored outputs (i.e., the survival times) with the so-called observed censoring times (please refer to the notations in Section 2.1) and revising the corresponding estimation error items in the loss function.
Although balancing the confounders helps to reduce selection bias, the above-mentioned methods cannot discriminate between confounders and non-confounders (i.e., they treat all covariates in the baseline as confounders) and therefore may balance the non-confounders which do not affect the treatment assignment t i . Shi et al. pointed out that this may lead to a decrease in prediction precision, and proposed the so-called Dragonnet to prevent balancing non-confounders [10]. Besides the three FCNs used in CFR, Dragonnet introduces an FCN to predict the treatmentt i and further incorporates a cross-entropy distance betweent i and t i in the loss function, whose minimization helps to reduce the influence of the non-confounders in the representation space and therefore helps to prevent them from being balanced [10]. However, unlike CFR and CSA, the Dragonnet method has the demerits of not balancing the confounders and is not applicable to censored survival data.
Besides representation learning methods, there are also other methods for estimating treatment effects on survival time based on machine learning, such as random survival forest (RSF), COX regression, and accelerate failure time (AFT) [11][12][13], etc. In spite of their effectiveness, a common limitation of these kind of methods is they do not balance the confounders, which could lead to selection bias [4].
Actually, besides the above-mentioned methods, Randomized Controlled Trials (RCTs) [14] are always gold standards for treatment effect estimation, in which patients meeting some particular inclusion criteria are selected and randomly assigned to a treatment group and control group; then, the treatment effect is evaluated by comparing the difference between the trial results of the two groups. The RCTs will not face the problem of selection bias because the random allocation of the treatment can guarantee that the baseline distributions in the treatment group and control group are identical. Similarly, there are also Retrospective Cohort Studies (RCSs) for treatment effect estimation [15], in which the data of the treatment and control group are selected from historical data based on inclusion criteria, but RCSs still have the identical baseline distributions in the two groups to avoid selection bias. However, RCTs may not be feasible in many cases, e.g., forcing non-smokers to smoke in an RCT for smoking is against ethics [16]. Even in the cases where RCSs are feasible, the strict inclusion criteria limits the generalizability of RCTs for the patients who cannot be represented by the included ones [17,18], which is also the case for RCSs. These demerits limit the application of RCTs and RCSs.
In spite of their limitations, the conclusions obtained from accomplished RCTs or RCSs can offer valuable qualitative prior knowledge to the counterfactual learning of the ITE, because although the i (x i ) is counterfactual and unavailable for each patient i, it can be obtained from the results of RCTs or RCSs where, for patients with a significant treating effect, there is ∆ Y i > 0 with a high probability, and for patients without a significant treating effect, there is ∆ Y i = 0 (please refer to Section 4.1 for details). However, although there exists effective methods for treatment effect estimation which introduce prior knowledge, such as [19,20], which incorporates prior knowledge on the relationship between the baseline x i and treatment t i , to the best of our knowledge, there is still no method for treatment effect estimation which can take advantage of prior knowledge on the ITE obtained from RCTs or RCSs.
To sum up, there are four problems which need to be considered in the estimation of the ITE on survival time, i.e., (i) how to balance the confounders to reduce selection bias; (ii) how to handle the censored survival data; (iii) how to avoid balancing the nonconfounders which may lead to the decrease in prediction precision; and (iv) how to take advantage of prior knowledge on the ITE obtained from RCTs or RCSs.
Considering the situation that the existing methods have proposed solutions to problems (i)-(iii) separately, that none of them take all three problems into consideration in a single method, and that there has been no solution to problem (iv), in this paper, we first propose a new model called CSA-Dragonnet based on CSA and Dragonnet to combine CSA's solutions to problems (i) and (ii) and Dragonnet's solution to problem (iii), and then propose a CSA-Dragonnet with Embedded Prior Knowledge (CDNEPK) to further incorporate the prior knowledge obtained from RCTs or RCSs.
The more important contributions of this paper come from the second part, i.e., the proposing of CDNEPK, which includes: (i) finding a way to express different kinds of prior knowledge extracted from RCTs or RCSs in a unified form; (ii) embedding prior knowledge into the CDNEPK proposed in this paper by inserting counterfactual prediction nets into CSA-Dragonnet, whose output is denoted byŷ , which actually reflects the impact of selection bias in the representation space caused by the confounders.
Dragonnet [10] consists of four FCNs, i.e., φ, h T=1 , h T=0 , and ψ. Similar to the CSA i are used to predict the outcomes for patients from D T=1 and D T=0 , respectively. Unlike the CSA model, a new FCN ψ is introduced in Dragonnet to predict the treatmentt i byt i = ψ(φ(x i )), and the loss to be minimized is defined by: where the first two items are estimation errors of the outcomes, ∑ i∈D all CE t i , t i is the average cross-entropy distance betweent i and t i over all patients which reflects the impact of the non-confounders ont i . So, its minimization helps to reduce the influence of the non-confounders in the representation space and prevents them from being balanced. From (1) and (2) it can be seen that for the four problems needed to be considered in the estimation of ITE on survival time mentioned in Section 1, the CSA method gives solutions to problem (i) and (ii), i.e., how to balance the confounders to reduce selection bias and how to handle the censored survival times data, and the Dragonnet method gives a solution to problem (iii), i.e., how to avoid balancing the non-confounders which may lead to the decrease in prediction precision.

CSA-Dragonnet
As summarized in Section 1, for the ITE estimation, the CFR model [3] is proposed to reduce selection bias by balancing the confounders in the representation space, and the CSA method [4] is proposed by extending CFR to handle the survival data which could be censored. Since CFR does not discriminate between the confounders and the non-confounders, it also balances the non-confounders and leads to a decrease in prediction precision. Hence, Dragonnet is proposed in [10] to reduce the influence of the non-confounders in the representation space and to prevent them from being balanced. However, Dragonnet still suffers from the problems of selection bias and censoring [10]. So in this section, we propose a CSA-Dragonnet based on the CSA model [4] and Dragonnet [10] to combine their advantages. Figure 1 shows the architecture of the proposed CSA-Dragonnet. The CSA-Dragonnet consists of three parts, i.e., (i) as in the CSA [4] and Dragonnet [10] models, the baseline x i of all patients from D all is mapped onto a latent representation by an FCN φ(x i ); (ii) as in Dragonnet [10], in order to reduce the influence of the non-confounders in the representation space, a single-layered FCN ψ with φ(x i ) as the input is used to predict the probability of the treatment, i.e.,t i = ψ(φ(x i )); (iii) as in the CSA model [4], in order to predictŷ t i =1  Figure 1, respectively. In the two branches, g T= * , h T= * , and u T= * (with * = 1 or 0) are all FCNs, ε 1 and ε 0 are specially designed random inputs [4], and denotes the concatenating operation rather than summation, i.e., the input of h T= * is a vector composed of g T= * (φ(x i )) and u T= * (ε * ). The random inputs ε 1 and ε 0 are utilized to introduce some randomness model in the time generation process [4]. Please refer to reference [4] for the details of the non-parametric survival model.

CSA-Dragonnet
As summarized in Section 1, for the ITE estimation, the CFR model [3] is proposed to reduce selection bias by balancing the confounders in the representation space, and the CSA method [4] is proposed by extending CFR to handle the survival data which could be censored. Since CFR does not discriminate between the confounders and the non-confounders, it also balances the non-confounders and leads to a decrease in prediction precision. Hence, Dragonnet is proposed in [10] to reduce the influence of the non-confounders in the representation space and to prevent them from being balanced. However, Dragonnet still suffers from the problems of selection bias and censoring [10]. So in this section, we propose a CSA-Dragonnet based on the CSA model [4] and Dragonnet [10] to combine their advantages. Figure 1 shows the architecture of the proposed CSA-Dragonnet. The CSA-Dragonnet consists of three parts, i.e., (i) as in the CSA [4] and Dragonnet [10] models, the baseline of all patients from is mapped onto a latent representation by an FCN ( ); (ii) as in Dragonnet [10], in order to reduce the influence of the non-confounders in the representation space, a single-layered FCN with ( ) as the input is used to predict the probability of the treatment, i.e., ̂ = ( ) ; (iii) as in the CSA model [4], in order to predict and , ( ) of all the patients are divided into two parts, i.e., for patients from and for patients from , which are further fed into two groups of networks on the top and bottom branches of Figure 1, respectively. In the two branches, * , ℎ * , and * (with * = 1 or 0) are all FCNs, and are specially designed random inputs [4], and ⨁ denotes the concatenating operation rather than summation, i.e., the input of ℎ * is a vector composed of * ( ) and * ( * ). The random inputs and are utilized to introduce some randomness model in the time generation process [4]. Please refer to reference [4] for the details of the nonparametric survival model. In summary, the whole CSA-Dragonnet is a three-head neural network, in which the inputs include the baseline and two random sources, and the outputs include the predicted probability of the treatment ̂ as well as the predicted survival times and . Eight FCNs are involved in the network, among which and are shared networks for all patients, while , , ℎ on the top branch are only applicable to patients with = 1 and , , ℎ on the bottom branch are only applicable to In summary, the whole CSA-Dragonnet is a three-head neural network, in which the inputs include the baseline and two random sources, and the outputs include the predicted probability of the treatmentt i as well as the predicted survival timesŷ FCNs are involved in the network, among which φ and ψ are shared networks for all patients, while u T=1 , g T=1 , h T=1 on the top branch are only applicable to patients with t i = 1 and u T=0 , g T=0 , h T=0 on the bottom branch are only applicable to patients with t i = 0. FCNs φ, g T=1 ,g T=0 are defined by Leaky Rectified Linear Unit (Relu) activation functions; FCNs u T=1 ,u T=0 use Hyperbolic Tangent(tanh) activation functions; FCNs h T=1 , h T=0 are defined by exponential activation functions; and ψ is defined by the softmax activation function. The loss function to train the eight FCNs in CSA-Dragonnet can be defined by combining (1) for CSA and (2) for Dragonnet as follows: where the last item comes from (2) (i.e., the loss function of Dragonnet) and the other items come from (1) (i.e., the loss function of the CSA method). Please refer to references [21] and [22] for detailed definitions of the IPM distance and cross-entropy distance, respectively.
As explained in Section 1, (i) IPM p T=1 φ , p T=0 φ measures the difference between the impacts of the confounders in the representation space, so minimizing it helps to balance the confounders and reduces selection bias [3,4]; (ii) ∑ i∈D all CE t i , t i measures the impact of the non-confounders ont i , so its minimization helps to reduce the influence of the non-confounders in the representation space and prevents them from being balanced [10]; (iii) as for the first part, since there is δ i = 1 and γ t=j i is set as the observed censoring time y cen,i for patients whose survival times are censored, minimizing the summation to exceed the observed censoring time [4]. Hence, CSA-Dragonnet with the loss L CSA−Dragon can balance the confounders, handle the censored survival data, and avoid balancing the non-confounders simultaneously.

Remark 1.
CSA-Dragonnet is a combination of CSA and Dragonnet. (i) It will reduce to the CSA model [4] if the middle branch fort i in Figure 1 and the cross-entropy distance ∑ i∈D all CE t i , t i in (3) are removed; (ii) CSA-Dragonnet will reduce to Dragonnet if φ is directly connected to h T=1 and h T=0 in the top and bottom branches (i.e., u T=1 , g T=1 , u T=0 , g T=0 , ε 1 and ε 0 are removed), the IPM distance between the distributions of φ x t i =1 i and φ x t i =0 i is removed, and the first part of L CSA−Dragon is replaced with ∑ j=0,1 ∑ i∈D T=j y t=j i −ŷ t=j i 2 , which cannot handle censored data.

A Unified Expression of the Prior Knowledge Yielded by RCTs and RCSs
As mentioned in Section 1, the results of RCTs or RCSs can offer valuable information about the ITE to counterfactual learning. In the following, two examples are given to illustrate it in detail.

Example 1. McNamara et al. investigated if advanced Hepatocellular Carcinoma (HCC) patients with
liver function in good condition could benefit from "sorafenib" [23] through a systematic review and meta-analysis of 30 related studies based on RCTs or RCSs, which comprised 8678 patients altogether. The conclusion was that patients with Child-Pugh grade A could benefit from "sorafenib" significantly, while the effect of "sorafenib" on patients with Child-Pugh grade B is still controversial [23].
The Child-Pugh (CP) grades mentioned in the conclusion are widely used to describe the liver functional status of a patient [23]. It is determined by a CP score defined as the summation of the scores of five covariates in the baseline listed in Table 1, i.e., the covariate scores of hepatic encephalopathy (HE), ascites (AC), total bilirubin (TBIL), albumin (ALB), and prothrombin time (PT), which are further assigned according to the conditions given in Table 1 [24,25]. More concretely, each row of Table 1 gives the rules for how to assign a score to a corresponding covariate listed in the first column. In addition, a CP score of five or six is also banded into the CP grade A, and a CP score of seven, eight, or nine is banded into the CP grade B [24,25].  [26]. A total of 143 patients with HCC were involved in the trial, all of whom satisfied the inclusion criterion of "with single tumor lower than 2 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites (AC)". Comparisons between the results of the hepatectomy and control groups showed that the hepatectomy could not significantly extend survival time for patients satisfying the inclusion criterion.
Let Y T=1(0) i denote the potential survival time of patient i receiving "sorafenib" (not receiving "sorafenib"), the conclusions in Example 1 actually tell us that if a patient i belongs to the CP grade A, then there is i > 0 with a high certainty even if the patient was not involved in the meta-analysis conducted by [23]. This prior knowledge offers valuable information on the counterfactual ITE to the patients involved in a representation learning. Similarly, let Y T=1(0) i represent the potential survival time of patient i receiving a hepatectomy (not receiving a hepatectomy), we know from Example 2 that if patient i meets the condition of "with single tumor lower than 2 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites", there is ∆ Y i = 0 with a high possibility, which is also important prior knowledge for representation learning.
It can be seen that Examples 1 and 2 describe the conditions of the knowledge in different ways. In Example 2, the original covariates in the baseline are directly evaluated in the inclusion criterion, which is common in RCTs or RCSs, while in Example 1, the CP score derived from the original covariates in the baseline is evaluated in the condition. This is also a way of representativeness to express the conditions of the patients, because besides the CP score adopted in Example 1, there are also many other different kinds of scores to measure the initial conditions of the patients related to various diseases, which may influence the further treatment effects, such as the influence of the lung allocation score [27] on lung transplantation [28] and the influence of the renal score [29] on renal cryoablation [30], etc.
In the following, we define a set to denote the group of patients satisfying the two typical kinds of conditions mentioned above in a unified way, i.e., where the coefficients λ j,l and µ j,l for j ∈ {HE, AC, TBIL, ALB, PT} and l = 1, 2, and 3 can be assigned according to Table 1. For example, it is direct that there are λ HE,2 = 1, µ HE,2 = 2; λ AC,2 = µ AC,2 = 1; λ PT,1 = 0, µ PT,1 =4, and λ TBIL,3 = 51, µ TBIL,3 = +∞. As for Example 2, the group of patients satisfying the inclusion criterion of "with single tumor lower than 2 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites (AC)" can be directly written as: where x i,j denotes the jth element of x i for j ∈ {diameter, number, DM, V I, AC}. So, the knowledge obtained in Examples 1 and 2 can be written as for ρ = 1, . . . , , then let: where ∪ represents the union. The prior knowledge can be finally written as

Importance of Counterfactual Prediction
As shown in Figure 1, during the training process, when predictingŷ = 0 obviously may offer additional information on the counterfactual potential output, which can be further used as some kind of bound for the counterfactual predictionŷ So, in order to take full advantage of the prior knowledge given by RCTs or RCSs, in the following, we will first enhance the CSA-Dragonnet by incorporating counterfactual prediction branches which can outputŷ T=1−t i i and further by introducing new items into the loss function to guide the training of the counterfactual prediction outputs. We refer to the enhanced method as CSA-Dragonnet with Embedded Prior Knowledge (CDNEPK).

Architecture of CDNEPK with Incorporated Counterfactual Prediction Branches
To support the counterfactual prediction, two new branches to predictŷ the loss function to guide the training of the counterfactual prediction outputs. We refer to the enhanced method as CSA-Dragonnet with Embedded Prior Knowledge (CDNEPK).

Architecture of CDNEPK with Incorporated Counterfactual Prediction Branches
To support the counterfactual prediction, two new branches to predict for = 1 and 0, i.e., and , can be added to CSA-Dragonnet, as shown in Figure 2.  Figure 3 gives a more concise diagram for the CDNEPK, which is equivalent to Figure  2. In Figure 3, the top branch is for factual prediction, which actually combines the calculations in both the top and bottom branches of Figure 1 (or the top and 4th branches of Figure 2) into one branch. Similarly, the bottom branch of Figure 3 is for counterfactual prediction, which combines the 2nd and 5th branches of Figure 2. For convenience, for a patient who has received treatment , we call the top branch of Figure 3 Figure 3 gives a more concise diagram for the CDNEPK, which is equivalent to Figure 2. In Figure 3, the top branch is for factual prediction, which actually combines the calculations in both the top and bottom branches of Figure 1 (or the top and 4th branches of Figure 2) into one branch. Similarly, the bottom branch of Figure 3 is for counterfactual prediction, which combines the 2nd and 5th branches of Figure 2. For convenience, for a patient i who has received treatment t i , we call the top branch of Figure 3 (which consists of g T=t i , h T=t i , and u T=t i ) the factual prediction branch, and call the bottom branch of Figure 3 (which consists of g T=1−t i , h T=1−t i , and u T=1−t i ) the counterfactual prediction branch hereafter.
the loss function to guide the training of the counterfactual prediction outputs. We refer to the enhanced method as CSA-Dragonnet with Embedded Prior Knowledge (CDNEPK).

Architecture of CDNEPK with Incorporated Counterfactual Prediction Branches
To support the counterfactual prediction, two new branches to predict for = 1 and 0, i.e., and , can be added to CSA-Dragonnet, as shown in Figure 2.

Figure 2.
Introducing counterfactual prediction branches into CSA-Dragonnet. Figure 3 gives a more concise diagram for the CDNEPK, which is equivalent to Figure  2. In Figure 3, the top branch is for factual prediction, which actually combines the calculations in both the top and bottom branches of Figure 1 (or the top and 4th branches of Figure 2) into one branch. Similarly, the bottom branch of Figure 3 is for counterfactual prediction, which combines the 2nd and 5th branches of Figure 2. For convenience, for a patient who has received treatment , we call the top branch of Figure 3 (which consists of , ℎ , and ) the factual prediction branch, and call the bottom branch of Figure 3 (which consists of , ℎ , and ) the counterfactual prediction branch hereafter.

Loss Items of CDNEPK with Incorporated Prior Knowledge
As explained in Section 4.2, the prior knowledge = 0 for x i ∈ Γ offers valuable information for the training of the bottom counterfactual prediction branch. In this section, we will discuss how to incorporate this information into the loss function according to different situations of t i = 1 or 0 (i.e., whether patient i has actually received the treatment or not), δ i = 0 or 1 (i.e., whether the patient's survival time is censored or not), and ∆ Y i > 0 or ∆ Y i = 0 (i.e., whether patient i could greatly benefit from the treatment T = 1 relative to T = 0 or not according to prior knowledge).

1.
Patients with prior knowledge In this case, since the survival time is not censored, we know that of the two potential outputs Y T=1 i and Y T=0 i in the prior knowledge, Y T=0 i has the ground truth observation, i.e., there is Y T=0 can be used as a lower bound of the counterfactual prediction of y T=1−t i i =ŷ T=1 i , or in other words, there should be a constraintŷ Let N x i ∈Ω denote the number of patients who belong to Ω, we can define the following loss item: whose minimization will penalize γ

and favors the satisfaction of the constraintŷ
In this case, the survival time is censored, which means at the end of the trial the patient is still alive. So, the observed time γ and therefore can define the loss term by only replacing the δ i in (9) with (1 − δ i ) considering δ i = 0, i.e., It is worth mentioning that although γ t i =0 i is used as the lower bound in both case (i) and case (ii), it is more conservative in this case than in case In this case, Y T=1 i has the ground truth γ as the upper bound. Then, the loss item can be defined by penalizinĝ y i > 0 as follows: (iv) t i = 1, δ i = 0 and ∆ Y i > 0.
In this case, there is Y T=1 i because the survival time is censored, and therefore Y T=1 i has a lower bound γ i . So, in this case, the prior knowledge does not offer additional information for the counterfactual model training.

2.
Patients with prior knowledge In this case, the survival time can be observed, so similar to (1).(i) and (1).(iii), for t i = 1 or 0, Y T=t i i has the ground truth γ i for t i = 1 or 0. Hence, γ t i i can serve as the label of the counterfactual predicted survival timeŷ T=1−t i i , and the loss item can be defined as follows: 12) where N x i ∈Γ denotes the number of patients who belong to Γ and N x i ∈D T=j∩ Γ,T=j denotes the number of patients who belong to the intersection of D T=j (j = 0 or 1) and Γ.
In this case, the survival time cannot be observed, so similar to (1). (ii) and (1).
i as the lower bound, and we can define the loss item as: whose minimization will penalize γ

Training Algorithm for CDNEPK
The final loss item for the counterfactual predictionŷ can be defined as the summation of (9)-(13), i.e., (14) and the loss function for CDNEPK is finally defined as: min φ, ψ g T=1, h T=1, u T=1 g T=0, h T=0, u T=0 L CDNEPK = min L CSA−Dragon + L CP (15) where L CSA-Dragon has been defined in (3). By now, all of the four problems mentioned in Section 1, i.e., (i) balancing the confounders, (ii) handling the censored data, (iii) avoiding balancing the non-confounders, and (iv) taking advantage of prior knowledge have been properly considered in CDNEPK. The training algorithm of CDNEPK is summarized as the following.

Remark 2.
It is worth noting that Algorithm 1 can also be used for the training procedure of CSA-Dragonnet proposed in Section 3 just by replacing the loss function in line four of Algorithm 1 (i.e., Formula (15)) as the loss function of CSA-Dragonnet (i.e., Formula (3)).

Data Generating and Experiment Setup
As mentioned in Section 1, the results of RCTs or RCSs can offer valuable information about the ITE to counterfactual learning. In the following, two examples are given to illustrate it in detail.
Based on an ACTG dataset which is given by [31] and contains 2139 HIV patients who received either the treatment of "monotherapy with Zidovudine" or the treatment of "Diadanosine with combination therapy", [4] proposes the following scheme for generating the semi-synthetic dataset D all = {(γ i , x i , δ i , t i ), i = 1, 2, . . . , N} [4].
where the treatment t i is simulated via a logistic model; ; the censored time y cen,i is assumed to follow a lognormal distribution; and the observed time γ i is the minimum of the survival time y t i i and the censored time y cen,i . δ i = 1(0) indicates that survival time is (is not) censored, which is determined by comparing y cen,i and y t i i in the simulation, e.g., if y cen,i is longer than y t i i , δ i is set as 1 [4]. Λ = {d 1 , d 2 , λ, µ, κ T=1 , κ T=0 , η T=1 , η T=0 , χ T=1 , χ T=0 , µ c , σ 2 c contains the parameters of the simulation scheme. The Individual Treatment Effect ∆ Y i can be acquired by i . It is worth mentioning that (16)  which is impossible in the real world and can be used to evaluate the performance of a counterfactual learning method which treats y T=t i i as counterfactual and unobservable. In this section, to generate semi-synthetic data with simulated prior knowledge, we divided all the patients' baselines covered by the ACTG dataset [31] into four cases, i.e., and then by setting the parameters in Λ of (16) properly, generated four different datasets satisfying different conditions, respectively, i.e., and ∆ Y i = 0 , and D 4 = x i x i ∈ Θ 4 and ∆ Y i has wide distribution}. The final semi-synthetic dataset was obtained by D all = ∪ 4 i=1 D i . Through properly selecting the parameters in Λ, among the 2139 patients in D all , there were 417 patients belonging to D 1 or D 2 , 668 patients belonging to D 3 , and 1054 patients belonging to D 4 .
From the viewpoint of evaluating an ITE estimation method based on the dataset D all , although ∆ Y i was generated by the simulation and was known, we treated ∆ Y i as counterfactual (not observable) but assumed that there was the prior knowledge ∆ Y i > 0 or ∆ Y i = 0 for part of the patients, i.e., there were where Ω = Θ 1 ∪ Θ 2 and Γ = Θ 3 , and we had no prior knowledge for patients not belonging to Ω or Γ, among which ∆ Y i may have randomly varied from negative to positive.
In the experiment, the dataset was randomly divided into the training set, validation set, and test set with a ratio of 70%:15%:15%. As in CSA [4], the FCNs φ, g T=1 , g T=0 , u T=1 , u T=0 used in CDNEPK and CSA-Dragonnet were two-layer MLPs of 100 hidden units, and the FCNs h T=1 , h T=0 used in CDNEPK and CSA-Dragonnet were one-layer MLPs. In addition, all the hidden units in φ, g T=1 , g T=0 were characterized by batch normalization and the dropout probability of p = 0.2 on all layers. As in Dragonnet [10], the FCN ψ used in CDNEPK and CSA-Dragonnet was a one-layer MLP. Weighting factors α, β in (3) were set as 1000,100, respectively, which were selected by cross-validation. The iteration time c was set as 80 and the batch size was set as 850. An Adam optimizer was used with the learning rate r = 3 × 10 −3 .

Experimental Results
We compared our proposed CDNEPK and CSA-Dragonnet with the following methods: (i) CSA [4]; (ii) the accelerate failure time (AFT) model with Weibull distributions [12]; (iii) the random survival forest (RSF) model [13]; and (iv) the COX proportional hazard model [11]. Among them, CSA was introduced in the preliminary, whose settings for the FCNs were identical to those in CDNEPK and CSA-Dragonnet in the simulation. Instead of applying balance representation like the three methods mentioned above, the AFT, RSF, and COX models took the treatment vector as a covariate directly, which led to the limited ability to handle selection bias.
In the experiments, we adopted the PEHE (precision in the estimation of a heterogeneous effect) and the absolute error of the ATE (average treatment effect), which are widely used for assessing the Individual Treatment Effect error [4], and are defined as following, respectively [4]: It is worth noting that ε PEHE and ε ATE can only be calculated in simulation experiments where the ground truth ∆ Y i is available and they cannot be calculated for real-world data where ∆ Y i is counterfactual [4]. Table 2 presents the comparison results among COX, AFT, RSF, CSA, CSA-Dragonnet, and CDNEPK. It can be seen that the COX and AFT models had poorer performance since they adopted linear models and did not consider selection bias. For RSF, although it still suffered from selection bias, its ability to process nonlinear survival data led to the lower ε PEHE and ε ATE compared to COX and AFT. CSA, as the baseline method of this paper, dealt with the nonlinearity and selection bias by representation learning and balancing the confounders. It had a significant enhancement compared to COX, AFT, and RSF. Compared to the basis of CSA, the proposed CSA-Dragonnet took the confounder identification into account and improved the performance on ε PEHE and ε ATE . Furthermore, CDNEPK is proposed to cope with prior knowledge, which is superior to all other methods.

Real-World Experiment on Hepatocellular Carcinoma
As the third most fatal cancer for men in the world, Hepatocellular Carcinoma (HCC) has a high mortality rate for patients [32]. Although a hepatectomy is the most effective treatment for HCC, the mortality of some patients after a hepatectomy still remains high and how long a hepatectomy can prolong the survival time of HCC patients still remains controversial [33]. In this section, we utilized CDNEPK to estimate the Individual Treatment Effect for each patient.
The dataset used in this section included records of 1459 patients, which were retrospectively collected from three hospitals in China. Among the 1459 patients, 784 patients were treated with a hepatectomy and the other 675 patients were not treated with liver resection. Basic information, laboratory tests, and imaging tests were included in the patients' records. The basic information included gender, age, and ECOG-PS score. The laboratory tests consisted of alpha-fetoprotein (AFP), blood tests (i.e., total bilirubin, alanine transaminase, aspartate aminotransferase, and alkaline phosphatase), and hepatitis tests (i.e., HBsAg, HBsAb, HBeAg, HBeAb, HBcAb, and HCVAb). The imaging tests contained tumor numbers, diameters, sites, distant metastasis, vascular invasion, and ascites. All of the above 21 clinical covariates of the baseline and whether a patient had a hepatectomy were included in our final analysis.
In Example 2 we mentioned that HCC patients with a single small tumor cannot benefit from a hepatectomy with a high probability. As for HCC patients in other cases, there are also RCTs or RCSs that focus on whether they could benefit from a hepatectomy. [34] summarizes the results as follows: (i) patients with a single tumor lower than 2 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites (AC) could not benefit from a hepatectomy significantly;(ii) patients with 2-3 tumors lower than 2 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites (AC) could not significantly extend survival time from a hepatectomy;(iii) patients with a single tumor between 5-10 cm, no distant metastasis (DM), no vascular invasion (VI), and no ascites (AC) could benefit from a hepatectomy significantly.
Similar to Example 2 of Section 4.1, the results of the RCTs and RCSs given in [34] can be expressed as the following prior knowledge. Let Y T=1(0) i represent the potential survival time of patient i receiving a hepatectomy (not receiving a hepatectomy), and we divide all patients' baselines covered by the HCC dataset [31] into four cases, i.e., where According to Formula (7), there is Γ = Θ 1 ∪ Θ 2 and Ω = Θ 3 .In addition, for patients not belonging to Ω or Γ, we have no prior knowledge, among which ∆ Y i may have a wide distribution.
In the experiment, we obtained a trained CDNEPK by using Algorithm 1 of Section 4.5 based on the data of the 1459 HCC patients, in which the settings of CDNEPK were identical to those in Section 5.
A direct usage of the obtained CDNEPK is giving the predicted ITE by∆ Y new (x new ) for a new patient who has not received treatment yet, where∆ Y new (x new ) denotes the output of CDNEPK with x new as the input. This kind of prediction may help a doctor or patient choose the proper treatment. However, the reason why we did not divide the dataset of the 1459 HCC patients into a training set and a test set to show the predicted ITEs for the patients in the test set and evaluate their prediction errors is that the ITE is counterfactual for a patient in the real word data, which means the ground truth data is unavailable for any patient.
In the following, we will show another usage of the obtained CDNEPK, i.e., analyzing the importance of each covariate on the ITE. Based on the obtained CDNEPK, we first calculatê ∆ Y i for all of the 1459 patients based on their baselines, then build the relationship between the estimated ITE and the baseline by solving the following lasso regression problem: where = [ 1 , . . . m ] and ι are the regression coefficient vector and weighting factor, respectively. Formula (22) can be solved by the method in [20]. According to the idea of factor analysis [35], it is intuitive that the absolute values of the m regression coefficients, i.e., 1 , . . . m , can reflect the contributions of the m covariates of the baseline x i to the ITE, respectively; i.e., the greater j is, the greater the contribution the jth covariate has to the ITE. So, through cross-validation, we selected four covariates corresponding to the regression coefficients with the top four greatest absolute values as the key covariates which are most important to the ITE, i.e., tumor diameter, alpha fetoprotein, aspartate aminotransferase, and distant metastasis.
In Figure 4, a box-plot is used to illustrate the relationships between the ITE and the four key covariates. It is apparent from Figure 4a that the ITE increased with the increase in tumor diameter when it was less than 8 cm. In contrast, when the diameter was less than 2 cm, the median ∆ Y i was less than zero, which indicates that patients with numbers less than 2 cm may not benefit significantly from a hepatectomy. As a whole, patients with tumors between 5-8 cm could benefit the most from a hepatectomy. Figure 4b indicates that the ITE increased with the increase in alpha fetoprotein, while Figure 4c shows that the ITE decreased with the increase in aspartate aminotransferase. It can be inferred that the benefit of a hepatectomy is positively associated with liver function. Figure 4d shows that patients without distant metastasis had higher benefit ratios than those with distant metastasis in terms of the median and upper quartile. Thus, patients without distant metastasis have a high probability of benefiting from a hepatectomy.
which are most important to the ITE, i.e., tumor diameter, alpha fetoprotein, aspartate aminotransferase, and distant metastasis.
In Figure 4, a box-plot is used to illustrate the relationships between the ITE and the four key covariates. It is apparent from Figure 4a that the ITE increased with the increase in tumor diameter when it was less than 8 cm. In contrast, when the diameter was less than 2 cm, the median ∆ was less than zero, which indicates that patients with numbers less than 2 cm may not benefit significantly from a hepatectomy. As a whole, patients with tumors between 5-8 cm could benefit the most from a hepatectomy. Figure 4b indicates that the ITE increased with the increase in alpha fetoprotein, while Figure 4c shows that the ITE decreased with the increase in aspartate aminotransferase. It can be inferred that the benefit of a hepatectomy is positively associated with liver function. Figure 4d shows that patients without distant metastasis had higher benefit ratios than those with distant metastasis in terms of the median and upper quartile. Thus, patients without distant metastasis have a high probability of benefiting from a hepatectomy.  The above example shows that, with CDNEPK, we can utilize observational historical data and prior knowledge to estimate the individual surgical benefit for HCC patients and can further analyze the influence of covariates on the trend of surgical benefits. The results can offer HCC surgeons quantitative information and valuable assistant treatment advice, which can never be obtained by RCT or RCS studies.

Conclusions
In this paper, we propose CSA-Dragonnet and CDNEPK to estimate the ITE on survival time from observational data. The key novelty of our methods is that we insert counterfactual prediction nets into CSA-Dragonnet and extract valuable bound information for the counterfactual prediction from the prior knowledge yielded by RCTs and RCS to guide the training of counterfactual outputs. Experiments based on semi-synthetic data The above example shows that, with CDNEPK, we can utilize observational historical data and prior knowledge to estimate the individual surgical benefit for HCC patients and can further analyze the influence of covariates on the trend of surgical benefits. The results can offer HCC surgeons quantitative information and valuable assistant treatment advice, which can never be obtained by RCT or RCS studies.

Conclusions
In this paper, we propose CSA-Dragonnet and CDNEPK to estimate the ITE on survival time from observational data. The key novelty of our methods is that we insert counterfactual prediction nets into CSA-Dragonnet and extract valuable bound information for the counterfactual prediction from the prior knowledge yielded by RCTs and RCS to guide the training of counterfactual outputs. Experiments based on semi-synthetic data and real-world data showed that CDNEPK had the best performance compared to existing methods and that it can provide auxiliary treatment advice for surgeons.