1. Introduction
Logistic regression is one of the most commonly used techniques for modeling the relationship between the dependent variable and one or more independent variables.
In data analysis and machine learning, a transformation refers to a mapping of a variable into a new variable. A transformation can be linear or nonlinear, depending on whether the mapping is linear or nonlinear. Linear transformations can be used to improve interpretability of coefficients in linear regression and make a fitted model easier to understand [
1], whereas nonlinear transformations are often used to improve the fit of the model on the data [
2].
Three types of linear transformations are commonly used in machine learning prior to model fitting, namely, min–max normalization, z-score standardization and simple scaling. Since different variables that are measured in different scales may not contribute equally to model fitting, min–max normalization is used to transform all continuous variables into the same range [0, 1] to avoid a possible bias. Essentially, min–max normalization subtracts the minimum value of a continuous variable from each value and then divides by the range of the variable. z-score standardization rescales continuous variables to the standard scale, i.e., how far it is from the mean. Mathematically, z-score standardization subtracts the mean value of a continuous variable from each value and then divides by the standard deviation of the variable. Simple scaling shrinks or expands a continuous variable with big values and small values, respectively. The three types of linear transformations are all discussed by Adeyemo, Wimmer and Powell [
3] for logistic regression.
However, the work in the literature on transformations in regression have some limitations. First, most work focuses on improving the fit of the regression model [
4,
5,
6,
7,
8,
9]. Second, the effects of transformations are rarely discussed. Morrell, Pearson, and Brant [
10] examined how linear transformations affected a linear mixed-effect model and the tests of significance of fixed effects in the model. They showed how linear transformations modified the random effects, and their covariance matrix and the value of the restricted log-likelihood. Zeng [
11] studied invariant properties of some statistical measures under monotonic transformations for univariate logistic regression. Zeng [
12] derived analytic properties of some well-known category encodings such as ordinal encoding, order encoding and one-hot encoding in multivariate logistic regression by means of linear transformations. Adeyemo, Wimmer and Powell [
3] compared the prediction accuracy of the three types of linear transformations, min–max normalization, z-score standardization and simple scaling, in logistic regression, by means of simulation.
In this paper, we first generalized a linear transformation for a single variable to multiple variables by a matrix multiplication. We then studied various effects of a generalized linear transformation in logistic regression. We showed that an invertible generalized linear transformation has no effects on predictions, multicollinearity, pseudo-complete separation, and complete separation. We also showed that multiple linear transformations do not have effects on the variance inflation factor (VIF). Numeric examples with randomly generated transformations from a real data were presented to illustrate our theoretic results.
The remainder of this paper is organized as follows. In 
Section 2, we give two definitions of a generalized linear transformation and show that they are equivalent. In 
Section 3, we study the effects of a generalized linear transformation on logistic regression. In 
Section 4, we present numeric examples to validate our theoretic results. Finally, the paper is concluded in 
Section 5.
Throughout the paper, we concentrate on transformations of independent variables, which are also sometimes called explanatory variables.
  2. A Generalized Linear Transformation in Logistic Regression
Let  be the vector of  independent variables and  be the dependent variable. Let us consider a sample of  independent observations  where  is the value of  and  the values of  independent variables  for the -th observation. Without loss of generality, we assume  are all continuous variables since otherwise they can be converted into continuous variables. 
Let us adopt the matrix notation:
      where 
 for all 
 (used for intercept 
) and matrix 
 is called the design matrix. Here, 
 are called regression coefficients or regression parameters. 
Without causing confusion, we also use  to denote the  columns or column vectors of . We further use capital letter  to denote the row vector  for 
Definition 1. A linear transformation is a linear function of a variable which maps or transforms the variable into a new one. Specifically, a linear transformation of variable  can be defined as  where  and  are constants and  is nonzero. For convenience, let us call a linear transformation of a single variable a simple linear transformation. By multiple linear transformations, we mean a set of simple linear transformations. Here, we use letter  in the superscript to denote the new variable after a transformation.
 Note that  and  in Definition 1 are not vectors since  is a variable.
Definition 1 can be generalized naturally by matrix multiplication to transform a set of variables to a new set of variables.
Definition 2. A generalized linear transformation is a linear matrix-vector expressionthat transforms or maps independent variables  into new independent variables  where  is a  matrix of real numbers and  are real constants. Here,  are variables not vectors.  It should not be confused with the linear transformation between two vector spaces, in which there is no vector 
. Here and hereafter, we use the prime symbol ′ in the superscript for the transpose of a vector or a matrix. The new variables 
 in the component forms are
      
Consider a simple linear transformation, , for some  with  Without loss of generality, assume . Let  be a -dimensional diagonal matrix with  and . Let  be a -dimensional column vector. Then  are transformed into to  according to Definition 2. Similarly, consider a set of simple linear transformations, say,  for  with  Let  be a -dimensional diagonal matrix with  for  and  for  Let  be a  dimensional column vector. Then  are transformed into to  according to Definition 2. Hence, both a simple linear transformation and multiple linear transformations are a special case of a generalized linear transformation. 
However, Definition 2 is not convenient to use since the new design matrix issomewhat complicated. Therefore, we give another definition incorporated with the design matrix.
Definition 3. A generalized linear transformation is a matrix multiplication  that transforms  into  where  are the 2nd to the last column of  and  is a  matrix of real numbers as follows  Note that we request the first column of  to be 0 except the first entry (which is 1) in order for  to be the new design matrix. 
For convenience, let us partition 
 into 4 blocks such that 
, where 
 is the 
-dimensional row vector 
, 0 is the 
-dimensional column vector of all 0′s and 
 is the 
 submatrix by deleting the first column and the first row of 
 that is,
      
In the following we prove the definitions of generalized linear transformation are equivalent.
Theorem 1. Definition 2 and Definition 3 are equivalent.
 Proof.  Let us begin with Definition 2. Its new design matrix is 
 Hence, the new design matrix of Definition 2 is in the form of Definition 3 with
      
Note that the submatrix by deleting the first row and first column of matrix  above is the transpose of , that is, 
Next, let us begin with Definition 3. 
The second, third, …, last column of the matrix above are from the linear transform
      
      respectively. Hence, Definition 3 is in the form of Definition 2 with
      
      and
      
We have concluded our proof. □
If we expand along the first column to find the determinant of  in (2), we immediately see that the determinant of  is equal to the determinant of  Therefore,  is nonsingular (or invertible) if and only if  in (3) is nonsingular. In addition, it follows from the proof of Theorem 1 that  is nonsingular if and only if A in Definition 2 is nonsingular.
Moreover, it is easy to see that if 
 is nonsingular then the inverse of 
 can be written as
      
From now on we will use Definition 3 unless otherwise specified. For convenience, let us call the generalized linear transformation  invertible if C is invertible. 
  3. Effects of a Generalized Linear Transformation
In logistic regression, the dependent variable  is binary with 2 values 0 and 1. Let the conditional probability that  be denoted by 
Logistic regression assumes the logit linearity between the log odds and independent variables 
Equation (10) above can be written as
      
The following log likelihood is used in logistic regression
      
The maximum likelihood method is used to estimate parameters in logistic regression. Specifically, the maximum likelihood estimators (MLE) are the values of parameters 
 that maximize (12). The vector 
 of the MLE estimators of 
 satisfies [
13]
      
      or in matrix-vector form
      
      where 
 and 
 for 
 Note that after a generalized linear transformation 
 (12) and (14) hold with the design matrix 
 replaced by the new design matrix 
. 
Equation (13) or (14) represents (
) nonlinear equations of 
 and cannot be solved explicitly in general [
14]. Rather, they can be solved numerically by Newton-Raphson algorithm [
15] as follows
      
      where 
 is the 
 diagonal matrix with its diagonal elements 
 In addition 
 Both 
 and 
 are evaluated at 
 in (15). 
If 
 is nonsingular and the data is not completely separable or pseudo-completely separable [
16], then the MLE estimator 
 exists and is unique.
The MLE estimator 
 can be used to predict 
 by the linear combination of variables 
In particular, we have 
 fitted values
      
  3.1. Effects on MLE Estimator and Predictions
Theorem 2. For logistic regression, if the MLE estimator of  is  then the MLE estimator of  is  after a generalized linear transformation  assuming  is nonsingular. Moreover, the generalized linear transformation does not affect predictions.
 Proof.  Since 
 is the maximum likelihood estimator of 
 (14) is satisfied by 
 Multiplying both sides of (14) by 
 we obtain
        
 Clearly, (17) can be rewritten as
        
Writing 
 as 
 for 
 we have
        
It follows from (18) and (19) that  satisfies (14) for the new design matrix . Hence, the linear combinations  of  is the new MLE estimator after the generalized linear transformation .
Let us now predict 
 for a set of values of variables 
 for the new system after the generalized linear transformation 
 using the new MLE estimator 
 Let 
 be a specific value of 
 respectively. Then, the row vector 
 in the original system becomes 
 in the new system. By (16), the predicted conditional probability of 
 when 
 in the new system is
        
The right-hand side of (20) is the predicted conditional probability of  when  in the original system. □
  3.2. Effects on Multicollinearity
Perfect multicollinearity or complete multicollinearity or multicollinearity, in short, refers to a situation in logistic regression in which two or more independent variables are linearly related [
17]. In particular, if two independent variables are linearly related, then it is called collinearity. 
Mathematically, multicollinearity means there exist constant 
 such that
        
        where at least two of 
 are nonzero. If we treat 
 as an independent variable, then we just require at least one of 
 is nonzero.
Multicollinearity is a common issue in logistic regression. If there is multicollinearity, the design matrix  will not have a full column rank of . Hence, the  matrix  in (15) will have a rank less than . Thus, the inverse matrix  in (15) does not exist, which make the iteration in (15) impossible.
If there is near multicollinearity and there is no separation of the data points, theoretically 
 in (15) has an inverse and the iteration in (15) can be proceeded. Yet, iteration (15) may not find an approximate inverse 
 and hence may cause unstable estimates and inaccurate variances [
18].
Some authors define multicollinearity in logistic regression to be a high correlation between independent variables [
19,
20,
21]. Let us call multicollinearity with high correlation by near multicollinearity and reserve multicollinearity for perfect multicollinearity or complete multicollinearity.
Let us define VIF now. Let 
 be the R-squared that results when 
 is linearly regressed against the other 
 independent variables. Then VIF for 
 is defined as
        
Near multicollinearity can be detected by using VIF [
2]. The larger the VIF of an independent variable, the larger the correlation between this independent variable and others. However, there is no standard for acceptable levels of VIF. Multicollinearility can be combated by a generalized cross-validation (GCV) criterion in partially linear regression models [
22,
23]. 
  3.2.1. Preliminary Results in Linear Regression
As VIF is related to linear regression, let us briefly introduce some preliminary results in linear regression. As for logistic regression, we consider  independent variables . Unlike logistic regression, the dependent variable  in linear regression is a continuous variable. We shall adopt the same notation as in logistic regression unless otherwise specified. In particular,  is the design matrix. 
In linear regression, the relationship between 
 and 
 is formulated as a linear combination
          
          where 
 is a random error, or in matrix notation
          
The ordinary least squares (OLS) estimator 
 of 
 satisfies [
2]
          
Assuming the 
)-dimensional square matrix 
 is nonsingular, then the OLS estimator 
 is unique and can be written explicitly as
          
The OLS estimator 
 can be used to predict 
 by the linear combination of variables 
 as follows
          
Like Gelman and Hill [
1] and Chatterjee and Hadi [
2], we will call a predicted value a fitted value if the values of 
 come from one of the 
 observations. So, we have 
 fitted values
          
Therefore, the 
-dimensional column vector 
 for the 
 fitted values 
 can be expressed as
          
It is easy to show that the OLS estimator is  after an invertible generalized linear transformation . Moreover, the generalized linear transformation does not affect predictions. Indeed, let us now predict  for a set of values of variables , which could be from any set of values not necessarily from one of the n observations. We first transform the values of  into  where  are values of  Next, we apply (27) and obtain , which is the predicted value of the original model. 
In linear regression, the coefficient of determination, denoted by 
 and also called R-squared, is given by Chatterjee and Hadi [
2].
          
          where 
 is the mean of the dependent variable 
, that is, 
 and 
 is the fitted value
          
The coefficient of determination 
 can be related to the square of the correlation between 
 and 
 as follows [
2]
          
          where
          
Theorem 3.  in linear regression is invariant under invertible generalized linear transformations.
 Proof.  Expressing 
 in the numerator of the 2nd equation in (29) into the matrix form and applying (26), we obtain
          
 Substituting (32) into (29) yields
          
Now let 
 be an invertible generalized linear transformation. Then the OLS estimator after the transformation becomes 
. In this case, 
 in (33) becomes
          
          which returns to 
 in (29) before the generalized linear transformation. □
  3.2.2. Effects on Logistic Regression
In Definitions 2 and 3, we defined a generalized linear transformation only for independent variables. Since an independent variable is used as the dependent variable in order to find its VIF, we consider a simple linear transformation for the dependent variable in the following result. 
Lemma 1. Consider a linear regression with  as the dependent variable and  as the independent variables. If we make a simple linear transformation on y such as  and a generalized linear transformation  on independent variables with nonsingular , then  is the OSL estimator of the new linear regression after the transformations, where  is the design matrix,  is a -dimensional row vector and  is the OLS estimator of the original linear regression.
 Proof.  Since for the new linear regression has design matrix is 
 and the dependent variable can be expressed as 
 where 
 is a 
-dimensional row vector, it is sufficient show that 
 satisfies
          
 Substituting 
 into the left-hand side of (34) and replacing 
 with 
, we obtain
          
          which is the right-hand side of (34). □
Theorem 4. VIF for each independent variable is invariant under multiple linear transformations in logistic regression.
 Proof.  Without loss of generality, we assume multiple linear transformations  for the first  independent variables for , where . To find VIF, we do linear regressions for each , by making  as the dependent variable and  as the independent variables. Similarly, we do linear regression for each  by making  as the dependent variable and  as the independent variables. We only prove the invariance of VIF for  and of VIF for  as the invariance of VIF for   can be proved similar to  and the invariance of VIF for   can be proved similar to .
To find VIF for 
, we do linear regressions by making 
 as the dependent variable and 
 as the independent variables. In this case, the dependent variable 
 and the independent variables 
 result from a generalized linearization 
, where 
 is the design matrix with independent variables 
 and 
 is the upper triangular matrix as follows
          
 Since the determinant of 
 equals 
 by Lemma 1, the OLS estimator after the multiple linear transformations is 
. By (29), it’s sufficient to prove the following identity:
Since the denominator of the left-hand side of (35) is
          
It is sufficient to show that
          
Expressing the left-hand side of (36) as the multiplication of vectors
          
          where 
 is the 
-dimensional column vector with all elements of 
 and 
 is the 
-dimension vector of fitted values 
 for 
Applying (28) for the vector 
 of fitted values and design matrix 
 and applying Lemma1, we obtain
          
Hence, 
 and so (37) becomes
          
          which is the right hand-side of (36).
To find VIF for 
, we do linear regressions by making 
 as the dependent variable and 
 as the independent variables. In this case, the independent variable result from a generalized linearization 
, where 
 is the design matrix of independent variables 
 and 
 is the upper triangular matrix as follows
          
Since the determinant of  equals  by Theorem 3, VIF for  after the generalized transformation  is the same as VIF for  prior to the generalized linear transformation. □
Remark 1. VIFs are not necessarily invariant under an invertible generalized linear transformation . For instance, let  and  and keep  unchanged. Then  result from the generalized linear transformation with Since the determinant of  is −1,  is nonsingular. However, VIF for  after the generalizer linear transformation  equals VIF for  prior to the generalized linear transformation, which are unequal in general. The following result is immediate.
 Theorem 5. Multicollinearity exists in logistic regression if, and only if, it exists after an invertible generalized linear transformation.
 Remark 2. All the results about multicollinearity and VIF also apply to machine learning algorithms in which multicollinearity is applicable such as linear regression.
   3.3. Effects on Linear Separation
Albert and Anderson [
16] first assumed design matrix 
 to have a full column rank, that is, no multicollinearity. They then introduced the concept of separation (including complete separation and quasi-complete separation) and overlap in logistic regression with intercept. They showed that separation leads to nonexistence of (finite) MLE and that overlap leads to finite and unique MLE. Therefore, like multi-collinearity, separation is a common issue in logistic regression.
Definition 4. There is a complete separation of data points if there exists a vector  that correctly allocates all observations to their response groups; that is,  Definition 5. There is quasi-complete separation if the data are not complete separable, but there exists a vector  such that  and equality holds for at least one subject in each response group.
Definition 6. If neither a complete nor a quasi-complete separation exists, then the data is said to have overlap.
 Theorem 6. An invertible generalized linear transformation does not affect the data configuration of logistic regression.
 Proof.  We consider three cases.
 Case 1. There is a complete separation of data points in the original system. Then (38) holds for a vector  The row  in the design matrix is  for , after the invertible generalized linear transformation . Let , then vector  is a constant column vector of dimension (p + 1). Since , (38) holds after the generalized linear transformation. Therefore, there is also a complete separation of data points after the generalized linear transformation . 
Case 2. There is a quasi-complete separation of data points in the original system. It can be proved similarly to Case 1.
Case 3. The original data point has overlap. Then the new data points after the generalized linear transformation  also has overlap. We prove it by contradiction. Assume otherwise the new data points after the generalized linear transformation does not has overlap. Then there is either a complete separation or a pseudo-complete separation of data points. Let us first assume there is a complete separation of data point after the generalized linear transformation . Then there is a vector  such that (38) holds. Row  in the design matrix after the generalized linear transformation  is  for . Let , then (38) holds with , which is a contradiction. Next, let us assume there is a quasi-complete separation after the generalized linear transformation . It can be proved similarly. □
  4. Numeric Examples
In this section, we use real data, the well-known German Credit Data from a German bank, to validate our theoretical results. The German Credit Data can be found in the UCI Machine Learning Repository [
24]. The original dataset is in file “german.data”, which contains categorical/symbolic attributes. It has 1000 observations representing 1000 loan applicants. The statistical software package R (version 3.4.2) and its RStudio will be employed for our analyses. Since there are only 1000 records, we will not split them into training and test. We extract german.data using R’s read_table function, call it german_credit_raw, and use colnames() method to rename the column names.
There are 21 variables or attributes in german_credit_raw including 8 numerical ones as follows, which are denoted by  resepectively: 
Duration: Duration in month; 
credit_amount: Credit amount; 
installment_rate: Installment rate in percentage of disposable income;
current_address_length: Present residence since; 
age: Age in years; 
num_credits: Number of existing credits at this bank; 
num_dependents: Number of people being liable to provide maintenance for;
credit_status: Credit status: 1 for good loans and 2 for bad loans.
Let us define a new variable called default as . With the new variable default, 0 is for good loans and 1 is for bad loans. Since it is not easy to interpret categorical variables, we will only consider numerical variables.
  4.1. Validation of Invariance of Separation
Let us first build a logistic regression model logit_model_1 using all the 8 numerical variables and glm function in R. In the following, we italicize statements in R, use “>“ for the R prompt and make outputs from R bold.
> logit_model_1 <- glm(default ~ duration + credit_amount + installment_rate + current_address_length + age + num_credits + num_dependents + credit_status, data = german_credit_raw, family = “binomial”)
Warning message:
glm.fit: algorithm did not converge
We see a warning message as above. It indicates a separation in the data. Indeed, this separation is from variable credit_status. (38) holds with . By Definition 4, there is a complete separation of data points.
Now let us make a generalized linear transformation. We randomly generate 
 matrix 
 as shown in 
Table 1 and the 8-dimensional row vector 
 in (3) by calling R function 
runif, which generates random values from a uniform distribution with a default value from 0 to 1. We set seed for the purpose of reproduction. We denote 
 and 
 by C_11 and C_1 in R, respectively. We call R’s function det to calculate the determinant of 
> set.seed(1)
> C_11 <- matrix(runif(64),nrow = 8)
We use R function det to find the determinant of C_11 to be 0.01433565.
Vector  is generated as follows:
> set.seed(10)
> C_1 = runif(n = 8, min = 1, max = 20)
[1] 10.642086 6.828602 9.111246 14.168940 2.617583 5.283296 
6.216080 6.173796
Since  is nonsingular, so is  by (9). Now  into  as in (6). Let us denote  by   in R. 
Let us build a logistic regression model logit_model_2 for the eight transformed variables. 
We also see the warning message as for the eight original variables. Therefore, after a nonsingular generalized linear transformation, the separation in data remains. 
  4.2. Validation of MLE
Let us drop credit_status and rebuild a logistic regression model called logit_model_3. The main output is shown in 
Table 2. 
The output also indicates the data still has overlap after the transformation. Hence, we have validated Theorem 6. 
We see variables current_address_length, num_credits and num_dependents are not significant at the 0.05 level. Since we are not focused on building a model, let us still keep these variables. Let us extract the coefficients and put them in a vector called model_coef_3 as follows:
> model_coef_3 <- data.frame(coef(logit_model_3))
> model_coef_3 <- as.matrix(model_coef_3)
Next, let us make a generalized linear transformation. We use letter 
 rather than 
 to distinguish the case from 
Section 4.1. We randomly generate 
 matrix 
 and the 7-dimensional row vector 
 in (3) by calling R function runif. Again, we denote 
 as shown in 
Table 3 and 
 by D_11 and D_1 in R, respectively.
> set.seed(2)
> D_11 <- matrix(runif(49),nrow = 7)
> det(D_11)
[1] 0.2851758
> set.seed(20)
> D_1 = runif(n = 7, min = 1, max = 20)
[1] 17.672906 15.602131 6.300300 11.054110 19.295234 19.626737 
2.735319
Since the determinant of 
 is nonzero, 
 is non-singular by (9) as shown in 
Table 4:
We use R function solve to find its inverse 
 and call it inv_D (see 
Table 5) in R 
Now 
 into 
 as in (6). Let us denote 
 by 
  in R. Let us build a logistic regression model for the seven transformed variables and call it logit_model_4. The main output is shown in 
Table 6:
Let us extract the coefficients called model_coef_3 to get more digits as shown in 
Table 7:
> model_coef_4 <- data.frame(coef(logit_model_4))
Let us find the multiplication of  and vector model_coef_3 in R as follows:
> inv_D%*%model_coef_3
The result of the product is shown in 
Table 8 below.
This is exactly the same as model_coef_4. Next, we calculate the predicted values for all the 1000 records using both models logit_model_3 and logit_model_4 by calling R function predict and then all.equal utility to check these two predictions are near equality:
> model_3_predictions = predict(logit_model_3, german_credit_raw, type=“response”)
> model_4_predictions = predict(logit_model_4, german_credit_raw, type=“response”)
> all.equal(model_3_predictions, model_4_predictions, tolerance = 1e-13)
[1] “Mean relative difference: 0.0000000000005060054”
We see that the two predictions are identical taking rounding errors into consideration. Thus, we have validated validated Theorem 2. 
Note that a nonlinear transformation even a one-to-one correspondence will not have the properties in Theorem 2 even for a single variable. For instance, let us define a one-to-one correspondence for variable age as follows: , which is  in R. Let us build a univariate logistic regression model called logit_model_5 for age and a univariate logistic regression model called logit_model_6 for age_6. Next, we apply these two models to predict the values for german_credit_raw. 
> model_5_predictions = predict(logit_model_5, german_credit_raw, type=“response”)
> model_6_predictions = predict(logit_model_6, german_credit_raw, type=“response”)
> all.equal(model_5_predictions, model_6_predictions, scale=1)
[1] “Mean absolute difference: 0.008512868”
We see that the predictions from logit_model_5 are in general different from predictions for logit_model_6. 
  4.3. Validation of Invariance of VIF
For logistic regression model logit_model_3 in 
Section 4.2, we use VIF function in the car package of R to find VIF for all the 7 variables. The result is shown in 
Table 9. 
> car::vif(logit_model_3)
Next, we randomly generate multiple simple transformations as follows
> set.seed(30)
> A = runif(n = 7)
> set.seed(40)
> B = runif(n = 7, min = 1, max = 10)
> german_credit_raw
$duration_7 = A [
1] * german_credit_raw
$duration + B [1]
> german_credit_raw$credit_amount_7 = A [2] * german_credit_raw$credit_amount + B [2]
…
> german_credit_raw$num_dependents_7 = A [7] * german_credit_raw$num_dependents + B [7]
We build a logistic regression for the variables after multiple simple linear transformations and call it logit_model_7. We then find VIF as follows and display the result in 
Table 10> car::vif(logit_model_7)
Hence, we have validated Theorem 4. There is no need to validate Theorem 5 (the invariance of multicollinearity) as its analytical proof is straightforward.