A Generalized Linear Transformation and Its Effects on Logistic Regression

Guoping Zeng; Sha Tao

doi:10.3390/math11020467

Abstract

Linear transformations such as min–max normalization and z-score standardization are commonly used in logistic regression for the purpose of scaling. However, the work in the literature on linear transformations in logistic regression has two major limitations. First, most work focuses on improving the fit of the regression model. Second, the effects of transformations are rarely discussed. In this paper, we first generalized a linear transformation for a single variable to multiple variables by matrix multiplication. We then studied various effects of a generalized linear transformation in logistic regression. We showed that an invertible generalized linear transformation has no effects on predictions, multicollinearity, pseudo-complete separation and complete separation. We also showed that multiple linear transformations do not have effects on the variance inflation factor (VIF). Numeric examples with a real data were presented to validate our results. Our results of no effects justify the rationality of linear transformations in logistic regression.

Keywords:

logistic regression; linear transformations; predictions; ordinary least squares estimator; maximum likelihood estimator

MSC:

62J12

1. Introduction

Logistic regression is one of the most commonly used techniques for modeling the relationship between the dependent variable and one or more independent variables.

In data analysis and machine learning, a transformation refers to a mapping of a variable into a new variable. A transformation can be linear or nonlinear, depending on whether the mapping is linear or nonlinear. Linear transformations can be used to improve interpretability of coefficients in linear regression and make a fitted model easier to understand [1], whereas nonlinear transformations are often used to improve the fit of the model on the data [2].

Three types of linear transformations are commonly used in machine learning prior to model fitting, namely, min–max normalization, z-score standardization and simple scaling. Since different variables that are measured in different scales may not contribute equally to model fitting, min–max normalization is used to transform all continuous variables into the same range [0, 1] to avoid a possible bias. Essentially, min–max normalization subtracts the minimum value of a continuous variable from each value and then divides by the range of the variable. z-score standardization rescales continuous variables to the standard scale, i.e., how far it is from the mean. Mathematically, z-score standardization subtracts the mean value of a continuous variable from each value and then divides by the standard deviation of the variable. Simple scaling shrinks or expands a continuous variable with big values and small values, respectively. The three types of linear transformations are all discussed by Adeyemo, Wimmer and Powell [3] for logistic regression.

However, the work in the literature on transformations in regression have some limitations. First, most work focuses on improving the fit of the regression model [4,5,6,7,8,9]. Second, the effects of transformations are rarely discussed. Morrell, Pearson, and Brant [10] examined how linear transformations affected a linear mixed-effect model and the tests of significance of fixed effects in the model. They showed how linear transformations modified the random effects, and their covariance matrix and the value of the restricted log-likelihood. Zeng [11] studied invariant properties of some statistical measures under monotonic transformations for univariate logistic regression. Zeng [12] derived analytic properties of some well-known category encodings such as ordinal encoding, order encoding and one-hot encoding in multivariate logistic regression by means of linear transformations. Adeyemo, Wimmer and Powell [3] compared the prediction accuracy of the three types of linear transformations, min–max normalization, z-score standardization and simple scaling, in logistic regression, by means of simulation.

In this paper, we first generalized a linear transformation for a single variable to multiple variables by a matrix multiplication. We then studied various effects of a generalized linear transformation in logistic regression. We showed that an invertible generalized linear transformation has no effects on predictions, multicollinearity, pseudo-complete separation, and complete separation. We also showed that multiple linear transformations do not have effects on the variance inflation factor (VIF). Numeric examples with randomly generated transformations from a real data were presented to illustrate our theoretic results.

The remainder of this paper is organized as follows. In Section 2, we give two definitions of a generalized linear transformation and show that they are equivalent. In Section 3, we study the effects of a generalized linear transformation on logistic regression. In Section 4, we present numeric examples to validate our theoretic results. Finally, the paper is concluded in Section 5.

Throughout the paper, we concentrate on transformations of independent variables, which are also sometimes called explanatory variables.

2. A Generalized Linear Transformation in Logistic Regression

Let

x = (x_{1}, x_{2}, \dots, x_{p})

be the vector of

p

independent variables and

y

be the dependent variable. Let us consider a sample of

n

independent observations

(x_{i 1}, x_{i 2}, \dots, x_{i p}, y_{i}), i = 1, 2, \dots, n,

where

y_{i}

is the value of

y

and

x_{i 1}, x_{i 2}, \dots, x_{i p}

the values of

p

independent variables

x_{1}, x_{2}, \dots, x_{p}

for the

i

-th observation. Without loss of generality, we assume

x_{1}, x_{2}, \dots, x_{p}

are all continuous variables since otherwise they can be converted into continuous variables.

Let us adopt the matrix notation:

Y = (\begin{matrix} \begin{matrix} y_{1} \\ \begin{matrix} y_{2} \\ ⋮ \end{matrix} \end{matrix} \\ y_{n} \end{matrix}), X = (\begin{matrix} x_{10} & x_{11} & \dots & x_{1 p} \\ x_{20} & x_{21} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n 0} & x_{n 1} & \dots & x_{n p} \end{matrix}), β = (\begin{matrix} \begin{matrix} β_{0} \\ \begin{matrix} β_{1} \\ ⋮ \end{matrix} \end{matrix} \\ β_{p} \end{matrix})

where

x_{i 0} = 1

for all

i

(used for intercept

β_{0}

) and matrix

X

is called the design matrix. Here,

β_{0}, β_{1}, \dots, β_{p}

are called regression coefficients or regression parameters.

Without causing confusion, we also use

x_{0}, x_{1}, \dots, x_{p}

to denote the

(p + 1)

columns or column vectors of

X

. We further use capital letter

X_{i}

to denote the row vector

(1, x_{i 1}, x_{i 2}, \dots, x_{i p})

for

i = 1, 2, \dots, n .

Definition 1.

A linear transformation is a linear function of a variable which maps or transforms the variable into a new one. Specifically, a linear transformation of variable

x

can be defined as

x^{t} = a x + b,

where

a

and

b

are constants and

a

is nonzero. For convenience, let us call a linear transformation of a single variable a simple linear transformation. By multiple linear transformations, we mean a set of simple linear transformations. Here, we use letter

t

in the superscript to denote the new variable after a transformation.

Note that

a

and

b

in Definition 1 are not vectors since

x

is a variable.

Definition 1 can be generalized naturally by matrix multiplication to transform a set of variables to a new set of variables.

Definition 2.

A generalized linear transformation is a linear matrix-vector expression

(\begin{matrix} \begin{matrix} x_{1}^{t} \\ \begin{matrix} x_{2}^{t} \\ ⋮ \end{matrix} \end{matrix} \\ x_{p}^{t} \end{matrix}) = A (\begin{matrix} \begin{matrix} x_{1} \\ \begin{matrix} x_{2} \\ ⋮ \end{matrix} \end{matrix} \\ x_{p} \end{matrix}) + (\begin{matrix} \begin{matrix} b_{1} \\ \begin{matrix} b_{2} \\ ⋮ \end{matrix} \end{matrix} \\ b_{p} \end{matrix})

(1)

that transforms or maps independent variables

x_{1}, \dots, x_{p}

into new independent variables

x_{1}^{t}, x_{2}^{t}, \dots, x_{p}^{t},

where

A = (a_{i j})

is a

p \times p

matrix of real numbers and

b_{1}, b_{2}, \dots, b_{p}

are real constants. Here,

x_{1}, \dots, x_{p}

are variables not vectors.

It should not be confused with the linear transformation between two vector spaces, in which there is no vector

b = {(b_{1}, b_{2}, \dots, b_{p})}^{'}

. Here and hereafter, we use the prime symbol ′ in the superscript for the transpose of a vector or a matrix. The new variables

x_{1}^{t}, x_{2}^{t}, \dots, x_{p}^{t}

in the component forms are

b_{1} + \sum_{j = 1}^{p} a_{1 j} x_{j}, b_{2} + \sum_{j = 1}^{p} a_{2 j} x_{j}, \dots, b_{p} + \sum_{j = 1}^{p} a_{p j} x_{j} .

Consider a simple linear transformation,

x_{i}^{t} = a x_{i} + b

, for some

x_{i}

with

1 \leq i \leq p .

Without loss of generality, assume

i = 1

. Let

A

be a

p

-dimensional diagonal matrix with

a_{11} = a

and

a_{22} = a_{33} = \dots = a_{p p} = 1

. Let

b = {(b, 0, \dots, 0)}^{'}

be a

p

-dimensional column vector. Then

x_{1}, x_{2}, \dots, x_{p}

are transformed into to

x_{1}^{t}, x_{2}, \dots, x_{p}

according to Definition 2. Similarly, consider a set of simple linear transformations, say,

x_{i}^{t} = a_{i} x_{i} + b_{i}

for

1 \leq i \leq r

with

2 \leq r \leq p .

Let

A

be a

p

-dimensional diagonal matrix with

a_{i i} = a_{i}

for

1 \leq i \leq r

and

a_{i i} = 1

for

i = r + 1, r + 2, \dots ., p .

Let

b = {(b_{i}, b_{2}, \dots, b_{r}, 0, \dots, 0)}^{'}

be a

p

dimensional column vector. Then

x_{1}, x_{2}, \dots, x_{p}

are transformed into to

x_{1}^{t}, x_{2}^{t},, \dots, x_{r}^{t}, x_{r + 1}, x_{r + 2}, \dots, x_{p}

according to Definition 2. Hence, both a simple linear transformation and multiple linear transformations are a special case of a generalized linear transformation.

However, Definition 2 is not convenient to use since the new design matrix issomewhat complicated. Therefore, we give another definition incorporated with the design matrix.

Definition 3.

A generalized linear transformation is a matrix multiplication

X C

that transforms

x_{1}, \dots, x_{p}

into

x_{1}^{t}, x_{2}^{t}, \dots, x_{p}^{t},

where

x_{1}^{t}, x_{2}^{t}, \dots, x_{p}^{t}

are the 2nd to the last column of

X C

and

C

is a

(p + 1) \times (p + 1)

matrix of real numbers as follows

C = (\begin{matrix} 1 & c_{11} & c_{12} & \dots & c_{1 p} \\ 0 & c_{21} & c_{22} & \dots & c_{2 p} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & c_{p 1} & c_{p 2} & \dots & c_{p p} \\ 0 & c_{p + 1, 1} & c_{p + 1, 2} & \dots & c_{p + 1, p} \end{matrix}) .

(2)

Note that we request the first column of

C

to be 0 except the first entry (which is 1) in order for

X C

to be the new design matrix.

For convenience, let us partition

C

into 4 blocks such that

C = (\begin{matrix} 1 & C_{1} \\ 0 & C_{11} \end{matrix})

, where

C_{1}

is the

p

-dimensional row vector

(c_{11} c_{12} \dots c_{1 p})

, 0 is the

p

-dimensional column vector of all 0′s and

C_{11}

is the

p \times p

submatrix by deleting the first column and the first row of

C,

that is,

C_{11} = (\begin{matrix} c_{21} & c_{22} & \dots & c_{2 p} \\ c_{31} & c_{32} & \dots & c_{3 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{p + 1, 1} & c_{p + 1, 2} & \dots & c_{p + 1, p} \end{matrix}) .

(3)

In the following we prove the definitions of generalized linear transformation are equivalent.

Theorem 1.

Definition 2 and Definition 3 are equivalent.

Proof.

Let us begin with Definition 2. Its new design matrix is

(\begin{matrix} 1 & b_{1} + \sum_{j = 1}^{p} a_{1 j} x_{1 j} & b_{2} + \sum_{j = 1}^{p} a_{2 j} x_{1 j} & \dots & b_{p} + \sum_{j = 1}^{p} a_{p j} x_{1 j} \\ 1 & b_{1} + \sum_{j = 1}^{p} a_{1 j} x_{2 j} & b_{2} + \sum_{j = 1}^{p} a_{2 j} x_{2 j} & \dots & b_{p} + \sum_{j = 1}^{p} a_{p j} x_{2 j} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & b_{1} + \sum_{j = 1}^{p} a_{1 j} x_{n j} & b_{2} + \sum_{j = 1}^{p} a_{2 j} x_{n j} & \dots & b_{p} + \sum_{j = 1}^{p} a_{p j} x_{n j} \end{matrix}) = X (\begin{matrix} 1 & b_{1} & b_{2} & \dots & b_{p} \\ 0 & a_{11} & a_{21} & \dots & a_{p 1} \\ 0 & a_{12} & a_{22} & \dots & a_{p 2} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & a_{1 p} & a_{2 p} & \dots & a_{p p} \end{matrix}) .

(4)

Hence, the new design matrix of Definition 2 is in the form of Definition 3 with

C = (\begin{matrix} 1 & b_{1} & b_{2} & \dots & b_{p} \\ 0 & a_{11} & a_{21} & \dots & a_{p 1} \\ 0 & a_{12} & a_{22} & \dots & a_{p 2} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & a_{1 p} & a_{2 p} & \dots & a_{n p} \end{matrix}) .

(5)

Note that the submatrix by deleting the first row and first column of matrix

C

above is the transpose of

A

, that is,

A^{'} .

Next, let us begin with Definition 3.

\begin{matrix} X C = X (\begin{matrix} 1 & c_{11} & \dots & c_{1 p} \\ 0 & c_{21} & \dots & c_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & c_{p + 1, 1} & \dots & c_{p + 1, p} \end{matrix}) \\ = (\begin{matrix} 1 & c_{11} + \sum_{j = 1}^{p} c_{j + 1, 1} x_{1 j} & c_{12} + \sum_{j = 1}^{p} c_{j + 1, 2} x_{1 j} & \dots & c_{1 p} + \sum_{j = 1}^{p} c_{j + 1, p} x_{1 j} \\ 1 & c_{11} + \sum_{j = 1}^{p} c_{j + 1, 1} x_{2 j} & c_{12} + \sum j = 1 p c_{j + 1, 2} x_{2 j} & \dots & c_{1 p} + \sum_{j = 1}^{p} c_{j + 1, p} x_{2 j} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & c_{11} + \sum_{j = 1}^{p} c_{j + 1, 1} x_{n j} & c_{12} + \sum_{j = 1}^{p} c_{j + 1, 2} x_{n j} & \dots & c_{1 p} + \sum_{j = 1}^{p} c_{j + 1, p} x_{n j} \end{matrix}) . \end{matrix}

(6)

The second, third, …, last column of the matrix above are from the linear transform

c_{11} + \sum_{j = 1}^{p} c_{j + 1, 1} x_{j}, c_{12} + \sum_{j = 1}^{p} c_{j + 1, 2} x_{j}, \dots, c_{1 p} + \sum_{j = 1}^{p} c_{j + 1, p} x_{j},

respectively. Hence, Definition 3 is in the form of Definition 2 with

A = (\begin{matrix} c_{21} & c_{31} & \dots & c_{p + 1, 1} \\ c_{22} & c_{32} & \dots & c_{p + 1, 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{2 p} & c_{3 p} & \dots & c_{p + 1, p} \end{matrix})

(7)

and

(\begin{matrix} \begin{matrix} b_{1} \\ \begin{matrix} b_{2} \\ ⋮ \end{matrix} \end{matrix} \\ b_{p} \end{matrix}) = (\begin{matrix} \begin{matrix} c_{11} \\ \begin{matrix} c_{12} \\ ⋮ \end{matrix} \end{matrix} \\ c_{1 p} \end{matrix}) .

(8)

We have concluded our proof. □

If we expand along the first column to find the determinant of

C

in (2), we immediately see that the determinant of

C

is equal to the determinant of

C_{11} .

Therefore,

C

is nonsingular (or invertible) if and only if

C_{11}

in (3) is nonsingular. In addition, it follows from the proof of Theorem 1 that

C

is nonsingular if and only if A in Definition 2 is nonsingular.

Moreover, it is easy to see that if

C_{11}

is nonsingular then the inverse of

C

can be written as

C^{- 1} = (\begin{matrix} 1 & - C_{1} C_{11}^{- 1} \\ 0 & C_{11}^{- 1} \end{matrix}) .

(9)

From now on we will use Definition 3 unless otherwise specified. For convenience, let us call the generalized linear transformation

X C

invertible if C is invertible.

3. Effects of a Generalized Linear Transformation

In logistic regression, the dependent variable

y

is binary with 2 values 0 and 1. Let the conditional probability that

y = 1

be denoted by

P r o b (y = 1 | x) = π (x) .

Logistic regression assumes the logit linearity between the log odds and independent variables

x_{1}, x_{2}, \dots, x_{p}

\ln [\frac{π (x)}{1 - π (x)}] = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p} .

(10)

Equation (10) above can be written as

π (x) = \frac{e^{β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}}}{1 + e^{β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}}} .

(11)

The following log likelihood is used in logistic regression

l (β, Y, X) = \sum_{i = 1}^{n} y_{i} \ln (\frac{e^{X_{i} β}}{1 + e^{X_{i} β}}) + \sum_{i = 1}^{n} (1 - y_{i}) \ln (\frac{1}{1 + e^{X_{i} β}}) .

(12)

The maximum likelihood method is used to estimate parameters in logistic regression. Specifically, the maximum likelihood estimators (MLE) are the values of parameters

β_{0}, β_{1}, \dots β_{p}

that maximize (12). The vector

\hat{β} = {({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{p})}^{'}

of the MLE estimators of

β = {(β_{0}, β_{1}, β_{2}, \dots, β_{p})}^{'}

satisfies [13]

\sum_{i = 1}^{n} y_{i} = \sum_{i = 1}^{n} π_{i} \sum_{i = 1}^{n} x_{i j} y_{i} = \sum_{i = 1}^{n} x_{i j} π_{i}

(13)

or in matrix-vector form

X^{'} Y = X^{'} π

(14)

where

π = {(π_{1}, π_{2}, \dots, π_{n})}^{’},

and

π_{i} = π (X_{i}) = \frac{e^{X_{i} \hat{β}}}{1 + e^{X_{i} \hat{β}}}

for

i = 1, 2, \dots, n .

Note that after a generalized linear transformation

X C,

(12) and (14) hold with the design matrix

X

replaced by the new design matrix

X C

.

Equation (13) or (14) represents (

p + 1

) nonlinear equations of

{\hat{β}}_{0}, {\hat{β}}_{1}, \dots, {\hat{β}}_{p}

and cannot be solved explicitly in general [14]. Rather, they can be solved numerically by Newton-Raphson algorithm [15] as follows

β^{(i + 1)} = β^{(i)} + {(X^{'} V X)}^{- 1} g,

(15)

where

V

is the

n \times n

diagonal matrix with its diagonal elements

π_{1} (1 - π_{1}), π_{2} (1 - π_{2}), \dots, π_{n} (1 - π_{n}) .

In addition

g = X^{'} (Y - π) .

Both

V

and

g

are evaluated at

β^{(i)}

in (15).

If

X^{'} X

is nonsingular and the data is not completely separable or pseudo-completely separable [16], then the MLE estimator

\hat{β}

exists and is unique.

The MLE estimator

\hat{β}

can be used to predict

p r o b (y = 1 | x)

by the linear combination of variables

x_{1}, \dots, x_{p}

\hat{π} (x) = \frac{e^{{\hat{β}}_{0} + {\hat{β}}_{1} x_{1} + {\hat{β}}_{2} x_{2} + \dots + {\hat{β}}_{p} x_{p}}}{1 + e^{{\hat{β}}_{0} + {\hat{β}}_{1} x_{1} + {\hat{β}}_{2} x_{2} + \dots + {\hat{β}}_{p} x_{p}}} = = \frac{e^{(1, x_{1}, x_{2}, \dots, x_{p}) \hat{β}}}{1 + e^{(1, x_{1}, x_{2}, \dots, x_{p}) \hat{β}}} .

(16)

In particular, we have

n

fitted values

\hat{π_{i}} = \frac{e^{(1, x_{i 1}, x_{i 2}, \dots, x_{i p}) \hat{β}}}{1 + e^{(1, x_{i 1}, x_{i 2}, \dots, x_{i p}) \hat{β}}}, i = 1, 2, \dots, n .

3.1. Effects on MLE Estimator and Predictions

Theorem 2.

For logistic regression, if the MLE estimator of

β

is

\hat{β},

then the MLE estimator of

β

is

C^{- 1} \hat{β}

after a generalized linear transformation

X C

assuming

C

is nonsingular. Moreover, the generalized linear transformation does not affect predictions.

Proof.

Since

\hat{β}

is the maximum likelihood estimator of

β,

(14) is satisfied by

\hat{β} .

Multiplying both sides of (14) by

C^{'},

we obtain

C^{'} (X^{'} Y) = C^{'} (X^{'} π) .

(17)

Clearly, (17) can be rewritten as

{(X C)}^{'} Y = {(X C)}^{'} π .

(18)

Writing

X_{i} \hat{β}

as

(X_{i} C) (C^{- 1} \hat{β})

for

i = 1, 2, \dots, n,

we have

π_{i} = \frac{e^{X_{i} \hat{β}}}{1 + e^{X_{i} \hat{β}}} = \frac{e^{(X_{i} C) (C^{- 1} \hat{β})}}{1 + e^{(X_{i} C) (C^{- 1} \hat{β})}} .

(19)

It follows from (18) and (19) that

C^{- 1} \hat{β}

satisfies (14) for the new design matrix

X C

. Hence, the linear combinations

C^{- 1} \hat{β}

of

\hat{β}

is the new MLE estimator after the generalized linear transformation

X C

.

Let us now predict

p r o b (y = 1)

for a set of values of variables

x_{1}, \dots, x_{p},

for the new system after the generalized linear transformation

X C

using the new MLE estimator

C^{- 1} \hat{β} .

Let

v_{1}, v_{2}, \dots, v_{p}

be a specific value of

x_{1}, \dots, x_{p},

respectively. Then, the row vector

(1, v_{1}, \dots, v_{p})

in the original system becomes

(1, v_{1}, \dots, v_{p}) C

in the new system. By (16), the predicted conditional probability of

y = 1

when

x = (1, v_{1}, \dots, v_{p}) C

in the new system is

\frac{e^{(1, v_{1}, \dots, v_{p}) C (C^{- 1} \hat{β)}}}{1 + e^{(1, v_{1}, \dots, v_{p}) C (C^{- 1} \hat{β)}}} = \frac{e^{(1, v_{1}, \dots, v_{p}) \hat{β}}}{1 + e^{(1, v_{1}, \dots, v_{p}) \hat{β}}} .

(20)

The right-hand side of (20) is the predicted conditional probability of

y = 1

when

x = (1, v_{1}, \dots, v_{p})

in the original system. □

3.2. Effects on Multicollinearity

Perfect multicollinearity or complete multicollinearity or multicollinearity, in short, refers to a situation in logistic regression in which two or more independent variables are linearly related [17]. In particular, if two independent variables are linearly related, then it is called collinearity.

Mathematically, multicollinearity means there exist constant

a_{0}, a_{1}, \dots, a_{p}

such that

a_{0} x_{0} + \sum_{i = 1}^{p} a_{i} x_{i} = 0

(21)

where at least two of

a_{1}, \dots, a_{p}

are nonzero. If we treat

x_{0}

as an independent variable, then we just require at least one of

a_{1}, \dots, a_{p}

is nonzero.

Multicollinearity is a common issue in logistic regression. If there is multicollinearity, the design matrix

X

will not have a full column rank of

p + 1

. Hence, the

(p + 1) \times (p + 1)

matrix

\hat{I} = X^{'} V X

in (15) will have a rank less than

p + 1

. Thus, the inverse matrix

{\hat{I}}^{- 1}

in (15) does not exist, which make the iteration in (15) impossible.

If there is near multicollinearity and there is no separation of the data points, theoretically

\hat{I} = X^{'} V X

in (15) has an inverse and the iteration in (15) can be proceeded. Yet, iteration (15) may not find an approximate inverse

\hat{I} = X^{'} V X

and hence may cause unstable estimates and inaccurate variances [18].

Some authors define multicollinearity in logistic regression to be a high correlation between independent variables [19,20,21]. Let us call multicollinearity with high correlation by near multicollinearity and reserve multicollinearity for perfect multicollinearity or complete multicollinearity.

Let us define VIF now. Let

R_{j}^{2}

be the R-squared that results when

x_{j}

is linearly regressed against the other

(p - 1)

independent variables. Then VIF for

x_{j}

is defined as

V I F_{j} = \frac{1}{1 - R_{j}^{2}}, j = 1, 2, \dots, p .

(22)

Near multicollinearity can be detected by using VIF [2]. The larger the VIF of an independent variable, the larger the correlation between this independent variable and others. However, there is no standard for acceptable levels of VIF. Multicollinearility can be combated by a generalized cross-validation (GCV) criterion in partially linear regression models [22,23].

3.2.1. Preliminary Results in Linear Regression

As VIF is related to linear regression, let us briefly introduce some preliminary results in linear regression. As for logistic regression, we consider

p

independent variables

x_{1}, x_{2}, \dots, x_{p}

. Unlike logistic regression, the dependent variable

y

in linear regression is a continuous variable. We shall adopt the same notation as in logistic regression unless otherwise specified. In particular,

X

is the design matrix.

In linear regression, the relationship between

y

and

x_{1}, \dots, x_{p}

is formulated as a linear combination

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ϵ

(23)

where

ϵ

is a random error, or in matrix notation

Y = X β + (\begin{matrix} \begin{matrix} ϵ_{1} \\ \begin{matrix} ϵ_{2} \\ ⋮ \end{matrix} \end{matrix} \\ ϵ_{n} \end{matrix}) .

(24)

The ordinary least squares (OLS) estimator

\hat{β}

of

β

satisfies [2]

X^{'} X \hat{β} = X^{'} Y .

(25)

Assuming the

(p + 1

)-dimensional square matrix

X^{'} X

is nonsingular, then the OLS estimator

\hat{β} = {({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{p})}^{'}

is unique and can be written explicitly as

\hat{β} = {(X^{'} X)}^{- 1} X^{'} Y .

(26)

The OLS estimator

\hat{β}

can be used to predict

y

by the linear combination of variables

x_{1}, \dots, x_{p}

as follows

\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{1} + {\hat{β}}_{2} x_{2} + \dots + {\hat{β}}_{p} x_{p} = (1, x_{1}, x_{2}, \dots, x_{p}) \hat{β} .

(27)

Like Gelman and Hill [1] and Chatterjee and Hadi [2], we will call a predicted value a fitted value if the values of

x_{1}, \dots, x_{p}

come from one of the

n

observations. So, we have

n

fitted values

\hat{y_{i}} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + {\hat{β}}_{2} x_{i 2} + \dots + {\hat{β}}_{p} x_{i p}, i = 1, 2, \dots, n .

Therefore, the

n

-dimensional column vector

\hat{Y}

for the

n

fitted values

{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n}

can be expressed as

\hat{Y} = {({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n})}^{'} = X \hat{β} .

(28)

It is easy to show that the OLS estimator is

C^{- 1} \hat{β}

after an invertible generalized linear transformation

X C

. Moreover, the generalized linear transformation does not affect predictions. Indeed, let us now predict

y

for a set of values of variables

x_{1}, \dots, x_{p}

, which could be from any set of values not necessarily from one of the n observations. We first transform the values of

x_{1}, \dots, x_{p}

into

(1, v_{1}, \dots, v_{p}) C,

where

v_{1}, v_{2}, \dots, v_{p}

are values of

x_{1}, \dots, x_{p} .

Next, we apply (27) and obtain

(1, v_{1}, \dots, v_{p}) C (C^{- 1} \hat{β}) = (1, v_{1}, \dots, v_{p}) \hat{β}

, which is the predicted value of the original model.

In linear regression, the coefficient of determination, denoted by

R^{2}

and also called R-squared, is given by Chatterjee and Hadi [2].

R^{2} = \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(29)

where

\bar{y}

is the mean of the dependent variable

y

, that is,

\bar{y} = \frac{\sum_{i = 1}^{n} y_{i}}{n},

and

{\hat{y}}_{i}

is the fitted value

{\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + {\hat{β}}_{2} x_{i 2} + \dots + {\hat{β}}_{p} x_{i p,} i = 1, 2, \dots, n .

The coefficient of determination

R^{2}

can be related to the square of the correlation between

Y

and

\hat{Y}

as follows [2]

R^{2} = {[C o r (Y, \hat{Y})]}^{2}

(30)

where

C o r (Y, \hat{Y}) = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}} .

(31)

Theorem 3.

R^{2}

in linear regression is invariant under invertible generalized linear transformations.

Proof.

Expressing

\sum_{i = 1}^{n} {(y_{i} - {y^}_{i})}^{2}

in the numerator of the 2nd equation in (29) into the matrix form and applying (26), we obtain

\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = {(Y - X \hat{β})}^{'} (Y - X \hat{β}) = Y^{'} Y - Y^{'} X \hat{β} .

(32)

Substituting (32) into (29) yields

R^{2} = 1 - \frac{Y^{'} Y - Y^{'} X \hat{β}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} .

(33)

Now let

X C

be an invertible generalized linear transformation. Then the OLS estimator after the transformation becomes

C^{- 1} \hat{β}

. In this case,

R^{2}

in (33) becomes

1 - \frac{Y^{'} Y - Y^{'} (X C) C^{- 1} \hat{β}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} = 1 - \frac{Y^{'} Y - Y^{'} X \hat{β}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

which returns to

R^{2}

in (29) before the generalized linear transformation. □

3.2.2. Effects on Logistic Regression

In Definitions 2 and 3, we defined a generalized linear transformation only for independent variables. Since an independent variable is used as the dependent variable in order to find its VIF, we consider a simple linear transformation for the dependent variable in the following result.

Lemma 1.

Consider a linear regression with

y

as the dependent variable and

x_{1}, x_{2}, \dots, x_{p}

as the independent variables. If we make a simple linear transformation on y such as

y^{t} = a y + b

and a generalized linear transformation

X C

on independent variables with nonsingular

C

, then

\hat{β^{t}} = C^{- 1} (a \hat{β} + {(b, 0, \dots, 0)}^{'})

is the OSL estimator of the new linear regression after the transformations, where

X

is the design matrix,

(b, 0, \dots, 0)

is a

(p + 1)

-dimensional row vector and

\hat{β} = {(X^{'} X)}^{- 1} X^{'} Y

is the OLS estimator of the original linear regression.

Proof.

Since for the new linear regression has design matrix is

X C

and the dependent variable can be expressed as

a Y + {(b, b, \dots, b)}^{'},

where

(b, b, \dots, b)

is a

n

-dimensional row vector, it is sufficient show that

\hat{β^{t}} = C^{- 1} (a \hat{β} + {(b, 0, \dots, 0)}^{'})

satisfies

{(X C)}^{'} (X C) β = {(X C)}^{'} (a Y + {(b, b, \dots, b)}^{'}) .

(34)

Substituting

\hat{β^{t}} = C^{- 1} (a \hat{β} + {(b, 0, \dots, 0)}^{'})

into the left-hand side of (34) and replacing

\hat{β}

with

{(X^{'} X)}^{- 1} X^{'} Y

, we obtain

{(X C)}^{'} (X C) \hat{β^{'}} = {(X C)}^{'} X (a \hat{β} + {(b, 0, \dots, 0)}^{'}) = {(X C)}^{'} (a Y + {(b, b, \dots, b)}^{'}),

which is the right-hand side of (34). □

Theorem 4.

VIF for each independent variable is invariant under multiple linear transformations in logistic regression.

Proof.

Without loss of generality, we assume multiple linear transformations

x_{i}^{t} = a_{i} x_{i} + b_{i}

for the first

r

independent variables for

i = 1, 2, \dots, r

, where

r \leq p

. To find VIF, we do linear regressions for each

i = 1, 2, \dots, r

, by making

x_{i}^{t}

as the dependent variable and

x_{1}^{t}, x_{2}^{t}, \dots, x_{i - 1}^{t}, x_{i + 1}^{t}, \dots, x_{r}^{t}, x_{r + 1,} x_{r + 2}, \dots, x_{p}

as the independent variables. Similarly, we do linear regression for each

i = r + 1, r + 2, \dots, p,

by making

x_{i}

as the dependent variable and

x_{1}^{t}, x_{2}^{t}, \dots, x_{r}^{t}, x_{r + 1,} \dots, x_{i - 1}, x_{i + 1}, \dots, x_{p}

as the independent variables. We only prove the invariance of VIF for

x_{1}^{t}

and of VIF for

x_{r + 1}

as the invariance of VIF for

x_{i}^{t},

i = 2, 3, \dots, r

can be proved similar to

x_{1}^{t}

and the invariance of VIF for

x_{i,}

i = r + 1, r + 2, \dots, p

can be proved similar to

x_{r + 1}

.

To find VIF for

x_{1}^{t}

, we do linear regressions by making

x_{1}^{t}

as the dependent variable and

x_{2}^{t}, x_{3}^{t}, x_{r}^{t}, x_{r + 1,} x_{r + 2}, \dots, x_{p}

as the independent variables. In this case, the dependent variable

x_{1}^{t} = y^{t} = a_{1} x_{1} + b_{1}

and the independent variables

x_{2}^{t}, x_{3}^{t}, \dots, x_{r}^{t}, x_{r + 1,} x_{r + 2}, \dots, x_{p}

result from a generalized linearization

X C

, where

X

is the design matrix with independent variables

x_{2,} x_{3}, \dots, x_{p}

and

C

is the upper triangular matrix as follows

C = (\begin{matrix} 1 & b_{2} & b_{3} & \dots & b_{r} & 0 & 0 & \dots & 0 \\ 0 & a_{2} & 0 & \dots & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & a_{3} & \dots & 0 & 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & a_{r} & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & ⋮ & 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \dots & 1 \end{matrix}) .

Since the determinant of

C

equals

a_{2} a_{3} \dots a_{r} \neq 0,

by Lemma 1, the OLS estimator after the multiple linear transformations is

\hat{β^{t}} = C^{- 1} (a_{1} \hat{β} + {(b_{1}, 0, \dots, 0)}^{'})

. By (29), it’s sufficient to prove the following identity:

\frac{\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \hat{y_{i}^{t}})}^{2}}{\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \bar{(a_{1} x_{1} + b_{1})})}^{2}} = \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} .

(35)

Since the denominator of the left-hand side of (35) is

\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \bar{(a_{1} x_{1} + b_{1})})}^{2} = \sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - a_{1} \bar{x_{1}} - b_{1})}^{2} = {(a_{1})}^{2} \sum_{i = 1}^{n} {(x_{i 1} - \bar{x_{1}})}^{2} = {(a_{1})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} .

It is sufficient to show that

\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \hat{y_{i}^{t}})}^{2} = {(a_{1})}^{2} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} .

(36)

Expressing the left-hand side of (36) as the multiplication of vectors

\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \hat{y_{i}^{t}})}^{2} = {(a_{1} x_{1} + B_{1} - \hat{Y^{t}})}^{'} (a_{1} x_{1} + B_{1} - \hat{Y^{t}})

(37)

where

B_{1}

is the

n

-dimensional column vector with all elements of

b_{1}

and

\hat{Y^{t}}

is the

n

-dimension vector of fitted values

\hat{y_{i}^{t}}

for

i = 1, 2, \dots, n .

Applying (28) for the vector

\hat{Y^{t}}

of fitted values and design matrix

X C

and applying Lemma1, we obtain

\hat{Y^{t}} = (X C) \hat{β^{t}} = (X C) C^{- 1} (a_{1} \hat{β} + {(b_{1}, 0, \dots, 0)}^{'}) = a_{1} X \hat{β} + B_{1} .

Hence,

a_{1} x_{1} + B_{1} - \hat{Y^{t}} = a_{1} (x_{1} - \hat{X β})

and so (37) becomes

\sum_{i = 1}^{n} {(a_{1} x_{i 1} + b_{1} - \hat{y_{i}^{t}})}^{2} = {(a_{1})}^{2} {(x_{1} - \hat{X β})}^{'} (x_{1} - \hat{X β})

which is the right hand-side of (36).

To find VIF for

x_{r + 1}

, we do linear regressions by making

x_{r + 1}

as the dependent variable and

x_{1}^{t}, x_{2}^{t}, \dots, x_{r}^{t}, x_{r + 2}, \dots, x_{p}

as the independent variables. In this case, the independent variable result from a generalized linearization

Z D

, where

Z

is the design matrix of independent variables

x_{1}, x_{2,}, \dots, x_{r}, x_{r + 2}, \dots, x_{p}

and

D

is the upper triangular matrix as follows

D = (\begin{matrix} 1 & b_{1} & b_{2} & \dots & b_{r} & 0 & 0 & \dots & 0 \\ 0 & a_{1} & 0 & \dots & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & a_{2} & \dots & 0 & 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \dots & a_{r} & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & ⋮ & 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & \dots & 1 \end{matrix}) .

Since the determinant of

D

equals

a_{1} a_{2} \dots a_{r} \neq 0,

by Theorem 3, VIF for

x_{r + 1}

after the generalized transformation

Z D

is the same as VIF for

x_{r + 1}

prior to the generalized linear transformation. □

Remark 1.

VIFs are not necessarily invariant under an invertible generalized linear transformation

X C

. For instance, let

x_{1}^{t} = x_{2}

and

x_{2}^{t} = x_{1}

and keep

x_{3}, x_{4}, \dots, x_{p}

unchanged. Then

x_{1}^{t}, x_{2}^{t}, x_{3}, x_{4}, \dots, x_{p}

result from the generalized linear transformation with

D = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & \dots & 1 \end{matrix}) .

Since the determinant of

D

is −1,

D

is nonsingular. However, VIF for

x_{1}^{t}

after the generalizer linear transformation

X D

equals VIF for

x_{2}

prior to the generalized linear transformation, which are unequal in general.

The following result is immediate.

Theorem 5.

Multicollinearity exists in logistic regression if, and only if, it exists after an invertible generalized linear transformation.

Remark 2.

All the results about multicollinearity and VIF also apply to machine learning algorithms in which multicollinearity is applicable such as linear regression.

3.3. Effects on Linear Separation

Albert and Anderson [16] first assumed design matrix

X

to have a full column rank, that is, no multicollinearity. They then introduced the concept of separation (including complete separation and quasi-complete separation) and overlap in logistic regression with intercept. They showed that separation leads to nonexistence of (finite) MLE and that overlap leads to finite and unique MLE. Therefore, like multi-collinearity, separation is a common issue in logistic regression.

Definition 4.

There is a complete separation of data points if there exists a vector

b = {(b_{0}, b_{1}, \dots, b_{p})}^{'}

that correctly allocates all observations to their response groups; that is,

{\begin{matrix} \sum_{j = 0}^{p} b_{j} x_{i j} = X_{i} b = b^{'} X_{i}^{'} > 0, y_{i} = 1, \\ \sum_{j = 0}^{p} b_{j} x_{i j} = X_{i} b = b^{'} X_{i}^{'} < 0, y_{i} = 0 . \end{matrix}

(38)

Definition 5.

There is quasi-complete separation if the data are not complete separable, but there exists a vector

b = {(b_{0}, b_{1}, \dots, b_{p})}^{'}

such that

{\begin{matrix} \sum_{j = 0}^{p} b_{j} x_{i j} = X_{i} b = b^{'} X_{i}^{'} \geq 0, y_{i} = 1, \\ \sum_{j = 0}^{p} b_{j} x_{i j} = X_{i} b = b^{'} X_{i}^{'} \leq 0, y_{i} = 0 \end{matrix}

(39)

and equality holds for at least one subject in each response group.

Definition 6.

If neither a complete nor a quasi-complete separation exists, then the data is said to have overlap.

Theorem 6.

An invertible generalized linear transformation does not affect the data configuration of logistic regression.

Proof.

We consider three cases.

Case 1. There is a complete separation of data points in the original system. Then (38) holds for a vector

b = {(b_{0}, b_{1}, \dots, b_{p})}^{'} .

The row

i

in the design matrix is

X_{i} C

for

i = 1, 2, \dots, n

, after the invertible generalized linear transformation

X C

. Let

b^{t} = {(C^{'})}^{- 1} b

, then vector

b^{'}

is a constant column vector of dimension (p + 1). Since

{(b^{t})}^{'} {(X_{i} C)}^{'} = b^{'} X_{i}^{'}

, (38) holds after the generalized linear transformation. Therefore, there is also a complete separation of data points after the generalized linear transformation

X C

.

Case 2. There is a quasi-complete separation of data points in the original system. It can be proved similarly to Case 1.

Case 3. The original data point has overlap. Then the new data points after the generalized linear transformation

X C

also has overlap. We prove it by contradiction. Assume otherwise the new data points after the generalized linear transformation does not has overlap. Then there is either a complete separation or a pseudo-complete separation of data points. Let us first assume there is a complete separation of data point after the generalized linear transformation

X C

. Then there is a vector

b = {(b_{0}, b_{1}, \dots, b_{p})}^{'}

such that (38) holds. Row

i

in the design matrix after the generalized linear transformation

X C

is

X_{i} C

for

i = 1, 2, \dots, n

. Let

b^{t} = {(C^{'})}^{- 1} b

, then (38) holds with

b^{t}

, which is a contradiction. Next, let us assume there is a quasi-complete separation after the generalized linear transformation

X C

. It can be proved similarly. □

4. Numeric Examples

In this section, we use real data, the well-known German Credit Data from a German bank, to validate our theoretical results. The German Credit Data can be found in the UCI Machine Learning Repository [24]. The original dataset is in file “german.data”, which contains categorical/symbolic attributes. It has 1000 observations representing 1000 loan applicants. The statistical software package R (version 3.4.2) and its RStudio will be employed for our analyses. Since there are only 1000 records, we will not split them into training and test. We extract german.data using R’s read_table function, call it german_credit_raw, and use colnames() method to rename the column names.

There are 21 variables or attributes in german_credit_raw including 8 numerical ones as follows, which are denoted by

x_{1}, x_{2}, \dots, x_{8},

resepectively:

Duration: Duration in month;
credit_amount: Credit amount;
installment_rate: Installment rate in percentage of disposable income;
current_address_length: Present residence since;
age: Age in years;
num_credits: Number of existing credits at this bank;
num_dependents: Number of people being liable to provide maintenance for;
credit_status: Credit status: 1 for good loans and 2 for bad loans.

Let us define a new variable called default as

y = d e f a u l t = c r e d i t_s t a t u s - 1

. With the new variable default, 0 is for good loans and 1 is for bad loans. Since it is not easy to interpret categorical variables, we will only consider numerical variables.

4.1. Validation of Invariance of Separation

Let us first build a logistic regression model logit_model_1 using all the 8 numerical variables and glm function in R. In the following, we italicize statements in R, use “>“ for the R prompt and make outputs from R bold.

> logit_model_1 <- glm(default ~ duration + credit_amount + installment_rate + current_address_length + age + num_credits + num_dependents + credit_status, data = german_credit_raw, family = “binomial”)

Warning message:

glm.fit: algorithm did not converge

We see a warning message as above. It indicates a separation in the data. Indeed, this separation is from variable credit_status. (38) holds with

b_{0} = - 3, b_{1} = b_{2} = \dots = b_{7} = 0, b_{8} = 2

. By Definition 4, there is a complete separation of data points.

Now let us make a generalized linear transformation. We randomly generate

8 \times 8

matrix

C_{11}

as shown in Table 1 and the 8-dimensional row vector

C_{1}

in (3) by calling R function runif, which generates random values from a uniform distribution with a default value from 0 to 1. We set seed for the purpose of reproduction. We denote

C_{11}

and

C_{1}

by C_11 and C_1 in R, respectively. We call R’s function det to calculate the determinant of

C_{11} .

Table 1. Matrix C_11.

> set.seed(1)

> C_11 <- matrix(runif(64),nrow = 8)

We use R function det to find the determinant of C_11 to be 0.01433565.

Vector

C_1

is generated as follows:

> set.seed(10)

> C_1 = runif(n = 8, min = 1, max = 20)

[1] 10.642086 6.828602 9.111246 14.168940 2.617583 5.283296

6.216080 6.173796

Since

C_11

is nonsingular, so is

C = (\begin{matrix} 1 & C_{1} \\ 0 & C_{11} \end{matrix})

by (9). Now

x_{1}, \dots, x_{8} can be transmitted

into

x_{1}^{t}, x_{2}^{t}, \dots, x_{8}^{t}

as in (6). Let us denote

x_{1}^{t}, x_{2}^{t}, \dots, x_{8}^{t}

by

d u r a t i o n_{2}, c r e d i t_a m o u n t_{2}, \dots,

c r e d i t_s t a t u s_{2}

in R.

Let us build a logistic regression model logit_model_2 for the eight transformed variables.

We also see the warning message as for the eight original variables. Therefore, after a nonsingular generalized linear transformation, the separation in data remains.

4.2. Validation of MLE

Let us drop credit_status and rebuild a logistic regression model called logit_model_3. The main output is shown in Table 2.

Table 2. Coefficients and statistics for model 3.

The output also indicates the data still has overlap after the transformation. Hence, we have validated Theorem 6.

We see variables current_address_length, num_credits and num_dependents are not significant at the 0.05 level. Since we are not focused on building a model, let us still keep these variables. Let us extract the coefficients and put them in a vector called model_coef_3 as follows:

> model_coef_3 <- data.frame(coef(logit_model_3))

> model_coef_3 <- as.matrix(model_coef_3)

Next, let us make a generalized linear transformation. We use letter

D

rather than

C

to distinguish the case from Section 4.1. We randomly generate

7 \times 7

matrix

D_{11}

and the 7-dimensional row vector

D_{1}

in (3) by calling R function runif. Again, we denote

D_{11}

as shown in Table 3 and

D_{1}

by D_11 and D_1 in R, respectively.

Table 3. Matrix D_11.

> set.seed(2)

> D_11 <- matrix(runif(49),nrow = 7)

> det(D_11)

[1] 0.2851758

> set.seed(20)

> D_1 = runif(n = 7, min = 1, max = 20)

[1] 17.672906 15.602131 6.300300 11.054110 19.295234 19.626737

2.735319

Since the determinant of

D_11

is nonzero,

D = (\begin{matrix} 1 & D_{1} \\ 0 & D_{11} \end{matrix})

is non-singular by (9) as shown in Table 4:

Table 4. Matrix D.

We use R function solve to find its inverse

D^{- 1}

and call it inv_D (see Table 5) in R

Table 5. Inverse matrix of D.

Now

x_{1}, \dots, x_{7} can be transmitted

into

x_{1}^{t}, x_{2}^{t}, \dots, x_{7}^{t}

as in (6). Let us denote

x_{1}^{t}, x_{2}^{t}, \dots, x_{7}^{t}

by

d u r a t i o n_{4}, c r e d i t_a m o u n t_{4}, \dots,

n u m b e r_d e p e n d e n t s_{4}

in R. Let us build a logistic regression model for the seven transformed variables and call it logit_model_4. The main output is shown in Table 6:

Table 6. Coefficients and statistics of model 4.

Let us extract the coefficients called model_coef_3 to get more digits as shown in Table 7:

Table 7. Coefficients of model 4.

> model_coef_4 <- data.frame(coef(logit_model_4))

Let us find the multiplication of

D^{- 1}

and vector model_coef_3 in R as follows:

> inv_D%*%model_coef_3

The result of the product is shown in Table 8 below.

Table 8. Product of inverse matrix of D and coefficients of model 3.

This is exactly the same as model_coef_4. Next, we calculate the predicted values for all the 1000 records using both models logit_model_3 and logit_model_4 by calling R function predict and then all.equal utility to check these two predictions are near equality:

> model_3_predictions = predict(logit_model_3, german_credit_raw, type=“response”)

> model_4_predictions = predict(logit_model_4, german_credit_raw, type=“response”)

> all.equal(model_3_predictions, model_4_predictions, tolerance = 1e-13)

[1] “Mean relative difference: 0.0000000000005060054”

We see that the two predictions are identical taking rounding errors into consideration. Thus, we have validated validated Theorem 2.

Note that a nonlinear transformation even a one-to-one correspondence will not have the properties in Theorem 2 even for a single variable. For instance, let us define a one-to-one correspondence for variable age as follows:

a g e_6 = \ln (a g e)

, which is

\log (a g e)

in R. Let us build a univariate logistic regression model called logit_model_5 for age and a univariate logistic regression model called logit_model_6 for age_6. Next, we apply these two models to predict the values for german_credit_raw.

> model_5_predictions = predict(logit_model_5, german_credit_raw, type=“response”)

> model_6_predictions = predict(logit_model_6, german_credit_raw, type=“response”)

> all.equal(model_5_predictions, model_6_predictions, scale=1)

[1] “Mean absolute difference: 0.008512868”

We see that the predictions from logit_model_5 are in general different from predictions for logit_model_6.

4.3. Validation of Invariance of VIF

For logistic regression model logit_model_3 in Section 4.2, we use VIF function in the car package of R to find VIF for all the 7 variables. The result is shown in Table 9.

Table 9. VIF for model 3.

> car::vif(logit_model_3)

Next, we randomly generate multiple simple transformations as follows

> set.seed(30)

> A = runif(n = 7)

> set.seed(40)

> B = runif(n = 7, min = 1, max = 10)

> german_credit_raw$duration_7 = A [1] * german_credit_raw$duration + B [1]

> german_credit_raw$credit_amount_7 = A [2] * german_credit_raw$credit_amount + B [2]

…

> german_credit_raw$num_dependents_7 = A [7] * german_credit_raw$num_dependents + B [7]

We build a logistic regression for the variables after multiple simple linear transformations and call it logit_model_7. We then find VIF as follows and display the result in Table 10

Table 10. VIF for model 7.

> car::vif(logit_model_7)

Hence, we have validated Theorem 4. There is no need to validate Theorem 5 (the invariance of multicollinearity) as its analytical proof is straightforward.

5. Conclusions

In this paper, we first generalized a linear transformation for a single variable to multiple variables by a matrix multiplication. We then studied various effects of a generalized linear transformation in logistic regression. We showed that an invertible generalized linear transformation has no effects on predictions, multicollinearity, pseudo-complete separation, and complete separation. We also showed that multiple linear transformations do not have effects on the variance inflation factor (VIF). Numeric examples with real data were presented to validate our theoretic results.

Author Contributions

Writing—original draft, G.Z.; Writing—review & editing, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/, accessed on 6 November 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gelman, A.; Hill, J. Data Analysis Using Regression and multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Chatterjee, S.; Hadi, A.S. Regression Analysis by Example, 5th ed.; John Wiley & Sons: New York, NY, USA, 2013. [Google Scholar]
Adeyemo, A.; Wimmer, H.; Powell, L.M. Effects of normalization techniques on logistic regression in data science. J. Inf. Syst. Appl. Res. 2019, 12, 37–44. [Google Scholar]
Box, G.E.P.; Tidwell, P.W. Transformation of the Independent Variables. Technometrics 1962, 4, 531–550. [Google Scholar] [CrossRef]
Whittemore, A.S. Transformations to Linearity in Binary Regression. SIAM J. Appl. Math. 1983, 43, 703–710. [Google Scholar] [CrossRef]
Kay, R.; Little, S. Transformations of the explanatory variables in the logistic regression model for binary data. Biometrika 1987, 74, 495–501. [Google Scholar] [CrossRef]
Feng, C.; Wang, H.; Lu, N.; Chen, T.; He, H.; Lu, Y.; Tu, X.M. Log-transformation and its implications for data analysis. Shanghai Arch. Psychiatry 2014, 26, 105–109. [Google Scholar] [PubMed]
Zhang, M.; Chen, S.; Rain, S.C. Evaluating Continuous Variable Transformations in Logistic Regression. In Proceedings of the Midwest SAS User Group Conference 2015, Omaha, NE, USA, 18–20 October 2015. [Google Scholar]
Lee, D.K. Data transformation: A focus on the interpretation. Korean J. Anesthesiol. 2020, 73, 503–508. [Google Scholar] [CrossRef] [PubMed]
Morrell, C.H.; Pearson, J.D.; Brant, L.J. Linear Transformations of Linear Mixed-Effects Models. Am. Stat. 1997, 51, 338–343. [Google Scholar]
Zeng, G. Invariant Properties of Logistic Regression Model in Credit Scoring under Monotonic Transformations. Commun. Stat. Theory Methods 2017, 46, 8791–8807. [Google Scholar] [CrossRef]
Zeng, G. On the analytical properties of category encodings in logistic regression. Commun. Stat. Theory Methods, 2021; advance online publication. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2013. [Google Scholar]
Zeng, G. On the Existence of an Analytical Solution in Multiple Logistic Regression. Int. J. Appl. Math. Stat. 2021, 60, 53–67. [Google Scholar]
Refaat, M. Credit Risk Scorecards: Development and Implementation Using SAS; Lulu.com: Raleigh, NC, USA, 2011. [Google Scholar]
Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
Zeng, G.; Zeng, E. On the Relationship between Multicollinearity and Separation in Logistic Regression. Commun. Stat. Simul. Comput. 2021, 50, 1989–1997. [Google Scholar] [CrossRef]
Shen, L.; Gao, Y.; Xiao, J. Simulation of Hydrogen Production from Biomass Gasification in Interconnected Fluidized Beds. Biomass Bioenergy 2008, 32, 120–127. [Google Scholar] [CrossRef]
Vatcheva, K.P.; Lee, M.; McCormick, J.B.; Rahbar, M.H. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology 2016, 6, 227. [Google Scholar] [CrossRef] [PubMed]
Cincotta, K. Multicollinearity in Zero Intercept Regression: They Are Not Who We Thought They Were. In Proceedings of the Presented at the Society of Cost Estimating and Analysis (SCEA) Conference, Albuquerque, NM, USA, 6–10 June 2011. [Google Scholar]
Dohoo, I.R.; Ducrot, C.; Fourichon, C.; Donald, A.; Hurnik, D. An overview of techniques for dealing with large numbers of independent variables in epidemiologic studies. Prev. Vet. Med. 1997, 29, 221–239. [Google Scholar] [CrossRef] [PubMed]
Amini, M.; Roozbeh, M. Optimal partial ridge estimation in restricted semiparametric regression models. J. Multivar. Anal. 2015, 136, 26–40. [Google Scholar] [CrossRef]
Roozbeh, M. Optimal QR-based estimation in partially linear regression models with correlated errors using GCV criterion. Comput. Stat. Data Anal. 2018, 117, 45–61. [Google Scholar] [CrossRef]
Lichman, M. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2013; Available online: http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/ (accessed on 16 December 2022).

Table 1. Matrix C_11.

	[,1]	[,2]	[,3]	[,4]	[,5]	[,6]	[,7]	[,8]
[1,]	0.2655087	0.6291140	0.7176185	0.2672207	0.4935413	0.8209463	0.7323137	0.3162717
[2,]	0.3721239	0.0617863	0.9919061	0.3861141	0.1862176	0.6470602	0.6927316	0.5186343
[3,]	0.5728534	0.2059746	0.3800352	0.0133903	0.8273733	0.7829328	0.4776196	0.6620051
[4,]	0.9082078	0.1765568	0.7774452	0.3823880	0.6684667	0.5530363	0.8612095	0.4068302
[5,]	0.2016819	0.6870228	0.9347052	0.8696908	0.7942399	0.5297196	0.4380971	0.9128759
[6,]	0.8983897	0.3841037	0.2121425	0.3403490	0.1079436	0.7893562	0.2447973	0.2936034
[7,]	0.9446753	0.7698414	0.6516738	0.4820801	0.7237109	0.0233312	0.0706790	0.4590657
[8,]	0.6607978	0.4976992	0.1255551	0.5995658	0.4112744	0.4772301	0.0994662	0.3323947

Table 2. Coefficients and statistics for model 3.

Coefficients:
	Estimate	Std. Error	z Value	Pr(>\|z\|)
(Intercept)	−1.56979765	0.42997660	−3.651	0.000261	***
duration	0.02621174	0.00770330	3.403	0.000667	***
credit_amount	0.00007060	0.00003404	2.074	0.038053	*
installment_rate	0.20355992	0.07251671	2.807	0.004999	**
current_address_length	0.04090933	0.06690897	0.611	0.540923
age	−0.02143075	0.00708337	−3.026	0.002482	**
num_credits	−0.15689020	0.13049965	−1.202	0.229276
num_dependents	0.12800328	0.20131338	0.636	0.52488
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Table 3. Matrix D_11.

	[,1]	[,2]	[,3]	[,4]	[,5]	[,6]	[,7]
[1,]	0.1848823	0.8334488	0.40528218	0.3875495	0.96264405	0.6271963	0.1150841
[2,]	0.7023740	0.4680185	0.85354845	0.8368892	0.13237200	0.8444290	0.1632009
[3,]	0.5733263	0.5499837	0.97639849	0.1505014	0.01041453	0.2848706	0.9440418
[4,]	0.1680519	0.5526741	0.22582546	0.3472722	0.16464224	0.6672256	0.7948638
[5,]	0.9438393	0.2388948	0.44480923	0.4887732	0.81019214	0.1504698	0.9746879
[6,]	0.9434750	0.7605133	0.07497942	0.1492469	0.86886104	0.9817279	0.3490884
[7,]	0.1291590	0.1808201	0.66189876	0.3570626	0.51428176	0.2970107	0.5019699

Table 4. Matrix D.

	[,1]	[,2]	[,3]	[,4]	[,5]	[,6]	[,7]	[,8]
[1,]	1	17.6729064	15.6021310	6.30029963	11.0541102	19.29523359	19.6267371	2.7353192
[2,]	0	0.1848823	0.8334488	0.40528218	0.3875495	0.96264405	0.6271963	0.1150841
[3,]	0	0.7023740	0.4680185	0.85354845	0.8368892	0.13237200	0.8444290	0.1632009
[4,]	0	0.5733263	0.5499837	0.97639849	0.1505014	0.01041453	0.2848706	0.9440418
[5,]	0	0.1680519	0.5526741	0.22582546	0.3472722	0.16464224	0.6672256	0.7948638
[6,]	0	0.9438393	0.2388948	0.44480923	0.4887732	0.81019214	0.1504698	0.9746879
[7,]	0	0.9434750	0.7605133	0.07497942	0.1492469	0.86886104	0.9817279	0.3490884
[8,]	0	0.1291590	0.1808201	0.66189876	0.3570626	0.51428176	0.2970107	0.5019699

Table 5. Inverse matrix of D.

	[,1]	[,2]	[,3]	[,4]	[,5]	[,6]	[,7]	[,8]
[1,]	1	−7.9987564	−8.13990103	4.21471891	1.9118435	−3.7229455	−10.65351722	2.7150084
[2,]	0	−0.3069549	0.33838941	0.27554240	−0.6317051	0.4987514	0.42693364	−0.8228946
[3,]	0	1.6396322	−0.06252947	0.79806766	0.2807233	0.2680550	−0.76439514	−2.2899087
[4,]	0	−0.1062208	0.10961077	0.64555254	−0.8488122	−0.5625429	0.10509829	1.1379418
[5,]	0	0.6460218	0.86107501	−0.79420393	0.7167836	0.9646552	−1.22876225	−1.0880130
[6,]	0	0.2595927	−0.43338042	−0.38521513	−0.4283198	0.1580233	0.29055578	0.9751883
[7,]	0	−1.1460418	0.12889195	−0.48351108	0.4179223	−1.0293070	1.14940313	1.6676878
[8,]	0	−0.4189740	−0.45383398	0.03608131	0.8623441	0.2778314	−0.07680999	0.3163327

Table 6. Coefficients and statistics of model 4.

Coefficients:	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.25487	1.65128	0.760	0.447
duration_4	−0.16078	0.18457	−0.871	0.384
credit_amount_4	0.03798	0.46388	0.082	0.935
installment_rate_4	0.23513	0.24465	0.961	0.337
current_address_length_4	−0.08251	0.27582	−0.299	0.765
age_4	−0.01331	0.19983	−0.067	0.947
num_credits_4	−0.05616	0.35614	−0.158	0.875
num_dependents_4	0.07820	0.08644	0.905	0.366

Table 7. Coefficients of model 4.

	coef.logit_model_4.
(Intercept)	1.2548745
duration_4	−0.1607787
credit_amount_4	0.0379776
installment_rate_4	0.2351349
current_address_length_4	−0.0825126
age_4	−0.0133075
num_credits_4	−0.0561589
num_dependents_4	0.0781968

Table 8. Product of inverse matrix of D and coefficients of model 3.

	coef.logit_model_3.
[1,]	1.25487449
[2,]	−0.16077871
[3,]	0.03797764
[4,]	0.23513490
[5,]	−0.08251257
[6,]	−0.01330746
[7,]	−0.05615895
[8,]	0.07819677

Table 9. VIF for model 3.

	VIF
duration	1.781992
credit_amount	1.991715
installment rate	1.223520
curent_adress_length	1.069032
age	1.111486
num_credits	1.031390
num_dependents	1.033490

Table 10. VIF for model 7.

	VIF
duration_7	1.781992
credit_amount_7	1.991715
installment_rate_7	1.223520
curent_adress_length_7	1.069032
age_7	1.111486
num_credits_7	1.031390
num_dependents_7	1.033490

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.