Bicluster Analysis of Heterogeneous Panel Data via M-Estimation

Cui, Weijie; Li, Yong

doi:10.3390/math11102333

Open AccessArticle

Bicluster Analysis of Heterogeneous Panel Data via M-Estimation

by

Weijie Cui

¹ and

Yong Li

^1,2,*

¹

School of Management, University of Science and Technology of China, Hefei 230026, China

²

New Finance Research Center, International Institute of Finance, University of Science and Technology of China, Hefei 230026, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(10), 2333; https://doi.org/10.3390/math11102333

Submission received: 15 April 2023 / Revised: 9 May 2023 / Accepted: 12 May 2023 / Published: 17 May 2023

(This article belongs to the Special Issue Advances in Statistics: Theory, Methodology, Applications and Data Analysis)

Download

Browse Figure

Versions Notes

Abstract

:

This paper investigates the latent block structure in the heterogeneous panel data model. It is assumed that the regression coefficients have group structures across individuals and structural breaks over time, where change points can cause changes to the group structures and structural breaks can vary between subgroups. To recover the latent block structure, we propose a robust biclustering approach that utilizes M-estimation and concave fused penalties. An algorithm based on local quadratic approximation is developed to optimize the objective function, which is more compact and efficient than the ADMM algorithm. Moreover, we establish the oracle property of the penalized M-estimators and prove that the proposed estimator recovers the latent block structure with a probability approaching one. Finally, simulation studies on multiple datasets demonstrate the good finite sample performance of the proposed estimators.

Keywords:

heterogeneous panel data; block structure; bicluster; M-estimation; fused penalty

MSC:

62J07

1. Introduction

Panel data models can fully utilize both cross-sectional and time-series information, making them a popular tool in fields such as economics and finance. Traditional panel data models often assume that the regression coefficients are homogeneous across individuals and over periods, which is too rigid an assumption. In many real-world applications, heterogeneity in individual and/or time dimensions is often observed. For example, in precision medicine research, different subgroups of patients may respond differently to treatments, while in economics, events such as the 2009 European debt crisis led to varying debt-to-GDP ratios among European countries. Although these heterogeneous factors are unobserved and latent, modeling them will bring significant improvement to data analysis.

Numerous estimation methods have been developed for panel data models with heterogeneous coefficients, addressing two main sources of heterogeneity: individual and period. To account for heterogeneity across individuals, a commonly used assumption is that individuals can be classified into subgroups with identical coefficients within the same subgroup but different coefficients across subgroups. Penalty-based methods have been frequently used to cluster coefficients in the individual direction. Su et al. [1] propose C-Lasso, a modified version of Lasso, for subgroup identification and coefficient estimation. This method is based on a penalized objective function inspired by the fused Lasso method introduced by Tibshirani et al. [2]. Wang and Zhu [3] study high-dimensional panel data models using a concave fused penalty method for both subgroup identification and variable selection and proved the asymptotic properties of the proposed estimator under specific regularity conditions. To capture the heterogeneity that may exist over time, structural breaks are often assumed. Qian and Su [4] use a group fused Lasso method to estimate both the number of breaks and model parameters simultaneously. Furthermore, Qian and Su [5] employ an adaptive group fused Lasso approach that applies the shrinkage method to PLS and PGMM estimations. Their method can consistently determine the number of breaks and estimate the break dates with a probability approaching one.

However, these studies have two primary limitations. Firstly, they only account for heterogeneity in one dimension, which may be insufficient for modeling complex data prevalent in the era of big data. Secondly, their objective functions employ the least squares loss which can result in substantial estimation bias when the data distribution contains a heavy tail or outliers. Consequently, further research has been conducted to address these issues. Some researchers have focused on the two-dimensional heterogeneous panel data model, where the coefficients exhibit both subgroup structure and structural breaks. Okui and Wang [6] allow the number, timing, and size of structural breaks to vary across different subgroups and employ the K-means method and adaptive group fused Lasso method to identify the individual group structure and structural breaks, respectively. Lumsdaine et al. [7] study cases where the group structure of coefficients changes after the unknown structural break and develop a novel iterative algorithm to estimate the coefficients and recover the unknown structure. Additionally, some researchers have focused on robust estimation. For example, Zhang et al. [8] set the objective function as the sum of

L_{1}

loss and concave pairwise fused penalty when studying the panel data model with an individual group structure and provide an easy-to-implement algorithm based on the idea of local linear approximation [9] to find local minima. Cheng et al. [10] further generalize the

L_{1}

loss to a general loss function under the framework of M-estimation.

In this paper, we generalize the coefficient structure studied by [6,7] to a more general block structure. The regression coefficients with a block structure exhibit both an individual-group structure and temporal–structural breaks, where the individual-group structure can change at change points, and the temporal–structural breaks can vary across different groups. Furthermore, the regression coefficients are identical within the same sub-block, while they exhibit heterogeneity across different sub-blocks. This block structure is highly flexible and more general compared to the structures studied previously. Additionally, the homogeneous panel data model, the panel data model with a group structure, and the panel data model with structural breaks can all be viewed as special cases of the model being investigated in this study.

We propose a robust biclustering method based on M-estimation and double concave fused penalties for simultaneously recovering the unknown block structure and estimating the regression coefficients. The M-estimator exhibits robustness to heavy-tailed distributions and outliers, while the double concave fused penalty can automatically identify potential block structure. We develop an effective algorithm utilizing local quadratic approximation to optimize the objective function, which is computationally more efficient than the Alternating Direction Method of Multipliers (ADMM) [11] algorithm. Moreover, we establish the asymptotic convergence property of the oracle estimator and prove that the proposed estimator can recover the latent block structure with a probability approaching one. Simulation experiments on multiple datasets demonstrate that the estimator proposed in this paper has an excellent performance in finite sample situations. Additionally, models based on

L_{1}

loss and

H u b e r

loss functions can achieve more accurate results than those based on

L_{2}

loss functions in the presence of heavy-tailed data distributions.

2. Materials and Methods

2.1. Model Setting

Given panel data observations

{(x_{i t}, y_{i t}), i = 1, \dots, N; t = 1, \dots, T}

, this paper explores a linear panel data model that accounts for the heterogeneity of intercept and slope coefficients across both the individual and time dimensions. The model is represented as follows,

y_{i t} = μ_{i t} + x_{i t}^{⊤} ζ_{i t} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots T,

(1)

where

y_{i t} \in R

is the response variable, and

x_{i t} \in R^{P - 1}

is the explanatory variable. The intercept term

μ_{i t}

and the slope coefficient

ζ_{i t} \in R^{P - 1}

can vary across both individual and time dimensions in the model. The random errors are assumed to be independent and identically distributed with a mean of 0 and a standard deviation of

σ

, and their distribution is denoted by f.

We assume that the observed data are collected from an unknown,

L^{0}

different blocks, where the regression coefficients are homogeneous within the same block but heterogeneous across different blocks. It is worth noting that the group structure and structural breaks can be viewed as special cases of the block structure. Below, we provide the relevant notation to describe this block structure.

Let

β_{i t} = {(μ_{i t}, η_{i t}^{⊤})}^{⊤}

and

z_{i t} = {(1, x_{i t}^{⊤})}^{⊤}

. Then, Equation (1) can be expressed as follows,

y_{i t} = z_{i t}^{⊤} β_{i t} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots, T .

(2)

We denote the block structure as

B = {B_{1}, \dots, B_{L^{0}}}

, where

B_{k}

represents the index set of samples belonging to the kth sub-block. If the ith individual’s observation at time t belongs to the kth sub-block, then

(i, t) \in B_{k}

. Let

α_{1}^{0}, \dots, α_{L^{0}}^{0}

denote the real regression coefficients corresponding to the

L^{0}

sub-blocks, and let

β_{i t}^{0}

be the real value of

β_{i t}

. Then, we have

β_{i t}^{0} = \{\begin{matrix} α_{1}^{0}, & if (i, t) \in B_{1}, \\ α_{2}^{0}, & if (i, t) \in B_{2}, \\ \dots \\ α_{L^{0}}^{0} & if (i, t) \in B_{L^{0}} . \end{matrix}

(3)

In practical scenarios, the real block structure is often unknown. To recover the block structure described above, it is necessary to estimate the number of sub-blocks, the index sets of each block, and the block-specific coefficients.

2.2. Proposed Estimator

In this subsection, we propose a biclustering estimation method to automatically recover the block structure without specifying the number of blocks and provide robust estimates of regression coefficients via M-estimation and concave fused penalties. Let

β = {(β_{11}^{⊤}, β_{12}^{⊤}, \dots, β_{1 T}^{⊤}, \dots, β_{N 1}^{⊤}, β_{N 2}^{⊤}, \dots, β_{N T}^{⊤})}^{⊤}

denote the coefficients to be estimated. To recover the block structure under the fused sparse assumption, a natural idea is to shrink the coefficient differences

| | β_{it} - β_{j t^{^{'}}} | |

between two samples

(i, t)

and

(j, t^{^{'}})

that belong to the same block

B_{l}

to zero. In the following, we present the objective function based on M-estimation and concave fused penalty as follows,

\begin{matrix} Q (β; λ, γ) & = \sum_{i = 1}^{N} \sum_{t = 1}^{T} ρ (y_{i t} - z_{i t}^{⊤} β_{i t}) + \sum_{t = 1}^{T} \sum_{i < j} P_{λ} (| | β_{i t} - β_{j t} | |) \\ + \sum_{i = 1}^{N} \sum_{t < t^{^{'}}} P_{γ} (| | β_{i t} - β_{i t^{^{'}}} | |), \end{matrix}

(4)

where the first term on the right-hand side is the regular loss function

ρ

in the M-estimation literature. It satisfies several conditions, including being a continuous convex function on

R

, almost everywhere differentiable except for a finite set of points, having a unique global minimum at 0, and

ρ (0) = 0

. Commonly used loss functions such as least squares (

L_{2}

), absolute deviation (

L_{1}

), and Huber loss all satisfy these conditions. The second and third terms are two fused penalty terms designed to identify the individual-group structure and temporal–structure breaks.

Commonly used penalty terms include Lasso [12], Ridge [13], and Elastic net [14]. Ridge and Elastic net can shrink the fusion terms toward small values, but not exactly zero, which makes it challenging to cluster coefficients within the same block. Lasso, on the other hand, can shrink the fusion terms to exactly zero but has a tendency to introduce bias in the estimated coefficients. This bias can impact the accuracy of the parameter estimates and potentially lead to suboptimal results in recovering the latent block structure. To achieve both automatic recovery of the parameter structure and unbiased or nearly unbiased estimation of coefficients, concave fused penalty functions have been proposed, such as the smoothly clipped absolute deviation (SCAD) [15] and the minimax concave penalty (MCP) [16]. In this paper, following the approach of Wang and Zhu [3], Ma and Huang [17], Wang et al. [18], we use the SCAD penalty function with a tuning parameter

λ

,

P_{λ} (k) = λ \int_{0}^{k} {(1 - x / (λ a))}_{+} d x,

(5)

and the MCP penalty function with a tuning parameter

γ

,

P_{γ} (k) = γ \int_{0}^{k} min {1, (a - x / γ) / (a - 1))} d x,

(6)

where the fixed parameter a controls the concavity of the penalty function, and k represents pairwise differences in regression coefficients between individuals or periods.

P_{λ} (k)

and

P_{γ} (k)

can compress some pairwise difference values

| | β_{i t} - β_{j t} | |

and

| | β_{i t} - β_{i t^{^{'}}} | |

to zero, thereby recovering the block structure of the regression coefficients.

For a given

λ

and

γ

, we define the proposed estimator as

\hat{β} (λ, γ) = \underset{β \in R^{N T P}}{arg min} Q (β; λ, γ) .

(7)

In the following, we abbreviate

\hat{β} (λ, γ)

as

\hat{β}

when there is no ambiguity. Although the penalty term in Equation (4) is concave and global minimum points are difficult to obtain, local minimum points can be obtained through iterative algorithms. Ma and Huang [17] and Wang et al. [18] apply the ADMM algorithm to solve the penalized optimization problem by transforming it into a Lagrangian-constrained optimization problem. However, the algorithm is computationally intensive and involves cumbersome steps. In this paper, we develop a novel algorithm based on local quadratic approximation [15] to solve Equation (4). In the next section, we provide a detailed derivation.

2.3. Proposed Algorithm

We propose a local quadratic approximation-based algorithm to solve Equation (7). The local quadratic approximation algorithm was introduced by Fan and Li [15]. Specifically, given a non-zero value

x_{0}

as the initial value for the penalty function

P_{λ} (| x |)

, we can apply a first-order Taylor expansion on

x^{2}

as follows,

P_{λ} (| x |) = P_{λ} ({(x^{2})}^{\frac{1}{2}}) \approx P_{λ} (| x_{0} |) + \frac{P_{λ}^{^{'}} (| x_{0} |)}{2 | x_{0} |} (x^{2} - x_{0}^{2}) .

(8)

Specifically, let

β^{(k - 1)}

denote the estimate of

β

obtained after the

(k - 1)

th iteration. At the kth iteration, we locally approximate

ρ (\cdot)

,

P_{λ} (\cdot)

, and

P_{γ} (\cdot)

around

β^{(k - 1)}

, which yields

\begin{matrix} Q^{(k)} (β; γ, λ) \\ = \sum_{i = 1}^{N} \sum_{t = 1}^{T} \frac{ϕ (| y_{i t} - z_{i t}^{T} β_{i t}^{(k - 1)} |)}{2 | y_{i t} - z_{i t}^{T} β_{i t}^{(k - 1)} |} {(y_{i t} - z_{i t}^{T} β_{i t})}^{2} \\ + \sum_{t = 1}^{T} \sum_{i < j} \frac{P_{λ}^{^{'}} (| | β_{i t}^{(k - 1)} - β_{j t}^{(k - 1)} | |)}{2 | | β_{i t}^{(k - 1)} - β_{j t}^{(k - 1)} | |} | | β_{i t} - β_{j t} {| |}^{2} \\ + \sum_{i = 1}^{N} \sum_{t < t^{^{'}}} \frac{P_{γ}^{^{'}} (| | β_{i t}^{(k - 1)} - β_{i t^{^{'}}}^{(k - 1)} | |)}{2 | | β_{i t}^{(k - 1)} - β_{i t^{^{'}}}^{(k - 1)} | |} | | β_{i t} - β_{i t^{^{'}}} {| |}^{2} \\ + C, \end{matrix}

(9)

where

ϕ (\cdot)

,

P_{λ}^{^{'}} (\cdot)

, and

P_{γ}^{^{'}} (\cdot)

are the derivatives of

ρ (\cdot)

,

P_{λ} (\cdot)

, and

P_{γ} (\cdot)

, respectively. C depends on

β^{(k - 1)}

and can be treated as a constant when solving for the kth estimator. By minimizing

Q^{(k)} (β; γ, λ)

, we obtain

β^{(k)}

.

Equation (9) has an explicit solution. To simplify Equation (9), we first define some symbols as follows,

Y = {(y_{11}, \dots, y_{1 T}, \dots, y_{N 1}, \dots, y_{N T})}^{⊤},

Z_{i} = diag (z_{i 1}^{⊤}, \dots, z_{i T}^{⊤}), Z = diag (Z_{1}, \dots, Z_{N}),

Δ_{1} = diag \{\sqrt{ϕ (| Y - Z β^{(k - 1)} |) / | Y - Z β^{(k - 1)} |}\},

where the square root symbol

\sqrt{\cdot}

, the function

ϕ (\cdot)

, the division symbol

\cdot / \cdot

, and the absolute value function

| \cdot |

represent the corresponding operations performed on each element of the vector when they act on vectors.

Let

e_{i}^{(c)}

be an N-dimensional vector with 1 in the i-th dimension and 0 in the other dimensions, and let

e_{t}^{(r)}

be a T-dimensional vector with 1 in the tth dimension and 0 in the other dimensions. We define

δ_{i, j}^{(c)} = I_{T \times T} \otimes [{(e_{i}^{(c)} - e_{j}^{(c)})}^{⊤} \otimes I_{P \times P}], δ_{t, t^{^{'}}}^{(r)} = I_{N \times N} \otimes [{(e_{t}^{(r)} - e_{t^{^{'}}}^{(r)})}^{⊤} \otimes I_{P \times P}],

where

I_{T \times T}

,

I_{P \times P}

, and

I_{N \times N}

are identity matrices, and ⊗ represents the Kronecker product. We concatenate

δ_{i, j}^{(c)}

for

i < j

and

δ_{t, t^{^{'}}}^{(r)}

for

t < t^{^{'}}

to obtain

δ_{c} = {({δ_{1, 2}^{(c)}}^{⊤}, \dots, {δ_{N - 1, N}^{(c)}}^{⊤})}^{⊤}, δ_{r} = {({δ_{1, 2}^{(r)}}^{⊤}, \dots, {δ_{T - 1, T}^{(r)}}^{⊤})}^{⊤} .

Let

U = δ_{c} β^{(k - 1)}

and

V = δ_{r} β^{(k - 1)}

.

U

and

V

are both NTP-dimensional vectors that can be expressed as

U = {(u_{11}^{⊤}, \dots, u_{N T}^{⊤})}^{⊤}

and

V = {(v_{11}^{⊤}, \dots, v_{N T}^{⊤})}^{⊤}

, respectively, where

u_{i t}

and

v_{i t}

are P-dimensional vectors. Let

{\bar{u}}_{i t}

and

{\bar{v}}_{i t}

be the

L_{2}

norms of

u_{i t}

and

v_{i t}

, respectively. Let

\bar{U} = {({\bar{u}}_{11}, \dots, {\bar{u}}_{N T})}^{⊤}

,

\bar{V} = {({\bar{v}}_{11}, \dots, {\bar{v}}_{N T})}^{⊤} .

Thus, Equation (9) can be written as

Q^{(k)} (β; γ, λ) = \frac{1}{2} {(Y - Z β)}^{⊤} Δ_{1}^{⊤} Δ_{1} (Y - Z β) + \frac{1}{2} β^{⊤} Δ_{2}^{⊤} Δ_{2} β + \frac{1}{2} β^{⊤} Δ_{3}^{⊤} Δ_{3} β + C,

(10)

where

Δ_{2} = [I_{P \times P} \otimes \sqrt{P_{λ}^{^{'}} (\bar{U}) / \bar{U}}] δ_{c}, Δ_{3} = [I_{P \times P} \otimes \sqrt{P_{γ}^{^{'}} (\bar{V}) / \bar{V}}] δ_{r}

.

By minimizing the above equation, we obtain the iteration formula for the kth step,

β^{(k)} = {(Z^{⊤} Δ_{1}^{⊤} Δ_{1} Z + Δ_{2}^{⊤} Δ_{2} + Δ_{3}^{⊤} Δ_{3})}^{- 1} Z^{⊤} Δ_{1}^{⊤} Δ_{1} Y .

(11)

We repeat this iteration process until the norm of the difference between

β^{(k)}

and

β^{(k - 1)}

is smaller than a given threshold

δ

(set to 1e-5 in our experiments), at which point the algorithm terminates. As noted by Hunter and Li [19], this algorithm belongs to the class of MM algorithms, and its convergence is guaranteed.

To perform the iterative process, it is necessary to specify the initial value of the regression coefficients. An appropriate initial value can reduce the number of iterations and computation time. Following the approach of Wang et al. [18], we use ridge regression to obtain the initial values. The specific formula is as follows,

\begin{matrix} β^{(0)} & = \underset{β \in R^{N T P}}{arg min} {\sum_{i = 1}^{N} \sum_{t = 1}^{⊤} {(y_{i t} - z_{i t}^{⊤} β_{i t})}^{2} + λ^{*} \sum_{t = 1}^{⊤} \sum_{i < j} | | β_{i t} - β_{j t} {| |}^{2} \\ + γ^{*} \sum_{i = 1}^{N} \sum_{t < t^{^{'}}} | | β_{i t} - β_{i t^{^{'}}} {| |}^{2}} \\ = \underset{β \in R^{N T P}}{arg min} \{| | Y - {Z β | |}^{2} + λ^{*} | | δ_{c} {β | |}^{2} + γ^{*} | | δ_{r} β {| |}^{2}\} \\ = {(Z^{⊤} Z + λ^{*} δ_{c}^{⊤} δ_{c} + γ^{*} δ_{r}^{⊤} δ_{r})}^{- 1} X^{⊤} Y . \end{matrix}

(12)

Here,

λ^{}

and

γ^{}

are tuning parameters, which are set to 1e-3 in all subsequent experiments.

To select the optimal tuning parameters in the objective function, we use the modified Bayesian information criterion (mBIC) [20], which has been widely utilized for hyperparameter selection in the field of heterogeneous structure recovery [8,10,21,22]. Notably, Cheng et al. [10] have demonstrate the selection consistency of mBIC within the context of subgroup identification tasks under the framework of M-estimation. In this paper, we denote the mBIC as follows,

mBIC (λ, γ) = log (\frac{1}{N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} ρ (y_{i t} - z_{it}^{⊤} {\hat{β}}_{i t})) + c \frac{log log (N T)}{N T} log (N T P) \hat{L (λ, γ)} P,

(13)

where

\hat{L (λ, γ)}

is an estimate of the number of sub-blocks, and c is a constant. In this paper, when

ρ

is the

L_{2}

loss, we follow the setting of Ma and Huang [17] and set

c = 10

; when

ρ

is the

L_{1}

loss, we follow the setting of Zhang et al. [8] and set

c = 5

; when

ρ

is Huber, we also set

c = 5

. To search for the optimal values of the parameters

λ

,

γ

, and the controlling parameter a in the penalty functions, we use grid search to traverse the ranges of

[λ_{min}, λ_{max}]

,

[γ_{min}, γ_{max}]

, and

[a_{min}, a_{max}]

, respectively, with a given step size. We calculate the mBIC value for each combination of

λ

,

γ

, and a. The combination that results in the minimum mBIC is selected as the optimal tuning parameters, which are then used to obtain the final estimation result.

2.4. Asymptotic Properties

We first investigate the property of oracle estimator. If the underlying block structure

B = {B_{l} : l = 1, . . ., L^{0}}

is known, the oracle estimator is defined by

\tilde{α} = \underset{α \in R^{P L^{0}}}{arg min} {\sum_{l = 1}^{L} \sum_{(i, t) \in B_{l}} ρ (y_{i t} - z_{i t}^{T} α_{l})},

(14)

where

\tilde{α} = {({\tilde{α}}_{1}^{T}, . . ., {\tilde{α}}_{L}^{T})}^{T}

. The oracle estimator is unavailable in practice because it assumes the knowledge of the real block structure, but it plays a significant role in theoretical analysis.

First, we define some notations. Let

z_{i t, p}

denote the pth element of

z_{i t}

.

λ_{min} (\cdot)

and

λ_{max} (\cdot)

denote the minimum and maximum eigenvalues of a matrix, respectively. Let

ϕ

denote the derivative of

ρ

,

ϕ^{^{'}}

and

ϕ^{^{'}^{'}}

denote the first and second derivatives of

ϕ

, respectively, and

f^{^{'}}

denotes the first derivative of the distribution function f of

ϵ_{i t}

.

S_{P} = {d \in R^{P} {: | | d | |}^{2} = 1}

denotes the unit sphere in

R^{P}

.

Since

ρ

may have nondifferentiable points, its derivative

ϕ

may have discontinuities. Therefore, we need to classify and discuss

ϕ

according to its properties. Following the method of He and Shao [23], we classify

ϕ

into two categories: smooth function and jump function. When

ϕ

is Lipschitz continuous on its domain, we call it a smooth function; when

ϕ

has a finite number of jump points but is Lipschitz continuous on the intervals between two adjacent jump points, we call it a jump function. It is clear that

L_{2}

loss and Huber loss are smooth functions, while

L_{1}

loss is a jump function.

We introduce the following assumptions.

(A1). There exists a constant

M_{1}

such that

| x_{i t, p} | \leq M_{1}, \forall 1 \leq i \leq N, 1 \leq t \leq T, 1 \leq p \leq P,

and there exist two positive constants

C_{1}

and

C_{2}

such that

C_{1} \leq λ_{min} (\frac{1}{N T} Z^{⊤} Z) \leq λ_{max} (\frac{1}{N T} Z^{⊤} Z) \leq C_{2} .

(A2).

L^{0} P = O ({(N T)}^{c_{1}})

, for some

0 < c_{1} < \frac{1}{3}

(A3). When the loss function is a smooth function,

ϕ^{^{'}}

and

ϕ^{″}

are bounded by

c_{0} = E ϕ^{^{'}} (ϵ_{i t}) \in (0, \infty)

; when the loss function is a jump function,

ϕ

, f, and

f^{^{'}}

are bounded by

c_{0} = - \int_{- \infty}^{\infty} ϕ (r) f^{^{'}} (r) d r \in (0, \infty)

.

(A4).

{sup}_{d_{1}, d_{2} \in S_{p}} \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} | z_{i t}^{⊤} d_{1} |^{2} {| z_{i t}^{⊤} d_{2} |}^{2} = O (N T) .

Remark 1.

Assumption (A1) is a regularization assumption on the design matrix, where the minimum and maximum eigenvalues of

{(N T)}^{- 1} Z^{⊤} Z

are bounded by constants, which is a common assumption in heterogeneity panel data analysis based on concave fused penalties; see Ma and Huang [17], Ma et al. [22], Wang and Zhu [3], etc. Assumption (A2) allows the real coefficients dimension

L^{0} P

to increase with the sample size

N T

but at a slower rate. Assumption (A3) provides bounds on the loss function and error term distribution for different types of ϕ. This assumption is used by He and Shao [23] to prove the asymptotic normality of the M-estimator in linear regression models. For commonly used loss functions in the M-estimation field (such as

L_{1}

,

L_{2}

, and Huber) and commonly used error term distributions (such as normal and t-distributions), it is easy to prove that assumption (A3) is satisfied. Assumption (A4) further imposes restrictions on the design matrix, where if

z_{i t}

is a random sample from a P-variate distribution and for any

d \in S_{P}

,

E (| d^{⊤} z_{i t} |^{4})

that is uniformly bounded, then assumption (A4) holds, obviously.

Under these assumptions, we can obtain the consistency properties of the oracle estimator.

Theorem 1.

Under assumptions (A1)–(A4), we have

| | \tilde{α} - α^{0} | | = O_{p} (\sqrt{\frac{L^{0} P}{N T}}), | | \tilde{β} - β^{0} | | = O_{p} (\sqrt{\frac{L^{0} P | B_{max} |}{N T}}),

sup_{i, t} | | {\tilde{β}}_{i t} - β_{i t}^{0} | | = O_{p} (\sqrt{\frac{L^{0} P}{N T}}) .

We can also provide the asymptotic normal theory for the oracle estimator (Proof of Theorem 1 in Appendix A.1).

Theorem 2.

Under assumptions (A1)–(A4), we have

N T d^{T} (\tilde{α} - α^{0}) / σ (d) \to N (0, 1),

where

σ^{2} (d) = {(c_{0} E ϕ^{2} (ϵ_{i t}))}^{- 1} d^{⊤} (Z^{⊤} Z) d

, and d is a

P L \times 1

vector such that

| | d | | = 1

(Proof of Theorem 2 in Appendix A.2).

Let b denote the minimum difference between coefficients of any two sub-blocks, i.e.,

b = min_{l \neq l^{^{'}}} | | α_{l}^{0} - α_{l^{^{'}}}^{0} | |

. Let

| B_{min} |

denote the sample size of the smallest sub-block. Let

p_{λ} (s) = λ^{- 1} P_{λ} (s)

and

p_{γ} (s) = γ^{- 1} P_{γ} (s)

denote the standardized penalty functions, where

p_{λ}^{^{'}} (s)

and

p_{γ}^{^{'}} (s)

are their derivatives. To derive the asymptotic properties of our proposed estimator, additional assumptions are required.

(A5). For any

c_{1} < c_{2} \leq 1

, there exists

M_{2} > 0

such that

{(N T)}^{(1 - c_{2}) / 2} b \geq M_{2},

where

c_{1}

is defined in Assumption (A2).

(A6). There exist two positive constants

c_{3}

and

c_{4}

such that for any

ϵ_{i t}

and

c \in [- c_{3}, c_{3}]

, we have

P (| ϕ (ϵ_{i t} + c) | > x) \leq 2 exp (- c_{4} x^{2}) .

(A7). The normalized penalty functions

p_{λ} (s)

and

p_{γ} (s)

are symmetric non-decreasing functions that are convex on

[0, \infty)

. We have

p_{λ} (0) = p_{γ} (0) = 0

. There exist positive numbers

a > 0

and

a^{^{'}} > 0

such that

p_{λ} (s)

and

p_{γ} (s^{^{'}})

are constant when

s \geq a λ

and

s^{^{'}} \geq a^{^{'}} γ

, respectively. The derivatives

p_{λ}^{^{'}} (s)

and

p_{γ}^{^{'}} (s)

are continuous except at a finite number of points, and

p_{λ}^{^{'}} (0 +) = p_{γ}^{^{'}} (0 +) = 1 .

Remark 2.

Assumption (A5) provides the minimum difference in regression coefficients between different sub-blocks, which is essential for the separability of the coefficients. Assumption (A6) further restricts the error term, which is relatively mild for M-estimators. Specifically, when ρ is the

L_{2}

loss,

ϕ (ϵ_{i t}) = 2 ϵ_{i t}

, and assumption (A6) is equivalent to requiring that the error term

ϵ_{i t}

has sub-Gaussian tails, which is a common assumption in the field of high-dimensional statistics. When ρ is the

L_{1}

or Huber loss, assumption (A6) obviously holds because ϕ is bounded by a constant. Assumption (A7) restricts concave penalty functions, which can be easily verified to be satisfied by SCAD and MCP, where positive a and

a^{'}

control the concavity of the penalty function. It should be noted that when

s \geq a λ

,

p_{λ}^{^{'}} (s)

is constant. This means that when

(i, t)

and

(j, t^{'})

belong to different sub-blocks, the fused penalty term

p_{λ} (| | β_{i t} - β_{j t^{'}} | |)

tends to be constant; i.e., it does not compress the coefficient differences of different sub-blocks.

Theorem 3.

Under assumptions (A1)–(A7) and

m a x (λ, γ) = o ({(N T)}^{- (1 - c_{2}) / 2})

,

\frac{\sqrt{P} \sqrt{log (N T)}}{min (λ, γ) | B_{min} |} = o (1)

, the oracle estimator is a local minimizer of the objective function with probability tending to one, i.e., as both N and T

\to \infty

, we have

lim_{N, T \to \infty} P (\hat{β} = \tilde{β}) \to 1 .

Under the conditions of Theorems 2 and 3, we can obtain the following corollary (Proof of Theorem 3 in Appendix A.3).

Corollary 1.

N T d^{T} (\hat{α} - α^{0}) / σ (d) \to N (0, 1),

where

σ^{2} (d) = {(c_{0} E ϕ^{2} (ϵ_{i t}))}^{- 1} d^{⊤} (Z^{⊤} Z) d

, and d is a

P L \times 1

vector such that

| | d | | = 1

.

In practice, the distribution of

ϵ_{i t}

is unknown, and the estimate of

E ϕ^{2} (ϵ_{i t})

can be taken as

\hat{E ϕ^{2} (ϵ_{i t})} = {(N T - \hat{L} P)}^{- 1} \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} {[ϕ (y_{i t} - z_{i t}^{⊤} {\hat{β}}_{i t})]}^{2} .

When

ϕ (\cdot)

is a smooth function, the estimate of

c_{0}

is denoted as

{\hat{c}}_{0} = {(N T)}^{- 1} \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} ϕ^{^{'}} (y_{i t} - z_{i t}^{⊤} {\hat{β}}_{i t}) .

3. Simulation

3.1. Simulation Setting

This section presents two artificially constructed examples to investigate the finite-sample performance of the proposed estimator. The simulated data are independently generated from the following model,

y_{i t} = μ_{i t} + x_{i t}^{T} η_{i t} + ϵ_{i t}, i = 1, . . ., N; t = 1, . . ., T,

We consider different combinations of the number of individuals N, the time range T, and the dimension of the regression coefficients P. For each combination of

(N, T)

, we change the distribution of

ϵ_{i t}

from normal to heavy-tailed and compared the estimation results under different loss functions, including L1, Huber, and L2 loss.

We use the MCP penalty function in the experiment, as SCAD yields similar results and is therefore not presented. The maximum number of iterations is set to 50, and the threshold

δ

is set to

10^{- 5}

. The algorithm terminates when the number of iterations exceeds 50 or the update range of coefficients is less than

10^{- 5}

. Each group of experiments is repeated

R = 100

times. In each experiment, we perform a grid search to select the optimal tuning parameters

λ

,

γ

, and a by comparing the mBIC values. The range of

λ

and

γ

is

[0.1, 1.5]

with a step size of 0.2, and the range of a is

[2, 10]

with a step size of 2. We utilize the following metrics to assess the systematic error of the regression coefficient estimation and the precision of the block structure recovery.

RMSE: root mean square error between the estimated parameter $\hat{β}$ and the real parameter $β^{0}$ .

$\frac{1}{R} \sum_{r = 1}^{R} \sqrt{\frac{1}{N T P} | | {\hat{β}}^{r} - β^{0} {| |}^{2}}$
Bias: bias between the estimated parameter $\hat{β}$ and the real parameter $β^{0}$ .

$\frac{1}{R} \sum_{r = 1}^{R} [\frac{1}{N T P} \sum_{i = 1}^{N} \sum_{t = 1}^{T} \sum_{p = 1}^{P} | {\hat{β}}_{i t, p}^{r} - β_{i t, p}^{0} |]$
Per: the percentage that the estimated number of blocks and the real number of blocks are equal.

$\frac{1}{R} \sum_{r = 1}^{R} I ({\hat{L}}^{r} = L^{0})$
ERI: The Rand Index (RI) is used to evaluate the accuracy of clustering, which ranges between 0 and 1, with higher values indicating better performance. Motivated by the formation of $R I$ , we can calculate individual or period-specific $R I$ s, denoted as $R I_{t}$ or $R I_{i}$ , respectively. We define $E R I$ as the averages over all periods and individuals, as follows,

$E R I = \frac{1}{2} [\frac{1}{T} \sum_{t = 1}^{T} R I_{t} + \frac{1}{N} \sum_{i = 1}^{N} R I_{i}]$

3.2. Simulation Examples

Example 1.

In this example, we generate simulated data from the following model,

y_{i t} = μ_{i t} + x_{i t} η_{i t} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots, T,

where

x_{i t} = 2 \times e_{i t}

and

e_{i t}

are independent and identically distributed standard normal random variables. The intercept term

μ_{i t}

and slope coefficients

η_{i t}

have dimension 1 and exhibit the same block structure. As shown in Figure 1, we construct three sets of panel data by varying the combinations of N and T. Each set of data can be partitioned into two sub-blocks

B_{1}

and

B_{2}

, corresponding to coefficients

β_{1} = (2, 3)

and

β_{2} = (2, 5)

, respectively.

To verify the robustness of the proposed method, we consider three scenarios for generating

ϵ_{i t}

. Scenario 1 (normal distribution):

ϵ_{i t} \sim N (0, 1)

. Scenario 2 (heavy-tailed distribution):

ϵ_{i t} \sim 0.5 \times t (3)

, where

t (3)

denotes the Student’s t-distribution with 3 degrees of freedom. Scenario 3 (mixture distribution):

ϵ_{i t} \sim 0.3 \times N (0, 0 . 5^{2}) + 0.2 \times N (0, 5^{2})

.

We obtain nine groups of panel data by varying the combinations of

(N, T)

and the distribution of

ϵ_{i t}

. For each group of data, we compare the performance of the oracle estimator, the

L_{1}

-loss-based estimator, the

L_{2}

-loss-based estimator, and the Huber-loss-based estimator. The oracle estimator is the estimator when the block structure is known, and for convenience, we use the

L_{2}

loss as its loss function, so the oracle estimator has an explicit solution. The estimators under an unknown block structure are influenced by the tuning parameters

λ

,

γ

, and a. We use a grid search method to obtain the optimal hyperparameter combination following the setting in Section 2.3. Additionally, there is an extra parameter

δ

in the Huber loss function to control the shape of the loss function, and we use the default value of 1.345.

Example 2.

In this example, we adopt the block partitioning method under different combinations of

(N, T)

as in Example 1 with the difference being an increase in the dimension of the regression coefficients from 2 to 4. The model is specified as follows,

y_{i t} = μ_{i t} + x_{i 1} η_{i t, 1} + x_{i 2} η_{i t, 2} + x_{i 3} η_{i t, 3} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots, T,

where the real coefficients of the first block are

β_{1} = (2, 3, - 2, 1)

, and those of the second block are

β_{2} = (- 2, 1, 3, - 1)

. The explanatory variables

(x_{i 1}, x_{i 2}, x_{i 3})

are generated from a multivariate normal distribution with mean

(0, 0, 0)

and covariance matrix

(\begin{matrix} 1 & 0.5 & 0.25 \\ 0.5 & 1 & 0.5 \\ 0.25 & 0.5 & 1 \end{matrix})

The error term

ϵ_{i t}

is generated according to the same mixture distribution as in Example 1. We obtain three sets of panel data for different combinations of N and T.

For each set of data, we compute the oracle estimator and the estimator for the unknown block structure following the same procedure as in Example 1. Then, we analyze various result indicators to investigate the performance of the proposed method as the dimension of the regression coefficients increases.

Example 3.

To evaluate the effectiveness of a dual-penalty in handling two-dimensional heterogeneous panel data, we conduct an ablation experiment by removing either the individual dimension penalty or the time dimension penalty from the objective function (4) and comparing the results with the full objective function that includes both penalties.

We use a set of simulated data from Example 1, where the data have a dimension of

N = T = 32

, the error term follows a standard normal distribution, and the loss term in the objective function is the

L_{2}

loss.

3.3. Simulation Results

Table 1, Table 2 and Table 3 display the simulation results for Example 1, corresponding to three different distributions of the error term. In each table, we consider three combinations of

(N, T)

, reporting the results of the oracle estimator when the block structure is known, and the results based on three different loss functions when the block structure is unknown. The objective function for obtaining the oracle estimator is set to use the

L_{2}

loss. Since obtaining the oracle estimator is difficult in practice, we use it only for comparison. We report the mean and standard deviation of 100 repeated experiments, with the standard deviation displayed in parentheses.

In Table 1, the error terms follow the standard normal distribution. Simulations based on

L_{1}

loss and Huber loss perform similarly to those based on

L_{2}

loss. In terms of coefficient estimation, when

N = T = 16

, the

L_{2}

loss slightly outperforms

L_{1}

and Huber losses, but as N and T increase, the differences among the three simulation results diminish, and they all approach the oracle estimator. In terms of structural recovery, the Per metric equals 1 in the second and third combinations of

(N, T)

, indicating that the estimated number of blocks equals the actual number of blocks. The ERI metric approaches 1, indicating that the method accurately classifies the samples into the correct sub-blocks.

In Table 2, the error terms follow a heavy-tailed t distribution. It is evident that the results based on

L_{1}

loss and Huber loss outperform those based on

L_{2}

loss. When

N = T = 16

, we even observe that the RMSE metric based on

L_{1}

loss is slightly lower than the oracle estimator based on

L_{2}

loss. This is because the heavy-tailed distribution increases the probability of outliers in the simulated data, and

L_{2}

loss is more sensitive to outliers than

L_{1}

loss. As N and T increase, the results based on

L_{1}

and Huber losses approach the oracle estimator, and the RMSE and Bias metrics for coefficient estimation become increasingly closer to those of the oracle estimator. The Per and ERI metrics reach or approach 1, indicating that the method accurately recovers the block structure.

In Table 3, the error terms follow a mixture of normal distributions with a stronger heavy-tailed effect. The simulation results are worse than those of the previous two cases, but those based on

L_{1}

loss and Huber loss are still better than those based on

L_{2}

loss. As N and T increase, the Per and ERI metrics gradually approach 1. This simulation experiment also confirms the robustness and block structure recovery ability of the proposed estimator.

Below, we analyze the performance of the proposed estimator as the dimension of the coefficient P increases. In Table 4, the error term follows a heavy-tailed mixture normal distribution. Increasing P makes coefficient estimation and block structure recovery more challenging, but the performance metrics corresponding to

L_{1}

loss and Huber loss remain superior to those of

L_{2}

loss. As N and T increase, all performance metrics of the estimator improve significantly, and the

P e r

metric corresponding to

L_{1}

loss approaches 0.9 when

N = T = 32

.

Finally, we present the results of the ablation experiment with fused penalty terms. Since the simulated data have a block structure, setting penalty terms only in one dimension cannot effectively compress the coefficient differences in the other dimension and thus cannot recover the block structure. Hence, we report only the results of the RMSE and Bias metrics in Table 5. The performance of the double-penalty method exceed those of any single-dimensional penalty method by a large margin, indicating the necessity of biclustering analysis for block panel data. The double-penalty method not only recovers unknown structures but also greatly reduces the estimation error of the coefficients.

4. Discussion

Panel data models with heterogeneous coefficients have gained a lot of attention in various fields due to their ability to capture complex data patterns. In this paper, we extend the existing literature by proposing a more general block structure that captures heterogeneity in individual and time dimensions in a flexible manner. Our proposed model exhibits both an individual-group structure that can change at change points and temporal–structural breaks that can vary across different groups. A robust biclustering method based on M-estimation and double concave fused penalties is developed to estimate the coefficients, which can handle heavy-tailed data and outliers. Under certain regularity conditions, we established the asymptotic normality of the oracle estimator and the proposed estimator. Numerical simulations have validated the excellent finite-sample performance of our proposed method by evaluating the recovery of unknown structures as well as the estimated bias of regression coefficients. Furthermore, in our numerical simulations, we specifically investigate the performance of our proposed model in the presence of heavy-tailed distributions, which highlights the superior performance of our proposed method in handling outliers. We believe that our method has potential applications in various fields where data exhibit complex heterogeneity.

Despite the progress made, there are still some limitations and further research topics that warrant further exploration. First among these is the lack of convergence proof for the Bayesian information criterion. To achieve this, more regularization assumptions would be required for the distribution of covariates and error terms. For relevant work in this area, researchers can consider the methodology proposed by Cheng et al. [10] for clustering individual group structures. A second challenge arises from the high-dimensional matrix calculations involved in the algorithm for solving the objective function. The computational requirements increase exponentially with N and T. To address this issue, the divide-and-conquer method can be used for parallel computation to improve the algorithm’s efficiency. Finally, variable selection through L1 penalty on covariates is another promising area for further research, particularly in situations with high-dimensional covariate dimensions P. These areas of investigation will be the topic of future studies.

Author Contributions

Methodology, W.C.; Software, W.C.; Formal analysis, W.C.; Investigation, W.C.; Resources, Y.L.; Data curation, W.C.; Writing—original draft, W.C.; Writing—review & editing, Y.L.; Supervision, Y.L.; Project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data generated or analyzed during this study are included in this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In the appendix, we provide the proofs of Theorems 1–3.

Appendix A.1. Proof of Theorem 1

Here, we present the proof of Theorem 1. The first conclusion of Theorem 1 follows directly from Example 1 in He and Shao [23].

Since

| | \tilde{α} - α^{0} | | = O_{p} (\sqrt{\frac{L^{0} P}{N T}})

, there exists a positive constant C such that

| | \tilde{α} - α^{0} | | \leq C \sqrt{\frac{L^{0} P}{N T}}

. Using the relationship between

\tilde{α}

and

\tilde{β}

, we have

\begin{matrix} | | \tilde{β} - β^{0} {| |}^{2} & = \sum_{l = 1}^{L^{0}} \sum_{(i, t) \in B_{l}} | | {\tilde{α}}_{l} - α_{l^{0}} {| |}^{2} \\ \leq | B_{max} | \sum_{l = 1}^{L^{0}} | | {\tilde{α}}_{l} - α_{l^{0}} {| |}^{2} \\ = | B_{max} | | | \tilde{α} - α^{0} {| |}^{2} \\ \leq \frac{C^{2} L^{0} P | B_{max} |}{N T} . \end{matrix}

(A1)

Thus,

| | \tilde{β} - β^{0} | | \leq C \sqrt{\frac{L^{0} P | B_{max} |}{N T}}

.

Finally, applying a simple inequality yields

\begin{matrix} sup_{i, t} | | {\tilde{β}}_{i t} - β_{i t}^{0} | | = sup_{l} | | {\tilde{α}}_{l} - α_{l}^{0} | | \leq | | \tilde{α} - α^{0} | | \leq C \sqrt{\frac{L^{0} P}{N T}} . \end{matrix}

(A2)

This completes the proof of Theorem 1.

Appendix A.2. Proof of Theorem 2

By assumption (A2), there exists

0 < c_{1} < \frac{1}{3}

such that

L^{0} P = O ({(N T)}^{c_{1}})

. Clearly,

{(L^{0} P)}^{3} {(log (L^{0} P))}^{2} = o (N T) .

(A3)

Therefore, by Example 1 in He and Shao [23], the result of Theorem 2 follows directly.

Appendix A.3. Proof of Theorem 3

We partition the block structure of regression coefficients as follows: we first divide the samples into C groups according to the group structure in the dimension of individuals; then, we partition them into K time periods according to the structural breaks in the time dimension (if there exist structural breaks for any individuals at given time period, we perform segmentation on all individuals instead of splitting individuals under given groups). In this way, we obtain

K C

sub-blocks, which obviously exceeds the number of true sub-blocks. Let

B_{l c}

be the set of sample indexes that belong to both the cth individual group and the l-th true sub-block, and let

B_{l k}

be the set of sample indexes that belong to both the kth time group and the lth true sub-block. Denote

L (β) = \sum_{i = 1}^{N} \sum_{t = 1}^{T} ρ (y_{i t} - z_{i t}^{T} β_{i t}),

(A4)

P (β) = \sum_{t = 1}^{T} \sum_{i < j} P_{λ} (| | β_{i t} - β_{j t} | |) + \sum_{i = 1}^{N} \sum_{t < t^{^{'}}} P_{γ} (| | β_{i t} - β_{i t^{^{'}}} | |),

(A5)

L^{B} (α) = \sum_{l = 1}^{L} \sum_{(i, t) \in B_{l}} ρ (y_{i t} - z_{i t}^{⊤} α_{l}),

(A6)

\begin{matrix} P^{B} (α) & = λ \sum_{l < l^{^{'}}} \sum_{c = 1}^{C} (| B_{l c} | | B_{l^{^{'}} c} |) ρ_{λ} (| | α_{l} - α_{l^{^{'}}} | |) \\ + γ \sum_{l < l^{^{'}}} \sum_{k = 1}^{K} (| B_{l k} | | B_{l^{^{'}} k} |)) ρ_{γ} (| | α_{l} - α_{l^{^{'}}} | |), \end{matrix}

(A7)

where the variable

| B_{l c} |

denotes the number of samples belonging to both the cth individual group and the lth true sub-block, while

| B_{l k} |

represents the number of samples belonging to both the kth time group and the lth true sub-block. To simplify the notation without creating confusion, we use

Q (β)

to refer to the objective function in Equation (4), which is defined as

Q (β) = L (β) + P (β),

(A8)

and let

Q^{B} (α) = L^{B} (α) + P^{B} (α) .

(A9)

Let

M_{B}

denote the set of

R^{N T P}

coefficients with a block structure. We define a mapping

T : M_{B} \to R^{L P}

, which maps a block-structured

N T P

-dimensional vector

β

to an

L P

-dimensional vector

α

. Here,

α = {(α_{1}^{⊤}, \dots, α_{L}^{⊤})}^{⊤}

is the concatenation of L P-dimensional vectors, with the lth vector

α_{l}

representing the coefficients for the lth sub-block. Additionally, we define a mapping

T^{B} : R^{N T P} \to R^{L P}

, which maps any

N T P

-dimensional vector

β

to an

L P

-dimensional vector

α

based on the block structure of

B

as follows. For any

β = {β_{11}^{⊤}, \dots, β_{1 T}^{⊤}, \dots β_{N 1}^{⊤}, \dots, β_{N T}^{⊤}}

and

B = {B_{1}, \dots B_{L}}

, the mapping process is given by

T^{B} (β) = \{| B_{1} |^{- 1} \sum_{(i, t) \in B_{1}} β_{i t}^{⊤}, \dots, {| B_{L} |}^{- 1} \sum_{(i, t) \in B_{L}} β_{i t}^{⊤}\} .

(A10)

From the properties of the mappings T and

T^{B}

, it follows that for any

β \in M_{B}

, we have

T (β) = T^{B} (β)

,

P (β) = P^{B} (T (β))

. Let

α = T (β)

, then

P (T^{- 1} (α)) = P^{B} (α)

. Therefore, we conclude that,

Q (β) = Q^{B} (T (β)), Q^{B} (α) = Q (T^{- 1} (α)) .

(A11)

We denote the true regression coefficients as

β^{0}

and

α^{0}

, and the oracle estimators as

\tilde{β}

and

\tilde{α}

. Then, we define the set

Θ_{1} = \{β \in R^{N T P} : sup_{i t} | | β_{i t} - β_{i t}^{0} | | \leq \sqrt{\frac{L^{0} P}{N T}}\} .

(A12)

We now prove Theorem 2 in two steps.

(i) Let

β \in Θ_{1}

, and denote

β^{*} = T^{- 1} (T^{B} (β))

. For all

β^{*} \neq \tilde{β}

, we have

Q (β^{*}) > Q (\tilde{β})

with a probability tending to 1.

(ii) Define the set

Θ_{2} = {β_{i t} : sup_{i t} | | β_{i t} - {\hat{β}}_{i t} | | \leq s}

, where s is a positive sequence. When s is small enough, we have

Q (β) \geq Q (β^{*})

with a probability tending to 1 for all

β \in Θ_{1} \cap Θ_{2}

.

Clearly, if (i) and (ii) are proved, then for any

β \in Θ_{1} \cap Θ_{2}

, we have

Q (β) > Q (\tilde{β})

, which means that the oracle estimator

\tilde{β}

is a local minimum of

Q (β)

, and this conclusion holds with probability tending to 1.

The proof for (i) is as follows. Let

α = T^{B} (β)

; then,

\begin{matrix} sup_{l} {∥α_{l} - α_{l}^{0}∥}^{2} & = sup_{l} {∥{|B_{l}|}^{- 1} \sum_{(i, t) \in B_{l}} β_{i t} - α_{l}^{0}∥}^{2} \\ = sup_{l} {∥{|B_{l}|}^{- 1} \sum_{(i, t) \in B_{l}} (β_{i t} - β_{i t}^{0})∥}^{2} \\ \leq sup_{l} {|B_{l}|}^{- 1} \sum_{(i, t) \in B_{l}} {∥β_{i t} - β_{i t}^{0}∥}^{2} \\ \leq sup_{i, t} {∥β_{i t} - β_{i t}^{0}∥}^{2} \\ \leq \frac{L^{0} P}{N T}, \end{matrix}

(A13)

therefore, all l and

l^{^{'}}

, the following inequation holds,

| | α_{l} - α_{l^{^{'}}} | | \geq | | α_{l}^{0} - α_{l^{^{'}}}^{0} | | - 2 sup_{l} | | α_{l} - α_{l}^{0} | | \geq b - 2 \sqrt{\frac{L^{0} P}{N T}} .

(A14)

Using the inequality from assumption (A5),

{(N T)}^{(1 - c_{2}) / 2} b \geq M_{2}

, and the condition

max (λ, γ) = o ({(N T)}^{- (1 - c_{2}) / 2})

, we can obtain

b - 2 \sqrt{\frac{L^{0} P}{N T}} > max (a λ, a^{^{'}} γ),

(A15)

Therefore, assumption (A7) implies

P^{B} (α) = C

, where C is a constant, and

Q^{B} (α) = L^{B} (α) + C

. Since

\tilde{α}

is the global minimum of

L^{B} (α)

, it follows that

Q^{B} (α) > Q^{B} (\tilde{α})

for any

α \neq \tilde{α}

. Finally, using (A11), we have

Q^{B} (α) = Q (T^{- 1} (α)) = Q (β^{})

and

Q^{B} (\tilde{α}) = Q (\tilde{β})

, so

Q (β^{*}) > Q (\tilde{β})

for any

β^{} \neq \tilde{β}

. This completes the proof of conclusion (i).

Continuing with the proof for result (ii), we define two functions

P_{1} (β) = \sum_{t = 1}^{⊤} \sum_{i < j} P_{λ} (| | β_{i t} - β_{j t} | |), P_{2} (β) = \sum_{i = 1}^{N} \sum_{t < t^{^{'}}} P_{γ} (| | β_{i t} - β_{i t^{^{'}}} | |) .

(A16)

By Taylor expanding around

β_{i t}

, we can decompose the difference of the objective function into three parts,

Q (β) - Q (β^{*}) = Γ_{1} + Γ_{2} + Γ_{3}

(A17)

where each part takes the following form:

Γ_{1} = L (β) - L (β^{*}),

(A18)

Γ_{2} = \sum_{t = 1}^{⊤} \sum_{i = 1}^{N} \frac{\partial P_{1} (β^{m})}{\partial β_{i t}} (β_{i t} - β_{i t}^{*}),

(A19)

Γ_{3} = \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} \frac{\partial P_{2} (β^{m})}{\partial β_{i t}} (β_{i t} - β_{i t}^{*}) .

(A20)

Here,

β^{m} = θ β + (1 - θ) β^{*}

, where

θ

is a scalar between 0 and 1.

First, let us handle the first part,

\begin{matrix} Γ_{1} & = \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} (ρ (y_{i t} - z_{i t}^{⊤} β_{i t}) - ρ (y_{i t} - z_{i t}^{⊤} β_{i t}^{*})) \\ = - \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} ϕ (y_{i t} - z_{i t}^{⊤} β_{i t}^{m}) z_{i t}^{⊤} (β_{i t} - β_{i t}^{*}) \\ = - \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} ϕ (ϵ_{i t} + z_{i t}^{⊤} (β_{i t}^{0} - β_{i t}^{m})) z_{i t}^{⊤} (β_{i t} - β_{i t}^{*}) . \end{matrix}

(A21)

By Assumption (A6) and

z_{i t}^{⊤} (β_{i t}^{0} - β_{i t}^{m}) = o_{p} (1)

, we can deduce that,

P (ϕ (ϵ_{i t} + z_{i t}^{⊤} (β_{i t}^{0} - β_{i t}^{m})) > \sqrt{log (N T)}) \leq 2 {(N T)}^{- c_{4}},

(A22)

which implies that as N and T approach infinity,

ϕ (ϵ_{i t} + z_{i t}^{⊤} (β_{i t}^{0} - β_{i t}^{m})) \leq \sqrt{log (N T)}

holds with probability tending to one. Thus, we can bound

Γ_{1}

as follows,

\begin{matrix} Γ_{1} & \geq - \sum_{i = 1}^{N} \sum_{t = 1}^{⊤} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} z_{i t}^{⊤} (β_{i t} - β_{i t}^{*}) \\ = - \sum_{l = 1}^{L} \sum_{(i, t) \in B_{l}} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} z_{i t}^{⊤} (β_{i t} - {| B_{l} |}^{- 1} \sum_{(j, t^{^{'}}) \in B_{l}} β_{j t^{^{'}}}) \\ = - \sum_{l = 1}^{L} \sum_{(i, t), (j, t^{^{'}}) \in B_{l}} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} {| B_{l} |}^{- 1} z_{i t}^{⊤} (β_{i t} - β_{j t^{^{'}}}) \\ \geq - \sum_{l = 1}^{L} \sum_{(i, t), (j, t^{^{'}}) \in B_{l}} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} {| B_{m i n} |}^{- 1} \sqrt{P} M_{1} | | β_{i t} - β_{j t^{^{'}}} | | . \end{matrix}

(A23)

For

Γ_{2}

and

Γ_{3}

, using the results from Wang et al. [18] and Fang et al. [24], we have

\begin{matrix} Γ_{2} \geq λ \sum_{t = 1}^{⊤} \sum_{l = 1}^{L} \sum_{(i, t), (j, t) \in B_{l}, i < j} ρ^{^{'}} (4 s) | | β_{i t} - β_{j t} | |, \\ Γ_{3} \geq γ \sum_{i = 1}^{N} \sum_{l = 1}^{L} \sum_{(i, t), (i, t^{^{'}}) \in B_{l}, t < t^{^{'}}} ρ^{^{'}} (4 s) | | β_{i t} - β_{i t^{^{'}}} | | . \end{matrix}

(A24)

Finally, by combining the above results, we can obtain

\begin{matrix} Q (β) - Q (β^{*}) & \geq - 4 \sum_{l = 1}^{L} \sum_{(i, t), (j, t^{^{'}}) \in B_{l}, i < j, t < t^{^{'}}} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} {| B_{m i n} |}^{- 1} \sqrt{P} M_{1} | | β_{i t} - β_{j t^{^{'}}} | | \\ + \sum_{l = 1}^{L} [\sum_{t = 1}^{⊤} \sum_{(i, t), (j, t) \in B_{l}, i < j} λ ρ^{^{'}} (4 s) + \sum_{i = 1}^{N} \sum_{(i, t), (i, t^{^{'}}) \in B_{l}, t < t^{^{'}}} γ ρ^{^{'}} (4 s)] | | β_{i t} - β_{j t^{^{'}}} | | \\ \geq - 4 \sum_{l = 1}^{L} \sum_{(i, t), (j, t^{^{'}}) \in B_{l}, i < j, t < t^{^{'}}} \sqrt{c_{4}^{- 1}} \sqrt{log (N T)} {| B_{m i n} |}^{- 1} \sqrt{P} M_{1} | | β_{i t} - β_{j t^{^{'}}} | | \\ + \sum_{l = 1}^{L} \sum_{(i, t), (j, t^{^{'}}) \in B_{l}, i < j, t < t^{^{'}}} min (λ, γ) ρ^{^{'}} (4 s) | | β_{i t} - β_{j t^{^{'}}} | | . \end{matrix}

(A25)

As

s \to 0

,

ρ^{^{'}} (4 s) \to 1

. Moreover, since

\frac{\sqrt{P} \sqrt{log (N T)}}{min (λ, γ) | B_{min} |} = o (1)

, we have

Q (β) - Q (β^{*}) \geq 0

.

This completes the proof.

References

Su, L.; Shi, Z.; Phillips, P.C. Identifying latent structures in panel data. Econometrica 2016, 84, 2215–2264. [Google Scholar] [CrossRef]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused LASSO. J. R. Stat. Soc. Ser. B 2005, 67, 91–108. [Google Scholar] [CrossRef]
Wang, W.; Zhu, Z. Group structure detection for a high-dimensional panel data model. Can. J. Stat. 2022, 50, 852–866. [Google Scholar] [CrossRef]
Qian, J.; Su, L. Shrinkage estimation of regression models with multiple structural changes. Econom. Theory 2016, 32, 1376–1433. [Google Scholar] [CrossRef]
Qian, J.; Su, L. Shrinkage estimation of common breaks in panel data models via adaptive group fused Lasso. J. Econom. 2016, 191, 86–109. [Google Scholar] [CrossRef]
Okui, R.; Wang, W. Heterogeneous structural breaks in panel data models. J. Econom. 2021, 220, 447–473. [Google Scholar] [CrossRef]
Lumsdaine, R.L.; Okui, R.; Wang, W. Estimation of panel group structure models with structural breaks in group memberships and coefficients. J. Econom. 2023, 233, 45–65. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, H.J.; Zhu, Z. Robust subgroup identification. Stat. Sin. 2019, 29, 1873–1889. [Google Scholar] [CrossRef]
Zou, H.; Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008, 36, 1509–1533. [Google Scholar]
Cheng, C.; Feng, X.; Li, X.; Wu, M. Robust analysis of cancer heterogeneity for high-dimensional data. Stat. Med. 2022, 41, 5448–5462. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 2011, 3, 1–122. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar] [CrossRef]
Wang, W.; Yan, X.; Ren, Y.; Xiao, Z. Bi-Integrative Analysis of Two-Dimensional Heterogeneous Panel Data Model. arXiv 2021, arXiv:econ.EM/2110.10480. Available online: http://xxx.lanl.gov/abs/2110.10480 (accessed on 10 October 2021).
Hunter, D.R.; Li, R. Variable selection using MM algorithms. Ann. Stat. 2005, 33, 1617. [Google Scholar] [CrossRef]
Wang, H.; Li, R.; Tsai, C.L. Tuning Parameter Selectors for the Smoothly Clipped Absolute Deviation Method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef]
Wang, H.; Li, B.; Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2009, 71, 671–683. [Google Scholar] [CrossRef]
Ma, S.; Huang, J.; Zhang, Z.; Liu, M. Exploration of heterogeneous treatment effects via concave fusion. Int. J. Biostat. 2019, 16, 20180026. [Google Scholar] [CrossRef] [PubMed]
He, X.; Shao, Q.M. On parameters of increasing dimensions. J. Multivar. Anal. 2000, 73, 120–135. [Google Scholar] [CrossRef]
Fang, K.; Chen, Y.; Ma, S.; Zhang, Q. Biclustering analysis of functionals via penalized fusion. J. Multivar. Anal. 2022, 189, 104874. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Block structures corresponding to different combinations of N and T in Example 1. The orange regions correspond to block 1, and the blue regions correspond to block 2. (a) shows the block structure for

N = T = 16

. (b) shows the block structure for

N = T = 32

. (c) shows the block structure for

N = 16

and

T = 32

.

Figure 1. Block structures corresponding to different combinations of N and T in Example 1. The orange regions correspond to block 1, and the blue regions correspond to block 2. (a) shows the block structure for

N = T = 16

. (b) shows the block structure for

N = T = 32

. (c) shows the block structure for

N = 16

and

T = 32

.

Table 1. Experimental Indicators of Each Combination When the Error Term is Normally Distributed in Example 1.

(N, T)	Model	Rmse	Bias	Per	ERI
(16, 16)	Oracle	0.034 (0.025)	0.056 (0.021)
	$L_{1}$	0.040 (0.031)	0.087 (0.041)	0.99	0.998 (0.002)
	$L_{2}$	0.039 (0.026)	0.073 (0.030)	1	0.999 (0.001)
	Huber	0.039 (0.026)	0.075 (0.031)	1	0.998 (0.001)
(16, 32)	Oracle	0.020 (0.015)	0.039 (0.014)
	$L_{1}$	0.024 (0.020)	0.068 (0.034)	1	0.999 (0.001)
	$L_{2}$	0.021 (0.016)	0.040 (0.020)	1	0.999 (0.001)
	Huber	0.022 (0.016)	0.041 (0.018)	1	0.998 (0.001)
(32, 32)	Oracle	0.015 (0.012)	0.028 (0.011)
	$L_{1}$	0.016 (0.014)	0.035 (0.021)	1	0.998 (0.001)
	$L_{2}$	0.016 (0.011)	0.028 (0.012)	1	0.999 (0.001)
	Huber	0.016 (0.010)	0.028 (0.012)	1	0.999 (0.001)

Table 2. Experimental Indicators of Each Combination When the Error Term is t-Distributed in Example 1.

(N, T)	Model	Rmse	Bias	Per	ERI
(16, 16)	Oracle	0.024 (0.020)	0.046 (0.022)
	$L_{1}$	0.023 (0.018)	0.058 (0.032)	0.98	0.998 (0.001)
	$L_{2}$	0.026 (0.021)	0.063 (0.039)	0.89	0.983 (0.011)
	Huber	0.024 (0.019)	0.060 (0.035)	0.99	0.998 (0.001)
(16, 32)	Oracle	0.016 (0.013)	0.030 (0.011)
	$L_{1}$	0.017 (0.013)	0.041 (0.019)	1	0.999 (0.001)
	$L_{2}$	0.023 (0.018)	0.054 (0.033)	0.91	0.987 (0.006)
	Huber	0.018 (0.014)	0.043 (0.028)	1	0.998 (0.002)
(32, 32)	Oracle	0.012 (0.009)	0.021 (0.009)
	$L_{1}$	0.012 (0.009)	0.032 (0.012)	1	0.999 (0.001)
	$L_{2}$	0.015 (0.012)	0.037 (0.023)	0.94	0.993 (0.004)
	Huber	0.012 (0.010)	0.033 (0.022)	1	0.999 (0.001)

Table 3. Experimental Indicators of Each Combination When the Error Term Follows a Mixture Distribution in Example 1.

(N, T)	Model	Rmse	Bias	Per	ERI
(16, 16)	Oracle	0.044 (0.037)	0.084 (0.038)
	$L_{1}$	0.054 (0.038)	0.122 (0.056)	0.79	0.986 (0.012)
	$L_{2}$	0.081 (0.066)	0.162 (0.098)	0.42	0.943 (0.037)
	Huber	0.057 (0.065)	0.151 (0.117)	0.74	0.980 (0.011)
(16, 32)	Oracle	0.034 (0.025)	0.062 (0.025)
	$L_{1}$	0.049 (0.034)	0.112 (0.530)	0.81	0.986 (0.010)
	$L_{2}$	0.088 (0.075)	0.207 (0.129)	0.49	0.953 (0.028)
	Huber	0.051 (0.040)	0.118 (0.073)	0.76	0.984 (0.010)
(32, 32)	Oracle	0.023 (0.017)	0.044 (0.017)
	$L_{1}$	0.032 (0.024)	0.086 (0.066)	0.89	0.989 (0.005)
	$L_{2}$	0.061 (0.073)	0.149 (0.106)	0.62	0.971 (0.021)
	Huber	0.044 (0.046)	0.093 (0.068)	0.83	0.987 (0.007)

Table 4. Experimental Indicators of Each Combination When the Error Term Follows a Mixture Distribution and P = 4 in Example 2.

(N, T)	Model	Rmse	Bias	Per	ERI
(16, 16)	Oracle	0.044 (0.037)	0.084 (0.038)
	L1	0.054 (0.038)	0.122 (0.056)	0.79	0.986 (0.012)
	L2	0.073 (0.062)	0.157 (0.093)	0.42	0.943 (0.037)
	Huber	0.057 (0.045)	0.151 (0.087)	0.74	0.980 (0.011)
(16, 32)	Oracle	0.034 (0.025)	0.062 (0.025)
	L1	0.049 (0.034)	0.112 (0.053)	0.81	0.986 (0.010)
	L2	0.070 (0.063)	0.154 (0.091)	0.49	0.953 (0.028)
	Huber	0.051 (0.040)	0.118 (0.073)	0.76	0.984 (0.010)
(32, 32)	Oracle	0.023 (0.017)	0.044 (0.017)
	L1	0.032 (0.024)	0.086 (0.066)	0.89	0.989 (0.005)
	L2	0.061 (0.073)	0.149 (0.106)	0.62	0.971 (0.002)
	Huber	0.044 (0.036)	0.093 (0.068)	0.83	0.987 (0.007)

Table 5. Ablation Experiment of Penalty Terms in Example 3.

Model	RMSE	Bias
Oracle	0.015 (0.012)	0.028 (0.011)
Double Penalties	0.016 (0.011)	0.028 (0.012)
Individual Penalty Only	0.155 (0.119)	0.799 (0.342)
Temporal Penalty Only	0.226 (0.130)	0.440 (0.154)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, W.; Li, Y. Bicluster Analysis of Heterogeneous Panel Data via M-Estimation. Mathematics 2023, 11, 2333. https://doi.org/10.3390/math11102333

AMA Style

Cui W, Li Y. Bicluster Analysis of Heterogeneous Panel Data via M-Estimation. Mathematics. 2023; 11(10):2333. https://doi.org/10.3390/math11102333

Chicago/Turabian Style

Cui, Weijie, and Yong Li. 2023. "Bicluster Analysis of Heterogeneous Panel Data via M-Estimation" Mathematics 11, no. 10: 2333. https://doi.org/10.3390/math11102333

APA Style

Cui, W., & Li, Y. (2023). Bicluster Analysis of Heterogeneous Panel Data via M-Estimation. Mathematics, 11(10), 2333. https://doi.org/10.3390/math11102333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bicluster Analysis of Heterogeneous Panel Data via M-Estimation

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Setting

2.2. Proposed Estimator

2.3. Proposed Algorithm

2.4. Asymptotic Properties

3. Simulation

3.1. Simulation Setting

3.2. Simulation Examples

3.3. Simulation Results

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix A.3. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI