Incorporating Prior Information in Latent Structures Identification for Panel Data Models

Li, Yi; Luo, Xingxing; Liao, Mengqi

doi:10.3390/math13091505

Open AccessArticle

Incorporating Prior Information in Latent Structures Identification for Panel Data Models

by

Yi Li

¹,

Xingxing Luo

^2,* and

Mengqi Liao

³

¹

College of Tourism, Hunan Normal University, Changsha 410081, China

²

School of Management, Chongqing University of Science and Technology, Chongqing 401331, China

³

School of Economics, Xiamen University, Xiamen 361005, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1505; https://doi.org/10.3390/math13091505

Submission received: 14 March 2025 / Revised: 22 April 2025 / Accepted: 29 April 2025 / Published: 2 May 2025

Download

Browse Figures

Versions Notes

Abstract

In this paper, we explore the latent structures for panel data models in presence of available prior information. The latent structure in panel models allows individuals to be classified into several distinct groups, where the individuals within the same group share the same slope parameters, while the group-specific parameters are heterogeneous. To incorporate the prior information, we design a new alternating direction method of multipliers (ADMM) algorithm based on the pairwise group fused Lasso penalty approach. The asymptotic properties and the convergence of ADMM algorithm are well established. Simulation studies demonstrate the advantages of the proposed method over existing methods in terms of both estimation efficiency and detection accuracy. We illustrate the practical utility of the proposed procedure by analyzing the relationship between electricity consumption and GDP in China.

Keywords:

latent structures; panel data; prior information; ADMM algorithm

MSC:

62J07

1. Introduction

Panel data models are widely encountered in many substantial areas in both econometric and statistics. When conducting analyses for panel data, conventional approaches typically assume that the slopes of regressors are completely homogeneous across all individuals [1]. However, such homogeneity assumptions are not in accordance with reality through many practical analyses. See, for example, [2,3,4], to name a few. From the perspective of model strategy, the complete homogeneity assumption ignores individual-specific heterogeneity, which is quite important in panel data analysis. On the other hand, estimating parameters for each individual separately is feasible for achieving complete heterogeneity, but the estimators and relevant inferences are inefficient or even imprecise, as the time dimension T is usually very short. To strike a balance between completely homogeneous and fully heterogeneous models, a natural way is to assume that the panel units can be classified into several unknown groups, where the members within a group share the same slope parameters, while the slopes across different groups are different. In this way, the convergence rate of estimators changes from

\sqrt{T}

to

\sqrt{N_{k} T}

, where

N_{k}

is the total number of individuals in the kth group. The group structure here is called latent structure [5], and the identification of latent structures is sometimes called homogeneity structure learning [6,7] or subgroup identification [8].

1.1. Literature Review

The existing literature on exploring the latent group structure can be roughly classified into three categories based on different perspectives. First, researchers have found that identifying the latent group can be formulated into a penalization estimation context, such that modern feature selection techniques [9,10] can be employed. A pivotal contribution in this area can be traced back to [5], who developed a novel variant of the Lasso estimator called Classifier-Lasso (C-Lasso). The penalty form of C-Lasso takes an additive-multiplicative form, which forces the parameters of individuals into several groups for a given number of groups. This method had subsequently been extended in various panel models, such as [11,12,13,14]. Recently, ref. [15] proposed a new penalized estimation procedure based on the pairwise adaptive group fused Lasso (PAGFL) penalty, which can automatically identify latent group structures by shrinking the difference in slopes between two individuals towards zero. Compared to C-Lasso, this approach is guaranteed to converge to a unique global minimizer. Second, note that clustering panel individuals into groups is essentially a supervised clustering problem. Hence, the classical K-means algorithm can be adapted to the panel regression framework with group structures with slight modifications. Research in this direction includes [2,16,17]. The third popular approach is based on the sequential binary segmentation algorithm (SBSA) of [18] which was originally designed for structural break detection. Ref. [19] employed the SBSA to detect latent group structures in nonlinear panel models, which consists of two steps: first, sorting the preliminary unconstrained consistent estimators of the regression coefficients, and then performing break point detection. The individuals within adjacent estimated change points therefore form a group. This idea has subsequently been extended by [20,21] to detect the latent structures in spatial dynamic panels and time-varying panels.

These methods make important contributions but often ignore the use of prior information. It is well-known that incorporating prior information in economic and statistical modeling can help improve estimation performance. For example, in the application of predicting clinical measures from neuroimages, ref. [22] proposed a sparse fused group Lasso method that incorporates spatial and group structure information to enhance prediction accuracy. Similarly, ref. [23] detected the sparsity and homogeneity of regression coefficients with imposed prior constraints. To quickly identify clustered patterns in spatial regression and reduce computational burden, ref. [24] built a minimum spanning tree (MST) based on the spatial neighborhood information, and then detected homogeneity in coefficients based on the tree. Along this line, ref. [25] proposed a new MST by leveraging the concept of network topology and the local model similarity, simultaneously. Numerical experiments and real-world data analysis in the literature demonstrated that incorporating more prior information can achieve better estimation and clustering efficiency.

In panel data analysis, researchers often have access to prior information before economic modeling, such as existing research conclusions or objective geographic information. For example, ref. [15] used the preliminary individual-specific estimator to construct the data-adaptive weights in PAGFL. The adaptive weights can be regarded as the indicator for similarity among units that implies the likelihood of two units being in the same group. One important example, which is closely related to this work, is [26], who extended the idea of [6] and exploited a panel-CARDS algorithm. This method first sorts all individuals based on a preliminary estimator, and then constructs an ordered partition set for the individuals. By applying shrinkage estimation between and within the segments, the latent structure can be identified in a data-driven manner. Notably, constructing the rank mapping from the preliminary estimates effectively borrows prior information. Another leading example arises from [2], who studied grouped fixed effects and assumed that some coefficients do not change across individuals. This model setup also posits the prior information that the coefficients of certain regressors are homogeneous across individuals. Hence, as noted by [2] in their supplementary material, adding prior information on the structure of unobserved homogeneity deserves further investigation, and we are currently working on this issue.

1.2. Contributions and Organization

This paper mainly focuses on incorporating the prior information to improve the latent structures detection performance. The contributions are as follows. First, we provide a general prior information framework that can be represented as a constrained region for regression coefficients for panel models. To identify the latent group structure and utilize the prior information, we develop a novel penalized estimation scheme based on the concave pairwise fusion penalty combining the constraint information. Second, due to the restrictions imposed by the prior constraint, traditional numerical optimization algorithms cannot be directly applied here. To implement the proposed approach, we further design an alternating direction method of multipliers (ADMM) algorithm [27] for solving the constrained fused Lasso problem. The ADMM algorithm is straightforward and easy to implement using free software such as R. Additionally, we have developed a new R package phpadmm to facilitate the estimation procedures described in this paper, which is available upon request. Third, we also demonstrate the convergence property of the ADMM algorithm and the asymptotic results for the estimators. We validate the theoretical results through simulation studies and real-world application. Fourth, we apply the proposed latent structures detection procedure to investigate the relationship between electricity consumption and GDP in China, a significant topic that has garnered considerable interest in the field of energy economics, and find four subgroups among the 30 provinces.

The remainder of the paper is structured as follows. We illustrate the proposed ADMM estimation method in Section 2. The convergence of the algorithm and the asymptotic theory for the estimators are examined in Section 3. Section 4 compares the finite sample performance of different methods. We investigate the effect of electricity consumption on GDP using proposed methodology in Section 5. Section 6 concludes the paper. The proofs and additional testing procedure are given in the Appendix.

Notation. Throughout this paper, we adopt the following notations. Denote

0_{m}

as a

m \times 1

zero vector,

1_{m}

as a

m \times 1

unit vector, and

I_{m}

as a

m \times m

identity matrix. Let

〈a, b〉 = a^{T} b

be the inner product of two vector

a

and

b

with the same dimension. For an

m \times n

matrix

A = {(a_{1}, \dots, a_{n})}^{T}

, we denote its transpose as

A^{T}

and

vec (A) = {(a_{1}^{T}, \dots, a_{n}^{T})}^{T}

. Throughout this paper, a superscript zero on any quantity refers to the corresponding true quantity, which is a fixed number, vector or matrix.

2. Model and Proposed Estimation Method

2.1. Panel Heterogeneity with Prior Constraint Information

Let

y_{i t}

be the real-valued dependent variable and

x_{i t} = {(x_{i t, 1}, \dots, x_{i t, p})}^{⊤}

a

p \times 1

vector of regressors for

i = 1, \dots, N

and

t = 1, \dots, T

, where i and t index the individual and time respectively. We consider the following linear panel model with grouped latent structures pattern:

y_{i t} = μ_{i} + x_{i t}^{T} β_{i}^{0} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots, T,

(1)

where

μ_{i}

is the fixed individual effect for the ith individual,

ϵ_{i t}

is error term with mean zero, and

β_{i}^{0}

is a

p \times 1

vector of unknown coefficients for the ith individual. The set of coefficients

{β_{1}^{0}, \dots, β_{N}^{0}}

admits the following latent grouping structure

β_{i}^{0} = \{\begin{matrix} α_{1}^{0}, & if i \in G_{1}^{0} \\ ⋮ & ⋮ \\ α_{K_{0}}^{0} & if i \in G_{K_{0}}^{0} \end{matrix}

(2)

where

K_{0}

is an unknown integer,

α_{l}^{0} \neq α_{k}^{0}

for any

l \neq k

, and

α_{k}

represents the common slope parameters for the kth subgroup. The set

G^{0} = {G_{1}^{0}, \dots, G_{K}^{0}}

forms a partition of

{1, \dots, N}

and

G_{l}^{0} \cap G_{k}^{0} = \emptyset

for any

l \neq k

.

Let

B = {(β_{1}, \dots, β_{N})}^{T}

be a

K \times p

coefficient matrix, and let

β = vec (B)

. In this paper, we assume that the prior information can be formulated as a constrained region, i.e.,

β \in C,

(3)

where

C

is a non-empty convex set, which is quite general. For example, in a cross-province panel data analysis, one may consider that the provinces in neighboring regions are more likely to belong to the same group. We can incorporate geographical information into

C

. Similarly, in financial studies, the financial returns of similar industry sectors tend to share the same scores on the common factors. In addition, when

C = {β : β \in R^{N p}}

, group identification in (2) simplifies to the no-prior information situation, which has been discussed by e.g., [5,12,13]. The convex set

C

is user-specified that can include equality constraints, inequality constraints, or other constrained forms depending on practical considerations. To enhance the readability of this paper, we summarize the key notations in Table 1.

Prior information is typically known and significant, especially in the context of this paper. As mentioned earlier, several existing methods has utilized the prior information in identifying the latent group structures in panel models, including the ordering-based method [26], and the adaptive weights in the group fused Lasso penalty [15]. In essence, these two methods both make use of the distance between any two units’ initial slope estimators. In contrast, our method generalizes the definition of prior information including both the equality and inequality constraints. We summarize the differences between the existing works and our framework about prior information in Table 2. Effectively utilizing this prior information can enhance both homogeneity detection and parameter estimation accuracy. Our objective in this paper is to explore how to incorporate these prior constraints (3) into the process of detecting homogeneity.

2.2. A Regularized Approach

To detect homogeneity in panel data models, one may utilize the K-means clustering method, as demonstrated by [2,8], or the binary segmentation method proposed by [28]. However both methods struggle to incorporate prior information effectively. In this paper, we consider a regularized approach with a concave pairwise fusion penalty to detect homogeneity in model (1). This method was firstly applied in [29] for subgroup analysis and has been extensively employed in various fields, such as precision medicine. After concentrating out the fixed effects, the objective function can be represented as

\begin{matrix} L (β; γ) = \frac{1}{2 N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{⊤} β_{i})}^{2} + \sum_{i < j} p_{γ} (∥ β_{i} - β_{j} ∥_{1}) \\ s . t . β \in C \end{matrix}

(4)

where

{\tilde{x}}_{i t} = x_{i t} - {\bar{x}}_{i}

and

{\tilde{y}}_{i t} = y_{i t} - {\bar{y}}_{i}

in which

{\bar{x}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} x_{i t}

and

{\bar{y}}_{i} = \frac{1}{T} \sum_{t = 1}^{T} y_{i t}

,

p_{γ} (\cdot)

is penalty functions and

γ > 0

is a tuning parameters that control the degree of penalty on

∥ β_{i} - β_{j} ∥_{1}

. There are a total of

\frac{N (N - 1)}{2}

terms in

\sum_{i < j} p_{γ} (∥ β_{i} - β_{j} ∥_{1})

, which encourages shrinkages of the pairwise differences in slopes for any two subjects. This regularized adjustment can identify the latent homogeneous structures for the coefficients

β

. Ref. [30] pointed out that the non-convex penalties

p_{γ} (\cdot)

should satisfy four conditions, and the smoothly clipped absolute deviation (SCAD) [9] and minimax concave (MCP) [31] penalties are adequate choices.

To further incorporate the prior constraint condition (3), we introduce a new regularized term and rewrite the objective function (4) as

\tilde{L} = \frac{1}{2 N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{⊤} β_{i})}^{2} + \sum_{i < j} p_{γ} (∥ β_{i} - β_{j} ∥_{1}) + I_{C} (β)

(5)

where

I_{C} (\cdot)

is an indicator function for

β

, equaling a constant when

β

belongs to

C

and

+ \infty

otherwise. It should be noted that directly minimizing the Equation (5) is computationally challenging, since the SCAD and MCP penalties are not convex, and the pairwise fusion penalization term is non-separable concerning

β_{i}

in the estimation. To achieve computational feasibility, we formally propose an ADMM algorithm to minimize (5).

2.3. ADMM Implementation

ADMM algorithm is an optimization algorithm that is particularly useful for solving optimization problems with specific structures, such as those involving separable or block-separable objective functions [27]. The key idea of the ADMM algorithm is to decompose a global optimization problem that is hard to solve into several small pieces that can be solved more easily. In terms of our optimization problem, following the principle of ADMM, we first introduce two sets of auxiliary parameters

δ_{i j} = β_{i} - β_{j}

and

η_{i} = β_{i}

to reparameterize the objective function (5) as a constrained optimization problem

\begin{matrix} \tilde{L} (β, δ) = \frac{1}{2 N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{⊤} β_{i})}^{2} + \sum_{i < j} p_{γ} (∥ δ_{i j} ∥_{1}) + I_{C} (η) \\ s . t . \{\begin{matrix} β_{i} - β_{j} - δ_{i j} = 0 \\ β - η = 0 \end{matrix} \end{matrix}

(6)

where

δ = {(δ_{i j}^{T}, i < j)}^{T}

and

η = {(η_{1}^{T}, \dots, η_{N}^{T})}^{T}

. To solve (6), we use the ADMM algorithm to solve the following augmented Lagrangian function

\begin{matrix} Q (β, η, δ, ν, ϑ) & = & \tilde{L} (β, δ) + \sum_{i < j} 〈ν_{i j}, β_{i} - β_{j} - δ_{i j}〉 + \frac{ρ_{2}}{2} \sum_{i < j} {∥ β_{i} - β_{j} - δ_{i j} ∥}^{2} \\ + I_{C} (η) + \sum_{j = 1}^{N} 〈ϑ_{j}, β_{j} - η_{j}〉 + \frac{ρ_{1}}{2} \sum_{j = 1}^{N} {∥ β_{j} - η_{j} ∥}^{2} \end{matrix}

(7)

where

ν = {(ν_{i j}^{T}, i < j)}^{T}

,

ϑ = {(ϑ_{j}^{T}, \dots, ϑ_{N}^{T})}^{T}

are Lagrange multipliers, and

ρ_{1}

,

ρ_{2}

are two fixed augmented parameters. The ADMM algorithm has the advantage of decomposing (7) into several manageable subsets. Along this line, we consider alternating minimization for each block of parameters

β, η, δ, ϑ

and

ν

. Specifically, given the current values at the mth iteration, denoted as

(β^{(m)}, η^{(m)}, δ^{(m)}, ϑ^{(m)}, ν^{(m)})

, the ADMM computation at the

(m + 1)

th iteration proceeds as follows:

\begin{matrix} β^{(m + 1)} & = & arg min_{β} Q (β, η^{(m)}, δ^{(m)}, ϑ^{(m)}, ν^{(m)}), \end{matrix}

(8)

\begin{matrix} η^{(m + 1)} & = & arg min_{η} Q (β^{(m + 1)}, η, δ^{(m)}, ϑ^{(m)}, ν^{(m)}), \end{matrix}

(9)

\begin{matrix} δ^{(m + 1)} & = & arg min_{δ} Q (β^{(m + 1)}, η^{(m + 1)}, δ, ϑ^{(m)}, ν^{(m)}), \end{matrix}

(10)

\begin{matrix} ϑ^{(m + 1)} & = & ϑ^{(m)} + ρ_{1} (β^{(m + 1)}) - η^{(m + 1)}), \end{matrix}

(11)

\begin{matrix} ν^{(m + 1)} & = & ν^{(m)} + ρ_{2} (Ω β^{(m + 1)} - δ^{(m + 1)}) . \end{matrix}

(12)

For the first minimization problem in (8),

β^{(m + 1)}

is the minimizer of

\begin{matrix} H_{ρ} (β) = & \frac{1}{2 N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{⊤} β_{i})}^{2} + \frac{ρ_{2}}{2} \sum_{i < j} {∥ β_{i} - β_{j} - δ_{i j}^{(m)} + ρ_{2}^{- 1} ν_{i j}^{(m)} ∥}^{2} \\ + \frac{ρ_{1}}{2} \sum_{j = 1}^{N} {∥ β_{j} - η_{j}^{(m)} + ρ_{1}^{- 1} ϑ_{j}^{(m)} ∥}^{2} + constant . \end{matrix}

By simple algebraic manipulation, we can rewrite it in matrix form

\begin{matrix} H_{ρ} (β) = & \frac{1}{2 N T} ∥ \tilde{y} - \tilde{x} {β ∥}^{2} + \frac{ρ_{2}}{2} {∥ D β - δ^{(m)} + ρ_{2}^{- 1} ν^{(m)} ∥}^{2} \\ + \frac{ρ_{1}}{2} {∥ β - η^{(m)} + ρ_{1}^{- 1} ϑ^{(m)} ∥}^{2} + constant, \end{matrix}

where

\tilde{y} = {({\tilde{y}}_{1}^{T}, \dots, {\tilde{y}}_{N}^{T})}^{T}

is a

N T \times 1

vector with

{\tilde{y}}_{i} = {({\tilde{y}}_{i 1}, \dots, {\tilde{y}}_{i T})}^{T}

,

\tilde{x} = diag ({\tilde{x}}_{1}, \dots, {\tilde{x}}_{N})

is a

N T \times N p

matrix with

{\tilde{x}}_{i} = {({\tilde{x}}_{i 1}, \dots, {\tilde{x}}_{i T})}^{T}

, and

D : = E \otimes I_{p}

is a

\frac{N (N - 1)}{2} p \times N p

matrix, where ⊗ represents the Kronecker product,

E = {(e_{i} - e_{j}), i < j}_{\frac{N (N - 1)}{2} \times N}^{⊤}

and

e_{i}

is the

N \times 1

vector with one at the ith component, and zeros elsewhere. Notably, the minimizer has a closed-form solution:

\begin{matrix} β^{(m + 1)} & = & {(\frac{1}{N T} {\tilde{x}}^{T} \tilde{x} + ρ_{1} I_{N p} + ρ_{2} D^{T} D)}^{- 1} \\ \times & \{\frac{1}{N T} {\tilde{x}}^{⊤} y + ρ_{1} (η^{(m)} - ρ_{1}^{- 1} ϑ^{(m)}) + ρ_{2} D (δ^{(m)} - ρ_{2}^{- 1} ν^{(m)})\} . \end{matrix}

For

η^{(m + 1)}

in (9), it is actually a standard convex optimization problem, since we only need to minimize

\frac{ρ_{1}}{2} ∥ ξ_{j}^{(m)} - η_{j} ∥ + I_{C} (η_{j})

(13)

where

ξ_{j}^{(m)} = β_{j}^{(m + 1)} + ρ_{1}^{- 1} ϑ_{j}^{(m)}

. In practical implementation, we employ the quadratic programming algorithm by the R function “solve.QP” available from the package quadprog.

For

δ^{(m + 1)}

in (10), by discarding the terms that are independent of

δ

, we just need to minimize

\frac{ρ_{2}}{2} ∥ ζ_{i j}^{(m)} - δ_{i j} ∥^{2} + p_{γ} (∥ δ_{i j} ∥_{1})

where

ζ_{i j}^{(m)} = β_{i}^{(m + 1)} - β_{j}^{(m + 1)} + ρ_{2}^{- 1} ν_{i j}^{(m)}

. For the MCP with

γ > 1 / ρ_{1}

, the updated

δ^{(m + 1)}

has a closed-form solution

δ_{i j}^{(m)} = \{\begin{matrix} \frac{S (ζ_{i j}^{(m)}, λ / ρ_{1})}{1 - 1 / (γ ρ_{1})} & if ∥ ζ_{i j}^{(m)} ∥ \leq γ λ, \\ ζ_{i j}^{(m)} & if ∥ ζ_{i j}^{(m)} ∥ > γ λ . \end{matrix}

(14)

For the SCAD penalty with

γ > 1 / ρ_{1} + 1

, the update is given by

δ_{i j}^{(m)} = \{\begin{matrix} S (ζ_{i j}^{(m)}, λ / ρ_{1}) & if ∥ ζ_{i j}^{(m)} ∥ \leq λ + λ / ρ_{1}, \\ \frac{S (ζ_{i j}^{(m)}, γ λ / (ρ_{1} (γ - 1)))}{1 - 1 / (γ ρ_{1})} & if λ + λ / ρ_{1} < ∥ ζ_{i j}^{(m)} ∥ \leq γ λ, \\ ζ_{i j}^{(m)} & if ∥ ζ_{i j}^{(m)} ∥ > γ λ . \end{matrix}

(15)

where

S (z, t) = {(1 - t / ∥ z ∥)}_{+} z

is the groupwise soft-thresholding operator with

{(x)}_{+} = x

if

x > 0

and 0 otherwise.

As suggested by [27], a reasonable termination criterion for an ADMM algorithm is that both the primal and dual residuals are sufficiently small. In this paper, the primal and dual residuals are defined as

∥ r_{1}^{(m + 1)} ∥^{2} = ∥ Ω β^{(m + 1)} - δ^{(m + 1)} ∥^{2}, ∥ r_{2}^{(m + 1)} ∥^{2} = {∥ β^{(m + 1)} - η^{(m + 1)} ∥}^{2}

(16)

and

∥ s_{1}^{(m + 1)} ∥^{2} = ∥ ρ_{1} (η^{(m + 1)} - η^{(m)}) ∥^{2}, ∥ s_{2}^{(m + 1)} ∥ = ∥ ρ_{2} Ω (δ^{(m + 1)} - δ^{(m + 1)}) ∥^{2},

(17)

respectively. When

{lim}_{m \to \infty} {∥ r_{k}^{(m)} ∥}^{2} \leq e^{primal}

for

k = 1, 2

and

{lim}_{m \to \infty} {∥ s_{1}^{(m)} + s_{2}^{(m)} ∥}^{2} \leq e^{dual}

hold, where

e^{primal}

and

e^{dual}

are pre-specified feasible primal and dual tolerances, we stop the iteration. We summarize the above procedures in Algorithm 1.

Algorithm 1: ADMM algorithm for panel data models with prior constraints

2.4. Initial Values and Tuning Parameter

In nonconvex optimization, selecting an appropriate initialization for parameters is crucial, because this will not only produce an optimal solution but also significantly accelerate the iterations. As proposed by [26], we set the initial values for

β

as the estimators without the latent group structure, that is

L_{N T} (β) = \frac{1}{2 N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{⊤} β_{i})}^{2} .

Solve the optimization problem yields the least square estimates

β_{i}^{(0)} = {(\frac{1}{T} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{x}}_{i t}^{T})}^{- 1} (\frac{1}{T} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{y}}_{i t}),

for

i = 1, \dots, N

. Then, set

δ_{i j}^{(0)} = β_{i}^{(0)} - β_{j}^{(0)}

,

η_{i}^{(0)} = β_{i}^{(0)}

,

ϑ^{(0)} = 0

, and

ν^{(0)} = 0

. The question arises regarding how to determine the tuning parameters

ρ_{1}

,

ρ_{2}

and

λ

. Following [29], we set

ρ_{1} = ρ_{2} = 1

and

λ = 3

.

The selection of

γ

is particularly important, since the performance of group structures identification depends on

γ

. Following the strategy of [5], we minimize the following BIC-type criterion function:

IC (γ) = log (σ_{N T}^{2} (γ)) + p K (γ) ψ_{N T}

(18)

where

σ_{N T}^{2} (γ) = L_{N T} (β (γ))

is the sum of squared residuals, and

K (γ)

denotes the number of groups corresponding to

γ

. Here, the positive constant

ψ_{N T}

that depends on N and T serve as a balance between the degrees of freedom and the model fitting errors. To select the optimal

γ

, define a grid of points

γ_{min} = γ_{0} < \dots < γ_{W} = γ_{max}

with equal spacing. To speed up the computation, we employ the warm start strategy. Specifically, for each

γ \in {γ_{0}, \dots, γ_{W}}

, we compute the solution path

\hat{β} (γ_{w})

under the initial value

\hat{β} (γ_{w - 1})

. The optimal value for

γ

is defined as

\hat{γ} = arg {min}_{γ_{w}, w = 0, \dots, W} IC (γ_{w})

.

3. Theoretical Analysis

In this section, we mainly discuss the the convergence of the ADMM Algorithm 1, and establish the asymptotic property of the proposed estimator.

3.1. Convergence of the Algorithm

Under some mild conditions, the convergence of the ADMM algorithm can be guaranteed. For the iterated sequence

{β^{(m)}, η^{(m)}, δ^{(m)}, ϑ^{(m)}, ν^{(m)}}

, there exists a global minimizer

{β^{*}, η^{*}, δ^{*}, ϑ^{*}, ν^{*}}

satisfying the first order condition for a stationary point The following proposition illustrates the convergence of Algorithm 1.

Proposition 1.

Recall the primal and dual residuals defined in (16) and (17). It holds that

{lim}_{m \to \infty} {∥ r_{j}^{(m)} ∥}^{2} = 0

for

j = 1, 2

, and

{lim}_{m \to \infty} {∥ s_{1}^{(m)} + s_{2}^{(m)} ∥}^{2} = 0

for both SCAD and MCP penalties. Therefore, the sequence

{β^{(m)}, η^{(m)}, δ^{(m)}, ϑ^{(m)}, ν^{(m)}}_{m = 1}^{\infty}

will approach to an optimal point.

Proposition 1 indicates that Algorithm 1 converges if both the primal feasibility and dual feasibility hold. The proof of this proposition follows a strategy similar to that in [29], and the detailed proof is postponed to the Appendix.

3.2. Asymptotic Property

We now establish the theoretical properties of the penalized estimator. First, we introduce some notations. Let

x_{i} = {(x_{i 1}, \dots, x_{i T})}^{T}

and

y_{i} = {(y_{i 1}, \dots, y_{i n})}^{T}

. Let

G^{(K)} = (G_{1}^{(K)}, \dots, G_{K}^{(K)})

be an arbitrary K-partition of

{1, \dots, N}

. We define

{\hat{σ}}_{G^{(K)}}^{2} = {(N T)}^{- 1} \sum_{k = 1}^{K} \sum_{i \in G_{k}^{(K)}} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{T} {\hat{β}}_{i, G_{k}^{(K)}})}^{2},

where

{\hat{β}}_{i, G_{k}^{(K)}} = arg {min}_{β_{i}} {(N_{k} T)}^{- 1} \sum_{i \in G_{k}^{(K)}} \sum_{t = 1}^{T} {({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{T} β_{i})}^{2}

. Denote

N_{k}

as the number of elements in the kth group. Define

ρ (t) = λ^{- 1} p_{γ} (t, λ)

as the scale penalty function and

ρ' (t) = p_{λ}^{'} (| t |) sgn (t)

as its first-order derivative. We then introduce the following assumptions.

Assumption 1.

(i)

{(x_{i t}, y_{i t}), t = 1, \dots, T}

is stationary strong mixing for each i with mixing coefficients

α_{i} (\cdot)

.

α (\cdot) = {max}_{i} α_{i} (\cdot)

satisfies

α (τ) \leq c_{α} ρ^{τ}

for some constant

c_{α} > 0

and

ρ \in (0, 1)

. (ii) The random variables

{x_{i}, y_{i}}_{i = 1}^{N}

are independent, and

{max}_{i, t} E {∥ x_{i t} ∥}^{2 q} < c_{1}

for some constant

c_{1} < \infty

and

q > 4

. (iii) There exists two positive constants

c_{2}

and

c_{3}

such that

{min}_{1 \leq k \leq K} μ_{min} (\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} E ({\tilde{x}}_{i}^{T} {\tilde{x}}_{i})) \geq c_{2}

and

{max}_{1 \leq i \leq N} μ_{max} (\frac{1}{T} E (x_{i}^{T} x_{i})) \leq c_{3}

. (iv) The error term satisfies

E (ϵ_{i t}) = 0

,

E (x_{i t} ϵ_{i t}) = 0

and

{max}_{i, t} E {| ϵ_{i t} |}^{2 q} < c_{1}

. (v)

N_{k}

either tends to infinity or remains fixed as

T \to \infty

and

N = O (T^{2})

.

Assumption 2.

(i) The function

ρ (t)

is symmetric, non-decreasing and concave on

[0, \infty)

. It satisfies

ρ (0) = 0

and is constant for all

t \geq a λ

for some constant

a > 0

. (ii) The derivative

ρ' (t)

satisfies

ρ' (0 +) = 1

and is continuous except for a finite number of values of t.

Assumption 3.

(i) There exist positive definite matrices

Ω_{k}

for

k = 1, \dots, K

such that

\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} \sum_{t = 1}^{T} {\tilde{x}}_{i t} {\tilde{x}}_{i t}^{T} \overset{p}{\to} Ω_{k}

as

T \to \infty

or

(N_{k}, T) \to \infty

. (ii) Let

B_{k} = \frac{1}{\sqrt{N_{k} T}} \sum_{i \in G_{k}^{0}} \sum_{t = 1}^{T} E ({\tilde{x}}_{i t} ϵ_{i t})

. Then

\sqrt{N_{k} T} \sum_{i \in G_{k}^{0}} \sum_{t = 1}^{T} {\tilde{x}}_{i t} ϵ_{i t} - B_{k} \overset{d}{\to} N (0, Ψ_{k})

as

T \to \infty

or

(N_{k}, T) \to \infty

where

B_{k}

equals

O_{p} (\sqrt{N_{k} / T})

if

{\tilde{x}}_{i t}

is not exogenous and 0 otherwise.

Assumption 4.

{min}_{1 \leq \hat{K} \leq K} {inf}_{G^{(\hat{K})}} {\hat{σ}}_{G^{(\hat{K})}}^{2} \overset{p}{\to} {\underset{̲}{σ}}^{2} > σ_{0}^{2}

as

(N, T) \to \infty

where

σ_{0}^{2} = {lim}_{(N, T) \to \infty}

{(N T)}^{- 1} \sum_{k = 1}^{K_{0}} \sum_{i \in G_{k}^{0}} \sum_{t = 1}^{T} E ({\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{T} β_{i}^{0}) .

Assumption 5.

As (

N, T) \to \infty

,

ψ_{N T} \to 0

and

ψ_{N T} T \to \infty

.

Assumption 1(i,ii) imposes some conditions on

{(x_{i t}, y_{i t})}

, which are quite common in panel data models. Assumption 1(iii,iv) specifies moment conditions on the regressors

x_{i t}

and the noise term

ϵ_{i t}

, respectively. Assumption 1(v) allows the number of elements in each group to be finite, which distinguishes it from the conditions set forth in [5]. Assumption 2 is standard for concave penalty functions, such as SCAD and MCP. Assumption 4 is utilized to study the asymptotic normality of the proposed estimators. Finally, Assumptions 4 and 5 are adapted from [5] to demonstrate the classification consistency described in (18).

Once

\hat{β}

is given, the estimated group pattern can be directly derived by classifying the coefficients

{\hat{β}}_{i}

’s into groups. We denote

\hat{K}

as the estimated number of groups and

{\hat{α}}_{k}

as the common slope shared by the kth group for

k = 1, \dots, \hat{K}

. By definition, the estimated group pattern

\hat{G} = ({\hat{G}}_{1}, \dots, {\hat{G}}_{\hat{K}})

is given by

{\hat{G}}_{k} = {i \in {1, \dots, N} : {\hat{β}}_{i} = {\hat{α}}_{k}} for k = 1, \dots, \hat{K} .

We then have the following limiting distribution for

{\hat{α}}_{k}

.

Theorem 1.

Suppose that Assumption 1–3 hold. Then,

\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) - Ω_{k}^{- 1} B_{k} \overset{d}{\to} N (0, Ω_{k}^{- 1} Ψ_{k} Ω_{k}^{- 1})

for

k = 1, \dots, K

as

T \to \infty

or

(N_{k}, T) \to \infty

.

Theorem 1 demonstrates that the group-specific estimator possesses the asymptotic normality properties with a convergence rate of

\sqrt{N_{k} T}

. If the group membership is known, the oracle estimator for

α

is given by

{\hat{α}}_{k}^{o r} = {(\sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i})}^{- 1} (\sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{y}}_{i}), for k = 1, \dots, K .

As pointed by [5,26],

{\hat{α}}_{k}^{o r}

and

{\hat{α}}_{k}

share similar asymptotic normality. Given the estimated grouping structure

\hat{G}

, a post-Lasso version of

α

is obtained by

{\hat{α}}_{{\hat{G}}_{k}} = {(\sum_{i \in {\hat{G}}_{k}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i})}^{- 1} (\sum_{i \in {\hat{G}}_{k}} {\tilde{x}}_{i}^{T} {\tilde{y}}_{i}), for k = 1, \dots, \hat{K} .

The post-Lasso estimators perform at least as well as the Lasso estimators in terms of convergence rate, but they exhibit a smaller second-order bias, as shown by [32]. It would be interesting to investigate the higher-order asymptotic properties of the post-Lasso estimators within our framework.

We utilize the information criterion defined in (18) to determine the number of groups. The following theorem justifies its asymptotic validity.

Theorem 2.

Suppose that Assumptions A1–A2 and A4–A5 hold. The information criterion in (18) will select a optimal tuning parameter

γ^{o p t}

such that the estimated

\hat{β} (γ^{o p t})

approaches to the oracle estimator

{\hat{β}}^{o r}

with probability approaching 1, and

P (\hat{K} (γ^{o p t}) = K) \to 1, as (N, T) \to \infty .

Theorem 2 indicates that under some mild condition, the minimizer of (18) can only be the one that produces the correct number of groups, as N or T or both go to infinity. Since

ψ_{N T}

in the IC (18) plays a crucial role in determining the latent group structures, we follow [5] to pick

ψ_{N T}

by trying a list of candidates. Similar to [26], we find that

ψ_{N T} = \frac{1}{2 \sqrt{N T}}

has fairly good sample performance. The same setting is also aplied in the empirical study.

4. Simulation Studies

In this section, we conduct a series of simulation studies to assess the finite sample performance of the proposed method. We consider the following three data generating processes:

4.1. Basic Setup

DGP 1. The first data setting is similar to the GDP 1 in [5]. Here, the observations

{(y_{i t}, x_{i t}), i = 1, \dots, N, t = 1, \dots, T}

are generated from a linear panel model where

x_{i t} = {(0.2 μ_{i} + e_{i t 1}, 0.2 μ_{i} + e_{i t 2})}^{T}

with the fixed effect

μ_{i}

following an i.i.d. standard normal distribution across time. The error terms

e_{i t 1}

and

e_{i t 2}

are i.i.d. standard normal and independent of

μ_{i}

. The slopes of individuals are divided into three groups with a ratio of

N_{1} : N_{2} : N_{3} = 4 : 3 : 3

. The true coefficients are

B_{a} = {(\begin{matrix} 0.4 & 1 & 1.6 \\ 1.6 & 1 & 0.4 \end{matrix})}^{T} .

DGP 2. The DGP 2 is similar to the DGP 3 in [26]. In this setting, the coefficients are divided into eight groups, where the first group accounts for 30% of the total individuals and each of the remaining seven groups shares 10% of individuals. The generation for

x_{i t}

is similar to DGP 1. The group-specific coefficients take the values

B_{b} = {(\begin{matrix} - 4 & - 3 & - 2 & - 1 & 1 & 2 & 3 & 4 \\ 4 & 3 & 2 & 1 & - 1 & - 2 & - 3 & - 4 \end{matrix})}^{T} .

DGP 3. In the third DGP, we let

x_{i t} = {(x_{i t 1}, x_{i t 2}, x_{i t 3})}^{T}

and

x_{i t k} = 0.2 μ_{i} + e_{i t k}

for

k = 1, 2, 3

. Individuals are also divided into three groups with a ratio of

N_{1} : N_{2} : N_{3} = 3 : 4 : 3

. The true parameter values for each group are

B_{c} = {(\begin{matrix} 0.6 & 0.6 & 0.6 \\ 1.5 & 1 & 0.5 \\ - 1 & 0 & 1 \end{matrix})}^{T}

The constraint set for each DGP includes both equality or inequality constraints, which are described as follows.

(i): (equality constraint:) For $B_{a}$ , the sum of coefficients for each individual is the same. For $B_{b}$ , the sum of all elements equals zero, and for $B_{c}$ , the coefficients for the first regressor remain constant across all individuals.
(ii): (inequality constraint:) For $B_{a}$ , all elements in the coefficient matrix are required to be greater than zero.

Hence, the constrained regions for

β_{a} : = vec (B_{a})

is

C_{a} = {β_{a} : R_{a e}^{T} β_{a} = 0, R_{a i}^{T} β_{a} > 0_{6}}

where

R_{a e} = {({(1_{2}^{T}, - 1_{2}^{T}, 0_{2}^{T})}^{T}, {(0_{2}^{T}, 1_{2}^{T}, - 1_{2}^{T})}^{T})}^{T}

and

R_{a i} = I_{6}

. For

β_{b} : = vec (B_{b})

,

C_{b} = {β_{b} : R_{b e}^{T} β_{b} = 0}

where

R_{b e} = 1_{16}

. For

β_{c} : = vec (B_{c})

,

C_{c} = {β_{c} : R_{c}^{T} β_{c} = 0_{2}}

where

R_{c} = {({(1, 0_{2}, - 1, 0_{5})}^{T}; {(0_{3}, 1, 0_{2}, - 1, 0_{2})}^{T})}^{T}

.

Throughout these DGPs, we consider the individual sizes

N = 100, 200

and the time spans

T = 10, 20

and 40. The error terms

ϵ_{i t}

are drawn from a standard normal distribution, independent across i and t, and are also independent of all regressors. The fixed effects

μ_{i}

and the error terms

ϵ_{i t}

are mutually independent. For each case, we conduct 300 replications. For comparison, we also consider the results obtained without prior information as reported by [5].

4.2. Simulation Results

We present the frequency with which a specific number of groups is obtained across 300 replications for all DGPs. While the accurate determination of the number of groups is undoubtedly crucial, this metric alone does not provide insight into the degree of similarity between the estimated groups and the true groups. Following [6], we employ the normalized mutual information (NMI) measure to assess the similarity between the estimated group structure

\hat{G}

and the true group structure

G^{0}

. Specifically, for two sets of disjoint clusters

A = {A_{1}, A_{2}, \dots}

and

B = {B_{1}, B_{2}, \dots}

on the same set

{1, \dots, N}

, the NMI is defined as

NMI (A, B) = \frac{I (A; B)}{[H (A) + H (B)] / 2}

where

I (A; B) = \sum_{i, j} (| A_{i} \cap B_{j} | / N) log (\frac{N | A_{i} \cap B_{j} |}{| A_{i} | \cdot | B_{j} |}) and H (A) = - \sum_{i} \frac{| A_{i} |}{N} log (\frac{| A_{i} |}{N}),

where

| A |

denotes the cardinality of set A. Obviously when two groupings are exactly the same, i.e.,

A = B

, we have

I (A; B) = H (A) = H (B)

, and

NMI (A, B) = 1

. In general, the NMI measure takes values on

[0, 1]

, and larger NMI implies the two group structures are closer. To measure the estimation accuracy for slope parameters, we use the average model error, which is defined as

\frac{1}{N} {(\hat{β} - β^{0})}^{T} (\hat{β} - β^{0})

.

Figure 1 reports the classification results across DGPs 1–3 for difference combinations of N and T. The yellow and green boxes represent the detection results with and without prior constraints, respectively. It shows the NMI between the estimated group structure

\hat{G}

and the true group structure

G^{0}

, and suggests that, as N or T increase, the NMI increases rapidly for both methods. More importantly, our method exhibits higher NMI values across all DGPs, which implies that incorporating prior information can substantially improve the classification performance. This phenomenon is particularly pronounced when T is small. We display the average model estimation error in Figure 2. As expected, the average model estimation error decreases as N or T increase. It is obvious that the proposed parameter estimators has smaller estimation biases, since some prior constraint information can help us to gain the extra efficiency of homogeneity detection.

Next, give the number of groups, we focus on the point estimation of the post-Lasso estimator. To save space, we report only the simulation results of the first group-specific parameter of all DGPs in Table 3, based on

N = 200

individuals. The oracle estimators, which assume that the latent group is known, are also included for comparison. As shown in the Table, the biases of the three types of estimators become negligible as T increases, indicating that all the estimators are consistent. Additionally, the standard deviations (SD) are close to their corresponding estimated standard errors (ESE), and the empirical coverage probabilities (ECP) converge to the nominal level of 95%. This validates the asymptotic normality properties of the proposed estimator, as demonstrated in Theorem 1. The no-prior estimators yield relatively smaller ECP and larger biases compared to the other two types of estimators, especially when T is small. Undoubtedly, the oracle estimators perform reasonably well, and our method produce results closer to those of the oracle estimators. In summary, incorporating prior information significantly improves the finite sample performance in terms of estimation efficiency, particularly when the time span is small.

5. Empirical Analysis

With the deepening of China’s reform and opening up, the country has undergone profound changes over the past few decades, becoming the second-largest economic entity and the second-largest power consumer in the world. It is well known that electricity plays a crucial role in both human activity and industrial production, including sectors such as communication and engineering. Thus, electricity serves as a solid foundation for economic development. Consequently, many researchers have focused on the relationship between electricity consumption and economic growth. For example, ref. [33] found that real GDP and electricity consumption in China are cointegrated using data from 1971 to 2004. Additionally, ref. [34] investigated the relationship between electricity consumption and economic growth through a panel data analysis. China’s vast territory encompasses 34 provinces, municipalities, and autonomous regions, creating a significant domestic market. This paper primarily focuses on the homogeneous effect of electricity consumption on GDP across these provinces.

GDP is the most closely monitored statistical index in macroeconomics and serves as an important measure of economic development. At the beginning of each year, the Chinese government releases the provincial GDP data for the previous year. The electricity consumption data used in this study are derived from the Li Keqiang Index (https://en.wikipedia.org/wiki/Li_Keqiang_index#) (accessed on 1 February 2025), which has been cited by The Economist as a reliable measure of China’s economy. This index incorporates three indicators electricity consumption, railway cargo volume, and bank lending that reflect the operation of China’s economy more accurately. Due to significant data gaps and challenges in data collection, the Tibet Autonomous Region, Hong Kong, Macao, and Taiwan are excluded from this analysis. Therefore, the study includes a total of 30 provinces, with observations spanning from 2001 to 2019, resulting in

N = 30

and

T = 19

. All data utilized in this study are sourced from the China Statistical Yearbook Database, available at https://www.stats.gov.cn/sj/ndsj (accessed on 1 February 2025).

We first examine the existence of a homogeneous effect across provinces using a residual-based bootstrapping procedure. The detailed algorithm can be found in the Appendix. The resulting p-value is very close to zero, indicating the statistical significance of the homogeneous effect. Moreover, numerous empirical analyses suggest a positive relationship between GDP and electricity consumption. Therefore, we assume that the coefficients for all provinces are positive as prior information. To investigate this hidden homogeneity, we consider the following linear fixed effects model with a latent group structure:

y_{i t} = μ_{i} + x_{i t} β_{i} + ϵ_{i t}, i = 1, \dots, N; t = 1, \dots, T, s . t . β_{i} > 0,

where

y_{i t}

is the logarithm of GDP for the ith province at the tth year,

x_{i t}

represents electricity consumption,

μ_{i}

characterizes the fixed effect, and

ϵ_{i t}

is the error term. The prior information here includes a inequality constraint for parameters

β_{i}

’s, i.e.,

C = {β_{i} > 0, i = 1, \dots, N}

.

Table 4 presents the classification results using two different approaches: the proposed estimator incorporating the inequality constraint

C

and the PGFAL estimator proposed by [15] that does not account for the prior constraint. Using our proposed ADMM algorithm described in Algorithm 1, the provinces are divided into two groups. As shown by the Table, Group 1 includes provinces such as Anhui, Fujian, Guangdong, Guangxi, Hainan, Hebei, Henan, Hubei, Hunan, Jiangsu, Liaoning, Inner Mongolia, Shandong, Shanxi, Shanghai, Sichuan, Xinjiang, Yunnan, and Zhejiang, with an estimated coefficient of 1.1692. Group 2 comprises Beijing, Gansu, Guizhou, Heilongjiang, Jilin, Jiangxi, Ningxia, Qinghai, Shaanxi, Tianjin, and Chongqing, with a significantly higher coefficient of 3.4770.

Group 1 consists largely of provinces from the Eastern and Central regions, characterized by advanced industrialization, technological innovation, and a strong emphasis on services and capital investment. In these regions, the direct relationship between electricity consumption and economic growth is weaker, as economic growth is increasingly driven by non-energy-intensive sectors. While electricity consumption remains significant, its marginal impact on GDP growth is reduced due to energy efficiency improvements and shifts toward high-value-added industries. Group 2, primarily consisting of provinces from the Western and Northeastern regions, shows a stronger dependency of economic growth on electricity consumption. These regions are still in the process of industrialization, infrastructure development, and urbanization, making electricity consumption a crucial driver of economic growth. Beijing, Tianjin, and Chongqing are notable inclusions in Group 2 despite being considered economically advanced. This classification can be explained by the ongoing infrastructure expansion and high-tech investments in these cities, which require substantial energy inputs. Beijing and Tianjin, while economically developed, are still heavily investing in energy-demanding sectors such as smart technologies, urban infrastructure, and research and development. Chongqing, traditionally an industrial city, continues to undergo substantial industrialization and urban expansion, further increasing its reliance on electricity consumption for growth. This observation is consistent with the findings of [35,36], who reported that in Western regions, energy consumption has a significant causal relationship with economic growth, while in Eastern and Central regions, this relationship is less pronounced. This pattern highlights the varying roles of electricity consumption in different stages of industrialization across China provinces, where more industrialized areas show diminishing returns from energy use, and less industrialized areas remain heavily reliant on electricity to drive economic growth.

However, if we do not consider the prior information, the provinces are classified into five groups, which consist of 2, 1, 1, 15 and 11 province(s) with the corresponding group-specific coefficients 0.6935, 0.8715, 0.9837, 1.0879 and 3.4351, respectively. This suggest that imposing the prior information helps to prevent overly divergent groupings, thus leading to more stable classifications. We also calculate the mean squared prediction errors (MSPE) of these two methods. Specifically, the observations of all individuals in the first 12 years are used as the training set, while the observations in the subsequent 7 years serves as the validation set. Based on the identified latent structures, we calculate the MSPE as

MSPE = \frac{1}{N | T_{val} |} \sum_{i = 1}^{N} \sum_{t \in T_{val}} {({\hat{y}}_{i t} - y_{i t})}^{2},

where

T_{val} = {13, \dots, 19}

,

y_{i t}

is the true observation and

{\hat{y}}_{i t}

is the prediction value. As a result, the MSPE of our estimator is 0.8396, while the MSPE without prior information is 3.9443, much larger than our method. The reduced prediction error illustrates that incorporating prior constraint information can improve the predictive performance in latent structures identification for panel models.

6. Discussion

Identifying the latent groups structures in panel data analysis has recently attracted significant research interest. In this paper, we explore a pairwise fusion approach incorporating prior constraint information to identify the group pattern, and design an efficient ADMM algorithm to solve the optimization problem. Simulation studies and a real data application show that the proposed estimators outperform the existing approaches in terms of both detection accuracy and predictive performance. Our work still has its limitations; for example, how to obtain the prior constraint information in practical data analysis. Besides, our asymptotic results are built on the condition that the prior information is correctly specified; however, it remains to be explored when the prior information is misspecified.

There are some interesting issues that warrant further research. First, we only consider the commonly-used linear panel model with individual fixed effects. It is possible to adapt our proposed estimation framework to other panel models with latent group structures, such as the panel models with endogenous regressors [5] or the nonlinear panel models [19]. The specific objective function and estimation routine need to be re-examined. Second, the selection consistency property in Theorem 2 assumes that the true number of groups

K^{0}

is fixed. It is appealing to consider a general framework that allows

K^{0}

increases with the sample size. However, it may bring additional technical difficulties. Third, it is reasonable to allow for the presence of cross-section dependence in the current models. For such case, utilizing the factor structure [37] or interactive fixed effects [38] is potentially feasible but would require specific efforts. We leave these issues as our future research topics.

Author Contributions

Conceptualization, Y.L. and X.L.; data curation, M.L.; formal analysis, X.L.; investigation, Y.L.; methodology, M.L.; project administration, Y.L.; software, M.L.; supervision, Y.L.; validation, X.L.; visualization, Y.L.; writing original draft, X.L. and M.L.; writing review and editing, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This is work was supported by Natural Science Foundation of Hunan Province (No. 2023JJ40453), Scientific Research Project of Education Department of Hunan Province (No. 23B0086), and National Natural Science Foundation of China (No. 72374071).

Data Availability Statement

Data are available in a publicly accessible repository: https://www.stats.gov.cn/sj/ndsj (accessed on 14 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

This appendix mainly contains the technical proofs and the test procedure for heterogeneity.

Appendix B. Proofs

Appendix B.1

Proof of Proposition 1.

By (11) and (12),

\begin{matrix} Q (β^{(m + 1)}, η^{(m + 1)}, δ^{(m + 1)}, ν^{(m)}, ϑ^{(m)}) - Q (β^{(m + 1)}, η^{(m + 1)}, δ^{(m + 1)}, ν^{(m)}, ϑ^{(m)}) \\ = & ρ_{1}^{- 1} ∥ ϑ^{(m + 1)} - ϑ^{(m)} ∥^{2} + ρ_{2}^{- 1} {∥ ν^{(m + 1)} - ν^{(m)} ∥}^{2} . \end{matrix}

(A1)

Then for any

δ

and

η

, we have

Q (β^{(m + 1)}, η^{(m + 1)}, δ^{(m + 1)}, ν^{(m + 1)}, ϑ^{(m + 1)}) \leq Q (β^{(m + 1)}, η^{(m + 1)}, δ^{(m + 1)}, ν^{(m)}, ϑ^{(m)}),

(A2)

from (10) and (13). Define a constraint set

S (η, δ) = {(η, δ) : β^{(m + 1)} - η = 0; Ω β^{(m + 1)} - δ = 0}

. Define

\begin{matrix} f^{(m + 1)} & = inf_{S (η, δ)} \{\frac{1}{2 N T} {∥ y - \tilde{x} β^{(m + 1)} ∥}^{2} + \sum_{i < j} p_{γ} (δ_{i j}, λ) + I_{C} (η)\} \\ = inf_{S (η, δ)} Q (β^{(m + 1)}, η, δ, ν^{(m)}, ϑ^{(m)}) \end{matrix}

Hence,

Q (β^{(m + 1)}, η^{(m + 1)}, δ^{(m + 1)}, ν^{(m)}, ϑ^{(m)}) \leq f^{(m + 1)} .

Let t be an arbitrary positive integer. From (10) and (13), we obtain

\begin{matrix} ϑ^{(m + t - 1)} & = ϑ^{(m)} + ρ_{1} \sum_{i = 1}^{t - 1} (β^{(m + i)} - η^{(m + i)}), \\ ν^{(m + t - 1)} & = ν^{(m)} + ρ_{2} \sum_{i = 1}^{t - 1} (Ω β^{(m + i)} - δ^{(m + i)}) . \end{matrix}

It holds that

\begin{matrix} Q (β^{(m + t)}, η^{(m + t)}, δ^{(m + t)}, ν^{(m + t - 1)}, ϑ^{(m + t - 1)}) \\ = & \frac{1}{2 N T} ∥ y - \tilde{x} β^{(m + t)} ∥^{2} + p_{γ} (| δ^{(m + t)} |, λ) + I_{C} (η^{(m + t - 1)}) + ν^{(m + t - 1) T} (Ω β^{(m + t)} - δ^{(m + t)}) \\ + \frac{ρ_{2}}{2} ∥ Ω β^{(m + t)} - δ^{(m + t)} ∥^{2} + ϑ^{(m + t - 1) T} (β^{(m + t)} - η^{(m + t)}) + \frac{ρ_{1}}{2} {∥ β^{(m + t)} - η^{(m + t)} ∥}^{2} \\ = & \frac{1}{2 N T} ∥ y - \tilde{x} β^{(m + t)} ∥^{2} + p_{γ} (| δ^{(m + t)} |, λ) + I_{C} (η^{(m + t - 1)}) + ν^{(m) T} (Ω β^{(m + t)} - δ^{(m + t)}) \\ + ρ_{2} \sum_{i = 1}^{t - 1} ∥ Ω β^{(m + i)} - δ^{(m + i)} ∥^{2} + \frac{ρ_{2}}{2} {∥ Ω β^{(m + t)} - δ^{(m + t)} ∥}^{2} + ϑ^{(m) T} (β^{(m + t)} - η^{(m + t)}) \\ + ρ_{1} \sum_{i = 1}^{t - 1} ∥ β^{(m + i)} - η^{(m + i)} ∥^{2} + \frac{ρ_{1}}{2} {∥ β^{(m + t)} - η^{(m + t)} ∥}^{2} \\ \leq f^{(m + t)} . \end{matrix}

Note that

Q (β, η^{(m)}, δ^{(m)}, ν^{(m)}, ϑ^{(m)})

is strong convex and differentiable with respect to

β

, as the Hessian matrix

(\frac{1}{N T} {\tilde{x}}^{T} \tilde{x} + ρ_{1} I_{N p} + ρ_{2} Ω^{T} Ω)

is positive definite. Together with the fact that

Q (β, η, δ, ν, ϑ)

is convex in

(δ, η)

, the sequence

{β^{(m)}, η^{(m)}, δ^{(m + 1)}, ϑ^{(m)}, ν^{(m)}}_{m = 1}^{\infty}

converges to a stationary point

{β^{*}, η^{*}, δ^{*}, ϑ^{*}, ν^{*}}

by directly applying the Theorem 4.1 of [39]. Thus we have

\{\begin{matrix} Ω β^{*} - δ^{*} = 0, \\ β^{*} - η^{*} = 0, \end{matrix}

(A3)

and

f^{*} = lim_{m \to \infty} f^{(m + 1)} = lim_{m \to \infty} f^{(m + t)} = inf_{S (η^{*}, δ^{*})} \{\frac{1}{2 N T} ∥ y - \tilde{x} β^{*} ∥^{2} + p_{γ} (| δ |, λ) + I_{C} (η)\} .

Moreover, for any

t \geq 0

, we have

\begin{matrix} lim_{m \to \infty} Q (β^{(m + t)}, η^{(m + t)}, δ^{(m + t)}, ν^{(m + t - 1)}, ϑ^{(m + t - 1)}) \\ = & \frac{1}{2 N T} ∥ y - \tilde{x} β^{*} ∥^{2} + p_{γ} (| δ^{*} |, λ) + I_{C} (η^{*}) + (t - \frac{1}{2}) ρ_{2} {∥ Ω β^{*} - δ^{*} ∥}^{2} + \\ lim_{m \to \infty} ν^{(m) T} (Ω β^{*} - δ^{*}) + lim_{m \to \infty} ϑ^{(m) T} (β^{*} - η^{*}) + (t - \frac{1}{2}) ρ_{1} {∥ β^{*} - η^{*} ∥}^{2} \\ \leq & f^{*} . \end{matrix}

Hence,

\begin{matrix} lim_{m \to \infty} ∥ r_{1}^{(m)} ∥^{2} = ∥ r_{1}^{*} ∥^{2} = {∥ Ω β^{*} - δ^{*} ∥}^{2} = 0, \\ lim_{m \to \infty} ∥ r_{2}^{(m)} ∥^{2} = ∥ r_{2}^{*} ∥^{2} = {∥ β^{*} - η^{*} ∥}^{2} = 0 . \end{matrix}

(A4)

By the convexity of

β

concerning

Q (β, η^{(m)}, δ^{(m)}, ν^{(m)}, ϑ^{(m)})

, it then follows that

\begin{matrix} \partial Q (β, η^{(m)}, δ^{(m)}, ν^{(m)}, ϑ^{(m)}) / \partial β \\ = & - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) + Ω^{T} ν^{(m)} + ρ_{2} Ω^{T} (Ω β^{(m + 1)} - δ^{(m + 1)}) + ϑ^{(m + 1)} + \\ ρ_{1} (β^{(m + 1)} - η^{(m)}) \\ = & - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) + Ω^{T} (ν^{(m)} + ρ_{2} (Ω β^{(m + 1)} - δ^{(m)}) + ϑ^{(m)} + ρ_{1} (β^{(m + 1)} - η^{(m)}) \\ = & - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) + Ω^{T} (ν^{(m + 1)} - ρ_{2} (Ω β^{(m + 1)} - δ^{(m + 1)}) + ρ_{2} (Ω β^{(m + 1)} - δ^{(m)})) \\ + ϑ^{(m + 1)} - ρ_{1} (β^{(m + 1)} - η^{(m + 1)}) + ρ_{1} (β^{(m + 1)} - η^{(m)}) \\ = & - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) + Ω^{T} ν^{(m + 1)} + ρ_{2} Ω^{T} (δ^{(m + 1)} - δ^{(m)}) + ϑ^{(m + 1)} + ρ_{1} (η^{(m + 1)} - η^{(m)}) \\ = & 0 \end{matrix}

Therefore,

s_{1}^{(m + 1)} + s_{2}^{(m + 1)} = \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) - Ω ν^{(m + 1)} - ϑ^{(m + 1)} .

Since

∥ Ω β^{*} - δ^{*} ∥^{2} = 0

and

∥ β^{*} - η^{*} ∥^{2} = 0

from (A3), we have

\begin{matrix} lim_{m \to \infty} \partial Q (β, η^{(m)}, δ^{(m)}, ν^{(m)}, ϑ^{(m)}) {/ \partial β |}_{β = β^{(m + 1)}} \\ = & lim_{m \to \infty} - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{(m + 1)}) + Ω^{T} ν^{(m + 1)} + ρ_{2} Ω^{T} (δ^{(m + 1)} - δ^{(m)}) + ϑ^{(m + 1)} + \\ ρ_{1} (η^{(m + 1)} - η^{(m)}) \\ = & - \frac{1}{N T} {\tilde{x}}^{T} (y - \tilde{x} β^{*}) + Ω^{T} ν^{*} + ϑ^{*} \\ = & 0 . \end{matrix}

So,

{lim}_{m \to \infty} s_{1}^{(m + 1)} + s_{2}^{(m + 1)} = 0

. Combining with (A4), the proof of Proposition 1 is completed. □

Appendix B.2

Proof of Proposition 1.

We divide the proof of Theorem 1 into several steps:

Step 1: Define

M_{G}

as the restricted subspace of

R^{N p}

where

M_{G} = {β \in C : β_{i} = β_{j}, for any i, j \in G_{k}^{0}, 1 \leq k \leq K}

(A5)

Suppose that the true latent structure

G^{0}

is known, the oracle estimator for

β

is

{\hat{β}}^{o r} = arg min_{β \in M_{G^{0}}} \frac{1}{N T} {∥ \tilde{y} - \tilde{x} β ∥}^{2},

and the oracle for

α

denoted by

{\hat{α}}^{o r}

is hence given by

{\hat{α}}_{k}^{o r} = arg min_{α_{k}} \frac{1}{N_{k} T} \sum_{i \in G_{k}^{0}} \sum_{t = 1}^{T} {∥ {\tilde{y}}_{i t} - {\tilde{x}}_{i t}^{T} α_{k} ∥}^{2}, for k = 1, \dots, K .

Then we have the following lemma. □

Lemma A1.

Suppose that Assumption 1-2 hold. Then with probability

1 - o (K / T)

, we have

∥ {\hat{β}}^{o r} - β^{0} ∥ = O_{p} (\sqrt{K / T})

.

Proof of Lemma A1.

To prove Lemma A1, it is sufficient to show

P (∥ {\hat{β}}^{o r} - β^{0} ∥ \leq C \sqrt{K {(ln T)}^{2} / T}) \geq 1 - o (K / T),

(A6)

where C is some positive constant. Let

E_{0} = \{μ_{min} (\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i}) > c_{2} / 2\}

. For any event

E

, we denote its complement as

E^{c}

. By applying the Lemma A.1 in [26], we have:

\begin{matrix} P \{\sqrt{N_{k}} ∥ {\hat{α}}^{o r} - α_{k}^{0} ∥ \geq C ln T / \sqrt{T}\} \\ = & P \{\sqrt{N_{k}} ∥ {(\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i})}^{- 1} \frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} ϵ_{i} ∥ \geq C ln T / \sqrt{T}\} \\ \leq & P \{\sqrt{N_{k}} ∥ {(\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i})}^{- 1} ∥ \cdot ∥ \frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} ϵ_{i} ∥ \geq C ln T / \sqrt{T}, E_{0}\} + P (E_{0}^{c}) \\ \leq & P (∥ \frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} ϵ_{i} ∥ \geq \frac{c_{2}}{2} C ln T / \sqrt{N_{k} T}) + o (T^{- 1}) \\ = & o (T^{- 1}) . \end{matrix}

It is obvious that

\begin{matrix} P (∥ {\hat{β}}^{o r} - β^{0} ∥^{2} \geq C^{2} K {(ln T)}^{2} / T) \\ = & P (\sum_{k = 1}^{K} N_{k} {∥ {\hat{α}}_{k}^{o r} - α_{k}^{0} ∥}^{2} \geq C^{2} K {(ln T)}^{2} / T) \\ \leq & \sum_{k = 1}^{K} P (N_{k} {∥ {\hat{α}}_{k}^{o r} - α_{k}^{0} ∥}^{2} \geq C^{2} {(ln T)}^{2} / T) \\ = & o (K / T) . \end{matrix}

Hence, (A6) holds. Compared with Theorem 2 in [26], the oracle estimator

{\hat{β}}^{o r}

of our method has a higher probability of

1 - o (K / T)

to approach the true

β^{0}

, while their method only has the probability

1 - o (K / T) - ϵ_{0}

. The underlying reason is that we take the concave pairwise fusion penalty to directly detect the homogeneity while [26] adopted the panel-CARDS procedures which required a good preliminary estimates for

β

at first. □

Step 2: Let

b_{N T} = min_{i \in G_{k}, j \in G_{m}, k \neq m} ∥ β_{i}^{0} - β_{j}^{0} ∥ = min_{k \neq m} ∥ α_{k}^{0} - α_{m}^{0} ∥

be the minimal difference of the common coefficient between two groups. It is important to show the following lemma.

Lemma A2.

Suppose that the Assumptions in Lemma A1 hold. If

b_{N T} > a λ

and

λ ≫ ϕ_{N T}

where

ϕ_{N T} = C ln T \sqrt{K / T}

, the objective function (5) has a local minimizer

\hat{β} (λ)

such that

P (\hat{β} (λ) = {\hat{β}}^{o r}) \to 1 .

Proof of Lemma A2.

Define

\begin{matrix} L_{N T} (β) = \frac{1}{2 N T} ∥ \tilde{y} - \tilde{x} {β ∥}^{2}, H_{N T} (β) = λ \sum_{i < j} ρ (∥ β_{i} - β_{j} ∥) + I_{C} (β) . \end{matrix}

(A7)

Recall the definition of

M_{G}

in (A5). We define the following two mappings:

T : M_{G} \to R^{K p} and T^{*} : R^{N p} \to R^{K p} .

(A8)

where

T (β)

is a

K p \times 1

vector consisting of K blocks with dimension p and the kth block is the common slopes of

β_{i}

for

i \in G_{k}

, and

T^{*} (β) = {N_{k}^{- 1} \sum_{i \in G_{k}^{0}} β_{i}, k = 1, \dots, K}^{T}

. Obviously, when

β

belong to

M_{G}

,

T (β) = T^{*} (β)

.

From (A7), we have

\tilde{L} (β) = L_{N T} (β) + H_{N T} (β)

. Define

L_{N T}^{G} (α) = L_{N T} (T^{- 1} (α)), H_{N T}^{G} (α) = H_{N T} (T^{- 1} (α)) .

(A9)

Hence

{\tilde{L}}^{G} (α) = L_{N T}^{G} (α) + H_{N T}^{G} (α)

. We also consider a small neighborhood of

β^{0}

:

Θ = {β \in C : ∥ β - β_{0} ∥ < ϕ_{N T}} .

From Lemma A1, there exists an event

E_{1} = {∥ {\hat{β}}^{o r} - β^{0} ∥ \leq ϕ_{N T}}

such that

P (E_{1}^{c}) \leq o (K / T)

. For any

β \in C

, let

β^{*} = T^{- 1} (T^{*} (β))

. It is sufficient to show that

{\hat{β}}^{o r}

is a local minimizer of

\tilde{L} (β)

with probability approaching 1. For this, we need to show:

(1).: In the event $E_{1}$ , $\tilde{L} (β^{*}) > \tilde{L} ({\hat{β}}^{o r})$ for any $β \in Θ$ and $β^{*} \neq {\hat{β}}^{o r}$ ;
(2).: There exists an event $E_{2}$ such that $P (E_{2}^{c}) \leq o (T^{- 1})$ . Over the event $E_{1} \cap E_{2}$ , there exists a neighborhood of ${\hat{β}}^{o r}$ , denoted by $Θ_{N T}$ such that $\tilde{L} (β) \geq \tilde{L} (β^{*})$ for any $β \in Θ_{N T} \cap Θ$ and the inequality strictly holds when $β \neq β^{*}$ .

If both (1) and (2) hold, we have

\tilde{L} (β) > \tilde{L} (β^{o r})

for any

β \in Θ_{N T} \cap Θ

and

β \neq {\hat{β}}^{o r}

, and therefore

{\hat{β}}^{o r}

is a strict local minimizer of

\tilde{L} (β)

over the event

E_{1} \cap E_{2}

.

We now prove the result in (1). We first show that

{\tilde{L}}^{G} (T^{*} (β))

equals a constant for any

β \in Θ

. Recall that

T^{*} (β) = α = {(α_{1}^{T}, \dots, α_{K}^{T})}^{T}

. Let

b_{N T} = {min}_{k \neq m} ∥ α_{k}^{0} - α_{m}^{0} ∥

. Then for any k and m, we have

\begin{matrix} min_{1 \leq k < m \leq K} ∥ α_{k} - α_{m} ∥ = & min_{1 \leq k < m \leq K} ∥ (α_{k} - α_{k}^{0}) - (α_{m} - α_{m}^{0}) + (α_{k}^{0} - α_{m}^{0}) ∥ \\ \geq & min_{1 \leq k < m \leq K} ∥ α_{k}^{0} - α_{m}^{0} ∥ - 2 max_{1 \leq k \leq K} ∥ α_{k} - α_{k}^{0} ∥, \end{matrix}

and

\begin{matrix} max_{1 \leq k \leq K} ∥ α_{k} - α_{k}^{0} ∥ = & max_{1 \leq k \leq K} {∥N_{k}^{- 1} \sum_{i \in G_{k}} (β_{i} - β_{i}^{0})∥}^{2} \leq max_{1 \leq k \leq K} N_{k}^{- 1} \sum_{i \in G_{k}} {∥ β_{i} - β_{i}^{0} ∥}^{2} \leq ϕ_{N T}^{2} . \end{matrix}

Then

{min}_{1 \leq k < m \leq K} ∥ α_{k} - α_{m} ∥ \geq b_{N T} - 2 ϕ_{N T} > a λ

. By Assumption 2,

ρ (∥ α_{k} - α_{m} ∥)

is a constant on

Θ

and therefore

H_{N T} (T^{*} (β))

is also a constant on

β \in Θ

. So

{\tilde{L}}^{G} (β) = L_{N T}^{G} (T^{*} (β)) + Constant

for all

β \in Θ

. Note that

L_{N T}^{G} (α)

is strictly convex w.r.t

α

and

{\hat{α}}^{o r}

is the unique global minimizer of

L_{N T}^{G} (α)

, then

L_{N T}^{G} (T^{*} (β)) > L_{N T}^{G} ({\hat{α}}^{o r})

for all

T^{*} (β) \neq {\hat{α}}^{o r}

. From (A9), we have

{\tilde{L}}^{G} ({\hat{α}}^{o r}) = \tilde{L} ({\hat{β}}^{o r})

and

{\tilde{L}}^{G} (T^{*} (β)) = \tilde{L} (T^{- 1} (T^{*} (β))) = \tilde{L} (β^{*})

. Hence

\tilde{L} (β^{*}) > \tilde{L} ({\hat{β}}^{o r})

for all

β^{*} \neq {\hat{β}}^{o r}

. So (1) is obtained.

We next prove the result in (2). Let

Θ_{N T} = {β \in C : ∥ β - {\hat{β}}^{o r} ∥ \leq t_{N T}}

where

t_{N T}

is a positive sequence. For any

β \in Θ_{N T} \cap Θ

, we have

\tilde{L} (β) - \tilde{L} (β^{*}) = [L_{N T} (β) - L_{N T} (β^{*})] + [H_{N T} (β) - H_{N T} (β^{*})] \equiv I_{1} + I_{2} .

(A10)

By using Taylor expansion, we have

\begin{matrix} I_{1} = & L_{N T} (β) - L_{N T} (β^{*}) = - {(\tilde{y} - \tilde{x} \overset{˘}{β})}^{T} \tilde{x} (β - β^{*}), \\ I_{2} = & H_{N T} (β) - H_{N T} (β^{*}), \end{matrix}

where

\overset{˘}{β} = κ β + (1 - κ) β^{*}

lies between

β

and

β^{*}

for

κ \in (0, 1)

.

For

(I_{2})

, we have

\begin{matrix} I_{2} = & λ \sum_{{j > i}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} (β_{i} - β_{i}^{*}) \\ + λ \sum_{{j < i}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} (β_{i} - β_{i}^{*}) \\ = & λ \sum_{{j > i}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} {(β_{i} - β_{i}^{*}) - (β_{j} - β_{j}^{*})} . \end{matrix}

For each

i, j \in G_{k}

, then

β_{i}^{*} = β_{j}^{*}

and

{\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} = κ (β_{i} - β_{j})

. Thus

\begin{matrix} I_{2} = & λ \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} (β_{i} - β_{j}) \\ + λ \sum_{1 \leq k < m \leq K} \sum_{{i \in G_{k}, j \in G_{m}}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} {(β_{i} - β_{i}^{*}) - (β_{j} - β_{j}^{*})} . \end{matrix}

Since

{sup}_{i} ∥ β_{i}^{*} - β_{i}^{0} ∥^{2} = {sup}_{k} {∥ α_{k} - α_{k}^{0} ∥}^{2} \leq ϕ_{N T}^{2}

and

{\overset{˘}{β}}_{i}

lies between

β_{i}

and

β_{i}^{*}

, then

sup_{i} ∥ {\overset{˘}{β}}_{i} - β_{i}^{0} ∥ \leq κ sup_{i} ∥ β_{i} - β_{i}^{0} ∥ + (1 - κ) sup_{i} ∥ β_{i}^{*} - β_{i}^{0} ∥ \leq κ ϕ_{N T} + (1 - κ) ϕ_{N T} = ϕ_{N T} .

(A11)

For any

k \neq m

,

i \in G_{k}

and

j \in G_{m}

,

∥ {\overset{˘}{β}}_{i} - β_{i}^{0} ∥ \geq min_{i \in G_{k}, j \in G_{m}} ∥ β_{i}^{0} - β_{j}^{0} ∥ - 2 max_{i} ∥ {\overset{˘}{β}}_{i} - β_{i}^{0} ∥ \geq b_{N T} - 2 ϕ_{N T} > a λ,

and hence

ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) = 0

by assumption 2. Since

{\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} = κ (β_{i} - β_{j})

,

\begin{matrix} I_{2} = & λ \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥^{- 1} {({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j})}^{T} (β_{i} - β_{j}) \\ = & λ \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} ρ' (∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) ∥ β_{i} - β_{j} ∥ . \end{matrix}

By using the similar argument, it is easy to show that

sup_{i} ∥ β_{i}^{*} - {\hat{β}}_{i}^{o r} ∥ = sup_{k} ∥ α_{k} - {\hat{α}}_{k}^{o r} ∥ \leq sup_{i} ∥ β_{i} - {\hat{β}}_{i}^{o r} ∥ .

Since

\begin{matrix} sup_{i} ∥ {\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥ \leq & 2 sup_{i} ∥ {\overset{˘}{β}}_{i} - β_{i}^{*} ∥ \leq 2 sup_{i} ∥ β_{i} - β_{i}^{*} ∥ \\ \leq & 2 (sup_{i} ∥ β_{i} - {\hat{β}}_{i}^{o r} ∥ + sup_{i} ∥ β_{i}^{*} - {\hat{β}}_{i}^{o r} ∥) \\ \leq & 4 sup_{i} ∥ β_{i} - {\hat{β}}_{i}^{o r} ∥ \leq 4 t_{N T}, \end{matrix}

we have

ρ' ({\overset{˘}{β}}_{i} - {\overset{˘}{β}}_{j} ∥) \geq ρ' (4 t_{N T})

due to the concavity of

ρ (\cdot)

. So

I_{2} \geq \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} λ ρ' (4 t_{N T}) ∥ β_{i} - β_{j} ∥ .

(A12)

For

(I_{1})

, we let

B = {(B_{1}, \dots, B_{N})}^{T} = {[{(\tilde{y} - \tilde{x} \overset{˘}{β})}^{T} \tilde{x}]}^{T}

where

B_{i} = {({\tilde{y}}_{i} - {\tilde{x}}_{i}^{T} β_{i})}^{T} {\tilde{x}}_{i}

. Hence

\begin{matrix} I_{1} = & - B^{T} (β - β^{*}) = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} N_{k}^{- 1} B_{i}^{T} (β_{i} - β_{j}) \\ = & - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{B_{i}^{T} (β_{i} - β_{j})}{2 N_{k}} - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{B_{i}^{T} (β_{i} - β_{j})}{2 N_{k}} \\ = & - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{{(B_{j} - B_{i})}^{T} (β_{j} - β_{i})}{2 N_{k}} \\ = & - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} N_{k}^{- 1} {(B_{j} - B_{i})}^{T} (β_{j} - β_{i}) . \end{matrix}

(A13)

Note that

{sup}_{i} ∥ B_{i} ∥ \leq {sup}_{i} \{∥ {\tilde{x}}_{i}^{T} ϵ_{i} ∥ + ∥ {\tilde{x}}_{i} ∥ \cdot ∥ β_{i}^{0} - {\overset{˘}{β}}_{i} ∥\}

. By assumption 1, we have

{sup}_{i} ∥ {\tilde{x}}_{i}^{T} ϵ_{i} ∥ \leq c_{1} \sqrt{T p}

and

{sup}_{i} ∥ {\tilde{x}}_{i} ∥ \leq 2 c_{3}

. From (A11), we have

{sup}_{i} ∥ β_{i}^{0} - {\overset{˘}{β}}_{i} ∥ \leq ϕ_{N T}

. Hence there exists an event

E_{2}

such that

P (E_{2}^{c}) = o (T^{- 1})

and over the set

E_{2}

,

{sup}_{i} ∥ B_{i} ∥ \leq (c_{1} \sqrt{T p} + 2 c_{3} ϕ_{N T})

. Then

\begin{matrix} ∥N_{k}^{- 1} {(B_{j} - B_{i})}^{T} (β_{j} - β_{i})∥ \\ \leq & N_{k}^{- 1} ∥ B_{j} - B_{i} ∥ \cdot ∥ β_{i} - β_{j} ∥ \leq N_{k}^{- 1} 2 sup_{i} ∥ B_{i} ∥ \cdot ∥ β_{i} - β_{j} ∥ \\ \leq & 2 N_{k}^{- 1} (c_{1} \sqrt{T p} + 2 c_{3} ϕ_{N T}) ∥ β_{i} - β_{j} ∥ . \end{matrix}

(A14)

Combining together (A12)–(A14), we have

\tilde{L} (β) - \tilde{L} (β^{*}) \geq \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} {λ ρ' (4 t_{N T}) - 2 N_{k}^{- 1} (c_{1} \sqrt{T p} + 2 c_{3} ϕ_{N T})} ∥ β_{i} - β_{j} ∥ .

By choosing sufficient small

t_{N T}

, we have

ρ' (4 t_{N T}) \to 1

and

2 N_{k}^{- 1} (c_{1} \sqrt{T p} + 2 c_{3} ϕ_{N T}) ≪ λ

. Therefore,

\tilde{L} (β) - \tilde{L} (β^{*}) \geq 0

for any

β \in Θ_{N T}

. The result in (2) is now completed.

Step 3: From Lemma A2, as

T \to \infty

and

σ_{0}^{2} \to 0

, we have

P (\hat{K} = K) \to 1

and

P ({\hat{G}}_{1} = G_{1}^{0}, \dots, {\hat{G}}_{K} = G_{k}^{0} | \hat{K} = K) \to 1

. By conditional probability formula, it yields that

P ({\hat{G}}_{1} = G_{1}^{0}, \dots, {\hat{G}}_{K} = G_{k}^{0}) = P ({\hat{G}}_{1} = G_{1}^{0}, \dots, {\hat{G}}_{K} = G_{K}^{0} | \hat{K} = K) P (\hat{K} = K) \to 1,

as

T \to \infty

. Next, let

B

be a Borel-measureable set, then

\begin{matrix} P (\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) \in B) = & P (\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) \in B | \hat{β} = {\hat{β}}^{o r}) P (\hat{β} = {\hat{β}}^{o r}) \\ + P (\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) \in B | \hat{β} \neq {\hat{β}}^{o r}) P (\hat{β} \neq {\hat{β}}^{o r}) \\ = & P (\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) \in B) {1 - o (1)} + o (1) \\ \to P (\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0}) \in B) as T \to \infty . \end{matrix}

This means that

\sqrt{N_{k} T} ({\hat{α}}_{k} - α_{k}^{0})

shares a similar asymptotic distribution with

\sqrt{N_{k} T} ({\hat{α}}_{k}^{o r} - α_{k}^{0})

, where

\sqrt{N_{k} T} ({\hat{α}}_{k}^{o r} - α_{k}^{0}) = {(\frac{1}{T N_{k}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} {\tilde{x}}_{i})}^{- 1} \frac{1}{\sqrt{N_{k} T}} \sum_{i \in G_{k}^{0}} {\tilde{x}}_{i}^{T} ϵ_{i} .

By Assumption 3, we have

\sqrt{N_{k} T} ({\hat{α}}_{k}^{o r} - α_{k}^{0}) - Ω_{k}^{- 1} B_{k} \overset{d}{\to} N (0, Ω_{k}^{- 1} Ψ_{k} Ω_{k}^{- 1})

. Thus the results in Theorem 1 is obtained. □

Appendix B.3

Proof of Theorem 2.

For a given tuning parameter

γ

, denote the corresponding parameters estimator as

\hat{β} (γ)

and the mean square error as

{\hat{σ}}_{\hat{G} (γ)}^{2}

. Then

\begin{matrix} {\hat{σ}}_{N T}^{2} (γ) = & 2 L_{N T} (\hat{β} (γ)) [I {\hat{G} (γ) = G_{0}} + I {\hat{G} (γ) \neq G_{0}}] \\ = & 2 L_{N T} (\hat{β} (γ)) I {\hat{G} (γ) = G_{0}} + o_{p} (1) \\ \to σ_{0}^{2}, as (N, T) \to \infty . \end{matrix}

By the definition,

IC (γ^{o p t}) = ln ({\hat{σ}}_{N T}^{2} (γ^{o p t})) + p K (γ^{o p t}) ψ_{N T}

, which approaches

ln (σ_{0}^{2})

as

(N, T) \to \infty

by Assumption 5. For any

γ^{*}

, we identify two situations:

Case 1: Under-fitted model. Suppose that

{\hat{K}}^{*} = \hat{K} (γ^{*})

with

1 \leq {\hat{K}}^{*} < K

. From Assumption 4, we have

\begin{matrix} IC (γ^{*}) = & ln ({\hat{σ}}_{{\hat{G}}_{{\hat{K}}^{*}}}^{2}) + p {\hat{K}}^{*} ψ_{N T} \geq min_{1 \leq {\hat{K}}^{*} < K} min_{{\hat{G}}_{{\hat{K}}^{*}}} ln ({\hat{σ}}_{{\hat{G}}_{K^{*}}}^{2}) + p {\hat{K}}^{*} ψ_{N T} \\ \to ln ({\underset{̲}{σ}}^{2}) > ln (σ_{0}^{2}), as (N, T) \to \infty . \end{matrix}

Hence

IC (γ^{*}) > IC (γ^{o p t})

as

(N, T) \to \infty

.

Case 2: Over-fitted model. Consider any other

γ^{*}

such that

{\hat{K}}^{*} = \hat{K} (γ^{*}) > K

. Following Lemma S1.14 in [5], we can obtain

N T [ln ({\hat{σ}}_{G_{0}}^{2}) - ln ({\hat{σ}}_{{\hat{G}}_{{\hat{K}}^{*}}}^{2})] = O_{p} (1)

. Then by Assumption 5, we have

\begin{matrix} P ({\hat{K}}^{*} > K) = & P (IC (γ^{*}) < IC (γ^{o p t})) \\ = & P (N T [ln ({\hat{σ}}_{G_{0}}^{2}) - ln ({\hat{σ}}_{{\hat{G}}_{{\hat{K}}^{*}}}^{2})] > ({\hat{K}}^{*} - K) N T ψ_{N T}) \\ \to 0, as (N, T) \to \infty . \end{matrix}

Combining Case 1 and Case 2 together, the information criterion in (18) can consistently select a optimal tuning parameter

γ

, which yields an oracle estimator for

β

with probability approaching 1. □

Appendix C. Testing for Heterogeneity

It is important to determine whether the heterogeneity effect is statistically significant. The hypothesis of no heterogeneity can be represented by

H_{0} : β_{1} = β_{2} = \dots = β_{N}

against the alternative hypothesis

H_{1} : β_{i} \neq β_{j}, for any i \neq j

Under the null hypothesis, the model is

y_{i t} = μ_{i} + β_{1}^{T} x_{i t} + ϵ_{i t}

By applying the fixed-effect transformation, we have

{\tilde{y}}_{i t} = β_{1}^{T} {\tilde{x}}_{i t} + {\tilde{ϵ}}_{i t}

where

{\tilde{ϵ}}_{i t} = ϵ_{i t} - \frac{1}{T} \sum_{t = 1}^{T} ϵ_{i t}

. The regression parameter

β_{1}

can be estimated by OLS, yielding residuals

{\bar{\hat{ϵ}}}_{i t}

, and the sum of squared errors

S_{0} = {\bar{\hat{ϵ}}}^{⊤} \bar{\hat{ϵ}}

, where

\bar{\hat{ϵ}} = {({\bar{\hat{ϵ}}}_{1}^{T}, \dots, {\bar{\hat{ϵ}}}_{N}^{T})}^{T}

and

{\bar{\hat{ϵ}}}_{i} = {({\bar{\hat{ϵ}}}_{i 1}, \dots, {\bar{\hat{ϵ}}}_{i T})}^{⊤}

. The likelihood ratio test of

H_{0}

is based on

F = \frac{S_{0} - S_{1} (\hat{K})}{{\hat{δ}}^{2}}

where

S_{1} (\hat{K})

is the sum of squared errors using the latent structures model 2, and

\hat{K}

is the estimated number of groups. The residual variance

{\hat{δ}}^{2}

is given by

{\hat{δ}}^{2} = \frac{1}{N (T - 1)} {\hat{ϵ}}^{⊤} \hat{ϵ} = \frac{1}{N (T - 1)} S_{1} (\hat{K})

However, the asymptotic distribution of

F_{1}

is non-standard, making it challenging to tabulate critical values directly. To address this issue, we propose a residual-based bootstrap procedure to calculate the p-values. The detailed procedures are outlined in Algorithm A1.

Algorithm A1: Residual bootstrap procedure

Step 1. Calculate the likelihood ratio test statistic F based on the original sample { $(y_{i t},_{i t}), i = 1, \dots, N; t = 1 \dots, T$ }.
Step 2. Obtain the usual fixed effect estimators ${\bar{\hat{β}}}_{1}$ without latent structures specification, and obtain the residual ${\bar{\hat{u}}}_{i t} = {\bar{\hat{ϵ}}}_{i t} - {\bar{\hat{ϵ}}}_{i \cdot}$ , where ${\bar{\hat{ϵ}}}_{i \cdot} = T^{- 1} \sum_{t = 1}^{T} {\bar{\hat{ϵ}}}_{i t}$ .
Step 3. Generate iid draws $v_{i t}$ from $i = 1, \dots, N$ and $t = 1, \dots, T$ from $N (0, 1)$ , and set $ϵ_{i t}^{*} = {\bar{\hat{ϵ}}}_{i t} v_{i t}$ and $y_{i t}^{*} = x_{i t}^{⊤} {\bar{\hat{β}}}_{1} + {\bar{\hat{ϵ}}}_{i \cdot} + ϵ_{i t}^{*}$ .
Step 4. Calculate the testing statistic $F_{(1)}^{*}$ based on the bootstrap resample ${(y_{i t}^{*}, x_{i t}), i = 1, \dots, N; t = 1, \dots, T}$ .
Step 5. Repeat the Steps 3–4 B times and obtain a lis of $F^{*}$ denoted by ${F_{(1)}^{*}, \dots, F_{(B)}^{*}}$ . The p-value is the proportion of F exceeding $F_{(b)}^{*}$ .

References

Hsiao, C. Analysis of Panel Data; Number 54; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Bonhomme, S.; Manresa, E. Grouped patterns of heterogeneity in panel data. Econometrica 2015, 83, 1147–1184. [Google Scholar] [CrossRef]
Su, L.; Chen, Q. Testing homogeneity in panel data models with interactive fixed effects. Econom. Theory 2013, 29, 1079–1135. [Google Scholar] [CrossRef]
Phillips, P.C.; Sul, D. Transition modeling and econometric convergence tests. Econometrica 2007, 75, 1771–1855. [Google Scholar] [CrossRef]
Su, L.; Shi, Z.; Phillips, P.C. Identifying latent structures in panel data. Econometrica 2016, 84, 2215–2264. [Google Scholar] [CrossRef]
Ke, Z.T.; Fan, J.; Wu, Y. Homogeneity pursuit. J. Am. Stat. Assoc. 2015, 110, 175–194. [Google Scholar] [CrossRef]
Xiao, D.; Ke, Y.; Li, R. Homogeneity structure learning in large-scale panel data with heavy-tailed errors. J. Mach. Learn. Res. 2021, 22, 1–42. [Google Scholar]
Zhang, Y.; Wang, H.J.; Zhu, Z. Quantile-regression-based clustering for panel data. J. Econom. 2019, 213, 54–67. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Huang, W.; Jin, S.; Su, L. Identifying latent grouped patterns in cointegrated panels. Econom. Theory 2020, 36, 410–456. [Google Scholar] [CrossRef]
Huang, W.; Su, L.; Zhuang, Y. Detecting unobserved heterogeneity in efficient prices via classifier-lasso. J. Bus. Econ. Stat. 2023, 41, 509–522. [Google Scholar] [CrossRef]
Su, L.; Ju, G. Identifying latent grouped patterns in panel data models with interactive fixed effects. J. Econom. 2018, 206, 554–573. [Google Scholar] [CrossRef]
Su, L.; Wang, X.; Jin, S. Sieve estimation of time-varying panel data models with latent structures. J. Bus. Econ. Stat. 2019, 37, 334–349. [Google Scholar] [CrossRef]
Mehrabani, A. Estimation and identification of latent group structures in panel data. J. Econom. 2023, 235, 1464–1482. [Google Scholar] [CrossRef]
Ando, T.; Bai, J. Panel data models with grouped factor structure under unknown group membership. J. Appl. Econom. 2016, 31, 163–191. [Google Scholar] [CrossRef]
Liu, R.; Shang, Z.; Zhang, Y.; Zhou, Q. Identification and estimation in panel models with overspecified number of groups. J. Econom. 2020, 215, 574–590. [Google Scholar] [CrossRef]
Bai, J. Estimating multiple breaks one at a time. Econom. Theory 1997, 13, 315–352. [Google Scholar] [CrossRef]
Wang, W.; Su, L. Identifying latent group structures in nonlinear panels. J. Econom. 2021, 220, 272–295. [Google Scholar] [CrossRef]
Su, L.; Wang, W.; Xu, X. Identifying latent group structures in spatial dynamic panels. J. Econom. 2023, 235, 1955–1980. [Google Scholar] [CrossRef]
Wang, Y.; Phillips, P.C.; Su, L. Panel data models with time-varying latent group structures. J. Econom. 2024, 240, 105685. [Google Scholar] [CrossRef]
Beer, J.C.; Aizenstein, H.J.; Anderson, S.J.; Krafty, R.T. Incorporating prior information with fused sparse group lasso: Application to prediction of clinical measures from neuroimages. Biometrics 2019, 75, 1299–1309. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Jin, B. Pairwise fusion approach incorporating prior constraint information. Commun. Math. Stat. 2020, 8, 47–62. [Google Scholar] [CrossRef]
Li, F.; Sang, H. Spatial homogeneity pursuit of regression coefficients for large datasets. J. Am. Stat. Assoc. 2019, 114, 1050–1062. [Google Scholar] [CrossRef]
Zhang, X.; Liu, J.; Zhu, Z. Learning coefficient heterogeneity over networks: A distributed spanning-tree-based fused-lasso regression. J. Am. Stat. Assoc. 2024, 119, 485–497. [Google Scholar] [CrossRef]
Wang, W.; Phillips, P.C.; Su, L. Homogeneity pursuit in panel data models: Theory and application. J. Appl. Econom. 2018, 33, 797–815. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers Inc.: Delft, The Netherlands, 2011. [Google Scholar]
Ke, Y.; Li, J.; Zhang, W. Structure identification in panel data analysis. Ann. Stat. 2016, 44, 1193–1233. [Google Scholar] [CrossRef]
Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar] [CrossRef]
Jeon, J.-J.; Kwon, S.; Choi, H. Homogeneity detection for the high-dimensional generalized linear model. Comput. Stat. Data Anal. 2017, 114, 61–74. [Google Scholar] [CrossRef]
Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Belloni, A.; Chernozhukov, V. Least squares after model selection in high-dimensional sparse models. Bernoulli 2013, 19, 521–547. [Google Scholar] [CrossRef]
Shiu, A.; Lam, P.-L. Electricity consumption and economic growth in china. Energy Policy 2004, 32, 47–54. [Google Scholar] [CrossRef]
Xu, S.-C.; He, Z.-X.; Long, R.-Y. Factors that influence carbon emissions due to energy consumption in china: Decomposition analysis using lmdi. Appl. Energy 2014, 127, 182–193. [Google Scholar] [CrossRef]
Akkemik, K.A.; Göksal, K.; Li, J. Energy consumption and income in chinese provinces: Heterogeneous panel causality analysis. Appl. Energy 2012, 99, 445–454. [Google Scholar] [CrossRef]
Wang, N.; Fu, X.; Wang, S. Economic growth, electricity consumption, and urbanization in china: A tri-variate investigation using panel data modeling from a regional disparity perspective. J. Clean. Prod. 2021, 318, 128529. [Google Scholar] [CrossRef]
Pesaran, M.H. Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 2006, 74, 967–1012. [Google Scholar] [CrossRef]
Miao, K.; Li, K.; Su, L. Panel threshold models with interactive fixed effects. J. Econom. 2020, 219, 137–170. [Google Scholar] [CrossRef]
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]

Figure 1. The normalized mutual information across different N and T for DGPs 1–3.

Figure 2. The average model estimation errors across different N and T for DGPs 1–3.

Table 1. The notations in our working model.

	Notation	Illustration
Observed quantities	$y_{i t}$	the univariate response variable
	$x_{i t}$	the $p \times 1$ explanatory variables
	$C$	the prior constraint information set
	N	the number of all individuals
	T	the number of observations for each individual
Unknown parameters	$μ_{i}$	the fixed effect of the ith individual
	$β_{i}$	the slope of the ith individual
	$G$	the latent group structures
	$G_{k}$	the membership of the kth group
	$α_{k}$	the common slope of the kth group
	K	the number of latent groups

Table 2. Comparison with other literature in latent structures identification for panel models.

Methods	Penalty Forms	Prior Information
C-Lasso [5]	$\sum_{i = 1}^{N} \prod_{k = 1}^{K} ∥ β_{i} - α_{k} ∥$	No prior information
Panel-CARDS [26]	$\sum_{l = 1}^{L - 1} \sum_{i \in B_{l}, j \in B_{l + 1}} p_{γ_{1}} (∥ β_{i} - β_{j} ∥) +$	An ordered segmentation
	$\sum_{l = 1}^{L} \sum_{i, j \in B_{l}} p_{γ_{2}} (∥ β_{i} - β_{j} ∥)$	${B_{1}, \dots, B_{L}}$
PAGFL [15]	$\sum_{1 \leq i < j \leq N} {\ddot{w}}_{i j} ∥ β_{i} - β_{j} ∥$	Adaptive weights ${\ddot{w}}_{i j}$ ’s
Our work	$\sum_{1 \leq i < j \leq N} p_{γ} (∥ β_{i} - β_{j} ∥_{1}) + I_{C} (β)$	A convex set $C$

Table 3. The parameter estimation results of

α_{11}

for DGPs 1-3 based on

N = 200

.

Table 3. The parameter estimation results of

α_{11}

for DGPs 1-3 based on

N = 200

.

DGP		No-Prior [5]			Prior			Oracle
DGP		$T = 10$	$T = 20$	$T = 40$	$T = 10$	$T = 20$	$T = 40$	$T = 10$	$T = 20$	$T = 40$
1	Bias	−0.025	−0.008	−0.002	−0.007	−0.004	−0.003	0.010	−0.001	−0.002
	SD	0.088	0.042	0.029	0.084	0.041	0.029	0.054	0.037	0.029
	ESE	0.060	0.041	0.029	0.062	0.042	0.029	0.058	0.041	0.029
	ECP	0.830	0.920	0.970	0.850	0.950	0.960	0.980	0.960	0.970
2	Bias	−0.011	−0.015	−0.001	0.008	−0.004	−0.001	−0.004	−0.007	0.000
	SD	0.086	0.040	0.027	0.089	0.042	0.027	0.062	0.040	0.027
	ESE	0.060	0.042	0.029	0.061	0.042	0.029	0.058	0.041	0.029
	ECP	0.780	0.980	0.960	0.830	0.950	0.960	0.940	0.980	0.970
3	Bias	0.005	0.003	0.003	0.018	0.005	0.004	0.005	0.004	0.003
	SD	0.078	0.042	0.031	0.077	0.043	0.032	0.060	0.042	0.031
	ESE	0.066	0.043	0.029	0.065	0.043	0.029	0.058	0.041	0.029
	ECP	0.910	0.970	0.930	0.920	0.960	0.920	0.940	0.950	0.930

Bias is empirical bias; SD is the Monte Carlo standard deviation; ESE is the estimated standard error and ECP is the empirical coverage probability of 95% Wald confidence interval.

Table 4. Classification results among 30 provinces in China.

With Prior Information		Without Prior Information
Group 1	Group 2	Group 1	Group 2	Group 3	Group 4	Group 5
$1.1692$	$3.4770$	$0.6935$	$0.8715$	0.9837	1.0879	3.4351
Anhui	Beijing	Guangdong	Shandong	Zhejiang	Anhui	Beijing
Fujian	Gansu	Jiangsu			Fujian	Gansu
Guangdong	Guizhou				Guangxi	Guizhou
Guangxi	Heilongjiang				Hainan	Heilongjiang
Hainan	Jilin				Hebei	Jilin
Hebei	Jiangxi				Henan	Jiangxi
Henan	Ningxia				Hubei	Ningxia
Hubei	Qinghai				Hunan	Qinghai
Hunan	Shannxi				Liaoning	Shannxi
Jiangsu	Tianjin				Inner Mongolia	Tianjin
Liaoning	Chongqing				Shanxi	Chongqing
Inner Mongolia					Shanghai
Shandong					Sichuan
Shanxi					Xinjiang
Shanghai					Yunan
Sichuan
Xinjiang
Yunnan
Zhejiang

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Luo, X.; Liao, M. Incorporating Prior Information in Latent Structures Identification for Panel Data Models. Mathematics 2025, 13, 1505. https://doi.org/10.3390/math13091505

AMA Style

Li Y, Luo X, Liao M. Incorporating Prior Information in Latent Structures Identification for Panel Data Models. Mathematics. 2025; 13(9):1505. https://doi.org/10.3390/math13091505

Chicago/Turabian Style

Li, Yi, Xingxing Luo, and Mengqi Liao. 2025. "Incorporating Prior Information in Latent Structures Identification for Panel Data Models" Mathematics 13, no. 9: 1505. https://doi.org/10.3390/math13091505

APA Style

Li, Y., Luo, X., & Liao, M. (2025). Incorporating Prior Information in Latent Structures Identification for Panel Data Models. Mathematics, 13(9), 1505. https://doi.org/10.3390/math13091505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incorporating Prior Information in Latent Structures Identification for Panel Data Models

Abstract

1. Introduction

1.1. Literature Review

1.2. Contributions and Organization

2. Model and Proposed Estimation Method

2.1. Panel Heterogeneity with Prior Constraint Information

2.2. A Regularized Approach

2.3. ADMM Implementation

2.4. Initial Values and Tuning Parameter

3. Theoretical Analysis

3.1. Convergence of the Algorithm

3.2. Asymptotic Property

4. Simulation Studies

4.1. Basic Setup

4.2. Simulation Results

5. Empirical Analysis

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix B. Proofs

Appendix B.1

Appendix B.2

Appendix B.3

Appendix C. Testing for Heterogeneity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI