1. Introduction
The least-squares estimation technique has proven to be highly effective in various fields. By analyzing the provided data, this method establishes a mathematical relationship between dependent and independent variables, which can subsequently be applied for forecasting, optimization, and control. Under appropriate design conditions, the parameter estimates obtained through this method typically exhibit optimal properties. Furthermore, when it is assumed that the error terms follow a Gaussian distribution, these estimates align with maximum likelihood estimators. With certain regularity conditions in place, the least-squares method produces the best linear unbiased estimates for regression parameters. The Gauss–Markov theorem further generalizes this result to linear regression models with arbitrary error distributions. In these fields, it is typically presumed that the observed data are accurate, leading to the development of statistical techniques tailored for clearly defined and precise datasets.
Nevertheless, in many real-world situations, the accessible data are not only random but also ambiguous because of imprecise information or expressions such as “approximately 10”, “slightly more than 10”, “somewhere between 5 and 10”, or subjective qualifiers such as “fair”, “good”, or “excellent”. In such instances, the classical least-squares method is not appropriate. Regression analysis typically considers two forms of uncertainty—randomness and fuzziness—but often only models random errors. Therefore, a broader regression framework is necessary to account for both random and fuzzy uncertainties. To handle data ambiguity, various statistical techniques have been developed [
1,
2]. The pioneering works on regression analysis using the fuzzy model were conducted by Tanaka and his collaborators [
3,
4,
5]. In recent decades, numerous researchers have studied regression within fuzzy environments [
6,
7,
8,
9,
10,
11]. Choi and Yoon introduced several fuzzy linear regression models, such as the componentwise fuzzy linear regression model [
6], the fuzzy rank linear model [
7], the general fuzzy regression model [
8], the equivalence in alpha-level linear regression [
9]. Lee et al. [
10] employed bootstrap techniques to make inferences about fuzzy regression models. Namdari et al. [
11] and Sohn et al. [
12] proposed fuzzy logistic regression models utilizing LAD (least absolute deviation) and LSE (least-squares estimation).
More recently, Yoon developed fuzzy mediation and moderation analysis techniques grounded in fuzzy regression analysis [
13,
14] and introduced a variable selection technique for multiple fuzzy regression models [
15]. Additionally, Bas and Egrioglu [
16,
17] discussed robust fuzzy regression, and some authors [
18,
19] applied TSK system to fuzzy regression.
In these studies, the discrepancies between the regression models and observed data are interpreted not as random errors governed by probability distributions but rather as stemming from the fuzziness of the model structure. Nonetheless, various researchers have investigated models incorporating random errors in this context [
20,
21,
22,
23,
24,
25,
26,
27,
28].
The mathematical properties of regression models, including optimality and large-sample behavior, hold significant importance in statistics [
28,
29,
30,
31]. However, research on fuzzy estimation has predominantly focused on practical applications rather than theoretical exploration of these aspects. So far, research on these mathematical topics has been relatively limited. Körner and Näther [
32,
33,
34], as well as Yoon and Grzergozewski [
28], have examined the characteristics of the BLUE (Best Linear Unbiased Estimator). Furthermore, Kim et al. [
25] and Yoon et al. [
27,
28] have investigated asymptotic theories in the context of large-sample studies. Recently, Croci et al. [
29] used semidefinite programming to analyze BLUE properties, while Song et al. [
30] and Young [
31] demonstrated consistency-related properties among optimality criteria.
This study builds on a prior investigation [
25] that explored the asymptotic properties of least-squares estimation using vague data. To identify sufficient conditions for the consistency and asymptotic normality of a sequence of least-squares estimators with vague data applied to a multiple linear regression model is the main goal of this paper, and additionally, asymptotic efficiency of these estimators has been discussed.
This paper is structured as follows.
Section 2 introduces the assumptions underlying the fuzzy model, along with key preliminary concepts essential for deriving the main results. In
Section 3, we develop the counterparts of the normal equations and derive the estimators. The large-sample properties of fuzzy least-squares estimators (FLSEs), including their strong consistency and asymptotic normality, are discussed in
Section 3 and
Section 4. Lastly,
Section 5 presents the approximate confidence region for the parameters as well as a comparison of the asymptotic relative efficiency of the FLSEs with that of the crisp least-squares estimators.
2. Preliminaries
Let us begin by examining the standard multiple linear regression model:
which can be expressed in matrix notation as
where
is an
vector of observed response variables,
denotes an
matrix of known constants
,
represents the
vector of unknown parameters, and
is an
vector of unobserved random errors, which are assumed to follow a distribution with d.f.
F such that
and
, where
.
The most popular estimation for
in model (
2) is least-squares estimation; the least-squares estimator (LSE) is given by
where the regression matrix
is supposed to have full rank, i.e., the columns of
are linearly independent. The LSE satisfies the standard optimal properties within the Gaussian model, where it is assumed that the error terms follow a Gaussian distribution. These estimators are both unbiased and efficient asymptotically. Moreover, they represent the best linear unbiased estimators (BLUEs) available. The only assumptions made are that the error distribution has a mean of zero and a finite variance
. The Gauss–Markov theorem represents a broader and more general principle.
In [
35], the assumption is made that the independent variables are observed without error. An et al. [
36] have considered the random inputs for a linear regression model.
On the other hand, an estimator
of
is said to be strong (weakly) consistent if
(in probability) as
. The strong consistency of
was established in [
35,
37] under specific conditions for the design points
, which correspond to the requirement
. Both [
35,
37] assume that the input variables are observed without error. For the linear model with random inputs, An et al. [
36] presented a novel estimator for the unknown parameters, derived from the Fourier transform of a symmetric weight function.
And an estimator
for
is considered strongly (or weakly) consistent if it converges to
almost surely (or in probability) as
. The strong consistency of
was demonstrated in [
35,
37], given certain conditions on the design points
, which impose the requirement that
. Both [
35,
37] operate under the assumption that the input variables are measured without error. An et al. [
36] introduced an innovative estimator for a linear model with random inputs, which was formulated using the Fourier transform of a symmetric weight function.
They established the strong consistency of this estimator under the assumption of independent and identically distributed observations .
In this study, the experimental data are regarded as imprecise. Moreover, we assume that these data can be interpreted as samples drawn from a fuzzy set-valued random variable.
To extend the least-squares method to cases involving imprecise data—while ensuring that it reduces to the classical method when the model elements are precise—we introduce the following fuzzy linear regression model:
where
and
are random fuzzy variables,
are unknown crisp regression parameters that must be estimated based on the observed values of
and
, and
are assumed to be crisp random error vectors. In Equation (
3), the operator ⊕ denotes the addition of fuzzy sets, while + represents the conventional addition of vectors.
It is important to note that model (
3) incorporates two distinct types of uncertainty: vagueness and randomness, simultaneously. Hence, model (
3) can be viewed as an expansion of the conventional linear regression model to encompass situations where the observations of both explanatory and response variables are expressed as fuzzy numbers, where precise values are treated as special cases of degenerate fuzzy numbers. Therefore, this model is different from Tanaka’s fuzzy regression models, which are discussed in [
32,
38,
39], etc.
It is worth noting that in model (
3), we are dealing with two distinct forms of uncertainty: vagueness and randomness, both of which are addressed concurrently. Therefore, model (
3) can be viewed as an expansion of the conventional linear regression model to accommodate situations where the observations of explanatory and response variables are fuzzy numbers. This is because crisp values can be seen as a special case of degenerated fuzzy numbers.
Models addressing the ambiguous representation of data will consist of specific fuzzy sets within the real number space.
Following [
21,
24], we present certain definitions pertaining to fuzzy sets and fuzzy numbers, alongside fundamental findings from fuzzy theory.
A fuzzy subset of is defined as a mapping, known as the membership function, from to the interval . Consequently, a fuzzy subset A is represented by its membership function .
For any , the crisp set is referred to as the -cut of A.
A fuzzy number A is a normal and convex subset of the real line with a bounded support.
The collection of all fuzzy numbers is denoted by . Notably, there are no universally applicable guidelines for determining the membership function of a fuzzy observation.
As a specific instance, we utilize a particular parametric class of fuzzy numbers, referred to as
-fuzzy numbers:
where
are predefined left-continuous and non-increasing functions satisfying
and
. The functions
L and
R serve as the left and right shape functions of
X, with
m representing the mode of
A, while
l and
correspond to the left and right spreads of
X. The notation
is used to represent an
-fuzzy number.
The parameters
l and
r indicate the level of fuzziness associated with the numerical value, which may be either symmetric or asymmetric. If both
l and
r are equal to 0, the numerical value is entirely crisp, meaning it has no fuzziness. The
-cuts of the fuzzy numbers are given by the interval representation:
The collection of all
-fuzzy numbers is denoted as
. Specifically, when
in
, the fuzzy number
A is referred to as a triangular fuzzy number, represented as
. Consequently, fuzzy numbers provide an effective means for modeling imprecise data.
The statistical management of fuzzy numbers usually requires considering elementary operations between them. Operations of fuzzy numbers are defined based on
Zadeh’s extension principle [
40].
An advantageous aspect of
-fuzzy numbers lies in their capability to express operations ⊕ and · through straightforward operations with respect to the parameters
m,
l,
r:
and
On the other hand, in the context of applying the least-squares method to fuzzy data, it necessitates a suitable metric defined on the spaces of fuzzy sets. Various metrics can be established within the class . The distance between two fuzzy numbers is often determined by the disparity between their -cuts.
A significant class of metrics can be formulated using support functions. The support function corresponding to any fuzzy set
is expressed as:
where
denotes the
-dimensional unit sphere in
, and
signifies the inner product in
. A metric on
is introduced based on the
-metric within the space of Lebesgue integrable functions:
for any
. Ming and Friedman [
39] proposed a metric for fuzzy numbers
X and
Y, defined in terms of the distance between their respective images. For
, the metric is given as:
where
and
represent the lower and upper endpoints of
. Diamond [
21] introduced an alternative metric applicable to the set of all triangular fuzzy numbers. Let
denote the collection of all triangular fuzzy numbers in
. For
, the metric is defined as follows:
where
represents the compact interval supporting
X, and
refers to its mode. Given
and
, the metric is equivalently expressed as:
Then, , where the notation means that it converges almost surely.
Theorem 1 (Lemma in [
35], p. 125)
. Let be a sequence of real numbers and be a sequence of i.i.d. r.v.’s such that and for all i. Moreover, as . Then,as . 3. Fuzzy Least-Squares Estimators and Asymptotic Normality
The focus of statistical analysis under (
3) primarily revolves around the task of making inferences about the parameters
. We reconsider the multiple regression model:
where
,
(
) are triangular fuzzy numbers
,
, where
,
are the modes,
,
are the left spreads and
,
are the right spreads of
and
, respectively. And we assume that the crisp random vectors
for expressing randomness represented by
with crisp random variables
,
,
.
In (
4), the operations + and · represent standard addition and scalar multiplication of vectors, respectively.
It is important to note that the component-wise representation of model (
4) is expressed through the following crisp models:
where the terms
,
, and
in
are subject to the constraints
and
, almost surely.
In the least-squares approach, our goal is to determine the estimators for that minimize the sum of squared residuals between the n observed values of Y and their corresponding predicted values .
Any vector
that minimizes
is referred to as the fuzzy least-squares estimator of
, given the fuzzy data
, where
and
, with
and
d.
Furthermore, if
and
, then
Denoting
, we obtain for
:
Thus, the fuzzy least-squares normal equations for
are given by:
In matrix terms, the normal Equation (
6) is as follows:
where
is the
matrices of known constants
which express the values of the
jth independent variable for the
ith sample,
is the
matrices of the values
, which is the difference of the right spread
and left spread
of the
,
is the
vector of observations, and
, where
and
are the
vectors of left and right spreads of response variable
, respectively.
If
, then (
7) has a single solution, given by
Moreover, we obtain from (
5) that
where
.
Under the same assumptions, it is worth noting that the renowned Gauss–Markov Theorem asserts that for any fixed p-vector , the expression serves as the Best Linear Unbiased Estimator (BLUE) of . This designation indicates that exhibits the minimum variance among all linear unbiased estimators of . While this represents a significant optimality characteristic of the FLSE, its practical utility is limited unless we possess some understanding of the associated distribution.
If we assume that the random errors in (
8) follow a normal distribution, then the FLSE
is identical to the MLE, and we have
Moreover, utilizing standard theoretical results, one can construct confidence intervals and perform hypothesis tests for the fuzzy regression parameters, as well as derive prediction intervals for new observations, based on the known values of the fuzzy regression variables.
Nonetheless, the Gaussian assumption proves overly restrictive and often challenging to validate in practical applications, underscoring the need for more suitable alternatives. This is where the large sample methods can help, and
Section 4 discusses the asymptotic properties of the FLSE
.
6. Conclusions
Within the domain of data analysis, numerous scenarios arise where data exhibit stochasticity and fuzziness due to uncertain information or linguistic subtleties. Navigating through such ambiguous data necessitates the application of fuzzy theory, which emerges as a promising approach. Specifically, many fuzzy regression models have been advanced to investigate causal relationships within such datasets. However, beyond mere estimation lies the imperative task of delving into the mathematical properties of fuzzy regression estimates. The significance of optimal properties, particularly in the context of large samples, holds paramount importance within statistical frameworks. Nevertheless, a notable gap persists in the scholarly discourse concerning these pivotal subjects.
In this context, the present study endeavors to scrutinize a multifaceted fuzzy regression model encompassing multiple fuzzy input–fuzzy output variables and fuzzy error terms, incorporating various optimal properties such as consistency, normality, and confidence regions with relative efficiency. Our investigation introduces a streamlined formulation for the fuzzy least-squares estimator while meticulously examining its foundational properties. Our findings indicate that the proposed estimator conforms to the Best Linear Unbiased Estimator (BLUE) criterion, thereby exemplifying optimality. Furthermore, the observed asymptotic normality under broad assumptions lays the groundwork for novel avenues in devising statistical tests and confidence intervals, crucial for both model validation and forecasting endeavors. These avenues beckon further exploration in subsequent research pursuits. The asymptotic relative efficiency shows that the analysis of this paper using fuzzy triangular functions is valid.
In addition, a significant challenge lies in our future research in uncovering analogous results within broader, more intricate fuzzy regression models based on trapezoidal fuzzy numbers, LR-fuzzy numbers, and analogous constructs.