On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high.


Background
Supervised classification is a machine learning technique that maps an input feature vector to an output label based on a set of correctly labeled training data. There is no single learning algorithm that works best on all supervised learning problems, as shown by the no free lunch theorem in [1]. As a result, there are many algorithms proposed in the literature whose performance depends on the underlying problem and the amount of training data available. The most widely used algorithms in the literature are decision trees [2,3], Support Vector Machines (SVM) [4,5], Rule-Based Systems [6], naive Bayes classifiers [7], k-nearest neighbors (KNN) [8], logistic regressions, and neural networks [9,10].

Motivation
In the following, we discuss the motivation for this work.

Lack of Tight Upper Bounds on the Performance of Classifiers
In general, there are no tight upper bounds on the performance of the classifiers used in practice. Many of the previous works only provide experimental performance results. However, this approach has drawbacks. For example, one has to rely on the trial-and-error approach in order to develop a good classifier for a given problem, which impacts the reliability. Next, the algorithms whose performance has been verified only experimentally may work for a given problem, but may fail to work when applied to a similar problem. Finally, experimental results do not provide intuition into the underlying problem, whereas the analytical results provide the understanding of the underlying problem and the corresponding solutions.
Motivated by this, in the paper, we aim to investigate classifiers with analytical upper bounds on their performance.

Independent and Non-Identically Distributed Features
In general, we can categorize the statistical properties of the feature vectors, which are the input to the classifier, into three types. To this end, let Y n (X) = Y 1 (X), Y 2 (X), . . . , Y n (X) denote the input feature vector to the supervised classifier, where n is the length of the feature vector and X is the label to which the feature vector Y n (X) belongs. Then, we can distinguish the following three types of feature vectors depending on the statistics of the elements in the feature vector Y n (X).
The first type of feature vector is when the elements of Y n (X) are independent and identically distributed (i.i.d.). This is the simplest features model, but also the least applicable in practice. This model is identical to hypothesis testing, which has been well investigated in the literature [11][12][13]. As a result, tight upper bounds on the performance of supervised learning algorithms for this type of feature vector are available in the hypothesis testing literature. For instance, the authors in [11] showed that the posterior entropy and the maximum a posterior error probability decay to zero with the length of the feature vector at the identical exponential rate, where the maximum achievable exponent is the minimum Chernoff information. In [12], the authors determine the requirements for the length of the vector Y n (X) and the number of labels m in order to achieve vanishing exponential error probability in testing m hypothesis that minimizes the rejection zone. In [13], the authors provide an upper bound and a lower-bound on the error probability of Bayesian m-ary hypothesis testing in terms of conditional entropy.
The second type of feature vectors is when the elements of Y n (X) are mutually dependent and non-identically distributed (d.non-i.d.). This type of features model is the most general model and the most applicable in practice. However, it is also the most difficult to tackle analytically. As a result, supervised learning algorithms proposed for this features model lack analytical tight upper bounds on their performance [14][15][16][17][18][19][20][21][22][23]. This is because there are not any frameworks that produce closed-form results when deriving statistics of vectors with d.non-i.d. elements when the underlying distributions are unknown. Then how can we investigate analytically classifiers for practical scenarios when the feature vectors have d.non-i.d. elements? A possible approach leads us to the third type of feature vectors, explained in the following.
The third type of feature vectors is when the elements of Y n (X) are mutually independent but non-identically distributed (i.non-i.d.). This features model is much simpler than the d.non-i.d. features model and, more importantly, it is analytically tractable, as we show in this paper. Furthermore, this features model is applicable in practice. Specifically, there exists a class of algorithms, known as Independent Component Analysis (ICA), that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements with a zero or a negligible loss of information [24][25][26][27][28]. The origins of ICA can be traced back to Barlow [29], who argued that a good representation of binary data can be achieved by an invertible transformation that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements. Finding such a transformation with no prior information about the distribution of the data has been considered an open problem until recently [28]. Specifically, the authors in [28] show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Thereby, the authors in [28] provide the first efficient set of solutions to Barlow

Small Training Set
The main factor that impacts the accuracy of supervised classification is the amount of training data. In fact, most supervised algorithms are able to learn only if there is a very large set of training data available [30]. The main reason for this is the curse of dimensionality [31,32], which states that "the higher the dimensionality of the feature vectors, the more training data are needed for the supervised classifier" [33]. For example, supervised classification methods such as random forest [34,35] and KNN [36] suffer from the curse of dimensionality. However, having large training data sets is not always possible in practice. As a result, designing a supervised classification algorithm that exhibits good performance even when the training data set is extremely small is important.
Motivated by this, in this paper, we investigate supervised classifiers for the case when t training feature vectors per label are available, where t = 1, 2, ...

Contributions
In this paper, we propose an algorithm for supervised classification of feature vectors with i.non-i.d. elements when the number of training feature vectors per label is t, where t = 1, 2, ... Next, we derive an upper bound on the error probability of the proposed classifier for uniformly distributed labels and prove that the error probability exponentially decays to zero when the length of the feature vector, n, grows, even when only one training vector per label is available, i.e., when t = 1. Hence, the proposed classification algorithm provides an asymptotically optimal performance even when the number of training vectors per label is extremely small. We compare the performance of the proposed classifier with the naive Bayes classifier and to the KNN algorithm. Our numerical results show that the proposed classifier significantly outperforms the naive Bayes classifier and the KNN algorithm when the number of training feature vectors per label is small and the length of the feature vectors n is sufficiently high.
The proposed algorithm is a form of the nearest neighbor classification algorithm, where the nearest neighbor is searched in the domain of empirical distributions. As a result, we refer to the algorithm as the nearest empirical distribution. The nearest empirical distribution algorithm is not new and, to the best of our knowledge, it was first proposed in [37] for the case when the elements of Y n (X) are i.i.d., i.e., for the equivalent problem of hypothesis testing. However, in this paper, we propose the nearest empirical distribution algorithm for the case when the elements of Y n (X) are i.non-i.d., which is much more complex than the problem of hypothesis testing where the elements of Y n (X) are i.i.d.
To the best of our knowledge, this is the first paper that investigates the important problem of classifying feature vectors with i.non-i.d. elements and provides an upper bound on its error probability. The novelty of this paper is not with the classifier itself, but rather in showing the importance of the problem of classifying feature vectors with i.non-i.d elements and in showing analytically that at least one classifier with an asymptotically optimal error probability exists when at least one training feature vectors per label is available.
The remainder of this paper is structured as follows. In Section 2, we formulate the considered classification problem. In Section 3, we provide our classifier and derive an upper bound on its error probability. In Section 4, we provide numerical examples of the performance on the proposed classifier. Finally, Section 5 concludes the paper.

Problem Formulation
The machine learning model is comprised of a label X, a feature vector Y n (X) = Y 1 (X), Y 2 (X), . . . , Y n (X) of length n mapped to the label X, and a learned labelX, as shown in Figure 1. In this paper, we adopt the information-theoretic style of notations and thereby random variables are denoted by capital letters and their realizations are denoted with small letters. The feature vector Y n (X) is the input to the machine learning algorithm whose aim is to detect the label X from the observed feature vector Y n (X).
The performance of the machine learning algorithm is measured by the error probability P e = Pr X =X . We adopt the modeling in [38][39][40] and represent the dependency between the label X and the feature vector Y n (X) via a joint probability distribution p X,Y n (x, y n ). Now, in order to gain a better understanding of the problem, we include the joint probability distribution p X,Y n (x, y n ) into the model in Figure 1. To this end, since p X,Y n (x, y n ) = p Y n |X (y n |x)p X (x) holds, instead of p X,Y n (x, y n ), we can include the conditional probability distribution p Y n |X (y n |x) and the probability distribution p X (x) into the model in Figure 1, and thereby obtain the model in Figure 2. Now, the classification learning model in Figure 2 is a system comprised of a label generating source X according to the distribution p X (x), a feature vector generator modelled by the conditional probability distribution p Y n |X (y n |x), a feature vector Y n , a classifier that aims to detect X from the observed feature vector Y n , and the detected labelX. Note that the system model in Figure 2 can be seen equivalently as a communication system comprised of a source X, a channel with input X and output Y n , and a decoder (i.e., detector) that aims to detect X from Y n . The notation used in this paper, letter X for labels and letter Y for features, is based on the notation used in information theory for modelling communication systems. In the classification model shown in Figure 2, we assume that the label X can take values from the set X , according to p X (x) = 1/|X |, where | · | denotes the cardinality of a set. Next, we assume that the i-th element of the feature vector Y n , Y i , for i = 1, 2, . . . , n, takes values from the set Y = y 1 , y 2 , . . . , y |Y | , according to the conditional probability distribution p Y i |X (y i |x).
Moreover, we assume that the elements of the feature vector Y n are i.non-i.d. As a result, the feature vector Y n takes values from the set Y n according to the conditional probability distribution p Y n |X (y n |x) given by where (a) comes from the fact that elements in the feature vector Y n are mutually independent and (b) is for the sake of notational simplicity, where p i is used instead of p Y i |X . As a result of (1), the considered classification model in Figure 2 can be represented equivalently as in Figure 3. Next, we assume that p i (y i |x), ∀i, and thereby p Y n |X (y n |x) are unknown to the classifier. Instead, the classifier knows X , Y, and for each x i ∈ X , where i = 1, 2, . . . , |X |, it has access to a finite set of t correctly labelled input-output pairs ( , denoted by T i , referred to as the training set for label x i . Finally, we assume that the following holds The condition in (2) means that the distribution of the feature vectors Y n (X) for label X = i is not a perturbation of distribution of the feature vectors Y n (X) for label X = j. As a result, the proposed classifier only applies to the subset of data vectors with i.non-i.d. elements that satisfy (2).
For the classification system model defined above and illustrated in Figure 3, we wish to propose a classifier that exhibits an asymptotically optimal error probability P e = Pr X =X with respect to the length of Y n , n, for any t ≥ 1, i.e., for any t ≥ 1, P e → 0 as n → ∞. Moreover, we wish to obtain an analytical upper bound on the error probability of the proposed classifier for a given t and n.

The Proposed Classifier and Its Performance
In this section, we propose our classifier, derive an analytical upper bound on its error probability, and prove that the classifier exhibits an asymptotically optimal performance when the length of the feature vector Y n , n, satisfies n → ∞. This is conducted in the following.
For a given vector v Moreover, for a given feature vector y k = (y 1 , y 2 , , . . . , y k ), let I[y k = y] be a function defined as where Z [y i = y] is an indicator function assuming the value 1 if y i = y and 0 otherwise. Hence, I[y k = y] counts the number of elements in Y k that have the value y.

The Proposed Classifier
Letŷ nt i be a vector obtained by concatenating all training feature vectors for the input label x i asŷ Let Pŷnt i be the empirical probability distribution of the concatenated training feature vector for label x i ,ŷ nt i , given by Let y n be the observed feature vector at the classifier whose label it wants to detect and let P y n denote the empirical probability distribution of y n , given by Using the above notations, we propose the following classifier.
Proposition 1. For the considered system model, we propose a classifier with the following classification rulex where r ≥ 1 and ties are resolved by assigning the label among the ties uniformly at random. (For example, if P y n − Pŷnt i r = P y n − Pŷnt j r holds for, i = j, we setx = x i orx = x j uniformly at random).
As seen from (8), the proposed classifier assigns the label x i if the empirical probability distribution of the concatenated training feature vector mapped to label x i , Pŷnt i is the closest, in terms of Minkowski distance r, to the empirical probability distribution of the observed feature vector P y n . In that sense, the proposed classifier can be considered as the nearest empirical distribution classifier.

Upper Bound on the Error Probability
The following theorem establishes an upper bound on the error probability of the proposed classifier. Theorem 1. LetP j , for j = 1, 2, . . . , |X |, be a vector defined as wherep(y|x j ) is given byp Then, for a given r ≥ 1, the error probability of the proposed classifier is upper bounded by where is given by Proof of Theorem 1. Without loss of generality we assume that x 1 is the input to p Y n |X (y n |x) and y n is observed. Let A k , for 1 ≤ k ≤ |Y |, be a set defined as A k = y n : Furthermore, let B k , for 1 ≤ k ≤ |Y |, be a set defined as where (a) follows from (13). Moreover, forŷ nt where (a) follows from (14). Next, we have the following upper bound where (a) follows from the Minkowski inequality. Combining (15)- (17), we obtain Hence, the Minkowski distance between the empirical probability distribution of the observed vector y n and the empirical probability distribution of the concatenated training vector for label x 1 is upper bounded by the right hand side of (18). We now derive a lower bound forŷ nt i , where i = 1. For any x i , such that i = 1, we have where (a) follows from (15) and (b) is again due to the Minkowski inequality. The expression in (19), can be written equivalently as where i = 1. Now, using the definitions of Pŷnt i andP 1 given by (6) and (9), respectively, into (20) we can replace the expression in the right-hand side of (20) by Pŷnt i −P 1 r , and thereby for any i = 1 we have The expression in (21) represents a lower bound on the Minkowski r distance between the empirical probability distribution of the observed vector y n and the empirical probability distribution of the concatenated training vector for any label x i , where i = 1.
Using the bounds in (18) and (21), we now relate the left-hand sides of (18) and (21). As long as the following inequality holds for each i = 1, which is equivalent to the following for i = 1 where (a), (b), and (c) follow from (18), (22), and (21), respectively. Thereby, from (24), we have the following for i = 1 Note that the right-and left-hand sides of (25) can be replaced by the Minkowski distance of the vectors v 1 = I y n = y 1 n − I ŷ nt 1 = y 1 nt , . . . , and v 2 = I y n = y 1 n − I ŷ nt i = y 1 nt , . . . , respectively. Now, (26) and (27) can be replaced by P y n − Pŷnt 1 and P y n − Pŷnt i , respectively, by the definitions of P y n and Pŷnt i given by (7) and (6), respectively. Therefore, (25) can be written equivalently as Now, let us highlight what we have obtained. We obtained that there is an for which if (23) holds for i = 1, and for that there are sets A and B for which y n ∈ A and y nt 1 ∈ B then (28) holds for i = 1, and thereby our classifier will detect that x 1 is the correct label. Using this, we can upper bound the error probability as where S is a set defined as In the following, we derive the expression in (29). The right-hand side of (29) can be upper bounded as where (a) follows from Boole's inequality. Now, note that we have the following upper bound for the first expression in the right-hand side of (31) Pr y n / ∈ A | ∈ S = Pr y n / ∈ where A k is the complement of A k and (a) follows from Boole's inequality. Note that (32) are n independent Bernoulli random variables with probabilities of success p 1 (y k |x 1 ), p 2 (y k |x 1 ), . . . , p n (y k |x 1 ), respectively. Let W [y k ] be a binomial random variable with parameters n,p(y k |x 1 ) . We proceed the proof by introducing the following well-known Hoefdding's Theorem from [41].
Theorem 2 (Hoeffding [41]). Assume that Z 1 , Z 2 , . . . , and Z n are n independent Bernoulli random variables with probabilities of success p 1 , p 2 , . . . , and p n , respectively. Next, let Z be defined as Z = Z 1 + Z 2 + . . . + Z n and, letp be defined asp = p 1 + p 2 + . . . + p n /n. Let W be a binomial random variable with parameters (n,p). Then, for a given a and b, where 0 ≤ a ≤ np ≤ b ≤ n holds, we have In other words, the probability distribution of W is more dispersed around its mean np than is the probability distribution of Z. Except in the trivial case when a = b = 0, the bound in (33) holds with equality if and only if p 1 = . . . = p n =p.
Setting a = n(p − δ) and b = n(p + δ) in (33), we obtain Using (34), we have the following upper bound where (a) follows from (34). We now turn to the proof of Theorem 1. According to Theorem 2, the probability distribution of W [y k ] is more dispersed around its mean np(y k |x 1 ) than is the probability distribution of ∑ 1≤j≤n Z [y j = y k ]. Therefore, we can upper bound the probability in the last line of (32) as where ∈ S is defined in (30) and (a) follows from (35). Now, let us introduce another well-known Hoeffding's Theorem from [42].
Theorem 3 (Hoeffding's inequality [42]). Let W 1 , W 2 , . . . , W n be n independent random variables such that for each 1 ≤ i ≤ n, we have Pr W i ∈ [a i , b i ] = 1. Then for S n , defined as where E S n is the expectation of S n .
Back to (36), by using the result of (37) for a i = 0 and b i = 1 since the binomial random variable W [y k ] can take values 0 or 1, respectively, we have where ∈ S is defined in (30). Inserting (38) into (32), we obtain the following upper bound Pr y n / ∈ A | ∈ S ≤ 2|Y |e −2n 2 .
Similarly, we have the following result for the second expression in the right-hand side of (31) where again (a) follows from Boole's inequality. Note that due to (5), for any integer number l such that 0 ≤ l ≤ t − 1 the random variables Z [y nl+1 = y k ], Z [y nl+2 = y k ], . . . , and Z [y nl+n = y k ] in (40) are n independent Bernoulli random variables with the probabilities of success p 1 (y k |x 1 ), p 2 (y k |x 1 ), . . . , and p n (y k |x 1 ), respectively y nl+1 , y nl+2 , . . . , y nl+n are elements ofŷ n 1 l+1 . In addition, note that p(y k |x 1 ) = 1 n n ∑ j=1 p j (y k |x 1 ) Notice that for each 0 ≤ l ≤ t − 1, p 1 (y k |x 1 ) + p 2 (y k |x 1 ) + . . . + p n (y k |x 1 ) is the summation of the probabilities of success of the random variables Z [y nl+1 = y k ], Z [y nl+2 = y k ], . . . , and Z [y nl+n = y k ]. Thereby, the last expression on the right-hand side of (41) is the average probability of success of random variables Z [y j = y k ] for 1 ≤ j ≤ nt. Now, let W [y k ] be a binomial random variable with parameters nt,p(y k |x 1 ) . Once again, according to Theorem 2, the probability distribution of W [y k ] is more dispersed around its mean ntp(y k |x 1 )) than is the probability distribution of ∑ 1≤j≤nt Z [y j = y k ]. Therefore, the probability in the last line of (40) can be upper bounded as where ∈ S, defined in (30) Inserting (39) and (43) into (31), and then inserting (31) into (29), we obtain the following upper bound for the error probability which is the optimal value of that exhibits the tightest upper bound for the error probability P e given by (44). This completes the proof of Theorem 1.
The following corollary provides a simplified upper bound on the error probability when t → ∞. Corollary 1. When the number of training vectors per label reaches infinity, i.e., when t → ∞, which is equivalently to the case when the probability distribution p(y n |x) is known at the classifier, the error probability of the proposed classifier is upper bounded as where is given by Proof. The proof is straightforward.
As can be seen from (8) and (11), the performance of the proposed classifier depends on r. We cannot derive the optimal value of r that minimizes the error probability since we do not have the exact expression of the error probability, we only have its upper bound. On the other hand, in practice, the optimal r with respect to the upper bound on the error probability also cannot be derived since the upper bound depends onP j , which would be unknown in practice due to p Y n |X (y n |x) being unknown. As a result, for our numerical examples, we consider the Euclidean distance (r = 2), which is one of the most widely used distance metrics in practice.
The following corollary establishes the asymptotic optimality of the proposed classifier with respect to n.

Corollary 2.
The proposed classifier has an error probability that satisfies P e → 0 as n → ∞ if |Y | ≤ O(n m ), m is fixed, and r > 2m. Here, n m indicates the dimension of our space, i.e., maximum number of alphabets each element in the feature vector y n can take. Thereby, the proposed classifier is asymptotically optimal .
Proof. For the proof, please see Appendix A.

Simulation Results
In this section, we provide simulation results of the performance of the proposed classifier for r = 2 and compare it to benchmark schemes. The benchmark schemes that we adopt for comparison are the naive Bayes classifier and the KNN algorithm. We cannot adopt a classifier based on a neural network since neural networks require a very large training set, which we assume is not available. For the naive Bayes classifier, the probability distribution p Y n |X (y n |x) is estimated from the training vectors as follows. Let againŷ nt i be a vector obtained by concatenating all training feature vectors for the input label x i as in (5). Then, the estimated probability distribution of p(y j = y|x i ), denoted byp(y j = y|x i ), is found asp and the naive Bayes classifier decides according tô The main problem of the naive Bayes classifier occurs when an alphabet y j ∈ Y is not present in the training feature vectors. In that case,p(y j |x i ) in (48) isp(y j |x i ) = 0, ∀x i ∈ X and, as a result, the right hand side of (49) is zero since at least one of the elements in the product in (49) is zero. In this case, the naive Bayes classifier fails to provide an accurate classification of the labels. In what follows, we see that this issue of the naive Bayes classifier appears frequently when we have a small number of training feature vectors. On the other hand, the KNN classifier works as follows. For the observed feature vector y n , the KNN classifier looks for the k nearest feature vectors to y n , among all training feature vectorsŷ n r s , for all 1 ≤ r ≤ |X | and 1 ≤ s ≤ T. Then by considering a set of K input-output pairs (x k ,ŷ n k l ), for k ∈ {1, 2, . . . , |X |} and l ∈ {1, 2, . . . , |T|}, the KNN classifier decides a label which is the most frequent among x k -s. The optimum value of k for t = 1 is k = 1.
In the following, we provide numerical examples where we illustrate the performance of the proposed classifier when p Y n |X (y n |x) is artificially generated.

The I.I.D. Case with One Training Sample per Label
In the following examples, we assume that the classifiers have access to only one training feature vector for each label, the elements of the feature vectors are generated i.i.d., and the alphabet size of the feature vector, |Y |, is fixed.
In Figures 4 and 5, we compare the error probability of the proposed classifier with the naive Bayes classifier and the KNN algorithm for the case when |Y | = 6 and |Y | = 20, respectively. In both examples, we have two different labels, i.e., |X | = 2. As a result, we have two different probability distributions p Y n |X 1 (y n |x 1 ) and p Y n |X 2 (y n |x 2 ). The probability distributions p Y n |X 1 (y n |x 1 ) and p Y n |X 2 (y n |x 2 ) are randomly generated as follows. We first generate two random vectors of length 6 and length 20 for Figures 4 and 5, respectively, where the elements of these vectors are drawn independently from a uniform probability distribution. Then we normalize these vectors such that the sum of their elements is equal to one. These two normalized randomly generated vectors then represent the two prob- The simulation is carried out as follows. For each n, we generate one training vector for each label, using the aforementioned probability distributions. Then, as test samples, we generate 1000 feature vectors for each label and pass these feature vectors through our proposed classifier, the naive Bayes classifier, and the KNN algorithm, and compute the errors. The length of the feature vector n is varied from n = 1 to n = 100. We repeat the simulation 5000 times and then plot the error probability. Figures 4 and 5 show that the proposed classifier outperforms both the naive Bayes classification and KNN. The main reason for this performance gain is because when only one training vector per label is available, the proposed classifier is more resilient to errors than the naive Bayes classifier, whereas the KNN algorithm has very poor performance because of the "curse of dimensionality". Specifically, the naive Bayes classifier cannot perform an accurate classification for small n compared to |Y | since the chance that an alphabet will not be present in one of the training feature vectors is close to 1. On the other hand, the KNN algorithm cannot perform an accurate classification for large n since the dimension of the input feature vector becomes much larger than the training data and the "curse of dimensionality" occurs.  In Figure 6, we compare the performance of the proposed classifier for different values of r when |Y | = 6 with the derived upper bounds. As can be seen, for this example, the derived theoretical upper bounds have similar slope as the exact error probabilities. Moreover, we can see that for this example, the optimal r is r = 1. However, this is not always the case and it depends on p Y n |X k (y n |x k ), |Y |, and |X |. In this example, we consider the i.non-i.d. case where the probability distributions p i (y i |x k ) are overlapping for all i, as shown in Figure 7. The small orthogonal lines on the xaxis in Figure 7 represent alphabets, i.e., the elements in Y, and the probability of occurrence of an alphabet y i is equal to the intersection between the corresponding orthogonal line to the represented probability distribution p i (y i |x k ) for k = 1, 2. By "overlapping", we mean the following. Let Y v and Y u denote the set of outputs generated by p v (y v |x k ) and p u (y u |x k ), respectively. If for any v and u, Y v ∩ Y u = ∅ holds, we say that the output alphabets are overlapping. Alphabets p n (y n |x 1 ) p n−1 (y n−1 |x 1 ) To demonstrate the performance of our proposed classifier in the overlapping case, we assume that we have two different labels, X = {x 1 , x 2 }, where the corresponding conditional probability distributions p i (y i |x 1 ) and p i (y i |x 2 ) are obtained as follows. For a given n, let Y = − n, −n + 1, . . . , 0, . . . , n − 1, n be the set of all alphabets. Note that the size of Y grows with n. Moreover, let u i and v i (1 ≤ i ≤ n) be vectors of length 2n + 1, given by , 0, . . . , 0 . (51) The number of zeros in each side of the vectors u i and v i is (n − i). To generate a feature vector from label x 1 (x 2 ), we generate the vector y n = (y 1 , y 2 , . . . , y n ), where y k takes values from the set Y, with a probability distribution p i (y i |x 1 ) = u i 1 + 2(n + y i ) The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then, we generate 1000 feature vectors for each label and pass them through our proposed classifier, the naive Bayes classifier, and the KNN algorithm and calculate the error probability. We change the length of the feature vector from n = 1 to n = 100 and repeat the simulation 1000 times and then plot the error probability.
As shown in Figure 8, there is a huge difference between the performance of the two benchmark classifiers and the proposed classifier. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of n as it is susceptible to the problem of unseen alphabets in the training vectors. The error probability of the KNN classifier is also almost 0.5 for n > 20 as it is susceptible to the "curse of dimensionality". However, the error probability of our proposed classifier continuously decays as n increases. In Figure 9, we run the same experiments as in Figure 8 but with T = 100, i.e., 100 training feature vectors per label. As can be seen from Figure 9, the performance of the proposed classifier is better than the naive Bayes classifier, for n > 15. Since |Y | = 2n + 1, for small values of n, the naive Bayes classifier has access to many training samples and, thereby, its performance is very close to the case when the probability distribution p Y n |X (y n |x) is known, i.e., to the maximum-likelihood classifier, and hence it has the optimal performance. As n increases, the number of alphabets rises, i.e., |Y | rises, and due to the aforementioned issue of the naive Bayes classifier with unseen alphabets, our proposed classifier performs much better classification than the naive Bayes classifier. Furthermore, note that the error probability of our proposed classifier decays exponentially as n increases which is not the case with the naive Bayes classifier. Moreover, Figure 9 also shows the theoretical upper bound on the error probability we derived in (11). In this example, we consider the i.non-i.d. case where the probability distributions p j (y j |x i ) are non-overlapping for all j as shown in Figure 10, where we defined "overlapping" in Section 4.2. Hence, we test the other extreme in terms of possible distribution of the elements in the feature vectors Y n .
To demonstrate the performance of our proposed classifier in the non-overlapping case, we assume that we have two different labels X = {x 1 , x 2 }, the corresponding conditional probability distributions p i (y i |x 1 ) and p i (y i |x 2 ) are obtained as follows. For a given n, let Y = 1, 2, 3, . . . , (n + 1) 2 − 1 be the set of all alphabets of the element in the feature vectors. Note again that the size of Y grows with n. in addition, let u i and v i for (1 ≤ i ≤ n), be vectors of length (n + 1) 2 − 1, given by , . . . , 1 i(i + 1) , 0, . . . , 0 , , . . . , , 0, . . . , 0 . (53) The number of zeros in the left-hand sides of u i and v i is i 2 − 1. To generate a feature vector from the label x 1 (x 2 ), we generate the vector y n = (y 1 , y 2 , . . . , y n ), where y k take values from the set Y, with probability distribution Alphabets . . .
. . . The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then we generate 250 feature vectors for each label and pass it through our proposed classifier, the naive Bayes classifier and KNN and calculate the error probabilities. We change the length of the vector from 1 to 80 and repeat the simulation 250 times and then plot the error probability. As shown in Figure 11, there is a huge difference between the performance of the proposed classifier and the two benchmark classifiers. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of n as it is susceptible to the issue with unseen alphabets in the training feature vector. The error probability of the KNN classifier is almost 0.5 for all shown values of n > 30 as it becomes susceptible to the "curse of dimensionality". However, the error probability of our proposed classifier still decays continuously as n increases.
Note that, in our numerical examples, we compared our algorithm with the benchmark schemes on two extreme cases of i.non-i.d. vectors, referred to as "overlapping" and "nonoverlapping". Any other i.non-i.d. vector can be represented as a combination of the "overlapping" and "non-overlapping" vectors. Since our algorithm works better than the benchmark schemes for small t on both these cases, it will work better than the benchmark schemes on any combination between "overlapping" and "non-overlapping" vectors, i.e., for any other i.non-i.d. vectors.

Conclusions
In this paper, we proposed a supervised classification algorithm that assigns labels to input feature vectors with independent but non-identically distributed elements, a statistical property found in practice. We proved that the proposed classifier is asymptotically optimal since the error probability moves to zero as the length of the input feature vectors grows. We showed that this asymptotic optimality is achievable even when one training feature vector per label is available. In the numerical examples, we compared the proposed classifier with the naive Bayes classifier and the KNN algorithm. Our numerical results show that the proposed classifier outperforms the benchmark classifiers when the number of training data is small and the length of the input feature vectors is sufficiency large. Data Availability Statement: Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Corollary 2
The proof is almost identical to the proof of Theorem 1; however, here we derive a looser upper-bound on the error-probability than that in (11), which is independent of PŷnT i . Without loss of generality we assume that x 1 is the input to p Y n |X (y n |x) and y n is observed at the classifier.
Let B k,l , for 1 ≤ k ≤ |Y | and 1 ≤ l ≤ |X |, be a set defined as B k,l = ŷ nt : I ŷ nt = y k nt −p(y k |x l ) ≤ 3 √ t .