On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing

Distributed hypothesis testing (DHT) has emerged as a significant research area, but the information-theoretic optimality of coding strategies is often typically hard to address. This paper studies the DHT problems under the type-based setting, which is requested from the popular federated learning methods. Specifically, two communication models are considered: (i) DHT problem over noiseless channels, where each node observes i.i.d. samples and sends a one-dimensional statistic of observed samples to the decision center for decision making; and (ii) DHT problem over AWGN channels, where the distributed nodes are restricted to transmit functions of the empirical distributions of the observed data sequences due to practical computational constraints. For both of these problems, we present the optimal error exponent by providing both the achievability and converse results. In addition, we offer corresponding coding strategies and decision rules. Our results not only offer coding guidance for distributed systems, but also have the potential to be applied to more complex problems, enhancing the understanding and application of DHT in various domains.


Introduction
Distributed hypothesis testing (DHT) is a significant problem in the field of information theory [1].In this problem, each distributed node observes partial data generated from the joint distribution and transmits an encoded message through a communication channel to a decision center, aiming to detect the true hypothesis.The primary goal of DHT is to maximize the decision error exponent in the asymptotic regime, where many different communication models [2-6] were considered in the previous literature.The main challenges of the DHT arise in two respects.Firstly, due to the intricate distributed structures, most of the existing works have focused on demonstrating achievability results, with converse results being limited to specific cases, such as the 1-bit [3], log 2 3-bit [7], and O(log 2 n)-bit [1] communication channels.Secondly, many of the achievability results were established using random coding with auxiliary random variables [8], which are difficult to implemen in real systems.
Notice that the distributed encoders in many real applications are required to process high-dimensional data [9], such as images, texts, and audios.Consequently, many of the federated learning algorithms focus on computing the quantities, such as the statistics, empirical risks, and gradient of data [10], which can be viewed as certain functions of the empirical distribution (type) of the data (for example, given the data x 1 , . . ., x n and feature function f (x), the statistic 1  n ∑ n i=1 f (x i ) = ∑ x PX (x) f (x) is a linear function of the empirical distribution PX ).
Motivated by this observation, we investigate the optimal decision error exponent of DHT based on the empirical distributions (type-based) under two common communication models.The first problem considers a noiseless channel, which is the typical mathematical model in real federated learning scenarios.It comes from the reality that federated learning often assumes that the nodes and the center machine can exchange information precisely; however, the dimensionalities of the transmitted signals are limited [9].Specifically, it is assumed that each node can only transmit the empirical mean of a one-dimensional feature, and such settings have gained significant attention recently in federated and multi-modal machine learning [9,11].The second problem assumes that the signal of each node, encoded with the empirical distribution, is transmitted over an additive white Gaussian noise (AWGN) channel, which is a widely-used mathematical model for realworld channels [12].The main goal of this paper is to establish the optimal error exponent for the aforementioned two problems by presenting: (i) the converse bound for the error exponent; and (ii) a practical coding strategy that achieves the converse bound.
The contributions of this paper are summarized as follows.First, in Section 4.1, we demonstrate the optimal error exponent for the type-based hypothesis testing over noiseless channels, where one-dimensional functions for all nodes and the corresponding decision rule are provided.Moreover, by applying the information geometric approach in [13], the hypotheses and the feature functions of each node can be modeled as vectors in the joint and marginal distribution spaces, respectively.In Section 4.3, the optimal feature function of each node can be interpreted as a decomposition of the hypothesis vector in the joint distribution space into vectors in the marginal distribution spaces, where each decomposed component indicates the contribution of the corresponding node in making the inference.
Second, we establish the optimal achievable error of the type-based hypothesis testing over AWGN channels by presenting both the achievability and converse results.In particular, the achievability part is based on a mixture coding strategy of both the amplifyand-forward and decode-and-forward strategies.Specifically, when the observed empirical distribution at a distributed node is sufficiently close to one of the true marginal distributions with respect to the two hypotheses, the node is confident of the true hypothesis.Then, we apply the decode-and-forward strategy, which first estimates the true hypothesis based on the observed empirical distribution, and then we apply the binary phase shift keying (BPSK) to transmit the decoded bit to the decision center.On the other hand, when the observed empirical distribution is far from both true marginal distributions, we apply the amplify-and-forward strategy to encode and transmit the observed empirical distribution by the pulse amplitude modulation (PAM) to the decision center.By applying the proposed coding strategy and conducting the log-likelihood ratio test at the decision center, we show in Section 5.2 the achievable error exponent.Finally, we demonstrate the converse results of the error exponent in Section 5.3 based on a genie-aided approach.The main idea is to add additional information to the distributed nodes.By either leveraging the true hypothesis to the distributed nodes or eliminating the channel noises, we show that the error exponent in Section 5.2 is also an upper bound of the optimal error exponent, which establishes the optimality.

Problem Formulations
Suppose that there are K random variables X K (X 1 , . . ., X K ).In this paper, we consider the binary hypothesis testing problem, and the two hypotheses H 0 and H 1 are defined as: where the observable data are i.i.d.generated according to either P (0) X K from the alphabet set (X 1 , • • • , X K ).In addition, we assume that there are K distributed nodes, where the k-th (k = 1, • • • , K) node can only observe the samples X k {x k , . . ., x k }.
To facilitate clarity in our illustration, we concentrate on the discrete case, assuming that each alphabet X k is discrete, and X X 1 × • • • × X K .In addition, for a joint distribution Q X K ∈ P X , we use [Q X K ] X k to denote its marginal distribution with respect to X k .We also denote P (i) X K as the marginal distributions of P (i) X K , for i = 0, 1.In the distributed hypothesis testing problem, we introduce a common assumption in the distributed setup [14] that the generating distributions P (0) X K and P (1) X K satisfy D(P (1) to avoid the trivial irregularities.Due to the type-based restriction, we further assume that P (0) Otherwise, the transmitted message as a function of the empirical distribution would be uninformative for distinguishing the hypotheses.In the following, we denote PX k as the empirical distributions of X k , defined as: . (2)

Type-Based Hypothesis Testing over Noiseless Channels
As shown in Figure 1, node k (k = 1, • • • , K) can encode the observed data X k and transmit a scalar signal by function u k .Due to the computational requirement as introduced in Section 1, we impose a restriction whereby the encoder u k is explicitly dependent on the empirical distribution PX k , i.e., u k : P X k → R, and P X k denotes the set of probability distributions defined on the alphabet X k .For the most direct method, we can transmit the emprical distributions by encoding them into the real space, which can lead to computational difficulty for federated learning data.In this paper, we further consider one of the most commonly used approaches in federated learning [15,16] and assume that u k computes a one-dimensional statistic where feature function f k : X k → R.Then, the decision center collects statistics u k ( PX k ) K k=1 , and makes a decision Ĥ on the true hypothesis.We prove in Section 4 that the further restrictions of computing the empirical means of features are without a loss of generality, where we can make good decisions as we observe the types.Additionally, the error probability is defined as where H denotes the true hypothesis, P H (H 0 ) and P H (H 1 ) are the prior distributions, and P n (•) is the probability measure defined from the data sampling process (1).In particular, we focus on the asymptotic error decaying rate, i.e., the error exponent, defined as where all logarithms are base e unless otherwise specified.The goal is to find the maximal error exponent of (4) and design the feature functions f 1 , • • • , f k and the detailed decision rule such that this error exponent can be achieved based on the log-likelihood ratio test (LLRT).

2
< l a t e x i t s h a 1 _ b a s e 6 4 = " l J e 1 U p 6 w M N q d p k g 3 q l r 1 m p e l 2 q h S L N G P 4 E t L C B y x N j y I = < / l a t e x i t > PX 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " H U O C q z y g l i P 2

Type-Based Hypothesis Testing over AWGN Channels
As depicted in Figure 2, we employ the identical hypothesis testing formulation as presented in (1).In this context, it is assumed that nodes 1 through K encode and transmit a length-m sequence using functions g 1 , • • • , g K , which operate based on their respective observations through additive white Gaussian noise (AWGN) channels to the central decision center.To accommodate the computational constraints, we restrict that the encoder Moreover, the averaged power constraints of the AWGN channels are: where the expectations are taken over the data sampling process defined in (1).Then, the decision center makes a decision Ĥ based on the received signals , where the noises are drawn from and I m denotes the m × m identity matrix.Additionally, we make the following assumption to make the errors arising from the AWGN channels and the decision process comparable, so that the trade-off between them can be described.In detail, we assume that the sequence length m also increases with n, and there exists a positive constant µ such that Our goal is to design the optimal encoders g 1 , • • • , g K , subjected to the constraints (5) and (6), as well as the decision rule Ĥ, where we have assumed P H (H 0 ) = P H (H 1 ) = 1 2 for explicit mathematical expression, such that the error exponent as defined in (4) is maximized.
Figure 2. The transmission procedures for the type-based distributed hypothesis testing problem over AWGN channels.

Related Works
Distributed hypothesis testing problems, also known as multiterminal hypothesis testing [1,3,14] or decentralized detection [17,18], have been extensively explored in the literature.In scenarios where each node can observe a single observation and send an encoded message to the central machine, the authors of [17] demonstrated that determining the optimal coding scheme is NP-hard, while [18,19] provided characterizations for the minimum decoding error rate and the optimal coding scheme for conditionally independent nodes.Furthermore, in situations where each node can observe n samples and transmit an encoded message to the decision center, [3,5,14,20] investigated the optimal decoding error exponents for the case of K = 2 nodes, with [21] generalizing the results to K > 2 nodes.Additionally, the author of [5] studied the Neyman-Pearson-like test, which further constrained the encoded messages to being an empirical functional mean, and provided optimal functions for the scenario with K = 2 nodes.The outcome presented in Section 4 can be perceived as a generalization of such setups to the case with K > 2 nodes.
On the other hand, DHT over noisy channels represents a novel and highly significant sub-problem within the broader context.While current research has primarily focused on transmission over discrete memoryless channels, certain aspects of this sub-problem have been investigated.For instance, some studies have explored scenarios involving side information [22] and cases that counteract independence assumptions [23].Additionally, optimal Type-II error considerations have been examined [24], along with investigations into the optimal pairs of Type-I and Type-II errors [25].
Diverging from the existing literature, the present paper delves into the DHT problem in the context of widely considered AWGN channels while also addressing the implications of computational demands.This novel approach fills a critical research gap and extends the understanding of DHT to a broader set of channel conditions, thus contributing to the advancement of the field.

Type-Based Hypothesis Testing over Noiseless Channels
In this section, we present the optimal error exponent along with the corresponding decision rule for the type-based hypothesis testing over noiseless channels.We commence by introducing the optimal error exponent under the condition that the decision center has access to the empirical distributions from different nodes.

Definition 1. The quantities D
where which represents the set of all distributions with given marginals R The following result provides the operational meaning of ( 9), which can be proved by Sanov's theorem [12].Lemma 1.When H i is the true hypothesis, the probability that nodes 1, • • • , K observe the empirical distributions PX 1 , • • • , PX K , respectively, is given by where .
= is the conventional dot-equal notation, i.e., we denote f n .= g n when lim n→∞ In addition, by applying the log-likelihood ratio test to detect the true hypothesis, the optimal decision error exponent based on the empirical distributions is Note that the type-based hypothesis testing problem assumes that the signal from each node is a function of the empirical distribution.Hence, the optimal error exponent in (4) will not exceed E * .In the following, we prove that error exponent E * can be achieved and provide the corresponding decision rule.

Optimal Feature
First, we introduce the following definitions of exponential and linear families, which will be useful for delineating our results.
Definition 2 (Exponential family).Given distribution P Z (z), and a function T : Z → R, we define the distribution P(λ) Z ( • ; T, P Z ) as with α(λ) log ∑ z ∈Z P Z (z ) exp(λT(z )).In addition, we use to denote the exponential family passing through P Z with T being the natural statistic.
Definition 3 (Linear family).Given a function h : Z → R, we define the linear family L Z (h) as In addition, we define the half-spaces S Z (h) and S (1) Then, for i = 0, 1 and t > 0, we define the sets We also define D(t) D 0 (t) ∩ D 1 (t).It can be verified that, for all t ≥ 0, both D 0 (t) and D 1 (t) are convex subsets of P X 1 × • • • × P X K , and thus D(t) is also convex.In addition, we have the following lemma.
Lemma 2. For E * as defined in (10), we have Proof.See Appendix A.
Based on Lemma 2, it follows from the separating hyperplane theorem (see, e.g., Section 2.5.1 of [26]) that functions and for all (R X 1 , . . ., Furthermore, we denote and then we have the following proposition.Given P Z ∈ P Z and S ⊂ P Z , we adopt the notation [27,28] , where P Z denotes the set of all distributions supported on Z. Proposition 1.The optimal exponent E * as defined in (10) satisfies Proof.See Appendix B.
Consequently, we establish the optimality of E * and provide the corresponding decision rule.
Theorem 1.Let f * 1 , . . ., f * K denote the features as defined in (15) and (16).The optimal error exponent of (4) is given by where E * is defined in (10).In addition, the corresponding decision rule Ĥ is Proof.See Appendix C.

General Geometric Structure
The geometry associated with Proposition 1 and Theorem 1 is depicted in Figure 3.In this figure, each point represents a distribution in P X , and the decision boundary (20) corresponds to the linear family L X (h * ) defined as in (13).In addition, from Corollary 3.1 of [27], λ 0 , λ 1 ∈ R exist such that where P(λ i ) X K ), i = 0, 1 are as defined in (11).In this context, Q X K and Q X K in (21) are the I-projections [27] of P (0) X K onto this linear family, respectively, which also induces the two exponential families E X h * , P (0) X K with h * as their common natural statistic.Additionally, all the points in D 0 (E * ) and D 1 (E * ) are divided by the the linear family L X (h * ).

< l a t e x i t s h a 1 _ b a s e 6 4 = "
Z 8 X + 5 t 3 X n u p s P f r b q Z d H r E S H 2 L 9 0 7 5 X / 1 a l e J M 6 w o 3 t w q K d Q M 6 o 7 n r o k + l b U y c 1 P X U l y C I l T u E 3 5 i D D X y v d 7 N r U m 1 r 2 r u 2 U 6 / 6 w r F a t i n t Y m e F G n p A G X v o / z J z h e L 5 a 2 i p u H G 4 X y b j r q L B a w h B W a 5 z b K 2 E c F V f K + x B 3 u 8 W A I 4 8 q 4 N m 7 e S o 1 M q p n H l 2 X c v g J q r J v 6 < / l a t e x i t > L X (h ⇤ ) < l a t e x i t s h a 1 _ b a s e 6 4 = " O + 4 Z 2 F T Z V c X Z K n n j + x a p l F w g m D e 4 m 5 D + p u p l 0 s s x x W x f + n G m f / V i V o 4 e j i U N d h U U y A Z U Z 2 V u s S y K + L m 6 p e q O D k E x A n c p X h I 2 J L K c Z 9 V q Y l k 7 a K 3 h o y / y U z B i r 2 V 5 s Z 4 F 7 e k A e s / x z k J 6 j s l f b + 0 V 9 0 t l I / S U W e x g U 0 U a Z 4 H K O M E F d T I + x q P e M K z c q b c K n f K / W e q k k k 1 6 / i 2 l I c P 4 / q U B w = = < / l a t e x i t > Q (0) X K < l a t e x i t s h a 1 _ b a s e 6 4 = " J J y f a X 3 f y q a x e w n 1 x r J O R m s 0 z w k 7 a K 3 h o y / y U z B i r 2 V 5 s Z 4 F 7 e k A e s / x z k J 6 j s l f b + 0 V 9 0 t l I / S U W e x g U 0 U a Z 4 H K O M E F d T I + x q P e M K z c q b c K n f K / W e q k k k 1 6 / i 2 l I c P 5 l y U C A = = < / l a t e x i t > Q (1) X K X K onto the linear family L X (h * ), i = 0, 1, and L X (h * ) can devide D 0 (E * ) and D 1 (E * ) in different half spaces.

Local Information Geometric Analysis
Although an explicit information geometry has been shown, we apply the local information geometric framework [13] to provide fundamental insights into this problem.Some useful notations and definitions in local information geometry are introduced as follows.
Definition 4 ( -neighborhood).Given a finite alphabet Z, and letting R Z be a distribution supported on Z with all entries being positive, its -neighborhood N Z (R Z ) is defined as Then, with R Z used as the reference distribution, each distribution P Z ∈ P Z can be equivalently expressed as a vector φ ∈ R |Z | or a function f : Z → R with referred to as the information vector and feature function associated with P Z , respectively.This provides a three way correspondence P Z ↔ φ ↔ f , which will be useful in our derivations.
Based on Definition 4, we introduce the local assumption that We use ψ (i) ↔ P (i) X K , i = 0, 1 to represent the corresponding information vectors [cf.(23)].For each k = 1, . . ., K, and given feature f k : X k → R, we define the corresponding information vector φ k ∈ R |X k | , where P X k [P X K ] X k is used as the reference distribution.Note that exists, where where δ x k xk represents the Kronecker delta.Moreover, the feature f k defined on X k , when considered as a mapping from X to R, corresponds to the information vector B k φ k in R |X | .Leveraging this correspondence, we can further establish the information vector for h( where we have defined and where for each k = 1, . . ., K, φ k ∈ R |X k | denotes the information vector corresponding to f k .Additionally, given a matrix A ∈ R m 1 ×m 2 , we use A † to denote its Moore-Penrose inverse [30], and we define the associated column space R(A) {Ax : x ∈ R m 2 } and projection matrix Π A AA † .Then, we can establish the local counterpart of E * in Theorem 1 as follows.
Theorem 2. Under the local assumption (24), let ψ (i) ↔ P (i) X K , i = 0, 1 denote the corresponding information vectors.Then, for h * as defined in (17), we have the correspondence h * ↔ B 0 φ * 0 , where and where B 0 is defined in (27).In addition, the optimal exponent E * in (10) can be expressed as Proof.See Appendix D.
Note that from Theorem 2, we have where Π B 0 is the projection matrix associated with the subspace R(B 0 ).The optimal feature B 0 φ * 0 in (26) corresponds to the projection of the sufficient statistic f LLR ↔ (ψ (1) − ψ (0) ) onto the function space that encompasses all possible h's satisfying the form h(x K ) = ∑ K k=1 f k (x k ).In other words, B 0 φ * 0 represents the best approximation of f LLR within the function space of interest, which leads to the optimal decision error exponent E * as shown in (29).
Moreover, from (26), this optimal feature can be decomposed to K components in subspaces R(B k ), for k = 1, . . ., K, where (27).This decomposition structure can be depicted as Figure 4 for the case Remark 1.The vectors B i φ * k are not simply the orthogonal projections of B 0 φ * 0 onto the subspaces R(B k ) since these subspaces, for k = 1, . . ., K, are not mutually orthogonal.Therefore, the decomposition of B 0 φ * 0 will depend on the Gram matrix [30] of the subspaces R(B k ), as illustrated in Figure 4. Furthermore, it is noteworthy that the orthogonal projection of B 0 φ * 0 onto the subspaces R(B k ) can be interpreted as characterizing the optimal error exponent of the binary hypothesis testing problem solely with the observations of X k [12].When the subspaces R(B k ) are orthogonal to each other, the optimal inference approach is straightforward, involving the extraction of the optimal information from each node by orthogonal projection.However, when the subspaces R(B k ) are not orthogonal, different nodes may share various forms of common information.Our result fundamentally demonstrates how to handle this shared information and extract the optimal features through the decomposition of the information vector over non-orthogonal subspaces.This insight provides a novel approach to address the challenges posed by the non-orthogonal subspaces and reveals how to extract the most informative features effectively, ultimately leading to improved performance in the distributed hypothesis testing problem.

Type-Based Hypothesis Testing over AWGN Channels
This section presents the optimal error exponent of the type-based hypothesis testing problem over AWGN channels, along with the corresponding coding strategy.To begin, we introduce several notations that will help in the presentation of the results. where It would be easy to find that D , and D * i (•) is as defined in (9).Moreover, we define the following error exponent with respect to ω ⊆ [K].
where we have used A \ B to represent the relative complement of set B in set A, and where µ is as defined in (8).We can also find E [K] = E * and E * is as defined in (10).Finally, we define the quantity E , which will be shown as the optimal error exponent where ([K]) denotes the power set of [K].
Theorem 3. The optimal error exponent of (4) is given by In the following, we prove Theorem 3 by both the achievability and converse result.

The Coding Strategy for Distributed Nodes
First, we define the different regimes of empirical distributions, for each k = 1, • • • , K and for some γ ∈ (0, 1).Basically, the specific choice of γ does not effect the achievable error exponent as long as γ ∈ (0, 1).It helps conduct the decode-and-forward and amplifyand-forward coding strategies as introduced in Section 1. Decode-and-forward regime:

Amplify-and-forward regime:
Consequently, in the amplify-and-forward regime, we can transmit such empirical distributions with exponentially large power by Pulse Amplitude Modulation (PAM) while still satisfying the power constraint.Specifically, let P (n) X k be the set of all possible empirical distributions of X k with n samples, and denote k → {1, . . ., η k } as the indices of empirical distributions.Then, according to the observed empirical distribution, the encoder of node Furthermore, if the empirical distributions are in the decode-and-forward regimes, we initially detect the true hypothesis and then transmit the bit using Binary Phase Shift Keying (BPSK) with the appropriate power.By employing these strategies, the achievability result can be obtained through repeated transmissions from all the distributed nodes.In other words, the resulting encoder for node k is defined as follows: where and where Proposition 2. The encoders as defined in (38) satisfy the power constraint (6), and Proof.See Appendix E.

Decision Rule and Achievable Error Exponent
After the decision center receives the output signals where [•] i denotes the i-th entry of a given vector.Then, we conduct the log-likelihood ratio test (LLRT) to detect the true hypothesis: Note that exponentially large power is allocated for the empirical distributions in the amplify-and-forward regime (cf.( 35), (36)); the decision center can correctly detect the coding regime of the nodes with super-exponentially high probability, i.e., for Therefore, we can assume that the decision center knows the coding regime of the nodes and define the following regime of the received signals with respect to subset ω ⊆ [K].
, ∀k ∈ ω, and where • denotes the floor function [31].The following result shows that decoding error of (43) can be neglected.
Proof.See Appendix F.
In the following, we denote p k p k − δ, for k = 1, • • • , K and discuss the decision error exponent when the received signals are in Θ ω .For k ∈ ω, the empirical distribution PX k can be recovered by (43), and for k ∈ [K] \ ω, node k detects the hypothesis according to the observed empirical distribution and transmits the detected bit by BPSK (cf.( 38)) through the AWGN channel.Then, the decision center detects the true hypothesis from the received signals by LLRT (41), which can be reduced to where for i = 0, 1, where ([K] \ ω) denotes the power set of [K] \ ω, and where for k Consequently, the decision error exponent is characterized by the following proposition.
Proposition 4. For any > 0 and ω ∈ ([K]), the decision error exponent by the decision rule (45) satisfies where E is as defined in (33).
Proof.See Appendix G.
Noticing that the overall decision error probability is the following proposition establishes the achievable error exponent by the coding strategy (38).
Proposition 5.By using the encoders g * 1 , • • • , g * K as defined in (38), and the decision rules Ĥ from (41), the achievable error exponent is given by E , i.e., where E is as defined in (33).

The Converse Result
In this section, we show that E is indeed an upper bound of ( 4), which establishes Theorem 3. Our main technique is to apply a genie-aided approach, which provides different kinds of additional information to both nodes and computes the corresponding error exponents under additional information.As depicted in Figure 5, given index set ω ∈ ([K]), suppose that for all k ∈ ω, node k can know and cancel the channel noise in advance; then, the channel is noiseless, and the decision center can perfectly receive the empirical distribution PX k .On the other hand, suppose that for all k ∈ [K] \ ω, we can leverage the true hypothesis H to node k ; then, with such additional information, we can establish the following upper bound of (4) (cf.( 33)).Proposition 6.Given index set ω ∈ ([K]), suppose that for all k ∈ ω, the decision center can obtain PX k perfectly.Additionally, for all k ∈ [K] \ ω, node k can obtain the true hypothesis H.The resulting optimal decision error exponent is where E ω is as defined in (32).

Proof. See Appendix H.
< l a t e x i t s h a 1 _ b a s e 6 4 = " p S a 2 r r 6 h s a m a H N L a 1 s s 3 t 6 x 6 j s F z x A p w 7 E c b 1 3 X f G G Z t k h J U 1 p i 3 f W E l t c t s a b n 5 i r 7 a 0 X h + a Z j r 8 h 9 V 2 z l t a x t 7 p i G J p l K x 7 v 3 0 r n t 0 p A 6 X B 7 x m e b U R s j a v K z X 9 Q G 3 w K R Z 3 j 5 W 9 6 K c H u q I X u q d r e q L 3 X 2 u V g h o V L / s 8 6 1 W t c N O x o 6 7 l t 3 9 V e Z 4 l d j 9 V f 3 q W 2 M F k 4 N V k 7 2 7 A V G 5 h V P X F g + O X 5 e m l / t I A n d E z + z + l R 7 r j G 9 j F V + N 8 U S y d I M o f o H 5 / 7 p 9 g d T S p j i f H F s c S M 7 P h V z S i B 3 0 Y 4 v e e w A z m s Y A U n 3 u I C 9 z g V o k o g 0 p S o W q q E g k 1 n f g S y t Q H C 9 G S 9 w = = < / l a t e x i t > x k 0 , H Figure 5.A geometric explanation of the genie-aided approach, which can lead to E ω as the upper bound of the error exponent in (4).
Notice that Proposition 6 is verified for all ω ∈ ([K]), and we cannot obtain a better performance than Proposition 6 for the DHT over AWGN channels without the additional information.We then conclude the following error exponent upper bound.Proposition 7.For all possible encodes g 1 , • • • , g K under the power constraint (6), the corresponding error exponent with respect to the LLRT decision rule satisfies where E is as defined in (33).
Remark 2 (Local-geometric interpretation).Note that the expression of the optimal error exponent E as defined in (33) is quite intricate, which could limit our understanding.To simplify the analysis, we introduce the local geometry assumption as given in (24).In Appendix I, we demonstrate that the error exponent corresponds to a more manageable expression where for ω = {i 1 , • • • , i |ω| }, we have defined and ), the first term in (51) represents the optimal error exponent (cf.(29)) when the decision center can access the empirical distributions PX k , k ∈ ω.The second term corresponds to the optimal error exponent when node k, k ∈ [K] \ ω can know the true hypothesis H and transmit the bit using BPSK modulation.The total error exponent is the sum of these two parts, and E aims to determine the minimum sum among all possible splits of the index set [K].In other words, E finds the optimal trade-off between accessing empirical distributions at the decision center and having individual nodes transmit bits with BPSK modulation.

Discussion
This paper discusses the DHT problem over two communication models.The first is the noiseless channel, which is mostly considered in current distributed learning and federated learning systems [9,11].For the noiseless channels, we show that by using onedimensional statistics from different nodes, it is possible to achieve the same error exponent when the decision center has knowledge of the corresponding empirical distributions.This result is significant as it simplifies the coding process at distributed nodes, allowing them to transmit only the necessary statistics rather than the entire empirical distribution, which provides a practical implementation of the result in [5].This finding proves the rationality of transmitting statistics as the most widely-used strategy in distributed learning and federated learning [11].
For the AWGN channels, this paper introduces a novel coding strategy, which cleverly combines decode-and-forward and amplify-and-forward techniques.The underlying concept of this coding strategy is based on the observation that the probability of the empirical distribution deviating significantly from the true marginal distribution diminishes exponentially.Consequently, by employing sufficiently large power, we can transmit the empirical distribution almost perfectly to the decision center while satisfying the averaged power constraint.When the prior distributions are not 1/2, the strategy still work for the optimal error exponent, and the only difference is to adjust the BPSK points for two hypotheses according to the power constaint.The demonstrated optimality of the achieved decision error exponent further indicates that the proposed coding strategy is highly effective and successfully approaches the theoretical limit within the given constraints of the problem.

Conclusions
This paper focuses on investigating DHT problems over both noiseless channels and AWGN channels, where the distributed nodes are constrained to encoding the received empirical distributions, driven by practical computational considerations.In the first problem, we demonstrate that utilizing one-dimensional statistics of distributed nodes and simply summing them up as the decision rule can lead to the optimal error exponent.For the second problem, we propose a coding strategy that combines decode-and-forward and amplify-and-forward techniques.We further introduce a genie-aided approach to establish the optimality of the achieved decision error exponent.Overall, our findings offer valuable insights into coding techniques for distributed nodes, and the established strategies can be extended to more general scenarios, broadening the applicability of DHT in diverse settings.Indeed, since D(t) is non-empty, (R X 1 , . . ., R X K ) and > 0 exist such that D * i (R X 1 , . . ., R X K ) < t − , for i = 0, 1, and thus D(t − ) is non-empty.

Appendix B. Proof of Proposition 1
We know that D i (E * ) ⊂ S (i) X (h * ), for i = 0, 1.This implies that S (i) , where for t ≥ 0 and i = 0, 1, we have defined D as defined in Lemma 2; then, we have X (h * ).

Appendix C. Proof of Theorem 1
On the one hand, note that from the Markov relation the minimum possible decision error can be obtained when we choose the empirical distributions PX 1 , . . ., PX K themselves as the statistics.
One the other hand, from Proposition 1, the error exponents associated with the type I error and the type II error are D S (1) X K , respectively.From ( 18), both exponents are E * , and thus the error exponent for P n ( Ĥ = H) is also E * .

Appendix D. Proof of Theorem 2
To begin, we define ψ ψ (1) − ψ (0) .Then, for given f k : X k → R it follows from Lemma 17 of [13] that the exponent based on the feature h( where we have defined ζ B 0 φ 0 ∈ R(B 0 ), and where φ 0 is as defined in (27).Then, note that the projection matrix Π B 0 satisfies Π B 0 = (Π B 0 ) 2 and ζ = Π B 0 ζ.Therefore, from the Cauchy-Schwarz inequality we have where the inequality holds with equality if and only if ζ takes the optimal values for some constant scalar c = 0. To determine the value of c, note that we have ζ * ↔ h * , where h * is the optimal feature as defined in (17).Note that in (21), for each i = 0, 1, Q (i) X K depends only on the product λ i h * ; we may assume λ 0 = 1/2 and simply use λ to denote λ 1 .Then, we have X K ) = P (0) which implies the correspondence Similarly, we have X K (x K ) ↔ ψ (1) + λζ + o( ) .

Appendix E. Proof of Proposition 2
According to Sanov's theorem, P n ( PX k ∈ M c k ) . = exp −n 1−γ , and P n ( PX k / ∈ M c k ) . = 1.Then, we have which will converge to 0 as n → 0. Additionally, for the power constraint,

Appendix F. Proof of Proposition 3
Note that equivalently, where Zk ∼ N (0, σ 2 k /m).We then apply the typical result for Gaussian tail [32], i.e., for any α > 0, − lim where for k = 1, • • • , K, and ω ∈ ( To finish the proof, we introduce the following lemma. In the following, we denote Ω [K] \ (ω ∪ ( ω ∪ ω )).For those indices k ∈ ω or k ∈ Ω, although they do not contribute to the Gaussian-like error exponents, they restrict that . By letting R ω X k = R ω X k = PX k (k ∈ ω or k ∈ Ω) that can be optimized, we find the lower bound of (A15).

Appendix H. Proof of Proposition 6
Let the encoders for k ∈ [K] \ ω be functions of H and PX k .The upper bound comes from the fact that the type is also generated from the hypothesis H. Therefore, the encoder on both the hypothesis and the type is just a function of the true hypothesis.Suppose that k denote the i-th entry of ρ k , and where 1 2 κ k and ∑ m i=1 p (i) k = p k .The error exponent with respect to the LLRT is min Here, we explain the optimality of κ(i) k , under which let R * X k , θ (i) * k be the solution to problem (A19).For other pairs of ( κ(i) k , κ . Then, we have (θ

Appendix I
Based on the results in Appendix D, E ω as defined in ( 32) satisfies (1) where k ω ∑ k∈ω |X k |, and then the result can be easily verified using Lagrangian multipliers.
r h X V z + I 1 Z O Y 4 a + 0 y l d 0 T c 6 o x 9 0 / d d e S d Y j 9 b L L s 9 3 X i t A c 3 5 9 a / / V f l c e z Q v t G 9 U / P C k 2 8 y r x K 9 h 5 m T H o K p 6 / v 7 h 1 c r b 9 e m 0 l m 6 Y g u 2 f 8 n u q D

Figure 1 .
Figure 1.The transmission procedures for the type-based distributed hypothesis testing problem over noiseless channels.

… 1 <
l a t e x i t s h a 1 _ b a s e 6 4 = " 6 i b 4 h N j F F 8 I D o h y b n n 3 n N m 7 l w 7 d J 1 Y W t Z T x h g a H h k d y 4 7 n J i a n p m f y s 3 P H c Z B E X F R 5 4 A Z R z W a x c B 1 f V K U j X V E L I 8 E 8 2 x U n d n d P 5 U 8 u R B Q 7 g X 8 k e 6 F o e u z c d 8 4 c z i R R r f x i w 2 O y w 5 n b P x i 0 + h 9 B b T B Y 6 Z y u r b b y B a t o 6 W X + B K U U F J C u S p B / R A N t B O B I 4 E H A h y T s g i G m p 4 4 S L I T y C I a a n g T I s h M S 1 0 C U u I u T o u E A P O d I m l C U o g x F 7 S d 9 z 2 j V S 1 q e 9 8 o y 1 m t M p L r 0 R K U 0 s k S a g v I i w O s 3 U 8 U Q 7 K / Y 3 7 6 7 2 V H e 7 o b + d e n n E S l w Q + 5 e u n / l f n a p F 4 g x b u g a H a g o 1 o 6 r j q U u i u 6 J u b n 6 p S p J D S J z C H Y p H h L l W 9 v t s a k 2 s a 1 e 9 Z T r + p j M V q / Y 8 z U 3 w r m 5 J A y 7 / H O c g q d X x b y s M H 4 / C U B w = = < / l a t e x i t > P (1) X K < l a t e x i t s h a 1 _ b a s e 6 4 = " D d E E b m S 4 s V s Z N 1 w l H G 6 9 6 R W + I y g= " > A A A C 0 H i c j V H L S s N A F D 2 N r 1 p f V Z d u g k W o m 5 K I r 2 X R j e C m i n 1 A b U u S T m t o X i Y T s Y Q i b v 0 B t / p V 4 h / o X 3 h n T E E t o h O S nD n 3 n j N z 7 z U D x 4 6 4 p r 1 m l K n p m d m 5 7 H x u Y X F p e S W / u l a L / D i 0 W N X y H T 9 s m E b E H N t j V W 5 z h z W C k B m u 6 b C 6 O T g W 8 f o N C y P b 9 y 7 4 M G A t 1 + h 7 d s + 2 D E 5 U q 9 J J G u 3 T U T s p a t u j T r 6 g l T S 5 1 E m g p 6 C A d F X 8 / A s u 0 Y M l O w Y m + l u T H e x S 1 p w P r P c U 6 C 2 k 5 J 3 y / t n e 0 W y k f p qL P Y w C a K N M 8 D l H G C C q r k f Y 1 H P O F Z O V d u l T v l / j N V y a S a d X x b y s M H 4 Y 6 U B g = = < / l a t e x i t > P (0) X K < l a t e x i t s h a 1 _ b a s e 6 4 = " U 2 r K F D C e e B q j b H c m F M j W E D d D W H 0 = " > A A A C 0 H i c j V H L S s N A F D 2 N r 1 p f V Z d u g k W o m 5 K I r 2 X R j e C m F f s A b U u S T m t o X i Y T s Y Q i b v 0 B t / p V 4 h / o X 3 h n T E E t o h O S nD n 3 n j N z 7 z U D x 4 6 4 p r 1 m l K n p m d m 5 7 H x u Y X F p e S W / u l a P / D i 0 W M 3 y H T 9 s m k b E H N t j N W 5 z h z W D k B m u 6 b C G O T g W 8 c Y N C y P b 9 8 7 5 M G A t 1 + h 7 d s + 2 D E 5 U q 9 p J m u 3 T U T s p a t u j T r 6 g l T S 5 1 E m g p 6 C A d F X 8 / A s u 0 Y D n 3 n j N z 7 z U D x 4 6 4 p r 1 m l K n p m d m 5 7 H x u Y X F p e S W / u l a P / D i 0 W M 3 y H T 9 s m k b E H N t j N W 5 z h z W D k B m u 6 b C G O T g W 8 c Y N C y P b 9 8 7 5 M G A t 1 + h 7 d s + 2 D E 5 U q 9 p J m u 3 T U T s p 6 t u j T r 6 g l T S 5 1 E m g p 6 C A d F X 8 / A s u 0 Y

Figure 3 .
Figure 3.The geometric structure in distributed hypothesis testing, with Q

Figure 4 .
Figure 4.The information decomposition structure in distributed hypothesis testing with K = 2 nodes, compared with the orthogonal decompositions on the subspace R(B k ) for each node k = 1, 2.
the decision center can recover the empirical distributions PX k (k ∈ ω) from the received signals θ k by the decoder:

0 < l a t e x i t s h a 1 _ b a s e 6 4 =
u 6 I a e 6 P 3 X W o 2 w R u B l n 2 e 9 q R X u T u p o q P D 2 r 6 r K s 8 T e p + p P zx I l z I Z e T f b u h k x w C 6 O p r x + c v h T m V k Y b Y 3 R B z + z / n B 7 o l m 9 g 1 1 + N y 2 W x c o b g A 9 T v z / 0 T r E 3 m 1 O n c 1 P J U N j 8 f f U U c w x j B B L / 3 D P J Y x B K K f G 4 Z x z j B a e x R S S o D y m A z V Y l F m n 5 8 C S X z A Y 0 I i j k = < / l a t e x i t >k " k H T y j C 6 t + 9 B 4 E 1 g F S k F U U m f 8 + P t 2 i y O c w I K G B u r g M C E I G 1 D g U j u E D A a b u C I 8 4 h x C e r D P 0 U K U t A 3 K 4 p S h E F u j s U K r w 5 A 1 a e 3 X d A O 1 R q c Y 1 B 1 S x r H A n t g d 6 7 J H d s + e 2 f u v t b y g h u / l l G a 1 p + V 2 a f J 8 d v / t X 1 W d Z o H q p + p P z w J l b A R e d f J u B 4 x / C 6 2 n b 5 5 d d P c 3 9 x a 8 R d Z m L + T / m n X Y A 9 3 A b L 5 q N x m + d w X / A + T v z / 0 T 5 F Z S 8 l p q N b O a 3 N o O v 2 I Y 8 0 h g i d 5 7 H V v Y R R p Z O t f B J a 7 R j n S l O S k u J X q p U i T U z O B L S M s f 3 V i P b w = = < / l a t e x i t > PX k< l a t e x i t s h a 1 _ b a s e 6 4 = " k u m S 5 H 7 Z O P U a x j 5 5 + g 7 f w H p U I S 0