An Enhanced Decoding Algorithm for Coded Compressed Sensing with Applications to Unsourced Random Access

Unsourced random access (URA) has emerged as a pragmatic framework for next-generation distributed sensor networks. Within URA, concatenated coding structures are often employed to ensure that the central base station can accurately recover the set of sent codewords during a given transmission period. Many URA algorithms employ independent inner and outer decoders, which can help reduce computational complexity at the expense of a decay in performance. In this article, an enhanced decoding algorithm is presented for a concatenated coding structure consisting of a wide range of inner codes and an outer tree-based code. It is shown that this algorithmic enhancement has the potential to simultaneously improve error performance and decrease the computational complexity of the decoder. This enhanced decoding algorithm is applied to two existing URA algorithms, and the performance benefits of the algorithm are characterized. Findings are supported by numerical simulations.


I. INTRODUCTION
Massive machine-type communication (mMTC) is a rapidly growing class of wireless communications which aims to connect tens of billions of unattended devices to wireless networks.One significant application of mMTC is that of distributed sensing, which consists of a large number of wireless sensors that gather data over time and transmit their data to a central server, which then interprets the received data to produce useful information and/or make executive decisions.When combined with recent advances in machine learning (ML), such networks are expected to open a vast realm of economic and academic opportunities.However, the large population of unattended devices within these networks threatens to overwhelm existing wireless communication infrastructures by dramatically increasing the number of network connections; it is expected that the number of machines connected to wireless networks will exceed the population of the planet by an entire order of magnitude.Additionally, the traffic and demand profiles characteristic of individual sensors and actuators are highly inefficient under existing human-centric communication protocols; specifically, the sporadic and bursty nature of sensor transmissions are very costly under estimation/enrollment/scheduling procedures typical of cellular networks.The combination of these challenges necessitates the design of novel physical and medium access control (MAC) layer protocols to efficiently handle the demands of these wireless devices.
One recently proposed paradigm for efficiently handling the demands of unattended devices is that of unsourced random access (URA), first proposed by Polyanskiy in 2017 [1].URA captures many of the nuances of IoT devices by considering a network with an exceedingly large number of uncoordinated devices, of which, only a small percentage is active at any given point in time.When a device/user is active, it encodes its short message using a common codebook and then transmits its codeword over a regularly scheduled time slot, as facilitated by a beacon.
Furthermore, the power available to each user is strictly limited and assumed to be uniform across devices.The use of a common codebook is characteristic of URA and has two important implications: first, the network does not need to maintain a dictionary of active devices and their unique codebook information; second, the receiver does not know which node transmitted a given message unless the message itself contains a unique identifier.The receiver is then tasked with recovering an unordered list of transmitted messages sent during each time slot by the collection of active devices.The performance of URA schemes is evaluated with respect to the per-user probability of error (PUPE), which is the probability that a user's message is not present in the receiver's final list of decoded messages (this measure is defined in (3)).In [1], Polyanskiy provides finite block length achievability bounds for the short block lengths typical of URA applications using random Gaussian coding and maximum likelihood (ML) decoding.However, these bounds were produced in the absence of complexity constraints and thus are impractical for deployment in real-world networks.Over the past few years, several URA schemes have been proposed as means to obtain near-optimal performance with tractable complexity [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23].
All of the aforementioned URA schemes employ concatenated channel codes to recover the messages sent by the collection of active users at the receiver.We note that the term channel code is used broadly such that it includes certain signal dictionaries such as those commonly used for compressed sensing (CS).Though it is conceptually simpler to decode the inner and outer codes independently, it is a well-known fact within coding theory that dynamically sharing information between the inner and outer decoders will often improve the performance of the decoder.In this paper, we present a novel algorithm for sharing information between a wide class of inner codes and a tree-based outer code that significantly improves the PUPE performance and reduces the computational complexity of the scheme.Specifically, our main contributions are as follows.
1) A general system model consisting of a wide class of inner codes and an outer tree code is developed.An enhanced decoding algorithm is presented whereby the outer tree code may guide the convergence of the inner code by restricting the search space of the inner decoder to parity consistent paths.
2) The coded compressed sensing (CCS) scheme of Amalladinne et al. in [9] is considered under this model.
The enhanced decoding algorithm is applied to CCS and the performance benefits are quantified.
3) The CCS for massive MIMO scheme of Fengler et al. in [22] is considered under this model.The enhanced DRAFT December 2, 2021 decoding algorithm is applied to CCS for massive MIMO and the performance benefits are quantified.

II. SYSTEM MODEL
Consider a URA system consisting of K active devices which are referred to by a fixed but arbitrary label Each of these users wishes to simultaneously transmit their B bit message w j to a central base station over a Gaussian multiple access channel (GMAC) using a concatenated code consisting of an inner code C and an outer tree code T .This inner code C has the crucial property that, given a linear combination of K ≤ δ codewords, the constituent information messages may be individually recovered with high probability.Furthermore, we assume that the probability that any two active users' messages are identical is low, i.e.Pr(w i = w j ) < ǫ for i = j.
We consider a scenario where it is either computationally intractable to inner encode/decode the entire message simultaneously or it is otherwise impractical to transmit the entire inner codeword at once; thus, each user must divide its information message into fragments and inner encode/decode each fragment individually.To ensure that the message can be reconstructed from its fragments at the receiver, the information fragments are first connected together using an outer tree-based code T , and then inner-encoded using code C. The resulting signal is transmitted over the channel.We elaborate on this process below.
Each message w j is broken into L fragments where fragment ℓ has length m ℓ and ℓ∈[L] m ℓ = B. Notationally, w j is represented as the concatenation of fragments by w j = w j (1)w j (2) . . .w j (L).The fragments are outerencoded together by adding parity bits to the end of each fragment, with the exception of the first fragment.This is accomplished by taking random linear combinations of the information bits contained in previous sections.The parity bits appended to the end of section ℓ are denoted by p j (ℓ), and it has length l ℓ .This outer-encoded vector is denoted by v j , where v j (ℓ) = w j (ℓ)p j (ℓ).The vector v j now assumes the form shown in Fig. 1.After the outer-encoding process is complete, user j inner-encodes each fragment v j (ℓ) individually using C and concatenates the encoded fragments to form signal x j .Each user then simultaneously transmits its signal to the base station over a GMAC.The received signal at the base station assumes the form where z is a vector of Gaussian noise with independent standard normal components and d accounts for the transmit power.
Recall that the receiver is tasked with producing an unordered list of all the transmitted messages.A naive way to do this is to have the inner and outer decoders operate independently of each other.That is, the inner decoder December 2, 2021 DRAFT is run on each of the L fragments in y to produce L estimates of the outer-encoded codewords.Since C has the property that, given a linear combination of its codewords, the constituent input signals may be recovered with high probability, the aggregate signal in every slot can be expanded into a list of K encoded fragments {v j (ℓ) : It is pertinent to remind the reader that vj (ℓ) does not necessarily correspond to the message sent by user j as the receiver has no way of connecting a received message to an active user within URA.At this point, the receiver has L lists L 1 , L 2 , . . ., L L , each with K outer-encoded fragments.From these lists, the receiver must estimate the K messages sent by the active devices during the frame.This is done by running the tree decoder on the L lists to find parity-consistent paths across lists.Specifically, the tree decoder first selects a root fragment from list L 1 and computes the corresponding parity section p(2).The tree decoder then branches out to all fragments in list L 2 whose parity sections match p(2); each match creates a parity consistent partial path.This process repeats until the last list L L is processed.At this point, if there is a single path from L 1 to L L , the message created by that path is deemed valid and stored for further processing; if there are multiple parity-consistent paths from a given root fragment or no parity consistent paths from a given root fragment, a decoding failure is declared.Fig. 2 illustrates this process.

L1 L2 LL
Fig. 2.This figure illustrates the operation of the tree decoder.The inner decoder C −1 produces L lists of K messages each.The outer tree decoder then finds parity consistent paths across lists to extract valid messages.
While intuitive, this strategy is sub-optimal because information is not being shared by the inner and outer decoders.If the inner and outer decoders were to operate concurrently, the output of the outer decoder could be used to reduce the search space of the inner decoder, thus guiding the convergence of the inner decoder to a parity consistent solution.This would also reduce the search space of the inner code, thus providing an avenue for reducing decoding complexity [24], [25].Explicitly, assume that immediately after the inner decoder produces list L ℓ , the outer decoder finds all parity-consistent partial paths from the root node to stage ℓ.Each of these R parity consistent partial paths has an associated parity section p r (ℓ + 1).Furthermore, it is known that only those fragments in L ℓ+1 that contain one of the {p r (ℓ + 1) : r ∈ [R]} admissible parity sections may be part of the K transmitted messages.
Thus, when producing L ℓ+1 , the search space of the inner decoder may be reduced drastically to just the subset for which fragments contain an admissible parity section p r (ℓ + 1).
This algorithmic enhancement has the potential to simultaneously reduce decoding complexity and improve PUPE performance.Still, a precise characterization of the benefits of this enhanced algorithm depends on the DRAFT December 2, 2021 inner code chosen.We now consider two situations in which this algorithm may be applied: Coded Compressed Sensing (CCS) [9] and CCS for massive MIMO [22].For each of the considered schemes, the complexity reduction and performance improvements are quantified.We emphasize that this algorithmic enhancement is applicable to other scenarios beyond those considered in this paper; one such example is the CHIRRUP scheme presented by Calderbank and Thompson in [12].

III. CASE STUDY 1: CODED COMPRESSED SENSING
In recent years, CCS has emerged as a practical scheme for URA that offers good performance with low complexity [9], [11], [13], [14].Though many variants of CCS have emerged, we will focus on the original version published by Amalladinne et al. in [9].At its core, CCS seeks to exploit a connection between URA and compressed sensing (CS).This connection may be understood by transforming a B-bit message w into a length 2 B index vector m; the single non-zero entry therein is a one at location [w] 2 , which is the binary message w interpreted as a radix-10 integer.This bijection is denoted f (x).The vector m may then be compressed into signal x using sensing matrix A and transmitted over a noisy channel.The multiple access channel naturally adds the sent signals from the active devices.At the receiver, the original signals may be recovered from y using standard CS recovery techniques such as non-negative least-squares (NNLS) or least absolute shrinkage and selection operator (LASSO).However, for messages of even modest lengths, the size of x is too large for standard CS solvers to handle.To circumvent this challenge, a divide and conquer approach can be employed.
In CCS, the inner code C consists of the CS encoder and the outer tree code T is identical to that presented in Section II.Note that there is an additional step between T and C: the outer-encoded message v is transformed into the inner code input m via the bijection described above.Furthermore, C has the property that, given a linear combination of its codewords, the corresponding set of K one-sparse constituent inputs may be recovered with high probability.This, combined with the assumption that Pr(w i = w j ) < ǫ for i = j, makes CCS an eligible candidate for the enhanced decoding algorithm described previously.We review below the CCS encoding and decoding operations.

A. CCS Encoding
When user j wishes to transmit a message to the central base station, it encodes its message in the following manner.First, it breaks its B-bit message into L fragments and outer-encodes the L fragments using the tree code described in Section II; this yields outer codeword v j .Recall that fragment ℓ has m ℓ information bits and l ℓ parity bits.We emphasize that m ℓ + l ℓ = v ℓ is constant for all sections in CCS, but the ratio of m ℓ to l ℓ is subject to change.Fragment v j (ℓ) is then converted into a length 2 m ℓ +l ℓ index vector, denoted by m j (ℓ), and compressed using sensing matrix A into vector x j (ℓ).Within the next transmission frame, user j transmits its encoded fragments across the GMAC with all other active users.At the base station, the received vector associated with slot ℓ assumes the form where z(ℓ) is a vector of Gaussian noise with standard normal components and d reflects the transmit power.This is a canonical form of a K-sparse compressed vector embedded in Gaussian noise.

B. CCS Decoding
CCS decoding begins by running a standard CS solver such as NNLS or LASSO on each section to produce L K-sparse vectors.The K indices in each of these L slots are converted back to binary representations using f −1 (x), and the tree decoder is run on the resultant L lists to produce estimates of the transmitted messages.
This process may be improved by applying the proposed enhanced decoding algorithm, which proceeds as follows for CCS.The inner CS solver first recovers section 1, and then computes the set of possible parity patterns for section 2, denoted by P 2 .The columns of A are then pruned dynamically to remove all columns associated with inadmissible parity patterns in section 2. This reduces the number of columns of A from Section 2 is then recovered, and the process repeats itself until section L has been decoded; at this point, valid paths through the L lists are identified and the list of estimated transmitted messages is finalized.Fig. 3 illustrates this process.

C. Results
As previously mentioned, the algorithmic enhancement presented in this article has the potential to improve both the performance and the computational complexity of concatenated coding schemes.Being URA scheme, the performance of CCS is evaluated with respect to the per-user probability of error (PUPE), which is defined as where Ŵ (y) is the estimated list of transmitted messages, with at most K items.Since many different CS solvers with varying computational complexities may be employed within the CCS framework, the complexity reduction offered by the enhanced decoding algorithm will be quantified by counting the number of columns removed from the matrix A.

DRAFT December 2, 2021
As discussed in [24], the column pruning operation has at least four major implications on the performance of CCS.These implications are summarized below.
1) Many CS solvers rely on iterative methods or convex optimization solvers to recover x from y = Ax.
Decreasing the width of A will result in a reduction in computational complexity, the exact size of which will depend on the CS solver employed.
2) When all message fragments have been correctly recovered for stages 1, 2, . . ., ℓ, the matrix A is pruned in such a way that is perfectly consistent with the true signal.In this scenario, the search space for the CS solver is significantly reduced and the performance will improve.
3) When an erroneous message fragment has been incorrectly identified as a true message fragment by stage ℓ, the column pruning operation will guide the CS solver to a list of fragments that is more likely to contain additional erroneous fragments.This further propagates the error and helps erroneous paths stay alive longer.
4) When a true fragment is removed from a CS list, its associated parity pattern may be discarded and disappear entirely.This results in the loss of a correct message and additional structured noise which may decrease the PUPE performance of other valid messages.
Despite having positive and negative effects, the net effect of the enhanced decoding algorithm on the system's PUPE perfomance is positive, as illustrated in Fig.  From Fig. 4, we gather that the enhanced decoding algorithm reduces the required E b /N 0 by nearly 1 dB for a low number of users.Furthermore, for the entire range of number of users considered, the enhanced algorithm is at least as good as the original algorithm and often much better.
By tracking the expected number of parity-consistent partial paths, it may be possible to compute the expected column reduction ratio at every stage.However, this is a daunting task, as explained in [9].Instead, we estimate the expected column reduction ratio by applying the analysis from [9] with the following simplifying assumptions: • No two users have the exact same message fragments at any stage: w i (ℓ) = w j (ℓ) whenever i = j and for all ℓ ∈ [L].
• The inner CS decoder makes no errors in producing lists L 1 , . . ., L L .
Under these assumptions and starting from a designated root node, the number of erroneous paths that survive stage ℓ, denoted L ℓ , is subject to the following recursion, When the matrix A is pruned dynamically, then K copies of the tree decoder run in parallel and, as such, the expected number of parity-consistent partial paths at stage ℓ can be expresses as Under the further simplifying assumptions that all parity patterns are independent and P j concentrates around its mean, we can approximate the number of admissible parity patterns.The probability that a particular path maps to a specific parity pattern is 2 −l ℓ and, hence, the probability that this pattern is not selected by any path become Taking the complement of this event and multiplying by the number of parity patters, we get an approximate expression for the mean number of admissible patterns, Thus, the expected column reduction ratio at slot ℓ, denoted E[R ℓ ], is given by ( [24]) Fig. 5 shows the estimated versus simulated column reduction ratio across stages.Overall, the number of columns in A can be reduced drastically for some stages, thus significantly lowering the complexity of the decoding algorithm.

IV. CASE STUDY 2: CODED COMPRESSED SENSING FOR MASSIVE MIMO
A natural extension of the single-input single-output (SISO) version of CCS proposed in [9] is a version of CCS where the base station utilizes M ≫ 1 receive antennas.In this scenario, we assume that the receive antennas are sufficiently separated to ensure negligible spatial correlation across channels.Furthermore, we adopt a block fading model where the channel remains fixed for a coherence period of n channel uses and all coherence blocks are assumed to be completely independent, as in [21].Each active user transmits its message over L coherence blocks, with one coherence block corresponding to each of the L sections described above; thus the total number of channel uses is N = nL.As in SISO CCS, the receiver is tasked with producing an estimated list of the messages transmitted by the collection of active users during a given time instant.In addition to observing the received signal, the base station has knowledge of the total number of active users, the codes used for encoding messages, and the second-order statistics of MIMO channels.We note that channel state information (CSI) is not fully known.Thus the decoding algorithm can be characterized as non-coherent [25].The scheme we consider in this work was first presented by Fengler et al. in [22].

A. MIMO Encoding
The encoding process for CCS with massive MIMO is analogous to the encoding process for CCS; for a thorough description of this process, please see Section III.However, the signal received by the base station will have a different structure as the base station employs M receive antennas.Let x(t, ℓ) denote the tth symbol in block ℓ of vector x.Then, the signal observed by the base station is of the form where z(t, ℓ) is circularly-symmetric complex white Gaussian noise with zero mean and variance N 0 /2 per dimension and h j (ℓ) ∼ CN (0, I M ) is a vector of small-scale fading coefficients representing the channel between user j and the base station's M antennas.

B. MIMO Decoding
Recall that an URA receiver is tasked with producing an unordered list of the messages transmitted by the collection of active devices.To do this, the receiver must first identify the list of fragments transmitted during each of the L coherence blocks and then extract the transmitted messages by finding parity consistent paths across lists.
The receiver architecture presented in [22] features a concatenated code, where the inner code C is decoded using a covariance-based activity detection algorithm and the outer tree code T is decoded in a manner identical to that presented in Section II.
Recall that each active user transforms its outer-encoded message v into a 1-sparse index vector m.Let {i j (ℓ) : j ∈ [K]} denote the set of indices chosen by the active users during block ℓ.Then, the signal observed at the base station is of the form where H(ℓ) has independent CN (0, 1) entries, Z is independent complex Gaussian noise, and Γ(ℓ) is a diagonal matrix that indicates which indices have been selected during block ℓ; that is, Γ(ℓ) = diag(γ 0 (ℓ), . . ., γ 2 v ℓ (ℓ)) where Finally, Y(ℓ) is a n × M matrix where the rows of Y(ℓ) correspond to various time instants and the columns of Y(ℓ) correspond to the different antennas present at the base station.Fig. 6 illustrates this configuration.
Determining which fragments were sent during coherence block ℓ is equivalent to estimating Γ(ℓ).This process is referred to as activity detection and may be accomplished through covariance matching when the number of receive antenna is large, as described in [22].An iterative algorithm for estimating Γ(ℓ) was first proposed by Fengler in [22] and is summarized in Algorithm 1.After the collection of fragments transmitted in each of the L sub-blocks has been recovered by Algorithm 1, tree decoding is employed to disambiguate the collection of transmitted messages.
As before, it is possible to leverage the enhanced version of the tree decoding process, with its dynamic pruning, for k ∈ S ℓ do 5: produced by the activity detection algorithm, the tree decoder can compute the set of all admissible parity patterns P 2 for list L 2 ; then, A(2) may be pruned to only contain those columns corresponding to messages with parity patterns in P 2 .A similar strategy can be applied moving forward, yielding a reduced admissible set P ℓ for parity patterns at stage ℓ.In turn, this reduces the index set S ℓ to which may be significantly smaller than [2 v ℓ ].This algorithmic refinement guides the activity detection algorithm to a parity consistent solution and reduces the search space of the inner decoder, thus improving performance significantly [25].

C. Results
The simulation results presented in this section correspond to a scenario with K ∈ [25,150] active users and M ∈ [25,125] antennas at the base station.Each user encodes their 96-bit signal into L = 32 blocks with 100 complex channel uses per block.The length of the outer-encoded block is v ℓ = 12 for all ℓ ∈ [L], and a parity profile of (l 1 , l 2 , . . ., l L ) = (0, 9, 9, . . ., 9, 12, 12, 12) is employed.The energy per bit E b /N 0 is fixed at 0 dB and the columns of A(ℓ) are chosen randomly from a sphere of radius √ nP .These parameters are chosen to match [22].Fig. 7 shows the PUPE of this scheme for a range of active users and several different values of M .In this figure, the dashed lines represent the performance of the original algorithm and the solid lines represent the performance of the enhanced version with dynamic pruning.
From Fig. 7, we gather that the proposed algorithm reduces the PUPE for a fixed number of active users and a fixed number of antennas at the base station.Additionally, this algorithm may be used as a means to reduce the number of antennas required to achieve a target PUPE.For instance, when K = 100, the enhanced algorithm allows for a 23% reduction in the number of antennas at the base station with no degradation in error performance.Fig. 8 provides the ratio of average runtimes of the enhanced decoding algorithm versus the original decoding algorithm.
The enhanced decoding algorithm also offers a significant reduction in computational complexity, especially for a low number of active users.

V. CONCLUSION
In this article, a framework for a concatenated code architecture consisting of a structured inner code and an outer tree code was presented.This framework was specifically designed for URA applications, but may find applications in other fields as well.An enhanced decoding algorithm was proposed for this framework that promises to improve performance and decrease computational complexity.This enhanced decoding algorithm was applied to two URA schemes: coded compressed sensing (CCS) and CCS for massive MIMO.In both cases, PUPE performance gains were observed and the decoding complexity was significantly reduced.
The proposed algorithm is a natural extension of the existing literature.From coding theory, we know that there are at least three ways for inner and outer codes to interact.Namely, the two codes may operate completely independent of one another in a Forney-style concatenated fashion; this is the style of the original CCS decoder presented in [9].Secondly, information messages may be passed between inner and outer decoders as both decoders converge to the correct codeword; this is the style of CCS-AMP which was proposed by Amalladinne et al in [11].
Finally, a successive cancellation decoder may be employed in the spirit of coded decision feedback; this is the style highlighted in this article and considered in [24], [25].Thus, the dynamic pruning introduced in this paper can be framed as an application of coding theoretic ideas to a concatenated coding structure that is common within URA.
Though the examples presented in this article pertained to CCS, we emphasize that dynamic pruning may be applicable to many algorithms beyond CCS.For instance, this approach may be relevant to support recovery in exceedingly large dimensions, where a divide and conquer approach is needed.As long as the inner and outer codes subscribe to the structure described in Section II, this algorithmic enhancement can be leveraged to obtain performance and/or complexity improvements.

Fig. 1 .
Fig. 1.This figure illustrates the structure of a user's outer encoded message, denoted by v. Fragment ℓ consists of the concatenation of information bits, denoted by w(ℓ), and parity bits, denoted by p(ℓ).

Fig. 3 .
Fig.3.This figure illustrates the enhanced decoding algorithm applied to CCS.After recovering L ℓ , the sensing matrix A is pruned so that list L ℓ+1 only contains parity-consistent fragments.

4 .
This figure was generated by simulating a CCS scenario with K ∈ [10 : 175] users, each of which wishes to transmit a B = 75 bit message divided into L = 11 stages over 22, 517 channel uses.NNLS was used as the CS solver.

Fig. 4 .
Fig. 4.This figure shows the required E b /N 0 to obtain a PUPE of 5% vs the number of active users.

Fig. 5 .
Fig.5.This figure illustrates the column reduction ratio provided by the enhanced decoding algorithm for each stage of the outer code and a varying number of users.Lines represent numerical results and markers represent simulated results.Clearly, the size of the sensing matrix may be drastically reduced.

Fig. 6 .
Fig. 6.This figure illustrates the structure of Y(ℓ), where the rows correspond to time instants and the columns correspond to receive antennas.
to improve performance and lower complexity.The application of the proposed algorithmic enhancement to the activity detection algorithm may be visualized in the following way.Let S ℓ denote the set of indices to perform coordinate descent over during coherence block ℓ; in its original formulation, S ℓ = [2 v ℓ ].After list L 1 has been Algorithm 1 Activity Detection via Coordinate Descent 1: Inputs: Sample covariance ΣY(ℓ) = 1 M Y(ℓ)Y(ℓ) H 2: Initialize: Σ ℓ = N 0 I n , γ(ℓ) = 0 3: for i = 1, 2, . . .do 4:

Fig. 7 .Fig. 8 .
Fig. 7.This figure illustrates the performance advantage of applying the enhanced decoding algorithm presented in this paper to CCS for massive MIMO.The dashed line represents the original performance from [22] and the solid line represents the performance of the enhanced algorithm.