Low-Complexity Address Generation for Multiuser Detectors in IDMA Systems

: This paper presents a low-complexity address generation unit (AGU) for multiuser detectors in interleave division multiple access (IDMA) systems. To this end, for the ﬁrst time, all possible options for designing AGUs are ﬁrst analyzed in detail. Subsequently, a complexity reduction technique is applied to each of those architectures. More speciﬁcally, some components in AGUs are relocated to make them shareable and removable without a ﬀ ecting the functionality. The complete transparency of such renovation makes it applicable to any existing multiuser detector without tailoring the interfacing components therein. Measuring the hardware complexity, all the resulting AGUs are compared with each other, and a new architecture simpler than the state-of-the-art one is developed. Implementation results in a 65 nm CMOS process, demonstrating that the proposed AGU can alleviate the equivalent gate count and the power consumption of the prior process by 13% and 31%, respectively.


Introduction
Nonorthogonal multiple access (NOMA) is an emerging class of multiple-access technologies in 5G telecommunications and the Internet of Things [1]. Interleave division multiple access (IDMA) is one of the NOMA schemes that distinguishes multiple users according to their distinct interleaving patterns [2]. By virtue of its fine scalability and robustness, IDMA is considered as a promising NOMA candidate for the forthcoming applications [3].
Recent works in the literature have pioneered sophisticated multiuser detector architectures for IDMA systems [4][5][6][7][8][9][10][11]. As generalized in Figure 1, a U-user detector incorporates U user-wise processing blocks (UPBs) and one elementary signal estimator (ESE). Each UPB contains its own address generation unit (AGU) for accessing memories therein. Since all users employ distinct interleaving patterns and access memories in their own manners, all the U AGUs are implemented separately, making the total number of AGUs in a detector U. Accordingly, when a massive number of users are connected, i.e., U >> 1, the AGUs as a whole contribute a significant portion to the overall hardware complexity.
Despite the weightiness in implementation, only one AGU structure [5] has been presented in the literature, and it has never been studied in detail. For the first time, this paper analyzes all possible options for designing AGUs, and then a complexity reduction technique is applied to each of those architectures. More specifically, some components in AGUs are relocated to make them shareable and removable without affecting the functionality. The complete transparency of such renovation makes it applicable to any existing multiuser detector without tailoring the interfacing components therein. Measuring the hardware complexity, all the resulting AGUs are compared with each other, and a new architecture simpler than the state-of-the-art one is developed. Implementation results in a 65 nm Despite the weightiness in implementation, only one AGU structure [5] has been presented in the literature, and it has never been studied in detail. For the first time, this paper analyzes all possible options for designing AGUs, and then a complexity reduction technique is applied to each of those architectures. More specifically, some components in AGUs are relocated to make them shareable and removable without affecting the functionality. The complete transparency of such renovation makes it applicable to any existing multiuser detector without tailoring the interfacing components therein. Measuring the hardware complexity, all the resulting AGUs are compared with each other, and a new architecture simpler than the state-of-the-art one is developed. Implementation results in a 65 nm CMOS process that will demonstrate that the proposed AGU can alleviate the equivalent gate count and the power consumption of the prior process by 13% and 31%, respectively.
The rest of this paper is organized as follows. Section 2 reviews the fundamentals of multiuser detection in IDMA systems. Section 3 compares two addressing modes for AGUs. The proposed complexity reduction technique is presented in Section 4. In Section 5, all possible options for AGUs are evaluated and discussed along with the implementation results. Concluding remarks are made in Section 6.

Background
At the transmitter of the uth user (Txu) in Figure 1, each of N information bits is first replicated S times by a spreader. The resulting sequence of J = NS chips is then permuted by an interleaver of length J. The sequences of the J chips before and after interleaving can be indexed by {j} and {πu(j)}, respectively, for j = 0, 1, …, J -1. πu(•) is the interleaving function of the uth user. In case of J = 4 and U = 2, for example, {j} = {0, 1, 2, 3}, {π1(j)} = {2, 0, 3, 1}, and {π2(j)} = {1, 3, 2, 0}. Note that two interleaving patterns are distinct. The chips departing from U users go through a wireless channel while interfering with each other. In a multiuser detector receiving the chips with interference, the ESE first distributes lu(πu(j)) to UPBu for u = 1, 2, …, U, where lu(πu(j)) is the log-likelihood ratio (LLR) of the jth chip from the uth user. In return, UPBu answers the ESE with eu(πu(j)), called an extrinsic LLR. After the ESE and UPBs exchange their LLRs several times, the final estimates of the N information bits are determined by the signs of LLRs.
The operation of UPBu can be formulated for all j = 0, 1, …, J -1 as is called a despread LLR, and pu(j) is the index of the despread LLR that corresponds to lu(πu(j)). Accordingly, the first line of (1) states that an extrinsic LLR, eu(πu(j)), is calculated by The rest of this paper is organized as follows. Section 2 reviews the fundamentals of multiuser detection in IDMA systems. Section 3 compares two addressing modes for AGUs. The proposed complexity reduction technique is presented in Section 4. In Section 5, all possible options for AGUs are evaluated and discussed along with the implementation results. Concluding remarks are made in Section 6.

Background
At the transmitter of the uth user (Tx u ) in Figure 1, each of N information bits is first replicated S times by a spreader. The resulting sequence of J = NS chips is then permuted by an interleaver of length J. The sequences of the J chips before and after interleaving can be indexed by {j} and {π u (j)}, respectively, for j = 0, 1, . . . , J -1. π u (·) is the interleaving function of the uth user. In case of J = 4 and U = 2, for example, {j} = {0, 1, 2, 3}, {π 1 (j)} = {2, 0, 3, 1}, and {π 2 (j)} = {1, 3, 2, 0}. Note that two interleaving patterns are distinct. The chips departing from U users go through a wireless channel while interfering with each other. In a multiuser detector receiving the chips with interference, the ESE first distributes l u (π u (j)) to UPB u for u = 1, 2, . . . , U, where l u (π u (j)) is the log-likelihood ratio (LLR) of the jth chip from the uth user. In return, UPB u answers the ESE with e u (π u (j)), called an extrinsic LLR. After the ESE and UPBs exchange their LLRs several times, the final estimates of the N information bits are determined by the signs of LLRs.
The operation of UPB u can be formulated for all j = 0, 1, . . . , J -1 as (1) d u (p u (j)) is called a despread LLR, and p u (j) is the index of the despread LLR that corresponds to l u (π u (j)). Accordingly, the first line of (1) states that an extrinsic LLR, e u (π u (j)), is calculated by subtracting an incoming LLR, l u (π u (j)), from its corresponding despread LLR, d u (p u (j)). Comparing the first and the second lines of (1) suggests that d u (p u (j)) is the sum of S LLRs associated with p u (j). Since p u (j) = floor(π u (j)/S), as rewritten in the third line of (1), {π u (j)} can be divided into J/S disjoint subsets, each of which has S elements associated with the same p u (j). Then, d u (p u (j)) can be interpreted as the sum of such S LLRs in a subset. Let us exemplify with J = 4, S = 2, and {π u (j)} = {2, 3, 1, 0}. The elements in subset {π u (0), π u (1)} = {2, 3} are associated with p u (j) = 1, and d u (p u (j)) = d u (1) is the sum of S = 2 LLRs, l u (2) + l u (3). The elements in subset {π u (2), π u (3)} = {1, 0} are related with p u (j) = 0, and d u (p u (j)) = d u (0) = l u (1) + l u (0). It is worth noting that obtaining one despread LLR by accumulating S LLRs associated with the same p u (j) is the inverse of spreading that makes S replicas of the p u (j)-th information bit. The state-of-the-art scheme to calculate (1), which is called on-the-fly despreading [5], has dominantly been employed by the latest UPBs [5][6][7][8]. It comprises two phases: 1) reception and 2) response. In the first phase, UPB u receives l u (π u (j)) for all j = 0, 1, . . . , J -1 from the ESE, and stores them into a memory named M L . Simultaneously, it adds l u (π u (j)) to the p u (j)-th entry in a memory named M P1 . M P1 contains J/S entries, each of which corresponds to a partial sum (PS) of d u (p u (j)) . After accumulating all the J LLRs as stated, the PSs become {d u (p u (j))}. Let us exemplify again (1) For j = 0, l u (2) is stored into M L , and the PS of d u (1) in M P1 is set to l u (2).
(2) For j = 1, l u (3) is stored into M L , and the PS of d u (1) in M P1 , which has been l u (2), is updated to As a result of J = 4 cycles, {l u (π u (j))} and {d u (p u (j))} have been prepared in M L and M P1 , respectively. In the second phase, UPB u returns e u (π u (j)) = d u (p u (j)) -l u (π u (j)) to the ESE for all j = 0, 1, . . . , J -1. The minuend and the subtrahend are retrieved from M P1 and M L , respectively. The next reception phase in which new PSs are to be computed may start in the middle of the response phase. However, since {d u (p u (j))} in M P1 are in use during the response phase, they should not be overwritten. Accordingly, the new PSs are stored into a duplicate memory of M P1, named M P2 . As the phases iterate, the roles of M P1 and M P2 alternate. For example, in even-numbered iterations, M P1 provides d u (p u (j)), while M P2 manages new PSs. In odd-numbered iterations, vice versa.

AGUs Based on Sequential and Interleaved Addresses
As stated above, every UPB intensively accesses M L , M P1 , and M P2 every cycle to read and write LLRs and PSs, necessitating the generation of proper read and write addresses. The AGU is responsible for organizing such addresses, and interfaces with the memories as depicted in Figure 2. The nomenclature of the signals is as follows. The baseline text stands for the functionality of an address, while the subscript designates the memory associated. For example, RA L is the read address for M L , and WA L is the write address for M L . In a similar manner, RA P1 , RA P2 , WA P1 , and WA P2 are the read and write addresses for M P1 and M P2 , respectively. Note that both M P1 and M P2 take WA P as their common write address. c[0] is the least significant bit (LSB) of the current iteration count c, it being 1 if the current iteration is odd-numbered or 0 if even-numbered. Table 1 briefs the meanings as a prompt reference. Let us recapitulate that ML has J entries to store lu(πu(j)) for j = 0, 1, …, J -1, and each of MP1 and MP2 has J/S entries to hold PSs of du(pu(j)) for pu(j) = 0, 1, …, J/S -1. While pu(j) is the only addressing scheme for J/S entries of MP1 and MP2, two different options are available for accessing the J entries of ML. One is to use sequential addresses (SAs), {j}, and the other is to adopt interleaved addresses (IAs),  LSB of iteration count c Let us recapitulate that M L has J entries to store l u (π u (j)) for j = 0, 1, . . . , J -1, and each of M P1 and M P2 has J/S entries to hold PSs of d u (p u (j)) for p u (j) = 0, 1, . . . , J/S -1. While p u (j) is the only addressing scheme for J/S entries of M P1 and M P2 , two different options are available for accessing the J entries of M L . One is to use sequential addresses (SAs), {j}, and the other is to adopt interleaved addresses (IAs), {π u (j)}. Nevertheless, only the former has been presented in the literature [5], and it has never been compared with the latter. Figure 3 sketches the existing AGU using SAs [5]. The output of the counter at the bottom, which is a cyclic sequence of elements in {j} = {0, 1, . . . , J -1}, is readily used as RA L . The output of the other counter at the top, which precedes the bottom one by E -1 cycles, plays the role of n_WA L . n_WA L is the next write address for M L that precedes WA L by one cycle, and E is the latency of the ESE. R J denotes a D-type register holding log 2 J bits, where the subscript J represents the argument of the logarithm. Both RA L and WA L are log 2 J-bit long so as to address all J entries in M L . WA L is generated by R J that defers n_WA L one cycle. Given i as input, interleaver u makes π u (i). Given i as input, a division-by-S-and-floor unit (DFU) calculates floor(i/S). Putting it all together, given RA L or j as input, a series of interleaver u and a DFU bounded by dotted lines derives RA P = floor(π u (j)/S) = p u (j). RA P serves as one input of each multiplexer. By the other set of interleaver u and the following DFU, n_WA L is transformed into n_WA P , which fills the remaining input of each multiplexer. According to c[0], RA P1 and RA P2 alternate between n_WA P and RA P . This implements the aforementioned role exchange of M P1 and M P2 , i.e., one retrieves d u (p u (j)) to compute e u (π u (j)) in the response phase, while the other prefetches a PS to accumulate l u (π u (j)) during the reception phase. A series of R J and a DFU that follows the upper interleaver u makes WA P . On the other hand, Figure 4 depicts another possible AGU that uses IAs. Unlike the SA-based AGU (S-AGU), RAL is the output of interleaveru that shuffles a cyclic sequence from a counter {j}, i.e., {πu(j)}. n_WAL is also taken from the output of the other interleaveru. The two counters are out of phase by E -1 cycles as they are in Figure 3. Since RAL is already an IA, a DFU is the only remaining stage to be undergone ahead of RAP. Similarly, n_WAP is made by a DFU that takes n_WAL as input. n_WAP and RAP are connected to the multiplexers. On the other hand, Figure 4 depicts another possible AGU that uses IAs. Unlike the SA-based AGU (S-AGU), RA L is the output of interleaver u that shuffles a cyclic sequence from a counter {j}, i.e., {π u (j)}. n_WA L is also taken from the output of the other interleaver u . The two counters are out of phase by E -1 cycles as they are in Figure 3. Since RA L is already an IA, a DFU is the only remaining stage to be undergone ahead of RA P . Similarly, n_WA P is made by a DFU that takes n_WA L as input. n_WA P and RA P are connected to the multiplexers.
On the other hand, Figure 4 depicts another possible AGU that uses IAs. Unlike the SA-based AGU (S-AGU), RAL is the output of interleaveru that shuffles a cyclic sequence from a counter {j}, i.e., {πu(j)}. n_WAL is also taken from the output of the other interleaveru. The two counters are out of phase by E -1 cycles as they are in Figure 3. Since RAL is already an IA, a DFU is the only remaining stage to be undergone ahead of RAP. Similarly, n_WAP is made by a DFU that takes n_WAL as input. n_WAP and RAP are connected to the multiplexers.  Both AGUs in Figures 3 and 4 contain two counters, two interleavers, and two multiplexers. Excluding such ones in common, the remaining components are colored grey for emphasis. The IAbased AGU (I-AGU) has one less RJ than the S-AGU. On the other hand, since SAs in {j} are independent of u unlike the IAs in {πu(j)}, the counters and RJ that generate RAL and WAL can be shared among all UPBu for u = 1, 2, …, U, being an advantage of the S-AGU. Another noteworthy merit of the S-AGU is that it may employ the simplified memory subsystem in Ref. [7]. More specifically, ML is usually implemented with a dual-port memory to accommodate two requests per cycle. When ML is accessed by SAs, however, a pair of adjacent requests can be integrated into one, and the number of memory accesses per cycle is reduced from two to one. Then, ML can be implemented with a single-port memory instead of a dual-port one, reducing the hardware complexity significantly.

DFU-Reduced Architecture
A DFU includes a division that incurs a significant hardware burden. It is therefore important to minimize the number of DFUs. To this end, we now manipulate the AGUs as follows. The top right part of Figure 3 is redrawn in step 1 of Figure 5. We exchange the location of RJ and the following DFU as illustrated in step 2, as such a change does not affect the functionality at all. Then, instead of using two separate DFUs, the output of one DFU can be shared as shown in step 3, mitigating the Both AGUs in Figures 3 and 4 contain two counters, two interleavers, and two multiplexers. Excluding such ones in common, the remaining components are colored grey for emphasis. The IA-based AGU (I-AGU) has one less R J than the S-AGU. On the other hand, since SAs in {j} are independent of u unlike the IAs in {π u (j)}, the counters and R J that generate RA L and WA L can be shared among all UPB u for u = 1, 2, . . . , U, being an advantage of the S-AGU. Another noteworthy merit of the S-AGU is that it may employ the simplified memory subsystem in Ref. [7]. More specifically, M L is usually implemented with a dual-port memory to accommodate two requests per cycle. When M L is accessed by SAs, however, a pair of adjacent requests can be integrated into one, and the number of memory accesses per cycle is reduced from two to one. Then, M L can be implemented with a single-port memory instead of a dual-port one, reducing the hardware complexity significantly.

DFU-Reduced Architecture
A DFU includes a division that incurs a significant hardware burden. It is therefore important to minimize the number of DFUs. To this end, we now manipulate the AGUs as follows. The top right part of Figure 3 is redrawn in step 1 of Figure 5. We exchange the location of R J and the following DFU as illustrated in step 2, as such a change does not affect the functionality at all. Then, instead of using two separate DFUs, the output of one DFU can be shared as shown in step 3, mitigating the complexity. Besides, R J is minified to R J/S , which is a register holding log 2 (J/S) < log 2 J bits. Note that each of RA P1 , RA P2 , and WA P are log2(J/S)-bit long so as to address J/S entries in M P1 and M P2 . Therefore, the relocation results in not only the removal of a DFU but also the reduction of the bit-width. The entire architecture of the DFU-reduced S-AGU is illustrated in Figure 6.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 9 complexity. Besides, RJ is minified to RJ/S, which is a register holding log2(J/S) < log2J bits. Note that each of RAP1, RAP2, and WAP are log2(J/S)-bit long so as to address J/S entries in MP1 and MP2. Therefore, the relocation results in not only the removal of a DFU but also the reduction of the bitwidth. The entire architecture of the DFU-reduced S-AGU is illustrated in Figure 6. A similar approach can be taken to the I-AGU as well.
Step 1 of Figure 7 redraws the top right part of Figure 4. Swapping RJ and the following DFU, we can acquire step 2 of Figure 7. Subsequently, we can share the output of the sole DFU and substitute RJ with RJ/S, as depicted in step 3. However, note that the output of the DFU is now n_WAP, from which WAL cannot be retrieved. To secure the indispensable WAL from n_WAL, we need to add RJ as illustrated in step 4. Unlike the original I-AGU which is of the lowest complexity, the DFU-reduced I-AGU in Figure 8 requires the additional register to hold n_WAP for one cycle. Thus, while one DFU is eliminated, one register is appended. In short, the S-AGU benefits more from the relocation technique than the I-AGU. A similar approach can be taken to the I-AGU as well.
Step 1 of Figure 7 redraws the top right part of Figure 4. Swapping R J and the following DFU, we can acquire step 2 of Figure 7. Subsequently, we can share the output of the sole DFU and substitute R J with R J/S , as depicted in step 3. However, note that the output of the DFU is now n_WA P , from which WA L cannot be retrieved. To secure the indispensable WA L from n_WA L , we need to add R J as illustrated in step 4. Unlike the original I-AGU which is of the lowest complexity, the DFU-reduced I-AGU in Figure 8 requires the additional register to hold n_WA P for one cycle. Thus, while one DFU is eliminated, one register is appended. In short, the S-AGU benefits more from the relocation technique than the I-AGU. Step-by-step illustration of removing DFU from I-AGU. It is worth noting that the complexity of each DFU can be somewhat relieved by confining S to be a power of two, as the division by a power of two can be easily achieved by right-shift operations.
In exchange for such a benefit, however, it sacrifices the range of applicability.

Evaluation and Discussion
For all kinds of AGU architectures in Figures 3, 4, 6, and 8, Table 2 summarizes the numbers of DFUs and register bits in a U-user detector. The feasibility of single-port ML is also tabulated. Counting all DFUs is apparent, as it is the number of DFUs per AGU multiplied by the number of AGUs in a detector, U. In contrast, register bits should be counted while taking into account the following: the RJ that produces WAL in S-AGUs can be shared among U UPBs, as the SA-based WAL is identical in all UPBs. In case of the original S-AGU, the number of RJ preceding the DFU is U, whereas the number of RJ producing WAL is 1, making the total number of register bits (U + 1)log2J. In the case of the DFU-reduced S-AGU, the numbers of RJ/S and RJ are U and 1, respectively, making It is worth noting that the complexity of each DFU can be somewhat relieved by confining S to be a power of two, as the division by a power of two can be easily achieved by right-shift operations.
In exchange for such a benefit, however, it sacrifices the range of applicability.

Evaluation and Discussion
For all kinds of AGU architectures in Figures 3, 4, 6 and 8, Table 2 summarizes the numbers of DFUs and register bits in a U-user detector. The feasibility of single-port M L is also tabulated. Counting all DFUs is apparent, as it is the number of DFUs per AGU multiplied by the number of AGUs in a detector, U. In contrast, register bits should be counted while taking into account the following: the R J that produces WA L in S-AGUs can be shared among U UPBs, as the SA-based WA L is identical in all UPBs. In case of the original S-AGU, the number of R J preceding the DFU is U, whereas the number of R J producing WA L is 1, making the total number of register bits (U + 1)log 2 J. In the case of the DFU-reduced S-AGU, the numbers of R J/S and R J are U and 1, respectively, making the total number of register bits Ulog 2 (J/S) + log 2 J. All R J 's in I-AGUs take different inputs and cannot be shared. For the common and practical set of parameters used in Ref.
[5-8], e.g., U = 16, J = 8192, and S = 16, Table 3 enumerates the numbers of DFUs and register bits. The percentages in parentheses are calculated with respect to the original S-AGU. The DFU-reduced S-AGU includes the fewest DFUs and register bits. In particular, it requires 33% fewer DFUs and 29% fewer register bits than the original S-AGU, highlighting the benefits of the proposed relocation. On top of that, it may adopt single-port M L , making it promising in every aspect of hardware complexity. In addition to the DFUs and the register bits, the AGUs include other components that contribute to the overall hardware complexity, as follows: algebraic interleavers [9,12]; counters; multiplexers. Besides, identical logics can be synthesized differently as fan-in, fan-out, and gate sizing in circuitry vary. To evaluate more thoroughly by taking such factors into account, the architectures were implemented in a 65 nm CMOS process using a 300 MHz clock and a 1.2 V supply. The corresponding results are summarized in Table 4. Equivalent gates were counted by regarding a two-input nand gate as one. Power consumptions were measured by back-annotating switching activities. The percentages in parentheses are again calculated with respect to the original S-AGU. The original S-AGU and I-AGU are associated with almost the same equivalent gates and powers dissipated. On the contrary, the DFU-reduced S-AGU integrates fewer gates and consumes less power than the DFU-reduced I-AGU, as the latter demands the additional registers in order to obtain WA L . Comparing the original and the DFU-reduced pairs, the DFU-reduced ones dissipate much lower power than their counterparts, owing to the removal of computationally intensive DFUs. In particular, the DFU-reduced S-AGU can be realized with 13% fewer gates, and spends 31% less power than the original S-AGU. Table 4. Implementation results.

Architecture
Equivalent Gate Count Power Consumption

Conclusions
We have analyzed four kinds of AGUs, and have figured out that the proposed DFU-reduced S-AGU is the most outstanding one from the viewpoint of efficient hardware implementation. The results of realization in a 65 nm CMOS have demonstrated that the equivalent gate count and the power dissipation can be mitigated by 13% and 31%, respectively. The reduction in DFUs and register bits has been achieved by exchanging the locations of a DFU and a register, and sharing the output of one DFU rather than employing multiple DFUs. As such a modification is completely transparent, i.e., it does not affect the functionality, it can be readily applied to all the existing multiuser detectors without tailoring the interfacing components therein. Future works include extending the proposed AGU to parallel IDMA architectures [6,8] that adopt different interleaving patterns. Besides this, the algebraic interleavers in AGUs will be investigated for a further reduction of the complexity, as they include several multipliers to be optimized.