Probabilistic Shaping for Finite Blocklengths: Distribution Matching and Sphere Shaping

In this paper, we provide a systematic comparison of distribution matching (DM) and sphere shaping (SpSh) algorithms for short blocklength probabilistic amplitude shaping. For asymptotically large blocklengths, constant composition distribution matching (CCDM) is known to generate the target capacity-achieving distribution. However, as the blocklength decreases, the resulting rate loss diminishes the efficiency of CCDM. We claim that for such short blocklengths over the additive white Gaussian noise (AWGN) channel, the objective of shaping should be reformulated as obtaining the most energy-efficient signal space for a given rate (rather than matching distributions). In light of this interpretation, multiset-partition DM (MPDM) and SpSh are reviewed as energy-efficient shaping techniques. Numerical results show that both have smaller rate losses than CCDM. SpSh—whose sole objective is to maximize the energy efficiency—is shown to have the minimum rate loss amongst all, which is particularly apparent for ultra short blocklengths. We provide simulation results of the end-to-end decoding performance showing that up to 1 dB improvement in power efficiency over uniform signaling can be obtained with MPDM and SpSh at blocklengths around 200. Finally, we present a discussion on the complexity of these algorithms from the perspectives of latency, storage and computations.

As the modulation order increases, the maximum rate that can be achieved with uniform signaling starts to suffer from a loss with respect to the channel capacity 1 .As an example, the maximum achievable information rate (AIR) for MLC in combination with multi-stage decoding (MSD) [2] is the mutual information (MI) of the channel input and output.If a uniform signaling strategy is employed with MLC-MSD, the MI is bounded away from capacity.This gap is called the shaping gap and is up to 0.255 bits per real channel use (bit/1-D) for the additive white Gaussian noise (AWGN) channel.When translated into an increase in required signalto-noise ratio (SNR) to obtain a certain MI, this so-called ultimate shaping gap corresponds to a 1.53 dB loss in energy efficiency [9].The well-known 1.53 dB is an asymptotic result for the AWGN channel and is only relevant for CM systems where the maximum AIR is the MI, and when the number of channel uses n as well as the modulation order approach infinity.
There exist numerous techniques in the literature, most of them proposed in the late 1980s and early 1990s, that attempt to close the shaping gap.Motivated by the fact that the capacity-achieving distribution for the AWGN channel is Gaussian, these techniques fundamentally aim at one of the following.The first goal is to construct a signal constellation with a Gaussian-like geometry, which is called geometric shaping (GS) [10]- [17].The other approach is to induce a Gaussian-like distribution over the signal structure, which is called probabilistic shaping (PS) [18]- [22].PS techniques can be further classified into two subgroups using the terminology introduced by Calderbank and Ozarow in [18].The direct approach is to start with a target distribution (which is typically close to the capacity-achieving distribution) on a lowdimensional signal structure and have an algorithm try to obtain it [18], [20].Following recent literature [23], the direct approach can also be called distribution matching (DM).The indirect approach is to start with a target rate and bound the multi-dimensional signal structure by a sphere, which we call sphere shaping (SpSh) [21], [22].Here, a Gaussian distribution is induced indirectly (when n → ∞) as a by-product.Finally, there exist some hybrid shaping approaches in which GS and PS are combined [24]- [26].We refer to [27,Sec. 4.5] for a detailed discussion on GS, and to [27,Ch. 4] and [28, Approach Method [18] Architecture Transform.
Sec. II] on PS.GS, PS, and hybrid shaping are shown on the top layer of Fig. 1 where the taxonomy of constellation shaping (as discussed in the current paper) is illustrated.We call this first layer shaping approach.On the second layer which we call shaping method, PS is split into two following the Calderbank/Ozarow terminology [18].
In the context of BICM, signal shaping techniques again attracted a considerable amount of attention in the 2000s.GS was investigated for BICM in [29]- [31], and PS was studied in [32]- [35].An iterative demapping and decoding architecture with PS was proposed in [36].The achievability of the so-called generalized MI (GMI) was shown for independent but arbitrarily distributed bit-levels in [37].In [38], it was demonstrated that the GMI is a nonconvex function of the input bit distribution, i.e., the problem of computing the input distribution that maximizes GMI is nonconvex.An efficient numerical algorithm to compute optimal input distributions in BICM was introduced in [39].The effect of mismatched shaping, i.e., not using the true symbol probabilities or reference constellation at the receiver, was examined in [40].The achievable rates, error exponents and error probability of BICM with PS were analyzed in [41].Signal shaping was investigated for BICM at low SNR in [42].PS in BICM was considered for Rayleigh fading channels in [43], [44].
Recently, probabilistic amplitude shaping (PAS) has been proposed to provide low-complexity integration of shaping into existing binary FEC systems with BMD [28].PAS uses a reverse concatenation strategy where the shaping operation precedes FEC coding, as shown in Fig. 2 (left).This construction has been first examined for constrained coding problems [45].A corresponding soft-decision decoding approach for this structure was studied in [46].PAS can be considered as an instance of the Bliss architecture [45] where in the outer layer a shaping code is used, and then in the inner layer parity symbols are added.The main advantage of this structure is that amplitude shaping can be added to existing CM systems as an outer code.In addition to closing the shaping gap, PAS moves the rate adaptation functionality to the shaping layer.This means that, instead of using many FEC codes of different rates to obtain a granular set of transmission rates, the rate is adjusted by the amplitude shaper with a fixed FEC code.Owing to these advantages, PAS has attracted a lot of attention.PAS has been combined with low-density paritycheck (LDPC) codes [28], polar codes [47] and convolutional codes [48].Its performance has been evaluated over AWGN channels [28], optical channels [49], [50], wireless channels [48] and parallel channels with channel state information available at the transmitter [51].
The key building blocks of the PAS framework are the amplitude shaper and deshaper, i.e., the purple boxes in Fig. 2 (left).The function of the amplitude shaper is to map uniform binary sequences to shaped amplitude sequences in an invertible manner.A careful selection of the set of sequences that can be outputted by the shaper with the aim of matching a target distribution (direct approach) or constructing an energy-efficient signal space (indirect approach) results in improvement in overall performance.We call the way this selection is accomplished shaping architecture which affects the performance of PAS.On the other hand, the actual implementation of this architecture is called here the shaping algorithm and determines the complexity of attaining this performance.The third and fifth layers of Fig. 1 illustrate shaping architectures and algorithms, respectively.The difference between the shaping architecture and the underlying algorithm is discussed in detail in Sec.II-D.
For the initial proposal of PAS [28], constant composition distribution matching (CCDM) was employed as the shaping architecture [52].The basic principle of CCDM is to utilize amplitude sequences having a fixed empirical distribution that is information-theoretically close to the target distribution.To this end, a constant composition constraint is put on the output sequences such that all have the same amplitude composition.To realize such a mapping, arithmetic coding (AC) is used in a way similar to [53].Although CCDM has vanishing rate loss for asymptotically large blocklengths [52], it has two fundamental drawbacks that prohibit its use for finite blocklengths.First, as recently shown in [54] and [55,Fig. 4], CCDM suffers from high rate losses as the blocklength decreases.Second, CCDM is implemented based on AC which is strictly sequential in input length [53], [56,Ch. 5].
To replace CCDM in the short-to-moderate blocklength regime and to provide more hardware-friendly implementations, improved techniques have been devised.The most prominent DM examples other than CCDM include multisetpartition DM (MPDM) [55] and product DM (PDM) [51], [57].Briefly stated, MPDM uses different compositions and expands the set of output sequences to achieve smaller rate losses than CCDM.With the same objective, PDM internally uses multiple binary matchers to generate the desired distribution as a product distribution 2 .In [58], a parallelamplitude (PA) architecture is proposed for DM to enable  even higher degree of parallelization.Also in [58], subset ranking (SR) is introduced as an alternative to the conventional AC method for binary-output CCDM.As for direct shaping methods, enumerative sphere shaping (ESS) and shell mapping (SM) are notable SpSh algorithms which are initially proposed in [21] and [59], respectively.ESS is recently considered in PAS framework [48], [60]- [63], as well as SM in [64].Furthermore, low-complexity implementation ideas for both of these algorithms have been presented in [65].The fourth layer in Fig. 1 which we call transformation for DM and ordering for SpSh designates the way a shaping algorithm formulates a solution to the problem defined by the shaping architecture.As an example, CCDM considers sequences having the same composition [52].By realizing a binary-to-nonbinary transformation with AC [52], [53], CCDM can directly be used to produce amplitude sequences.On the other hand, separate binary-to-binary transformations can be employed for different bit-levels using AC [53] or SR [58].Then these bit-levels can be combined such that the corresponding channel input distribution is close to the capacity-achieving distribution [51], [57].As another example, SpSh considers amplitude sequences in a sphere.ESS orders these sequences lexicographically [21], while SM and [22, Algorithm 1] order them based on their energy.
Other shaping schemes have been proposed that are briefly listed in the following.A detailed analysis of them is outside the scope of this manuscript.The concept of a "mark ratio controller" was proposed for low-complexity implementation of BL-DM in [66], [67].In the streaming DM of [68] and the prefix-free code distribution matching with framing of [69], [70], switching is performed between two (or more) variablelength shaping codes such that the output is always of fixed length.In [71], a "multi-composition" idea similar to [55] was applied to BL-DM.The authors of [72] provided a finite-precision implementation for AC-CCDM.In [73], a shaper based on ESS was introduced to shape a subset of the amplitude bit-levels, which is referred to as partial ESS (P-ESS).The authors of [74] introduced the "hierarchical" DM which realizes a nonuniform distribution with hierarchical LUTs.An approximate sphere shaping implementation based on Huffman codes was proposed in [75].
In this work, we examine DM and SpSh methods.The contributions of this paper are threefold.First, this paper is-to the best of our knowledge-the first study where a systematic comparison of different PS architectures is provided.Second, using rate loss as well as AIRs for finite-length shapers as the performance metrics, we claim that shaping strategies which aim to construct energy-efficient signal sets are more effective than the techniques which focus on matching distributions for the AWGN channel.For the analyzed schemes, this means that MPDM and SpSh, are more efficient for short blocklengths than CCDM whose sole objective is to obtain the capacityachieving distribution.Our claim is then verified via frame error rates (FERs) that are obtained in end-to-end decoding simulations of the PAS system employing long and short systematic LDPC codes from [76] and [77], respectively.The improvements in power efficiency that we obtained during endto-end decoding simulations are consistent with the predictions made by finite-length AIRs.The third contribution of this paper is to provide a discussion on the required storage, computational complexity, and the latency of different DM and SpSh algorithms.
The paper is organized as follows.The first part is tutoriallike.In Sec.II, background information on uniform and shaped signaling schemes, and amplitude shaping is provided.Section IV reviews DM and SpSh schemes from shaping architecture and algorithmic implementation perspectives.The second part of the paper is reserved for the comparison of four amplitude shaping architectures.Rate losses, AIRs, and endto-end decoding performance of PAS are studied in Sec.V. Section VI is devoted to a high-level discussion on latency and complexity of the schemes under consideration.Finally, conclusions are given in Sec.VII.

A. Notation and Definitions
We use capital letters X to denote random variables, lower case letters x to specify their realizations.Random vectors of length n are indicated by X n while their realizations are notated by x n .Element-wise multiplication of x n and y n is shown by x n y n .Calligraphic letters X represent sets.The Cartesian product of X and Y is indicated as X ×Y, while X n is the n-fold Cartesian product of X with itself.Boldface capital letters P specify matrices.Probability distributions over X are denoted by P X (x).The probability density function (PDF) of Y conditioned on X is indicated by f Y |X (y|x).
The discrete-time AWGN channel output is given at time where Z i is the noise which is independent of the input X i , and drawn from a zero-mean Gaussian distribution with variance σ 2 .There is an average The capacity of the AWGN channel is C = 1 2 log 2 (1 + SNR) in bit/1-D.This capacity can be achieved as n → ∞ by employing a codebook (set of input sequences) in which all the codewords (input sequences) are generated with entries independent and identically distributed according to a zeromean Gaussian with variance P [78, Ch. 9].The corresponding random coding argument shows that channel input sequences, drawn from a Gaussian distribution, are likely to lie in an nsphere of squared radius nP (1 + ε) for any ε > 0, when n → ∞.This motivates to select the signal points from within an nsphere, or equivalently to use an n-sphere as the signal space boundary, in order to achieve capacity.For a more detailed discussion on the asymptotic duality of Gaussian distributions and n-spherical signal spaces for large n, we refer the reader to, e.g., [9, Sec.IV-B]

B. Discrete Constellations and Amplitude Shaping
We consider 2 m -ary amplitude-shift keying (ASK) alphabets X = {±1, ±3, • • • , ±(2 m − 1)}, which can be factorized as X = S × A.Here S = {±1} and A = {1, 3, • • • , 2 m − 1} are the amplitude and sign alphabets, respectively.The cardinality of the amplitude alphabet is n a = |A|.Motivated by the fact that the capacity-achieving distribution for the AWGN channel is symmetric around the origin, we restrict our attention to the amplitude distribution P A (a), and assume that the sign distribution P S (s) is uniform and independent of the amplitudes.The distribution of the channel input X = SA is then P X (x) = P S (s)P A (a).
The distribution that maximizes the MI for ASK constellations subject to an average power constraint does not have a known analytical form.Maxwell-Boltzmann (MB) distributions P A (a) = K (λ) e −λa 2 for a ∈ A, are used for shaping amplitudes, e.g., in [20], [28], since they are the discrete-domain counterpart of the Gaussian distribution and maximize the entropy for a given average energy [78].Furthermore, as shown in [79, Table 5.1], the difference in MI for the MB distribution and the capacity-achieving distribution is insignificant for ASK constellations.For MB distributions, λ determines the variance of the distribution while K(λ) normalizes it.
In a dual manner, SpSh is also employed for amplitude shaping in the discrete domain [21], [22].In [54], it is shown that when an n-spherical region of X n is used as the signal space, the distribution induced on A approaches an MB distribution as n → ∞.The authors of [64] showed that at finite n, SpSh minimizes the informational divergence between the induced distribution and an MB distribution.
To employ high-order modulation formats such as 2 m -ASK for m ≥ 2, a binary labeling strategy is necessary.A discussion on binary labeling can be found in [8,Sec. 2].We assume that the binary label B 1 B 2 • • • B m of a channel input X can be decomposed into a sign bit B 1 and amplitude bits Example 1 (Binary labeling).The BRGC is tabulated for 8-ASK in Fig. 2 (right).Here, B 1 is symmetric around zero.Furthermore, when X has a distribution which is symmetric around zero, B 1 is uniform and stochastically independent of B 2 and B 3 .In this paper, we assume that the BRGC is used for labeling.

C. Fundamentals of Amplitude Shaping Schemes
The amplitude shaper is a block that maps k-bit uniform sequences to n-amplitude shaped sequences in an invertible manner.The tasks of this block are (i) to create a shaping codebook A ⊆ A n , and (ii) to realize a shaping encoder to index these sequences.The former task is related to the properties of the desired set A while the latter deals with the algorithmic implementation of the mapping.This difference is discussed in detail in Sec.II-D.In the remainder of this section, we introduce the concepts and parameters that are associated with the shaping techniques that will be investigated in this paper.
The energy of a sequence When n-sequences are represented as points in an ndimensional (n-D) space, the set consists of all amplitude sequences located in or on the surface of the n-sphere of squared radius E • .The zero-energy point is at the center of this sphere.The composition of a sequence where n j denotes the number of times the j th element of A occurs in x n , i.e., for j = 1, 2, • • • , n a .Here 1[•] is the indicator function which is 1 when its argument is true, and 0 otherwise.The number of unique n-sequences with the same composition C is given by the multinomial coefficient For a set A of amplitude sequences with P A (a) induced on A, the average energy per symbol is ( The shaping rate of the set A is defined as in bit/1-D.The input blocklength of a shaping algorithm that indexes sequences from the shaping set A is in bits.It can be shown that the parameters of a shaping code A satisfy the following inequality where (a) is due to the finite blocklength n and (b) is due to the binary-input nature of the shaping algorithm, i.e., the rounding in (7).Here H(A) is the entropy of P A in bits.In (8), both (a) and (b) are satisfied with equality when n → ∞.
The rate loss of a shaping set A with induced distribution P A (a) can then be defined in bit/1-D as D. Shaping Architecture vs. Shaping Algorithm The aforementioned shaping schemes have in common that they are aiming at solving an indexing problem, which is that the binary input at the mapper determines an output sequence.At the receiver side, the inverse operation is carried out.This indexing problem has many different approaches to, and for proper characterization and categorization, it is insightful to differentiate between architecture and algorithm.
When we speak of the architecture, we mean the underlying principle behind the mapping operation, which in turn can be realized with various different algorithms as shown in the fourth layer of Fig. 1.For instance, the CCDM principle (i.e., architecture) is that the sequences at the mapper output have a fixed number of occurrences of each amplitude, i.e., they satisfy the composition C. Furthermore, the mapping algorithm can operate on one nonbinary or several binary subsets of the output sequence.Bit-level [51], [57] and parallel-amplitude [58] designs are modifications to the conventional CCDM architecture that carry out such a transformation from one nonbinary to several binary DMs.Among all algorithms, a lookup table (LUT) is probably the simplest way to solve the CCDM indexing problem, yet the LUT size table is prohibitively large as it reaches Gbit size already for short blocklengths [75].The original mapping method for a nonbinary-alphabet CCDM is AC [52, Sec.IV] which is modified from [53].For binary-output CCDM, SR has recently been proposed as a low-serialism alternative to CCDM.MPDM [55] extends the CCDM principle (and thus architecture) by using variable-composition DM, yet internally uses CCDM methods for mapping and demapping.
As another example, the SpSh principle (i.e., architecture) is that the sequences at the output of the shaper satisfy a maximum-energy constraint, i.e., they satisfy (2).The problem of indexing these sequences can be solved again by a using a LUT.On the other hand, ESS [21], SM [22] and [22, Algorithm 1] are constructive algorithms to index sequences in a sphere.The required storage and computational complexity of these algorithms are compared in Sec.VI.For further discussion on SM, we refer the reader to [22], [80], [81,Ch. 8] and [27,Sec. 4.3].
In this work, the architectures of CCDM and MPDM, and different algorithmic realizations of DM are discussed in Sec.IV-A.The architecture of SpSh and different ways of realizing it (ESS and SM) are examined in Sec.IV-B.

A. Uniform Signaling
In uniform signaling, a k-bit uniform sequence code, as shown in Fig. 3 (a).Afterwards, the coded sequence c nc is divided into m-bit vectors, each of which is mapped to a channel input symbol via the symbol mapper.Finally, assuming that n c /m = n, the sequence x n ∈ S n × A n is transmitted over the channel.The transmission rate of this construction is R = k/n bit/1-D.We will compare the uniform and shaped signaling techniques at the same transmission rate R, as it is obviously the only fair comparison as recently discussed in [61, Sec.IV-A] and [82].

B. Probabilistic Amplitude Shaping
Böcherer et al. introduced in [28] the PAS framework which couples an outer shaping code and an inner FEC code to realize shaped and coded modulation.Figure 3 (b) shows the basic PAS architecture where first, an amplitude shaping block maps a k-bit uniform information sequence u k to an n-amplitude sequence a n = (a 1 , a 2 , • • • , a n ) in an invertible manner, where a j ∈ A for j = 1, 2, • • • , n.After this mapping block, these amplitudes are transformed into bits using the last m−1 bits of the employed binary labeling.We note that due to the shaped nature of a n , the bits at the output of the amplitudeto-bit conversion in Fig. 3 n m are then used as the input of a systematic, rate R c = (m−1)/m FEC code which is specified by an n(m−1)-by-nm parity-check matrix P .The n-bit parity output of this code is employed as the sign bit-level, i.e., the first bit of the binary labels, to determine the sign sequence To use a higher FEC code rate3 R c > (m − 1)/m, a modified PAS architecture is proposed in [28] as shown in Fig. 3 (c).The code rate in this sceme is R c = (m − 1 + γ)/m where γ = R c m − (m − 1) sepecifies the number of extra data bits that will be transmitted per symbol.In this modified structure, in addition to the n(m − 1) bit output of the shaper, extra γn information bits ũγn are fed to the FEC code which is now specified by an (m − 1 + γ)n-by-mn parity-check matrix P .The (1 − γ)n bit parity output of the FEC code is then multiplexed with the uniform bits ũγn to form an n-bit sequence that will select the signs.The transmission rate of this scheme is

C. PAS Receiver
At the receiver, the log-likelihood ratio (LLR) of B j is computed by a soft demapper as based on the channel output Y for j = 1, 2, • • • , m, where X j,u denotes the set of X ∈ X which have B j = u in their binary labels for u ∈ {0, 1}.We emphasize that the nonuniform a-priori information on the symbols is used in (10).Instead of symbol-wise probabilities P X (x), bit-wise probabilities P Bj (b j ) for j = 1, 2, • • • , m can equivalently be used to compute the LLRs as in [28, eq. ( 60)] or [8, eq.(3.[29][30][31][32]].Then based on the LLRs, a binary FEC decoder recovers the bits that were encoded by the FEC code.In the case of uniform signaling, these bits are the estimates of the information bits.For the PAS architecture shown in Fig. 3 (b), the output of the decoder consists of the estimates of the amplitude bits.Then these are mapped back to the information bit estimates using the inverse functions of the blocks in the shaper (green box), i.e., the corresponding bit-to-amplitude mapper followed by the corresponding amplitude deshaper.In addition to this, for the PAS architecture shown in Fig. 3 (c), the decoder also outputs the estimates of the γn extra data bits which were used as some of the signs.According to [28], a bit-metric decoder achieves the rate R BMD for any input distribution P X (x), where

D. Selection of Parameters for PAS
In this section, we study the optimum shaping and FEC coding rates for PAS using AIRs.Thus we consider the case where n → ∞ which implies that k = nH(A) from (8), and consequently, R = H(A) + γ.
In the PAS architecture, to obtain a target rate R = H(A)+γ using the 2 m -ASK constellation, a total of n(m − R) redundancy bits are added to a channel input sequence by shaping and coding operations combined.Shaping is responsible for n(m − 1 − H(A)) redundant bits whereas coding adds n(H(A)+1−R).This is illustrated in Fig. 4 where the content of a channel input sequence produced by the generalized PAS architecture of Fig. 3 (c) is shown.The striped areas represent the information carried in signs (red) which is γn bits, and in amplitudes (green) which is k = nH(A) bits.Dotted areas show the redundant bits in a sequence.When γ = 0, i.e., R c = (m − 1)/m, all signs are selected by redundancy bits and thus, the striped red area in Fig. 4 vanishes.When H(A) = m − 1, the amplitudes are uniformly distributed, i.e., there is no shaping, and thus, the dotted green area in Fig. 4 disappears.We note that a similar illustration was provided for a single ASK symbol in [55,Fig. 9].In Table I, the content of a sequence at the output of a PAS transmitter (in accordance with Fig. 4) is tabulated for Example 2 where n = 216.
When the input is constrained to be MB-distributed, H(X) = H(A) + 1 can be used as a design parameter which tunes the balance between shaping and coding redundancies at a fixed rate R.More specifically, the entropy H(A) of the MB  distribution is controlled by λ.Thus by changing λ, the amount of shaping redundancy in an amplitude can be adjusted.The question is then how to choose the optimum λ.Following Wachsmann, Fischer and Huber [2], [83], we use the gap to capacity (normalized SNR), which is defined as as the metric to be minimized when searching for the optimum MB distribution 4 for a fixed rate R and constellation size 2 m .The numerator in ( 12) is the SNR value at which R BMD = R for a given P X , and the denominator is the SNR value at which the capacity C = R.We note that instead of the MI in [2, eq. ( 55)], we now use the BMD rate of (11).Observing from Fig.
), the rate of the FEC code that should be employed in PAS to obtain a transmission rate R for a given constellation entropy H(X) is Example 3 (Optimal PAS parameters).In Fig. 5, the entropy H(X) of an MB-distributed input X with |X | = 8 twosided amplitude levels (i.e., 8-ASK) vs. ∆SNR is plotted for R = 2.25 bit/1-D.On the top horizontal axis, the corresponding FEC code rates in (13) are also shown.The rightmost point (indicated by a square) corresponds to uniform signaling where the target rate of 2.25 bit/1-D is obtained by using a FEC code of rate R c = R/m = 3/4.In this trivial case, all 0.75 bits of redundancy are added by the coding 4 In general, the gap-to-capacity curve can be plotted for any parametric family of distributions.Here we only consider the MB distributions since they have been shown to perform very close to the capacity of ASK constellations over the AWGN channel and maximize the energy efficiency [20].operation, and the gap to capacity ∆SNR is 1.04 dB.The leftmost part of the curve where H(X) goes to R belongs to the uncoded signaling case, i.e., R c = 1, where R is attained by shaping the constellation such that H(X) = R.Here ∆SNR is infinite since without coding, reliable communication is only possible over a noiseless channel.The minimum ∆SNR in Fig. 5 is obtained with H(X) = 2.745, which corresponds to R c = 0.835 from (13).In IEEE DVB-S2 [76] and 802.11 [77], the code rate that is closest to 0.835 is 5/6 ≈ 0.833.Accordingly, the best performance is expected to be provided by FEC rate 5/6, with an SNR gain over uniform that amounts according to this analysis to 0.83 dB.This will be confirmed by the numerical simulations presented in Sec.V-C.

IV. DISTRIBUTION MATCHING AND SPHERE SHAPING SCHEMES
This section gives an overview of various shaping schemes that are compatible with the PAS framework.We focus on constructive methods, i.e., the direct use of a LUT for mapping or demapping is not considered herein due to its impracticality even for moderate blocklengths.Also, only fixed-length schemes are considered.

A. Distribution Matching Schemes (Direct Method)
In the following, an overview of distribution matching architectures and algorithms is given.The difference between these two aspects was discussed in Sec.II-D.All of the following schemes have in common that a certain PMF is targeted explicitly.For finite-length DM, this means that some quantization might be required as to achieve an integervalued composition.Possible quantization rules include a simple rounding operation [28, Sec.V-A2], or minimizing the Kullback-Leibler divergence [84].We note that neither of these approaches is necessarily optimal in achieving the maximum information rate for a given n and channel law.
CCDM has been proposed for PAS in [52].We speak of constant composition if all matcher output sequences are permutations of a particular base sequence, which is typically described by the composition C stating the number of occurrences of each amplitude.The number of unique output sequences of the corresponding matcher, i.e., the cardinality of the shaping set A • ⊆ A n , is given by the multinomial coefficient MC(C), as defined in (4).Each amplitude sequence in A • has the same energy E • , and consequently, they all are located on the n-shell of squared radius E • as shown in Fig. 6.
MPDM has been proposed in [55] as an extension to CCDM that lifts the constant-composition principle.MPDM is based on the idea that the target distribution C need not be achieved in each output sequence; rather, it is sufficient if the ensemble average over all sequences gives the target composition.Considering the example of pairwise partition in [55], this means that each composition has a complement, both with the same number of occurrences, such that their average is the target distribution.There are, however, no known constructive algorithm for this variable-composition mapping problem.This is circumvented by reducing the number of unique sequences of each composition to be a power of two, which can come at the expense of some small rate loss.This additional constraint enables Huffman coding on the compositions, i.e., we can build a tree where a variable-length prefix determines the node and thus, the composition to be used.The remaining binary payload is mapped with conventional CCDM techniques.Note that the prefix and payload length are balanced such that the overall mapping operation is fixed-length.It has been shown in [55] that pairwise MPDM with such a tree structure gives an approximately fourfold length reduction compared to CCDM at the same information rate.It has also been demonstrated to give significant rate improvements for a fixed block fixed length for various QAM formats transmitted over the AWGN channel [85] and the optical fiber channel [86].
Example 5 (MPDM).We consider the same target PMF as in Example 4. Pairwise MPDM with tree structure utilizes 945 compositions whose average is again [95, 69,37,15].The shaping rate (6) of the matcher that produces sequences with these compositions is R s = 1.7315 bit/1-D.The corresponding input length (7) is k = 374 which is 7 bits more than that of CCDM which is a 1.9% rate increase.
CCDM has initially been realized with AC, which is sequential in the input length, i.e., at most k serial operations have be carried out for mapping and n for demapping 5 .Since the serialism of the AC method can be challenging to achieve for high-throughput CCDM operation, means to run several DMs in parallel have been proposed.For BL-DM [57] or PDM [51] where the target distribution is a product distribution, the parallelization factor is log 2 n a = m − 1 since one binaryalphabet DM is used for each bit level.This approach has been numerically shown to have reduced rate loss compared to a single nonbinary DM, yet comes at the expense of having the DM output limited to compositions that are generated from a product distribution.In [58], a different parallelization technique has been proposed, which operates on amplitudes rather than on bit levels.For each of the n a − 1 out of n a amplitudes, a binary-alphabet DM is operated in parallel, with the first DM determining the position of the first amplitude, the second DM where to position the second amplitude within those positions that have not been occupied by the preceding (i.e., first) amplitude.These DM operations can be run in parallel and only the final step of combining the subsequences into the nonbinary output sequence is sequential.We note that both bit-level DM and parallel-amplitude DM are compatible with MPDM.
The schemes discussed in the preceding paragraphs can be considered as extensions to the CCDM architecture that either nest various CCDMs for improved performance (MPDM) or transform a nonbinary CCDM into several binary CCDMs to achieve a larger parallelization (bit-level and parallelamplitude DM).In [58], SR has been proposed as an alternative to the conventional AC algorithm for CCDM as shown in the bottom layer of Fig. 1.SR solves the CCDM indexing problem by representing a binary-alphabet sequence as a constant-order subset that determines the position of either binary symbol.For a given sorting, such as lexicographical, the rank of such a subset is found by "enumerating" all preceding sequences which is used for source coding in [87], [88] and for shaping in [21], [65].This mapping from sequence to (binary) rank is called unranking in the combinatorics literature and acts as inverse mapping.The ranking operation from bits to sequence is DM mapping.The advantage of SR over AC is that the number of serial operations is significantly reduced [58, Sec.V].

B. Sphere Shaping Schemes (Indirect Method)
In this section, a review of SpSh algorithms is provided.All ensuing algorithms target a certain rate, i.e., the number of unique output sequences, rather than a PMF.To this end, for a given A, n and target k, the maximum-energy constraint E • is selected such that the set A • , as defined in (2), satisfies |A • | ≥ 2 k .This set consists of all 2 m -ASK amplitude lattice 6points on the surface or in the n-sphere of square radius E • as shown in Fig. 6.We note that possible sequence energy values for these points, i.e., squared radii of the n-dimensional shells that the sequences are located on, are {n, n + 8, • • • , E • }, and the number of shells is calculated as [48] L .
Remark 1.We see from the sphere-hardening result discussed, e.g., by Wozencraft  In the following, we explain two different algorithms to realize SpSh: Enumerative sphere shaping (ESS) and shell mapping (SM).Provided with identical parameters, these two address the same set A • of sequences where the difference is in the bits-to-amplitudes mapping.
ESS starts from the assumption that the energy-bounded amplitude sequences, i.e., a n ∈ A • , can be ordered lexicographically.Thus the index of an amplitude sequence is defined to be the number of sequences which are lexicographically smaller.To represent n-amplitude sequences in a sphere, an energy-bounded enumerative amplitude trellis is constructed [48, Sec.III-B].Operating on this enumerative trellis, n-step recursive algorithms are devised to realize the lexicographical index-sequence mapping in an efficient manner [21], [65].These algorithms demand the storage of a matrix (i.e., the trellis) of size (n + 1) × L where each element can be up to nR s -bit long.The required storage and computational complexity of ESS is discussed in Sec.VI.
Another way of ordering n-amplitude sequences in a sphere is to sort them based on their energy, i.e., based on the index of the n-dimensional shell that they are located on.Sequences on the same shell can be sorted lexicographically.To this end, a trellis which is different from that of ESS is constructed [22], [65].Based on this trellis, two different indexing algorithms are proposed in [22].The first one [22, Algorithm 1], which was proposed around the same time as ESS [21], has performance and complexity similar to ESS.The second one [22, Algorithm 2], which is the well-known shell mapping (SM), is based on the divide-and-conquer (D&C) principle, and enables a tradeoff between the computational and storage complexities [27,Sec. 4.3].The D&C principle was used to enumerate sequences from the Leech lattice earlier in [59].The basic principle is to successively divide an n-dimensional indexing problem into two n/2-dimensional problems, creating a log 2 n-step operation.Consequently, SM demands the storage of a matrix of size (log 2 n+1)×L where each element is again nR s -bit long.The required storage and computational complexity of SM is discussed in Sec.VI.
In their initial proposals, shaping matrices of ESS [21], SM [22] and [22, Algorithm 1] are computed with fullprecision (FP).To decrease the storage complexity of ESS and SM, a bounded-precision (BP) implementation method is proposed in [65].The idea is that any number can be expressed in base-2 as m • 2 p .Here m and p are called the mantissa and the exponent, stored using n m and n p bits, respectively.Then each number in a shaping matrix, i.e., in the trellis, is rounded down to n m bits after its computation, and stored in the form (m, p).The invertibility of ESS and SM functions is preserved with this approach [65].We note that the BP implementation can also be used to realize [22,Algorithm 1].The BP implementation decreases the memory required to store an element of the shaping trellis from nR s bits to n m + n p bits.Typical values of n m and n p are a few bytes.The required storage and the computational complexity of BP implementation is discussed in Sec.VI.The disadvantage of this approximation is that the numbers in the trellis, and thus the number of output sequences decreases, causing a rate loss.However this rate loss is shown to be upper-bounded by − log 2 (1 − 2 1−nm ) bit/1-D [65].
Example 7 (Bounded-preceision rate loss).If the shaping set A • in Example 6 is constructed with BP using n m = 9 bit mantissas and n p = 7 bit exponents, the resulting rate loss is upper-bounded by 0.0056 bit/1-D.For ESS and SM, the actual rate losses are 0.0021 and 0.0003 bit/1-D, respectively.Since the shaping rate with FP was R s = 1.7538, these rate losses keep R s > 1.75, and consequently, keep k = 112.Therefore, we claim that when more than a few bytes are used to store mantissas, BP rate loss is smaller than the loss due to the rounding operation in (7).Consequently, the operational rate k/n is not affected.However, the required memory to store an element of the shaping matrix drops from nR s = 113 bits to n m + n p = 16.
Both ESS and SM index the same set of sequences for fixed n, A and E • .The difference is in (i) the way algorithms are implemented and (ii) the way the sequences are ordered.We discuss the former difference in Sec.VI.Due to the rounddown operation in (7), only the sequences with indices smaller than 2 k are actually utilized.The remaining ones, i.e., the ones at the end of the list, are unused.For SM, all these sequences have the highest possible energy E • .On the other hand for ESS, these sequences are at the end of the lexicographical list and are not necessarily from the outermost shell.Thus operationally, the output average symbol energy of SM is no greater than that of ESS, for a fixed set of parameters.This difference could be important for ultra short blocklengths, however, for blocklengths larger than a few dozens, it becomes insignificant 7 .Furthermore, as discussed in [90], by simply removing some connections from the shaping trellis, it is Sphere Shaping possible to force the discarded sequences to be from the outermost shell for ESS as well.

C. Geometric Interpretation of the Shaping Approaches
Output sequences of CCDM have a fixed composition and thus, all have the same sequence energy nE, i.e., they are located on the n-dimensional shell of squared radius E • = nE.We note that there are multiple compositions that lead to the same sequence energy and thus, the corresponding shell is only partially utilized by CCDM, as shown in Fig. 6

(left).
With multiple compositions at its output, MPDM makes use of multiple partly filled n-shells, as in Fig. 6 (middle).The average symbol energy as well as the square radius E of the outermost shell that is utilized by MPDM depend on the actual set of considered compositions.Finally, n-sphere shaping employs all sequences inside the n-dimensional sphere of squared radius E • , as shown in Fig. 6 (right).Note that for simplicity, we have in this explanation neglected the constraint that any practical binary scheme can only address a power-oftwo number of shaped sequences.When all three approaches enclose the same number of sequences at a fixed n, their average energy as in ( 5) satisfy E ccdm ≥ E mpdm ≥ E spsh .Thus at any blocklength, SpSh makes use of the set of sequences having the least average energy and is thus the most energyefficient scheme.This observation will later be confirmed by the rate loss analysis in Sec.V-A.

V. PERFORMANCE COMPARISON
This section studies the performance of the shaping schemes explained in Sec.IV.The used metrics are (i) finite-length rate loss at a fixed blocklength n, (ii) maximum AIR for BMD and (iii) FER.

A. Rate Loss Analysis
The methodology of computing the rate loss for DM and SpSh schemes in a fair manner is illustrated in Fig. 7.For the DM schemes of Sec.IV-A, the following steps are carried out in order to obtain the rate loss for a particular n.First, the target distribution P A (and thus the modulation order 2 m ) is fixed.The target distribution is MB, optimized for a particular SNR.We then quantize P A to P Ā to get the integer-valued target composition C = nP Ā, where the quantization criterion is to minimize the Kullback-Leibler divergence between P A [55] Sphere Shaping: 7. Flowchart for the computation of loss for CCDM, MPDM and SPSH.and P Ā [84].For CCDM, k = log 2 MC(C) bits can be addressed where MC(•) is as defined in (4).For nonconstant composition DMs such as MPDM [55], the number of addressable bits depends on the addressable bits of all constituent compositions, considering the specific constraints of the DM construction such as pairwise partitioning [55, Sec.III-A].The rate loss is finally computed as R loss = H( Ā)−k/n, as defined in (9).

Generate MB P
For the SpSh schemes of Sec.IV-B, the approach must be different since it is not possible to explicitly target a certain distribution or composition.From the above methodology for MPDM schemes, we obtain the number of input bits k for a given n.For each n, we find the smallest E • (i.e., the squared radius of the sphere) such that the number of sequences inside the n-sphere We compute the induced distribution P Ã [48, eq. ( 17)], and corresponding entropy H( Ã).The rate loss is again obtained as R loss = H( Ã) − k/n.This procedure ensures that the DM and SpSh schemes are compared at the same rate 8 , i.e., at identical k and n.We note, however, that the target distribution and thus the source entropy differ.and the average symbol energy E are the same as CCDM's.
The smallest where k is the input length of MPDM.The corresponding induced distribution is P Ã = [0.4393,0.3220, 0.1722, 0.0665].Table II shows the input length k, average symbol energy E and rate loss R loss of CCDM, MPDM and ESS for these parameters.We see that MPDM is able to address a larger set of sequences than CCDM, leading to a seven bit increase in the input length.
Since their induced distributions are the same, this is reflected as a decrease in rate loss.Then starting with the same target k, ESS employs a set of sequences with smaller average energy.This is also translated to a decrease in rate loss as shown in Table II.Figure 8 (left) shows rate loss vs. blocklength for CCDM, MPDM, ESS, and SM.The target distribution is the same as Example 8.The target k for ESS is the number of bits achieved by MPDM at each n which ensures the same transmission rate R. We observe that all advanced schemes, i.e., MPDM, ESS, and SM, clearly outperform CCDM.The more efficient signal space usage of ESS and SM becomes particularly apparent at very short blocklengths 9 .The inset of Fig. 8 (left) shows the rate losses at n = 216.Here CCDM has 0.035 bits larger rate loss than MPDM.

B. Achievable Information Rates
Here, we numerically study the AIRs of ESS, MPDM and CCDM in the finite blocklength regime.As the figure of merit, the finite blocklength AIR for BMD is used as defined in [55, eq. ( 15)]: Here R BMD and R loss are as defined in (11) and (9).We note that (15) converges to (11) when n → ∞.The finite-length AIR in (15) has been employed to compare ESS and CCDM for the optical fibre channel in [61]- [63].We note here that ( 15) is an instance of the rate expression [91, eq. ( 1)] provided for the layered PS architecture 10 .In Fig. 8 (right), AIR n in bit/1-D is shown versus SNR in dB for 8-ASK with ESS, MPDM and CCDM.We use shaping blocks of length n = 216, which is compatible to the n c = 648-bit LDPC codes of IEEE 802.11 [77] that will be employed in PAS in subsequent sections.Shaping algorithms operate at a rate of k/n = 1.75, i.e., k is set to 378 bits.We note that this means we plotted the curves for fixed distributions and did not optimize them at each SNR, unlike [28,Fig. 4] or [55, Fig. 5].For comparison, the Shannnon capacity 1  2 log(1+SNR) and the GMI for uniform 8-ASK are also plotted.We observe that ESS and MPDM close most of the shaping gap.From the inset figure, we see that ESS and MPDM are roughly 0.72 dB more SNR-efficient than uniform signaling at rate R = 2.25.We note that R = 2.25 corresponds to γ = R − k/n = 0.5, and thus, R c = 5/6 in the PAS context.As a reference, the maximum possible capacity gain due to shaping at this rate is 1.04 dB.The remaining gap of 0.32 dB is due to the finite blocklength nature of shaping and the discrete nature of the employed constellation.
From the inset of Fig. 8 (right), we see that MPDM is 0.23 dB more power-efficient than CCDM.This difference is consistent with the empirical relation between the power loss and rate loss explained in footnote 7, more specifically, P loss,dB ≈ 7R loss = 6 • 0.0352 = 0.21 dB.
We conclude from Fig. 8 that from a practical point of view, MPDM and SpSh perform almost the same at blocklengths larger than n ≈ 200.Therefore, to make a choice among these at such values of n, required storage, computational complexity and latency of the algorithms that can be used to implement MPDM and SpSh should be considered.We will discuss these aspects of shaping algorithms in Sec.VI.
Remark 2 (Targeting a rate with DM).Example 8 shows that when the entropy of the target distribution is taken to be the target rate k/n (1.75 in Example 8), CCDM and MPDM are not able to obtain 2 k sequences.This is due to the inevitable nonzero rate loss of the DM schemes for finite blocklengths.For such cases, we increase the SNR that the target distribution is optimized for, until we obtain 2 k output sequences for the DM schemes.

C. End-to-End Decoding Performance
In the following, the decoding performance is evaluated after transmission of 64-QAM over an AWGN channel.BRGC in Fig. 2 (right) is used for amplitude to bit mapping after shaping, and for symbol mapping after FEC encoding as shown in Fig 2 (left).Different transmission rates and length regimes of LDPC codes are considered.For each SNR, the simulations are run until at least 100 frame errors are observed.For the first case of long FEC we use codes from the DVB-S2 LDPC standard [76] with blocklength n c = 64800 bits.In the case of short FEC the LDPC codes from the 802.11 standard [77] of length n c = 648 bits are used.
For a fixed 1-D constellation size M = 2 m , FEC code rate R c and target transmission rate R, we compute γ = R c m − (m − 1) and accordingly, k/n = R − γ.Here the total number of 1-D symbols in an n c -bit FEC codeword is n = n c /m.For DM algorithms working with A = {1, 3, 5, 7}, the AWGNoptimal MB PMFs at 10.7 and 14 dB SNR are quantized to obtain the integer composition based on [84] for the target rates 4 and 4.5 bit/2-D, respectively.For SpSh algorithms, E • is selected as the minimum value that leads to R s ≥ k/n.Both ESS and SM are then implemented with FP.The amplitude shaping function of SM is implemented using [22,Algorithm 1].
Figure 9 (left) shows the decoding performance with DVB-S2 LDPC codes for ESS, SM, MPDM, and uniform signaling at a transmission rate of 4.5 bits per complex channel use (bit/2-D).ESS, SM and MPDM, all of length 180 amplitudes, use either the LDPC code of rate 5/6 (solid curves) or rate 4/5 (dashed lines).At this shaping blocklength, each LDPC frame consists of 120 shaped blocks.In order achieve a transmission rate of 4.5 bit/2-D, the redundancy added by the shaping scheme is varied.For uniform 64-QAM, the code rate is set to 3/4.We observe for shaped schemes that the performance with FEC rate 5/6 is superior to rate 4/5, for which the reasons are outlined in Sec.III-D, and focus on 5/6 in the following.
At a FER of 1e-3, the shaped schemes outperform uniform signaling by approximately 0.9 dB.We further note that ESS, SM and MPDM have very similar performance, with ESS and SM being approximately 0.05 dB more power-efficient than MPDM.This is in good agreement with the rate loss analysis of Fig. 8 (left) where also only a marginal improvement of the SpSh schemes over MPDM is found.Remark 3. From the discussion in Sec.III-D, we expect an SNR improvement of approximately 0.83 dB of the shaped schemes over uniform signaling, which is in good agreement with the observed improvement of 0.9 dB.Potential reasons for the 0.1 dB difference between the theoretical analysis and the numerical simulations are the different coding gaps of the employed LDPC codes as well as the finite-length rate loss of the shaping schemes.
For short LDPC codes with shaped signaling, the shaping blocklength is set to n = 216, which, in combination with the LDPC code length of n c = 648 bits and 64-QAM, gives a one-to-one correspondence between the blocklengths of FEC and shaping.In Fig. 9 (right), the decoding performance is analyzed at transmission rates of 4 and 4.5 bit/2-D.Uniform 64-QAM requires LDPC rates 3/4 and 5/6, respectively.For the shaped schemes, the code rate that minimizes ∆SNR for 64-QAM, and rates 4 and 4.5 bit/2-D can be computed to be R c ≈ 0.79 and 0.83 using (12), respectively.Thus R c = 5/6 being the closest available to these values is used for shaped signaling.
As shown in Fig. 9 (right), we observe that at rate 4 bit/2-D ESS, SM and MPDM, which have identical decoding performance in this setup, require 1.1 dB less SNR than uniform to achieve a FER of 1e-3.This improvement is due to the finite-length shaping gain as well as the reduced coding gap of the rate-5/6 LDPC code over the rate-3/4.We further observe that ESS and MPDM are 0.22 dB more power-efficient than CCDM.
Figure 9 (right) also shows the FER at rate 4.5 bit/2-D.Here, ESS, SM and MPDM again perform identically.Uniform signaling is significantly outperformed by approximately 0.9 dB SNR.CCDM is now 0.23 dB less SNR-efficient than the other shaping approaches.
We have seen that the performance of the MPDM and SpSh schemes is almost identical for the considered shaping length.Hence, implementation aspects, which are discussed next, are believed to be of significant importance in the comparison between these schemes.

VI. APPROXIMATE COMPLEXITY DISCUSSION
In the preceding section, we followed the conventional approach of comparing different schemes by studying the blocklength that is required to obtain a certain shaping gain.While this is certainly a natural choice for analysing and comparing shaped systems, this approach inherently assumes that shorter blocks are always better, for instance because they have advantages regarding implementation.In the following, we comment on the implementation aspect by considering computational complexity, latency, and storage requirements.An example where slightly longer blocklengths can be beneficial also from an implementation perspective is the parallel-amplitude architecture proposed in [58,Sec. III].By allowing a small additional rate loss, the throughput is increased significantly by using n a − 1 DMs in parallel.Furthermore, the serialism (and thus, the latency) of the SR method of [58, Sec.IV] is smaller than AC-CCDM.It can thus be beneficial to make the blocks slightly larger than for conventional CCDM in order to facilitate implementation.
An interesting example where the selection of the shaping blocklength does not depend only on the complexity vs. shaping gain tradeoff is the nonlinear regime of the optical fibres.The authors of [61] recently found that shaping over shorter blocklengths increases the nonlinear tolerance, and thus, the effective SNR.Their claim is that when the complexity considerations are ignored, there is an optimum n that optimizes the balance between linear shaping gain and nonlinear tolerance.

A. Latency
In order to evaluate the latency of the discussed amplitude shaping algorithms, we use the concepts of "degree of serialism" and "parallelization factor" as defined in [58].Degree of serialism is the number of loop iterations that are completed for shaping/deshaping operations.We stress that this quantity neglects the computational complexity of these iterations, and thus the latency of the operations within each sequential processing step.Therefore the degree of serialism can only serve as a rough indicator for latency.On the other hand parallelization factor is the number of simultaneously possible executions of a process to complete shaping/deshaping operations.
AC, which can be employed to realize CCDM, is by nature a highly serial algorithm, and AC-CCDM has a serialism of k for matching and n for dematching [52].SR-DM, which is an alternative to AC-CCDM in the binary-output case [58], has a serialism of min(n 1 , n−n 1 ) and 1 for shaping and deshaping, respectively 11 .
In BL-DM [57] and PDM [51], a binary-output matcher is used for each of the log 2 n a = m − 1 amplitude bit levels to enable parallelization, and thus, the parallelization factor is log 2 n a .As another attempt, PA-DM uses a binary-output matcher for n a − 1 of the n a amplitudes [58], and thus, the parallelization factor is n a − 1.A more detailed discussion on improving the parallelization of DM algorithms is provided in Sec.IV-A.
The shaping and deshaping algorithms of ESS [21] and [22, Algorithm 1] have a serialism of k and n, respectively.On the other hand SM [22] operates based on the D&C principle as in [59], and therefore has a serialism of log 2 n for deshaping.Table III summarizes the serialism of discussed shaping schemes.

B. Storage Requirements
AC-CCDM, which employs an extension of [53] to nonbinary-output, associates an interval in [0, 1) to each binary input sequence and to each constant composition amplitude sequence [52, Sec.IV].In simplified terms, the final interval is computed by recursively splitting the initial interval into n a subintervals.The algorithm only requires the storage of the interval and the source statistics (i.e., the composition) which can be realized with log n bits 12 .Thus we denote the storage complexity of AC-CCDM by O(log n).A similar reasoning can be used to determine the storage complexity of SR-DM [58, Sec.IV] which is also O(log n).Sh: L multiplications, comparisons and subtractions † Dsh: L multiplications and additions † SM requires a division per dimension for shaping as well.(Sh:Shaping, Dsh: Deshaping, FP: Full-precision, BP: Bounded-precision [65], BC: Binomial Coefficient.) In MPDM, in addition to the requirements of the underlying CCDM algorithm, a composition is chosen based on a prefix of the binary input sequence.For this purpose, a prefix code and the corresponding Huffman tree is constructed [55, Sec.III-C].To store the binary-tree, a LUT can be constructed.The size of this table depends on the number of utilized compositions and grows with n for a fixed A and k/n.For practical scenarios, the number of compositions is on the order of a few hundreds as shown in the following example.
Example 9 (MPDM, number of compositions).We consider A = {1, 3, 5, 7}, n = 216 and target rates k/n = 1.5 and 1.75 bit/1-D.To obtain these target rates, MPDM uses 318 and 593 different compositions, respectively.Note that these are the parameters that are used for the simulations considered in Fig. 9 (right).
FP implementations of ESS and [22, Algorithm 1] require the storage of an n-by-L matrix where each element is at most nR s -bits long.Thus following Remark 1, the storage complexity of these algorithms is O(n 3 ) for fixed R s .FP SM can be realized by storing a log 2 n-by-L matrix [22], which has complexity O(n 2 log n).We note here that these values are in alignment with [22,  Remark 4. To compute the required storage for SpSh in the BP case, we will assume that n m is independent of n.This assumption relies on the fact that the rate loss resulting from BP only depends on n m [65].Thus for a fixed rate loss, the required value of n m is independent of n.Expressing the number of bits to store the exponent as n p = log 2 ( nR s − n m ) , we see that n p behaves as log 2 n for a fixed n m .We note here that for a fixed n, A and target k, the natural choice for n m is the smallest value that keeps the number of sequences at least 2 k [65].
For the BP implementations of ESS, SM and [22, Algorithm 1], each element of the stored shaping matrix is at most (n m + n p )-bit long [65].Following Remark 4, the storage complexity of ESS and [22, Algorithm 1] in the BP case is O(n 2 log n).On the other hand the storage complexity of BP SM is O(n log 2 n).
Example 11 (BP SpSh, required storage).To realize ESS or [22, Algorithm 1] with n m = 9 and n p = 7 for the setup in Example 6, at most Ln(n m + n p ) = 11.39 kB of memory is required.On the other hand, when implemented using n m = 6 and n p = 7, SM demands at most L log 2 n(n m + n p ) = 0.87 kB of memory.We note that the mantissa lengths n m are selected according to the discussion in Remark 4.
In conclusion, we believe that storage requirements in the order of a few kB are not critical for high-throughput operation, particularly in comparison to latency and complexity.Note that the required storage for BL-DM, PDM and PA-DM depends on the underlying algorithm.

C. Computational Complexity
To comment on the computational complexity of the amplitude shaping algorithms, we will mainly consider the number of required bit operations or computations of binomial coefficients (BC).The caveat here is that this approach only gives a rough estimate since the complexity of an operation depends heavily on the specific case that it is executed in.As an example, the seemingly simple operation of comparing the sizes of two numbers can be computationally challenging for large numbers.On the other hand the notoriously expensive division operation reduces to a simple shift in registers for some specific divisors.
As explained in Sec.VI-B, AC-CCDM can be realized by splitting an interval into n a per 1-D.This requires at most n a multiplications.For each multiplication, one of the multipliers is found by a division using the statistics of the composition.Finally, at most n a comparisons are carried out.We note that practical discussions such as "numerical precision", "gaps between intervals" and "rescaling" are omitted here, and the reader is referred to [56], [92], [93] for details.
An approximate implementation of AC-CCDM is proposed in [68] where computations are realized with fixed-point operations.However, this implementation also requires multiplications, divisions and comparisons of large integer numbers.In addition, an implementation of AC-DMs based on finiteprecision arithmetic is provided in [72].SR-CCDM, in contrast to AC, is based on calculating binomial coefficients (BCs).Thus, the number of bit operations depends a lot on how this computation is implemented or whether the BCs can be pre-computed and stored.However, to give a rough indication of computation complexity, we need to compute (n a −1) BCs for unranking (shaping), and (n a −1)/2 BCs for ranking (deshaping) per 1-D.
When ESS and [22, Algorithm 1] are implemented with FP, at most n a additions (subtractions) of numbers from the corresponding shaping matrix are required per 1-D.These numbers are at most nR s -bit long, i.e., n a nR s bit operations 13 per dimension (bit oper./1-D) are necessary.Thus, the computational complexity of these algorithms is O(n).FP implementation of SM however, requires at most L multiplications of numbers from the shaping matrix.Therefore, the computational complexity of SM is O(n 3 ).Table III summarizes serialism, required storage and computational complexity of discussed shaping algorithms as classified in Fig. 1.The main conclusion from Table III is that for DM, AC and SR provide a tradeoff between serialism and computational complexity.However, we note that SR can only be used for binary-output DM.On the other hand for SpSh, SM and ESS create a tradoff between required storage and computational complexity.The selection among different algorithms then depends on the actual resources that are available for shaping in practice, and thus, we refrain from making definitive suggestions here.
We conclude this paper by showing in Fig. 10, the maximum required storage versus maximum number of computations required to implement BP and FP SpSh, and BP AC-CCDM 15 .We see that there is a computational complexity vs. required storage tradeoff between ESS (and [22, Algorithm 1]) and SM.ESS requires larger storage but can be implemented with a smaller complexity, and only demands additions and subtractions.On the other hand, SM can be realized with a smaller storage, however requires many multiplications and Here we assume that BP AC-CCDM is implemented with finite-precision arithmetic using 16-bit numbers which is comparable to the values selected in [72].
divisions.In fact, by modifying the corresponding shaping and deshaping algorithms, it is also possible to adjust the balance between computational complexity and required storage as explained in [27,Sec. 4.3.4],i.e., operate between the ESS and SM clusters in Fig. 10.Furthermore, there is also a difference in computational complexities of ESS and [22, Algorithm 1].An initial step is required in [22,Algorithm 1] where the n-shell that the corresponding sequence is located on is determined.This step requires at most L − 1 additions and comparisons.
Finally, Fig. 10 also shows that BP AC-CCDM can be implemented with moderate computational complexity and minimal storage.Furthermore, these requirements do not heavily depend on blocklength n.Thus for large n where its rate losses are small, and for applications for which high serialism of AC is not important, AC-CCDM is an effective and lowcomplexity choice as a shaping algorithm.

VII. CONCLUSION
This paper reviewed prominent amplitude shaping architectures and algorithms for the probabilistic amplitude shaping (PAS) framework.Constant composition distribution matching (CCDM), multiset-partition DM (MPDM) and sphere shaping (SpSh) are all optimum shaping techniques for asymptotically large blocklengths, in the sense that they have vanishing rate loss.However, for short blocklengths, CCDM addresses a smaller set of output sequences than that of MPDM and SpSh, leading to higher rate losses.We provided evidence for the AWGN channel that seeking to utilize the signal space in energy-efficient manners is better than attempting to obtain the capacity-achieving distribution, which is derived for asymptotically large, and thus, impractical blocklengths.Therefore, MPDM, SpSh, and other energy-efficient shaping architectures are suitable to be used over a wider blocklength regime, especially for blocklengths below a couple of hundred symbols.
In addition to the rate loss analysis, we evaluated the achievable information rates (AIR) and frame error rates (FER) of the PAS framework employing CCDM, MPDM and SpSh as the underlying amplitude shaping approach.Enumerative sphere shaping (ESS) and shell mapping (SM) are both considered as potential SpSh implementations.AWGN channel simulations with 64-QAM demonstrate that power-efficiency gains on the order of 1 dB can be obtained already at blocklengths around 200 by employing MPDM and SpSh, and thus, justify our earlier observation on the objective of amplitude shaping.CCDM provides gains around 0.75 dB for the same settings.Furthermore, these gains are predicted well by shaping gain and AIR computations based on bit-metric decoding.
In the last part of the paper, we discussed the performance of shaping algorithms considering latency, required storage and computational complexity.To realize DM, arithmetic coding (AC)-based implementation of MPDM requires minimal storage and can be implemented with a few computations per input symbol.However AC has a higher serialism than subset ranking (SR)-based implementation which on the other hand has increased computational complexity.For SpSh, ESS and SM provide a tradeoff between storage and computational complexities, where the complexity is more due to the required storage for ESS and required number of computations for SM.Thus the decision on which algorithm should be used to realize energy-efficient amplitude shaping depends on the application-specific requirements on latency, available storage and tolerable computational complexity.

Fig. 2 .
Fig. 2. (Left) Block diagram of the PAS architecture.Amplitude shaping blocks (green boxes) are examined in the current paper.(Right) The binary reflected Gray code (BRGC) for 8-ASK.A quadrature amplitude modulation (QAM) symbol is the concatenation of two ASK symbols.

Example 2 (
Shaping, FEC and transmission rates in PAS).Consider the PAS architecture with 8-ASK, a rate R c = 5/6 FEC code, and a target rate R = 2.25 bit/1-D.The rate of the extra data that will be carried in the signs of the channel inputs is γ = R c m − (m − 1) = 0.5 bit/1-D.Therefore the rate of the amplitude shaper should be k/n = R − γ = 1.75 bit/1-D.If the length of the FEC code is n c = 648 bits, the blocklength is n = n c /m = 216 real symbols.Then the output set of the amplitude shaper must consist of at least 2 k = 2 216•1.75= 2 378 sequences.

Fig. 4 .
Fig. 4. Content of a channel input sequence produced by PAS.

FEC
Fig. 5. Channel input entropy vs. gap-to-capacity for 8-ASK at the target rate of R = 2.25 bit/1-D.The x-axis above shows the corresponding FEC code rates.

Fig. 6 .
Fig. 6.The illustration of the employed n-dimensional signal points by CCDM (left), MPDM (middle) and SpSh (right).Each circle represents an n-dimensional shell.Darker portions of the shells indicate the signal points on them which are utilized by the corresponding shaping approach.

Example 12 (
FP SpSh, computational complexity).Based on Example 6, at most n a nR s = 452 bit oper./1-D are necessary to realize ESS and [22, Algorithm 1].On the contrary, for SM algorithms, at most L nR s 2 = 1136441 bit oper./1-D are required.With BP approach, ESS and [22, Algorithm 1] can be implemented with at most n a (n m + n p ) bit oper./1-D.Then their computational complexity is O(log n).On the other side, BP SM can be realized with at most L(n m + n p ) 2 bit oper./1-D.Therefore the complexity of SM is now O(n log 2 n).Example 13 (BP SpSh, computational complexity).When Example 6 is now constructed with n m = 9 and n p = 7, ESS and [22, Algorithm 1] require at most n a (n m + n p ) = 64 bit oper./1-D.Correspondingly, if SM is realized with n m = 6 and n p = 7, L(n m + n p ) 2 = 15041 bit oper./1-D are necessary.

Fig. 10 .
Fig.10.Maximum computational complexity vs. maximum required storage of ESS and SM.Red-and blue-colored markers indicate FP and BP implementations, respectively.Radii of the markers are proportional to the corresponding blocklength n ∈ {64, 216, 512}.Here we assume that BP AC-CCDM is implemented with finite-precision arithmetic using 16-bit numbers which is comparable to the values selected in[72].
[89,Jacobs in[89, Sec.5.5], that E • ≈ nE for large n.Following Laroia et al. [22, Sec.III-A] and approximating the required average energy to transmit R bit/1-D by c2 2R , we can write E • ≈ nc2 2R where c is some constant.Therefore L in (14) is a linear function of n for a fixed rate R.Example 6 (Sphere shaping).The shaping set A • ⊂ A n for the parameters n = 64, A = {1, 3, 5, 7} and E • = 768, i.e., L = 89, has the shaping rate R s = 1.7538 bit/1-D.The input length of the corresponding amplitude shaper is k = 112 bits.The induced PMF is P A (a) = [0.42,0.32, 0.18, 0.08] over A, where the average energy per dimension is E = 11.6316.
FER vs. SNR for 64-QAM at a transmission rate of 4.5 bit/2-D.DVB-S2 LDPC codes with nc = 64800 bits are used.All shaping schemes use a blocklength of n = 180.(Right) FER vs. SNR for 64-QAM at transmission rates of 4 and 4.5 bit/2-D.LDPC codes of 802.11 with nc = 648 bits are used.All shaping schemes use a blocklength of n = 216.

Table I ]
[22,mple 10 (FP SpSh, required storage).To realize ESS or[22, Algorithm 1]for the setup in Example 6, at most Ln nR s = 80.46 kilobytes (kB) of memory is required.On the other hand for SM, at most L log 2 n nR s = 7.54 kB of memory should be allocated.