On the Capacity of Amplitude Modulated Soliton Communication over Long Haul Fibers

The capacity limits of fiber-optic communication systems in the nonlinear regime are not yet well understood. In this paper, we study the capacity of amplitude modulated first-order soliton transmission, defined as the maximum of the so-called time-scaled mutual information. Such definition allows us to directly incorporate the dependence of soliton pulse width to its amplitude into capacity formulation. The commonly used memoryless channel model based on noncentral chi-squared distribution is initially considered. Applying a variance normalizing transform, this channel is approximated by a unit-variance additive white Gaussian noise (AWGN) model. Based on a numerical capacity analysis of the approximated AWGN channel, a general form of capacity-approaching input distributions is determined. These optimal distributions are discrete comprising a mass point at zero (off symbol) and a finite number of mass points almost uniformly distributed away from zero. Using this general form of input distributions, a novel closed-form approximation of the capacity is determined showing a good match to numerical results. Finally, mismatch capacity bounds are developed based on split-step simulations of the nonlinear Schro¨dinger equation considering both single soliton and soliton sequence transmissions. This relaxes the initial assumption of memoryless channel to show the impact of both inter-soliton interaction and Gordon–Haus effects. Our results show that the inter-soliton interaction effect becomes increasingly significant at higher soliton amplitudes and would be the dominant impairment compared to the timing jitter induced by the Gordon–Haus effect.


Introduction
It is predicted that the capacity of data transfer network, mainly consists of optical fibers, will fall behind the data traffic demands in the near future [1]. The prediction implies the need for exploiting current optical fiber infrastructure to their limits before migrating to the next generation of optical fiber systems. However, the fundamental information transmission capacity of the most basic optical fiber link (i.e., standard single-mode fiber) is not fully known in the nonlinear regime. Different approaches have been used to tackle this problem in the literature including the recent application of nonlinear Fourier transform (NFT) to approach the limits of the nonlinear optical fiber [2,3]. Using NFT, the nonlinear dispersive fiber channel, defined by the nonlinear Schrödinger equation (NLSE), is transformed to linear channels in nonlinear spectral domain, redefining the capacity problem formulation for nonlinear optical fibers.
By applying NFT, the available degrees of freedom in temporal domain are transformed to two types of spectra in the nonlinear spectral domain, namely the discrete and continuous spectra. Therefore, NFT is regarded as a base for development of new techniques of data transmission, and different communication system designs have been proposed using NFT [4][5][6][7][8][9][10][11][12][13]. The performance of such NFT-employed system for long-haul communication is investigated by simulation and experiment [14,15]. However, it has been observed that the noise behavior is not trivial in these systems [16][17][18], and the performance largely depends on the design. Moreover, the application of NFT in estimating the capacity of nonlinear optical fibers is not straightforward since the NFT and inverse NFT (INFT) must be performed numerically and are computationally complex [19,20].
An estimation of the capacity of the nonlinear optical fiber by only signaling on its continuous spectrum defined by NFT is provided in [21,22]. Achievable rates have been predicted, but it has been shown that due to the signal dependency of the noise, the capacity will be saturated at high power. Moreover, several works in the literature have been focused on estimating the achievable information rates (AIR) of the fiber when the discrete spectrum (i.e., soliton transmission) is used as the signal space. In [23], a capacity lower bound for amplitude modulated first-order soliton communication system is estimated using a half-Gaussian input distribution. In [24], an achievable rate is estimated taken into account the Gordon-Haus effect that leads to timing jitter at the receiver. In [18], AIR is estimated for a more complicated system that modulates both the eigenvalue and the norming constant in the discrete spectrum. Assuming a receiver capable of detecting variable pulse duration, in [25], the time-scaled mutual information (MI) is numerically optimized considering the memoryless channel model for soliton communication.
In this paper, we investigate the capacity of the optical fiber channel when only a single discrete spectrum point is encoded and the data is mapped on the imaginary part of the corresponding eigenvalue. This is essentially equivalent to the amplitude modulated soliton communication in [26]. As mentioned above, a number of capacity bounds for such channel has been derived previously [18,23,24], and AIR in bits per second were also discussed in [25]. However, some intrinsic limitations, such as dependence of bandwidth on soliton amplitude and the interaction between neighboring soliton pulses have been ignored. Compared to the state-of-art works in the literature (e.g., [23,25]), we investigate the effect of channel memory induced by solitonic interaction, which is mostly ignored in the literature. In order to incorporate the time-bandwith degrees of freedom into the capacity problem formulation, we study the maximization of time-scaled MI similar to [25] but by assuming a more practical communication system that uses a fixed symbol duration (i.e., soliton pulse width). A general form of capacity-approaching input distributions are proposed through the optimization of an approximated normalized channel model, providing important insights into the optimal design of soliton communication systems. In addition, an analytical estimation of the capacity of amplitude modulated soliton transmission is provided. This paper is structured as follows: In Section 2, we initially consider a commonly used memoryless non-Gaussian channel model for the imaginary part of the eigenvalue [16]. By applying the variance normalizing transform (VNT) [22,27], the original channel is transformed into an equivalent channel with normalized noise power, which is then approximated by a unit-variance additive white Gaussian noise (AWGN) model in Section 3. Taking into account a peak amplitude constraint imposed by bandwidth limitations, the capacity in bits/normalized time and its corresponding input distribution are estimated using the proposed AWGN model and also an approximate analytical approach. Next, in the Section 4, we consider the effect of channel memory by developing the mismatch capacity bounds based on the split-step simulation of single soliton and soliton sequence transmissions over the NLSE. Based on the mismatch capacity results, the impact of inter-soliton interaction and Gordon-Haus effects on the capacity of soliton communication systems is studied.

Channel Model
At a low launch power, the optical fiber channel can be modeled as a linear dispersive channel impaired by AWGN noise. However, the Kerr nonlinearity becomes significant when the signal power increases to allow transmission over long haul fibers. The propagation of the complex envelope of a narrowband optical field in a standard single-mode fiber can be described by the stochastic nonlinear Schrödinger equation (NLSE), as discussed in ( [28], Chapter 4). Assuming the fiber loss to be perfectly compensated by an ideal distributed Raman amplification, the NLSE is given as where Q(T, Z) denotes the complex envelope of the optical field, N(T, Z) represents the amplifier spontaneous emission (ASE) noise term, T and Z are time and propagation distance, and β 2 and γ indicate group velocity dispersion and Kerr nonlinearity respectively. Note that the fiber loss term α here is omitted since ideal distributed Raman amplification is assumed. The ASE noise is modeled by a zero mean white Gaussian noise with autocorrelation The spectral density of the noise in [W/(km · Hz)] is N ASE = αhν 0 K T for the ideal distributed Raman amplification assumed in this work, where hν 0 denotes the photon energy and K T denotes the phonon occupancy factor. The NLSE could be normalized into the form with the corresponding normalized parameters as where dispersion length is defined as L D = T 2 0 /|β 2 |, and normalizing time T 0 can be selected independent of other parameters. Consequently, the autocorrelation of the normalized noise is, where σ 2 = N ASE 2γL 2 D T 0 according to the normalization (3). Using the inverse scattering method, NFT transforms the time domain optical signal into scattering data, consisting of continuous spectrum ρ(λ, z), eigenvalues λ m (z) M m=1 and corresponding norming constants C m (z) M m=1 which evolve linearly along the fiber in nonlinear spectral domain. It can be shown that, in a noise-free and interaction-free scenario, the eigenvalues λ m are preserved during the evolution along the fiber [29]. If only one eigenvalue exists at z = 0 and ρ(λ, 0) = 0, the solution of NLSE is a first-order soliton, which can be described analytically as where the only eigenvalue is λ 1 = ζ + iη (η > 0). Also, e 2 = C 1 2η and ψ = arg C 1 (z) where C 1 denotes the norming constant corresponding to eigenvalue λ 1 .
The Energy of the soliton in (5) is equal to 4η, where the temporal width and bandwidth are proportional to 1/η and η respectively. Note that within this work, only the imaginary part of the eigenvalue is modulated and the real part is set to zero, i.e., η = A, ζ = 0. Thus, at z = 0, the input pulse can be expressed as q(t, z = 0) = 2Asech(2At).
The propagation of the soliton pulse over the fiber is described by NLSE, and at the receiver side, the eigenvalue can be detected by NFT or pulse energy estimation. If the detected eigenvalue is denoted as R, the channel model for this amplitude modulated first-order soliton transmission system can be described by a conditional PDF P R|A (r|a), which is non-Gaussian with a variance dependent on its mean [16,30]. Ignoring inter-soliton interactions, a memoryless channel model can be defined for the amplitude modulated soliton system based on a noncentral chi-squared distribution (NCX2) with 4 degrees of freedom as [16,23] where I 1 (·) denotes the modified first order Bessel function of the first kind. The mean and variance of this distribution for large a are µ NCX2 (a) = σ 2 N + a and σ 2 NCX2 (a) = 1 2 σ 4 N + aσ 2 N respectively, where σ 2 N = 1 2 σ 2 L 2L D at distance Z = L and σ 2 is the power spectral density of the normalized ASE noise as defined in Equation (4). It can be seen that the channel model (7) for the imaginary part of the eigenvalue (soliton amplitude, or soliton energy) is non-Gaussian with signal dependent variance. In the next section, we develop different approaches to estimate the capacity of the channel described by (7).

Capacity Formulation for Memoryless Soliton Communication Channel
Here, the capacity problem for the channel defined by the conditional PDF (7) is formulated considering a peak amplitude constraint since the bandwidth occupied by soliton pulses is directly related to their amplitudes. That is, the modulating data on higher amplitudes requires larger bandwidth while the maximum signal bandwidth is restricted by physical limitations. Moreover, in practical scenarios, peak power is also constrained due to device limitations. Another important issue that needs to be considered for soliton communications systems is that soliton pulses defined as in (6) are not time-limited, and thus, they should be truncated for practical implementations.
We define the practical width of a soliton pulse (denoted by t s ) as the temporal width that contains 1 − δ of the soliton energy. Recalling the energy of the normalized soliton (6) is equal to 4A, this practical width can be obtained by solving the equation below for t s which is given by where the fixed value δ should be sufficiently small to make the truncation error negligible compared to noise. For example, assuming that the soliton pulse width is defined based on containing 99.9% of its energy (δ = 0.001), we have t s = 3.8/A. Noting that the temporal width of soliton pulses is inversely related to their amplitudes, we can also introduce a minimum amplitude constraint to limit the utilization of the temporal resources. Based on the constraints mentioned above, the capacity problem can be formulated as where C bpcu denotes the capacity in bits per symbol per channel use, I(A; R) represents the MI. Denoting the transmitted and received eigenvalues with random variables A and R respectively, A ub is the maximum amplitude constraint determined by maximum bandwidth or peak power and A lb is the minimum amplitude constraint determined by the maximum allowed symbol duration. Note that we also consider the possibility of transmitting no soliton over a symbol duration (i.e., off symbol) with probability p 0 , which is denoted by A = 0 here.
Noting that the signal space and the temporal resources are inter-related in the underlying soliton communication system, we will use an alternative capacity formulation that maximizes time-scaled MI [25] to get better insights into AIRs of the system in bits per second. Unlike [25], we assume a fixed symbol duration for all transmitted solitons to facilitate practical implementation. Since the pulse width is inversely related to the amplitude of the soliton, the minimum nonzero soliton amplitude A min ≥ A lb (i.e., maximum pulse width) in a given input distribution determines the symbol duration. Note that A min is not necessarily equal to the minimum amplitude constraint A lb and P(A < A min ) = p 0 . The time-scaled MI (MI) is thus defined as where MI is divided by the normalized symbol duration, resulting in a unit of [bits/normalized time].
The data rate in [bits/second] can be estimated by dividing the time-scale MI (11) with the normalizing time T 0 in (3). The corresponding time-scaled capacity formulation is then given by Note that the minimum amplitude constraint A lb can be also relaxed, since it is already inherently imposed by the modified objective function, i.e., the time-scaled MI. This is because the optimal solution would not include the small soliton amplitudes that consume the available temporal resources inefficiently due to their very large pulse width. Hence the capacity problem can be also written as In Section 3.2, it is shown that a minimum nonzero soliton amplitude A min naturally appears in the optimal distribution of the capacity problem in (13).

Equivalent Channel Model Based on VNT
To simplify the capacity analysis, similar to the method used in [22,[30][31][32], variance normalizing transform (VNT) is applied here to transform the original signal-dependent noise channel to a channel with a fixed noise power at sufficiently large signal-to-noise ratios. In general, the VNT can be applied to any random variable R where its variance σ 2 R is related to the mean µ R as σ 2 R = f 2 (µ R ). Then the variance of the transformed random variable, Y = T(R), is normalized to one (i.e., mean independent) at sufficiently large values of µ R . The general form of VNT can be written based on [33] as Therefore the normalized random variable Y = T(R) has the moments σ 2 Y ≈ 1 and µ Y = E[y] ≈ T(µ R ) for sufficiently large value of µ R . Substituting the statistics of the NCX2 channel considered in this work, the VNT will be given as where the approximation is made for mathematical simplicity and due to the fact that the variance normalization itself defined by VNT is only precise at large values of u/σ 2 N where the adopted approximation is also precise [22,27,31,32].
As shown in Figure 1, an equivalent soliton communication system can be defined based on the VNT approach where the noise power is signal-independent at large signal levels. Note that, in order to perform the coding and decoding at the same signal space, it is convenient to include both VNT and inverse VNT (IVNT) meaning that the soliton amplitude, A, is determined from the original input data X = T(A) as Noting the square root form of the VNT defined in (15) and considering that the NCX2 model in (7) defines the channel between the soliton eigenvalues A and R in Figure 1, the equivalent channel model between the transformed random variables X and Y is described by a noncentral chi (NCX) conditional PDF as where The capacity in bit per symbol of the system in (10) can then be rewritten based on the random variables X and Y as where X lb = T(A lb ) and X ub = T(A ub ). Moreover, the corresponding time-scaled capacity formulation is given by or based on the relaxed constraint as where the time-scaled MI can be written as and X min denotes the minimum nonzero symbol amplitude, i.e., A min = T −1 (X min ) = σ 2 N X 2 min /4. It is important to notice that the VNT transformation does not affect the MI between input and output, i.e., I(A; R) = I(X; Y), since the VNT function (15) is a monotonic and invertible function within the interested domain (See Lemma in [22]). Hence, the capacity formulations in (12) and (19) are equivalent.

Approximate AWGN Channel Model
It has been shown that the probability distribution of the normalized random variable after VNT tends to Gaussian distribution for a family of originally non-Gaussian probability distributions [22,31]. In this section, we first show that this is also true for the NCX distribution (17) in a Kullback-Leibler (KL) divergence sense. This inspires us to propose an approximate AWGN channel model to describe the amplitude modulated soliton communication system after VNT transformation as where the additive noise Γ is Gaussian with zero mean and unit variance.
The KL divergence between the NCX distribution, P Y|X (y|x), given in (17) and a Gaussian distribution Q Y|X (y|x) with mean x and unit variance tends to zero for a sufficient large x, that is where KL divergence, D KL (P, Q|x), is defined as Proof of Proposition 1. The detailed proof of Proposition 1 is shown in Appendix A Proposition 1 indicates that the NCX channel model (17) behaves similar to the approximate AWGN channel for a sufficiently large x. For example, The KL divergence D KL is estimated as small as 1.77 × 10 −12 for x = 86.67. This is by assuming that the pulse width contain 99.9% of the energy (δ = 0.001) and some typical fiber parameters as in Table 1. Next, we will show that the proposed approximate AWGN channel converges to the original NCX channel at sufficiently large large X lb . Theorem 1. Given the input X ∈ {0 ∪ [X lb , X ub ]} at a sufficiently large X lb , the mismatch capacity of the NCX channel with the approximate AWGN channel defined by (22) as auxiliary channel converges to the actual capacity of the NCX channel.
Proof of Theorem 1. The detailed proof of Theorem 1 is shown in Appendix B.
In [34,35], it is shown for the AWGN channel with amplitude constraints that the capacity-achieving distribution is discrete with a finite number of mass points for such channels. An upper bound is proposed in [36] for the number of mass points. However, these works focus on the MI-based capacity formulation. In the next Proposition, we extend the result in [34] to show the discreteness of the optimal solution to the time-scaled MI maximization problem for the proposed approximate AWGN channel.

Proposition 2.
Given an AWGN channel with the input amplitude constraint of X ∈ {0 ∪ [X lb , X ub ]} and X lb → ∞, the optimal input distribution for the capacity formulation in (19) is discrete with a finite number of mass points.
Proof of Proposition 2. The detailed proof of Proposition 2 is shown in Appendix C. Now, approximating the channel in (19) with an AWGN model based on Theorem 1 and considering the conclusion of Proposition 2 on the discreteness of the optimal input distribution asymptotically, the MI between X and Y can be expressed as where h(Y) denotes the output differential entropy, h(Γ) denotes the differential entropy of the unit variance AWGN noise, x k and p X (x k ) denote the input symbols and their corresponding probabilities within the input source alphabet, M denote the size of the nonzero alphabet, x 0 = 0 and p X (x 0 ) = p 0 denotes the corresponding probability. Hence, the problem in (19) can be rewritten as where the time-scaled MI function R(X; Y) is a function of two (M + 1)-length vectors x and p X which denote the mass points and their probabilities. As mentioned in the previous sections, the minimum amplitude constraint can be also relaxed yielding Since the input distribution is discrete, the vector [x, p X ] is sufficient to describe the input random variable X. The discreteness of the capacity-achieving input distribution allows for numerical evaluation of the capacity expression using similar algorithms as in [30,34]. In this work, the optimization over [x, p X ] is performed using an interior-point optimizer in MATLAB given the number of nonzero mass point is fixed at M. The optimization on M is then performed based on an exhaustive search approach which will keep increasing M until additional mass points can no longer improve the optimized time-scaled MI.  Figure 2 shows the capacity-achieving distributions obtained by solving (26) and the corresponding capacity estimation using the optimized input distribution. For these results, we assume an ideal distributed Raman amplified 2000 km fiber with the parameters detailed in Table 1. Using the constraint from X ub = 200 to X ub = 500. This range of peak amplitude constraint corresponds to the range of maximum eigenvalue from A ub = 0.4 to A ub = 2.5, which represent the peak optical power −5 dBm and +10 dBm, respectively. (d) Figure 2. The optimal input distribution and the corresponding optimized time-scaled mutual information (MI) obtained as the numerical solution of (26) subject to the peak amplitude constraint X ub assuming δ = 0.001. (a) The location of the optimal mass points (the peak amplitude is shown as the purple solid line with star) (b) The optimal probability of the mass point at zero (i.e., off symbol) (c) The optimal probabilities of the nonzero mass points, (d) The maximum Time-scaled MI given based on the solution of (26) and the lower bounds on the time-scaled capacity of the original noncentral chi-squared distribution (NCX) channel achieved by using different input distributions, including, on-off keying (OOK), 4 pulse amplitude modulation (4-PAM) and the input distribution given in (a) to (c). Note that the additional power axis denotes the power level of the solitons corresponds to the peak amplitude X ub assuming δ = 0.001.
In Figure 2a-c, the optimal distributions are shown for various peak amplitude constraints X ub . The figures show that the optimal distributions consist of an isolated mass point at zero (off symbol), and a uniform-like distribution starting from a minimum nonzero symbol (denoted by X min ) to the maximum symbol amplitude (denoted by X max = X ub ). It is also important to point out that the probabilities at X min and X max getting closer to the probabilities of the mass points in between as X ub increases, showing a convergence towards a uniform distribution. Note that the results in [25] shows a nonuniform distribution of optimal mass points since the pulse width is assumed to be variable. Figure 2d presents the capacity of the approximate AWGN channel based on the solution of (26) as well as some lower bounds on the capacity of the original NCX channel (17). The best lower bound is obtained by applying the optimal distribution of the approximate AWGN channel as in Figure 2a-c to the time-scaled MI of the NCX channel. This lower bound precisely overlaps with the capacity of the approximate AWGN channel, further confirming the result of Theorem 1, in a MI sense, i.e., that the AWGN channel is a very good approximation of the NCX channel within the range of consideration. Figure 2d also includes the time-scaled MI estimated for the transmission of conventional on-off keying (OOK) and 4 pulse amplitude modulation (4-PAM) signals over the original NCX channel. As expected, both conventional modulations show lower time-scaled MI comparing to the optimized input distribution. However, the conventional 4-PAM signal achieves even lower time-scaled MI than OOK. This is due to the fact that the fixed symbol duration is inversely related to the amplitude of the minimum nonzero amplitude X min , which is X min = X ub /3 for 4-PAM but X min = X ub for OOK. In general, for a K-PAM modulation scheme, the time-scaled MI can be upper bounded by the time-scaled source entropy, H(X) t s (A min ,δ) = σ 2 N X 2 min 2 ln(2/δ−1) log 2 (K), where the X min = X ub /(K − 1). It can be then shown that the time-scaled source entropy for K-PAM will always decrease with respect to K for K ≥ 2. This suggests that K-PAM with higher K cannot achieve better time-scaled MI than OOK. It is also worth noting that some of the sub-optimal distributions proposed in the literature (e.g., the half-Gaussian bound proposed in [23]) is not included here as the half-Gaussian input source would give a zero time-scaled MI when a fixed symbol duration is considered as in this paper.

Analytical Capacity Approximation
Inspired by the optimal input distributions obtained in the last section as presented in Figure 2, in this section, we focus on developing an analytical approach for time-scaled capacity estimation of the soliton communication system. Assuming that the peak amplitude constraint X ub is sufficiently large, Figure 2 shows that the capacity-achieving input distribution obtained by solving (26) is discrete with a finite number of mass points including an almost uniform distribution within [X min , X max = X ub ], and an additional mass points at zero, where the optimal X min needs to be found by solving the optimization problem. We therefore consider a general form of discrete input distribution with a mass point at zero with probability p 0 and a discrete uniform distribution within [X min , X max ] to find an analytical estimation of the solution to the capacity problem given in (26). Note that the upper boundaries of the distribution is denoted by X max ≤ X ub rather than X max = X ub to keep it inline with the peak amplitude constraint introduced earlier.
To write the corresponding MI based on (25), we first need to define the statistics of the channel output given the input signal parameters, P Y (y|p 0 , X min , X max ). In order to make the capacity analysis tractable, we make an approximation that the distribution of the noisy output signal Y given the transmission of nonzero mass points, i.e., P Y (y|X ∈ [X min , X max ]) is approximated by a continuous uniform distribution within the range [X min , X max ]. This approximation is reasonable when the number of mass points M are large and the noise variance is small compared to the signal level. Based on this approximation and also considering the Gaussian noise added to the zero mass point, we can write where the f G (·) denotes the PDF of a zero mean, unit variance Gaussian distribution and u(y|X min , X max ) denotes the step function that is equal to 1 when y is within [X min , X max ] and 0 otherwise. Considering the approximate PDF in (28), we now calculate the differential entropy of the received signal as where the approximation a leads from applying the approximate output distribution in (28), and the approximation b is valid under the assumption that X min 0, i.e., f G (y ≥ X min ) ≈ 0. Substituting (29) into the Equation (25), the approximated MI is then given as a function of p 0 , X min and X max as Noting that the scaling time (9) is a function of the minimum mass point X min , the approximate time-scaled MI function R app (X; Y) is then given as (31), the solution to the capacity problem given in (26), is obtained as

Theorem 2. Given the approximated time-scaled MI function in
where the optimal parameters of the input distribution are given as where W(·) denotes the Lambert W function.
Proof of Theorem 2. The detailed proof of Theorem 2 is shown in Appendix D.
Using Theorem 2, the approximate solution to the capacity problem in (26) can be calculated analytically. As it can be observed in Figure 3, this approximate capacity result demonstrates a close match to the exact capacity results obtained numerically.

Mismatch Capacity for Soliton Communication over the NLSE Channel
So far, we have focused on the capacity estimation of the first-order soliton transmission based on the commonly used memoryless channel model defined by the noncentral chi-squared distribution in (7). In this section, we study the capacity limits of the soliton transmission over a more realistic description of the fibre-optic channel defined by the NLSE. Hence, both the Gordon-Haus effect and the nonlinear interactions between adjacent soliton pulses can be incorporated into the capacity analysis. For this purpose, we use the numerical evaluation of mismatch capacity bounds based on split-step simulation of the NLSE. The mismatch capacity approach is commonly used to provide a lower bound on the capacity of a communication system, by assuming a mismatch distribution for decoding the received signal [32,37]. If the mismatch distribution is denoted by Q Y|X (y|x) and the real channel statistics is denoted by P Y|X (y|x), the time-scaled mismatch capacity bound for a discrete input signal is expressed as where p X (x j ) denotes the input probability of symbol x j taken from optimization (26), and E P Y|X (y|x k ) [·] denotes an expectation operation over the channel model P Y|X (y|x k ) . Recall from Section 3.1 that the unit-variance Gaussian distribution and the NCX distribution are well matched for the interested range of interest. Thus, a unit-variance Gaussian distribution Q Y|X (y|x) is a reasonable mismatch distribution to be employed in the calculation of the mismatch capacity.
To take into account the impairments introduced by ASE noise, such as Gordon-Haus timing jitter, as well as intersoliton interaction effects, we use the split-step method to simulate the propagation of single soliton or soliton sequence transmission over the fiber. Hence many realizations of the fiber-optic channel can be generated based on the simulation of NLSE to establish the statistics of the realistic channel given the capacity-approaching input distribution obtained in Section 3.2, (i.e., P(y|x k )). The generated channel statistics can then be used to numerically estimate the mismatch capacity in (36) through a Monte Carlo approach. Noting that the input distribution applied here is not necessarily the optimal distribution for the realistic channel, our results, C Mismatch , provide a lower bound on the mismatch capacity, which in turn gives a lower bound on the capacity of the realistic soliton communication system. The simulation of the channel realization required for the Monte Carlo estimation of mismatch capacity is generated following each function block of the proposed system as in Figure 1. The pulses correspond to the input alphabet will be transmitted into a simulated fiber perturbed by ASE noise via split step Fourier method based on NLSE (1). The output pulse from the simulated fiber will then be put through an NFT detector, which extracts the eigenvalue R from the detected pulse. The received eigenvalue R will then be VNT transformed into the transformed domain for decoding the information. Unless otherwise mentioned, δ = 0.001 is assumed to calculate the soliton duration, i.e., 99.9% soliton energy pulse-width.

Mismatch Capacity for Single Soliton Transmission
We first focus on single soliton transmission over the NLSE which takes into account the Gordon-Haus effect while ignoring the inter-soliton interaction effects. Using identical fiber parameters as in Table 1, Figure 3 compares the time-scaled mismatch capacity calculated based on 1000 realizations per possible symbol for X ub ∈ [200, 500] with the time-scaled capacity of AWGN model obtained in Section 3.2 and the analytical approximation derived in Section 3.3. From Figure 3, it can be observed that the time-scaled MI increases as the peak amplitude constraint increases. It is also observed all the curves provide a well-matched estimations of the capacity, confirming that the Gordon-Haus effect is not so significant within the range of interest here. Nevertheless, we can see that, for larger X ub , the gap between mismatch and AWGN curves increases, which can be due to the stronger Gordon-Haus effect, that will be experienced by larger amplitude soliton pulses. Note that the timing jitter introduced by the Gordon-Haus effect can shift the soliton beyond the limited timing window over which the NFT is applied, which leads to energy loss and possible errors in eigenvalue detection.

Mismatch Capacity for Soliton Sequence Transmission
The memoryless channel model of soliton communication considered in Section 3 and in most of the literature is only valid when there is no intersoliton interactions, limiting the accuracy of the model to the cases where the sequence of soliton pulses are well separated. In this section, we use the mismatch capacity approach introduced above to provide some insights on the impact of inter-soliton interaction effects on the capacity of soliton communication systems. In the previous section, the performance of the system is discussed based on simulating the transmission of a single soliton pulse through a long haul fiber-optic channel, which neglects the inter-solitonic interactions. In this section, the transmission of a sequence of three soliton pulses is considered, where the middle soliton is considered to be the target soliton for detection. Meanwhile, the neighboring solitons (i.e., the first and the third solitons) are assumed to be independently and randomly selected based on the statistics of the input signal distribution taken from the solution of the AWGN capacity formulation in (26). Note that the pulse width of a soliton is a function of δ and X min in the input signal distribution. The simulation is performed based on the same split step Fourier method employed in Section 4.1, while the NFT-based detection is only performed over the pulse width of the middle soliton.
It has been shown in [38] that, even in the absence of any noise, solitons can exert attracting or repelling forces on each other when they are not place far enough, and this leads to inter-soliton interaction effects. Thus, before implementing the soliton sequence transmission in the presence of the ASE noise, we intend to estimate the mean squared error (MSE) induced by the noiseless inter-soliton interaction to evaluate the significance of this effect for different soliton separations. Recall that the ASE noise power after VNT is normalized to 1. Hence, the inter-soliton interaction effect would be negligible relative to noise, if the inter-soliton interaction MSE is much less than 1, i.e., where E[·] denotes expectation over all possible combination of the three-soliton sequences, Y nl denotes the received VNT transformed eigenvalue in a noiseless scenario. The noiseless simulation is based on the identical simulation parameters as in Table 1 but in the absence of ASE noise (i.e., assuming noiseless ideal distributed Raman amplification) and using the input soliton amplitudes taken from the capacity-approaching distribution given in Section 3.2. In this section, the signaling of the solitons are based on four different δ parameters and their corresponding pulse width. Note that a smaller δ leads to a longer symbol duration as defined by (9), which results in more separation between solitons an thus less inter-soliton interaction. Figure 4 shows the inter-soliton interaction MSE estimated by simulating the transmission of all possible three-soliton sequences following the input distribution given in Section 3.2 assuming different values of δ. The overall trend of the MSE is increasing as the peak amplitude constraint X ub is increasing. Moreover, as expected, decreasing the δ parameter reduces the MSE. In fact, reducing δ corresponds to the decreasing the fraction of energy truncation that essentially extends the soliton temporal separation. The additional temporal separation will reduce the force between the solitons [38], thus, the inter-soliton interaction is mitigated. Note that, for δ = 10 −3 , the MSE goes beyond unity for X ub > 300 as shown in Figure 4, meaning that the inter-soliton interaction effect becomes comparable to noise beyond that point, hence, the δ parameter needs to be reduced to maintain a low interaction effect. Similarly, it is observed that the MSE becomes comparable to noise for δ = 3 × 10 −4 beyond X ub = 400.  Table 1.
In order to evaluate the impact of intersoliton interaction effect on the capacity of the system, Figure 5 shows the time-scaled capacity results and the corresponding MI calculated based on different proposed methods including the AWGN model and mismatch decoding with or without inter-soliton interaction effects for different values of δ. Figure 5a shows the significant impact of intersoliton interaction effects on the time-scaled capacity at higher peak amplitudes. For example, for δ = 10 −3 , the time-scaled MI gradually drops beyond X ub = 300 and tends to zero before X ub = 400. It is also observed that when δ decreases, the longer symbol duration scales down the time-scaled MI in the whole range of X ub but the efficiency of the communication system in combating intersoliton interaction effects improves (i.e., capacity drop shifts to higher soliton amplitudes). This indicates that there is a trade-off in selecting the parameter δ. On the one hand, a smaller δ mitigates more effectively both inter-soliton interaction and Gordon-Haus effects, and on the other hand, it reduces how efficiently the temporal resources are being used. Hence, in future work, δ also needs to be included in the capacity problem formulation. Nevertheless, Figure 5a gives an estimation of sensitivity of the time-scaled capacity with respect to δ by providing the mismatch results at different values of this parameter. Therefore, by taking the supremum of the curves with different δ values in different parts of the dynamic range, we can obtain a good estimation of the capacity lower bound in the presence of soliton interaction. For example, based on the available results, the capacity result at δ = 10 −3 is best up to X ub = 300 while the capacity results for δ = 3 × 10 −4 and δ = 10 −4 are best in ranges X ub ∈ [300, 400] and X ub > 400, respectively. (b) Figure 5. The capacity estimation of the soliton communication based on the AWGN model optimization in (26), and the mismatch capacity bounds in the presence (mismatch inter) or absence (mismatch no inter) of inter-soliton interaction effects in terms of (a) time-scaled MI and (b) MI, for different values of δ and the link parameters stated in Table 1.
The MI results presented in Figure 5b is produced from scaling back the optimized time-scaled MI results in Figure 5a. It therefore focuses on how efficiently each soliton is decoded rather than how efficiently the temporal resources are being used. The figure shows that, for δ = 10 −3 , the inter-soliton interaction effect strongly degrades the mismatch capacity beyond X ub = 300 as expected from Figures 4 and 5a. By reducing δ, it is observed that the inter-soliton interaction effect decreases and it almost matches the mismatch capacity results with no interaction at δ = 10 −4 . This is also expected from Figure 4, as δ = 10 −4 shows MSE 1 for most of the range of interest. In addition, the mismatch capacity at δ = 10 −5 even outperforms the mismatch capacity with no interaction and almost matches the AWGN result. This is because the mismatch with interaction at δ = 10 −5 corresponds to the transmission of a soliton sequence with longer symbol duration. The longer duration essentially eliminates both the Gordon-Haus effect as well as the interaction effects while this is not the case in the mismatch results with no interaction where we still assume shorter pulse width with δ = 10 −3 . This also verifies the accuracy of the proposed AWGN approximation model compared to the realistic simulated channel when both the Gordon-Haus and inter-soliton interaction effects are negligible.

Conclusions
In this paper, we proposed a number of new approaches for estimating the capacity of the amplitude-modulated soliton communication systems. We provided insights into the AIRs of such systems when effects such as Gordon-Haus and inter-soliton interaction are present. The non-central Chi squared channel model that is commonly used in the literature was initially considered and was then approximated by a unit-variance AWGN channel by applying VNT. Using the approximated channel model and subject to a peak amplitude constraint, optimal input distribution and the corresponding capacity were obtained numerically. The optimized distributions are discrete with a mass point at zero corresponding to no soliton transmission as well as an almost uniform distribution of mass points spread in a range away from zero up to the peak amplitude constraint. Using this general form of the optimal distribution based on the approximate AWGN model and applying some mathematical simplifications, we developed an analytical expression to estimate the capacity of the soliton communication system. Despite the additional approximations, the analytical approach provides a close match to the results obtained numerically based on the AWGN model. The optimal input distribution based on AWGN model were also used to calculate the mismatch capacity of the soliton communication system using the split-step simulation of the realistic channel defined by the NLSE. The results show that the effect of inter-soliton interaction caused by limiting the soliton pulse width is stronger than the Gordon-Haus effect for long haul fibers operating in a range of launch powers up to 10 dBm. They also show the trade-off between extending the pulse width to avoid inter-soliton interaction and compressing the pulse width to improve the temporal efficiency.
In future works, the soliton pulse truncation factor δ can be included in the capacity problem formulation as an additional variable. This allows for a more comprehensive analysis of the soliton interaction effects. Moreover, the capacity problem based on the assumption of variable pulse width can be considered in the presence of soliton interaction effects. Another interesting problem related to this work is the capacity analysis of higher-order soliton transmissions. ; Modified first order Bessel function of the first kind: I 1 (·); Probability density function (PDF) of y: P Y (y); Conditional PDF of y given x: P Y|X (y|x); MI between input X and output Y: I(X; Y); Time-scaled MI between input X and output Y: R(X; Y) [25]; KL divergence between distribution P and distribution Q given parameter x: D KL (P, Q|x); Lower constraint: (·) lb ; Upper constraint: (·) ub ; Minimum nonzero value: (·) min ; Maximum nonzero value: (·) max ; Optimal value: (·) * ; Lambert W function: W(·).

Appendix A
Proof of Proposition 1. Following a similar method as in [32], a non-negative term KL divergence is employed to evaluate difference between two distributions, where P and Q denote the distributions, x indicates the given parameter(s) of the two distribution P and Q. Within this proof, P Y|X (y|x) is considered to be a noncentral chi distribution as where I 1 (·) denotes the modified Bessel function of the first kind, and the mean and variance of P Y|X are denoted with µ NCX (x) and σ 2 NCX (x) respectively. Q Y|X (y|x) is considered as a Gaussian distribution with identical mean µ NCX (x) and variance σ 2 NCX (x), i.e., To prove the convergence of the NCX distribution to a Gaussian distribution with mean x and unit variance at a sufficiently large x, we first verify the convergence of the moments of of the NCX distribution and then show its tendency to Gaussian distribution at large x. Taking the limits of the first and second moments at large values of x, we obtain which verifies the convergence of moments to the corresponding values in the theorem statement. Now substituting the NCX distribution (A2) and its corresponding Gaussian distribution (A3) into (A1), the KL divergence can be expressed as where h NCX (x) denotes the differential entropy of the NCX distribution (A2) given parameter x, and E NCX (·) denotes the expectation over the NCX distribution (A2). The first term can be expressed as while the second term can be written as Since function f (y) = ln(I 1 (xy)) where x is a given non-negative constant and function g(y) = ln(y) are concave functions [39], Jensen's inequality is applied to obtain an upperbound on the KL divergence as Next, we find the limit of the upper bound of KL divergence D ub (x) using the limits of mean µ NCX and variance σ 2 NCX (x) already calculated in (A4) and (A5), that is At last, using the non-negativity of KL divergence, we have 0 ≤ lim x→+∞ D KL ≤ lim x→+∞ D ub = 0, i.e., lim x→+∞ D KL = 0. Therefore, we can conclude that the KL divergence between (A2) and (A3) goes to zero when x is sufficiently large and this concludes the proof.

Appendix B
Proof of Theorem 1. In this proof, we show that the gap between the NCX channel capacity and the mismatch capacity of the NCX channel given the approximate AWGN channel as auxiliary channel tends to zero as X lb → ∞. Consider that the input random variable X ∈ {0 ∪ [X lb , X ub ]} is separated into zero and nonzero sets. Then the PDF of X can be written as where δ(x) denotes the Dirac delta function, and PX(x) denotes the PDF of the nonzero inputX. Similarly, the output random variable Y can also be separated in a similar manners as where the PŶ(y) = X P Y|X PX(x)dx denotes the PDF of the output corresponding to the nonzero input. The MI between input X and output Y is then given as Substituting Equation (A12) in the output differential entropy h(Y), it is then rewritten as where we changed the variable of integral in the first term of the last equality as y = y − X lb . Taking the Taylor expansion of the logarithmic functions inside the two integrals of the right hand side of (A14) at y = 0 and y = 0, respectively, we obtain where the∆ and ∆ 0 are higher order terms for nonzero and zero input, respectively. At X lb → ∞, can be written in the form of expectations. They will therefore vanish since, for NCX distribution, lim X lb →∞ P Y |X (y ≤ −X lb /2|x ≥ X lb ) = 0 and lim X lb →∞ P Y|X (y ≥ X lb /2|0) = 0. Inserting (A15) in (A13), the MI is given as The mismatch capacity is a proven lower bound of the capacity. Assuming mismatch decoder design based on the Gaussian distribution, Q Y|X (y|x), the mismatch capacity I LB is defined as where the mismatch output distribution Q Y (y) can be written in similar manner as (A12) as where the QŶ(y) = X Q Y|X (y|x)PX(x)dx denotes the PDF of the output corresponding to the nonzero input. The mismatch capacity at X lb can be obtained via similar approach as before as where the∇ and ∇ 0 are higher order terms of the Taylor expansion for nonzero and zero inputs.
Similarly, at X lb → ∞,∇ = E ∑ ∞ can be written in the form of expectations. They will therefore vanish, for AWGN channel, since lim X lb →∞ Q Y |X (y ≤ −X lb /2|x ≥ X lb ) = 0 and lim X lb →∞ Q Y|X (y ≥ X lb /2|0) = 0 for the Gaussian distribution. The gap between the MI I(X; Y) and its lower bound I LB (X; Y) is then defined as +∞ −X lb /2 P X (x)P Y |X (y |x) log 2 P Y |X (y |x) Q Y |X (y |x) dy dx − X ub X lb +∞ −X lb /2 P X (x)P Y |X (y |x) log 2 PŶ (y ) QŶ (y ) dy dx.
(A20) At X lb → ∞, the vanishing terms∆,∇, ∆ 0 and ∇ 0 tends to 0, Hence, the limit of I gap is given by where the second term in (A20) is a nonnegative KL divergence term, hence, the I gap at X lb → ∞ is bounded by which is the expectation of the KL divergence over the nonzero range of X. According to Proposition 1, lim x→∞ D KL (P, Q|x ∈ [X lb , X ub ]) for NCX PDF P Y|X (y|x) and Gaussian PDF Q Y|X (y|x) tends to 0, therefore, the upper bound of the I gap also tends to 0. This completes the proof.

Appendix C
Proof of Proposition 2. Consider the input random variable X ∈ {0 ∪ [X lb , X ub ]} is separated into zero and nonzero sets. Then the probability density function of X can be where δ(x) denotes the Dirac delta function, and PX(x) denotes the PDF of the nonzero inputX. Similarly, the output random variable Y can also be separated with similar manners as P Y (y) = p 0 P Y|X (y|0) + (1 − p 0 )PŶ(y), where the PŶ(y) = X P Y|X (y|x)PX(x)dx denotes the PDF of the output corresponding to the nonzero input. Using the Taylor expansion as in Equation (A15), Appendix B, and considering the AWGN channel model, defined by P Y|X (y|x), the MI is given as where we changed the variable of integral in the first term of the last equality as y = y − X lb .