1. Introduction
The volume of information transfer including confidential information is continuously growing. According to the Statista Research Department report [
1], over the next years up to 2025, global data creation is projected to grow to more than 180 zettabytes. A competitive environment has been created for designing and improving both attack systems and information security systems. These circumstances lead to an increase in the mathematical and logical complexity and degree of intellectualization of the used algorithms, processes, and technical means. As a result, the effectiveness and dependability [
2] (reliability and security) of telecommunication systems and networks, as well as their components that implement data protection functions need to be improved.
Integrating methods of channel coding and cryptographic protection, or secure channel coding schemes, is one of the ways to increase the efficiency of information-processing tools, as well as to ensure data protection during its storage and transmission in telecommunication systems and networks.
Note that short packet transmission [
3] is a key feature of modern wireless systems, ultra-reliable networks, sensor networks, massive machine-type communications (MTC), and IoT applications [
4]. The prevalence of such systems and networks in the modern world requires the creation of new and the adaptation of existing approaches, to ensure the transmitted information integrity and confidentiality. In particular, the resources performance necessary for channel coding and cryptographic protection as well as the resources speed can play a decisive role.
This study considers an information interaction of MTC objects in a network with a dynamically changing structure. Each object of such a network, for example, a dynamic wireless sensor network [
5], has its own unique system of commands or alerts. This system of commands or alerts forms an ensemble of messages to be agreed between the object of information interaction and other network participants.
1.1. Related Literature
Currently known secure-channel coding schemes are based on the McEliece cryptosystem [
6,
7,
8,
9], universal stochastic coding [
10,
11], ‘golden’ cryptography [
12,
13], perfect algebraic constructions [
14,
15], and the use of permutations [
16,
17].
This study develops an approach using permutations.
The methodology of integrated-information security based on non-separable factorial coding [
18,
19] uses a subset of the set of permutations
of numbers
as codewords. Each number
is encoded by a binary code with a fixed length of
bits. Such information conversion allows getting a non-standard and redundant frame structure that does not require a separate field for the syncword, allows maintaining frame synchronization on the data signal, and allows the non-separable factorial code being used as a transport mechanism in short packet communications [
20,
21,
22,
23,
24,
25,
26,
27]. The cost of including syncwords is not negligible in such systems [
28,
29,
30]. Using a non-separable factorial code makes it possible to effectively search for frame boundaries even with a bit error rate close to 0.5, which is important for information transmission under the conditions of strong noise [
31,
32]. In addition, non-separable factorial coding may be a suitable tool to implement a cross-layer integrated approach to security and achieve secure short-packet communication from the perspective of both cryptography and physical layer security [
26,
27].
Previous studies [
33,
34] investigate the ability of a non-separable factorial code to detect and correct communication channel errors. The efficiency of the code has been proven, which is achieved, among other factors, due to its synchronization properties [
31,
32]. The studies [
33,
34] use the binary Hamming distance between codewords.
In this paper, similarly to the error-correcting Reed-Solomon coding [
35], we will consider symbols as elements of a codeword. This approach is of interest to ensure reliable transmission of permutations, in particular, for a three-pass cryptographic protocol based on permutations [
36].
We introduce the following definition to distinguish between the binary Hamming distance used in previous studies [
33,
34] and the Hamming distance between permutations of symbols
.
Definition 1. The symbol Hamming distancebetween two permutationsandis the number of symbol positions in which permutationsandare different.
It is obvious that and . In addition, if and only if .
Definition 2. A block codeis a code generated with a subset of permutations of lengthwith symbol Hamming distance.
In this case, is the symbol code distance.
Let be the -code size equal to the number of its codewords.
Since the code size
determines the amount of information transmitted by each codeword equal to
bits, the use of a
-code of the maximum size is the most efficient in terms of channel capacity. The last statement also follows from the central problem of coding theory [
37,
38].
In the literature [
39], the
-codes are called error correcting permutation codes. These codes are used for error correction of powerline communications using M-ary frequency shift keying modulation [
40].
There are lower bounds for
(in particular, Gilbert–Varshamov bounds and their improvements) as well as algebraic techniques for constructing
-codes [
39,
41,
42,
43,
44,
45,
46,
47]. For example,
,
, if
is a prime power then
and
[
41],
and
[
42]. Studies [
39,
43] use automorphism groups to provide
lower bounds. The authors of the literature [
44] use permutations invariant under isometries. The study [
45,
46] uses sequential partition and extension, parallel partition and extension, and a modified Kronecker product operation. The recent study [
47] improves
lower bounds using permutation rational functions.
In this study, in contrast to known algebraic methods, we present a statistical method for constructing a
-code and estimating its size
. We also take into account the fact that the
-code must be unique for each object in the dynamic wireless sensor network, and the code agreement between the participants of the information exchange process can take place by applying a cryptographic protocol [
36]. In such conditions, increasing the variability and unpredictability of the codeword ensemble is a necessary key condition for ensuring the protocol strength.
1.2. Main Contributions
We will generate codewords for a -code by enumerating a set of permutations of length and selecting permutations with the symbol Hamming distance to all preselected permutations not exceeding the value. Constructing a -code is complicated by the fact that when increases, it is practically impossible to generate permutations.
The goal of the study is to determine the dependence of the code size on the values of and when using the proposed statistical method.
To achieve this goal, the following tasks must be solved.
A statistical algorithm to generate codewords for a -code must be developed and implemented.
An analysis of the distribution frequency of a random value for a given number of implementations of the codeword generating algorithm must be performed. The distribution law for must be determined.
The dependences of the average and the maximum -code size, its standard deviation from the parameters and must be explored.
A technique to estimate a -code size depending on parameters and must be developed and applied.
1.3. Paper Structure
This paper is organized as follows:
Section 2 describes an algorithm to generate a set of codewords, analyses the dependence of the
-code size on the values of
and
, and presents a technique for constructing an approximation polynomial for the code size dependencies;
Section 3 shows the results of implementing the developed technique for
and discusses the results, and
Section 4 is the conclusion.
2. Materials and Methods
2.1. Algorithm to Generate Codewords
Figure 1 shows the algorithm to generate a set of codewords of a
-code.
Initially the set of codewords does not contain permutations. The initial complete set of permutations is generated randomly. The first permutation is selected and placed into the set of codewords being generated. Then the second permutation is selected and the Hamming distance to the first codeword is calculated for it. If the calculated distance is no less than the given value, the second permutation is also placed into the set of codewords. Otherwise, the next permutation from the initial set is selected. We continue the process of selecting permutations, calculating the Hamming distances to all selected codewords, and placing the permutation into the set of codewords if all calculated Hamming distances are no less than till all permutations of the initial set have been enumerated. After that, the number of permutations in the set of codewords is counted.
Constructing of a complete set of permutations can be implemented both by generating them in a certain, for example, lexicographic order with subsequent mixing, and by using random factorial numbers and their bijective transformation into permutations. At the same time, storing the permutation numbers (or the corresponding factorial numbers) instead of the permutations reduces the required amount of memory; however, due to additional transformations, it leads to an increase in the time to generate and output a permutation.
To reduce the amount of memory required to store the full set of
permutations, the initial set of permutations can be generated simultaneously with their analysis. In this case, in the algorithm of
Figure 1, there is no block for generating the initial set of permutations, and the block for selecting the next permutation is replaced by a block for generating the next permutation (
Figure 2).
At the same time, the uniqueness check of the generated permutation is additionally implemented in the new block.
Table 1 shows estimates of the mathematical expectation
and the standard deviation
of the code size, as well as its maximum value
obtained as a result of implementing the algorithm shown in
Figure 1 for 10,000 experiments with
and
.
Figure 3 shows a histogram of the distribution of a random value
for
and
.
Let the null hypothesis state that the distribution of a random value
corresponds to a normal distribution. The use of Pearson’s chi-squared test
[
48] indicates that there is no reason to reject the null hypothesis with the achieved
p-value (significance level) of 0.2768.
The normality of the distribution of a random value is also confirmed for and : . However, for and .
Note that as the value of
increases, the implementation of the algorithm shown in
Figure 1 becomes more difficult, since the generation of a complete set of
permutations requires a significant amount of memory and processor time (
Figure 4). For example, storing of
permutations using a fixed length binary code to encode permutation symbols requires 209.37 MB of memory; for M = 15 this amount is 8.92 TB. These calculations do not take into account the need to store service information. If we add service information then the memory amount required to form a complete set of permutations in the Python programming language [
49] is 67 MB for
, 667 MB for
, 7.15 GB for
, and 70 GB for
. It is possible to somewhat reduce the amount of used memory by optimizing the program code. However, it is almost impossible to implement the algorithm shown in
Figure 1 on a standard modern workstation when
.
The average time to generate one permutation was determined experimentally by generating 1,000,000 permutations of a given length .
All experiments in this research were implemented in the Python programming language [
49] using the PyCharm Community Edition 2020.3 [
50] integrated development environment on a desktop personal computer with the following parameters:
Here, we provide the possibility to construct a -code for the values of that do not allow generating permutations in practice.
The approach proposed in this study is based on the following. The algorithm to generate a set of codewords shown in
Figure 1 is preserved. At the same time, the initial set of permutations is a proper subset of the complete set of
permutations. The size of such a proper subset is denoted by
.
2.2. Algorithms to Generate the Initial Set of Random Permutations
Permutations of the initial set will also be generated randomly. Here, we consider two algorithms:
For example, permutation
with the basic permutation
can be generated with both the first and the second algorithm. The first algorithm: A decimal number
is generated, converted into a factorial number
, and then converted into a permutation syndrome [
52]
and a permutation
itself. The second algorithm: Each element of the syndrome
is generated separately and is then converted into a permutation
.
Both of the above algorithms to generate the initial set of permutations control the uniqueness of permutations within the set (
Figure 5).
Note that the above algorithms to generate
permutations can also be applied in the block for generating the next permutation of the algorithm in
Figure 2. In this case, the algorithms will output the permutation for analysis instead of writing it to the memory.
Comparing the speed of the two algorithms for generating random permutations shown in
Figure 5, we evaluated the performance of only the distinctive parts of the presented algorithms, the procedures for generating factorial numbers. The average time to generate one factorial number (
Figure 6) was calculated based on the results of the generation of 10,000 numbers.
The achieved graphs indicate that the time to generate a factorial number with the first algorithm (
Figure 5) increases with an increase of the
value much faster than the second method. In addition, unlike the first algorithm, the processes for the second algorithm in
Figure 5 are convenient for parallelization. This circumstance makes it possible to further increase the performance of the algorithm.
In this paper, we will use the second proposed algorithm to generate the initial set of random permutations, .
2.3. Dependence of the -Code Size on the Values of , , and
We will use to denote block factorial code formed by a subset of random permutations, and will use to denote the size of -code.
Next, we determine the dependence of the size on the value of . Such dependence can be used both to determine the required value of when designing a data transmission system with a -code, and to evaluate the efficiency of the code constructed from random permutations.
We will determine the dependence experimentally. In this case, the values are formed as follows.
Let
. Then
or
where
is a step;
.
Let
. Here, we accept
for
. Values
for
are given in
Table 2.
Similarly to
Figure 3,
Figure 7 shows the histograms of the distribution of a random value
for
at
and
, constructed as a result of 10,000 experiments.
By analogy with the distribution of a random value
, we accept the null statistical hypothesis, which states that a random variable
is normally distributed. We apply Pearson’s chi-square test to test the null hypothesis.
Table 3 shows the
p-values obtained for
at
and
from
Table 2.
In
Table 3, the green highlights the cases where the normal distribution for
at the significance level of
is confirmed; and the red highlights the cases where the normal distribution for
is not confirmed. These results can serve as evidence that at large
values the normal distribution begins to be observed at smaller values of
.
Figure 8 shows the graphs of estimates of the mathematical expectation
, standard deviation
, and the maximum value
of the
-code size against the value
. The curves on
Figure 8 are obtained as a result of 10,000 experiments for
and
.
Figure 8 also shows the approximation curves [
53] and equations, as well as the approximation reliability coefficient
for the dependences of estimates of the mathematical expectation
and the maximum value
. The
close to unity indicates an accurate description of the dependencies
and
for
by a second-degree polynomial of the form
.
Table 4 summarizes the coefficients
,
,
for the
and
approximation polynomials.
Note that according to Equation (2). Then the approximation functions can be easily calculated by setting the values of for the required .
2.4. Technique for Constructing an Approximation Polynomial
To construct approximations for dependencies and and, if necessary, to perform extrapolation to predict the behaviour of these functions at values exceeding the upper limit of the range of their statistical study, it is necessary to perform the next steps:
To calculate , to set and values, and to calculate ;
To generate dependencies and for the range of values determined in accordance with (2);
To determine approximation polynomials for and .
It is also possible to select the values of for constructing dependencies and in the opposite direction with respect to (1), from the smallest to the largest. In this case, the method to obtain approximations is as follows:
Values of
,
, and
are chosen. Values of
are calculated using an expression
where
. It’s obvious that
;
Dependences and are also formed for the range of values determined in accordance with (3);
Quadratic approximation polynomials are calculated for and .
To obtain approximation polynomials of the form in the obtained expressions of the form , it is necessary to perform the replacement .
3. Results
Here, we apply the developed method for when it is necessary to predict the average and the maximum number of codewords with formed by and random permutations.
To construct approximations, we use the
values given in
Table 2 for the step
at
.
Figure 9 shows the graphs of the estimates of the mathematical expectation
and the maximum value
of the code size
against the value
. Each value of
and
was formed as a result of 10,000 experiments.
If we place values and into the expression (1) , and calculate the corresponding values of , we can get , . Note that , . Then the predicted values and are equal to 73.3836 and 81.2250 for and 77.1631 and 84.8191 for .
Figure 10 shows the histograms of the distribution of random values
and
constructed as a result of
experiments. The resulting average values are 72.8667 and 76.5667 (maximum values are 76 and 81).
Here, we determine the confidence interval for the obtained sample means [
54]:
where
is the sample mean;
is the corrected sample standard deviation, is the sample standard deviation;
is the number of experiments ;
is the the upper quantile of Student’s t-distribution with degrees of freedom.
Let . Then, the confidence interval is for , and it is for .
The predicted values of 73.3836 and 77.1631 fall within the indicated confidence intervals.
Then, let
when
.
Figure 11 shows the graphs of the estimates of the mathematical expectation
and the maximum value
of the
-code size against the value
. Each value of
and
was formed as a result of 10,000 experiments.
By placing values and into the expression (1) , and calculating the corresponding values of we get , . Taking into account that , , the predicted values and are equal to 73.3922 and 80.8342 for and 77.1775 and 84.3992 for .
Let the step
be further increased.
Table 5 summarizes the predicted values
and
for
and
when
.
Table 5 shows that all predicted values fall within the indicated confidence intervals
for
and
for
.
Here, we calculate and present in
Table 6 the relative prediction error for the values given in
Table 5. We assume that the maximum number of reference points
forms the most accurate prediction.
The results in
Table 6 indicate that three points as far as possible from each other
are sufficient to obtain an approximation curve (an approximation polynomial of the second degree). At the same time, the authors recommend using four points
to construct such a curve.
Here, we discuss the study results.
The proposed algorithm to generate codewords allows for the provision of the necessary technical result of constructing
-code with the required code distance. However, the obtained
values do not reach the known lower bounds [
39,
41,
42,
43,
44,
45,
46,
47]. For example,
. At the same time, paper [
39] gives the lower bound of 154 for the
-code size. The corresponding values for
are
vs. the lower bound of 42 and
vs. the lower bound of 77 in [
39]. However, we cannot say that the result is negative. First, in this study, we used not an algebraic, but a statistical method for code construction. Second, the proposed statistical method, unlike the algebraic method, allows for the construction of a unique system of commands or alerts for dynamic wireless sensor network objects. Note also that increasing the
-code size may lead to a decrease in the number of different possible
-codes, which can be constructed for the defined values of
,
, and
. In turn, the number of different possible
-codes is important for applying the
-code both in secure-channel coding schemes and for constructing a unique system of commands or alerts for MTC objects. At the same time, we do not deny the need to continue the search for new effective and fast statistical methods for
-code construction or to improve the proposed method. Determining the balance between the
-code size and the number of possible different
-codes is an actual problem that can be the subject for further research.
The study has shown that the relative error in predicting the size of -code increases with increasing the hypothetical number of permutations in the initial set, as well as with increasing the step . However, the nature of this dependence is not obvious and can be further investigated.
4. Conclusions
In this paper, we have developed and implemented a statistical algorithm to generate codewords of a -code by enumerating a set of permutations of length and selecting permutations with the symbol Hamming distance to all preselected codewords not exceeding the value.
We applied two algorithms to generate a random factorial number. The first algorithm is based on the conversion from a random decimal number by division, and the second algorithm is based on the random generation of individual digits of a factorial number. We found that the second method is faster.
We have determined experimentally the dependences of the average and the maximum values of the size of a -code constructed from a subset of permutations, on the value of .
A technique to compute approximation quadratic polynomials for the determined dependences of the average and the maximum values of the -code size has been developed. A key feature of this technique is to use the function (1) of a double logarithm and to use a quadratic polynomial. The approximation polynomials and their corresponding curves can be used to extrapolate the dependencies and predict their behavior at values exceeding the upper limit of their statistical study range.
Finally, we confirmed the effectiveness of the developed technique to estimate the average and the maximum size values for and at the upper limit of the statistical study range . The prediction relative error of -code size did not exceed the value of 0.72% obtained for and .