1. Introduction
Impairing the robustness of cryptographic applications is a sensitive topic. The interest in direct attacks, vulnerabilities, and backdoors for all currently used ciphers is certainly justified by economic and geopolitical reasons. If a vulnerable implementation of a cryptographic algorithm is surreptitiously distributed, an “evil” actor or a national security agency might easily access any sort of sensitive and precious information. However, “legal” actors might exist that openly mandate or encourage the adoption of cryptographic implementations that include backdoors, in order to realize “key escrow” mechanisms. For instance, a national country might legislate that judiciary representatives must always be able to recover any kind of encrypted communication involved in a criminal case.
Until a few years ago, it was only conjectured [
1] that major security agencies were able to decrypt a large portion of the world’s encrypted traffic, mainly thanks to vulnerabilities hidden in pseudo-random generators or major cryptographic algorithms and applications. Some examples of this practice might be the Hans Bühler case in 1994 [
2], the Dual-EC algorithm proposed in 2004 by the US National Institute of Standards and Technologies [
3,
4], and perhaps the OpenBSD backdoor incident that emerged in 2010 [
5,
6]. However, in the last few years many government bodies have openly talked about enforcing by law “responsible encryption” or “exceptional access to encrypted documents” [
7,
8]: essentially, these are just more palatable words for “escrow key” and “backdoors”.
The approach to backdoor construction has changed in the last few years. In the past, the focus was mainly on weaknesses in pseudo-random generators or software implementations that might allow attackers to predict some secret data of the target users. Nowadays, the emphasis is on theoretical backdoors based on mathematical properties of the cryptographic primitives. The main advantage of this new approach is that it is very difficult to discover a mathematical backdoor by just looking at the cryptographic algorithm. For example, Bannier and Filiol [
9] showed in 2017 how a block cipher similar to the Advanced Encryption Standard (AES) can be devised so that it includes, by design, a hidden mathematical backdoor that allows a knowledged attacker to effectively break the cipher and recover the key.
Evil actors and legal actors pursue very different goals. This fact justifies the adoption of very different backdoor mechanisms. An evil actor is primarily concerned with how convenient triggering the backdoor is and secondarily how well the backdoor mechanism is hidden from the final user; however, it is not crucially important to also preserve the security of the cipher. Thus, a backdoor introduced by an evil actor might even be a vulnerability hidden in a cipher implementation such that anyone knowing about its existence could easily break the cipher and recover the encrypted messages. For instance, a mechanism that can be easily exploited might be based on a semi-prime generator that selects just one of the primes at random, while the other prime is fixed. The Euclidean algorithm applied to two different vulnerable semi-primes outputs the fixed prime; thus, anyone can easily break the cipher even if the fixed prime is not known in advance. Perhaps not surprisingly, there are a lot of very weak public keys in the Internet [
10,
11]. However, a legal actor does not want to significantly impair the security of a cryptographic algorithm, because the final users might just refuse to adopt an insecure cipher. A backdoor introduced by legal actors is likely a vulnerability embedded in a cryptographic implementation that allows only “authorized” actors to decipher the encrypted messages without knowing the private keys of the final users. Usually, this means that the retrieval of the encrypted messages can be performed only if the actor knows a secret escrow key related to the backdoor itself.
Among the most widespread cryptographic algorithms, RSA [
12] deserves special consideration, because it is conveniently used to protect any kind of sensitive data transmitted over the Internet. It is commonly believed that RSA has been properly designed and that, by itself, it does not contain hidden vulnerabilities. However, a large number of attacks to RSA have been proposed since its invention. These attacks span from directly factoring the semi-prime in the public key to exploiting weaknesses in the generation algorithm for the prime factors; for a survey, see [
13]. Furthermore, several RSA backdoors have been proposed: they are specially crafted values in RSA parameters that allow a knowledgeable attacker to recover the private key from publicly available information. For an in-depth discussion of several RSA backdoors, see [
14].
This article proposes a new idea to inject backdoors in RSA key generators, which was loosely inspired by the concept of “implicit hints” of May and Ritzenhofen [
15] in pairs of semi-primes. However, this idea differs significantly from the backdoors based on implicit hints and, as far as I know, from any other published backdoor proposal (a preliminary version of this work has been published as a preprint [
16]).
More specifically, May and Ritzenhofen proposed the implicit factorization problem (IFP), which is based on the premise that two or more semi-primes with factors sharing some common bits can be factored with some variants of the Coppersmith’s algorithm [
17,
18]. The authors stated that “[…] one application of their result is malicious key generation of RSA moduli, i.e., the construction of backdoored RSA moduli”. In my opinion, however, a backdoor based on shared bits, as described in [
15], is not really effective for RSA. It is practically not possible to exploit this backdoor in large “balanced” semi-primes, such as those used in currently used RSA moduli, because the time required by the Coppersmith’s algorithm to factor a semi-prime grows exponentially as the size difference of the factors becomes smaller. Moreover, this vulnerability is self-evident to anyone looking at the factors, because there would be a long run of identical bits in the two values; thus, such backdoors cannot be easily concealed from the owner of the private keys.
The new idea is the following: rather than including in the bit expansions of the factors a long run of identical bits, the bit expansions include portions of correlated bits, where the correlation is bound to a secret designer key not known to the owner of the backdoored keys. In practice, the backdoor designer enforces some mathematical conditions on the values of the factors, such as congruences with a modulo for a large prime (of nearly the same size of the factors), which acts as the designer key.
Following the IFP approach in [
15], I firstly devised a backdoor (named the TSB, Twin Semi-prime Backdoor) based on mutual correlations between the factors of two distinct semi-primes. Afterwards, I devised a simpler backdoor (named the SSB, Single Semi-prime Backdoor) based on the same idea but suitable for injecting a backdoor in a single semi-prime. The backdoors can be applied to RSA and also to any other cipher whose security is based on the difficulty of the integer factorization of semi-primes.
A key difference from the IFP approach is that in triggering the backdoors, that is, in order to factor the semi-prime(s) by exploiting the designer key, there is no need to apply some variant of the Coppersmith’s algorithm. Therefore, if the value of the designer key is known, factoring the semi-prime(s) is easy and efficient. However, if the designer key is not known, there seems to be no efficient way to factor the semi-prime(s). Moreover, without the designer key, there seems to be no efficient way to detect the existence of the backdoor, even when looking at the distinct prime factors of the semi-prime(s). Of course, significant progresses in quantum computing might affect the robustness of the proposed backdoors. However, such progresses will likely have a significant impact on all aspects of the RSA algorithm.
The rest of the article is organized as follows.
Section 2 includes some mathematical notations and an introduction to the basic RSA algorithm.
Section 3 includes a discussion of the prior works related to RSA backdoors and the implicit factorization problem (IFP).
Section 4 presents the simpler backdoor, the SSB, while
Section 5 presents the more sophisticated backdoor for a pair of semi-primes, the TSB. Finally,
Section 6 includes the conclusions of this work.
2. Preliminaries
In this article, denotes the relation in which is a multiple of c; often the shorter notation will be used. The notation denotes the operation remainder of the division ; hence, and .
If is an integer, its size in bits is defined as . Writing means that x and y are equal or differ by at most one, while means that x and y differs by a value negligible with respect to the sizes of x and y. If , for any large n, then ; that is, both and may be considered to be approximately equal to n, ignoring a difference in size.
If h is an integer, denotes the k most significant bits of h (a value from 0 to ), while denotes the k less significant bits.
A semi-prime is a number N such that where p and q are primes. Therefore, . If , then the semi-prime is said to be balanced. When considering sequences of semi-primes (), it is assumed that they have a common size , for every i; furthermore, the primes have a common size ; it follows that all primes have the same size .
The RSA public key cryptosystem was invented by Rivest, Shamir, and Adleman [
12] in 1977. In its simplest form, the algorithm is based on a balanced semi-prime
and a couple of exponents
e,
d such that
and
. Here,
denotes the Euler’s totient function, which can be easily computed as
if the prime factors
p and
q are known. Theoretically, the value of
e could be random, while the value of
d can be computed from
e and
using the Extended Euclidean algorithm. The pair
is the “public key” of RSA, and the encryption function is
. Either the pair
or the pair
is the “private key”, and the decryption function is
. Of course, factoring
N allows an attacker to recover the private key from the public key, because from
p and
q we can compute
and then
.
3. Related Work
Many authors proposed to classify backdoors embedded in cryptographic applications according to several, different criteria. Following [
19], there exist three types of backdoors: (1) weak backdoors, (2) information transfer via subliminal channels, and (3) SETUP mechanisms. Weak backdoors are based on modifications of the cryptographic protocol such that it would be possible for anyone to break the cipher and recover the secret data. Vulnerabilities falling under the information transfer via subliminal channels category allow an attacker to exploit the cryptographic protocol in such a way to create a hidden communication channel that cannot be intercepted or unambiguously detected. Finally, SETUP (Secretly Embedded Trapdoor with Universal Protection) mechanisms create vulnerabilities in the cryptographic protocols that cannot be easily exploited by third-party attackers.
SETUP mechanisms were firstly proposed by Young and Yung [
20,
21] in 1996: they coined the term “kleptography” to denote the usage of cryptographic primitives in order to design “safe” backdoors in other cryptographic protocols. Following the classical distinction between asymmetric and symmetric cryptography, SETUP mechanisms can lead to
asymmetric backdoors and
symmetric backdoors.
In an asymmetric backdoor, the information required to recover the encrypted messages is protected by an asymmetric cipher. Usually, this means that some data that allow an actor to recover any user private key are encrypted with the public key of the designer of the RSA implementation and stored inside the corresponding user public key. Any actor that knows the corresponding designer private key may extract the data from the user public key and decipher them to recover the user private key. Notice that in this case the RSA implementation is “tamper resistant”: even reverse engineering cannot reveal the designer private key.
In a symmetric backdoor, however, the designer key that allows an actor to recover the user private key from the user public key is stored in some form inside the RSA implementation itself. To be secure and undetected, the RSA implementation (perhaps a physical device) must be “tamper proof”.
Existing RSA backdoors may also be categorized according to the place where the backdoor’s specific data are stored: either in the semi-prime N alone or also in the exponent e of any public key . “Exponent-based” backdoors are somewhat easier to devise, because e could theoretically be any random value coprime with . However, most RSA implementations make use of special fixed values for the public exponent, such as small values or values with a small Hamming weight, in order to improve the efficiency of the RSA algorithm. Thus, exponent-based backdoors cannot be easily hidden from the final user and can be perceptively slower than honest RSA implementations. Backdoors embedded in the public key’s semi-prime do not limit the choice of the public exponent; however, they must address a crucial problem: how to encode information about the factorization of the semi-prime in the semi-prime itself, in such a way that the information is encrypted with a secret key and, possibly, the pair is indistinguishable from a pair of primes generated by an honest RSA implementation.
This article proposes two backdoors embedded in the semi-primes of the RSA’s public keys; as a matter of fact, the backdoors apply to any cryptographic protocol based on the integer factorization of semi-primes. Related work concerning exponent-based backdoors is not further discussed here; examples of these backdoors can be found in [
14,
22,
23,
24].
3.1. Symmetric Backdoors
The proposed SSB algorithm implements a symmetric backdoor, because the escrow key is fixed and hard-cabled in the hardware or software device that generates the vulnerable semi-primes. As we shall see, the TSB might be considered both a symmetric or an asymmetric backdoor.
The first RSA backdoor was proposed by Anderson [
25] in 1993. It is a symmetric backdoor embedded in the public key’s semi-prime: let
be an
m-bit secret prime (the “backdoor key”), and let
and
be pseudo-random functions that, given a seed in argument, produce a
-bit value (in the original article,
and
). For any vulnerable
-bit semi-prime
, let
be
-bit random numbers that coprime with
, and let
and
. Given
N and
, it is possible to compute
, then factor the
m-bit number
, and finally compute
p and
q. Kaliski [
26] proved that it is possible to discover the backdoor by either computing the continued fraction
, because the expansion likely contains an approximation of the fraction
, or by finding a reduced basis of a suitable lattice built on the primes of two vulnerable moduli. He also showed that the backdoor can be detected by the lattice method when 14 or more non-factored vulnerable moduli are available. It is easy to observe that Kaliski’s detection algorithm can be easily defeated by introducing a “dynamic backdoor key” whose exact value depends, for instance, on an incremental counter. However, another drawback of Anderson’s backdoor is that
; hence, triggering the backdoor for currently used public key sizes might require factoring a too large integer.
The first backdoor proposed in this article, the SSB, is similar to Anderson’s construction, in that triggering the backdoor involves as first step computing the remainder of the integer division of the semi-prime and the designer (escrow) key. However, a key difference from Anderson’s idea is the form of the primes p and q, which allows the SSB to escape detection by Kaliski’s algorithms and to avoid factoring a large integer when exploiting the backdoor.
In 2003, Crepéau and Slakmon [
23] presented, among several other exponent-based backdoors, a semi-prime-based backdoor that relies on Coppersmith’s attack [
18] and encrypts the factor
p in the RSA modulus
in such a way that the bits in
have the correct distribution for a random semi-prime, while the middle
bits of
N are an encryption, via a pseudo-random function
, of
. The SSB and TSB backdoors use an entirely different mechanism and do not rely on Coppersmith’s attack, which means that they can be efficiently exploited even on very large balanced semi-primes.
In 2008, Joye [
27] studied the performances of generating a semi-prime
N in which some bits are prescribed; he developed as an example an RSA symmetric backdoor based on the Coppersmith’s attack in which some of the bits of
p are encrypted in
q. While this study is relevant when analyzing the generation times of any semi-prime backdoor, their proposal is entirely different than the present one.
The symmetric backdoor proposed by Patsakis [
28] in 2012 is based on yet another idea: the parameterized, randomized backdoor algorithm decomposes an integer as a sum of squares in a way depending on a designer’s secret parameter. The backdoor consists of imposing that the semi-prime, once decomposed using the secret parameter, can be easily solved by a nonlinear system whose solutions are properly bounded.
In 2017, Nemec, Sys, and others [
29] exposed ROCA (Return of Coppersmith’s Attack), a critical vulnerability (perhaps unintentional) in the key generation algorithm of the
RSALib library, which is written, adopted, and distributed to third parties by Infineon, one of the top producers of cryptographic hardware devices. This work raised much interest because the flaw was already present in devices produced in 2012 and the total number of affected devices and, consequently, vulnerable keys is huge. In any
generated by the flawed
RSALib, all primes
p and
q have the form
, where
is the
primorial number composed by the product of the first
t primes, and
k,
a are random integers. The values of
t for semi-primes of bit length
, 1024, 2048, and 4096 are, respectively,
, 71, 126, and 225. This means that the number of truly random bits in each of the primes is reduced, respectively, to 98, 171, 308, and 519. In order to find the factors of a vulnerable semi-prime, a variant of the Coppersmith’s attack is used: it is possible to efficiently factor
when the value
is known. Hence, the recovering procedure determines a suitable divisor
M of
of size
(to reduce the search space for
a), guesses an exponent
a, computes 67,537
, and factors
N. It is also easy to verify whether a given key is flawed:
N is likely vulnerable if the discrete logarithm
exists. Actually, this logarithm can be easily computed by the Pohlig–Hellman algorithm [
30] because
is the product of many small consecutive primes. Hence, ROCA arguably belongs to the weak backdoor category.
3.2. Asymmetric Backdoors
The proposed TSB algorithm can be used to implement both symmetric and asymmetric backdoors. In fact, the TSB makes use of an embedded designer key but also generates two distinct semi-primes. If both semi-primes are used to build two distinct public keys, both available to a third-party attacker, then tampering with the TSB device may expose the designer key and break the keys. The TSB can be used to generate a public key (from one of the generated semi-primes) and a dedicated escrow key composed by the hard-coded large prime inside the device and the other semi-prime, which must be considered the designer’s secret key. This is a reasonable scenario for cryptographic keys used in a highly-secure work environment. In this second case, the TSB must be considered an asymmetric backdoor, because tampering with the device is not enough to break an already generated key.
The first examples of asymmetric backdoors proposed by Young and Yung [
20] in 1996 were exponent-based. However, that article also includes the description of an asymmetric semi-prime-based backdoor named PAP, for “Pretty Awful Privacy”. The backdoor designer defines a designer’s RSA public key
and private key
, where
. Let
and
be invertible functions depending on a fixed key
K that transform a seed of
bits in a pseudo-random value of
bits. In order to create a backdoor, the designer first chooses a prime
p of bit length
at random then searches the smallest value
K such that
.
is then encrypted as
. The RSA semi-prime
N results from the search of a prime
q such that the
most significant bits of
coincide with
. The attacker can easily break the public key by extracting
from
N then starting an exhaustive search of the value for
K that, when applied to the inverse permutations
and
, permits the extraction of the proper factor
p using the RSA private key
.
In a series of articles published between 1997 and 2008, Young and Yung [
21,
31,
32,
33] proposed several kleptographic backdoors for RSA using different cryptographic algorithms for embedding the factor
p in
N. Specifically, in [
21] the backdoor PAP2 is embedded in the RSA semi-prime via the ElGamal protocol [
34]; that is, encrypting
p in
N is based on a Diffie–Hellman key exchange. In [
31] the backdoor PP, for “Private Primes”, is based on Rabin’s cryptosystem [
35]; it also differs from the one described in [
21] because it uses non-volatile memory to store the number of generated backdoored keys so as lower the probability of producing the same key twice. In [
32] the encryption of the factor
p inside the semi-prime
N is achieved by means of an elliptic curve Diffie–Hellman key exchange. In 2008, Young and Yung [
33] revisited the backdoor proposed in [
32] and implemented it on the OpenSSL library. After some optimization effort, this implementation was made faster than the original OpenSSL RSA key generation methods.
In 2010, Patsakis [
28,
36] proposed yet another kleptographic mechanism that relies on Coppersmith’s attack and forges
p and
q so that the most significant bits of both of them are of the form
, where
a is a secret design parameter,
r is a random value, and
is the designer’s asymmetric public key.
In 2016, Wüller, Kühnel, and Meyer [
37] proposed an RSA backdoor called PHP, for “Prime Hiding Prime”, in which the information required to factor
N is hidden in
N itself. The idea is to select a prime
p such that
is a prime, where
is the RSA public key of the designer. To factor
, the designer computes
. An improvement of PHP, called PHP’, is also described in [
37]: here,
, where
s is the concatenation of
random bits and
. Half of the bits of
p are enough to recover the factorization of
N thanks to the Coppersmith’s attack.
Markelova [
19] revisited Anderson’s idea for a symmetrical backdoor and devised SETUP mechanisms that protect the backdoor by means of some public key algorithms, in particular, based on discrete logarithm problems on both finite fields and elliptic curves. The author also presented a SETUP backdoor exploiting the Chinese Remainder theorem. The article [
19] also includes a discussion of the similarities between these SETUP backdoors and the ROCA backdoor.
3.3. The Implicit Factorization Problem
In 1985, Rivest and Shamir [
38] introduced the
oracle complexity as a new way to look at the complexity of the factorization problem (and the related RSA attack): they showed that the semi-prime
N can be factored in polynomial time if an oracle provides
of the bits of
p. In 1996, Coppersmith [
17,
18] improved the result by showing that an explicit “hint” about the top half bits of
p are sufficient for factoring
N in polynomial time. In particular, Coppersmith described some algorithms based on lattice reduction and the LLL procedure [
39] to find small integer roots of univariate modular polynomials or bivariate integer polynomials. Later [
40,
41], these algorithms were reformulated in simpler ways and heuristically extended to the multivariate polynomial case.
The seminal article [
15] focusing on “implicit hints” was published in 2009 and it is due to May and Ritzenhofen. An oracle gives an implicit hint when it does not output the value of some bits of one of the factors of the semi-prime; rather, the oracle outputs another semi-prime whose primes share some bits with the factors of the original semi-prime. The authors formally introduced the implicit factorization problem (IFP) and showed that two semi-primes
and
can be factored in time
if
, with
. The algorithm is based on a lattice reduction: the search for the unknown primes
is reduced to a search for a basis of a suitable lattice by means of the quadratic Gaussian reduction algorithm. This result implies that only highly imbalanced semi-primes can be factored, because
; hence,
. The authors also extended this result to
semi-primes and showed that a polynomial algorithm based on the Lenstra-Lenstra-Lovász lattice basis reduction (LLL) algorithm [
39] exists if
. For the balanced case, this result is not useful, because it means that all
primes are identical; hence, they can be easily recovered by the Euclidean algorithm. However, the authors also showed that their method can be used to factor
k balanced semi-primes when some additional conditions are satisfied and
bits are discovered by brute force.
In the following years, many articles improved and extended the results of May and Ritzenhofen: further details can be found in a survey [
42] published in 2018.
All attacks and vulnerabilities based on these results assume that the factors of vulnerable semi-primes share some identical bits. From a practical point of view, backdoors relying on shared identical bits cannot be easily concealed from anyone looking at the factors, that is, from the owner of the private key. Furthermore, all the results cited in this section are based on some variants of Coppersmith’s algorithms [
17,
18]. In contrast, the proposed backdoors generate semi-primes with factors without common shared bits and do not require Coppersmith’s algorithm. Therefore, they are difficult to detect and are much more efficient when applied to balanced semi-primes with a large size, such as those used in the current RSA public keys.
4. SSB: A Backdoor Embedded in a Single Semi-Prime
This section presents the SSB (Single Semi-prime Backdoor), a proposal for a new backdoor encoded in the value of a semi-prime N. The section first describes the vulnerability and how the semi-prime is generated; then, it describes the procedure to efficiently factor the semi-prime, provided that the corresponding escrow key is known. Finally, the section reports an analysis of the theoretical and practical efficiency of the backdoor.
4.1. Generation of a Vulnerable Semi-Prime
The first step of the generation of a vulnerable semi-prime is choosing an “escrow key”, which is a secret value that must be known in order to detect and exploit the backdoor. The escrow key is a prime T of a size slightly smaller than the size of the primes in the semi-prime. Thus, if is the reference bit length of the primes (e.g., for RSA-1024), then , where typically for ; a good value for , 1024, and 2048 appears to be . The backdoor designer must also choose the value of a constant K, which must be related to the value of , as discussed later; in particular, it will be shown that a good practical choice may be .
In order to create a vulnerable semi-prime, two distinct primes p and q, each of them having a bit length of roughly , must be generated. The backdoor exists whenever the following condition holds:
H0. There exists a positive integer k with such that .
Algorithm 1 below can be used to generate the two primes
p and
q satisfying the condition H0. It is based on Dirichlet’s theorem stating that there are infinitely many primes of the form
if
(consider
and
). The semi-prime is then computed as
.
Algorithm 1: Generation of a vulnerable semi-prime with escrow key T |
|
4.2. Recovering Procedure
The key idea of the SSB, and also the proof that it works as expected, is its recovering procedure. Formally, the factors of N can be efficiently recovered by knowing in advance only the semi-prime N and the escrow key T. The values of the parameters , K, and c may affect the running time of the recovering procedure; however there is no need to know them to recover the factors.
The recovering procedure can be split into three phases:
- 1.
Recovering “low-level” coefficients.
- 2.
Recovering “high-level” coefficients.
- 3.
Recovering the factors.
Generally speaking, in a practical implementation of the recovering procedure it might be convenient to interleave the executions of these three phases. However, here the phases are discussed independently to simplify the description of the whole procedure.
Example 1. A “running example” may be useful to understand the description of the SSB’s recovering procedure. Let α = 128, c = 5, K = 30. Pick as a random secret the 123-bit prime T = 6451117418610792529759522664972769997. Then, pick as vulnerable semi-prime N = 54577680260424665710663143106120874652519112194523277824721618245793829954991 (of bit length 255).
4.2.1. Recovering “Low-Level” Coefficients
At the beginning, only
N and
T are known. The equation
and the equation in condition H0 imply the following:
By combining them, we obtain the following:
Because , where K is a reasonably small constant, we can exhaustively test every possible value for k and discard any value for which in the Galois field GF(T) is a quadratic non-residue, that is, discard any value k such that for all integers , . Here, denotes the value in GF(T) such that .
The output of this phase is a list containing candidate values for the “low-level” coefficient k and the corresponding quadratic residue in GF(T). The correct value of k yields .
Example 2 (Continuing Example 1)
. There are 14 values for k ∈ [2, 30] that yield a quadratic residue in GF(T). They are 3, 4, 9, 10, 12, 13, 14, 16, 19, 22, 23, 25, 27, and 30.
4.2.2. Recovering “High-Level” Coefficients
This phase starts by knowing N, T, k, and . Actually, this phase is executed once for any candidate in the list built in the previous phase; any candidate is discarded as soon as it yields inconsistent results.
The first step computes the square root of
in GF(
T); that is, it finds the values whose square is congruent to
modulo
T, typically by means of the Tonelli–Shanks algorithm [
43,
44]. Because in general any square root has two distinct values in GF(
T), there are two possible values
and
for
, where
. In the following, let
be either
or
; this phase has to be performed with both values by discarding the value that yields inconsistent results.
Starting from
, the value
can be easily computed from Equation (
2), so several candidate values for
and
are now known.
Example 3 (Continuing Example 2)
. The 14 possible values for k, each of them with two possible roots γ1 and γ2, yield the following 28 cases: | | |
| 1101001108223132047246029465205384188 | 3303003324669396141738088395616152564 |
| 5350116310387660482513493199767385809 | 3148114093941396388021434269356617433 |
| 383884601054424720447564657194317617 | 1535538404217698881790258628777270468 |
| 6067232817556367809311958007778452380 | 4915579014393093647969264036195499529 |
| 255923067369616480298376438129545078 | 2303307606326548322685387943165905702 |
| 6195194351241176049461146226843224919 | 4147809812284244207074134721806864295 |
| 674267825617802548964398838956350795 | 291560837567232959884465724590737953 |
| 5776849592992989980795123826016419202 | 6159556581043559569875056940382032044 |
| 550500554111566023623014732602692094 | 154889230727999753716654126259535131 |
| 5900616864499226506136507932370077903 | 6296228187882792776042868538713234866 |
| 872807543698631712198073475805281438 | 4895380649471419728815432520495888697 |
| 5578309874912160817561449189167488559 | 1555736769139372800944090144476881300 |
| 1772631623417650051858813089627283653 | 5463490472014723136744815259863661151 |
| 4678485795193142477900709575345486344 | 987626946596069393014707405109108846 |
| 3033616408778183904655979003889226190 | 3380040610175394766179005407418229061 |
| 3417501009832608625103543661083543807 | 3071076808435397763580517257554540936 |
| 1334962546318133547479911059450973176 | 6010936124212159812839742134650180353 |
| 5116154872292658982279611605521796821 | 440181294398632716919780530322589644 |
| 392162883320122101182846126731882268 | 2176466014431893696263092123128639899 |
| 6058954535290670428576676538240887729 | 4274651404178898833496430541844130098 |
| 2533078726893509165415881053303209543 | 200753951053578036729560241218889516 |
| 3918038691717283364343641611669560454 | 6250363467557214493029962423753880481 |
| 2426893127022547123724783203111380952 | 2612271408066545325283876093029593827 |
| 4024224291588245406034739461861389045 | 3838846010544247204475646571943176170 |
| 367000369407710682415343155068461396 | 3457892555397395895454742521875687695 |
| 6084117049203081847344179509904308601 | 2993224863213396634304780143097082302 |
| 2107797709484122639489264125803469073 | 5173874517026546416842219789349142217 |
| 4343319709126669890270258539169300924 | 1277242901584246112917302875623627780 |
The semi-prime
N can be written as follows:
that is, if
,
From the last equation, it is easy to obtain the following bounds:
Therefore,
. Because by construction
c is a small constant, it is possible to adopt a brute force approach to discover the missing “high-level” coefficients
and
. The brute force search guesses the value of the sum
, starting from the lower bound
(from Equation (
7)) and ending at the upper bound
(from Equation (
6)).
For any candidate value of the sum
, Equation (
5) can be transformed by introducing an unknown
,
,
,
:
that is,
Because we are looking for integer solutions for
x and
, the brute force attack just tries all values for
C, in increasing order, and immediately discards any value such that
is not a square. If the value of
C survives, the solutions
are computed; if either one of the solutions is an integral number, the pair
is recorded as a candidate solution.
Example 4 (Continuing Example 3)
. By Equation (6), ≤ 1312, and the search interval for is [71, 1312]. Eventually, the brute force search phase yields the following: | | |
| 8459626466054297349616399379260014164347 | |
| 8457579353068579085275623974455862931102 | |
| 8460098809150150504624722565825147543255 | |
| 8455567114736811835697200866446146361343 | |
| 8460098809150150504624722565825147543255 | |
| 8456206922405235876897946807541470224038 |
|
| 8460159710142991168177115744323858742548 | |
| 8454674421387565411156205086222433061299 | |
| 8460176966608408915640022393992616856441 | |
| 8454431238974637688887602540186506313669 | |
| 8459527860681315896302789741399585733765 | |
| 8458844931455875155214043724730914133903 | |
| 8458688931473612092596816415096292863204 | |
| 8459473936150433673255660520780811038011 | |
| 8458600731145590019601856845450035587533 | |
| 8458563270745932805742932307196370272787 | |
| 8458946310355913998823466442261921695379 | |
| 8459841091607833499654026572791050078911 | |
| 8460057876762344444478490757720182548647 | |
| 8456175388241485667746177173305070300817 | |
| 8460111356431450904636879475680056375399 | |
| 8456394071690787199309265394309605704461 | |
| 8459207454427345656382788041250813432771 | |
| 8457795501543823956302037177881981637553 | |
| 8459993466423705060298814722415082625743 | |
| 8457367241929899374346925285427054004837 | |
| 8458499704585927357292407303891989899950 | |
| 8459330259393827233818979265142169741243 | |
There is only one candidate: k = 9, p mod T = 4147809812284244207074134721806864295, q mod T = 6195194351241176049461146226843224919, π = 48, ν = 26. 4.2.3. Recovering the Factors
This phase starts by knowing N, T, , , and a list of candidate solutions .
For any candidate solution
, the corresponding
are computed, then the product
is compared to
N. One of the candidate solutions certainly yields a factorization of the semi-prime.
Example 5 (Continuing Example 4)
. Finally, we obtain the following:and we verify that = 54577680260424665710663143106120874652519112194523277824721618245793829954991 = N. 4.3. Analysis
The time complexity of the SSB’s recovering procedure can be easily obtained. As explained in the previous subsection, the procedure starts by recovering the “low-level” coefficients by means of an exhaustive search among
possible values for
k. For every candidate value, the procedure must execute some operations in GF(
T) whose cost is in
and also use the Tonelli–Shanks algorithm to determine if a value
is a quadratic residue, which costs
[
45]. The list of candidate values for
k has expected length
, because in a finite field with an odd number of elements any quadratic residue has two square roots; thus, half of the elements of the field are not the square of another element. Therefore, the “high-level” coefficients recovery phase is executed on
candidate values for
k and includes an exhaustive search in an interval of size
; in every iteration the procedure executes a few integer operations on values of bit length
; hence, every execution of this phase has a cost in
. Finally, the cost of every execution of the third phase is dominated by two multiplications of values of bit length
; hence, it is in
. Summing all up, the worst-case cost of the whole recovering procedure is in
.
The values of the parameters
K and
c are chosen by the backdoor designer. We would expect that larger values of
K and
c yield smaller running times for Algorithm 1 and longer running times for the recovery procedure; this intuition is confirmed by the experiments. Anyway, the value of
c cannot be made too large or it would be possible to discover the backdoor by just guessing the design key
T of bit length
. By letting
and
, for instance,
and
as suggested in
Section 4.1, one obtains a running time for the recovery procedure in
, that is, a polynomial in the size of the semi-prime.
Experimental Results
In order to confirm that the backdoor works as expected and to assess the execution times with respect to the designer’s parameters, the SSB has been implemented in SageMath [
46] and extensive tests have been performed (the code is open-source and available at
https://gitlab.com/cesati/ssb-and-tsb-backdoors.git, accessed on 17 September 2023).
In particular, three values for have been considered: 512 (the size of factors for RSA-1024), 1024 (RSA-2048), and 2048 (RSA-4096). All tests have been performed by choosing . This means that the escrow keys have sizes 505, 1017, and 2041, respectively. The value of c is so small that detecting the existence of the backdoor by simply guessing the value of the escrow key does not appear to be significantly easier than guessing one of the factors of the corresponding semi-primes. Every test trial involves choosing a value for the parameter K, generating an escrow key T and a vulnerable semi-prime, then recovering the factors of the semi-prime by just using the values of the semi-prime and the escrow key. The tests have been executed by varying the parameter K so as to determine a value yielding both fast generations of vulnerable semi-primes and a reasonably quick recovery of the factors.
The tests have been executed on three computational nodes with 16 physical Intel Xeon E5-2620 v4 cores running at 2.1 GHz with 64 GiB of RAM. The nodes are based on the Slackware 14.2 software distribution with a Linux kernel version 5.4.78 and SageMath version 9.1. All tests have properly recovered the factors of the vulnerable semi-primes. Each value of
has been tested 20 times. The SageMath code is sequential; that is, each test trial runs on a single computation core.
Table 1 and
Figure 1 report averages and standard deviations of the running times.
The experimental results confirm that the value of
K is crucial in determining both the time required to generate a vulnerable semi-prime and the time required to recover the factors. Even if the code has not been optimized at all, the recovery time is reasonably small for all tested values of
K; hence, the SSB is a practically effective backdoor. However, the generation time is also very important whenever the backdoor mechanism has to be hidden in hardware devices or software programs that are supposed to yield robust, legit semi-primes. While in general larger values of
K are associated with smaller generation times, there seems to be a threshold value for
K above which the generation times are essentially constants and near the minimum observed value. From the data shown in
Table 1 and
Figure 1,
K may be safely set to values near 500, 1000, and 2000 for
, 1024, and 2048, respectively; that is,
.
5. TSB: A Backdoor Embedded in a Pair of Semi-Primes
This section describes the TSB (Twin Semi-prime Backdoor), a new proposal for a backdoor embedded in the values of a pair of semi-primes, and . These semi-primes are typically to be generated on the same device but can be used independently. For instance, the two semi-primes might be used in two different RSA keys. It is not hard to justify the generation of two different RSA keys. For instance, the user might be told that one RSA key is for business or work usage and the other one is for personal usage. Alternatively, one of the semi-primes can be used to build an RSA key while the other one can be separately stored as an escrow key for the RSA key.
This section first reports how the two semi-primes are generated. Then, it describes the procedure to efficiently factor both semi-primes, provided that the corresponding designer key is known. Finally, the section reports an analysis of the theoretical and practical efficiency of the backdoor.
5.1. Generation of the Vulnerable Pair of Semi-Primes
The first step of the generation of a vulnerable pair is choosing a “designer key”, which is a secret value that must be known in order to detect and exploit the backdoor. The designer key is a prime T of a size slightly smaller than the size of the primes in each semi-prime. Thus, if is the reference bit length of the primes (e.g., for RSA-1024), then , where typically for ; a good value for , 1024, and 2048 appears to be . The backdoor designer must also choose the values of two constants K and B. The value of K is related to the value of , as discussed later; typically, , e.g., , 200, and 400 for , 1024, and 2048, respectively. The constant acts as a detection threshold, so any value for B such that is valid.
In order to create a vulnerable pair, four distinct primes, , , , and , each of them having a bit length of roughly , must be generated. The backdoor exists whenever the following conditions hold:
H1. There exists a positive integer h with such that .
H2. There exists a positive integer with such that .
H3. There exists a positive integer with such that .
H4. The integers h, , and are all coprimes; that is, .
H5. is not a divisor of modulo T; that is, .
H6. .
Algorithm 2 can be used to generate the four primes
,
,
, and
satisfying the conditions H1–H6 above. Once more, the algorithm is implicitly based on Dirichlet’s theorem stating that there are infinitely many primes of the form
when
.
Algorithm 2: Generation of a vulnerable pair of semi-primes |
|
Finally, the semi-primes are computed as and . Observe that and are coprimes, because all factors are necessarily different by construction.
5.2. Recovering Procedure
The key idea of the TSB, and also the proof that it works as expected, is its recovering procedure. Formally, the factors of and can be efficiently recovered by knowing in advance only the pair of semi-primes and the designer key T. The values of the parameters , K, and c may affect the running time of the recovering procedure; however there is no need to know them to recover the factors.
The recovering procedure can be split into four phases:
- 1.
Recovering “medium-level” coefficients.
- 2.
Recovering “low-level” coefficients.
- 3.
Recovering “high-level” coefficients.
- 4.
Recovering the factors.
Generally speaking, in a practical implementation of the recovering procedure it might be convenient to interleave the executions of these four phases. However, the phases are here described independently to simplify the description of the whole procedure.
Example 6. This is the “running example” for the TSB’s recovering procedure. Let α = 64, c = 3, K = 100, and B = 257. Pick as a random secret the 61-bit prime T = 1350856093440009833. Then, pick as vulnerable semi-primes N1 = 199771249142689629600100193795300988277 and N2 = 330849388672597230630022641974377014199 (both of bit length 128).
5.2.1. Recovering “Medium-Level” Coefficients
The recovering procedure starts by assuming to know the following data: , , and the “secret” prime T.
Equations in conditions H1, H2, and H3 enforce the following congruences of
and
modulo
T:
It turns out that
and
are congruent modulo
T to two values that have a big common factor,
. However, the Euclidean algorithm on
and
does not really help here:
The point is that the greatest common divisor is relative to the lifted images of the products in the Galois field GF(T), and it is not related to the greatest common divisor of the products and in .
Example 7 (Continuing Example 6)
. To overcome this problem, observe that Equations (
8) and (
9) also imply the following ones:
and therefore there exist two integers
,
such that
From the last two equations,
Observe that dropping
from Equation (
12) yields
Similarly, from Equation (
13),
Hence, the sizes of the “medium” coefficients
and
are so small that they can be quickly recovered by a brute force approach as in Algorithm 3. It is possible to recognize the proper values of
and
because the size of
produced by the gcd with the right values is usually much higher than the average value resulting from a gcd with random wrong values. In fact, by condition H6,
; hence, the procedure selects any candidate pair of medium-level coefficients
for which the greatest common divisor in Equation (
14) is between
B and
T. Moreover, the value returned by the Euclidean algorithm with the right values must be a square in the Galois field GF(
T); hence, the procedure may use this condition to filter some false positives. In all test cases, the first value found by this brute force procedure yields a proper factorization result.
Algorithm 3: Brute force search of the medium-level coefficients |
|
Example 8 (Continuing Example 7)
. There are only two possible pairs ∈ [0, 1002] × [0, 100] that yield a greatest common divisor higher than B = 257: (671, 10) and (5277, 79). The gcd for the pair (671, 10) is 196865400950880229, which is the square of 10632559655363908 modulo T. The gcd for the pair (5277, 79) is 1547721494390890062: because it is above T, the pair can be discarded.
5.2.2. Recovering “Low-Level” Coefficients
The previous phase might determine several candidate pairs of medium-level coefficients, and the current phase must be applied to each of them.
This phase starts by assuming to know the following data:
,
,
T,
,
, and the value
derived from Equation (
14). The value of the “low-level” coefficient
can be immediately computed using Equation (
13):
or, assuming
,
where
.
However, by inverting Equation (
12) one obtains the value of the product
:
or, assuming
,
.
Since both
h and
are not greater than
K, their product is below
. Moreover, by condition H4,
. Because the number of multiplicative partitions of this product does not exceed
[
47,
48], the procedure may exhaustively generate all possible candidate pairs
and apply the forthcoming phases to each of them. When these phases are performed on the true pair
, a proper factorization of
and
is computed.
Example 9 (Continuing Example 8)
. Two exact integer divisions yield k2 = 69 and (h k1) = 4606 = 2 · 72 · 47. Therefore, there are six possible pairs (h, k1), corresponding to the non-trivial subsets of the three values 2, 72, and 47: (2, 2303), (47, 98), (49, 94), (94, 49), (98, 47), and (2303, 2).
5.2.3. Recovering “High-Level” Coefficients
This phase starts by knowing the following data: , , T, h, , , and .
The procedure starts by computing the square root of
in GF(
T); that is, it finds the values whose square is congruent to
modulo
T, typically by means of the Tonelli–Shanks algorithm [
43,
44]. Because in general any square root has two distinct values in GF(
T), one obtains two possible values
and
for
, where
. In the following, let
be either
or
; the procedure has to perform this phase with both values and discard the one that yields inconsistent results.
It is now possible to compute the value
, because
means the following:
where obviously
is computed in GF(
T); that is,
.
The value
can now be inferred from the equation in condition H1, because
Also,
and
can be computed from conditions H2 and H3:
Example 10 (Continuing Example 9)
. The square roots of γ2 = 196865400950880229 in GF(T) are γ1 = 10632559655363908 and γ2 = 1340223533784645925. The six possible pairs (h, k1) and the two possible roots γ1 and γ2 yield the following 12 cases: | | | | |
| 5316279827681954 | 21265119310727816 | 685500817531612520 | 366823308110054826 |
| 1345539813612327879 | 1329590974129282017 | 665355275908397313 | 984032785329955007 |
| 1264857461085442480 | 499730303802103676 | 1249852184152786057 | 820374834734901808 |
| 85998632354567353 | 851125789637906157 | 101003909287223776 | 530481258705108025 |
| 331038891447662896 | 520995423112831492 | 584496908244388744 | 1227986014848582496 |
| 1019817201992346937 | 829860670327178341 | 766359185195621089 | 122870078591427337 |
| 632428730542721240 | 999460607604207352 | 1148848274865562281 | 410187417367450904 |
| 718427362897288593 | 351395485835802481 | 202007818574447552 | 940668676072558929 |
| 165519445723831448 | 1041990846225662984 | 1168993816488777488 | 613993007424291248 |
| 1185336647716178385 | 308865247214346849 | 181862276951232345 | 736863086015718585 |
| 466909284818889792 | 171375204382903130 | 454232818686074308 | 1147050503383169489 |
| 883946808621120041 | 1179480889057106703 | 896623274753935525 | 203805590056840344 |
At this point the procedure knows the values , , T, , , , and .
The semi-prime
(
) can be written as follows:
that is, if
,
The following bounds can be easily obtained from the last equation:
Therefore,
. Because by construction
c is a small constant, the procedure can adopt a brute force approach to discover the missing “high-level” coefficients
and
. The brute force search guesses the value of the sum
, starting from the lower bound
(from Equation (
23)) and ending at the upper bound
(from Equation (
22)).
For any candidate value of the sum
, Equation (
21) can be transformed by introducing an unknown
,
,
,
:
that is,
Because we are looking for integer solutions for
x and
, the brute force attack tries all values for
C, in increasing order, and immediately discards any value such that
is not a square. If the value of
C survives, the solutions
are computed; if either one of the solutions is an integral number, the pair
is recorded as a candidate solution.
Example 11 (Continuing Example 10)
. By Equation (22), π1 ν1 ≤ 110 and π2 ν2 ≤ 182. The search interval for π1 + ν1 is [20, 110]. The search interval for π2 + ν2 is [26, 182]. Eventually, the brute force search phase yields the following candidates: | | | | |
| 147882225056116242909 | 244912533420701231951 | |
| 147222186060035527550 | 243949765754682004760 | |
| 146714639142360634749 | 244614821750351042527 | |
| 147878492694158853453 | 244584070795448038178 |
|
|
| 147741686848298522541 | 244444700795865219999 | |
| 147306366554550564348 | 244842826140386624154 | |
| 147347067872903355989 | 244614821750351042527 | |
| 147777488784871629677 | 244673613681882690950 | |
| 147741686848298522541 | 244444700795865219999 | |
| 147725344017071121644 | 244749828556075164398 | |
| 147727922012766098477 | 244772788355361842613 | |
| 147298208022831052744 | 244740357969687905399 | |
Therefore, there is only one surviving parameter set: h = 47, k1 = 98, k2 = 69, π1 = 9, ν1 = 12, π2 = 12, ν2 = 14, p1 mod T = 101003909287223776, q1 mod T = 85998632354567353, p2 mod T = 530481258705108025, and q2 mod T = 851125789637906157.
5.2.4. Recovering the Factors
This phase starts by knowing , T, , , and a list of candidate solutions , for . The procedure now works on every semi-prime separately.
For any candidate solution , it computes the corresponding and and then it simply verifies whether . One of the candidate solutions certainly yields a factorization of the semi-prime.
Example 12 (Continuing Example 11)
. Finally, we obtain the following:and we verify that 5.3. Analysis
The time complexity of the TSB’s recovering procedure can be easily obtained. As already explained, the procedure starts by recovering the “medium-level” coefficients by means of an exhaustive search among
possible values for the pair
. For every candidate pair, the procedure must execute the Euclidean algorithm on values of bit lengths up to
, which costs
. It may also use the Tonelli–Shanks algorithm to determine if a value
is a quadratic residue, which costs
[
45]. The “low-level” coefficients recovery phase involves a couple of integer divisions on values
, a factorization of a value
, and the generation of up to
candidate pairs
; hence, the cost of each execution of this recovery phase is
. The “high-level” coefficients recovery phase includes an exhaustive search in an interval of size
; in every iteration the procedure executes a few integer operations on values of bit length
; hence, the cost of every execution of this phase is
. Finally, the cost of every execution of the fourth phase is dominated by four multiplications of values of bit length
; hence, it is in
. Summing all up, the worst-case cost of the whole recovering procedure is in
.
The values of the parameters
K and
c are chosen by the backdoor designer. It is easy to observe that larger values of
K and
c yield shorter running times for Algorithm 2 and longer running times for the recovery procedure. Anyway, the value of
c cannot be made too large, or it would be possible to discover the vulnerability by just guessing the designer key
T of bit length
. However, experimental results show that larger values of
c do not necessarily yield shorter times for the generation phase. By letting
and
, as suggested in
Section 5.1, one obtains a running time for the recovery procedure in
, that is, a polynomial in the size of the semi-primes.
Experimental Results
In order to confirm that the backdoor works as expected and to assess the execution times with respect to the designer’s parameters, the TSB has been implemented in SageMath [
46] and extensive tests have been performed (the code is open-source and available at
https://gitlab.com/cesati/ssb-and-tsb-backdoors.git, accessed on 17 September 2023).
In particular, three values for have been considered: 512 (the size of factors for RSA-1024), 1024 (RSA-2048), and 2048 (RSA-4096). All tests have been performed by choosing . This means that the designer keys have sizes 505, 1017, and 2041, respectively. The value of c is so small that detecting the existence of the backdoor by simply guessing the value of the designer key does not appear to be significantly easier than guessing one of the factors of the corresponding semi-primes. Every test trial involves choosing a value for the parameter K, generating a designer key T and a pair of vulnerable semi-primes, then recovering the factors of the semi-primes by just using the values of the semi-primes and the designer key. The tests have been executed by varying the parameter K so as to determine a value yielding both fast generations of vulnerable semi-primes and a reasonably quick recovery of the factors.
The tests have been executed on the same computational nodes described in
Section 4. All tests have properly recovered the factors of the vulnerable semi-primes. Each value of
has been tested 20 times. The SageMath code is sequential; that is, each test trial runs on a single computation core.
Table 2 and
Figure 2 report averages and standard deviations of the running times.
The value of
K is crucial in determining both the time required to generate a pair of semi-primes and the time required to recover the factors. The experimental results show that, even if the SageMath code is not optimzed, the recovery time is reasonably small for all tested values of
K; hence, the TSB is a practically effective backdoor. However, generation time is also very important whenever the backdoor mechanism has to be hidden in hardware devices or software programs that are supposed to yield robust, legit semi-primes. While in general larger values of
K are associated with smaller generation times, there seems to be a threshold value for
K above which the generation times are essentially constants and near the minimum observed value. From the data shown in
Table 2 and
Figure 2,
K can be safely set to values near 100, 200, and 400 for
, 1024, and 2048, respectively; that is,
.