In this section, we analyze the error
in the computation of
for
. Analogous to the analysis for the reciprocal, we determine a tight bound for
taking into account all (rounding) errors, assuming fixed-point arithmetic with
f fractional bits in Algorithm 5 (thus with
). In
Section 5.3, we will use this bound to determine the minimal number of additional bits
n needed to guarantee that the absolute error for
is limited to
, also taking into account the errors due to scaling.
With the help of Lemmas A5 and A6, we are able to give a bound on the total error for the reciprocal square root after iterations.
Proof. Clearly, if , the initial error is already below . Because no iterations are performed, no further errors are introduced, and the final error remains below .
For the cases in which , we exhaustively compute the error for all possible inputs b, taking into account all rounding possibilities. This covers the values and yields a maximum value for of approximately .
For larger values of f, we follow an analogous approach to that for the reciprocal: firstly, we derived an expression that bounds the absolute error as a function of f and . Secondly, we compute the value of the error bound for (), which will be below . Thirdly, we show that for larger values of f, the value of the error bound will always be smaller than in the case .
Following Lemma A5, we know that in the case of exact arithmetic, the error at the start of the final iteration is bounded by
. Lemma A6 tells us that in the first iteration, the rounding error is bounded by
, while in every subsequent iteration it is bounded by
. Thus, for
, we obtain the following bound for the total error at the start of the final iteration:
Let
. Applying (
A4) with
, and without the third-order term (since
), gives
For the case
, where
, this yields
for which a simple numerical analysis shows that the maximum value is slightly below
.
We complete the proof by showing that
for
, with
defined by (
12). Since
is increasing as a function of
f, let
be the lowest value of
f such that
. Then,
is also increasing as a function of
.
Since it is clear that for all , it suffices to bound . To that end, we will consider the three terms in the definition of that depend on and f separately.
To evaluate the first term , we note that is defined to have the value 1.045 for , with , while for . It thus follows that , and as a result the entire term decreases rapidly with .
For convenience, we use
and
in the analysis below. The second term may then be written as
. Using the definition of
, we get
, where
satisfies
and
. Taking the derivative thus yields for the second term
where
and
, so that
. This bound is almost identical to the bound we found in the analysis for the reciprocal. The factor before the outer parentheses is now positive for any valid
b and
. Additionally, it is easy to verify that the factor within the outer parentheses is negative for
. Because the negative part will only increase in (absolute) size with
, the derivative is, and will remain, negative. This shows that the original term is decreasing as a function of
.
The third term is
. Writing
as a function of
, we may rewrite this term to
. Taking the derivative, we find:
Again, this bound is very similar to the bound in the analysis for the reciprocal. The term before the outer parentheses is positive for any valid b and , and with the known value for it is easy to verify that the term between the outer parentheses is negative for . Additionally, the negative part will only increase in (absolute) size with . Therefore, the derivative is always negative, which shows that the original term is decreasing as a function of .
Combining these results shows that for all and , which proves the statement. Note that we could tighten the bound even more by computing for an arbitrary . □
Similar to the reciprocal, we will perform our computations with extra precision to control the effect of rounding. Therefore, in the following, we assume a total of fractional bits. Then, we apply Theorem 4 to find that .