Successive Standardization of Rectangular Arrays

In this note we illustrate and develop further with mathematics and examples, the work on successive standardization (or normalization) that is studied earlier by the same authors in Olshen and Rajaratnam (2010) and Olshen and Rajaratnam (2011). Thus, we deal with successive iterations applied to rectangular arrays of numbers, where to avoid technical difficulties an array has at least three rows and at least three columns. Without loss, an iteration begins with operations on columns: first subtract the mean of each column; then divide by its standard deviation. The iteration continues with the same two operations done successively for rows. These four operations applied in sequence completes one iteration. One then iterates again, and again, and again,.... In Olshen and Rajaratnam (2010) it was argued that if arrays are made up of real numbers, then the set for which convergence of these successive iterations fails has Lebesgue measure 0. The limiting array has row and column means 0, row and column standard deviations 1. A basic result on convergence given in Olshen and Rajaratnam (2010) is true, though the argument in Olshen and Rajaratnam (2010) is faulty. The result is stated in the form of a theorem here, and the argument for the theorem is correct. Moreover, many graphics given in Olshen and Rajaratnam (2010) suggest that but for a set of entries of any array with Lebesgue measure 0, convergence is very rapid, eventually exponentially fast in the number of iterations. Because we learned this set of rules from Bradley Efron, we call it"Efron's algorithm". More importantly, the rapidity of convergence is illustrated by numerical examples.


Introduction
Research summarized in this paper is an extension of that reported in [2] and a conference proceeding [3], and concerns successive standardization (or normalization) of large rectangular arrays X of real numbers, such as arise in gene expression, from protein chips, or in the earth and environmental sciences. The basic message here is that convergence that holds on all but a set of measure 0 in the paper by Olshen and Rajaratnam ([2] is shown here to be exponentially fast in a sense we make precise. The basic result in [2], though true is not argued correctly in [2]. The gap is filled here. Typically there is one column per subject; rows correspond to "genes" or perhaps gene fragments (including those that owe to different splicing of "the same" genes, or proteins). Typically, though not always, columns divide naturally into two groups: "affected" or not. Two-sample testing of rows that correspond to "affected" versus other individuals or samples is then carried out simultaneously for each row of the array. Corrections for multiple comparisons may be very simple, or might perhaps allow for "false discovery." As was noted in [2], data may still suffer from problems that have nothing to do with differences between groups of subjects or differences between "genes" or groups of them. There may be variation in background, perhaps also in "primers." Thus, variability across subjects might be unrelated to status. In comparing two random vectors that may have been measured in different scales, one puts observations "on the same footing" by subtracting each vector's mean and dividing by its standard deviation. Thereby, empirical covariances are changed to empirical correlations, and comparisons proceed. But how does one do this in the rectangular arrays described earlier? An algorithm by which such "regularization" is accomplished was described to us by colleague Bradley Efron; so we call the full algorithm Efron's algorithm. We shall use the terms successive "normalization" or successive "standardization" interchangeably (and also point out that Bradley Efron considers the the latter term a better description of the algorithm). To avoid technical problems that are described in [2] and repeated here, we assume there are at least three rows and at least three columns to the array. It is immaterial to convergence, though not to limiting values, whether we begin regularization to be described by row or by column. In order that we fix an algorithm, we begin by column in the computations though by row in the mathematics. Thus, we first mean polish the column, then standard deviation polish the column; next we mean polish the row, and standard deviation polish the row. The process is then repeated. By "mean polish" of, say, a column, we mean subtract the mean value for that column from every entry. By "standard deviation" polish the column, we mean divide each number by the standard deviation of the numbers in that column. Definitions for "row" are entirely analogous.
In [2] convergence is studied with entries in the rectangular array with I rows and J columns viewed as elements of R IJ , Euclidean IJ space. Convergence holds for all but a Borel subset of R IJ of Lebesgue measure 0. The limiting vector has all row and column means 0, all row and column standard deviations 1. We emphasize that to show convergence on a Lebesgue set of full measure, it is enough to find a probability mutually absolutely continuous with respect to Lebesgue measure for which convergence is established with probability 1.

Preliminaries
We begin precise definitions:X By analogy, set and analogously define (S (n) j ) 2 for n = 2, 3, .. By induction on dimension applied to regular conditional distributions, (S (0) i ) 2 > 0 a.s. As explained in [2] the 2 × 2 case illustrates the need to work with dimensions greater than or equal to 3. Consider the following arbitrary 2 × 2 matrix If a < b and c < d, then A moment's reflection shows that if X is I × 2,with n odd, then after each row normalization, each column has an odd number of entries, each entry being +1 or −1. However, each row has exactly one +1 and one −1. Thus (S With min(I, J) ≥ 3, both tend to 1 a.s. as n ր ∞. The fact that {X ij } iid implies in particular that X is row and column exchangeable. Thus, if where each column is I × 1, and π is a permutation of the integers {1, ..., J}, then X ∼ [x π(1),c , ..., x π(J),c ]. The analogous holds for rows, where where x i,r is 1 × J

Background and Motivation
A first step in the argument of [2] is to note that Lebesgue measure on the Borel subsets of R IJ is mutually absolutely continuous with respect to IJ product Gaussian measure, each coordinate being standard Gaussian. Thereby, the distinction between measure and topology is blurred; arguments of [2] as corrected here. Having translated a problem concerning Lebesgue measure to one concerning Gaussian measure, one cannot help note from graphs in [2] Figures 1, 2, 3, 6, and 7 that suggest with entries of X chosen independently from a common absolutely continuous distribution, and as led to these figures, almost surely ultimately, convergence is at least exponentially fast. In these graphs, the ordinate is always the difference of logarithms (the base does not matter) of squared Frobenius norm of the difference between current iteration and the immediately previous iteration; always abscissa indexes iteration. One purpose of this paper is to demonstrate the ultimately almost sure rapidity of convergence. Readers will note we assume that coordinates are independent and identically distributed, with standard normal distributions. This assumption is used explicitly though unnecessarily in [2], but only implicitly here, and then only to the extent that arguments here depend on those in [2]. Obviously, this Gaussian assumption is sufficient. It is pretty obviously not necessary.
For the sake of motivation we illustrate the patterns of convergence from successive normalizations on 5×5 matrices. We first describe the details of the successive normalization in a manner very similar to [2] . Consider a simple 5 × 5 matrix with entries generated from a normal distribution with a given mean and standard deviation. In our case we take the mean to be 2 and variance to be 4, though the specific values the mean and variance parameter take do not really matter. We first standardize the initial matrix X (0) at the level of each by row, i.e., first subtracting the row mean from each entry and then dividing each entry in a given row by its row standard deviation. The matrix is then standardized at the level of each column, i.e., by first subtracting the column mean from each entry and then by dividing each entry by the respective column standard deviation. Row mean and standard deviation polishing followed by column mean and standard deviation polishing is defined as one iteration in the process of attempting to row and column standardize the matrix. The resulting matrix is defined as X (1) . Now the same process is repeated with X (1) and repeated until successive renormalization eventually yields a row and column standardized matrix. The successive normalizations are repeated until "convergence" which for our purposes is defined as the difference in the squared Frobenius norm between two consecutive iterations being less than 10 −8 .
The figures (see  are plots of the log of the ratios of the squared Frobenius norms of the differences between consecutive iterates. In particular, they capture the type of convergence patterns that are observed in the 5 × 5 case from different starting values. In [2] it was proved that regardless of the starting value (provided the dimensions are at least 3) the process of successive normalization always converges, in the sense that it leads to a doubly standardized matrix. In addition, it was noted that the convergence is very rapid. We can see empirically from Figures 1-11 that eventually the log of the successive squared differences tends to decay in a straight line, i.e., the rate of convergence is perhaps exponential. This phenomenon of linear decay between successive iterations in the log scale is observed in all the diagrams. Hence, one is led naturally to ask whether this is always  A natural question to ask is whether the convergence phenomenon observed will still occur if simultaneous normalization is undertaken as compared to successive normalization. In other words the row and column mean polishing and row and column standard deviation polishing is all done at once: In this case the simultaneous normalization algorithm does not converge. In fact it is shown to diverge.     X denotes an I × J matrix with values x ∈ ℜ IJ . We take coordinates X ij (x) = X ij to be iid N (0, 1). As in [2], [3], 3 ≤ min(I, J) ≤ max(I, J) < ∞. As before, ij ], where, in an obvious notation X In view of [2], almost surely (S . Without loss, we take (S In various places we make implicit use of a theorem of Skorokhod, the Heine-Borel Theorem, and this obvious fact. Suppose we are given a sequence of random variables Y = (Y 1 , Y 2 , ...) and a condition C 0 that depends only on their finite-dimensional distributions. If we wish to make a conclusion C 1 concerning Y, then it is enough to find one probability space that supports Y with given finite dimensional distributions for which C 0 implies C 1 . The theorem of Skorokhod mentioned(see pages 6-8 of [1]) is to the effect that if the underlying probability space is (for measure theoretic purposes) the real line with measures given by, say, distribution functions F = (F 1 , F 2 , ...); and if F is compact with respect to weak convergence, where F q l converges to G in distribution; then if each F i is absolutely continuous with respect to some H (which itself is absolutely continuous), and Y i (x) = F i (x), then Y q l converges H-almost surely to a random variable Y .
Off of a set of probability 0, i,j (X (q) ij ) 2 = IJ for all q ≥ 1. We assume that x lies outside this "cursed" set. This is key to convergence.
Since the entries of X are independent, X is row and column exchangeable. This property is inherited by X (q) for every q. Because all entries (for q ≥ 1) are a.e. bounded uniformly in q, E{X ) 2 tends to 0 along a subsequence as m increases, not only is the limit bounded as a function of x, but also the limit random variable has expectation 0. Necessarily every almost sure subsequential limit in m of the random variables X (2m−1) ·j has mean 0. Likewise, every almost sure subsequential limit in m of the random variables (X (2m−1) ij ) 2 has expectation 1. All are bounded as functions of x. One consequence of these matters is that P (A j ) = 0. The next paragraph is a proof.
There is a subsequence of {q l } -for simplicity write it as {q l } -along which almost surely (a) lim As a corollary, one sees now that X (2q l −1) ·j → 0 a.s. Since the original {q l } could be taken to be an arbitrary subsequence of {q}, we conclude that convergence of row and column means to 0 and convergence of row and column standard deviations to 1 takes place everywhere except on a set of Lebesgue measure 0. THEOREM 4.1 Efron's algorithm converges almost surely for X on a Borel set of entries with complement a set of Lebesgue measure 0.
We turn now to a study of rates of convergence. To begin, define the following Z = lim m X (m) a.e. and λ Almost everywhere convergence and the fact that for each row and each column, not every Z ij can be of the same sign enable us to conclude that for m large enough Remember that for j = 1, ..., J ≥ 3, 1 I I i=1 (Z ij ) 2 = 1, and analogously for i = 1, ..., I ≥ 3. Now write We know that For all (i, j), Z i· = Z ·j = 0 a.e. To continue, we compute that is the positive square root of (S One argues that where D = D(I, J) < ∞. A key observation is that for every m, there exists j = j(m) for which {λ (2m−1) ij } are not of the same or all of the opposite sign as {Z ij : i = 1, ..., I}, which, as was noted, are not, themselves of the same sign. Argue analogously regarding {λ (2m) ij }. Refer now to Figure 9. Our arguments show the correctness of the concentric circles in R IJ on which X (n) , X (n+1) , X (n+2) , · · · ultimately lie. We conclude that to suitable approximation, the successive iterates lie on such circles with radii in geometric ratio. That ratio is uniformly < 1, but is not arbitrarily close to 0. However Figure 9 buries a key idea in our study of convergence are rates. Thus, write θ (n) for the angle between X (n) and Z.
Obviously, the squared Frobenius norm X (n) − Z 2 F can be expressed as . Therefore, convergence of X (n) to Z implies that θ (n) can be taken arbitrarily close to 0. This is another way of saying that for each n, X (n) 2 F = IJ. Failure of this condition is why "simultaneous normalization" fails.

Illustrations of Convergence
We now illustrate the rapidity of convergence in the 3×3 case to give a geometric perspective. First note that in the 3 × 3 case the set of fixed points are characterized by 3 unique values. For instance in the following doubly normalized matrix which arises a result of successive normalization has only three unique elements. (4) Hence we can use the first column of the limit matrix to represent the fixed point arising from successive normalization. Hence the curve of fixed points can be generated by applying the successive normalization process to random starting values. Figures 10 -11 below illustrates the curve that characterizes the set of fixed points for the 3 × 3 matrix case when this numerical exercise is implemented. The origin is marked in black. The latter 3 subfigures superimpose the unit circle on the diagram in order to illustrate that the set of fixed points represent a "ring" around the unit sphere. Figure 12 considers different starting For each example, a magnification or close up of the path is provided. It is clear from these diagrams that the algorithm "accelerates" or "speeds up" as the sequence nears its limit.  There are at least several directions for future research that continues to build from [2], [3] and from material presented here. For one, Adam Olshen notes that often in applications, and for many reasons, X may not be exactly rectangular. Thus, some initial coordinates may be missing. By arguments not reported here we have shown that so long as "missingness" is independent of values that would be reported for a full X, and no matter the realization there are almost surely genuine data for at least three rows for each column and at least three columns for each row, and so long as standard deviations are defined with "correct:" divisors, then convergence on all but a set of initial values of measure 0 is plausible. The process of rendering rows and columns with means 0 and standard deviations 1 is potentially a powerful "hammer" in search of a biological "nail." Just as studying data by (scale-free) correlations rather than in original given scales can lend insight, so, too, can the process of successive normalization described here.