Rosenblatt’s First Theorem and Frugality of Deep Learning

The Rosenblatt’s first theorem about the omnipotence of shallow networks states that elementary perceptrons can solve any classification problem if there are no discrepancies in the training set. Minsky and Papert considered elementary perceptrons with restrictions on the neural inputs: a bounded number of connections or a relatively small diameter of the receptive field for each neuron at the hidden layer. They proved that under these constraints, an elementary perceptron cannot solve some problems, such as the connectivity of input images or the parity of pixels in them. In this note, we demonstrated Rosenblatt’s first theorem at work, showed how an elementary perceptron can solve a version of the travel maze problem, and analysed the complexity of that solution. We also constructed a deep network algorithm for the same problem. It is much more efficient. The shallow network uses an exponentially large number of neurons on the hidden layer (Rosenblatt’s A-elements), whereas for the deep network, the second-order polynomial complexity is sufficient. We demonstrated that for the same complex problem, the deep network can be much smaller and reveal a heuristic behind this effect.


Introduction
Rosenblatt [1] studied elementary perceptrons (Fig. 1).A-and R-elements are the classical linear threshold neurons.The R element is trainable by the Rosenblatt algorithm, while the A-elements should represent a sufficient collection of features.Rosenblatt assumed no restrictions on the choice of the A-elements.He proved that the elementary perceptrons can separate any two non-intersecting sets of binary images (Rosen-blatt's Theorem 1 in [1]).The proof was very simple.For each binary image x we can create an A-element A x that produces output 1 for this image and 0 for all other.Indeed, let the input retina have n elements and x = (x 1 , . . ., x n ) be a binary vector (x i = 0 or 1) with k non-zero elements.The corresponding A-element A x has input synapses with weights w i = 1/k if x i = 1 and w i = −1/k if x i = 0.For an arbitrary binary image y, w i y i ≤ 1 and this sum is equal to 1 if and only if y = x.The threshold for the A x output can be selected as 1 − 1 2k .Thus, The set of neurons A x created for all binary vectors x transforms binary images into the vertexes of the standard simplex in R 2 n with coordinates OutA x (y) (1).Any two non-intersecting subsets of the standard simplex can be separated by a hyperplane.Therefore, there exists an R-element that separates them.According to the convergence theorem (Rosenblatt's Theorem 4 in [1]), this R-element can be found by the perceptron learning algorithm (a simple relaxation method for solving of systems of linear inequalities).
Thus, first Rosenblatt's theorem is proven: Theorem 1.1.Elementary perceptron can separate any two non-intersecting sets of binary images.
Of course, selection of the A-elements in the proposed form for all 2 n binary images is not necessary in the realistic applied classification problems.Sometimes, even empty hidden layer can be used (the so-called linearly separable problems).Therefore, together with Rosenblatt's Theorem 1 we get a problem about reasonable (if not optimal) selection of A-element.There are many frameworks for approaching this question, for example the general feature selection algorithms: we generate (for example, randomly, or with some additional heuristics) large set of A-elements that is sufficient for solving classification problem, and then select the appropriate set of features using different methods.For the bibliography about feature selection we refer to recent reviews [2,3,4].
The Minsky and Papert book "Perceptron" [5] was published seven years later than Rosenblatt's book.They started from the restricted percetrons and assumed that each A-element has a bounded receptive field (either by a pre-selected diameter or by the bounded number of inputs).Immediately, instead of Rosenblatt's omnipotence of unrestricted perceptrons, they found that the elementary perceptron with such restrictions cannot solve some problems like connectivity of the image or parity of the number of pixels in it.Minsky and Papert results were generalized to more general metric spaces and graphs [6].The heuristic behind these results is quite simple: if a human cannot solve the problem immediately, by a glance, and needs to apply some sequential operations like counting pixels or following tangled path then this problem is not solvable by a restricted elementary perceptron.
At the same time, we can expect that the unrestricted elementary perceptron can solve this problem but for the cost of great (exponential?)complexity.Multilayer ("deep") networks are expected to solve these problems without explosion of complexity.In that sense, deep networks should be simpler than shallow networks for the problems that cannot be solved by restricted elementary perceptrons and require (from humans) combination of parallel and sequential actions.
In this note, we demonstrate the relative simplicity of deep solvers on a version of the well-known travel maze problem (Fig. 2).This geometric problem is closely related to the connectivity problem and has been used for benchmarking in various areas of machine learning (see, for example, [7]).

a) b)
Figure 2: Have we chosen the right delicacies (right) for our guests (left)?a) A prototype travel maze problem.b) A simplified form of the problem with piece-wise linear paths for further formal description (Sec.2).Complexity depends on the number of guests and the number of links in a path.
For formal analysis of the travel maze problem, we need to represent the paths on a discrete retina of S-elements (Fig. 1).Then, to implement the logic of the proof of Rosenblatt's first theorem, each A-element should be an indicator element for a possible path.For each guestdelicacy pair in Fig. 2 a) or b), an elementary perceptron must be created that returns 1 if there is a path from this guest to this delicacy, and 0 if there are no such paths.Thus, n 2 elementary perceptrons should be created.We can easily combine them in a shallow network with n 2 outputs.To finalize the formal statement we should specify the set of the possible paths.In our work, we select a very simple specification without loops, steps back or non-transversal intersections of paths (Fig. 2 b)).Consider the following problem.There are n people, each of whom owns a single object from the set {1, 2, . . ., n}, and different people that own different objects.This correspondence between people and objects can be drawn as a diagram consisting of n broken lines, each of which contains L links (Fig. 3).Each stage of the diagram consists of n links and can be encoded by a permutation or, equivalently, by permutation matrix in a natural way.Namely, if an edge is drawn from node i to node π(i) (i = 1, 2, . . ., n), then the permutation can be represented in the form

Formal problem statement
or by the permutation matrix P = (p i,j ), where If the permutation matrix X i (i = 1, 2, . . ., n) corresponds to the i-th stage then the product X 1 • X 2 • . . .• X L is the permutation matrix again and it defines the correspondence "personobject".It is required to construct a shallow (fully connected) neural network that determines the correspondence "person-object" from the diagram.

Input layer Ln 2 neurons
Inner layer Figure 4: A shallow (fully connected) neural network for the travel maze problem (Fig. 3).It differs from the classical elementary perceptron (Fig. 1) by n 2 output neurons instead of one, and can be considered as a union of n 2 elementary perceptrons with joint retina and hidden layer of A-elements.
We denote the entries of the matrix P k by p and the other entries are equal to 0.
Let an L-tuple (X 1 , X 2 , . . ., X L ) (X i ∈ M for all i = 1, . . ., L) be an L-tuple (a word of length L) over the set of permutation matrices M .The number of such permutation matrices is n!, and the number of L-tuples with elements from M is (n!) L .Consider all such words arranged by the lexicographical order.For each word W j = (X 1 , X 2 , . . ., X L ) with the number j (j = 0, 1, . . ., (n!) L − 1) we assign the same number to the product The matrix P t j is also a n × n permutation matrix.
Entries of matrices X 1 , X 2 , . . ., X L are inputs of the neural network (see Fig. 4).Each input corresponds to an input neuron (S-element, Fig. 1).An inner layer A-neuron y j corresponds to the L-tuple W j = (X 1 , X 2 , . . ., X L ) having the same number j.The neuron y j should give the output signal 1, if the input vector is W j and output 0 for all other (n!) L − 1 possible input vectors.Other input vectors are impossible in our settings (Fig. 4).Each matrix element of every permutation matrix X i is either 0 or 1, therefore the L-tuples star of output connections of the inner neuron can be coded as a 0 − 1 sequence, that is a vertex of the Ln 2 -dimensional unit cube.(Apparently, there are more vertices than L-tuples of permutation matrices.)This cube is a convex body and each vertex can be separated from all other vertices by a linear functional.In particular, for each j we can find such a linear functional l j that l j (W j ) > 1/2 and l j (W k ) < 1/2 (k = j, k, j = 1, . . ., (n!) L ).Here we, with some abuse of language, use the same notation for the tuple W j and the correspondent vertex of the cube (a 0 − 1 sequence of the length Ln 2 ).Thus, each inner neuron y j can be chosen in the form of the linear threshold element with the output signal (compare to (1 and Theorem 1.1): where h is the Heaviside step function.We use for the output of the neuron y j the same notation y j .
The structural difference of the shallow network (Fig. 4) for the travel maze problem from the elementary perceptron (Fig. 1) is the number of neurons in the output layer.For the travel maze problem the answer is the permutation matrix with n 2 0 − 1 elements.The inner layer neuron y j detects the L-tuple of one-step permutation matrices W j = (X 1 , X 2 , . . ., X L ).When this input vector is detected, y i sends the output signal 1 to the output neurons connected with it.For all other input vectors, it keeps silent.The output neurons are just simple linear adders.The output neurons z qr are labelled by pairs of indexes, q, r = 1, . . ., n.The matrix of outputs is the permutation matrix from the start to the end of the travel.The structure of the output connections of y i is determined by the input L-tuple W j = (X 1 , X 2 , . . ., X L ): the connection from y j to z qr has weight 1, if the corresponding entry (P t j ) qr = 1 and is 0 if Thus the neuron y j corresponds to our problem answer.Let us represent the network functioning in more detail with explicit algebraic presentations.All the inputs and outputs are Boolean (0−1) variables.We use the standard Boolean algebra notations.In particular, x = 1−x Thus, if y j is a neuron of the inner layer, then it corresponds to the product expansion of j in the base n!.We need . ., X L ) = (P a j,L−1 , P a j,L−2 , . . ., P a j,0 ).

•x
(L) 1,πa j,0 (1) • x 2,πa j,0 (2) . . .x The third level neurons z ij form the matrix Z = (z ij ) that is the answer to this problem: Since The constructed network memorizes products in all L-tuples of permutation matrices, recognizes the input L-tuple of permutations, and sends the product to the output.For example, we have 5 = 1 • (2!) 2 + 0 • (2!) + 1 for j = 5 therefore We can write similar expressions for all other y j .
In this case, we have

Deep neural network solution
The calculation of the matrix Z = X 1 • X 2 • . . .• X L can be performed using a deep learning network, multiplying sequentially: To calculate the entries of matrices, we use conjunction and addition modulo 2.  connections between neurons.
Let A be a banded matrix of the bandwidth r + 1.The maximum number of nonzero entries in an arbitrary row of A at most 2r + 1.The number of nonzero entries in A at most If A and B are banded matrices of the bandwidth r + 1 and t + 1, respectively, then the product AB is a banded matrix of bandwidth r + t + 1.
Theorem 5.1.For a r-bounded problem, there is a shallow neural network with a depth of 3, neurons, and connections between them.
Theorem 5.2.For a r-bounded problem, there is a deep neural network with a depth L, (ir + 1)N ir connections between neurons if Lr ≤ n − 1, and 2

Conclusion and outlook
• Shallow neural network combined from elementary Rosenblatt's perceptrons can solve the travel maze problem, in accordance with Rosenblatt's first theorem.
• Complexity of the constructed solution of the travel maze problem by deep network is much smaller than for the solution provided by the shallow network (the main terms are 2Ln 2 versus (n!) L for the numbers of neurons and 2L 3 versus Ln 2 (n!) L for the numbers of connections).
The first result is important in the context of the widespread myth that elementary Rosenblatt's perceptrons have limited abilities and that Minsky and Papert revealed these limitations.This mythology has penetrated even into the encyclopedic literature [8].
Original Rosenblatt's perceptrons [1] (Fig. 1) can solve any problem about classification of binary images and, after minor modification, even wider.This simple fact was proven in Rosenblatt's first theorem, and nobody criticised this theorem and proof.The universal representation property of shallow neural networks were studied in 1990s from different point of view, including approximation of real-valued functions [9] and evaluation of upper bounds on rates of approximation [10].Elegant analysis of shallow neural networks involved infinite-dimensional hidden layers [11] and upper bounds were derived on the speed of decrease of approximation error as the number of network units increases.Abilities and limitations of shallow networks were reviewed recently in detail [12].
Of course, a single R-element can solve only linearly separable problems, and, obviously, not all problems are linearly separable.Stating this trivial statement does not require any intellectual effort.Minsky and Papert [5] considered much more complex systems then a single linear threshold R-element.They studied the same elementary perceptrons that Rosenblatt did (Fig. 1) with one restriction: receptive fields of A-elements are bounded.These limitation may assume a sufficiently small diameter of the receptive field (the most common condition), or limited number of input connections of each A-neuron.Elementary perceptrons with such restriction have limited abilities: if we have only local information, then we cannot solve such a global problem as checking the connectivity of a set or the travel maze problem with one glance.We should integrate the local knowledge into global criterion using a sequence of steps.This intuitively clear statement was accurately formalised and proved for the parity problem by Minsky and Papert [5].
Without restrictions, elementary perceptrons are omnipotent.In particular, they can solve the travel maze problem in the proposed form, but the complexity of solutions can be huge (Theorem 3.1).On the contrary, the deep network solution (Theorem 4.1) is much simpler and seems to be much more natural.It combined solution from the one-step permutations locally, step by step, whereas the shallow network operated by all possible global paths.Restriction of the possible paths of travel by bounded radius of a single step (Sec.5) does not change the situation qualitatively.(The restricted problem is simpler than the original one.This should not be confused with the possible network limitations, that complicate all problems.) The second observation seems to be more important than the first one: the properly selected deep solutions can be much simpler than the shallow solutions.In the contrast to the widely discussed huge deep structures and their surprising efficiency (see the detailed exposition of mathematics of deep learning in [13]) the relatively small but deep neural networks are nonsurprisingly effective for solution of problems where local information should be integrated into global decision, like in the discussed version of the travel maze problem.These networks combine the benefits of the fine-grained parallel procession and the solutions of problems at a glance with the possibility to emulate logic of sequential data analysis, when it is necessary.The important question in this context is: "How deep should be the depth?" [14].The answer depends on the problem.
The open question remains: are the complexity estimates sharp?How far are our solutions from the best ones?We do not expect that this problem has a simple solution because even for multiplication of n × n matrices no final solution has yet been found despite great efforts and significant progress (for the best of our knowledge, the latest improvement from n 2.37287 to n 2.37286 was achieved recently [15]).Another open question might attract attention: Analyse the original geometric travel maze problem (see Fig. 2 instead of its more algebraic simplification presented in Fig. 3).It includes many non-trivial tasks, for example, convenient discrete representation of the possible paths with bounded curvature, lengths and ends, and constructive selection of -networks in the space of such paths for preparing the input weights of of A-elements.
Complexity of functions computable by deep and shallow networks used for solution of classification problem were compared for the same complexity of networks [16].Complexity of functions was measured using topological invariants of the superlevel sets.The results seem to support the idea that deep networks can address more difficult problems with the same number of resources.
The problem of effective parallelism pretends to be the central problem which is being solved by the whole neuroinformatics [17].It has long been known that the efficiency of parallel computations increases slower than the number of processors.There is a well known "Minsky hypothesis": efficiency of a parallel system increases (approximately) proportionally to logarithm of the number of processors; at least, it is a concave function.Shallow neural networks pretend to solve all problems in one step, but the cost for that may be enormous number of resources.Deep networks make possible a trade-off between resources (number of elements) and the time needed to solve a problem, since they can combine the efficient parallelism of neural networks with elements of sequential reasoning.Therefore, neural networks can be a useful tool for solving the problem of efficient fine-grained parallel computing if we can answer the question: how deep should the depths be for different classes of problems.The case study presented in our note gives an example of significant increase of efficiency for a reasonable choice of depth.

Figure 3 :
Figure 3: Game diagram with L stages (a formalized and simplified version of the travel maze problem).

Figure 7 :
Figure 7: A deep neural network diagram for simplified travel maze problem.