Why Dilated Convolutional Neural Networks: A Proof of Their Optimality

One of the most effective image processing techniques is the use of convolutional neural networks that use convolutional layers. In each such layer, the value of the layer’s output signal at each point is a combination of the layer’s input signals corresponding to several neighboring points. To improve the accuracy, researchers have developed a version of this technique, in which only data from some of the neighboring points is processed. It turns out that the most efficient case—called dilated convolution—is when we select the neighboring points whose differences in both coordinates are divisible by some constant ℓ. In this paper, we explain this empirical efficiency by proving that for all reasonable optimality criteria, dilated convolution is indeed better than possible alternatives.


Convolutional Layers: Input and Outpu
At present, one of the most efficient techniques in image processing and in other areas is a convolutional neural network; see, e.g., [1]. Convolutional neural networks include special types of layers that perform linear transformations.
Each such layer is characterized by integer-values parameters X ≤ X, Y ≤ Y, d in ≥ 1, and d out ≥ 1; then: • the input to this layer consists of the values F d (x , y ), where d , x , and y are integers for which X ≤ x ≤ X, Y ≤ y ≤ Y, and 1 ≤ d ≤ d in ; and • the output of this layer consists of the values G d (x, y), where d, x, and y are integers for which X ≤ x ≤ X, Y ≤ y ≤ Y, and 1 ≤ d ≤ d out .

Convolutional Layer: Transformation
A general linear transformation has the form for some coefficients K d (x, x , y, y , d ).
Transformations performed by a convolutional layer are a specific case of such generic linear transformations, where the following two restrictions are imposed: • first, each value G d (x, y) depends only on the values F d (x , y ), for which both differences |x − x | and |y − y | do not exceed some fixed integer L, and • the coefficients K d (x, x , y, y , d ) depend only on the differences x − x and y − y : for some coefficients k d (i, j, d ) defined for all pairs (i, j) for which |i|, |j| ≤ L.
The values k d (i, j, d ) are known as a filter.
The resulting linear transformation takes the form Thus, the output G d (x, y) of a convolutional layer corresponding to the point (x, y) is determined by the values F d (x − i, y − j) of the input to this layer at points (x − i, y − j) corresponding to |i| ≤ L and |j| ≤ L. This is illustrated by Figure 1, where, for L = 1 and for a point (x, y) marked by an asterisk, we show all the points (x , y ) = (x 0 − i, y 0 − j) that determine the values G d (x, y). For convenience, points (x , y ) that do not affect the values G d (x, y), are marked by zeros. For L = 2, a similar picture has the following form. This is illustrated by Figure 2.

Sparse Filters and Dilated Convolution
Originally, convolutional neural networks used filters in which all the values k d (i, j, d ) for |i|, |j| ≤ L can be non-zero. It turned out, however, that we can achieve a better accuracy if we consider sparse filters, i.e., filters in which, for some pairs (i, j) with |i|, |j| ≤ L, all the values k d (i, j, d ) are fixed at 0; see, e.g., [2][3][4].
In Figure 3, we show an example of such a situation, when L = 2 and only values k d (i, j, d ), for which both i and j are even allowed to be non-zero. In general, it turned out that such a restriction works best if we only allow k d (i, j, d ) = 0 for pairs (i, j) which are divisible by some integer , i.e., if we take In this case, the layer's output signal G d (x, y) can be written in the following equivalent form: where we denoted L The resulting networks are known as dilated convolutional neural networks, since skipping some points (i, j) in the description of the filter is kind of equivalent to extending (dilating) the distance between the remaining points; see, e.g., [2][3][4].

Empirical Fact That Needs Explanation
In principle, we could select other points (i, j) at which the filter can be non-zero. For example, we could select points for which j is even, but i can be any integer. This is illustrated by Figure 4. Alternatively, for L = 2, as points (i, j) at which k d (i, j, d ) can be non-zero, we could select the points (0, 0), (0, ±1), (±1, 0), and (±2, ±2), see Figure 5. However, empirical evidence shows that the selection corresponding to dilated convolution-when we select points for which i and j are both divisible by some integer -works the best [2][3][4].
To the best of our knowledge, there is no theoretical explanation for this empirical result-that dilated convolution leads to better results that select other sets of non-zerovalued points (i, j). The main objective of this paper is to provide such an explanation.
Comment. Let us emphasize that the only objective of this paper is to explain this empirical fact; we are not yet at a stage where we can propose a new method or even any improvements to the known methods.

Let Us Reformulate This Situation in Geometric Terms: Case of Traditional Convolution
In the original convolution Formula (1), to find the values G d (x, y), the layer's output signal at a point (x, y), we need to know the values F d (x , y ), the layer's input signal at all the points (x , y ) of the type (x − i, y − j) for |i|, |j| ≤ L. We can reformulate it by saying that we need to know the values F d (x , y ) at all the points (x , y ) at which the ∞ distance does not exceed L: where we denoted We use, in this formula, the bounded subset D of the "grid" Z × Z and not the whole set S def = Z × Z only matters at the border of the domain D. So, to simplify our formulas, we can follow the usual tradition (see, e.g., [3]) and simply use the whole set S = Z × Z instead of the bounded set D: Comment. Note that the set S is potentially infinite. What makes the set of all the points (x , y )-which affects the values G d (x, y)-finite is the restriction d ∞ ((x, y), (x , y )) ≤ L, whose meaning is that such points (x , y ) should belong to the corresponding neighborhood of the point (x, y).

Case of Dilated Convolution
The dilated convolution can be described in a similar way. Namely, we can describe the Formula (4) as where S (x, y) denotes the set of all the points (x , y ) for which both differences x − x and y − y are divisible by : Note that, in this representation of dilated convolution, while we have several different sets S (x, y) for different points (x, y), there is only one filter k d (x − x , y − y , d ), namely the same filter that was used in the original representation (4). So, in this new representation, we have exactly as many parameters as before.
The main difference between this formula and the Formula (9) is that, in contrast to the usual convolution (9), where the same set S = Z × Z could be used for all the points (x, y), here, in general, we may need different sets S (x, y) for different points (x, y).
For example, if = 2, then we need four such sets: • for points (x, y) for which both x and y are even, the Formula (10) holds for x and y are even}; • for points (x, y) for which x is even but y is odd, the Formula (10) holds for x is even and y is odd}; • for points (x, y), for which x is odd but y is even, the Formula (10) holds for x is odd and y is even}; • finally, for points (x, y) for which x and y are both odd, the Formula (10) holds for In this case, instead of the single set S 1 (x, y) = S (as in the case of the traditional convolution), we have a set of such sets To avoid confusion, we will call subsets of the original "grid" Z × Z sets, while the set of such sets will be called a family. In these terms, the Formula (10) can be described as follows: where S F (x, y) denotes the set S ∈ F from the family F that contains the point (x, y): In this representation, all four sets S from the family F are infinite-just like the set S corresponding to the traditional convolution is infinite. Similarly to the traditional convolution, what makes the set of all the points (x , y )-which affects the values G r (x, y)finite is the restriction d ∞ ((x, y), (x , y )) ≤ L, whose meaning is that such points (x , y ) should belong to the corresponding neighborhood of the point (x, y). Figure 6 describes which of the four sets S ∈ F corresponds to each point (x, y) from the "grid" Z × Z: where S is the set of all the pairs (x, y) ∈ Z × Z, in which both differences x − i and y − j are divisible by 3.
In general, for an arbitrary point (x, y), we should use the set x mod , y mod .

Other Cases
Such a representation is possible not only for dilated convolution. For example, the above case when we allow arbitrary value i and require the value j to be even can be described in a similar way, with where: • for points (x, y) for which y is even, we take • and for points (x, y) for which y is odd, we take We can also have families which have an infinite number of sets; an example of such a family will be given below.
We can also, in principle, consider the situations when we do not require that the coefficients k d (x, x , y, y , d ) depend only on the differences x − x and y − y . Thus, we arrive at the following general description.

General Case
In the general case, we get the following situation: • we have a family F of subsets of the "grid" Z × Z; • the values G d (x, y) of the layer's output signal at a point (x, y) are determined by the formula for some values K d (x, x , y, y , d ), where S F (x, y) denotes the set S ∈ F from the family F that contains the point (x, y).
For the Formula (23) to uniquely determine the values G d (x, y), we need to make sure that the set S F (x, y) is uniquely determined by the point (x, y), i.e., that for each point (x, y), the family F contains one, and only one, set S that contains this point. In other words: • different sets from the family F must be disjoint, and • the union of all the sets S ∈ F must coincide with the whole "grid" Z × Z.
In mathematical terms, the family F must form a partition of the "grid" Z × Z. Comment. To avoid possible confusion, it is worth mentioning that while different sets S from the family F are disjoint, this does not preclude the possibility that sets S F (x, y) and S F (x , y ) corresponding to different points (x, y) = (x , y ) can be identical. For example, in the description of the traditional convolution, the family F consists of only one set F = S . In this case, for all points (x, y) and (x , y ), In terms of sets corresponding to different points, disjointness means that if the sets S F (x, y) and S F (x , y ) are different, then these sets must be disjoint:

We Do Not a Priori Require Shift-Invariance
Please note that we do not a priori require that the sets S F (x, y) and S F (x 0 , y 0 ) corresponding to two different points (x, y) and (x 0 , y 0 ) should be obtained from each other by shift-this property is known as shift invariance and is satisfied both for the usual convolution and for the dilated convolution.
It should be emphasized, however, that we will show that this shift-invariance property holds for the optimal arrangement.

Let Us Avoid the Degenerate Case
From a purely mathematical viewpoint, we can have a partition of the "grid" Z × Z into one-point sets {(x, y)}. This is an example when the family F has infinitely many subsets.
In this case, no matter what value L we choose, the Formula (23) implies that the values G d (x, y) of the layer's output signal at a point (x, y) are determined only by the values F d (x, y) of the layer's input at this same point. This is equivalent to using a convolution with L = 0; such a convolution is known as the 1-by-1 convolution.
While such a convolution is often useful, in this case, for each point (x, y), there is only one point (x , y ) = (x, y), so it is not possible to select only some of the points (x , y )-which is the whole idea of dilation. Since, in this paper, we study dilation, we will therefore avoid this 1-by-1 situation and additionally require that at least one set from the family F should contain more than one element.

What We Plan to Do
We will consider all possible families F that form a partition of the "grid" Z × Z, and we will show that for all optimality criteria that satisfy some reasonable conditions, the optimal family is either the family of sets corresponding to the dilated convolution-or a natural modification of this family.
Let us describe what we mean by optimality criteria.

What Does "Optimal" Mean?
In our case, we select between different families of sets F , F , . . . . In general, we select between alternatives a, b, etc. Out of all possible alternatives, we want to select an optimal one. What does "optimal" mean?
In many cases, "optimal" is easy to describe: • we have an objective function f (a) that assigns a numerical value to each alternative a-e.g., the average approximation error of the numerical method a for solving a system of differential equations, and • optimal means that we select an alternative for which the value of this objective function is the smallest possible (or, for some objective functions, the largest possible).
However, this is not the only possible way to describe optimality. For example, if we are minimizing the average approximation error, and there are several different numerical methods with the exact same smallest value of average approximation error, then we can use this non-uniqueness to select, e.g., the method with the shortest average computation time. In this case, we have, in effect, a more complex preference relation between alternatives than in the case when decision is made based solely on the value of the objective function. Specifically, in this case, an alternative b is better than the alternative a-we will denote it by a < b-if: If this still leaves several alternatives which are equally good, then we can optimize something else, and thus, have an even more complex optimality criterion.
In general, having an optimality criterion means that we are able to compare pairs of alternatives-at least some such pairs-and conclude that: • for some of these pairs, we have a < b, • for some of these pairs, we have b < a, and • for some others pairs, we conclude that alternatives a and b are, from our viewpoint, of equal value; we will denote this by a ∼ b.
Of course, these relations must satisfy some reasonable properties. For example, if b is better than a, and c is better than b, then c should be better than a; in mathematical terms, the relation < must be transitive.
We must have some alternative which is better than or equivalent to all othersotherwise, the optimization problem has no solutions. It also makes sense to require that there is only one such optimal alternative-indeed, as we have mentioned, if there are several equally good optimal alternatives, this means that the original optimality criterion is not final, that we can use this non-uniqueness to optimize something else, i.e., in effect, to modify the original criterion into a final (or at least "more final") one.

Invariance
There is an additional natural requirement for possible optimality criteria, which is related to the fact that the original "grid" Z × Z has lots of symmetries, i.e., transformations that transform this "grid" into itself.
For example, if we change the starting point of the coordinate system to a new point (x 0 , y 0 ), then a point that originally had coordinates (x, y) now has coordinates (x − x 0 , y − y 0 ). It makes sense to require that the relative quality of two different families F and F will not change if we simply change the starting point.
Similarly, we can change the direction of the x-axis, then a point (x, y) becomes (−x, y). If we change the direction of the y-axis, we obtain a transformation (x, y) → (x, −y). Finally, we can rename the coordinates: what was x will become y and vice versa; this corresponds to the transformation (x, y) → (y, x). Such transformations should also not affect the relative quality of different families.
Please note that we are not requiring that the family F of sets be shift-invariant, what we require is that the optimality criterion is shift-invariant.
Let us explain why, in our opinion, it makes sense to require that the optimality criterion is shift-invariance-as well as having other invariance properties. Indeed, let us consider any usual optimality criterion such as accuracy of classification, robustness to noise, etc. What each criterion means is, e.g., the overall classification accuracy over the set S of all possible cat and not-a-cat images I ∈ S. We want this method to correctly classify images into cats and not-cats, whether these images are centered or somewhat shifted. Thus, to adequately compare different methods, we should test these methods on a set S of images that includes both original and shifted images. Here: • if we shift each image I from the set S by the same shift (x 0 , y 0 ), i.e., replace each image I ∈ S by a shifted image I = T x 0 ,y 0 (I) for which I (x, y) = I(x − x 0 , y − y 0 ), • then, we should get, in effect, the exact same set of images: The only difference between these two sets of images may be the few images where the cat is right at the image's boundary; in this paper, we will ignore this difference-just like we ignored the bounded-ness in the previous text. In this ignoring-bounds approximation, we conclude that T x 0 ,y 0 (S) = {T(x 0 , y 0 )(I) : I ∈ S} = S.
How does a shift of the original image affect the input signals to the following convolution layers? In between the very first input layer and the following convolution layers, we may have (and usually do have) layers that perform the "compression" of the (x, y) part-i.e., that transform: • values corresponding to several points (x, y) • into values corresponding to a single new point (x , y ). In general, the (x, y)-shift of the original data corresponds to a shift of the transformed data-but by smaller shift values. For example, if data corresponding to each new (x, y)point come from data from four different "pre-compression" points, then the shift by (x 0 , y 0 ) in the pre-(x, y)-compression layer corresponds to a shift of the convolution layer input by (x 0 /2, y 0 /2).
Since the set of input images should not change if we apply a shift, we can conclude that for each convolution layer, the set of the corresponding inputs to this layer should also not change if we shift all these inputs, i.e., if we replace each input F d (x, y) with a shifted input for some shift (x 0 , y 0 ). The set of inputs on which we compare different methods does not change when we apply a shift. So, if one method was better when we processed original inputs, it should still be better if we process shifted inputs-since the resulting set of inputs is the same. In other words, the quality (e.g., accuracy) Q F (S) of a method corresponding to the family F , when gauged by the set of inputs corresponding to original images, should be the same as this method's quality Q F (T x 0 ,y 0 (S)) on the set of all the inputs obtained from the original set S by this shift-since these two sets of inputs are, in effect, the same set: T x 0 ,y 0 (S) = S. Thus, Q F (T x 0 ,y 0 (S)) = Q F (S). However, as one can see, shifting all the inputs is equivalent to shifting all the sets from the family F . Indeed, if we apply the Formula (23) to the shifted layer's input i.e., in terms of the shifted coordinates X def = x − x 0 and Y def = y − x 0 for which x = X + x 0 and y = Y + y 0 , we get-taking into account that the distance d ∞ does not change with shift-that: and where C denotes the condition (X + x 0 , Y + y 0 ) ∈ S F (X + x 0 , Y + y 0 ). In terms of the family F , the main difference between the Formulas (23) and (29) is that instead of the condition (x , y ) ∈ S F (x, y), we now have a new condition i.e., equivalently, (X , Y ) ∈ S F (X + x 0 , Y + y 0 ) − (x 0 , y 0 ). It is easy to check that this new condition is equivalent to (Y , Y ) ∈ S F (X, Y), where the new family F is obtained by shifting sets from the original family F . So: • the relative quality of two families does not change if we shift all the layer's inputs; • however, shifting all the layer's inputs is equivalent to shifting all the sets from the family F . Thus, the relative quality of two families does not change if we shift both families. In other words, a reasonable optimality criterion-which describes which family is bettershould be invariant with respect to shifts.
Similarly, we can argue that a reasonable optimality criterion should not change if we rename xand y-axes, etc. Now, we are ready for the precise formulation of the problem.

Definitions and the Main Result
Definition 1. By a family, we mean a family of non-empty subsets of the "grid" Z × Z, a family in which: • all sets from this family are disjoint, and • at least one set from this family has more than one element. Terminological comment. To avoid possible misunderstandings, let us emphasize that here, we consider several levels of sets, and to avoid confusion, we use different terms for sets from different levels: • first, we consider points (x, y) ∈ Z × Z; Comment about the requirements. In the previous text, we argued that for each family F , the union of all its sets ∪{S : S ∈ F } should coincide with the whole "grid" Z × Z. However, in our definition of an alternative, we did not impose this requirement. We omitted this requirement to make our result stronger-since, as we see from the following Proposition, this requirement actually follows from all the other requirements.

Definition 2.
By an optimality criterion, we mean a pair of relations (<, ∼) on the class of all possible families that satisfy the following conditions: Comment. The pair of relations (<, ∼) between families of subsets forms what is called a pre-order or quasi-order. This notion is more general than partial order, since, in contrast to the definition of the partial order, we do not require that if a ≤ b and b ≤ a, then a = b: in principle, we can have a ∼ b for some a = b. Definition 3. We say that a family F is optimal with respect to the optimality criterion (<, ∼), if for every other family F , we have either F < F or F ∼ F .

Definition 4.
We say that the optimality criterion is final if there exists exactly one family which is optimal with respect to this criterion.

Definition 5.
By a transformation T : Z × Z, we mean one of the following transformations: T x 0 ,y 0 (x, y) = (x − x 0 , y − y 0 ), T −+ (x, y) = (−x, y), T +− (x, y) = (x, −y), and T ↔ (x, y) = (y, x). Definition 7. We say that the optimality criterion is invariant if for all transformations T, F < F implies that T(F ) < T(F ), and F ∼ F implies that T(F ) ∼ T(F ). Proposition 1. For every final invariant optimality criterion, the optimal family is equal, for some integer ≥ 1, to one of the following two families: • the family of all the sets S ,x 0 ,y 0 def = {(x 0 + · n x , y 0 + · n y ) : n x , n y ∈ Z} corresponding to all possible pairs of integers (x 0 , y 0 ) for which 0 ≤ x 0 , y 0 < ; • the family of all the sets S ,x 0 ,y 0 def = {(x 0 + · n x , y 0 + · n y ) : n x , n y ∈ Z and n x + n y is even} corresponding to all possible pairs of integers (x 0 , y 0 ) for which 0 ≤ x 0 , y 0 < .

Comments.
• This proposition takes care of all invariant (and final) optimality criteria. Thus, it should work for all usual criteria based on misclassification rate, time of calculation, used memory, or any others used in neural networks: indeed, if one method is better than another for images in general, it should remain to be better if we simply shift all the images or turn all the images upside down. Images can come as they are, they can come upside down, they can come shifted, etc. If, for some averaging criterion, one method works better for all possible images, but another method works better for all upside-down versions of these images-which is, in effect, the same class of possible images-then, from the common sense viewpoint, this would mean that something is not right with this criterion. • The first possibly optimal case corresponds to dilated convolution. In the second possibly optimal case, the optimal family contains similar but somewhat different sets; an example of such a set is given in Figure 7. Thus, this result explains the effectiveness of dilated convolution-and also provides us with a new alternative worth trying. • The following proof is similar to several proofs presented in [5,6].
Proof. 1 • . Since the optimality criterion is final, there exists exactly one optimal family F opt . Let us first prove that this family is itself invariant, i.e., that T(F opt ) = F opt for all transformations T. Indeed, the fact that the family F opt is optimal means that for every family F , we have F < F opt or F ∼ F opt . Since this is true for every family F , it is also true for every family T −1 (F ), where T −1 denotes inverse transformation (i.e., a transformation for which T(T −1 (x, y)) = (x, y)). Thus, for every family F , we have either T −1 (F ) < F opt or T −1 (F ) ∼ F opt . Due to invariance, we have F = T(T −1 (F )) < T(F opt ) or F ∼ T(F opt ). By definition of optimality, this means that the alternative T(F opt ) is also optimal. However, since the optimality criterion is final, there exists exactly one optimal family, so T(F opt ) = F opt .
The statement is proven. 2 • . Let us now prove that the optimal family contains a set S that, in its turn, contains the point (0, 0) and some point (x, y) = (0, 0). Indeed, by definition of a family, every family-including the optimal family-contains at least one set S that has at least two points. Let S be any such set from the optimal family, and let (x 0 , y 0 ) be any of its points. Then, due to Part 1 of this proof, the set S def = T x 0 ,y 0 (S) also belongs to the optimal family, and this set contains the point Since the set S had at least two different points, the set S = T x 0 ,y 0 (S) also contains at least two different points. Thus, the set S must contain a point (x, y) which is different from (0, 0).
The statement is proven. 3 • . In the following text, by S , we will mean a set from the optimal family F opt whose existence is proven in Part 2 of this proof: namely, a set that contains the point (0, 0) and a point (x, y) = (0, 0). 4 • . Let us prove that if the set S contains a point (x, y), then this set also contains the points (x, −y), (−x, y), and (y, x).
Indeed, due to Part 1 of this proof, with the set S the optimal family F opt also contains the set T +− (S ). This set contains the point T +− (0, 0) = (0, 0). Thus, the sets S and T +− (S ) have a common element (0, 0). Since different sets from the optimal family must be disjoint, it follows that the sets S and T +− (S ) must coincide. The set T +− (S ) contains the points (x, −y) for each point (x, y) ∈ S. Since T +− (S ) = S , this implies that for each point (x, y) ∈ S , we have (x, −y) ∈ T +− (S ) = S .
Similarly, we can prove that (−x, y) ∈ S and (y, x) ∈ S . The statement is proven. 5 • . By combining the two conclusions of Part 4-that (x, −y) ∈ S and that therefore T −+ (x, −y) = (−x, −y) ∈ S , we conclude that for every point (x, y) ∈ S , the point is also contained in the set S . 6 • . Let us prove that if the set S contains two points (x 1 , y 1 ) and (x 2 , y 2 ), then it also contains the point Indeed, due to Part 1 of this proof, the set T −x 2 ,−y 2 (S ) also belongs to the optimal family. This set shares an element with the original set S . Thus, the set T −x 2 ,−y 2 (S ) must coincide with the set S . Due to the fact that (x 1 , y 1 ) ∈ S , the element belongs to the set T x 1 ,y 1 (S ) = S . The statement is proven. 7 • . Let us prove that if the set S contains a point (x, y), then, for each integer c, this set also contains the point c · (x, y) = (c · x, c · y).
Indeed, this follows from Parts 6 and 7. 9 • . The set S contains some points which are different from (0, 0), i.e., points for which at least one of the integer coordinates is non-zero. According to Parts 4 and 5, we can change the signs of both x and y coordinates and still get points from S . Thus, we can always consider points with non-negative coordinates.
Let d denote the greatest common divisor of all positive values of the coordinates of points from S .
If a value x appears as an x-coordinate of some point (x, y) ∈ S , then, due to Part 4, we have (x, −y) ∈ S and thus, due to Part 5, Similarly, if a value y appears as a y-coordinate of some point (x, y) ∈ S , then we get (0, 2y) ∈ S and thus, due to Part 3, (2y, 0) ∈ S .
It is known that a common divisor d of the values v 1 , . . . , v n can be represented as a linear combination of these values: For each value v i , we have (2v i , 0) ∈ S , thus, for 2d = c 1 · (2v 1 ) + . . . + c n · (2v n ), by Part 8, we get (2d, 0) ∈ S . Due to Part 4, we thus have (0, 2d) ∈ S , and due to Parts 6 and 7, all points (n x · (2d), n y · (2d)) for integers n x and n y also belong to the set S .
If S has no other points, then for the set containing (0, 0), we indeed conclude that this set is the same as what we described for dilated convolution, with = 2d. 10 • . What if there are other points in the set S ? Since d is the greatest common divisor of all the coordinate values, each of these points has the form (c x · d, c y · d) for some integers c x and c y . Since this point is not of the form (n x · (2d), n y · (2d)), this means that either c x , or c y is an odd number-or both.
Let us first consider the case when exactly one of the values c x and c y is odd. Without losing generality, let us assume that c x is odd, so c x = 2n x + 1 and c y = 2n y for some integers n x and n y . Due to Part 9, we have (2n x · d, 2n y · d) ∈ S , so the difference also belongs to the set S . Thus, similarly to Part 9, we can conclude that for every two integers c x and c y , we have (c x · d, c y · d) ∈ S . So, in this case, S coincides, for = d, with the set corresponding to dilated convolution.
The only remaining case is when not all points (c x · d, c y · d) belong to the set S . This means that for some such point, both values c x and c y are odd: c x = 2n x + 1 and c y = 2n y + 1 for some integers n x and n y . Due to Part 9, we have (2n x · d, 2n y · d) ∈ S , so the difference ((2n x + 1) · d, (2n y + 1) · d) − (2n x · d, 2n y · d) = (d, d) also belongs to the set S .
Since, due to Part 9, we have (2n x · d, 2n y · d) ∈ S for all n x and n y , we conclude, by using Part 5, that ((2n x + 1) · d, (2n y + 1) · d) ∈ S . So, all pairs for which both coordinates are odd multiples of d are in S . Thus, we get the new case described in the Proposition. 11 • . The previous results were about the sets containing the point (0, 0). For all other sets S containing some other point (x 0 , y 0 ), we get the same result if we take into account that the optimal family is invariant, and thus, with the set S, the optimal family also contains the set T x 0 ,y 0 (S) that contains (0, 0) and is, thus, equal either to the family corresponding to dilated convolution or to the new similar family.
The proposition is proven.

Conclusions
One of the efficient machine learning ideas is the idea of a convolutional neural network. Such networks use convolutional layers, in which the layer's output at each point is a combination of the layer's input corresponding to several neighboring points. A reasonable idea is to restrict ourselves to only some of the neighboring points. It turns out that out of all such restrictions, the best results are obtained when we only use neighboring points, for which the differences in both coordinates are divisible by some constant . Networks implementing such restrictions are known as dilated convolutional neural networks.
In this paper, we provide a theoretical explanation for this empirical conclusion.

Future Work
This paper describes a general abstract result: that for any optimality criterion that satisfies some reasonable properties, some dilated convolution is optimal. To be practically useful, it is desirable to analyze which dilated convolutions are optimal for different practi-cal situations and for specific criteria uses in machine learning, such as misclassification rate, time of calculation, used memory, etc. (and the combination of these criteria). It is also desirable to analyze what size neighborhood we should choose for different practical situations and for different criteria.
Author Contributions: All three authors contributed equally to this paper: M.C. formulated the neural-related problem, V.K. provided a general optimization framework for solving such problems, J.C. came up with an idea of how to adjust this general framework for this particular problem, and jointly, all three of us transformed this idea into precise definitions and results. All authors have read and agreed to the published version of the manuscript.