Convolutional neural networks are applied for management data characterized by a grid topology. CNN focuses local reports on adjacency structures that appear in the data. These processes of knowledge extraction take place through adaptive learning of patterns that start from the bottom to reach the highest level. CNNs are widely used in machine vision for object recognition. Convolutional neural networks are deep neural networks created to operate on grid inputs characterized by strong spatial dependencies in local regions. An example of grid input is a two-dimensional image: it is nothing more than an array of values between 0 and 255. Adjacent pixels present in an image are interlaced with each other and together they define a pattern, a texture, a contour, etc. CNNs associate these characteristics with values called weights, which will be similar for local regions with similar patterns [

35].

The quality that distinguishes them is also the operation that gives them their name, that is, convolution. It is nothing more than the scalar product between matrices, specifically between a grid structure of weights and a similar structure from the input. Convolutional networks are deep networks in which the convolutional layer appears at least once, although the majority use far more than one. The input of a 2D CNN is an image, that is, an array of values of each single pixel occupying a precise position within the image. RGB (RED, GREEN, BLUE) images are described by a set of matrices of the intensities of the primary colors; the dimensions of an image are therefore not limited to height and width, but there is a third: depth, depth. The dimensions of the image, including the depth, will be given to the input layer, the very first layer of a CNN. The subsequent activation maps, that is, the inputs of the subsequent layers, also have a multidimensional structure and will be congruent in number with the independent properties relevant to the classification [

36].

The convolutional neural networks operate on grid structures characterized by spatial relationships between pixels, inherited from one layer to the next through values that describe small local regions of the previous layer. The set of matrices of the hidden layers, the result of convolution or other operations, is called a feature map or activation map, the trainable parameters are tensors called filters or kernels [

37].

From a mathematical point of view, a CNN can be regarded as a neural network densely connected with the substantial difference that the first layers carry out a convolution operation. We indicate with the following equation the relationship between input and output of an intermediate layer:

Let us analyze in detail the basic elements of an architecture based on CNN.

#### 2.4.1. Convolutional Layer

Convolution is the fundamental operation of CNN. It places a filter (kernel) in every possible position of the image, covering it entirely and calculates the scalar product between the kernel itself and the corresponding matrix of the input volume, having equal dimensions. It is possible to view the convolution as a kernel overlay on the input image. A kernel is characterized by the following hyperparameters: height, width, depth and number [

38].

Usually the kernels have a square shape and depth equal to that of the layer to which they are applied. The number of possible alignments between kernel and image defines the height and width of the next feature map. A distinction should be made between the depth of the kernel and the depth of the hidden layer/activation map: the first, is the same as the layer to which it is applied, the second derives instead from the number of kernels applied. The number of kernels is a hyperparameter defined based on the ability to distinguish even more complex shapes that you want to give to the network. Kernels are therefore the components to which the characteristics of image patterns will be associated.

The kernels of the first layers identify primitive forms, the later ones learn to distinguish ever larger and more complex forms. One property of the convolution is the translation equivalence: translated images are interpreted in the same way and the values of the activation map translate with the input values. This means that shapes generate similar feature maps, regardless of their location in the image.

One of the properties of the convolution is the following. A convolution on the q layer increases the receptive field of a feature from the q layer to the q + 1 layer. In other words, each value of the activation map of the next layer captures a wider spatial region than the previous one. Feature maps of the layers capture characteristic aspects of increasingly larger regions and this is the reason CNNs can be defined as deep: long sequences of blocks of layers are needed to study the whole image.

The convolution operation involves a contraction of the q layer with respect to q + 1 and a consequent loss of information. The problem can be curbed by using the so-called padding, a technique that involves adding pixels to the edges of the activation maps to maintain the spatial footprint. Obviously, in order not to alter the information, the pixels are assigned null values. The result is an increase in the size (height and width) of the input volume by an amount of which is reduced because of the convolution.

Since the product is zero, the external regions subject to padding do not contribute to the result of the scalar product. Instead, what happens is to allow the convolutional filter to override the edges of the layer and calculate the scalar product only for cells of values other than 0. This type of padding is defined as half-padding since about half of the filter goes beyond the edges, when placed at the ends. Half-padding is used to maintain the spatial footprint.

When padding is not used, we simply talk about valid-padding and in practice it does not give good results for the following reason: while with half-padding the cells at the edges contribute to the information, in the case of valid-padding, these they do not see the filter passage and are under-represented. Another form of padding is full padding, with which the filter is left completely out of the layer, occupying cells with only zeros. Doing so increases the spatial footprint of the layer, in the same way that valid padding reduces it.

A convolutional filter computes the scalar product in every single position of the input layer, but it is also possible to limit the computation to a lower number of positions, using the stride S. The convolution is then applied to the positions 1, S + 1, 2S + 1, etc., along both dimensions. It follows that the stride involves a reduction of the size by a factor of about 1/S and of the S^{2} area. Generally, values limited to 1 or 2 are used while with greater stride the aim is to reduce the memory request. Through the stride it is possible to capture complex patterns in large portions of the image, with results like those produced by max pooling.

In general, the dimensions of the input images are reduced to avoid complications in defining the hyperparameters. As far as convolution is concerned, the number of filters is usually a power of 2 to facilitate computation, the stride of 1 or 2, the size of the filter of 3 or 5. Small filters mean deeper and more performing networks.

Finally, each convolutional filter is associated with a bias given a filter p and the layer q, bias is indicated with b (p, q). The bias is a multiplication factor of the activation map and its presence increases the number of parameters of a unit. Like all other parameters, the bias value is defined by backpropagation during the training phase.