1. Introduction
In this paper, we propose the use of a
Discrete Wavelet transform as the convolution operation in a convolutional neural network [
1,
2] revised in [
3]. In recent years, convolutional neural networks (CNNs) have had multiple applications in different areas, mainly in image recognition or feature separation in an
information cube due to their excellent classification capacity, the flexibility of its algorithm [
2], and the ability to pair to solve different problems in various areas [
4] such as feature analysis [
5]. The design of different CNN architectures [
6,
7] has progressed to improve the results of
certainty, as well as save the computing resources used. To enhance these architectures, different hybrid CNN algorithms have been proposed [
8,
9] that apply to various problems and present advantages over traditional CNN [
10] by being able to process information on multiple scales and various measurements [
4] such as presented in [
11]. Hybrid CNNs are a combinatorial strategy that uses other algorithms to improve several results, such as accuracy. In particular, using the multi-resolution strategy allows one to separate the multifrequency response, a typical situation within seismic signatures such as those presented by [
12,
13]. In that respect, several procedures may be followed from geophysics studies. Nevertheless, this particular approach of highlighting several features independently from frequency responses, as well as reclassifying this behavior at the rest of those geophysical characteristics, is quite valuable from these authors as well.
Information flows in a traditional CNN and is redefined based on interaction in the design of its architecture with the
Discrete Wavelet transform [
14,
15] to provide an alternative to the convolution of a CNN and analyze the advantages and disadvantages that this proposal contains. In different works, the contribution of the Wavelet transform in the use of a CNN is known [
6,
16,
17,
18]. However, it is commonly used to process the data before it enters the CNN [
17,
19].
Unlike these approaches, a time-frequency decomposition method is proposed [
9,
20,
21], seeking to recover as much information as possible and then organize it sensitively into patterns that allow categorizing specific characteristics with better sensitivity compared to a traditional CNN [
22,
23]. It should be noted that the approach followed in this paper incorporates the use of the
Pooling algorithm as a matrix-based information compaction strategy, achieving the representation of relevant information between neighboring vectors.
This method aims to extract features of specific interest in a data interval scaled in power, time, and frequencies containing multidimensional data using a proximity map between them. The assertiveness rate designed by the user is of particular interest. A Discrete Wavelet transform explores focused frequency response in terms of several characteristics that may be highlighted later.
In the context of this work, it is interesting to point out the number of variables (
Table 1), both global and local, which play an essential factor in the decision-making for the construction of the method, making this composition of local algorithms a global strategy feasible to be optimized under multiple metrics such as processing speed or assertiveness in the extraction of features, as well as other Merit features that are analyzed at the end of this work.
2. The Proposed Methodology
The structure of a CNN follows different steps throughout the data classification process [
22,
23,
24], which can be alternated and applied several times to achieve a more complex structure according to the needs of the problem studied [
3,
16,
25]. This section briefly explains the proposed method that varies from traditional CNN and includes the following steps (
Figure 1).
A set of data (signals, images, information in various dimensions, etc.) must be considered homogeneously: vectors with the same length and representing the same phenomenon give an information cube. The above is equivalent to saying that having vectors in the same vector space is necessary. In this sense, a set of data must be built that, on the one hand, is sensitive to the changes observed in the phenomenon. (diversity of information intrinsic to its entropy) that allow the classification of stably diverse patterns formed in different situations [
26,
27]. The data are intended to allow for the determination of characteristics independently of the scale [
10,
28,
29] (
Figure 2).
For this purpose, a sliding window of x1 is constituted with the input data with a shift of that allows the extraction of local characteristics typical of nonpersistent behaviors in a focused way. It is interesting to note that comparing windows with standard information provides a better convolution strategy, which is primarily based on a discrete Mother Wavelet strategy such as Morlet wavelet. Given the latter, it is of interest to generate such pre-processing.
- B.
Convolution
The convolution can be interpreted as a “moving weight average” where the “weight” is determined by a function g(x). This solution is commutative. Convolution is an operation that integrates several data points from an information source, such as a function f(x), compared to a given kernel g(x). Since the function f(x) may present significant variations, such as peaks or discontinuities, averaging around each point x tends to decrease these variations, lowering the peaks and smoothing the discontinuities. The use of
DWT instead of
Learned Kernels from
CNN provides the precision to cover all possible ranges of scales, giving the possibility of highlighting features with loss of power. With the hypothesis that the
Discrete Wavelet transform is a convolution in the strict sense, the insertion of this in the convolution of a convolutional neural network and modifying the characterization procedure by using the
Self-Organizing Map is proposed (
Appendix A).
- C.
Normalization
In this normalization step, the dimension of the vectors does not change compared to the previous step. It is carried out only on the basis of the analysis of the infinite norm in a global manner and is applied to the entire data cube.
- D.
Pooling
The next step is
Pooling. In the case of Wavelet, this can be
Pooling or
Unpooling—corresponding to data reductions or increases, respectively—in the vectors resulting from convolution [
1,
3,
14]. In the case of
Pooling,
Max Pooling or
Average Pooling can be used. Otherwise, data are added as either repeated, average, or zero.
In the case of the present manuscript, a feature map is built to determine regions of similarity from a Self-Organized Map.
- E.
Classification
The traditional CNN algorithm establishes a conventional neural network such as the Multilevel Perceptron. Multilevel Perceptron is a well-explored neural network and is known for its strategy in both training and classification, where the authors must carefully establish the parameters that will give the depth of the network. The input is the flattened vector. This is where neural network learning takes place through a feedforward neural network process applying backpropagation (which is the application of the chain rule) for the training and learning of the neural network. This is about calculating the gradient of the loss function with the weight function, known as the gradient descent, which is a time-consuming strategy. In the case of this manuscript, the use of a Self-Organizing Map is established since it is required to obtain a classification such that the information contained in the information cube can be distributed within the map in such a way that it allows highlighting characteristics not defined by the various scales.
Once a feature approximation map has been built, This leads to multicolor classification printing on the rest of the
information cube once learning is ensured from the number of established epochs. This classification must be performed using some classification method, such as the
Self-Organizing Map already mentioned, which will be discussed in detail in this manuscript. It is interesting to note that the classification presented as a multicolor illumination has already been worked on by Molino in [
20].
It is essential to note that the parameters are considered in each step and must be chosen carefully. In the rest of this work, we will provide context for the number of variables to be selected and the effect these variables will have on the global result. The adjustment of these will be a matter of exploration based on global optimization. It cannot be the result of an approximation without the capacity for revision, rectification, and repetition. It should be noted that this set of steps results from a thorough review of different bibliographies from different areas described in the Introduction, seeking to give certainty and optimization on a multiscale scale in an information cube that must be multipurpose.
The idea behind integrating DWT instead of a local group of convolution filters is to cover the whole spectrum of scales by using diverse approximation levels. The final objective is to highlight a specific feature through a convolution strategy; thus, ensuring feature selections through multiscale approximation is a suitable technique for this purpose. It is important to explore this strategy as an integrated aspect of a Self-Organizing Map and a sliding window, as presented in the following section, to highlight how this information is recovered.
3. Structure
Unlike the traditional information processing algorithm based on a data planning structure, an original time–frequency decomposition is proposed, which allows the preprocessing to have more information regardless of the scales. This proposal contemplates several steps that need to be described in detail in terms of the scalar elements processed and the results of the neural network, which, for our purposes, will be a
Self-Organizing Map. Initially, the information is presented by forming various packages at multiple scales from an
information cube, as shown in
Figure 2. In this sense, the management of the indices generates the strict follow-up of the geometry of the information that leads us to provide an order of the same indexing without losing the relevant characteristics in local terms and their collateral effects with geometrically close regions (
Figure 3).
It is essential to establish that the scope of such a learning strategy by the Self-Organizing Map is a distribution of characteristics on a map of K neurons distributed so as to allow a stable representation construction. This is invariant for the type of information to be processed.
As a second stage, the characterization of the rest of the information is based on the trained neurons and the generated map, where it is possible to determine common characteristics in the rest of the maps that, for the particular example, refer to the planes of the
information cube. In
Section 4, more details of the case studies are given.
Based on these two stages, a stable notation must be generated that allows us to distinguish the processing of the information. For the case of
Figure 2 the depth will be defined as
.
The generated notation is as follows, let
where:
: represents the element of the vector processed by the Wavelet, which we will call approximations;
d: represents the element of the vector processed by the Wavelet, which we will call details;
: represents the plane based on the information cube to be processed, ;
j: initial vector number, ;
i: number of elements in the vector, ;
: number of applied Wavelet;
: level of applied Wavelet;
: original vector data bounded by the sliding window of with .
Input vectors correspond to the input plane with and scalar value .
There are
n input vectors of length
mThe idea of using Wavelet decomposition levels for feature extraction is that a characteristic presented within the analyzed data is highlighted from the rest of the information for a given type of scale. This procedure occurs as many levels are selected, extracting information as an accurate and reproducible mechanism.
This experimental process is a Discrete Wavelet to the initial data, processed after separation through a sliding window. For this example, we will apply it such that depending on the number of levels, we will have an Information Distribution that will depend on the size of the initial information cube and the initial sliding window in parallel with two levels to the initial data as shown below:
3.1. Convolution: Application of Wavelet
Discrete Wavelet transform is commonly used to separate approximations and detail coefficients that regulate certain representations at several levels. For example, A represents the approximation and D represents the detail vectors of the input vectors following Equation (
2), as presented in the following representation. Further decomposition is possible, e.g., at level 2, where the approximation and detail vectors are referred to in Vector 1 (
Appendix D).
For the
→
→
→
This will give us the result that each vector
is “decomposed” into several vectors
of length
, and
, respectively, if they were two levels (
Figure 4).
3.2. Pooling
In this context, the variation of the characteristics by sliding windows at the beginning of the processing makes it possible to detect high-frequency events. However, this depends exclusively on the initial sampling capacity during data acquisition, in other words, on the
information cube. This condition is crucial to define several parameters around this algorithm that are listed in
Table 1, such as the numerical definition of sliding windows, the type of mother Wavelet to use, the type of
Pooling to use, and the number of neurons, among others.
Concerning the procedure called Pooling, we have one option to handle the sampling, which is part of the conceptual definition of the information cube. In the remainder of the section, we will describe some differences that we consider of interest for the context of the article, concentrating on the processing used in the proposed methodology.
After forming the matrices, Pooling is applied to each matrix, for this case, , , . This manuscript employs Max Pooling for dimensionality reduction.
In this case, two-dimensional Pooling is conducted on the matrices , , and , so that the matrices have the exact dimensions, that is:
From mx to mx, Pooling is conducted on the larger matrices.
Then, matrices of the same size will be obtained. However, different information will be taken for each expression.
Concerning these strategies, we have sought to extract common characteristics allowing determination of the sensitivity of the aggregation, depending on the direction of the search for characteristics. In this sense, the way the information cube is observed based on the indexes (m, n, ) and , plays a fundamental role.
Now, for the specific case of the strategy proposed in this manuscript, we define an approximation based on the Approximation and Detail matrices formed in the previous section,
(
Appendix B).
Concerning the last matrix, we have the following representation that allows the formation of the indexes necessary for the Pooling operation.
For the matrix proposed to generate the
Pooling (in our case, we used
Max Pooling, being a variable to be discussed in another research context) based on the Details, we will call it
, and for what is related to the Approximations, we will call it
. Therefore, the dimensions of both matrices will consist of two fundamental parameters known as
, being the output columns in the
Pooling process, i.e., the size of the window concerning the local columns (it is also the length of the input vector referring to
n in Equation (
1)) and
being the output rows in the
Pooling process corresponding to the local window in terms of the
Pooling window (it also expresses the number of input vectors used by the algorithm,
m in Equation (
1)). Then, there are two matrices with the following dimensions
, which are based on the following equation:
where:
n: Full input length concerning Equation (
1) for the case of
m, defined concerning the pool determined by the data division by the predecessor process.
: Pooling window size in terms of columns; in this case, the window in terms of rows is equal.
: Pooling division, called shifting between Pooling windows; for the case of , the same value would be proposed.
If we consider and as the limits to be defined, the number of input vectors gives these in . The number of points represented by the data division coming from the Wavelets; therefore, for the first level, it is , while for the second level, it is .
Given the information obtained by the proposed window concerning
, defining an interaction relating to the total data in
and
is essential. We will define this variable as
t, which is given by:
3.3. Classification
Starting with the context of the necessary processing for a classification and pattern recognition process is essential when processing the information cube and all the proposed decompositions.
Within Machine Learning, we find different types of learning [
10]. The one that interests us due to the nature of some problems that are represented in the selection of characteristics in a multiscale way is Unsupervised Learning, where there are no defined labels for the output of this type of algorithm, but the same structure of the data allows learning and classification according to this algorithm. Due to knowledge representation, as long as new patterns are defined during Wavelet decomposition and
Pooling processing. These characteristics constantly occur in
self-organizing representation in an unsupervised strategy.
The unsupervised algorithm to be explored is the SOM (
Self-Organizing Maps), which gives us a representation respecting the multidimensionality of each vector resulting from the Approximation and Detail generated by the
Wavelet and processed by local selection in the
Pooling mechanism and the decomposition from the sliding window proposed as preprocessing of the
information cube. The challenge in the case study is to identify faults within large sections of the stack
information cube as presented by [
13,
20].
A Self-Organizing Map (SOM) model has two layers of neurons. The input layer, with (*) neurons, one for each input data, and the output layer, formed by neurons. The output layer stores the information and forms the feature map. The information is propagated from the initial or input layer to the final layer or output data. Each neuron i of the initial layer is connected through the weights with each of the neurons * of the final layer or output. This product will be developed in the remainder of this section.
In this type of neural network, we use competitive learning, which means that the layer of neurons modifies its weights so that these weights are increasingly more
than the input data. The competitive learning concept presents the adaptive weight strategy, allowing one to incorporate knowledge at a specific neuron and the related modification at the surrounding neurons. These weights are called BMU (best matching unit), which, in our case, will be stored in the matrix
. Each neuron in the output layer has as input the data vector modified by the weight vector
. Once this is completed,
consists of comparing the weight vectors with the input data
and verifying which will be the closest among them using the algebraic expression:
We will concretely develop this equation concerning the work followed in previous sections. One of the fundamental characteristics in generating the map is the approximation of neighboring neurons to the winning neuron, but on a two-dimensional basis, which, in our case, is represented by a Gaussian equation, where
is the related variance.
Therefore, concerning the modification of weights seen as a learning stage, a classical equation is proposed for the modification of weights.
where:
is the learning rate or learning factor, which will be a value in the interval [0, 1];
is the weight matrix;
is the training matrix that is the product of the formation of the Approximations and the Details seen in the previous sections;
i,
j corresponds to the iterations concerning the indices of the initial formation of the
information cube based on Equation (
1).
Since the multidimensional integration of the information in this processing mechanism is on an
information cube that expands into several cubes depending on the number of levels generated by the
wavelets and the decomposition from a local sliding window, it is of interest to have the tracking of the index as a consequent plane given in
, as well as the indices expressed in Equations (
A6)–(
A8) and (
3) (
Appendix B and
Appendix C). In this way, each data point backed up in
and
has the following indexation:
where:
This results in the following equation, which is in the operation of each scalar in
Now, from the operation performed in
, where the minimum value is sought
where
x is the value in
i, while
y is the value in
j that corresponds to the minimum in
with respect to
. This leads us to the fact that concerning the proximity function for each neuron in
and according to the winning neuron given by the values
x,
y, it is proposed to use a multidimensional Gaussian expressed in Equation (
8) to determine the proximities according to the differences as follows:
Thus, the following matrix
H is generated
The feedback equation for the weight update is the following, based on the base equation expressed in Equation (
10)
Understanding that:
Now, to implement the algorithm called SOM, in this case, we build a tuple of values for the dimension of the cube to be processed and incorporate the data into these for the training processing.
where:
is the number of features per observation to be considered.
is the number of planes of interest for training.
is the number of vectors in the region of interest.
is the number of points in each vector in the region of interest.
For the construction of the SOM Analysis in terms of the characterization of features, the following tuple is proposed:
This implementation of the method as a whole is expressed in
Appendix D of this manuscript. For this work and regarding the problem being addressed, the parameters are taken into account, and their variation intervals are:
Method:
The information cube is defined.
The sliding window is defined based on values and .
The weight of each node is initialized to a random value.
The input vector V is chosen.
The Euclidean distance between the input vector V and the weight matrix is found.
The node that produces the smallest distance is found.
For each node of the two-dimensional map, Steps 3 and 4 are repeated.
The best matching unit (BMU), or closest element, is calculated.
The topological neighborhood of the BMU and radius are found in the map expressed in Equation (
16).
The neurons are determined.
4. Case Study
As mentioned above, the actual analysis of the data of an available data set will be a group of mechanical signals that we will call the
seismic cube (
Figure 5). For data confidentiality reasons, this set of signals will be treated only as input data with characteristics of signals in the frequency domain, omitting data such as location, extension, etc. The study was carried out in an area of Mexico that corresponds to an oil field. Through field studies, seismic data were obtained, processed, and converted to a readable format representing a set of vectors and ultrasound signals with 1200 entries. The number of vectors is around 1500 planes, with 600 points per vector and 350 vectors per plane.
Data were collected in the field. They were originally arranged three-dimensionally in a cube (seismic cube), in SEG-Y format, one of the various standard formats for this type of geophysical data. The seismic cube is made up of each of the ultrasound signals that travel through the subsurface and are captured at the surface. This process is explained in detail in [
20].
These signals are the input data in our model and are convolved using a Wavelet decomposition strategy. This wavelet and the number of signal decomposition levels will be selected as parameters.
The next parameters to be selected correspond to the Pooling, where we choose the size of the Pooling matrix, which helps us reduce the dimensionality of the data after convolution, as well as the step size of this matrix over the data path. This is followed by the use of the Euclidean norm for data normalization.
The point of interest here is how the method classifies this data set according to the architecture’s adaptation of the Wavelet. An An analysis of this same data set was previously conducted with SOM without taking Pooling and sliding window processing stages; however, we seek to contribute with this work to understanding the effects of Wavelet coupling to a Self-Organizing Map by a sliding window.
In order to specify the experiments and compare them with several parameters, a valuable experiment is proposed, as shown in
Figure 6. This particular result is presented in as
Case 1 and is compared to several other experiments by the mean square error (MSE) from weight matrices (Equation (
21)). MSE is a common comparison metric between vectors or matrices since the dimensionality is not lost and scaling is not a key factor in comparing values. MSE is used to compare matrices (weight matrices) since it is stable for defining the main differences between neurons.
In this study, our objective is to obtain, through the proposed algorithm, a classification of the data that allows us to identify areas with possible geological and geophysical characteristics that allow the storage of hydrocarbons following [
20]. These areas are identified through the algorithm and visible in the graphics obtained after training the neural network and classifying the data.
The first step that has been performed is the identification of the input vectors,
, as described in
Section 3. Each vector
represents data or a signal. The second step is to perform the algorithm, obtaining a particular weight matrix
considered as a basis, where the rest of the cases are those to be compared and named
.
To test this proposal, a case study published in [
20] is used, based on a seismic analysis taking partial data called stacked traces, as observed in
Figure 5. Therefore, significant effort is necessary to combine all the possible variables (
Table 1) to build the best case study from the proposed methodology. Understanding the meaning of correctness is crucial to allow for a proper selection of the most appropriate parameters.
Table 2 shows some of the most essential sensitive variables to be selected. These are presented in terms of a suitable combination of current results. The size of the sliding window, called
, is considered constant for this case study with a nominal of 11 points. Regarding mother wavelets, these are modified according to the parameter
in
Table 3. At
Table 3, the resulting value is
error, which is calculated following Equation (
21).
where
n is the number of weights from the estimated weight matrix given a particular case (
Table 3).
Given the analysis provided by the method presented in
Section 2 and
Section 3, the following result is presented: the sector to be scrutinized is linked as characteristics in the rest of the observed planes following the
information cube. This result is the set of various patterns and is visualized in
Figure 6 as in the results presented in [
20], an identification of some cluster of characteristics is achieved sharing in such a way that they give us satisfactory results in both the separation and classification of the characteristics. The approximate results have been found by comparing to [
30]. In this case, a map of neurons is given, where 36 neurons are built and illuminated through the image according to
Figure 6 and the related Map
Figure 7. In this initial case, the number of selected neurons is reflected in the numbers shown within each of the related neurons on the given map. Those selected neurons are represented in colors within
Figure 6. The reader may identify several characteristics that are not highlighted in other cases, according to
Table 2. These characteristics correspond to patterns identified within facies within the seismic data.
The following modifications (
Table 2) are performed and shown in the Next figures from Cases 3, 8, 10, and 11 and the related Maps.
In this case,
Figure 8 shows the resulting classification of the map based on the selection of a particular characteristic according to
Figure 9. However, the reader may identify that the number of neurons and the number of points is the same in Case 1; the results are different since several seismic characteristics are not highlighted.
Now, in Cases 8 and 10, there are similar responses in terms of the characteristics highlighted in
Figure 10 and
Figure 12 that tend to be a blurred representation in the same state.
Figure 11 and
Figure 13 show the SOM maps for cases 8 and 10 respectively. In case 11 shown in
Figure 14 with its corresponding SOM map shown in
Figure 15, although the Daubechies 3 wavelet shows the area of interest, the Daubechies 2 wavelet shows better results.
In this sense, the plane named inline plane 2370 is used to train the algorithm as the first group of characteristics to be selected and then searched in the rest of the planes based on this group of characteristics. In this context, it has been possible to categorize stably and consistently from a predetermined decomposition based on frequency. Following accurate pattern recognition for feature extraction based on the variable selection.
Alternatively, a comparison between CNN and the proposal followed in this paper is made by processing the same selected plane as that used previously, as shown in
Figure 16. The reader may realize that the image is quite blurred in comparison to
Figure 6. The results of the processing of this
information cube depict the number of selected neurons, as presented in
Figure 17, where the information accumulates in 10 of the 36 neurons generated, allowing fewer features to be depicted.
5. Conclusions
This work presents an algorithm based on three essential pillars in the analysis and extraction of features: the decomposition of information, the expansion through the agreed compression of local information from the convolution, and the construction of maps of said features.
The functional contribution to knowledge is the ability to distinguish features with a high variation cost during a single segment of information calculation. In later applications, processing all the information to distinguish originally highlighted features is unnecessary. The procedure proposed in this paper presents various angles of local optimization from the perspective of global metrics, such as computation speed, which is not the focus of this paper.
This methodology uses maps to efficiently determine features with a particular type of scaling of specific interest in a map analysis. This particular interest is given by the rate of assertiveness designed by the user. In the case studies presented in this manuscript, the rate of reasonable coincidence is based on previous analysis. The use of the Wavelet is the multidimensional convolution process, the pooling process is used to determine the most critical effect between neighboring data, and the construction of a Self-Organizing Map is used to highlight differences through a multidimensional analysis.