1. Introduction
Some problems in the field of pattern recognition have been successfully solved. Commercial systems for speech recognition, image recognition, and automatic text analysis are known. The degree of success in solving these problems depends on the degree of formalized description of the subject area [
1]. Face image recognition is solved in isolation. The problem of identifying grammar and syntax errors is a more complex task. Recognition of images and scenes, dictation of texts from a microphone, and automatic classification of text are unsolved tasks. Existing systems only demonstrate their level of complexity. The difficulties that arise in solving these problems are in the synchronization of the analyzed information. This leads to the formation of large, pure hypotheses. In the case of the processing and synchronization of a large amount of information, their verification becomes a non-trivial task and is also unsolved within the framework of the applied methods. Currently, the complexity of the methods for representing semantic and mathematical information with both metalinguistic and figurative means practically does not allow for their effective use to solve problems. Within the scientific direction of artificial intelligence, numerous attempts have been made and are being made to use semantic and pragmatic information, mainly to solve the problem of human–machine communication in natural language. The works of Alkon, Alwang, and Bengio [
2,
3,
4] are widely known. Their success is due to the fact that the semantic picture is replaced by the rigid structure of the relational database, from which natural language interpretations are made and attempts to interpret statements in terms of concepts are made. However, the great ambiguity of these interpretations arises from the inaccuracy of language models. It is not possible to automatically form a model based on texts alone. Less well-known are the ways in which semantic information is used. Image recognition Quasi-Zo [
4,
5] has been used as a world model to analyze scenes in which individual objects are represented by generalized geometric shapes such as balls and cylinders. With the help of this model, objects in the scene are represented, segmented, and identified and further described in metalinguistic terms, as well as the relations between them and their dynamics. Furthermore, all these steps are processed separately.
The development of methods for the representation of information at the semantic and pragmatic levels (equally convenient for both linguistic and image recognition tasks) is a key point in improving both the quality and functionality of these systems and in the transition to the next stage of development of intelligent systems (IS), i.e., the stage of creating integrated multimodal systems for information processing and storage. The existence of these tasks makes us look for new approaches to the methods of presenting and processing information from different modalities—verbal, visual, and supermodal (semantic and pragmatic) information—in synchrony [
6].
Introducing knowledge into artificial IS is effective not due to modeling of individual intellectual functions but due to modeling of the computing environment in which entire tasks are solved. Intellectual systems are those that perform intellectual functions within the framework of cognitive behavior: perception, learning, formation of thinking patterns (using a pattern to solve current problems), problem solving, prediction, decision making, linguistic behavior, etc. Therefore, IS include natural language processing information systems, word processing systems, and automated systems. The classification of existing systems allows them to be divided into two classes: single-level systems that recognize speech events using one-way or modified Bayesian rules (implemented on a neural network) and synchronous processing systems using empirical linguistic rules [
7].
For example, after an acoustic speech signal is fed into the system, it is digitized, cleaned up, normalized by amplitude, and freed from related information. Then, its pieces are compared to the standards for each level that were made during the training stage.
In the case of solving simple recognition problems, commands with a limited vocabulary and single-level statistical approaches are most often used. In order to solve more complicated problems, such as finding keywords in a stream of continuous speech, the structural approach needs to use information from all levels of language, from morphology to syntax, as well as information from outside of languages, such as semantics and pragmatics. The complexity of the task of building speech recognition systems comes from the fact that a lot of information with different internal structures to be processed by different algorithms needs to be put together into a single whole. In addition, the use of practical solutions to the problem of speech recognition encounters a psychological barrier, which consists of the fact that a person expects the same possibilities in communication from speech recognition systems as in communication with a person. Solving the latter task involves recreating—if possible—processing, and presenting the information at one’s disposal. This means that in addition to integrating linguistic and extralinguistic sources of knowledge at different levels, it is necessary to integrate information processing subsystems from other modalities—primarily visual. When three problems are solved, effective synchronization and integration of a large amount of disparate information become possible. First, it is necessary to use the same algorithms to process information with different structures. Second, it is desirable to implement these algorithms with the use of specialized (directed precisely to these algorithms) equipment instead of universal processing means. Thirdly, it is necessary to implement an associative means of synchronizing information.
Analysis of existing systems showed that, as in the case of speech recognition, when solving the problem of image recognition, two main approaches are used: geometric and linguistic. Image recognition has its own problems when it comes to sorting through large quantities of information because it requires a considerable amount of math. IS that work well can only be made if they are synchronized and have a high level of resilience. They can be represented in the form of a set of rules or a declarative representation of knowledge when the information is represented in the form of a database. Solving the problem of integrating information from different modalities would allow us to escape this vicious circle [
8,
9,
10].
The aim of this work is to identify effective ways of synchronizing multilevel structured information from different modalities (images, speech, and text), which allows for natural reproduction of the structure of information as it occurs in the human brain. Processing optimization methods should allow for modeling of a sustainable process. For this purpose, neural network representation and processing of different modalities can be used. It is also necessary to develop a method and algorithm to train neural networks for robustness and synchronization.
The current paper contains five sections in which we described the used of CGNN training to analyze time sequences to represent speech as textual information, resulting in h stability for dynamic information generation. Then, a description of the algorithm and an example implementation of a stability model in a CGNN are presented.
2. Cohen–Grossberg Network Training for Different Modalities
Using an ANN and considering existing solutions showed that they can be broken down into two types: static and dynamic systems. Classical networks with elements that are like neurons can solve the problem of recognizing spatial images and speech characteristics. Dynamic images and speech can also be recognized by using networks with delay elements and dynamic neural networks. In this case, special techniques are used to take into account the way information is organized over time.
An ANN that takes into account dynamic time-based information can be used to analyze temporal sequences in which the representation of both speech and visual/textual information is reduced, such as a Cohen–Grossberg network.
Cohen and Grossberg presented their variation of an ANN in [
11] represents self-organization, competitiveness, etc. Grossberg designed a continuous time-based racing network based on the human visual system. His work is characterized by the use of non-linear mathematics to model specific functions. The topics of their papers include specific areas, such as how adversarial networks can provide an improvement in recognizable information in vision, and their work is characterized by a high level of mathematical complexity [
12,
13].
In order to take the temporal structure of the information into account, a special technique is used. Information is fed with delays due to additional network inputs, and the highest output is expected to be produced. In this case, the network begins to take into account the time context of the input, and dynamic images are automatically formed.
We propose the use of the learning law of the adaptive weights in the Grossberg network, which W. Grossberg calls long-term memory (LTM) because the rows of W represent patterns that have been stored and can be recognized by the network. The stored pattern that is closest to the input produces the highest output in the second layer.
One law of learning for
is given by:
where
is yjr learning rate coefficient,
W is the input vector, and
t is the time variable. The two-layer equations show a passive decay term in the first term in the bracket on the right and a Hebbian-like learning process in the second term. Combined, these terms result in the deterioration of Hebb’s rule.
The first-layer equation normalizes the strength of an input pattern after receiving external inputs. The input vector (
p) is used to calculate the excitatory and inhibitory inputs. It takes the shape of
where
determines the speed of response,
p is the input vector,
t is the time variable, and
b is inhibitory bias.
Equation (
2) is an intriguing shunt model with the following input:
The sum of each element in the input vector—aside from the ith element—therefore constitutes the inhibitory input to the ith neuron.
The on-center/off-surround pattern is created by the two matrices: and because the inhibitory input, which shuts the neuron off, comes from locations outside of the input vector, while the excitatory input, which includes the neuron, comes from the ith element of the input vector, which is centered at the same point. The input pattern is normalized by this style of binding pattern.
The lower bound of the maneuver pattern is set to zero by setting the inhibitory bias (
) to zero for the sake of simplicity. Moreover, it uniformly adjusts all components of the excitation bias (
):
As a result, the upper bound for each neuron is equal. Let us examine the first layer’s normalizing effect, where the
ith neuron’s response has the following form:
At steady state,
gives us:
If we opt for neuron
steady-state output, the outcome is:
The relative intensity of the
ith input is defined as follows:
The steady-state activity of neurons takes the following form:
Hence, regardless of the size of the overall input (
P),
is always proportional to the relative intensity (
). Moreover, the neuron’s overall activity is modest:
The input vector is normalized to maintain the relative intensities of its separate components while reducing the overall activity to less than . As a result, rather than encoding the instantaneous variations in the total input activity (P), the first layer’s outputs () encode the relative input intensities (). This outcome is the result of the shunt model’s nonlinear gain control and the on-center/off-surround coupling of the inputs.
The consistency of the information that has been processed and the dynamic properties of the visual system are explained in the first layer of the Grossberg network. The network responds to relative, not absolute, picture intensities.
The continuous-time period layer, the second layer of the Grossberg network, serves a number of purposes. The entire activity in the layer is first normalized, just like the first layer. Second, the detected information enhances its model, making it more likely that the neuron with the greatest input also produces the strongest response. Lastly, it stores the amplified model, acting as short-term memory (STM).
The presence of feedback in the second layer is the primary distinction between the two layers. It enables the network to retain a pattern even when the input is no longer present. The band also engages in competition, which amplifies the information that may be recognized in the pattern.
The equation for the second layer takes the following form:
This is a shunt model with an excitation input of , while on-center feedback is expressed as , and adaptive weights similar to those in a Kohonen network make up . Following training, the rows of indicate prototype models. The off-surround feedback provided by is the inhibitory input to the shunting model. provides this feedback.
The following example of a network with two neurons can be taken into consideration to demonstrate the impact of the second layer of a Grossberg network:
and
The prototype models (rows of the weight matrix (
)) and the output of the first layer serve as the internal multipliers for the second layer (normalized input model). The prototype model that is most similar to the input model has the highest internal multiplier. The second layer then engages in competition among neurons, which has the effect of supporting large outputs while attenuating small outputs, thereby tending to improve the output pattern. Competition in a Grossberg network preserves large values while reducing small values, yet it need not necessarily reduce all small values to zero. The activation function controls how much recognizable information is amplified [
11].
Two key characteristics are mentioned. First, some information augmentation occurs before the input is eliminated. According to the second layer’s inputs:
As a result, the input to the second neuron is 1.5 times that of the first neuron. However, after a quarter of a second, the second neuron’s output surpasses that of the first neuron by a factor of 6.34.
The network subsequently develops and saves the pattern once the input is set to zero, which is the second distinguishing feature of the response. Even after the input is stopped, the output continues. Grossberg [
11] refers to this tendency as reverberation. The network can store the pattern and the on-center versus off-surround pattern of the connections, which are determined by
and
, thanks to nonlinear feedback. This leads to an improvement.
It is taken into consideration that both levels of the Grossberg network use an on-center/off-surround structure [
14]. Other connection patterns are available for usage in various applications. The directed receptive field has been suggested as a structure to implement this technique [
15]. The “on” (excitatory) connections for this structure originate from one side of the field, whereas the “off” (inhibitory) connections originate from the other side of the field.
When
is not active, it is feasible to disable learning in specific circumstances. The equation in this case has the following training form:
which is expressed in the form of a vector as
where
.The elements of the
ith row of
make up the vector.
Learning is only possible when the terms on the right-hand side of Equation (
1) are multiplied by the integer
. This is an ongoing application of the principle of learning from the beginning. The topology and structure of the data being converted are preserved because of the learning law of adaptive weights. Similar pieces are converted along the same trajectory, whereas distinct fragments are converted along various paths. In this scenario, the network starts to consider the temporal context of the input. Then, it is possible to automatically create dynamic picture standards [
12,
13].