Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting

Gribanova, Ekaterina; Gerasimov, Roman; Viktorenko, Elena

doi:10.3390/bdcc9090235

Open AccessArticle

Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting

by

Ekaterina Gribanova

^*

,

Roman Gerasimov

and

Elena Viktorenko

Image Processing and Artificial Intelligence Laboratory, Tomsk State University of Control Systems and Radioelectronics, Lenina Str., Tomsk 634050, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(9), 235; https://doi.org/10.3390/bdcc9090235

Submission received: 7 August 2025 / Revised: 1 September 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

Download

Browse Figures

Versions Notes

Abstract

This paper introduces novel inverse optimization algorithms (RC and DC) for neural network training in stock price forecasting in an attempt to overcome the traditional gradient descent limitation of local minima convergence. The key novelty is a stochastic algorithm for inverse problems adapted to neural network training, where target function values decrease iteratively through selective weight modification. Experimental analysis used closing price data from 40 Russian companies, comparing traditional activation functions (linear, sigmoid, tanh) with specialized functions (sincos, cloglogm, mish) across perceptrons and single-hidden-layer networks. Key findings show the superiority of the DC method for single-layer networks, while RC proves most effective for hidden-layer networks. The linear activation function with the RC algorithm delivered optimal results in most experiments, challenging conventional nonlinear activation preferences. The optimal architecture, namely, a single hidden layer with two neurons, achieved the best prediction accuracy in 70% of cases. The research confirms that inverse optimization algorithms can provide higher training efficiency than classical gradient methods, offering practical improvements for financial forecasting.

Keywords:

training algorithm; neural network; inverse algorithm; optimization algorithm; stock closing price

1. Introduction

The stock market serves as an indicator of a country’s economic health and plays a crucial role in attracting and redistributing capital through the issuance of securities. In addition, it is one of the primary instruments for investment and preservation of public savings [1]. In this context, analyzing the dynamics of securities, particularly stocks, is of interest to many market participants and enables more effective managerial and investment decision making. One of the main analytical tools is technical analysis, which, unlike fundamental analysis, assumes that all relevant factors are already incorporated into historical prices. Consequently, only past data need to be analyzed [2].

The task of forecasting prices for financial instruments, as a specific case of time series analysis, is one of the key challenges in the field of analytics and applied machine learning, as it involves the study of complex dynamic systems. The data used to describe the behavior of such systems are characterized by nonlinearity, a stochastic nature, and high levels of noise. The analysis of these data must rely on identifying hidden patterns under conditions of volatility. Statistical methods, such as Autoregressive Integrated Moving Average or Generalized Autoregressive Conditional Heteroskedasticity (GARCH), demonstrate insufficient effectiveness under these conditions, especially for short-term forecasting.

Neural networks demonstrate the ability to identify nonlinear dependencies and hidden relationships. Activation functions and learning algorithms are key components of neural networks that determine training efficiency and, consequently, the accuracy of predictive values. Currently, classical learning algorithms based on gradient descent are predominantly used. However, these methods suffer from several critical weaknesses. Specifically, they tend to become trapped in local minima [3], exhibit sensitivity to initial parameter values, and show poor convergence performance when dealing with data containing periodic components—a characteristic feature of financial time series. These fundamental limitations of existing approaches motivated us to develop and investigate novel approaches to neural network training.

The aim of this research is to develop inverse optimization algorithms for neural network training and to conduct a comparative analysis of traditional and inverse training methods regarding their impact on prediction accuracy. The study addresses the efficiency of neural network models using various activation functions and training algorithms based on Russian stock market data. The research will test the following primary hypothesis: inverse optimization algorithms can provide higher neural network training efficiency compared with classical gradient-based approaches.

The significance of this research is driven by two key factors. First, there is a pressing need to develop neural network training methods that enhance forecasting accuracy in the investment domain, thereby improving the reliability of investment decision making processes. Second, inverse optimization algorithms remain insufficiently explored in the context of time series prediction tasks, creating a notable research gap. The scientific novelty of this work lies in the development of neural network training algorithms founded on an inverse approach.

This study comprises the following sections: a literature review, a description of the developed algorithm, experimental design and results analysis, and conclusions and implications.

2. Materials and Methods

2.1. Literature Review

Time series analysis occupies a prominent position among methods for forecasting financial asset dynamics. Statistical models have long served as the primary tools employed for this purpose. Among various statistical approaches, linear and nonlinear regression models, autoregressive models, and moving average models are particularly noteworthy [4]. Econometric approaches, such as the Autoregressive Conditional Heteroskedasticity (ARCH) model proposed in [5], along with its modification—the GARCH model introduced in [6]—have become fundamental frameworks for modeling temporal variability and volatility clustering in financial asset returns. Subsequently, several GARCH modifications were developed to better adapt to stock market volatility, including Integrated GARCH [7], Exponential GARCH [8], Threshold GARCH [9], and Glosten–Jagannathan–Runkle GARCH [10].

Despite the success of traditional econometric models, they face a number of challenges, including limited flexibility in modeling complex nonlinear processes and insufficient accuracy under certain market conditions. Their parametric nature and fixed functional form restrict their ability to adapt to the complex nonlinear dependencies observed in modern stock markets, particularly when dealing with high-frequency and large-scale data.

This is precisely why more advanced machine learning methods are employed for such tasks, among which neural network models are among the most popular [11,12]. In particular, deep learning models have been successfully applied to forecasting stock quotes and share prices based on fundamental and technical analysis [13,14,15].

However, improving the efficiency of neural network models remains a relevant challenge. To address this, researchers are developing new types of networks (recurrent, transformers, hybrid), optimizing network architecture (number and configuration of layers, neuron count), and designing innovative training algorithms and activation functions.

The type of activation function affects gradient stability, training efficiency, and the accuracy of output values. Traditional activation functions (linear, sigmoid, tanh) are characterized by implementation simplicity; however, they have significant limitations, including the vanishing gradient problem and insufficient effectiveness in modeling temporal dependencies [3,16,17]. Modern specialized activation functions (such as mish) have demonstrated high efficiency in specific applications across multiple studies [18,19,20] and have sparked interest among researchers and practitioners regarding their potential for advancing deep learning methodologies [21,22], including practical applications in stock market dynamics prediction [15].

One of the challenges researchers face when training neural networks is optimizing the loss function, considering its nonlinear, multi-extremal nature and the large number of parameters to be optimized [23,24,25]. Gradient descent with gradients calculated using the backpropagation algorithm is typically employed for this purpose [26]. Because standard gradient descent does not always ensure good convergence, researchers have developed and implemented adaptive optimization methods, including Adaptive Gradient Algorithm [26], Adaptive Delta [27], Root Mean Square Propagation [28], Adaptive Moment Estimation (Adam) [29], Adam with Maximum Norm [30], Nesterov-accelerated Adaptive Moment Estimation [31], and their modifications. Despite the popularity of these methods, researchers have noted several limitations: sensitivity to initial values and tendency to converge to local minima [32]. In some cases, the use of gradient methods may not lead to improvements in error function values, resulting in optimization failure. Particularly, optimization difficulties are observed when data contain periodic components [33], which may characterize processes involving various types of fluctuations in the stock market. These fundamental weaknesses of existing approaches represent barriers to achieving reliable financial forecasting accuracy and highlight the need for alternative optimization strategies that can overcome these inherent limitations.

To address the aforementioned limitations, researchers have been investigating various metaheuristic algorithms, which are classified according to the nature of the processes they simulate. The classification includes evolutionary methods [34]; swarm-based algorithms [35]; approaches derived from mathematics, chemistry, and physics [36]; as well as techniques inspired by human behavior [37]. The primary advantage of metaheuristic algorithms for neural network optimization is that the objective function does not necessarily need to be differentiable or continuous, and their gradient-free search process can effectively bypass local minima. However, the disadvantages of such methods include high computational resource requirements, time complexity, and the need to configure numerous parameters. As the number of optimized parameters in a neural network increases, the required resources, such as memory and processing time, grow significantly, potentially preventing the achievement of an acceptable accuracy level within a reasonable timeframe. One proposed solution to this problem involves training only a single layer—typically the first or last—but this approach reduces the number of adjustable parameters and does not guarantee finding the global minimum. This underscores the relevance of researching new approaches to neural network training.

The identified limitations of traditional gradient-based methods create gaps in the current state of research. Specifically, existing neural network training approaches lack robust mechanisms for avoiding local minima convergence, particularly when processing financial data with inherent periodicity and high volatility. Furthermore, current optimization methods demonstrate insufficient adaptability to the nonlinear, multi-extremal nature of financial time series, limiting their practical applicability in investment decision making contexts. These weaknesses in the existing body of research directly motivated the development of inverse optimization algorithms presented in this study.

2.2. Inverse Optimization Algorithms for Neural Network Training

2.2.1. Neural Network Model

In this paper, we examine a deep neural network model (Figure 1) capable of reproducing nonlinear dependencies and revealing hidden patterns. This model has gained widespread application in stock market dynamics modeling [38,39,40,41]. The selection of neural network structure and parameters, including activation functions f, training hyperparameters, and the training algorithm, significantly determines the effectiveness of this analytical tool [42].

The neural network structure is determined by the number of hidden layers and the number of neurons in each layer. This study examines both perceptron models (without hidden layers) and neural networks with a single hidden layer, with the latter being the most prevalent approach in prediction tasks [43,44,45]. Specifically, after analyzing 177 scientific publications, Xu et al. [43] established that over 60% of research studies employ neural networks with a single hidden layer. This architecture represents an optimal compromise between computational efficiency and predictive power, enabling the approximation of nonlinear dependencies in time series without significantly increasing computational costs, thus ensuring high prediction accuracy while maintaining reasonable implementation complexity. When determining the number of neurons in the hidden layer, researchers distinguish between small dimensionality (1–10 neurons) and large dimensionality (11–20 neurons) [16]. This investigation examines configurations with 2, 8, and 16 neurons in the hidden layer.

The number of hidden layers and neurons determines the total number of adjustable weight coefficients w (total_weights), which can be calculated based on the number of neurons in the input layer (input_neurons) and hidden layer (hidden_neurons):

total_weights = (input_neurons + 1) × hidden_neurons + hidden_neurons + 1

The training algorithm adjusts the values of the neural network’s weight coefficients w to improve its predictive performance. The objective of this algorithm is to ensure maximum correspondence between the neural network output and actual values. This is achieved by minimizing the error function, which represents the sum of squared differences between the actual price values and the model’s predicted values z.

2.2.2. Algorithm for Neural Network Training Using the Inverse Optimization Approach

Most optimization methods employ an iterative process in which the value of the objective function f(x) at the current point is calculated at each step, and the search direction for the next point is determined (Figure 2). An alternative approach involves formulating an inverse problem, where the arguments x that provide a specified function value are sought (Figure 2) [46,47]. In this study, a stochastic algorithm for solving the inverse problem [48] was used to implement the inverse approach, which was modified for neural network training. Thus, the resulting algorithm is metaheuristic, where the search for the optimum is guided by the following rules:

At each iteration, the target function value decreases with a certain step, ensuring minimization of the error function.
At each iteration, only one weight coefficient is selected for modification with a specified probability. Then, its new value is determined to achieve the target function value by solving an equation. If the selected weight coefficient cannot achieve the target function value (no solution to the equation exists), another weight coefficient is chosen until all arguments have been considered. For illustration, Figure 3 shows a scenario where it is impossible to reach the target value y2 from point A by changing x1. Consequently, this argument is excluded from calculations, and the solution is found by modifying argument x2. Thus, each weight coefficient has two characteristics, namely, a selection probability and a usability indicator u, which takes the value 1 if the argument can be selected for solving the problem and 0 otherwise.
If no weight coefficient allows the target function value to be reached, the step size for changing the target function value is reduced.

When establishing the probability β for argument selection, the following approaches were considered:

All probabilities are equal (random choice (RC)), and the weight coefficient is selected randomly at each iteration.
Probabilities are calculated based on the gradient: the higher the absolute gradient value for a specific weight coefficient is, the higher the probability that this coefficient will be selected for modification (derivative choice (DC)).

Figure 2. Classical and inverse optimization scheme. The orange color represents the variable quantities that are adjusted to find a solution, while the blue color indicates the factors or elements that drive these changes.

Figure 3. Modification of the argument to achieve a new target function value y2: absence of solution (left); solution found (right). The dashed line shows a possible search direction when changing one argument. Colors represent the function levels of f(x).

Thus, the developed RC algorithm comprises the following steps.

Step 1. Initialization:

The weights w are randomly initialized, and the error function is computed along with its minimization step:

h = f_{1} (P r i c e_p r e v \times w_{0})

z = f_{2} (h_{b i a s} \times w_{1})

J (w) = \sum_{i = 1}^{n} {(z_{i} - P r i c e_{i})}^{2}

(1)

J_{p r e v} = J (w)

Δ J = \frac{J (w)}{d}

where Price_prev is the matrix of input price values (Price_t₋₁, Price_t₋₂...), including the bias; Price represents the vector of actual output values, z denotes the vector of predicted values, n is the number of observations, f₁ is the activation function of the hidden layer, f₂ is the activation function of the output layer,

h_{b i a s}

is the output of hidden layer with the bias, and d is the divisor determining the step size

Δ J

.

Step 2. Reduction in the error function by a specified step size:

Apply stepwise reduction to the cost function utilizing a defined step size:

J_{t a r g e t} = J_{p r e v} - \frac{Δ J}{r}

Step 3. Selection of argument for modification:

For each weight with u = 1, calculate the selection probability β (

m = 1, t o t a l_w e i g h t s

):

l = \sum_{j = 1}^{t o t a l_w e i g h t s} u_{j}

β_{m} = \{\begin{matrix} \begin{matrix} \frac{1}{l} \end{matrix} i f \begin{matrix} u_{m} = 1 \end{matrix} \\ \begin{matrix} 0 \end{matrix} o t h e r w i s e \end{matrix}

Choose the weight coefficient with index p based on the probability β. In the absence of coefficient selection, increase the step reduction factor according to r = r⋅q (q is the step reduction coefficient). Should r exceed the maximum threshold r_max, the algorithm terminates. Otherwise, initialize the usage indicator u to unity for all weights w, recalculate the probability β, and perform random selection of the weight coefficient based on the updated selection probabilities β.

Step 4. If the selected weight coefficient pertains to the second level (p ≥ ((input_neurons + 1) hidden_neurons)) proceed to Step 5; otherwise, proceed to Step 6.

Step 5. Modification of the weight coefficient belonging to the second layer:

Compute the coefficient index v within the level as follows:

Compute the coefficient index v within the level:

v = p − (input_neurons + 1) hidden_neurons.

Update the selected weight w_1v by solving the following equation:

J (w_{1 v}^{*}) = J_{t a r g e t}

The application of Newton’s method in this process requires computing the gradient g₁ as follows:

z_{e r r o r} = z - P r i c e

z_{d e l t a} = z_{e r r o r} ⊙ f_{2}^{'} (z)

(2)

g_{1} = {h_{b i a s}}^{T} z_{d e l t a}

where

f_{2}^{'} (z)

is the derivative of the activation function, and

⊙

denotes element-wise multiplication.

Formulate a new weight coefficient array to enable subsequent modification of the selected element and error function evaluation:

w_{1}^{*} = w_{1}

The Newton-based iterative updates are implemented according to the following formula:

w_{1 v}^{*} = w_{1 v}^{*} - \frac{J (w^{*}) - J_{t a r g e t}}{2 g_{1 v}}

The values for J(w*) and

g_{1 v}

are recalculated after each iteration using Equations (1) and (2).

Subsequently, the new value of the error function is calculated using the obtained weighting coefficient:

J_{n e w} = J (w^{*})

If _new < J_prev, then w₁ = w₁*, u = 1 for all weights, and J_prev = J_new. Proceed to Step 2.

Otherwise, set u_p = 0. Proceed to Step 2.

Step 6. Modification of the weight coefficient belonging to the first layer:

Determine the row and column indices v and b for the first-layer coefficient matrix by decomposing the linear index p:

v = p mod hidden_neurons

b = ⌊p/hidden_neurons⌋

The hidden layer error and gradient are computed as follows:

h_{e r r o r} = z_{d e l t a} \times {w_{1 [: - 1]}}^{T}

h_{d e l t a} = h_{e r r o r} ⊙ f_{1}^{'} (h)

(3)

g_{0} = {w_{0}}^{T} h_{d e l t a}

where

w_{1 [: - 1]}

is the array of weight coefficients w₁ excluding the element associated with the bias term.

The selected weight w_0b,ᵥ is updated by solving the following equation:

J (w_{0 b, v}^{*}) = J_{t a r g e t}

To solve the equation using Newton’s method, the iterative formula is employed as follows (preliminarily, array w is copied to array w*:

{w^{*}}_{0} = w_{0}

):

{w_{0}}^{*}_{b, v} = {w_{0}}^{*}_{b, v} - \frac{J (w^{*}) - J_{t a r g e t}}{4 g_{0 b, v}}

Here, J(w*) and

g_{0 b, v}

are recalculated after each iteration using Formulas (1) and (3).

Compute the new value of the error function with the updated weight coefficient:

J_{n e w} = J (w^{*})

If J_new < J_prev, then w₀ = w₀*, u = 1 for all weights, and J_prev = J_new. Proceed to Step 2.

Otherwise, set u_p = 0. Proceed to Step 2.

The pseudocode of the RC is presented in Appendix A. Figure 4 shows the main blocks of the algorithm, including the error function minimization, weight coefficient selection, and equation solving. The absence of active (u = 1) weights leads to a reduction in the search step until the step becomes sufficiently small. For a single-layer network (perceptron), hidden_neurons is set to zero. After step 3, the algorithm proceeds directly to Step 5, where w₁ denotes the weights connecting the input neurons to the output neuron.

When using the DC algorithm, probabilities β are proportional to the absolute values of the partial derivatives. To determine these probabilities, arrays c₀ and c₁ are initially formed from vector u, corresponding to arrays w₀ and w₁ and containing usage eligibility indicators for each element in calculations. Next, the absolute values of gradients are determined, taking into account the usage indicators:

c w_{0} = |g_{0}| ⊙ c_{0}

c w_{1} = |g_{1}| ⊙ c_{1}

Finally, the probabilities of selecting a weight coefficient are determined using the following formula:

β_{0} = \frac{c w_{0}}{\sum c w_{0} + \sum c w_{1}}

β_{1} = \frac{c w_{1}}{\sum c w_{0} + \sum c w_{1}}

where β₀ is a matrix containing the probabilities of selecting first-layer weight coefficients, and β₁ is a vector containing the probabilities of selecting second-layer weight coefficients. For a single-layer network, only the β₀ matrix is calculated.

3. Experimental Results and Discussion

For the experiments, daily closing price data of stocks from the Russian stock market were used. This choice is determined by the characteristics of the Russian stock market, including high volatility, whereas the concentration of trading activity in the commodity sector forms specific cross-sectoral correlation patterns. This determines the significance of technical analysis for stock analysis. The dataset was constructed using a time window concept. Specifically, a fixed set of consecutive historical stock price values was used as features for predicting the next value in the time series. Based on the research [16], the time window size was set to 30. Data preprocessing included normalization of the original values using Min–Max transformation, which improved the speed and accuracy of neural network training. Additionally, the original dataset was divided into training and test sets according to the chronological separation principle, ensuring the model’s effectiveness was verified on new data.

To compare the performance of algorithms, widely used neural network training methods, including Adam and Stochastic Gradient Descent (SGD), implemented in Python 3, were utilized. Standard metrics were employed for algorithm evaluation: mean squared error (MSE) and mean absolute error (MAE). The learning hyperparameter in Adam and SGD algorithms varied from 0.1 to 0.00001 to determine the optimal configuration, with the number of epochs set to 100. For the inverse algorithms RC and DC, the number of Newton method iterations was set to 50, with an accuracy of 0.001. The following hyperparameter values were established for the inverse algorithms. These included r = 1, q = 2, rmax = 100,000, and d = 10 for the perceptron. In addition, for the neural network with one hidden layer, two hyperparameters were changed: rmax = 1,000,000 and d = 5. A linear activation function was used for the output layer. However, for the hidden layer, a set of functions was considered, including both classical activation functions and those specifically designed for time series modeling. In particular, descriptions of the sincos, cloglogm, cloglog, logsigm, rootsig, sinc, and wave functions are provided in source [16]. In addition, the mish function is described in work [19], and the snake function is presented in study [33]. In the study, “tang” denotes the tangent activation function, whereas “arctg” refers to the arctangent function. Additionally, time constraints were imposed on the problem-solving process with inverse algorithms. A maximum of 5 min was allocated for the single-layer network, and 2 min was allocated for the neural network with one hidden layer.

3.1. Forecasting the Closing Price of Gazprom PJSC Shares

In the first part of our computational experiments, we addressed the task of forecasting the closing price of Gazprom PJSC shares. Gazprom shares were selected for analysis because they possess high liquidity. As some of the most actively traded stocks on the Moscow Exchange, they have significant market capitalization and are included in the main Moscow Exchange and Russian Trading System indices with high weighting, which makes them representative for studying the Russian stock market. For this purpose, we compiled a time series dataset of daily closing prices from trading sessions covering the period from 4 January 2023 to 15 March 2025. This interval, encompassing 562 trading days, enables the investigation into the dynamics of the company’s shares across various phases of the market cycle, ensuring a balance between data relevancy and sufficiency.

Table 1 presents the MAE and MSE values obtained on the test set using a single-layer neural network. Table 2 shows a fragment of the simulation results for the linear model. According to the obtained results, the best MSE and MAE values were achieved with the cloglog activation function and the DC learning algorithm.

Table 3, Table 4 and Table 5 present the simulation results for a neural network with one hidden layer containing 2, 8, and 16 neurons, respectively. The optimal solution was obtained using a neural network with one hidden layer containing two neurons, employing a linear activation function and the RC learning algorithm. It is worth noting that for the single-layer neural network, the DC algorithm proved to be the most effective learning algorithm for most activation functions. In contrast, for the neural network with one hidden layer, it demonstrated the poorest performance. For neural networks with one hidden layer containing two and eight neurons, the best MAE and MSE values for most activation functions were achieved using the RC algorithm. However, with 16 neurons, the Adam algorithm yielded the best MSE and MAE values for the majority of activation functions.

Figure 5 presents a box plot diagram of absolute deviations between predicted and actual values (a) and the predicted values (b) on the test set using a neural network with one hidden layer containing two neurons and a linear activation function. Analysis of the results demonstrates that the RC algorithm exhibits the smallest dispersion of absolute errors, as well as the lowest values for their mean (indicated by “+”) and median (indicated by a horizontal line within the box). These results indicate that the RC algorithm’s predicted values on the test set most accurately correspond to the actual data. DC exhibits the highest median with considerable variability. The time series visualization demonstrates that RC (violet dashed line) achieves superior tracking accuracy during volatile market periods (observations 130–140), whereas traditional methods (SGD and Adam) and DC exhibit notable lag during rapid market transitions. The convergence behavior around observation 130 highlights the adaptive capacity of the proposed approach to recalibrate following market corrections, whereas conventional optimizers struggle to recover tracking accuracy. For model comparison and to determine the significance of absolute error differences between them, we also applied statistical analysis, specifically the Wilcoxon test to perform pairwise comparisons of the models. This test revealed highly significant (p-value < 0.0001) differences between all models except for SGD and DC (p-value = 0.009).

Figure 6 illustrates the results for the mish activation function, which shows that classical training algorithms and DC failed to adjust weights properly for adequate prediction of the modeled time series values. The Wilcoxon test revealed highly significant differences between all models (p-value < 0.0001).

Figure 7 presents a heatmap summarizing the data from Table 1, Table 3, Table 4 and Table 5. It can be observed that the single-layer network exhibits the most MSE values below 0.01; however, the minimum values were obtained using a neural network with a hidden layer.

The current stage in the evolution of artificial intelligence is marked by significant progress in the field of large language models (LLMs), such as ChatGPT, Gemini, and Claude, which opens new perspectives for their application in financial forecasting and analytics. In recent years, the international literature has witnessed numerous studies demonstrating the effectiveness of LLMs in predicting market indicators [49,50]. In this paper, we examined the forecasting of Gazprom’s stock closing price using the Claude 4 Sonnet Thinking model. The choice of this model is justified by its ability not only to generate coherent text but also to consistently articulate its reasoning logic, which enhances the transparency of the methodology and increases confidence in the obtained results.

The following prompt was used to obtain forecast values: “The table in the file contains stock closing price values: price t−1 is the closing price value at time (day) t−1, price t−2 is the closing price value at time t−2. For each row, provide a forecast of the stock price for time t using your intuition.” Table 6 presents a fragment of the Excel file that was uploaded for analysis by the language model.

Row 2: Forecast 157.80—observing a downward trend from 158.68 to 158.01, expecting the decline to continue
Row 3: Forecast 155.90—significant drop from 158.01 to 156.52, downward trend is accelerating
Row 4: Forecast 151.50—sharp decline from 156.52 to 153.09, downward momentum persists
Row 5: Forecast 154.20—slight upward correction from 153.09 to 153.56, potential rebound possible

Figure 8 presents the actual and predicted stock prices obtained using the language model. The minimal divergence between predicted and actual values during periods of high volatility indicates stable performance under changing market conditions, with the model successfully adapting to both correction phases (95–100) and periods of rapid price appreciation (130–140). The MSE value for the test sample was 0.0027, whereas the MAE was 0.0389. Thus, it should be noted that the obtained results surpass those achieved using a single-layer neural network. Furthermore, for the network with one hidden layer, the MSE and MAE metrics were lower than those obtained when using Adam, SGD, and DC optimizers.

3.2. Experiments Using Data from 40 Russian Companies

In the second part of the experimental study, 40 stocks of Russian companies were selected for analysis. The input data consisted of the 1000 most recent closing price values (with the last value dated 1 July 2025). Figure 9 illustrates the dynamics of the normalized stock closing prices.

Table 7 presents the results of a comparative analysis of single-layer neural network training methods across various activation functions using two time windows (5 and 30 periods). Each value indicates the number of stocks for which the corresponding training method demonstrated the best mean squared error performance on the test dataset when using a specific activation function.

For the five-period time window, the following most effective combinations were identified: Adam optimizer with arctg function (18 cases), cloglogm (19 cases), invers (21 cases), mish (21 cases), and snake (21 cases); SGD with softplus (21 cases) and softs functions (18 cases); DC method with sincos function (25 cases); and RC method with logsig function (15 cases).

For the 30-period time window, there was a shift in method effectiveness: the DC method demonstrated superiority with arctg function (19 cases), cloglogm (21 cases), invers (20 cases), linear (33 cases), mish (21 cases), and snake (25 cases); the RC method showed the best results with cloglogm (17 cases), logsig (16 cases), mish (12 cases), and sincos functions (17 cases); and the Adam optimizer maintained effectiveness with the softs function (26 cases).

Analysis of the aggregate indicators revealed a significant influence of the time window size on the effectiveness of the investigated methods: the Adam algorithm achieved minimum MSE values in 185 cases with a 5-period window, whereas the DC algorithm demonstrated superior accuracy in 222 cases with a 30-period window.

Figure 10 presents the results of a comparative analysis of activation functions and training methods for time series forecasting with window sizes of 5 and 30 periods. Performance evaluation was conducted using the MSE criterion on the test dataset. Specifically, the comparative analysis results show the distribution of activation functions by frequency of achieving minimum MSE values across the studied stocks (left) and the distribution of training algorithms by frequency of achieving minimum MSE values across the studied stocks (right).

The results demonstrate a consistent advantage of the linear activation function regardless of the time window size, with the most pronounced effect observed with the 30-period window. The DC optimization algorithm exhibits minimum MSE values in the maximum number of cases under investigation, with its proportion of superior performance being substantially higher for the 30-period time window compared with the 5-period window.

Table 8 presents the modeling results for a neural network with one hidden layer. The table indicates the number of stocks for which each investigated training method with the corresponding activation function demonstrated the minimum MSE value on the test dataset. The RC algorithm with 50 and 80 Newton method iterations was used as the inverse training method. The research results show that for the neural network architecture with two neurons in the hidden layer, the RC method demonstrates superior performance for most activation functions. As the number of neurons in the hidden layer increases, the advantage shifts to classical training methods. Furthermore, increasing the number of Newton method iterations from 50 to 80 led to higher computational costs. This prevented algorithm convergence within the limited time interval, consequently resulting in higher MSE values in some cases.

Figure 11 illustrates a histogram showing the distribution of activation functions based on the number of stocks for which each function provided the lowest MSE value on the test dataset. The analysis demonstrates that when varying the number of neurons in the hidden layers of the network, minimum error values were predominantly achieved using the linear activation function. Regardless of the neural network architecture, the RC method with 50 iterations of the Newton algorithm consistently produced solutions with the lowest MSE values.

Figure 12 illustrates the minimum MSE values obtained on the test dataset for various stocks and neural network architectures. Research findings indicate that a neural network architecture with a single hidden layer containing two neurons provides optimal prediction accuracy in 70% of the examined cases. The empirical analysis of neural network architectures revealed that a single-layer configuration fails to adequately approximate complex nonlinear relationships in the data, whereas increasing the number of hidden neurons to sixteen leads to overfitting, excessive adaptation to noise in the training dataset, and significant deterioration of the generalization capability on test data.

Figure 13 presents heatmaps for the mean MSE (a) and MAE (b) values for a neural network with two neurons in the hidden layer. For each stock, modeling was performed with each activation function, and the algorithm providing the lowest MSE and MAE values for the specific function was identified. Subsequently, we calculated the mean values across all activation functions and identified learning algorithms. The results demonstrate that for the linear activation function only, the RC algorithm achieved the best performance. For the softs activation function, all algorithms yielded comparably high values, indicating that this function is unsuitable for time series modeling within this architecture.

4. Conclusions and Suggestions for Future Work

The research focuses on improving the accuracy of short-term stock price forecasting based on inverse optimization algorithms and neural network modeling. Modified neural network training algorithms are proposed, and their effectiveness for financial forecasting tasks is evaluated. The study included analysis of classical (linear, sigmoid) and modern (mish, cloglog) activation functions to assess algorithm stability and identify the most effective activation functions for the forecasting problem. Experimental data confirm the validity of the proposed hypothesis and demonstrate the applicability of the suggested algorithms, which in several test cases provided improved neural network training quality compared with traditional approaches. Analysis of the results showed that for a single-layer neural network, optimal performance indicators were achieved using the DC method. However, for a neural network with one hidden layer, the RC method produced the best results.

The optimal architecture identified in our study, namely, a neural network with one hidden layer containing just two neurons, challenges the notion that deeper and more complex networks necessarily yield better forecasting results. This finding supports the principle of parsimony in model selection and aligns with Xu et al.’s [43] observation that single hidden layer architectures represent an optimal compromise between computational efficiency and predictive power for many applications. It should be noted that linear activation functions are rarely employed in hidden layers of neural networks due to gradient propagation issues. Specifically, when using the Adam optimization algorithm, superior results were achieved with other activation functions (snake, sigmoid, and logsig). However, when implementing inverse optimization algorithms, the linear activation function demonstrates optimal efficiency in the majority of our experiments.

The implications of findings extend beyond the specific domain of stock price forecasting. Results suggest that the methodology for selecting training algorithms should be reconsidered in the context of financial time series analysis. The prevalent preference for gradient-based methods may be suboptimal for tasks involving data with periodic components and high volatility, which characterize stock market dynamics.

The comparative analysis with LLMs, such as Claude 4 Sonnet Thinking, reveals another interesting dimension. Although LLMs demonstrated impressive forecasting capabilities that surpassed single-layer neural networks, they still did not consistently outperform optimized neural networks with one hidden layer trained using the inverse algorithm. This suggests that specialized, efficiently trained neural networks remain competitive tools for financial forecasting, even in the era of large foundation models.

Future research directions should include the following:

Integration of inverse optimization algorithms with other neural network architectures, particularly recurrent networks and transformers, which have shown promise in capturing temporal dependencies in financial data;
Exploration of hybrid approaches that combine the strengths of inverse optimization with traditional algorithms or metaheuristics;
Investigation of feature selection techniques to complement the proposed training algorithms and further enhance forecasting accuracy.

In addition, further research directions involve adapting and applying the developed algorithms to other machine learning tasks, particularly data clustering problems.

In conclusion, this study demonstrates that inverse optimization algorithms represent a valuable addition to the toolkit of methods for training neural networks in financial forecasting applications. By circumventing the limitations of gradient-based approaches, these algorithms enable more effective exploitation of the capabilities of neural networks, ultimately leading to improved forecasting accuracy and more reliable investment decision making.

Author Contributions

Conceptualization, E.G.; methodology, E.G.; software, E.G. and R.G.; validation, E.G., R.G. and E.V.; formal analysis, E.V.; investigation, E.G., R.G. and E.V.; resources, E.G. and R.G.; data curation, R.G.; writing—original draft preparation, E.G.; writing—review and editing, R.G. and E.V.; visualization, E.G. and R.G.; supervision, E.G.; project administration, E.G.; funding acquisition, E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out within the framework of the TUSUR Development Program for 2025–2036 of the Strategic Academic Leadership Program “Priority 2030”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

INPUT:

- Training data (Price_prev, Price)

- Network architecture (input_neurons, hidden_neurons)

- Parameters: d (step divisor), q (step reduction coefficient), r_max (maximum threshold)

OUTPUT: Optimized weights w₀, w₁

BEGIN

1. INITIALIZATION

Initialize weights w₀, w₁ randomly

Set r = 1, u = 1 for all weights

// Compute initial error function using Equation (1)

z = f₂(w₁ × f₁(w₀ × Price_prev))

J_prev = Σ(Price − z)²

// Set initial step size

step = J_prev/d

2. MAIN LOOP

WHILE r ≤ r_max DO

2.1. TARGET REDUCTION

// Apply stepwise reduction using Equation (2)

J_target = J_prev − step/r

2.2. WEIGHT SELECTION

// Calculate selection probabilities for RC method

total_usable = Σ u_i for all weights

IF total_usable = 0 THEN

r = r × q

Set u = 1 for all weights

total_usable = total number of weights

END IF

FOR each weight i with u_i = 1 DO

// Equal probabilities

β_i = 1/total_usable

END FOR

// Select weight index p based on probability β

p = RandomSelection(β)

2.3. LAYER DETERMINATION

IF p ≥ (input_neurons + 1) × hidden_neurons THEN

GO TO STEP 3 (Second Layer Modification)

ELSE

GO TO STEP 4 (First Layer Modification)

END IF

3. SECOND LAYER MODIFICATION

3.1. COEFFICIENT INDEX CALCULATION

v = p − (input_neurons + 1) × hidden_neurons

3.2. NEWTON’S METHOD ITERATION

Copy w₁ to w₁*

REPEAT

// Compute hidden layer output with bias

// Append bias

h_bias = [f₁(w₀ × Price_prev); 1]

// Compute gradient g₁ using Equation (2)

δ2 = (z − Price) ⊙ f₂’(w₁ × h_bias)

g₁ = h_bias^T × δ2

// Newton update

w₁*_v = w₁*_v − (J(w*) − J_target)/ 2g_1v

// Recalculate J(w*) and gradient using Equations (1) and (2)

z_new = f₂(w₁*× h_bias)

J_new = Σ(Price − z_new)²

UNTIL convergence or max_iterations

3.3. ACCEPTANCE CHECK

IF J_new < J_prev THEN

w₁ = w₁*

u = 1 for all weights

J_prev = J_new

GO TO STEP 2

ELSE

u_p = 0

GO TO STEP 2

END IF

4. FIRST LAYER MODIFICATION

4.1. MATRIX INDICES CALCULATION

v = p mod hidden_neurons

b = ⌊p/hidden_neurons⌋

4.2. NEWTON’S METHOD ITERATION

Copy w₀ to w₀*

REPEAT

// Compute hidden layer error using Equation (3)

h = f₁(w₀ × Price_prev)

δ2 = (z − Price) ⊙ f₂’(w₁ × [h; 1])

// Exclude bias weights

w_{1_no_bias} = w₁[1:hidden_neurons, :]

δ1 = (w_{1_no_bias}^T × δ2) ⊙ f₁’(w₀ × Price_prev)

// Compute gradient g₀ using Equation (3)

g₀ = Price_prev^T × δ1

// Newton update

w₀*_b,v = w₀*_b,v − (J(w*) − J_target)/4g_0b,v

// Recalculate J(w*) and gradient using Equations (1) and (3)

h_new = f₁(w₀* ×Price_prev)

z_new = f₂(w₁ × [h_new; 1])

J_new = Σ(Price − z_new)²

UNTIL convergence or max_iterations

4.3. ACCEPTANCE CHECK

IF J_new < J_prev THEN

w₀ = w₀*

u = 1 for all weights

J_prev = J_new

GO TO STEP 2

ELSE

u_p = 0

GO TO STEP 2

END IF

END WHILE

5. RETURN optimized weights w₀, w₁

END

References

Chang, P.C.; Liu, C.H.; Lin, J.L.; Fan, C.Y.; Ng, C.S.P. A neural network with a case based dynamic window for stock trading prediction. Expert Syst. Appl. 2009, 36, 6889–6898. [Google Scholar] [CrossRef]
Rouf, N.; Malik, M.B.; Arif, T.; Sharma, S.; Singh, S.; Aich, S.; Kim, H.-C. Stock Market Prediction Using Machine Learning Techniques: A Decade Survey on Methodologies, Recent Developments, and Future Directions. Electronics 2021, 10, 2717. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Box, G.E.P.; Jenkins, G.M. Time Series Analysis, Forecasting and Control; Holden Day: San Francisco, CA, USA, 1976. [Google Scholar]
Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econom. J. Econom. Soc. 1982, 50, 987–1007. [Google Scholar] [CrossRef]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
Engle, R.F.; Bollerslev, T. Modelling the persistence of conditional variances. Econom. Rev. 1986, 5, 1–50. [Google Scholar] [CrossRef]
Nelson, D.B. Conditional heteroskedasticity in asset returns: A new approach. Econometrica 1991, 59, 347–370. [Google Scholar] [CrossRef]
Glosten, L.R.; Jagannathan, R.; Runkle, D.E. On the relation between the expected value and the volatility of the nominal excess return on stocks. J. Financ. 1993, 48, 1779–1801. [Google Scholar] [CrossRef]
Zakoian, J.-M. Threshold heteroskedastic models. J. Econ. Dyn. Control 1994, 18, 931–955. [Google Scholar] [CrossRef]
Vukovic, D.; Spitsyna, L.Y.; Gribanova, E.; Spitsin, V.; Lyzin, I. Predicting the Performance of Retail Market Firms: Regression and Machine Learning Methods. Mathematics 2023, 11, 1916. [Google Scholar] [CrossRef]
Sarve, Y.A.; Phadke, A.C. A Survey on Data-Driven Techniques of Remaining Useful Life Assessment for Predictive Maintenance of the System. J. Sustain. Innov. 2025, 2, 58–71. [Google Scholar] [CrossRef]
Lv, P.; Wu, Q.; Xu, J.; Shu, Y. Stock Index Prediction Based on Time Series Decomposition and Hybrid Model. Entropy 2022, 24, 146. [Google Scholar] [CrossRef]
Zhao, C.; Hu, P.; Liu, X.; Lan, X.; Zhang, H. Stock Market Analysis Using Time Series Relational Models for Stock Price Prediction. Mathematics 2023, 11, 1130. [Google Scholar] [CrossRef]
Liu, Q.; Tao, Z.; Tse, Y.; Wang, C. Stock market prediction with deep learning: The case of China. Financ. Res. Lett. 2021, 46, 102209. [Google Scholar] [CrossRef]
Gomes, G.S.-D.-S.; Ludermir, T.B.; Leyla, M.M.; Lima, R. Comparison of new activation functions in neural network for forecasting financial time series. Neural Comput. Appl. 2011, 20, 417–439. [Google Scholar] [CrossRef]
Akter, S.; Haider, M.R. mTanh: A Low-Cost Inkjet-Printed Vanishing Gradient Tolerant Activation Function. J. Low Power Electron. Appl. 2025, 15, 27. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 2022, 503, 92–108. [Google Scholar] [CrossRef]
Misra, D. Mish: A Self Regularized Non-Monotonic Activation Function. In Proceedings of the 31th British Machine Vision Conference, Manchester, UK, 7–11 September 2020; Volume 1. [Google Scholar]
Khang, W.G.; Sugiyarto, S.; Firza Afiatin, M.Y.; Robiatul Mahmudah, K.; Nursyiva, I.; Mesith, C.; Choo, W.O. Comparison of Activation Functions in Convolutional Neural Network for Poisson Noisy Image Classification. Emerg. Sci. J. 2024, 8, 592–602. [Google Scholar] [CrossRef]
Zhang, S.; Lu, J.; Zhao, H. Deep Network Approximation: Beyond ReLU to Diverse Activation Functions. J. Mach. Learn. Res. 2023, 25, 1687–1725. [Google Scholar]
Bouraya, S.; Belangour, A. A comparative analysis of activation functions in neural networks: Unveiling categories. Bull. Electr. Eng. Inform. 2024, 13, 3301–3308. [Google Scholar] [CrossRef]
Han, F.; Jiang, J.; Ling, Q.H.; Su, B.Y. A survey on metaheuristic optimization for random single-hidden layer feedforward neural network. Neurocomputing 2019, 335, 261–273. [Google Scholar] [CrossRef]
Ojha, V.K.; Abraham, A.; Snášel, V. Metaheuristic design of feedforward neural networks: A review of two decades of research. Eng. Appl. Artif. Intell. 2017, 60, 97–116. [Google Scholar] [CrossRef]
Darwish, A.; Hassanien, A.E.; Das, S. A survey of swarm and evolutionary computing approaches for deep learning. Artif. Intell. Rev. 2020, 53, 1767–1812. [Google Scholar] [CrossRef]
Shen, L.; Chen, C.; Zou, F.; Jie, Z.; Sun, J.; Liu, W. A Unified Analysis of AdaGrad With Weighted Aggregation and Momentum Acceleration. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 14482–14490. [Google Scholar] [CrossRef] [PubMed]
Mandasari, S.; Irfan, D.; Wanayumini; Rosnelly, R. Comparison of sgd, adadelta, adam optimization in gender classification using cnn. JURTEKSI (J. Teknol. Dan Sist. Inf.) 2023, 9, 345–354. [Google Scholar] [CrossRef]
Elshamy, R.; Abu-Elnasr, O.; Elhoseny, M.; Elmougy, S. Improving the efficiency of RMSProp optimizer by utilizing Nestrove in deep learning. Dent. Sci. Rep. 2023, 13, 8814. [Google Scholar] [CrossRef]
Reyad, M.; Sarhan, A.M.; Arafa, M. A modified Adam algorithm for deep neural network optimization. Neural Comput. Appl. 2023, 35, 17095–17112. [Google Scholar] [CrossRef]
Nutakki, M.; Mandava, S. Optimizing home energy management: Robust and efficient solutions powered by attention networks. Heliyon 2024, 10, E26397. [Google Scholar] [CrossRef]
Lantian, L.; Weizhi, X.; Hui, Y. Character-level neural network model based on Nadam optimization and its application in clinical concept extraction. Neurocomputing 2020, 414, 182–190. [Google Scholar] [CrossRef]
Bai, L.; Liming, N. Gradient based invasive weed optimization algorithm for the training of deep neural network. Multimed. Tools Appl. 2021, 80, 22795–22819. [Google Scholar] [CrossRef]
Liu, Z.; Tilman, H.; Masahito, U. Neural Networks Fail to Learn Periodic Functions and How to Fix It. In Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 1–12 December 2020. [Google Scholar]
Sohail, A. Genetic Algorithms in the Fields of Artificial Intelligence and Data Sciences. Ann. Data. Sci. 2023, 10, 1007–1018. [Google Scholar] [CrossRef]
Ozsoydan, F.B.; Gölcük, I. A hyper-heuristic based reinforcement-learning algorithm to train feedforward neural networks. Eng. Sci. Technol. Int. J. 2022, 35, 101261. [Google Scholar] [CrossRef]
Kirkpatrick, S.; Gelatt, C.D.J.; Vecchi, M.P. Optimization by simulated annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef]
Biswas, S.; Singh, G.; Maiti, B.; Ezugwu, A.E.-S.; Saleem, K.; Smerat, A.; Abualigah, L.; Bera, U.K. Integrating Differential Evolution into Gazelle Optimization for advanced global optimization and engineering applications. Comput. Methods Appl. Mech. Eng. 2025, 434, 117588. [Google Scholar] [CrossRef]
Huang, Y.; Capretz, L.F.; Ho, D. Machine Learning for Stock Prediction Based on Fundamental Analysis. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–10. [Google Scholar] [CrossRef]
Oyewole, A.T.; Adeoye, O.B.; Addy, W.A.; Okoye, C.C.; Ofodile, O.C.; Ugochukwu, C.E. Predicting stock market movements using neural networks: A review and application study. Comput. Sci. IT Res. J. 2024, 5, 651–670. [Google Scholar] [CrossRef]
Zaheer, S.; Anjum, N.; Hussain, S.; Algarni, A.D.; Iqbal, J.; Bourouis, S. A Multi Parameter Forecasting for Stock Time Series Data Using LSTM and Deep Learning Model. Mathematics 2023, 11, 590. [Google Scholar] [CrossRef]
Wolff, D.; Echterling, F. Stock picking with machine learning. J. Forecast. 2023, 43, 81–102. [Google Scholar] [CrossRef]
Kaveh, M.; Mesgari, M.S. Application of Meta-Heuristic Algorithms for Training Neural Networks and Deep Learning Architectures: A Comprehensive Review. Neural Process. Lett. 2023, 55, 4519–4622. [Google Scholar] [CrossRef] [PubMed]
Xu, A.; Chang, H.; Xu, Y.; Li, R.; Li, X.; Zhao, Y. Applying artificial neural networks (ANNs) to solve solid waste-related issues: A critical review. Waste Manag. 2021, 124, 385–402. [Google Scholar] [CrossRef]
Lollia, F.; Gamberini, R.; Regattieri, A.; Balugani, E.; Gatos, T.; Gucci, S. Single-hidden layer neural networks for forecasting intermittent demand. Int. J. Prod. Econ. 2017, 183, 116–128. [Google Scholar] [CrossRef]
Luo, Y.; Hu, J.; Zhang, G.; Zhang, P.; Xie, Y.; Kuang, Z.; Zeng, X.; Li, S. A dissolved oxygen levels prediction method based on single-hidden layer feedforward neural network using neighborhood information metric. Appl. Soft Comput. 2024, 167, 112328. [Google Scholar] [CrossRef]
Gribanova, E. Development of iterative algorithms for solving the inverse problem using inverse calculations. East.-Eur. J. Enterp. Technol. 2020, 4, 27–34. [Google Scholar] [CrossRef]
Gribanova, E.; Savitsky, A. Algorithm for Estimating the Time of Posting Messages on Vkontakte Online Social Network. Int. J. Inf. Technol. Secur. 2020, 12, 3–14. [Google Scholar]
Gribanova, E. Elaboration of an Algorithm for Solving Hierarchical Inverse Problems in Applied Economics. Mathematics 2022, 10, 2779. [Google Scholar] [CrossRef]
Pelster, M.; Val, J. Can ChatGPT assist in picking stocks? Financ. Res. Lett. 2023, 59, 104786. [Google Scholar] [CrossRef]
Ko, H.; Lee, J. Can ChatGPT improve investment decisions? From a portfolio management perspective. Financ. Res. Lett. 2024, 64, 105433. [Google Scholar] [CrossRef]

Figure 1. Neural network model (h represents the output of the hidden layer of the neural network and z is the final neuron output).

Figure 4. Main blocks of the algorithm, including the error function reduction, weight coefficient selection, and equation solving. The cycle repeats until no active weight coefficients remain.

Figure 5. Comparative performance analysis of optimization algorithms for neural networks with linear activation function and single hidden layer architecture (two neurons): (a) Box plot distribution of absolute deviations between predicted and actual values across four optimization methods: SGD, Adam, RC, and DC. Hereinafter, plus indicates mean value, line indicates median, rectangle indicates the 25–75% quartile range, and whiskers indicate minimum and maximum values or 1.5 interquartile range and outlier distributions. (b) Time series comparison of actual market values (solid black line) versus predicted trajectories from all four optimization algorithms over 160 observation periods.

Figure 6. Comparative performance analysis of optimization algorithms for neural networks with mish activation function and single hidden layer architecture (two neurons): (a) Box plot distribution of absolute deviations between predicted and actual values across four optimization methods. (b) Time series comparison of actual market values (solid black line) versus predicted trajectories from all four optimization algorithms over 160 observation periods.

Figure 7. Heatmap of MSE values for different neural network architectures and training methods (activation function and training algorithm).

Figure 8. Actual and predicted stock prices obtained using the Claude 4 Sonnet Thinking language model over 160 observation periods.

Figure 9. Dynamics of normalized closing prices for the stocks of 40 Russian companies.

Figure 10. Frequency distribution of optimal activation functions and training algorithms achieving lowest MSE values across 40 stocks. A window size of 30 (a) demonstrates strong preference for linear activation (33 stocks) and DC training method, whereas a window size of 5 (b) shows similar activation preference but more balanced training method distribution, suggesting shorter temporal windows benefit from diverse optimization approaches.

Figure 11. Frequency distribution of optimal activation functions achieving lowest MSE across 40 stocks for different hidden layer sizes: (a) 2 neurons, (b) 8 neurons, and (c) 16 neurons. Linear activation consistently outperforms all alternatives, delivering optimal results for 36–38 stocks regardless of network architecture complexity.

Figure 12. MSE performance comparison across neural network architectures for 40 individual stocks. The two-neuron hidden layer configuration consistently achieves the lowest prediction errors for majority of stocks, demonstrating superior forecasting accuracy compared with single.

Figure 13. Heatmaps showing mean MSE (a) and MAE (b) values for neural networks with two hidden neurons across different activation functions and learning algorithms.

Table 1. Experimental results with a single-layer neural network for Gazprom PJSC stock (the best results are highlighted in green).

Activation Function	MSE				MAE
Activation Function	Adam	SGD	RC	DC	Adam	SGD	RC	DC
linear	0.0074	0.0106	0.005	0.0034	0.0671	0.0814	0.0552	0.0452
snake (0.5)	0.0075	0.0095	0.0074	0.0065	0.0667	0.0763	0.0687	0.0656
arctg	0.0087	0.0158	0.0046	0.005	0.0749	0.1049	0.0532	0.0543
sincos	0.05	0.05	0.1278	0.2632	0.1622	0.1622	0.3304	0.4737
invers	0.008	0.0143	0.0047	0.0042	0.0712	0.0984	0.0533	0.0508
softs	0.0257	0.0323	0.0336	0.0331	0.1171	0.1192	0.1351	0.1355
tanh	0.0088	0.0104	0.0056	0.0057	0.0743	0.1106	0.0596	0.0597
softplus	0.0646	0.0347	0.0084	0.0089	0.2416	0.1700	0.0778	0.0769
mish	0.0066	0.0098	0.0079	0.0057	0.0695	0.0734	0.0726	0.0586
cloglogm	0.0067	0.008	0.0035	0.0029	0.064	0.064	0.0462	0.041
cloglog	0.0138	0.011	0.0075	0.0054	0.0976	0.0804	0.0728	0.0587
logsigm	0.0047	0.0603	0.0037	0.0033	0.0519	0.225	0.0488	0.0444
rootsig	0.0133	0.0138	0.0117	0.0071	0.0933	0.0965	0.0862	0.0649

Table 2. Modeling results with different values of the learning hyperparameter (the best results are highlighted in green).

α	Test Set
	MSE				MAE
	Adam	SGD	RC	DC	Adam	SGD	RC	DC
1	0.0101	1.19 × 10¹²	0.0050	0.0034	0.0841	1,016,380	0.0552	0.0452
0.1	0.0074	5.21 × 10¹⁰			0.0671	213,836.4
0.01	0.0099	9.1 × 10¹¹			0.0782	889,434.8
0.001	0.0121	2.68 × 10¹⁰			0.0869	151,634.9
0.0001	0.2020	0.0106			0.3911	0.0814
0.00001	0.2940	0.0126			0.4798	0.0888

Table 3. Experimental results with a single hidden layer neural network with two neurons for Gazprom PJSC stock (the best results are highlighted in green).

Activation Function	MSE				MAE
Activation Function	Adam	SGD	RC	DC	Adam	SGD	RC	DC
linear	0.0099	0.0274	0.0020	0.0419	0.0862	0.1425	0.0323	0.1684
snake (0.5)	0.0078	0.0814	0.0474	0.037	0.0668	0.2639	0.1861	0.156
arctg	0.0431	0.0161	0.0079	0.0554	0.1788	0.1046	0.0558	0.2052
sincos	0.0273	0.0165	0.0023	0.1082	0.1325	0.0947	0.0351	0.3127
sigmoid	0.0902	0.1627	0.1699	0.137	0.2191	0.3769	0.1636	0.3503
invers	0.0431	0.0152	0.0034	0.0532	0.1792	0.1003	0.0395	0.1975
softs	0.0407	0.0486	0.0051	0.039	0.1569	0.1965	0.0519	0.1624
tanh	0.0477	0.0167	0.0060	0.0582	0.1897	0.1082	0.0463	0.2117
mish	0.0582	0.1308	0.0026	0.1096	0.1095	0.3373	0.0390	0.3042
cloglogm	0.0369	0.0245	0.0146	0.037	0.1544	0.1326	0.0574	0.1451
logsigm	0.0902	0.1627	0.1699	0.1369	0.2191	0.3769	0.1636	0.3503
sinc	0.1026	0.1708	0.0033	0.0871	0.2406	0.3862	0.0439	0.2777
wave	0.1105	0.0754	0.0363	0.1014	0.2668	0.2352	0.1529	0.3022

Table 4. Experimental results with a single hidden layer neural network with eight neurons for Gazprom PJSC stock (the best results are highlighted in green).

Activation Function	MSE				MAE
Activation Function	Adam	SGD	RC	DC	Adam	SGD	RC	DC
linear	0.0082	0.0102	0.0022	0.0356	0.0774	0.0755	0.0343	0.1502
snake (0.5)	0.0077	0.0274	0.0028	0.0216	0.0707	0.1322	0.0409	0.115
arctg	0.0092	0.0099	0.0022	0.0364	0.0750	0.0749	0.0353	0.1562
sincos	0.0497	0.0430	0.0035	0.0267	0.1758	0.1800	0.0467	0.1273
sigmoid	0.0490	0.0535	0.0383	0.157	0.1557	0.2094	0.0826	0.3706
invers	0.0124	0.0099	0.0104	0.0385	0.0910	0.0743	0.0619	0.1544
softs	0.0317	0.0140	0.0044	0.1067	0.1290	0.0995	0.0519	0.301
tanh	0.0148	0.0098	0.0090	0.0385	0.0936	0.0745	0.0747	0.1593
mish	0.0113	0.0139	0.0199	0.0751	0.0794	0.0885	0.0932	0.2486
cloglogm	0.0144	0.0125	0.0031	0.0675	0.0964	0.0857	0.0430	0.2329
logsigm	0.0490	0.0535	0.0420	0.157	0.1557	0.2094	0.0848	0.3706
sinc	0.0168	0.0343	0.0036	0.1689	0.0959	0.1381	0.0417	0.3839`
wave	0.0111	0.0108	0.0046	0.1882	0.0770	0.0752	0.0483	0.408

Table 5. Experimental results with a single hidden layer neural network with sixteen neurons for Gazprom PJSC stock (the best results are highlighted in green).

Activation Function	MSE				MAE
Activation Function	Adam	SGD	RC	DC	Adam	SGD	RC	DC
linear	0.0118	0.0100	0.0025	0.0739	0.0789	0.0819	0.0361	0.2301
snake (0.5)	0.0070	0.0132	0.0083	0.0296	0.0614	0.0941	0.0670	0.1447
arctg	0.0112	0.0125	0.0197	0.0238	0.0889	0.0879	0.1015	0.1124
sincos	0.0263	0.0625	0.0384	0.2386	0.1292	0.2267	0.1602	0.4606
sigmoid	0.0149	0.0373	0.0552	0.1485	0.1041	0.1669	0.1360	0.3612
invers	0.0106	0.0117	0.0081	0.1818	0.0846	0.0844	0.0563	0.3895
softs	0.0155	0.0111	0.0071	0.012	0.1072	0.0820	0.0669	0.09
tanh	0.0083	0.0138	0.0140	0.0265	0.0776	0.0883	0.0976	0.1349
mish	0.0107	0.0128	0.0164	0.0195	0.0734	0.0831	0.1080	0.1063
cloglogm	0.0109	0.0115	0.0161	0.0188	0.0866	0.0789	0.1015	0.1021
logsigm	0.0149	0.0373	0.0414	0.1485	0.1041	0.1669	0.1230	0.3612
sinc	0.0222	0.1674	0.0056	0.184	0.1273	0.3830	0.0552	0.4
wave	0.0177	0.0195	2334324	0.4522	0.0993	0.1002	202.1322	0.6271

Table 6. Fragment of the table with source data for forecasting.

Price_t−1	Price_t−2	Price_t−3	Price_t−4	Price_t−5	Price_t−6	Price_t−7	Price_t−8	…	Price_t−30
158.01	158.68	159.19	158.18	159.87	160.14	160.87	158.1	…	163.52
156.52	158.01	158.68	159.19	158.18	159.87	160.14	160.87		162.51
…	…	…	…	…	…	…	…	…	…

Table 7. Modeling results for a single-layer neural network.

Activation Function	Period Window = 5				Period Window = 30
Activation Function	Adam	SGD	Adam	SGD	Adam	SGD	Adam	SGD
arctg	18	10	2	10	8	4	9	19
cloglog	7	11	13	9	7	16	7	10
cloglogm	19	3	9	9	1	1	17	21
invers	21	7	4	8	6	2	12	20
linear	16	0	10	14	2	0	5	33
logsigm	9	8	15	8	3	6	16	15
mish	21	6	6	7	3	4	12	21
rootsig	13	15	8	4	10	6	13	11
sincos	0	1	14	25	1	7	17	15
snake	21	4	5	10	7	2	6	25
softplus	11	21	3	5	8	17	6	9
softs	12	18	6	4	26	3	5	6
tang	17	10	9	4	10	5	8	17
Total	185	114	104	117	92	73	133	222

Table 8. Modeling results for a neural network with one hidden layer.

Activation Function	2 Neurons in the Hidden Layer				8 Neurons in the Hidden Layer				16 Neurons in the Hidden Layer
Activation Function	Adam	SGD	RC₅₀	RC₈₀	Adam	SGD	RC₅₀	RC₈₀	Adam	SGD	RC₅₀	RC₈₀
arctg	2	1	35	2	10	4	25	1	14	20	2	4
cloglogm	3	2	32	3	6	5	27	2	7	25	2	6
invers	1	4	34	1	3	3	33	1	8	21	4	7
linear	0	0	40	0	0	0	40	0	1	0	39	0
logsigm	18	5	10	7	5	9	24	2	25	2	13	0
mish	5	0	33	2	33	3	4	0	26	3	11	0
sigmoid	17	5	10	8	5	7	25	3	23	2	15	0
sinc	4	6	30	0	9	0	31	0	16	0	24	0
sincos	2	4	34	0	3	9	21	7	9	28	3	0
snake	5	0	32	3	8	5	27	0	22	9	8	1
softs	11	5	15	9	3	0	37	0	15	3	11	11
tang	3	6	24	7	10	12	17	1	9	27	0	4
wave	8	17	15	0	18	2	19	1	17	5	16	2
Total	79	55	344	42	113	59	330	18	192	145	148	35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gribanova, E.; Gerasimov, R.; Viktorenko, E. Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting. Big Data Cogn. Comput. 2025, 9, 235. https://doi.org/10.3390/bdcc9090235

AMA Style

Gribanova E, Gerasimov R, Viktorenko E. Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting. Big Data and Cognitive Computing. 2025; 9(9):235. https://doi.org/10.3390/bdcc9090235

Chicago/Turabian Style

Gribanova, Ekaterina, Roman Gerasimov, and Elena Viktorenko. 2025. "Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting" Big Data and Cognitive Computing 9, no. 9: 235. https://doi.org/10.3390/bdcc9090235

APA Style

Gribanova, E., Gerasimov, R., & Viktorenko, E. (2025). Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting. Big Data and Cognitive Computing, 9(9), 235. https://doi.org/10.3390/bdcc9090235

Article Menu

Application of Inverse Optimization Algorithms in Neural Network Models for Short-Term Stock Price Forecasting

Abstract

1. Introduction

2. Materials and Methods

2.1. Literature Review

2.2. Inverse Optimization Algorithms for Neural Network Training

2.2.1. Neural Network Model

2.2.2. Algorithm for Neural Network Training Using the Inverse Optimization Approach

3. Experimental Results and Discussion

3.1. Forecasting the Closing Price of Gazprom PJSC Shares

3.2. Experiments Using Data from 40 Russian Companies

4. Conclusions and Suggestions for Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI