Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting

Popa, Flavius Gheorghe; Muresan, Vlad

doi:10.3390/ai6110295

Open AccessReview

Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting

by

Flavius Gheorghe Popa

^* and

Vlad Muresan

Automation and Computer Science, Technical University of Cluj-Napoca, 400027 Cluj-Napoca, Romania

^*

Author to whom correspondence should be addressed.

AI 2025, 6(11), 295; https://doi.org/10.3390/ai6110295

Submission received: 31 August 2025 / Revised: 28 October 2025 / Accepted: 30 October 2025 / Published: 17 November 2025

(This article belongs to the Special Issue AI in Finance: Leveraging AI to Transform Financial Services)

Download

Browse Figures

Versions Notes

Abstract

This review surveys how contemporary machine learning is reshaping financial and economic forecasting across markets, macroeconomics, and corporate planning. We synthesize evidence on model families, such as regularized linear methods, tree ensembles, and deep neural architecture, and explain their optimization (with gradient-based training) and design choices (activation and loss functions). Across tasks, Random Forest and gradient-boosted trees emerge as robust baselines, offering strong out-of-sample accuracy and interpretable variable importance. For sequential signals, recurrent models, especially LSTM ensembles, consistently improve directional classification and volatility-aware predictions, while transformer-style attention is a promising direction for longer contexts. Practical performance hinges on aligning losses with business objectives (for example cross-entropy vs. RMSE/MAE), handling class imbalance, and avoiding data leakage through rigorous cross-validation. In high-dimensional settings, regularization (such as ridge/lasso/elastic-net) stabilizes estimation and enhances generalization. We compile task-specific feature sets for macro indicators, market microstructure, and firm-level data, and distill implementation guidance covering hyperparameter search, evaluation metrics, and reproducibility. We conclude in open challenges (accuracy–interpretability trade-off, limited causal insight) and outline a research agenda combining econometrics with representation learning and data-centric evaluation.

Keywords:

financial forecasting; LSTM; random forest; gradient boosting; cost function; loss function; feedforward; activation functions; regression; classification

1. Introduction

Financial and economic forecasting underpins decision-making in markets, macroeconomic policy, and corporate planning. These characteristics constrain classical econometric models and motivate the growing use of machine learning (ML) methods that learn flexible patterns from heterogeneous signals [1,2,3]. Across the literature, tree-based ensembles (e.g., Random Forests, gradient boosting) and deep sequence models (e.g., LSTMs) frequently report improvements over traditional baselines in asset-return prediction, macro nowcasting, and firm-level revenue forecasting [3,4,5,6,7,8].

Figure 1 summarizes the forecasting pipeline we reference throughout the paper, such as data preparation, feature engineering, model training/validation, and evaluation/deployment, which will be expanded in Section 2.4.

This review synthesizes how modern ML contributes to forecasting in three application families: (i) financial markets (directional movement and volatility-aware predictions), (ii) macroeconomic indicators (e.g., GDP, inflation, unemployment) and (iii) company-specific planning (e.g., revenues).

We compare model classes, such as regularized linear models, tree/boosting ensembles, and neural architectures (LSTM/transformer), and connect design choices (feature representation, activation and loss functions, regularization) to business objectives and evaluation metrics. We highlight Random Forest and gradient boosting as robust, transparent baselines and LSTM-style models for sequential signals, while noting the promise of attention mechanisms [9,10,11].

We conduct a narrative review of recent peer-reviewed studies and influential preprints across the three domains above, emphasizing works with out-of-sample evaluation and explicit comparisons against classical approaches. Where relevant, we include methodological sources on activations, optimization, and loss design when these choices materially affect forecasting performance.

This review does not follow a systematic protocol of literature selection (e.g., PRISMA) but instead adopts a narrative and exploratory approach. The sources considered were identified through broad searches on Google Scholar and related platforms, complemented by insights gained while completing machine learning courses (Stanford’s Coursera sequence, Udemy tutorials) and hands-on experimentation in Python 3.10 with regression models, cross-validation, and training–validation–test splits. The focus on finance stems from the observation that AI offers unique potential to improve prediction in sales, pricing, and macroeconomic variables, where understanding which features carry the greatest weight is critical. To provide readers with clear foundation, the review begins with basic predictive methods (linear and logistic regression, cost functions, activation functions) before moving to advanced architectures and applications. A narrative review is appropriate in this context, given the heterogeneity of methods and the need to integrate both fundamental concepts and applied studies in financial forecasting.

In the following, we will highlight four main findings. The first one is to point out what model a task should use, by indicating where each family tends to excel (RF/GBM as strong starting points across regimes; LSTM ensembles for high-frequency classification; transformers for long-range dependencies). The second, is about a design playbook that links data representation (price vs. returns), loss/metric selection (cross-entropy vs. RMSE/MAE; robust losses for heavy tails), and regularization (lasso/elastic-net; nonconvex penalties) to out-of-sample performance and stability. The third aspect is about implementation practices to avoid data leakage, handle class imbalance, and structure hyperparameter search and cross-validation. The fourth and last aspect is about open challenges that you may face such as, non-stationarity and regime shifts, the accuracy–interpretability trade-off, and computational scaling for attention.

Recent sequence model advances: In addition to RNNs/LSTMs and transformers, very recent work explores selective state-space models (Mamba) for time-series forecasting [12,13]. Mamba variants use bidirectional encoders and channel-aware tokenization to capture long-range dependencies with linear-time complexity, and several studies report competitive or state-of-the-art accuracy on multivariate benchmarks [14]. Examples include Bi-Mamba+ (bidirectional Mamba with a forget-gate and a series-relation decider for channel-independent vs. channel-mixing tokenization) [15], CMMamba (channel-mixing Mamba), and a channel-independent, bidirectional, gated Mamba with an interactive recurrent mechanism (CIBG-Mamba-IRM) [16]. Collectively, these point to an emerging alternative to transformers for long-horizon and multivariate settings.

Methods of Literature Search

We searched Scopus, Web of Science, SSRN, and arXiv (2010–2025) using combinations of: “financial forecasting,” “macroeconomic nowcasting,” “revenue/demand forecasting,” “random forest/gradient boosting/elastic net,” “LSTM/transformer,” “leakage,” and “time-series cross-validation.” Inclusion criteria: peer-reviewed articles or influential preprints with empirical evaluation on financial, macroeconomic, or firm-level forecasting; explicit out-of-sample or time-respectful evaluation; and comparison to classical baselines. Exclusion criteria: purely theoretical pieces lacking empirical validation; studies prior to 2010; and works focused exclusively on reinforcement learning, multi-agent trading, or portfolio optimization (briefly acknowledged elsewhere). Searches were complemented by citation snowballing from seed surveys. For each study we extracted dataset/domain, forecast horizon, model class, loss/metric, main result, and leakage guards.

2. Model Families and Learning Components

2.1. Regularized & Baseline Linear Models

Logistic regression is one of the most widely used classification algorithms in machine learning and statistics. Unlike linear regression, which predicts continuous outcomes, logistic regression is designed to model binary outcomes where the dependent variable takes values between 0 and 1, representing probabilities.

The core of logistic regression lies in the logistic function (also known as the sigmoid function), which maps any real-valued input to a value between 0 and 1:

σ (z) = \frac{1}{\begin{matrix} 1 + ⅇ^{- z} \end{matrix}}, z = (w^{T} x + b), y \in \{0,1\}

(1)

where

z

represents the linear combination of input features “x” with weights “w” and bias term “b”.

For a binary classification problem, the logistic regression model predicts the probability that an instance belongs to class 1: [3,9]

P (y = 1| x; w, b) = \frac{1}{1 + e^{- (w^{T} * x + b)}}

(2)

When working with high-dimensional datasets, the scientific literature provides substantial evidence that regularized regression methods (Lasso, Ridge, Elastic Net) outperform traditional linear models in macroeconomic forecasting [4,7].

A study provides a direct comparison between consensus forecasts and Elastic Net, an AR model, and a random walk for predicting the US unemployment rate. Measured through Mean Absolute Error (MAE), Elastic Net consistently demonstrates superior accuracy across all forecasting horizons. Compared to the Blue Chip consensus, Elastic Net’s superiority extends to identifying turning points in the business cycle, predicting these turning points significantly earlier. For horizons of 12 months or less, the accuracy improvement is statistically significant. The study employs a rolling forecast framework and considers horizons of up to two years. Elastic Net leverages the FRED-MD dataset, which is a high-dimensional set of 138 macroeconomic variables, highlighting its capacity to manage complex relationships and high dimensionality [3,4].

A complementary approach is Complete Subset Regression, which averages forecasts from all linear models build on subsets of a fixed size k drawn from a larger predictor set. By choosing k to balance bias and variance, CSR has been shown to improve predictive accuracy in high-dimensional settings such as stock return forecasting, outperforming simple equal-weight combination and other shrinkage methods [17].

2.2. Trees and Ensembles

A decision tree is a predictive model that recursively splits data according to information gain, which is measured using entropy:

Information G a i n = H (p_{1}^{n o d e}) - (w^{l e f t} H (p_{1}^{l e f t}) + w^{r i g h t} H (p_{1}^{r i g h t}))

(3)

where entropy is defined as

H (p) = - p * \log_{2} (p) - (1 - p) \log_{2} (1 - p)

(4)

Entropy reflects the uncertainty of a distribution: it is maximal at (complete unpredictability), and minimal at

p = 0

or

p = 1

(full certainty).

Thus, decision trees split data at points where the reduction in entropy (information gain) is maximized.

Random Forest (RF), introduced by Breiman (2001), has become one of the most widely applied machine learning algorithms in economics and finance. It is particularly valued for its ability to manage high-dimensional data, capture nonlinear relationships, and provide robust out-of-sample predictive performance. Across the surveyed literature, Random Forest is consistently positioned as a benchmark method for forecasting financial markets, macroeconomic indicators, and corporate financial planning [3,7].

A Random Forest is an ensemble of decision trees, where each tree is constructed on a bootstrapped sample of the data. At every split, a random subset of predictors is considered, reducing correlation between trees and mitigating overfitting [7]. The final prediction is obtained by averaging the predictions of all trees (for regression tasks) or by majority vote (for classification tasks).

Two core principles define RF:

1.: Bootstrap Aggregation (Bagging): Trees are grown on resampled datasets, increasing model diversity.
2.: Random Feature Selection: At each split, only a subset of predictors is considered, reducing variance and improving generalization.

Breiman’s (2001) seminal work demonstrates that RF achieves convergence of generalization error as the number of trees grows, without overfitting, provided correlation among trees remains low and their individual predictive strength is sufficient [7].

Applications in stock market prediction show that RF is particularly effective when combined with price-based indicators. For instance, Patel et al. (2015) found that RF outperformed alternative classifiers, including SVMs and neural networks, in forecasting directional movement in the Indian Stock market [18,19]. Similarly, RF has been employed alongside LSTM and hybrid models to forecast asset prices, demonstrating competitive accuracy and robustness in volatile environments [20].

RF has been used to forecast GDP growth, inflation, and other macroeconomic variables. Coulombe (2020) develops a Macroeconomic Random Forest that delivers significant forecasting gains for U.S. unemployment and inflation while remaining interpretable through generalized time-varying parameters [8]. Yoon (2021) shows that RF provides more accurate forecasts of Japanese GDP growth than benchmark institutional forecasts from the IMF and the Bank of Japan [10]. Yoon (2021) evaluates Random Forest and Gradient Boosting models for Japan’s real GDP growth using quarterly macroeconomic data from 1981–2018. Forecasts are generated in a pseudo-real time expanding-window setup that avoids look-ahead bias. Accuracy is assessed using RMSE and MAPE, and the results show that both ML models outperform institutional benchmarks from the IMF and BOJ. Likewise, Paruchuri (2021) applies RF in Italian GDP forecasting, emphasizing its strength in capturing nonlinear dynamics absent in traditional econometric approaches [7,21].

In the corporate domain, Microsoft researchers reported that RF outperformed traditional econometric methods such as ARIMA for quarterly revenue forecasting [22]. Its ability to incorporate external and high-dimensional data sources makes it a preferred model for financial planning and analysis tasks requiring rapid updates.

The following are the advantages of Random Forest:

Robustness to Overfitting: Averaging over many trees reduces variance while maintaining predictive power [7].
Nonlinear Modeling Capability: RF captures complex, nonlinear relationships without strong parametric assumptions [3].
Variable Importance Measures: RF provides internal estimates of feature relevance, enhancing interpretability in economic and financial settings.
Strong Out-of-Sample Accuracy: Demonstrated across diverse applications, from macroeconomic forecasting to corporate planning [5,8,10].

Despite its strengths, the literature also identifies key limitations:

Black-Box Nature: Interpretability remains limited compared to linear econometric models [6].
Bias in Extrapolations: RF performs poorly outside the range of training data, making it less suitable for structural counterfactual analysis.
Class Imbalance: In financial forecasting tasks with skewed datasets, RF can be biased toward majority classes, requiring resampling or weighting strategies [23].
Computational Demands: Large forests can be resource-intensive, though parallelization mitigates this problem.

The Random Forest algorithm represents a methodological bridge, between traditional econometrics and modern machine learning. Its reliance on ensembles of decision trees makes it robust flexible, and well-suited for predictive tasks where nonlinearities, noise, and high-dimensional inputs prevail.

In financial markets, RF delivers competitive performance against advanced models such as LSTMs, often serving as a strong baseline. In macroeconomics, RF provides timely and accurate forecasts of GDP growth, outperforming institutional forecasts in some contexts. In corporate finance, RF supports revenue forecasting and planning tasks by leveraging large and complex datasets.

The Random Forest algorithm represents a methodological bridge between traditional econometrics and modern machine learning, making it robust and well-suited for predictive tasks where nonlinearities, noise and high-dimensional inputs prevail. In financial markets, RF delivers competitive performance against advanced models such as LSTMs [12], and in macroeconomics it provides timely forecasts of GDP growth, outperforming institutional benchmarks in some contexts. In corporate finance, RF supports revenue forecasting and planning tasks by leveraging large and complex datasets. While not a substitute for causal inference, it stands out as a powerful predictive tool and frequently serves as a baseline in modern forecasting pipelines [24].

2.3. Neural Architecture

Feedforward neural networks (FNNs) are universal function approximators capable of mapping any continuous function given sufficient complexity in their architecture [25]. An FNN consists of successive layers of neurons: an input layer, one or more hidden layers, and an output layer. Each neuron computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear activation function. This produces outputs that feed exclusively forward into subsequent layers, establishing the forward propagation mechanism.

Mathematically, the feedforward operation can be expressed as

y = f (x; W, b) = σ (W x + b)

(5)

where

x

denotes the input vector,

W

the weights,

b

the biases, and

σ

an activation function. The choice of activation (such as sigmoid, tanh, or ReLU) critically influences the learning dynamics and stability [10].

Learning in FNNs is achieved by optimizing weights and biases such that feedforward outputs approximate target values. This optimization relies on error functions and gradient-based updates. While gradient descent and backpropagation remain central to training, researchers have also turned to metaheuristic optimization (such as genetic algorithms and swarm intelligence) to overcome local minima and enhance generalization [25].

A core challenge in feedforward training lies in balancing bias and variance, ensuring that models generalize to unseen data without underfitting or overfitting. Regularization strategies, such as Lasso, group Lasso, and their extensions, have been incorporated directly into feedforward learning to constrain complexity and prune redundant neurons [9].

The feedforward principle extends beyond neural architectures into forecasting methodologies. In econometric and financial applications, feedforward models enable the direct mapping from historical and exogenous variables to predicted outcomes. For instance, Masini, Medeiros, and Mendes (2021) survey feedforward and recurrent architectures in time series forecasting, emphasizing their capability to approximate nonlinear dependencies in economic and financial data [3].

Hybrid forecasting models further integrate feedforward networks with econometric tools. For example, Stempień and Słępaczuk (2024) demonstrate that combining ARIMA with feedforward deep learning models (e.g., LSTMs) improves predictive accuracy for stock indices and cryptocurrencies [24], while Patel et al. (2015) show that two-stage SVR-ANN and SVR-RF fusion models can enhance stock index prediction based on technical indicators [19]. These hybridizations rely on the feedforward component to capture nonlinearities that classical models cannot represent.

Additionally, predictive operations in supply chain planning have adopted feedforward logic. Gallego-Garcia and Garcia-Garcia (2021) argue that predictive planning relies on forward-flow models that allocate resources according to statistically projected demand scenarios [26]. Here, feedforward ensures that forecasts inform decisions without recursive feedback delays, thereby improving adaptability and efficiency.

While feedforward networks provide static mappings, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks extent this by incorporating feedback loops and temporal dependencies [10,12,13]. Feedforward architectures are well-suited for problems where inputs and outputs are independent observations, whereas recurrent systems excel in sequential domains. Nonetheless, feedforward models remain central even in hybrid systems, as they establish the non-recurrent computational backbone.

Recent literature highlights several directions in refining feedforward models:

Activation Functions: evolving from sigmoid to ReLU and its variants to mitigate vanishing gradients [27].
Regularization Techniques: integration of hidden-layer pruning via group Lasso to enforce sparsity and improve efficiency [9].
Metaheuristic Optimization: incorporating swarm intelligence or evolutionary algorithms to optimize weights and architectures [25].
Hybrid Forecasting: combining feedforward neural structures with econometric models for financial time series [20].

These developments reflect the enduring relevance of feedforward while acknowledging the need for adaptivity in high-dimensional, nonlinear, and volatile domains.

Feedforward, as both a computational paradigm and a methodological principle, underpins much of modern forecasting and machine learning. At its core, feedforward represents the forward-only propagation of signals, ensuring that predictions derive from structured transformations of inputs without cyclical dependence. In neural networks, this manifests as a layer-by-layer computations, while in forecasting systems, it aligns with predictive models that map historical data to expected outcomes.

Across the surveyed literature, feedforward emerges not as an isolated mechanism, but as the foundation upon which regularization, optimization, and hybridization strategies build. Its simplicity, universality, and adaptability explain why feedforward continues to play a central role in both theoretical advancements and applied forecasting in economics, finance, and engineering.

Attention and Transformer Architectures

Recurrent and convolutional models laid the groundwork for handling sequential data in finance, but their limitations in capturing long-range dependencies motivated a shift toward attention mechanisms and transformer architectures. Transformers dispense with recurrence by applying self-attention, allowing models to weigh relationships across all time steps simultaneously. This flexibility has made them the dominant paradigm in NLP and computer vision, and they are increasingly adapted to financial applications. Beyond generic sequence modeling, transformers are now applied directly to financial microstructure. For limit-order-book (LOB) forecasting, transformer architectures with positional or dual attention have demonstrated superior performance in intraday price-movement classification, outperforming CNN and RNN baselines on benchmark LOB datasets [28,29]. In parallel, FinBERT [30] adapts BERT for the finance domain, significantly improving sentiment classification of analyst reports and financial news compared to classical machine learning methods. For multi-horizon time-series forecasting, the Temporal Fusion Transformer (TFT) [31,32] provides an interpretable attention framework, and more recent work shows that TFT-based models can improve stock-market prediction when enriched with technical indicators [31,32]. At a glance, the alignment between task and transformer variant can be summarized as:

LOB microstructure (intraday) → TransLOB/TLOB for directional classification [28,29].
Financial NLP → FinBERT for sentiment classification [30].
Multi-horizon time-series → TFT for interpretable regression and forecast calibration [31,32].

2.4. Learning Components

2.4.1. Cost Functions

At the heart of supervised learning lies the problem of evaluating how well a model approximates the relationship between explanatory variables and a target. This evaluation is conducted through a cost function, also referred to as a loss function, which quantified the discrepancy between predicted and observed values. The minimization of such a function provides the guiding principle for parameter estimation and model selection. Across statistical learning, from classical regression to advanced ensemble methods, the cost function serves as the mathematical embodiment of model performance.

In its simplest form, the cost function arises in linear regression, where predictions are modeled as

{\hat{y} = f}_{w, b} (x) = w x + b

(6)

The associated cost function is the mean squared error (MSE):

J (w, b) = (\frac{1}{2 m}) \sum_{i = 1}^{m} {({\hat{y}}^{(i)} - y^{(i)})}^{2}

(7)

where

$m$ is the number of training examples.
$f_{w, b} (x) = w x + b$ is the prediction for input x.
$y^{(i)}$ is the true target value.
The squared term ${({\hat{y}}^{(i)} - y^{(i)})}^{2}$ represents the error for each example.

The division by

2 m

provides two conveniences:

The cost is averaged over the dataset, normalizing it across datasets of different sizes.
The factor of 2 simplifies derivatives when applying gradient descent later.

This quadratic form provides several desirable properties: it is convex, ensuring a unique global minimum, and differentiable, enabling efficient optimization via gradient descent. The squared error penalizes larger deviations more heavily than smaller ones, yielding a natural fit for continuous-valued regression tasks. From the perspective of statistical theory, minimizing this cost corresponds to maximum likelihood estimation under Gaussian error assumptions [3].

The cost function is not only a measure of fit but also a geometric tool: visualizations reveal its “bowl-shaped” surface, where the global minimum corresponds to the optimal regression line. Thus, in regression the cost function embodies both statistical consistency and computational tractability (Figure 2).

When the prediction task shifts from regression to classification, the cost function must adapt accordingly. Logistic regression, for instance, introduces the logistic loss (cross-entropy):

J (w) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log {\hat{y}}^{(i)} + (1 - y^{(i)}) \log (1 - {\hat{y}}^{(i)})]

(8)

where

\hat{y}

is the predicted probability of class membership. Unlike squared error, which poorly models’ probability distributions, cross-entropy naturally arises from maximum likelihood estimation of Bernoulli-distributed outcomes [18]. This illustrates a crucial principle: the choice of cost function must be aligned with the statistical nature of the prediction task.

Modern machine learning extends the concept of cost functions beyond parametric models. In gradient boosting, Friedman formalized learning as numerical optimization in function space, where each stage the model iteratively fits a new function to the negative gradient of the cost function [25]. Here, the cost function defines not only the end goal but also the incremental updates. Depending on the modeling task, one may adopt squared error (regression), absolute error (robust regression), Huber loss (robust to outliers), or logistic loss (classification). Thus, the cost function becomes the unifying criterion driving the entire boosting procedure.

Tree-based ensembles and neural networks similarly rely on well-defined cost functions. Random Forests indirectly minimize squared error by averaging tree outputs, while deep networks learn through stochastic gradient descent on cost functions such as cross-entropy. Importantly, regularization terms (e.g.,

λ {‖w‖}^{2}

) are often incorporated into cost functions to prevent overfitting, effectively trading off bias and variance [3].

In financial forecasting, the design of the cost function interacts intimately with the choice of input features. Kamalov et al. (2021) demonstrated that classifiers build on raw stock prices and those built on returns may converge to different optima due to differences in feature distributions [33]. Since the cost function embodies assumptions about the underlying distribution, its minimization may yield systematically biased models if the feature space is poorly aligned with the problem. For instance, when return distributions are approximately Gaussian, optimization landscapes are smoother, leading to stable convergence. In contrast, irregular stock price distributions induce non-standard error surfaces, altering the trajectory of optimization.

This highlights a broader insight: the efficacy of a cost function cannot be decoupled from the data representation. Both the functional form of the cost and the statistical distribution of inputs determine learning outcomes.

The literature in high-dimensional time series forecasting emphasizes the necessity of regularized cost functions to ensure model stability when the number of predictors exceeds the sample size. Penalized regressions (e.g., LASSO, Elastic Net) minimize augmented cost functions of the form:

Q (β) = \sum_{t = 1}^{T} {(Y_{t} - X_{t}^{T} β)}^{2} + λ {‖β‖}_{1}

(9)

Balancing goodness-of-fit with parsimony [3]. These cost functions inherit convexity, allowing efficient computation while simultaneously enforcing sparsity. Similarly, robust boosting procedures employ cost functions such as the Huber loss, which interpolate between quadratic and absolute loss to maintain efficiency under Gaussian noise while resisting the influence of outliers [25].

Across domains (linear regression, classification, boosting, and high-dimensional forecasting) the cost function serves as the fundamental bridge between statistical theory and computational practice. It encodes assumptions about data distributions, defines the optimization landscape, and mediates trade-offs between bias, variance, and robustness. As financial forecasting studies illustrate, the alignment between cost function and feature space can decisively affect predictive power [34]. As ensemble and neural methods show, the cost function extends beyond error measurement to govern iterative learning itself [25]. And as advances in time series econometrics demonstrate, augmenting cost functions with penalties enables effective learning in high-dimensional, noisy environments [3].

Ultimately, the cost function is not a mere mathematical artifact. It is the formalization of the question “What does it mean to predict well?” Its definition, selection, and minimization are thus central to the practice of machine learning, ensuring that models remain not only predictive but also statistically principled and robust across varied domains.

2.4.2. Loss Functions

Machine learning methods fundamentally revolve around the ability to evaluate how well a model performs on observed data. This evaluation is codified in a loss function, a mathematical device that quantifies the discrepancy between predicted outcomes and true values. While related to the broader notion of a cost function, the loss function operates at the granularity of individual examples, and its aggregation yields the cost across datasets. The design and selection of loss functions directly influence statistical properties, computational feasibility, and ultimately predictive performance. Across regression, classification, boosting, and high-dimensional forecasting, the loss function is the unifying principle by which learning is defined.

The difference between loss and cost is the following:

Loss is the measure of error for a single training example.
Cost aggregates the loss across all training examples, often as an average.

In regression, the squared error serves as the canonical loss:

l o s s (f (x^{(i)}), y^{(i)}) = {(f (x^{(i)}) - y^{(i)})}^{2}

(10)

Leading to the mean squared error cost function. In classification, particularly logistic regression, the squared error loss proves inadequate because of the nonlinear transformation introduced by the sigmoid function. Instead, the logistic loss is adopted:

l o s s (f (x^{(i)}), y^{(i)}) = - y^{(i)} \log f (x^{(i)}) - (1 - y^{(i)}) \log (1 - f (x^{(i)}))

(11)

Which is convex and penalizes confident misclassifications strongly. This convexity ensures tractable optimization and forms the basis of the cross-entropy family of losses.

In statistical learning theory, forecasting is framed as a decision problem: the objective is to find a function

f (x)

that minimizes expected predictive loss under the joint distribution of inputs and outputs [3]. For regression tasks, commonly used losses include the squared-error

L (y, f) = {(y - f)}^{2}

and the absolute-error

L (y, f) = |y - f|

. These induce risk functions such as mean squared error (MSE) or mean absolute error (MAE), which not only serve as measures of fit but also as criteria for theoretical consistency. For classification, the logistic and cross-entropy losses are widely adopted, as they ensure convexity and penalize confident misclassifications severely [25,34].

This general framework allows for specialization. In macroeconomic forecasting, Masini et al. (2021) emphasize that loss functions embody assumptions about error structure, influencing both estimation and inference. For instance, squared error assumes symmetry and is sensitive to outliers, while robust alternatives such as the Huber loss provide resilience in fat-tailed environments [3,7].

In the context of neural networks, Crone et al. (2011) highlight that empirical performance evaluations depend critically on the chosen error metric [35]. The NN3 forecasting competition revealed that loss-based evaluations such as symmetric mean absolute percentage error (sMAPE) or mean absolute scaled error (MASE) often yield different rankings of forecasting methods, illustrating the dependence of conclusions on loss specification. Borovkova and Tsiamas (2019) further stress that for high-frequency stock market classification, cross-entropy loss (logistic loss) is employed to optimize LSTM ensembles [12,34]. The probabilistic nature of cross-entropy aligns with directional classification tasks, translating prediction into likelihood-based scoring that can be aggregated across time.

Friedman’s (2001) gradient boosting framework makes the centrality of loss explicit by viewing function estimation as numerical optimization in function space [36]. The algorithm iteratively fits weak learners in the direction of the negative gradient of the loss, thereby generalizing boosting to arbitrary differentiable losses. Different loss choices lead to distinct algorithms: squared-error produces LS-Boost; absolute-deviation yields LAD-Boost; Huber loss balances efficiency and robustness; and logistic loss enables boosting for classification. Thus, loss functions not only evaluate predictions but also define the optimization trajectory of learning algorithms.

Kamalov et al. (2021) demonstrate how the form of input features (prices vs. returns) interacts with the optimization of loss functions [33]. Because the statistical distribution of returns is closer to normality, while prices exhibit irregular and stock-specific distributions, classifiers trained on different features effectively minimize different empirical losses. The optimization landscape, and therefore convergence behavior, is shaped by this interaction between feature distribution and loss design. This finding underscores that the loss function cannot be disentangled from the data representation in financial forecasting.

In high-dimensional forecasting, the simple minimization of squared error is insufficient due to overparameterization. Folded concave penalties, such as SCAD and MCP, modify the effective loss by adding nonconvex regularization terms [33]. Fan (2014) prove that under certain regularity conditions, these penalized loss functions enjoy the strong oracle property: the resulting estimator performs as well as if the true underlying sparsity pattern were known in advance [26]. This theoretical guarantee emphasizes how carefully designed loss functions, combined with penalties, ensure both statistical optimality and computational tractability.

Yoon (2021) applies loss-based evaluation to machine learning models forecasting Japanese GDP growth [37]. Forecast accuracy is assessed using mean absolute percentage error (MAPE) and root mean squared error (RMSE), metrics derived from

L_{1}

and

L_{2}

loss, respectively. The study shows that gradient boosting and Random Forest models outperform institutional benchmarks (IMF, BOJ), demonstrating that the choice and minimization of suitable loss functions can yield superior predictive power even in macroeconomic contexts.

From forecasting competitions and high-frequency financial classification to macroeconomic growth prediction and penalized regression, the loss function emerges as the unifying element of modern predictive modeling. It provides the mathematical link between data, model, and optimization procedure. Neural networks rely on cross-entropy to ensure probabilistic calibration; boosting adapts to arbitrary differentiable losses; penalized regressions augment loss with regularization to achieve oracle properties; and economic forecasts are judged by error metrics rooted in

L_{1}

and

L_{2}

. Ultimately, the design and interpretation of loss functions embody both statistical rigor and practical forecasting performance, making them the cornerstone of machine learning in economics and finance [3,34].

2.4.3. Optimization (Gradient Descent)

Gradient descent (GD) is the backbone of modern machine learning optimization. It provides a systematic method for minimizing cost functions by iteratively updating parameters in the direction of steepest descent. Though conceptually straightforward, its practical implementation in high-dimensional, noisy, or economically meaningful settings raises deep theoretical and empirical considerations. Across disciplines, from econometrics to finance, and from boosting algorithms to deep learning, the mechanism of gradient descent reveals both its versatility and its challenges.

Gradient descent is described as an iterative optimization algorithm used to minimize the cost function of a model. In regression problems, the cost function measures the discrepancy between predicted values

f_{w, b} (x)

and observed outcomes

y

. Gradient descent provides a procedure to find the parameters

w, b

that minimize this cost.

Formally, the algorithm updates the parameters according to

w ∶ = w - \frac{α (\partial J (w, b))}{\partial w}

(12)

b ∶ = b - \frac{α (\partial J (w, b))}{\partial b}

(13)

where

$J (w, b)$ is the cost function.
$\frac{\partial J}{\partial w}, \frac{\partial J}{\partial b}$ are the partial derivatives (the gradients).
$α$ is the learning rate, a scalar controlling the step size.

This update is performed repeatedly until convergence.

The derivative (gradient) indicates the slope of the cost function with respect to each parameter. Because the cost function in linear regression has a “bowl-shaped” convex surface, the gradient always points away from the minimum (see Figure 2). By subtracting a fraction of the gradient, gradient descent moves the parameters toward the minimum, ensuring cost reduction at each step.

This paper synthesizes perspectives from nine recent contributions that address gradient descent in theory and practice, highlighting its role in cost minimization, convergence dynamics, and application to forecasting problems.

In statistical learning theory, this iterative minimization is cast as a decision problem: choosing f(x) f x to minimize expected predictive loss [3]. The gradient descent procedure is thus the computational realization of risk minimization, translating theoretical objectives into parameter updates.

Friedman’s seminal work on gradient boosting generalized gradient descent beyond parameter space, reframing it as optimization in function space [26]. Here, each iteration fits a weak learner to the negative gradient of the cost, effectively performing a stagewise descent. This perspective reveals the breadth of gradient descent: it underlies not only neural training but also ensemble learning, robust regression, and classification.

Different loss functions induce different descent dynamics: squared-error leads to LS-Boost, absolute deviation to LAD-Boost, and logistic likelihood to classification boosting. The cost surface and descent trajectory thus depend critically on the chosen error measure.

Arpit et al. (2017) provide empirical evidence on how gradient descent behaves in deep neural networks [38]. Their analysis shows that networks first minimize cost by capturing simple patterns, before gradually memorizing noise. This progression is a property of gradient descent itself: its incremental updates prioritize directions in the cost landscape associated with large, generalizable gradients, before fitting idiosyncratic fluctuations.

Activation functions modulate these dynamics by shaping gradient flow. Saturating functions such as sigmoid and tanh can lead to vanishing gradients, slowing descent, while rectified linear units (ReLU) preserve gradients and enable deeper architectures [6]. Hence, gradient descent is inseparable from both cost function definition and network architecture.

Gradient descent interacts closely with regularization. Penalized regression methods, such as LASSO or Elastic Net, augment the cost function with penalty terms that modify gradient updates [3]. Similarly, group LASSO in neural networks prunes hidden units by forcing entire weight groups towards zero [9]. From an economic perspective, regularization ensures that gradient descent avoids overfitting in high-dimensional settings, yielding estimators with desirable properties such as sparsity or oracle efficiency [3]. These augmented gradient flows reflect a balance between data fit and complexity control.

In large datasets, stochastic gradient descent (SGD) replaces exact gradients with noisy estimates from subsamples. This reduces computation while maintaining unbiased updates [39]. Though noisy, SGD often escapes shallow local minima and converges faster in practice, explaining its dominance in deep learning.

In financial forecasting, SGD is used for training models such as recurrent neural networks (LSTMs) and multilayer perceptrons [34]. Genetic programming approaches, while not gradient-based, are benchmarked against SGD-driven algorithms such as Random Forest and boosting [9], reinforcing SGD’s centrality in empirical performance comparisons.

Gradient descent plays a critical role in macroeconomic forecasting models, where parameter-rich specifications must be estimated from relatively small samples [9]. Penalized cost functions optimized via gradient descent yield stable predictors in high-dimensional environments, such as forecasting GDP, inflation, or unemployment [3]. In financial forecasting, the interaction between features (prices vs. returns) and optimization affects convergence landscapes [34]. LSTM models trained via gradient descent on price-based features outperform return-based analogues, highlighting how data representation interacts with gradient dynamics.

Athey (2019) emphasizes that while gradient descent excels at predictive optimization, its use in economics requires caution when causal inference is the objective [39]. In such cases, minimizing predictive error may not yield unbiased policy-relevant estimates. Gradient descent unifies statistical theory and computational practice. As optimization in parameter or function space, it enables diverse methods from boosting to neural networks. Its trajectory reveals insights into learning dynamics—how models generalize before memorizing noise, and how architecture influences convergence. In applied economics and finance, gradient descent provides scalable tools for forecasting while requiring careful adaptation when inference, not prediction, is the goal.

The surveyed literature confirms that gradient descent is not merely an algorithm but the engine of modern learning: its interaction with cost functions, data structures, and regularization defines the boundary between statistical rigor and practical forecasting success.

2.4.4. Linear Activation

As described in Section 2.3, neurons apply a weighted sum of inputs followed by an activation function. In the linear case, this activation is simply the identity, so the output equals the weighted sum. While rarely used in hidden layers, the linear activation remains important in regression tasks where ethe network’s output is unconstrained.

Formally, for an input vector

x = (x_{1}, x_{1}, \dots, x_{n})

with weights

w = (w_{1}, w_{1}, \dots, w_{n})

and bias

b

, the linear activation is defined as:

f (z) = z = w * x + b

(14)

where

$z$ is the pre-activation (weighted sum).
$f (z)$ is the activation, which in the linear case is simply the identity function.

Linear activation functions are fundamental in linear regression, where the model hypothesis is

f_{w, b} (x) = w * x + b

(15)

This is essentially a single-neuron model with a linear activation. In this context,

The neuron outputs a continuous value, suitable for predicting numerical targets.
The cost function (mean squared error) directly evaluates how close the linear activation outputs are to the actual target values.

Thus, the linear activation forms the basis of supervised regression tasks, where the model maps continuous inputs to continuous outputs without nonlinear transformations.

Limitations of linear activation when applied to multi-layer networks:

Lack of expressiveness. If all layers in a neural network use purely linear activations, the composition of functions collapses into a single linear transformation. For example:

$f (z) = W_{2} (W_{1} * x + b_{1}) + b_{2} = (W_{2} W_{1}) x + (W_{2} b_{1} + b_{2})$

(16)

which is still linear in $x$ . Thus, adding more layers provides no additional representational power.
Inability to model complex functions. Many real-world problems involve nonlinear patterns (classification, signal recognition, image tasks). A network of linear activations cannot approximate such functions. This motivates the use of nonlinear activations such as sigmoid, tanh, or ReLU.

Linear activation is characterized as the foundational activation function in neural models. It embodies the identity mapping

f (z) = z

(17)

and underlies linear regression as the simplest feedforward model. While it is limited in multi-layer contexts due to its inability to capture nonlinear relationships, it remains crucial in regression tasks and as an output function when unbounded real-valued predictions are required.

Thus, in an academic perspective, the linear activation function illustrates both the origin of neural modeling—as a direct generalization of regression—and the motivation for nonlinear extensions, which expand representational capacity and make deep networks powerful.

The linear function directly passes weighted inputs to the next layer, which makes it equivalent to traditional regression analysis. This connection explains why early applications of machine learning in forecasting were often compared with econometric models such as OLS regression, ARIMA, or vector autoregression [9,17,24]. For instance, in macroeconomic forecasting, statistical models assuming linear relationships between inputs and outputs have historically dominated policy analysis [17]. Neural networks with a linear output activation can be interpreted as flexible regressions, bridging classical econometrics with machine learning [3].

The linear function is also essential for output layers in regression tasks, including GDP growth prediction [40], or price forecasting [34]. Without constraining outputs to a bounded range (as with sigmoid or tanh), the linear function allows models to produce unrestricted continuous values, a necessity in financial applications.

The first advantage is interpretability. As shown in studies of machine learning in economics [22,39], decision-makers require transparent models. With linear activations, each weight can be mapped directly to marginal contributions, preserving an interpretable link between predictors (e.g., interest rates, prices) and outcomes (e.g., GDP growth).

Second, the linear function reduces the risk of overfitting. Ensemble approaches such as bagging [36] and Random Forests [9] highlight how variance reduction improves prediction. Similarly, linear activation ensures that hidden units do not introduce uncontrolled nonlinearities, which can destabilize models when sample sizes are limited, as in many macroeconomic datasets [21].

Third, linear activations facilitate hybrid modeling. Research combining econometric and nonlinear ML models finds that the linear part captures stable, long-term relationships, while nonlinear components (e.g., LSTM or boosting) capture residual dynamics [20]. This division would not be possible without retaining a linear mapping in part of the network.

The drawback is that stacking multiple linear layers collapses to a single linear transformation, regardless of depth. Thus, a network using only linear activations cannot model nonlinear dependencies that are crucial in financial time series, where volatility clustering, regime changes, or behavioral anomalies dominate [34,35]. Empirical competitions such as the NN3 study on time series prediction confirmed that nonlinear models—enabled by nonlinear activations—outperform linear baselines on complex data [35].

In finance, studies using LSTM ensembles [34] or gradient boosting machines [26] show superior performance relative to linear benchmarks when high-frequency or nonlinear patterns are present. Similarly, Random Forests applied to GDP forecasting [40] outperform purely linear methods, especially in turbulent periods. These results highlight the insufficiency of linear activation for capturing nonlinear market dynamics.

Despite limitations, the linear function remains indispensable. First, it provides a baseline benchmark. Many works on financial forecasting compare advanced neural networks or ensemble learners against linear regression or linear-activated models [34,40]. The linear benchmark clarifies whether predictive gains are due to genuine nonlinear learning or mere complexity.

Second, linear output layers are widely used even in nonlinear architectures. For example, LSTM networks forecasting prices typically include a linear output to map hidden nonlinear transformations to continuous predictions [18,34].

Finally, linear activation supports regularization and penalization frameworks. Folded concave penalized estimation [18] and complete subset regression [17] both extend linear models with regularization [17]. These methods rely on linear mappings at their core but enhance them with constraints to balance bias and variance.

The linear activation function occupies a paradoxical role: by itself, it is too limited to model nonlinear financial dynamics, yet it is indispensable as a benchmark, an output mapping, and a stabilizing component in hybrid frameworks. Evidence across forecasting studies confirms that models relying solely on linear activations rarely achieve state-of-the-art performance. However, when combined with nonlinear techniques such as Random Forests [9], gradient boosting [26], or LSTMs [34], the linear function ensures interpretability and comparability.

In short, the linear activation function is not obsolete—it is the anchor point of economic forecasting models. It preserves the interpretability of traditional econometrics while providing the foundation upon which nonlinear methods demonstrate their added value.

2.4.5. Sigmoid Activation

Building on the general architecture in Section 2.3, the sigmoid activation maps the neuron’s weighted input into the [0,1] interval via the logistic function. It is mathematically defined as

σ (z) = \frac{1}{1 + e^{- z}}

(18)

where

$z = w * x + b$ is the linear combination of inputs and parameters.
$g (z)$ is the activated output.

This function is smooth, differentiable, and strictly increasing, making it suitable for probabilistic interpretations.

Across the reviewed works, sigmoid functions are emphasized as a non-linear activation in feedforward and recurrent neural networks used for financial forecasting, particularly for binary classification problems such as price direction prediction [34,38].

In logistic regression, the sigmoid serves as the link function, transforming linear combinations of inputs into probabilities of a binary outcome [34].

In deep learning for financial time series, sigmoid activations appear particularly in output layers to provide probability distributions over outcomes such as “buy” vs. “sell” in stock forecasting [34].

In feedforward forecasting models, it is highlighted as a classical squashing function alongside hyperbolic tangent, central to the universal approximation capabilities of neural networks [3].

The literature identifies several benefits of sigmoid activations:

1.: Probabilistic Interpretation—Outputs lie in $[0,1]$ naturally mapping to probabilities.
2.: Non-linearity—Introduces the ability to learn nonlinear relationships beyond linear models.
3.: Universality—As shown by approximation theorems, networks with sigmoid activations can approximate any measurable function given sufficient hidden units.
4.: Ease of Implementation—Historically, sigmoid was widely adopted due to its mathematical simplicity and interpretability.

Despite its historical importance, the articles also underline well-known drawbacks:

Vanishing Gradients: For large $|x|$ gradient approach zero, which slows down or prevents learning in deep architectures.
Non-zero-centered outputs: Since values lie in $[0,1]$ gradients during optimization may propagate inefficiently.
Slower convergence compared to newer functions such as ReLU or tanh, especially in multi-layer networks.

Several variants such as hard sigmoid and sigmoid-weighted linear units (SiLU) are discussed in modern contexts, aiming to reduce computational cost or improve training dynamics.

The reviewed articles show practical use of sigmoid activations in different financial prediction settings:

LSTM networks for stock forecasting employ sigmoid gates (input, forget, and output) to regulate information flow, crucial in sequential modeling of financial time series [34].
Feedforward forecasting models integrate sigmoid functions at output layers to transform weighted inputs into directional probabilities of market movement [34].
Hybrid forecasting models also rely on sigmoid-based layers in combination with other methods, ensuring probabilistic classification outputs.

In synthesis, the sigmoid activation function holds a central place in the development of machine learning models for classification and forecasting tasks. Its mathematical structure provides a bridge between linear models and probabilistic interpretation, particularly relevant in financial prediction where outputs are directional probabilities. While vanishing gradient issues have limited its use in deeper hidden layers, sigmoid remains fundamental in output layers of classification networks and in the gating mechanisms of recurrent architectures such as LSTMs.

Thus, the sigmoid function is both historically foundational and practically indispensable in specific contexts, embodying the transition from linear to nonlinear, probabilistic machine learning models for forecasting.

3. Signals and Feature Sets in the Literature

The predictive performance of machine learning models in economics and finance depends not only on the choice of algorithms but also on the design of input features and signals. The literature emphasizes that representation choices—such as modeling prices versus returns, or levels versus growth rates—along with frequency alignment, the treatment of revisions, and leakage-proof evaluation protocols, significantly influence out-of-sample accuracy. This section synthesizes findings on the main families of signals used in macroeconomic forecasting, financial markets, and company-level planning, as well as common feature-engineering practices that improve model robustness and interpretability [3,4,17,22,34,37,40].

3.1. Macroeconomic Indicators

Macroeconomic datasets are widely used as both targets and predictors in forecasting studies. Common variables include GDP growth, inflation (CPI), unemployment, industrial production, retail sales, housing activity, trade balances, and sentiment surveys such as PMI/ISM indices. Research often combines so-called “hard” indicators, such as production and sales, with “soft” survey-based measures to improve robustness, particularly for nowcasting when official releases are delayed [17,37,40]. To improve stationarity and interpretability, variables are typically transformed into differences, quarter-on-quarter or year-on-year growth rates, log-growth, or rolling averages. Lag structures are introduced to capture publication delays and align predictors with the intended forecast horizon. Because official statistics are released asynchronously, macroeconomic panels are inherently “ragged,” and rigorous nowcasting pipelines attempt to align release calendars, ingest data in real time, and re-estimate models sequentially to avoid look-ahead bias [5,17]. A further challenge arises from data revisions: first-release estimates may differ materially from later vintages. Studies such as Yoon [22] demonstrate that performance evaluations based only on final data tend to overstate forecast accuracy, making real-time or pseudo-real-time vintages necessary. Research using the FRED-MD dataset shows that Elastic Net models significantly improve unemployment forecasting compared to autoregressive benchmarks, particularly in identifying turning points [9] Similarly, Maehashi and Shintani (2020) find that lasso, ridge, and ensemble methods outperform traditional factor models in predicting Japanese macroeconomic variables. Recent frameworks, such as Bolhuis and Rayner (2020) and Coulombe (2020), highlight the importance of ensemble approaches and interpretable tree-based models for robust nowcasting [28,29,30]. Overall, regularized regression methods and tree ensembles tend to perform strongly in macro panels because they can handle collinearity and heterogeneous predictors, while neural networks and deep learning show mixed improvements unless applied to very large, carefully curated datasets [3,4].

3.2. Financial Market Signals

Financial markets research has traditionally relied on price-based and return-based signals, along with volatility and liquidity measures. Typical predictors include log-returns, realized volatility derived from intraday data, trading volume, and liquidity proxies such as bid–ask spreads. For directional forecasting tasks, returns rather than prices are almost universally preferred, both for reasons of stationarity and for alignment with classification metrics such as cross-entropy or AUC [3,34]. Studies at daily or weekly horizons often construct rolling windows of lagged returns, volatility, and trend/reversal indicators, whereas intraday forecasting employs shorter overlapping windows (e.g., 5–60 min). Importantly, labeling procedures must avoid implicit leakage by ensuring that features at time t do not incorporate future observations [6]. In high-frequency applications, sequence models such as Long Short-Term Memory (LSTM) networks can provide clear gains over traditional classifiers, but only under rigorous, leakage-free rolling evaluation. Borovkova and Tsiamas (2019) report that ensembles of LSTMs significantly outperform both logistic regression and tree-based methods in high-frequency stock market classification [12]. Lanbouri and Achchab (2020) demonstrate the effectiveness of LSTM networks in predicting intraday stock prices, showing that they capture short-term dynamics by traditional econometric baselines [11]. Guo (2024) further confirms these findings, showing that LSTMs integrated with technical indicators achieve superior predictive accuracy at intraday horizons compared to ensemble tree baselines [41]. Representation choices, such as modeling close-to-close returns versus range-based volatility, have been shown to alter optimization outcomes substantially, highlighting the sensitivity of market forecasts to feature design [34].

3.3. Company-Level and Operational Data

At the firm level, forecasting models increasingly incorporate company-specific financial and operational data. Common inputs include financial statements, sales histories, key performance indicators (KPIs), transaction logs, and event calendars such as promotions or product launches. These datasets often have hierarchical structures (e.g., SKU → product category → regional totals), which require reconciliation in both model training and evaluation [5,18]. Feature engineering typically includes seasonality indicators (month, quarter, holidays), moving averages, price and promotion variables, as well as lagged sales. Hierarchical data can be modeled at the granular level with aggregation upwards, or alternatively at the aggregate level with disaggregation downwards using proportional or model-based splits [5,18]. However, real-world sales data are prone to censoring and quality issues, including stockouts, discontinuations, rebranded SKUs, and store closures. Best practice involves explicit handling of such cases through imputation, censoring flags, and data lineage tracking for reproducibility [5]. While deep learning methods show benefits when large numbers of related series are available—allowing the model to learn shared seasonality and event embeddings—regularized regression and ensemble methods remain competitive and often more interpretable in corporate settings [5,18].

3.4. Feature Engineering Patterns

Several recurring feature-engineering strategies cut across macroeconomic, market, and company-level studies. Temporal structures are captured through lag stacks, rolling averages, volatility estimates, and multi-scale windows. Calendar seasonality is encoded using cyclic functions (e.g., sine–cosine transforms) to avoid artificial breaks at period boundaries [3]. Scaling and outlier treatment are also important: standardization and robust scaling stabilize optimization, while winsorization or clipping mitigate the impact of extreme events in models sensitive to tails. Interaction features, such as volatility × trend, are often found to improve performance, while ensemble models automatically discover nonlinear interactions without manual specification [3]. Most importantly, feature representation must align with the business objective. For example, return-based representations are more suitable for classification of market direction, whereas level or difference representations better match magnitude-based metrics such as RMSE or MAPE [34]. Rigorous work further stresses that features must be leakage-proof: all inputs at time t should be constructed solely from information available at or before t. This includes respecting macroeconomic release calendars and avoiding the use of future averages or revisions [17].

3.5. Data Quality and Leakage Risks

The validity of empirical findings depends heavily on split design and evaluation protocols. Studies consistently emphasize the use of rolling or expanding-window cross-validation, avoiding shuffled splits that would otherwise contaminate temporal structure [3]. For macroeconomic applications, vintage or pseudo-real-time setups are necessary to ensure that forecasts are generated with the information available at the decision date [17,22]. In financial market tasks with overlapping horizons, researchers apply “embargoing” strategies to prevent target leakage across training and validation sets [3]. For corporate forecasting, rolling-origin evaluation is preferred, as it mimics real-world deployment where forecasts are generated sequentially. Across all domains, reproducibility requires careful documentation of data sources, versions, exclusions, and preprocessing pipelines [37].

3.6. Summary

Across macroeconomic, market, and company-level applications, the literature demonstrates that signal design and feature engineering are as critical to forecasting performance as model choice. Best practices include aligning representation to the forecasting objective (returns for direction, levels/changes for magnitudes), using vintage or pseudo-real-time protocols to reflect true information sets, and escalating model complexity only when supported by data scale and history. Transparent documentation of pipelines and rigorous time-respecting evaluation are increasingly recognized as prerequisites for reliable and reproducible results [3,4,17,22,34,37,40].

Temporal validation protocols are critical in finance and economics because data are time-ordered and subject to structural breaks, revisions, and non-stationarity. Table 1 summarizes recommended temporal validation protocols and key onsiderantions across markets, macroeconomic indicators, and firm-level datasets. Unlike random cross-validation, which risks look-ahead bias, appropriate schemes must ensure that only information available at the forecast date is used. Common approaches include rolling windows, expanding windows, and rolling-origin evaluation, sometimes combined with embargo periods to prevent overlap between training and testing.

By aligning validation strategy with data characteristics, researchers can avoid leakage, respect publication lags, and generate forecasts that better mimic real-time decision settings. The choice of protocol should therefore be treated as an integral design decision rather than an afterthought.

4. Task-Centric Synthesis and Practical Guidance

This section synthesizes implications across financial markets, macroeconomics, and company-level planning, emphasizing representation choices and time-respectful validation over architectural novelty. Throughout, we draw on the sources catalogued in Section 3 and add a small set of canonical references to anchor recurrent practices in each subfield [3,4,6,17,22,34,40,41,42,43,44,45,46,47,48,49,50,51].

The following checklist distills recurring best practices across the reviewed studies and serves as a reproducible starting point for financial, macroeconomic and firm-level forecasting:

Step 1: Define the prediction target. Begin with a clear goal and data representation.

Use returns for directional or classification tasks.
Use levels or growth rates for magnitude forecasting (GDP, revenue, etc.).

Step 2: Select the loss function aligned with the objective (see Table 2 for a concise mapping of tasks to appropriate metrics).

Cross-entropy or AUC for directional accuracy.
RMSE, MAE, or MAPE for continuous prediction.
Include regularization (ridge/lasso/elastic-net) to stabilize estimation.

Step 3: Start simple, then escalate.

Level 1: Linear/logistic regression, understand gradient descent and overfitting.
Level 2: Regularized linear models, control complexity in high-dimensional data.
Level 3: Tree ensembles (RF, GBM), capture nonlinearities and interactions.
Level 4: Sequence and attention models (LSTM, Transformer), model temporal dependencies [12,13].

Step 4: Use time-aware validation (see Table 1 for domain-specific protocols).

Markets & firms: rolling or expanding windows.
Macroeconomics: vintage or pseudo-real-time evaluation.

Step 5: Guard against data leakage.

Build features only from information available at or before the forecast date.
Use purged or embargoed splits for overlapping horizons.

Step 6: Report calibration and ablations.

Report both accuracy and probability calibration.
Run ablations against strong ensemble baselines to test if added complexity pays off.

This playbook offers a reproducible workflow that moves from conceptual understanding to empirically robust forecasting, thus helping readers bridge theory, implementation, and evaluation in a transparent, step-wise manner.

Table 2. Decision-aware mapping of forecasting tasks to evaluation metrics.

Forecasting Task	Appropriate Metrics	Notes/Caveasts
Directional classification	Log-loss (cross-entropy), Brier score, calibration curves	Penalize overconfidence; check probability calibration
Magnitude regression	Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE, with caution)	MAPE unstable near zero values; MAE/RMSE generally preferred
Portfolio/decision tasks	Utility-adjusted or cost-sensitive metrics (profit-and-loss, Sharpe ratio, downside risk)	Evaluation should reflect economic usefulness rather than purely statistical fit

4.1. Financial Markets

Across daily–weekly horizons, the most durable improvements derive from labels and representation rather than from model escalation. Targets defined as forward k-period returns (rather than prices) and optionally scaled by recent volatility produce more stationary series, while multi-horizon returns, realized-volatility measures, and simple trend/reversal transforms supply recurring signal families. Evidence on momentum and short-horizon reversal underscores the value of such representations and realized-volatility estimators derived from high-frequency data stabilize both inputs and targets [6,17,34,41,42,43]. Model choice should then reflect data granularity and the dependence induced by labeling: tree/boosting baselines reliably capture nonlinear interactions among return/volatility/trend features at end-of-day cadence, whereas sequence/attention models become useful in microstructure contexts where inputs are windowed order-book sequences and horizons are tightly coupled to order flow [6,34,46,47]. Evaluation must be strictly forward-only. Purged/embargoed cross-validation mitigates leakage from overlapping windows; out-of-time audit slices assess regime sensitivity; and when many variants are compared, multiple-testing adjustments such as the Deflated Sharpe Ratio help control backtest overfitting. Probability calibration should accompany risk-adjusted return metrics [17,34,44,45].

4.2. Macroeconomics

Nowcasting and short-horizon forecasting combine heterogeneous “hard” indicators (production, sales, housing) with “soft” surveys (PMI/ISM, sentiment). Regularized linear models and tree ensembles are dependable because they cope with collinearity, mixed frequency, and ragged panels without demanding very large, curated corpora; reported gains from deep architectures are mixed unless panels are extensive and curated, and evaluation is truly real-time [3,4,40].

Methodological realism matters most: adopt vintage or pseudo-real-time protocols that respect release calendars and revisions; build features and targets only from information available at the forecast date; benchmark against autoregressive and institutional baselines; report RMSE/MAPE with revision-sensitivity. Studies that follow these protocols often see model rankings shift relative to ex-post evaluations [17,22].

Bottom line for macro. Couple growth-rate/differenced representations and lag stacks with a vintage evaluation; begin with ridge/elastic-net and ensembles, then consider deep models only when scale and curation justify added complexity [3,4,5,17].

4.3. Company-Level Planning

For revenues, demand, and operations, the literature stresses seasonality and event structure, hierarchical coherence (SKU → category → total), and rolling-origin evaluation that mirrors deployment. Regularized linear and ensemble baselines are often strong and transparent; deep models help when there is long history, many related series, or rich promotion/price signals that benefit from learned embeddings. Representation choices (levels vs. changes; price vs. return proxies) should be tied to the loss used for decisions, and governance (lineage, stockouts/censoring, ID changes) must be explicit [5,18,34].

Bottom line for company-level. Build seasonality- and event-aware baselines, respect hierarchy in modeling and scoring, and escalate to deep architectures when history and cross-series signals justify it; reproducibility and lineage often matter as much as model class [5,18,22].

4.4. Cross-Cutting Guidance

Define the decision and loss first, then choose a representation that matches it (returns for direction; levels/changes for magnitude). Start with regularized linear models and RF/GBM; adopt time-respectful validation—rolling/expanding windows in markets and firms, vintage evaluation in macro—and report out-of-time performance and calibration. Escalate to LSTM/attention when residual structure indicates unresolved temporal dependence or when the problem relies on long context; always compare against a strong ensemble under identical protocols. Treat leakage control and reproducibility as first-class, not afterthoughts [3,6,9,11,17,22,34].

To summarize, the choice of evaluation metric should be aligned with the decision context of the forecasting task rather than applied mechanically. See Table 2 for a task-to-metric mapping that aligns evaluation with the decision context.

Directional classification (e.g., predicting up/down market moves) is best evaluated with log-loss, Brier score, or calibration curves, since these capture not only classification accuracy but also the reliability of predicted probabilities.
Magnitude regression (e.g., forecasting GDP growth, inflation, or asset returns) is typically assessed using MAE and RMSE. MAPE is sometimes applied for interpretability, but its instability near zero targets makes it less reliable in macro econometric or financial settings.
Portfolio or decision-driven tasks (e.g., trading strategies or asset allocation) require utility-adjusted or cost-sensitive losses, such as realized profit-and-loss, Sharpe ratio, or downside risk measures. These metrics ensure that evaluation reflects the economic value of predictions, not only their statistical accuracy.

5. Challenges and Open Problems

Structural changes (policy moves, crises, liquidity shocks) alter the data-generating process and the payoff to signals. Ensemble baselines are comparatively stable but still degrade when relationships drift. Credible work reports out-of-time results and stability across subperiods; shift-aware validation and simple online updates that respect temporal causality remain underused [3,6,9].

In macro nowcasting, final vintages do not exist at forecast time. Studies using vintage or pseudo-real-time protocols often find model rankings change versus ex-posttests. The field would benefit from maintained public vintage panels, shared release calendars, and standardized ragged-edge handling with explicit revision-sensitivity reporting [4,5,17,22].

Leakage is the most common reason for irreproducible gains: overlapping label windows, feature construction from future bars, and silent censoring/ID changes in firm data. Robust practice—forward-only features, non-overlapping targets, time-ordered splits, and data lineage—is clear but unevenly enforced. Tooling that makes these defaults would raise the empirical floor [3,22,34].

Directional tasks and rare events face class imbalance. Many papers optimize accuracy yet deploy thresholded decisions whose economics require calibrated probabilities. Pair discrimination metrics with calibration checks and validate thresholds out-of-time; consider cost-sensitive training when decisions are asymmetric [3,6].

Enterprises favor models that are stable under moderate re-tuning and provide intelligible drivers. Tree ensembles offer diagnostics (importance, partial dependence) and robust baselines; deep sequence models need additional explanation layers and sensitivity analyses. Governance artifacts—change logs, challenger-vs-champion tests, sign checks—are still rare [3,5,9,18].

Sequence models (especially attention/transformers) require substantial data and computation. Reported gains are promising for long contexts and multimodal inputs, but fair comparisons against strong ensembles under matched budgets are not universal. Cost-aware benchmarks and data-efficient methods (pretraining, transfer) would help [3,6,11].

Results often hinge on a specific market, country, or firm set. Cross-market studies and meta-analyses that quantify effect sizes across datasets would clarify where algorithmic complexity pays off versus where representation and evaluation hygiene dominate [3,4,34,40].

Research agenda (concise). Standardize shift-aware reporting; build vintage-accurate macro benchmarks; integrate decision-aware evaluation and calibration; mandate reproducibility and lineage; and evaluate sequence models against equal-effort ensemble baselines under matched costs [3,6.9,11,17,22].

6. Conclusions

This review synthesized how machine learning supports forecasting across financial markets, macroeconomics, and company-level planning. For tabular problems at daily/weekly or business cadences, tree ensembles, Random Forests and gradient boosting remain strong, stable baselines. At high frequency, LSTM-based ensembles often improve directional hit-rates and calibration when inputs are windowed returns/volatility and evaluation is strictly time-ordered; attention/transformer variants are promising for long contexts and multimodal inputs, though their benefits must be weighed against complexity. Outcomes depend as much on representation and loss–metric alignment as on the algorithm itself [3,6,9,11,34].

In macroeconomics, dependable performance often comes from regularized linear models and ensembles applied to growth-rate or differenced indicators with lag stacks, evaluated in vintage or pseudo-real-time setups. In company settings, seasonality- and event-aware baselines with hierarchical coherence and rolling-origin evaluation yield robust results; deep models help when long histories and many related series allow the model to learn shared structure. These practical choices (representation, evaluation protocol, governance) are repeatedly more decisive than adding architectural complexity [4,5,17,18,40].

Practical guidance is as follows: define the decision and loss first; choose a representation that matches it (returns for direction; levels/changes for magnitude); start with regularized linear and RF/GBM baselines; and adopt time-respectful validation (rolling/expanding in markets and firms; vintage in macro). Escalate to LSTM/attention when residual structure indicates unresolved temporal dependence or when the problem relies on long context, and always compare against a strong ensemble under identical protocols. Treat calibration, leakage control, and reproducibility as first-class objectives [3,6,9,17,22,34].

Limitations of this review: This review has several limitations that should be acknowledged. First, the selection of literature did not follow a systematic search protocol across defined databases; instead, sources were identified through exploratory searches (e.g., Google Scholar, practitioner articles) and guided by the author’s learning path through machine learning courses (Coursera, Udemy) and practical experimentation in Python. As such, relevant studies may have been overlooked, particularly those published in specialized outlets or non-English sources. Second, the emphasis of this review is deliberately pedagogical, beginning with foundational predictive methods such as regression, cost functions, and activation functions, before advancing to more complex architectures and applications in finance. This choice of scope means that certain cutting-edge areas of AI in finance (e.g., reinforcement learning, agent-based modeling, applications to decentralized finance and cryptocurrencies) are not addressed in detail. Third, the field of AI in finance is rapidly evolving, and any snapshot risks becoming incomplete as new datasets, methods, and applications emerge. Future updates will be needed to incorporate these developments and to provide a more systematic comparison of methods across domains. Finally, while a narrative review allows integration across diverse technical and applied literatures, it does not provide the quantitative aggregation of evidence that a meta-analysis or systematic review would offer. Readers should therefore treat the conclusions as a structured synthesis of existing knowledge rather than a definitive ranking of methods.

Research agenda: Priorities include shift-robust pipelines with routine out-of-time/regime-split reporting; vintage-accurate macro benchmarks with shared calendars and revision-sensitivity; decision-aware evaluation pairing accuracy with calibration and threshold economics; governance and reproducibility by default; and cost-aware sequence modeling that tests LSTM/attention against strong ensembles under matched budgets and data, including high-dimensional linear ensemble baselines such as Complete Subset Regression, with pretraining/transfer explored where appropriate [3,6,9,11,17,22].

Author Contributions

Conceptualization, F.G.P. and V.M.; methodology, F.G.P. and V.M.; software, F.G.P.; validation, F.G.P. and V.M.; formal analysis, F.G.P.; investigation, F.G.P.; resources, F.G.P. and V.M.; data curation, F.G.P.; writing—original draft preparation, F.G.P.; writing—review and editing, F.G.P. and V.M.; visualization, F.G.P.; supervision, V.M.; project administration, V.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ARIMA	Autoregressive Integrated Moving Average
AUC	Area Under the Curve
BOJ	Bank of Japan
CPI	Consumer Price Index
ECB	European Central Bank
ENS	Ensemble Methods
FNN	Feedforward Neural Network
GDP	Gross Domestic Product
IMF	International Monetary Fund
ISM	Institute for Supply Management
KPIs	Key Performance Indicators
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
MASE	Mean Absolute Scaled Error
ML	Machine Learning
NN	Neural Network
OLS	Ordinary Least Squares
PMI	Purchasing Managers’ Index
ReLU	Rectified Linear Unit
RF	Random Forest
RMSE	Root Mean Squared Error
RNN	Recurrent Neural Network
SGD	Stochastic Gradient Descent
SiLU	Sigmoid Linear Unit
sMAPE	Symmetric Mean Absolute Percentage Error
SVM	Support Vector Machine

References

Maehashi, K.; Shintani, M. Macroeconomic Forecasting Using Factor Models and Machine Learning: An Application to Japan. J. Jpn. Int. Econ. 2020, 58, 101104. [Google Scholar] [CrossRef]
Muskaan, M.; Sarangi, P.K. A Literature Review on Machine Learning Applications in Financial Forecasting. JTMGE 2020, 11, 23–27. [Google Scholar] [CrossRef]
Masini, R.P.; Medeiros, M.C.; Mendes, E.F. Machine Learning Advances for Time Series Forecasting. arXiv 2021, arXiv:2012.12802. [Google Scholar] [CrossRef]
Crone, S.F.; Hibon, M.; Nikolopoulos, K. Advances in Forecasting with Neural Networks? Empirical Evidence from the NN3 Competition on Time Series Prediction. Int. J. Forecast. 2011, 27, 635–660. [Google Scholar] [CrossRef]
Fama, F. Efficient Capital Markets: A Review of Theory and Empirical Work. J. Financ. 1970, 25, 383–417. [Google Scholar] [CrossRef]
Long, X.; Kampouridis, M.; Jarchi, D. An In-Depth Investigation of Genetic Programming and Nine Other Machine Learning Algorithms in a Financial Forecasting Problem. In Proceedings of the 2022 IEEE Congress on Evolutionary Computation (CEC), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Rayner, B.; Bolhuis, M. Deus Ex Machina? A Framework for Macro Forecasting with Machine Learning. IMF Work. Pap. 2020, 2020, 25. [Google Scholar] [CrossRef]
Coulombe, P.G. The Macroeconomy as a Random Forest. arXiv 2020, arXiv:2006.12724. [Google Scholar] [CrossRef]
Smalter Hall, A. Machine Learning Approaches to Macroeconomic Forecasting. Econ. Rev. 2018. [Google Scholar] [CrossRef]
Elliott, G.; Gargano, A.; Timmermann, A. Complete Subset Regressions. J. Econom. 2013, 177, 357–373. [Google Scholar] [CrossRef]
Lanbouri, Z.; Achchab, S. Stock Market Prediction on High Frequency Data Using Long-Short Term Memory. Procedia Comput. Sci. 2020, 175, 603–608. [Google Scholar] [CrossRef]
Borovkova, S.; Tsiamas, I. An Ensemble of LSTM Neural Networks for High-frequency Stock Market Classification. J. Forecast. 2019, 38, 600–619. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is Mamba Effective for Time Series Forecasting? arXiv 2024, arXiv:2403.11144. [Google Scholar] [CrossRef]
Liang, A.; Jiang, X.; Sun, Y.; Shi, X.; Li, K. Bi-Mamba+: Bidirectional Mamba for Time Series Forecasting. arXiv 2024, arXiv:2404.15772. [Google Scholar]
Li, Q.; Qin, J.; Cui, D.; Sun, D.; Wang, D. CMMamba: Channel Mixing Mamba for Time Series Forecasting. J. Big Data 2024, 11, 153. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation Functions: Comparison of Trends in Practice and Research for Deep Learning. In Proceedings of the 2nd International Conference on Computational Sciences and Technology, Jamshoro, Pakista, 17–19 December 2018. [Google Scholar]
Patel, J.; Shah, S.; Thakkar, P.; Kotecha, K. Predicting Stock Market Index Using Fusion of Machine Learning Techniques. Expert. Syst. Appl. 2015, 42, 2162–2172. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Paruchuri, H. Conceptualization of Machine Learning in Economic Forecasting. Asian Bus. Rev. 2021, 11, 51–58. [Google Scholar] [CrossRef]
Wasserbacher, H.; Spindler, M. Machine Learning for Financial Forecasting, Planning and Analysis: Recent Developments and Pitfalls. Digit. Financ. 2022, 4, 63–88. [Google Scholar] [CrossRef]
Arpit, D.; Jastrzębski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M.S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. A Closer Look at Memorization in Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Stempien, D.; Slepaczuk, R. Hybrid Models for Financial Forecasting: Combining Econometric, Machine Learning, and Deep Learning Models. arXiv 2025, arXiv:2505.19617. [Google Scholar]
Gallego-García, S.; García-García, M. Predictive Sales and Operations Planning Based on a Statistical Treatment of Demand to Increase Efficiency: A Supply Chain Simulation Case Study. Appl. Sci. 2020, 11, 233. [Google Scholar] [CrossRef]
Fan, J.; Xue, L.; Zou, H. Strong Oracle Optimality of Folded Concave Penalized Estimation. Ann. Stat. 2014, 42, 819–849. [Google Scholar] [CrossRef] [PubMed]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Wallbridge, J. Transformers for Limit Order Books. arXiv 2020, arXiv:2003.00130. [Google Scholar] [CrossRef]
Berti, L.; Kasneci, G. TLOB: A Novel Transformer Model with Dual Attention for Price Trend Prediction with Limit Order Book Data. arXiv 2025, arXiv:2502.15757. [Google Scholar]
Araci, D. FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Lim, B.; Arik, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting. arXiv 2019, arXiv:1912.09363. [Google Scholar] [CrossRef]
Yang, J.; Li, P.; Cui, Y.; Han, X.; Zhou, M. Multi-Sensor Temporal Fusion Transformer for Stock Performance Prediction: An Adaptive Sharpe Ratio Approach. Sensors 2025, 25, 976. [Google Scholar] [CrossRef]
Kamalov, F.; Gurrib, I.; Rajab, K. Financial Forecasting with Machine Learning: Price Vs Return. J. Comput. Sci. 2021, 17, 251–264. [Google Scholar] [CrossRef]
Alemu, H.Z.; Wu, W.; Zhao, J. Feedforward Neural Networks with a Hidden Layer Regularization Method. Symmetry 2018, 10, 525. [Google Scholar] [CrossRef]
Buckmann, M.; Joseph, A. An Interpretable Machine Learning Workflow with an Application to Economic Forecasting. SSRN J. 2022. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Yoon, J. Forecasting of Real GDP Growth Using Machine Learning Models: Gradient Boosting and Random Forest Approach. Comput. Econ. 2021, 57, 247–265. [Google Scholar] [CrossRef]
Agrawal, A.; Gans, J.; Goldfarb, A. The Economics of Artificial Intelligence: An Agenda; University of Chicago Press: Chicago, IL, USA, 2019; ISBN 978-0-226-61333-8. [Google Scholar]
Ojha, V.K.; Abraham, A.; Snášel, V. Metaheuristic Design of Feedforward Neural Networks: A Review of Two Decades of Research. Eng. Appl. Artif. Intell. 2017, 60, 97–116. [Google Scholar] [CrossRef]
Fildes, R.; Stekler, H. The State of Macroeconomic Forecasting. J. Macroecon. 2002, 24, 435–468. [Google Scholar] [CrossRef]
Guo, H. Predicting High-Frequency Stock Market Trends with LSTM Networks and Technical Indicators. AEMPS 2024, 139, 235–244. [Google Scholar] [CrossRef]
Jegadeesh, N.; Titman, S. Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency. J. Financ. 1993, 48, 65–91. [Google Scholar] [CrossRef]
Asness, C.S.; Moskowitz, T.J.; Pedersen, L.H. Value and Momentum Everywhere. J. Financ. 2013, 68, 929–985. [Google Scholar] [CrossRef]
Andersen, T.G.; Bollerslev, T.; Diebold, F.X.; Labys, P. Modeling and Forecasting Realized Volatility. SSRN J. 2001. [Google Scholar] [CrossRef]
López de Prado, M.M. Advances in Financial Machine Learning; Wiley: Hoboken, NJ, USA, 2018; ISBN 978-1-119-48208-6. [Google Scholar]
Bailey, D.H.; López De Prado, M. The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. JPM 2014, 40, 94–107. [Google Scholar] [CrossRef]
Sirignano, J.; Cont, R. Universal Features of Price Formation in Financial Markets: Perspectives from Deep Learning. SSRN J. 2018. [Google Scholar] [CrossRef]
Zhang, Z.; Zohren, S.; Roberts, S. DeepLOB: Deep Convolutional Neural Networks for Limit Order Books. arXiv 2018. [Google Scholar] [CrossRef]
Giannone, D.; Reichlin, L.; Small, D. Nowcasting: The Real-Time Informational Content of Macroeconomic Data. J. Monet. Econ. 2008, 55, 665–676. [Google Scholar] [CrossRef]
Croushore, D.; Stark, T. A Real-Time Data Set for Macroeconomists. J. Econom. 2001, 105, 111–130. [Google Scholar] [CrossRef]
Ghysels, E.; Sinko, A.; Valkanov, R.I. MIDAS Regressions: Further Results and New Directions. SSRN J. 2006. [Google Scholar] [CrossRef]

Figure 1. General workflow of a machine learning forecasting pipeline. A training dataset is processed by a learning algorithm to produce a trained model, which is then applied to generate predictions.

Figure 2. “Bowl shape” representation of the cost function where we have one global minimum.

Table 1. Recommended temporal validation protocols for different forecasting domains.

Domain/Data Type	Recommended Protocol(s)	Key Considerations
Financial markets (high-frequency, daily prices, order book data)	Rolling windows with walk-forward evaluation; apply embargo periods to prevent look-ahead bias	Market data are highly non-stationary; overlapping information can cause leakage; evaluation should mimic real-time trading.
Macroeconomic indicators (GDP, CPI, unemployment, PMI)	Expanding windows using only information available at each point in time; ideally use vintage/real-time datasets	Data revisions and publication lags must be respected; forecasts should only use contemporaneously available data.
Firm-level data (sales, earnings, planning variables)	Rolling-origin evaluation (train on early history, then move forward step by step); sometimes expanding windows if longer history is available	Firm-level datasets are often short and subject to structural breaks; align validation with the decision horizon (monthly, quarterly).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Popa, F.G.; Muresan, V. Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting. AI 2025, 6, 295. https://doi.org/10.3390/ai6110295

AMA Style

Popa FG, Muresan V. Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting. AI. 2025; 6(11):295. https://doi.org/10.3390/ai6110295

Chicago/Turabian Style

Popa, Flavius Gheorghe, and Vlad Muresan. 2025. "Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting" AI 6, no. 11: 295. https://doi.org/10.3390/ai6110295

APA Style

Popa, F. G., & Muresan, V. (2025). Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting. AI, 6(11), 295. https://doi.org/10.3390/ai6110295

Article Menu

Artificial Intelligence in Finance: From Market Prediction to Macroeconomic and Firm-Level Forecasting

Abstract

1. Introduction

Methods of Literature Search

2. Model Families and Learning Components

2.1. Regularized & Baseline Linear Models

2.2. Trees and Ensembles

2.3. Neural Architecture

Attention and Transformer Architectures

2.4. Learning Components

2.4.1. Cost Functions

2.4.2. Loss Functions

2.4.3. Optimization (Gradient Descent)

2.4.4. Linear Activation

2.4.5. Sigmoid Activation

3. Signals and Feature Sets in the Literature

3.1. Macroeconomic Indicators

3.2. Financial Market Signals

3.3. Company-Level and Operational Data

3.4. Feature Engineering Patterns

3.5. Data Quality and Leakage Risks

3.6. Summary

4. Task-Centric Synthesis and Practical Guidance

4.1. Financial Markets

4.2. Macroeconomics

4.3. Company-Level Planning

4.4. Cross-Cutting Guidance

5. Challenges and Open Problems

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI