Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review

Theodorakopoulos, Leonidas; Theodoropoulou, Alexandra

doi:10.3390/electronics15071405

Open AccessReview

Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review

by

Leonidas Theodorakopoulos

^*

and

Alexandra Theodoropoulou

Department of Management Science and Technology, University of Patras, 26334 Patras, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1405; https://doi.org/10.3390/electronics15071405

Submission received: 25 February 2026 / Revised: 20 March 2026 / Accepted: 25 March 2026 / Published: 27 March 2026

(This article belongs to the Special Issue Deep Learning and Data Analytics Applications in Social Networks)

Download

Browse Figures

Versions Notes

Abstract

Social network content is increasingly used as an auxiliary evidence stream for financial monitoring, risk assessment, and short-horizon decision support, yet many reported gains are hard to interpret because observability, timing, and attribution are handled inconsistently across studies. This review critically synthesizes the end-to-end pipeline that transforms social posts, interaction traces, linked artifacts, and related signals into decision-facing indicators, emphasizing evidence provenance, sampling bias, conditioning (bot/spam filtering, entity linking, timestamp alignment), and the modeling blocks typically used (text, temporal, relational, and fusion components) under deployment constraints. Across sentiment, relational, and multimodal or cross-platform signals, the analysis finds that apparent improvements often depend more on alignment discipline and conservative attribution than on architectural novelty, and that performance can be inflated by attention confounds, temporal leakage, and visibility effects. Relational indicators are most defensible for monitoring coordination and propagation patterns, while multimodal gains require clear ablations and realistic missing-modality tests. To support decision readiness, the paper consolidates assurance requirements covering manipulation, degraded observability, calibration and traceability, and provides compact reporting checklists and failure-mode mitigations. Overall, the review supports bounded claims and argues for time-aware evaluation and auditable pipelines as prerequisites for operational use.

Keywords:

social media analytics; big data analytics; financial decision-making; financial risk management; volatility forecasting; sentiment analysis; graph neural networks; financial DSS

1. Introduction

Financial decision support increasingly draws on big data streams that are not produced for markets but still shape expectations and behavior [1]. Social networks are a prominent case: they generate high-volume, high-velocity, heterogeneous evidence (text, interaction traces, links, and media) under shifting access constraints, deletions, and platform-driven visibility filters [2]. The practical question is not whether these streams can correlate with market outcomes, but when they can be treated as operational evidence rather than as a noisy record of attention and amplification.

Deep learning has made it technically feasible to process these inputs at scale. Contextual encoders can represent short, informal posts; temporal modules can model bursty arrival patterns and session cutoffs; graph encoders can exploit interaction structure and cross-entity dependencies; fusion mechanisms can combine text with interaction traces, linked artifacts, and basic market controls [3,4]. However, the literature often reports performance improvements without fully specifying how evidence was observed, aligned to decision time, or protected against leakage. As a result, similar headline metrics can reflect very different problem formulations and very different risks [5]. A second gap is interpretability at the level that matters for finance. In decision settings, it is rarely sufficient to provide a post hoc explanation of a model output. What matters is whether an output can be traced back to evidence that was observable before the decision cutoff, whether attribution to instruments is defensible under ambiguity, and whether the pipeline degrades predictably under missing data, manipulation, and drift. These constraints link modeling choices to governance requirements and to the real costs of errors [6,7].

This review addresses the field as an end-to-end decision pipeline rather than as a collection of architectures. Social evidence is treated as selectively observed and potentially adversarial. Modeling is discussed in terms of what it contributes under these constraints, not only in terms of representational capacity. Task signals are analyzed as constructed indicators whose meaning depends on alignment, aggregation, and reliability controls. Assurance and auditability are then used to bound what claims are defensible and comparable across studies. The review focuses primarily on finance-facing applications in which social network evidence is used for risk monitoring, volatility-sensitive forecasting, event-driven market interpretation, manipulation or coordination surveillance, and short-horizon decision support, rather than for long-horizon asset-pricing or portfolio-optimization tasks. Assurance and auditability are then used to bound what claims are defensible and comparable across studies.

1.1. Contributions of This Study

This review makes the following contributions:

It reframes social network analytics for finance around evidence observability and decision cutoffs, clarifying why platform access, ranking effects, and deletions change what results can legitimately claim.
It synthesizes model components (text, temporal, relational, fusion, deployment constraints) through their failure modes and assumptions, rather than treating architectures as an end in themselves.
It organizes the applied literature by signal type (sentiment aggregation, relational indicators, multimodal and cross-platform signals) and explains how each signal becomes decision-facing only after attribution and alignment steps.
It introduces an assurance-oriented perspective, emphasizing manipulation risk, robustness under missing evidence and delay, calibration under shift, and traceability of outputs back to time-appropriate evidence.
It consolidates bounded findings, recurring failure modes, and research directions that follow directly from validity gaps rather than from model fashion.

1.2. Scope and Limitations

This review focuses on how social network evidence is transformed into decision-facing signals through deep learning pipelines, with emphasis on time alignment, attribution, and assurance properties that affect validity. It does not attempt to rank methods by headline predictive metrics across studies, because tasks, horizons, evidence access, and leakage controls differ in ways that often make such comparisons uninterpretable. Claims are therefore presented as bounded by evidence provenance, observability constraints, and evaluation design choices reported in the underlying studies.

1.3. Paper Structure

Section 2 describes the critical narrative review methodology and literature-scoping approach followed in this study. Section 3 treats social network data as operational evidence and examines provenance, sampling bias, conditioning, and practical data issues that directly affect evaluation. Section 4 reviews the main model components used to represent and combine this evidence under timing and observability constraints. Section 5 analyzes how social evidence is transformed into decision-facing task signals, covering sentiment aggregation, relational indicators, multimodal and cross-platform integration, and transfer under drift. Section 6 focuses on assurance and decision readiness through the lens of network structure and information diffusion, with emphasis on manipulation, synthetic content, platform bias, information cascades, and traceability requirements. Section 7 synthesizes stable findings and common failure modes in relation to real-time processing and scalability constraints. Section 8 outlines research directions and Section 9 concludes the paper.

2. Methods

This article is a critical narrative review rather than a systematic or PRISMA-style review. The aim is not to exhaustively enumerate every published study but to synthesize the methodological, operational, and assurance-related issues that shape whether social network evidence can be used responsibly in financial decision support. The literature was therefore selected to provide broad coverage of the end-to-end pipeline addressed in this paper, including evidence provenance and sampling, conditioning and alignment, text and temporal modeling, graph-based and multimodal architectures, signal construction, and deployment-oriented assurance concerns.

To support this synthesis, the review was developed through targeted searches of major scholarly sources, principally Google Scholar, Scopus and Web of Science, combined with backward and forward inspection of references in relevant papers. Priority was given to peer-reviewed English-language publications, including journal articles, conference papers, and selected scholarly book chapters where they contributed directly to the methodological argument. The emphasis was placed on work published from 2015 onward, reflecting the period in which transformer-based, graph-based and multimodal deep learning methods became more central to social–financial analytics.

Study selection was guided by relevance rather than by formal exclusion counts. Papers were included when they addressed the use of social network or closely related online behavioral data in finance-facing tasks or when they offered methodological insight directly relevant to the pipeline examined here, such as entity linking, coordinated activity detection, temporal evaluation, multimodal fusion, calibration, or auditability. Studies were not included simply because they reported high predictive accuracy; preference was given to work that made the data pipeline, timing assumptions, attribution strategy or evaluation design sufficiently visible to support methodological interpretation. Because the purpose of the review is critical synthesis rather than exhaustive evidence aggregation, the manuscript does not report PRISMA flow statistics or claim comprehensiveness in the strict systematic-review sense. Instead, it aims to provide a structured and reproducible account of how the reviewed literature was scoped and why the resulting claims are intentionally bounded.

3. Social Network Data in Financial Contexts

This section treats social data as operational evidence, not as an abstract text corpus. The central critique is that many studies implicitly assume the evidence is neutral, complete, and temporally well-defined. In practice, the evidence is shaped by platform access constraints, ranking and moderation systems, participant self-selection, and measurement error introduced during collection and preprocessing. These factors do not merely add noise; they can induce systematic bias and can also manufacture apparent predictability through leakage or misalignment.

Section 3 frames social network inputs as big data in the strict sense: large scale, rapid arrival, heterogeneous formats, and structured missingness caused by platform access limits, ranking, and moderation, all of which directly shape what can be evaluated and what can be trusted.

3.1. Evidence Types, Provenance, Sampling Bias

Social network inputs should be treated as observational evidence produced under platform-specific constraints, not as a neutral record of public opinion [8]. Posts and interactions are generated by heterogeneous actors with different incentives, and they are made visible through ranking, recommendation, and moderation systems on Twitter, as well as through platform governance and moderation processes that affect what remains observable and what spreads [9,10]. As a result, the dataset that researchers and practitioners actually collect is typically a filtered slice of underlying activity, with systematic distortions that can materially affect downstream financial tasks [8].

A useful distinction is between content, interaction traces, and asset-linked mentions. Content includes text, images, video, URLs, and quoted or reshared fragments. Interaction traces include replies, reshares, mentions, follower ties, co-comment patterns, and engagement counts. Asset-linked mentions include cashtags, ticker strings, token symbols, protocol names, and company or product references that are mapped to tradable instruments; in practice, this mapping can be non-trivial because symbols collide across venues and communities and, therefore, require explicit disambiguation [11,12]. In StockTwits specifically, asset-linked posting conventions (cashtags, and crypto tickers ending with “.X”) make the linkage explicit, but they also introduce their own platform-specific constraints and edge cases [13].

Provenance determines what the evidence represents. Collection methods (official APIs, scraping, archives, third-party providers) impose different sampling mechanisms, rate limits, and coverage gaps, so the same platform can yield materially different datasets depending on how accounts and posts are selected and whether the sampling frame is designed to be representative [14,15]. Keyword- or account-based collection strategies can further skew what is observed by disproportionately capturing high-activity or high-salience narratives, which makes model results partly a property of the collection rule rather than the underlying behavior one intends to study [16]. Deletions and edits add another layer of uncertainty because the archived record can diverge from what was observable at the decision cutoff, especially in politically or commercially contested contexts where decay and removal are non-random [17]. When provenance is under-specified, reported improvements become difficult to interpret because the evidence-generating process is unknown, unstable across time, and often non-comparable across studies [14].

Sampling bias is structural. Visibility bias arises because what becomes observable is shaped by platform exposure controls and activity concentration (including the over-representation of hyperactive actors under common sampling choices), so popular messages are not a random sample of what was posted [18]. Self-selection bias follows from the fact that finance-oriented posting communities are not interchangeable with the broader investor population: social media-based investment advice use is patterned, and high-attention communities can display distinctive risk-taking and decision styles relative to routine investors. Event-conditioned sampling occurs when datasets are assembled around high-attention episodes, which can inflate apparent predictability relative to routine periods because both participation and visibility regimes change during peaks [19]. Finally, longitudinal analyses are affected by churn in accounts and content availability (deletions, suspensions, protection), which can change the composition of observed evidence even when the market environment is otherwise comparable [20].

These biases are not only theoretical. Empirical evidence from high-attention retail communities shows that social media attention can shift investor behavior in systematically non-representative ways: for example, attention generated on r/wallstreetbets has been associated with riskier stock selection, larger position sizes, and lower holding-period returns, indicating that self-selected, attention-intensive communities are not neutral mirrors of the broader investor population [19]. A related lesson appears in user-centric prediction settings. Bouadjenek et al. [21] show that predictive usefulness is not evenly distributed across participants, that different users are informative over different horizons, and that removing consistently incorrect users can improve simple stock market prediction from self-labeled social media posts. This strengthens the broader point made here: sampling strategy changes not only what is observed, but also which behavioral subpopulation the model is effectively trained to trust.

These issues imply that social network signals are not only noisy, but also conditionally observed. Empirical claims should therefore be interpreted as contingent on platform access, visibility mechanics, community composition, and the specific sampling strategy used to build the dataset.

Figure 1 illustrates how the observed dataset used in most studies is not a direct sample of underlying social network activity but the output of a multi-stage filtering process. Raw evidence—spanning content, interaction traces, and asset-linked mentions—first passes through platform-level mechanisms such as ranking, moderation, and deletion, which determine what is visible at any given time. Collection methods then impose a second layer of selection through API rate limits, keyword filters, and provider-specific coverage constraints. On top of these, structural sampling biases further narrow and distort what remains: visibility effects over-represent amplified content, self-selection skews the population toward particular communities, event-conditioned sampling inflates coverage of high-attention episodes, and community turnover alters evidence composition over time. Each layer introduces information loss that is neither random nor easily characterized after the fact. The resulting dataset should therefore be understood as conditionally observed, and any downstream claim inherits the assumptions embedded in every upstream filtering step.

3.2. Conditioning: Bots, Spam, Entity Resolution, Timestamp Alignment

Conditioning defines what is treated as admissible evidence and determines whether the resulting signal reflects organic behavior, coordinated activity, or platform amplification. The same downstream model can behave very differently under small changes in filtering thresholds, deduplication rules, or entity linking. For this reason, conditioning should be described as a core methodological component, with clear assumptions and failure modes [21,22].

Automated and coordinated activity is a persistent feature of finance-related social channels. Some accounts generate high-volume promotional content, others coordinate timing and phrasing across groups, and some are designed to inflate engagement metrics. Treating this activity as ordinary participation can distort both content-based features and network measures. Conversely, overly aggressive filtering can remove legitimate high-frequency actors or communities with distinctive posting patterns [23,24]. In practice, bot and coordination filtering usually combines several signal types rather than relying on a single detector: profile-level metadata, posting-rate and burstiness features, repetition and duplication cues, timing synchronization across accounts, network or cascade structure, and, increasingly, ensemble or drift-aware classifiers that can be updated when platform behavior changes [23,24,25,26,27]. Conditioning pipelines often rely on detectors trained on specific platforms and time periods; when platform behavior shifts, these detectors can degrade, altering the evidence distribution in ways that are difficult to detect without explicit monitoring [25].

Spam, templated promotion, and duplication create additional ambiguity. Near-duplicate posts may reflect coordinated campaigns, routine reposting, or quotation cascades. Removing duplicates can reduce low-information repetition, yet it can also erase propagation structure that is relevant for influence analysis or manipulation detection. The appropriate handling depends on the task: for sentiment-to-signal pipelines, deduplication may reduce artificial emphasis; for surveillance settings, duplication patterns may be part of the phenomenon of interest [26,27].

Entity resolution is a common point of hidden error. Ticker strings and token symbols are ambiguous, company names overlap, and reference vocabularies change over time through rebranding, mergers and symbol updates. Informal language introduces misspellings, slang, sarcasm and multilingual mentions that defeat naive keyword matching. In practice, current pipelines range from rule-based and dictionary-driven matching to context-aware named-entity recognition, candidate-generation plus disambiguation workflows, and graph-supported or neural entity-linking models that use surrounding context to resolve ambiguous mentions [28,29,30]. Where finance-facing resources are available, time-aware alias tables, instrument dictionaries and graph-structured reference resources can improve mention-to-asset mapping, but they also need temporal versioning so that later symbol changes do not leak future knowledge into earlier periods. If entity linking is performed using post hoc dictionaries or curated lists created after the studied period, the pipeline can inadvertently introduce future knowledge. Robust approaches therefore require explicit disambiguation rules, time-aware mappings where relevant, and confidence thresholds that allow uncertain mentions to be excluded rather than forced into a possibly incorrect label [28,29,30].

Timestamp alignment is one of the highest-leverage factors for validity [31]. Social timestamps are typically recorded in UTC, while market labels are defined by exchange calendars, local trading hours, holidays, and after-hours sessions. Misalignment can place posts into the wrong market interval, creating spurious associations [32]. Collection delay further complicates alignment: evidence may be observable to a collector only after a lag that varies by access method and platform constraints. In addition, edits and deletions can change content after publication; using the final archived text as if it were the original message, can leak information that was not available when a decision would have been made [33,34].

Together, these conditioning steps determine whether social evidence is comparable across time, platforms, and market regimes. They also shape whether later evaluation reflects anticipatory signals or a mixture of prediction and reaction content that happens to correlate with outcomes.

Figure 2 clarifies the time alignment assumptions that determine whether social signals can be interpreted as decision-relevant evidence. It distinguishes a correct setup, where the evidence window ends at a clearly defined decision cutoff and the label horizon starts afterward, from two common errors. The first error is window overlap, where features inadvertently include reaction content that occurs during the labeling period. The second is calendar-day aggregation that crosses trading-session boundaries, which can misplace after-hours posts into the wrong decision interval and distort evaluation.

3.3. Practical Data Issues That Directly Affect Evaluation

Several practical properties of social network evidence have direct consequences for evaluation quality. These properties are not minor implementation details; they change what is observable at decision time [35], they modify label meaning, and they can inflate performance estimates if not handled explicitly. In applied return-prediction settings, seemingly small choices about what is treated as available evidence can shift conclusions [36].

Missingness and coverage breaks are typically structured rather than random. Rate limits, API policy changes, scraping interruptions, and platform outages can create gaps that cluster around major events, when attention surges and collection systems are most likely to saturate [37]. Moderation sweeps and account bans can also remove content selectively, creating a persistence bias in the archived record, while nonrandom tweet mortality can further distort what remains available for replication and evaluation [38,39]. If such gaps are not characterized, evaluation may over-represent calmer periods or stable communities, while under-sampling volatile episodes where decision support is most valuable. When evidence density varies sharply across time, aggregate metrics can hide brittle behavior and mask failure under low-observability conditions.

A second issue is reaction content and label contamination. Many social posts are generated in response to price moves, news releases, or other publicly visible outcomes; empirically, sentiment can have strong same-day effects while offering limited signal beyond short horizons [40]. If these posts enter the feature window used to predict an overlapping horizon, the model is partly reading the outcome rather than anticipating it. This risk is not limited to extreme cases; it can occur whenever feature construction uses coarse intervals (e.g., daily bins) or when timestamps are aligned to calendar days rather than market sessions. Separating anticipatory evidence from contemporaneous commentary and retrospective narratives is therefore essential for interpreting claims about predictive or decision value, especially when workflows distinguish intraday from post-market information [41].

Evaluation comparability is also constrained by dataset construction choices that differ across studies [42]. Query terms, language filters, included communities, and conditioning rules produce materially different evidence distributions even within the same platform and time period, and ranking-based content selection can further shape what is actually observed [43]. As a result, cross-paper comparisons of accuracy or AUC are often not meaningful unless the instrument universe, evidence window, label horizon, and leakage controls are aligned [44]. This is particularly relevant when studies report improvements over baselines trained on different samples or when the baseline is evaluated under a different split strategy, because split design can materially affect leakage risk and effective task difficulty [45]. Without comparable experimental conditions, performance differences may reflect sampling and preprocessing rather than model capability [42].

Cross-platform evidence introduces further alignment and identity problems. User identities rarely map cleanly across platforms, and linkage accuracy depends on heterogeneous signals and network structure [46]. Community semantics and diffusion dynamics vary across platforms, so merging streams can combine interactions governed by different exposure logics. When platforms are joined via keyword overlap, the merge can conflate unrelated discussions that share vocabulary. When joined via URLs, the evidence becomes conditioned on link-sharing behavior, which differs across communities and time and can overweight highly active link-sharing groups [47]. These alignment choices therefore affect representativeness and, by extension, any downstream financial inference.

Finally, operational constraints are rarely incorporated into evaluation. Many studies implicitly assume that the full evidence stream is available instantly and consistently, yet real collection processes have delays and failures, and these are first-order concerns in deployed ML pipelines [48,49]. For decision-centric tasks, it is not enough to report a metric on an archived dataset; it matters whether the evidence would have been observable within the decision latency budget, whether the pipeline can tolerate missing modalities without collapsing, and whether performance remains stable under shifts in platform behavior and drift [50,51]. When these conditions are not tested, reported results should be interpreted as optimistic estimates under idealized observability rather than as evidence of deployable decision support [52].

Practical minimum reporting requirements follow directly from these issues: evidence coverage over time and gaps; the observation process and collection delay; handling of deletions and edits; separation of anticipatory evidence from reaction content; and split strategies that respect time and information availability [5]. These items establish whether evaluation results are interpretable as decision-relevant performance rather than as artifacts of dataset construction, and they map naturally onto structured documentation practices for datasets and models [53].

For this reason, benchmark datasets are useful only when they are paired with benchmark protocols. In social–financial settings, a reproducible benchmark should specify at minimum the instrument universe, the collection route and access constraints, the evidence window and decision cutoff, the label horizon and session convention, the handling of deletions, edits, and structured missingness, and the baseline comparisons used to separate social signal value from simple market or attention effects [53]. Standardized evaluation should also rely on time-forward or otherwise time-respecting splits, rather than random partitions, and should report robustness under delay, modality loss, and period shift rather than only average predictive metrics [49]. Without these conditions, a “benchmark” can improve convenience while still preserving the same comparability problems that currently limit interpretation across studies [53]. These requirements are consistent with the broader assurance items summarized later in Table 4.

4. Model Components for Social–Financial Representation and Inference

This section summarizes the main model components used to convert social network evidence into representations that can support forecasting, detection, or risk-related tasks. The focus is on how these components behave under the constraints described earlier, including noisy and selectively observed evidence, ambiguous entity mentions, tight timing requirements, and distribution shift across market regimes. Instead of treating architectures as an end in themselves, the discussion emphasizes what each component contributes to the overall pipeline, which assumptions it relies on, and which recurring failure modes limit decision relevance.

4.1. Text Representation for Noisy, Time-Sensitive Evidence

Text is the most common social evidence used in financial tasks because it is cheap to collect, easy to store, and quickly available. At the same time, it is a fragile input. The dominant challenge is not only ambiguity of language, but the fact that finance-related social text is produced under incentives that include persuasion, performance, promotion, and coordination, including attention-driven episodes and explicit promotion campaigns that can shape what gets written and repeated [19,54]. This makes the observed text distribution structurally different from newswire or formal reports, even when it references the same events.

Most recent systems rely on contextual encoders that map posts into dense representations (often via domain-adapted transformer variants or fine-tuned deep models used as sentiment and classification backbones in finance settings), followed by lightweight task heads for classification or regression [55,56]. In social–financial settings, these encoders are typically expected to handle short messages, informal spelling, heavy use of slang, and frequent code-switching; empirical work on financial tweets repeatedly shows that preprocessing choices and representation learning are not secondary details, because performance varies materially across modeling families and embedding choices [57]. They also face systematic semantic shifts: the polarity of key terms depends on market context, community conventions evolve, and new jargon appears rapidly around emergent assets and narratives. As a result, models that perform well on a static test set can degrade quickly when applied to later periods; this is consistent with broader evidence on concept drift and shifting data-generating processes in streaming settings, and with drift-adaptation work specific to text streams [58,59]. A related practical issue is that “generic” sentiment tooling often fails to transfer cleanly to finance-specific short posts because vocabulary and intent are domain-specific, so strong-looking accuracy in one platform or time slice may not persist out-of-sample [60].

Recent work also points toward financial foundation models and LLM-style systems that rely on broader pretraining and more flexible adaptation across finance-related language tasks. These models can improve transfer across sentiment classification, information extraction, and target-aware interpretation, and they make it easier to reuse a common language backbone rather than rebuilding narrow task-specific pipelines for each setting. At the same time, their additional capacity does not remove the core constraints emphasized in this review. In social–financial settings, larger language models still depend on explicit entity attribution, strict decision-time alignment, and careful separation of anticipatory evidence from reaction content; otherwise, more expressive representations can amplify rather than solve timing and attribution problems. Their practical value should therefore be assessed in relation to latency, calibration, drift tolerance, and evidence traceability, not only by offline language-task performance [55,56,57].

A recurring methodological weakness is that improvements in representation quality are often inferred from downstream prediction gains without isolating the source of those gains [61]. In this setting, shortcut learning should be understood as the broader problem: the model achieves strong apparent performance by exploiting correlates of the target that are easier to learn than the intended decision-relevant semantics. In finance-facing applications, look-ahead bias is one specific manifestation of this broader problem. It arises when label construction, horizon definition, event curation or evidence aggregation allow the representation to absorb information that would not have been available at the decision cutoff [62]. Under these conditions, apparent progress can stem from dataset construction rather than from better semantic understanding, and even small changes in look-ahead choices can alter what the model is effectively allowed to know. When labels are derived from market movement or from post hoc curated event lists, the encoder may learn proxies for time period, topic clusters, platform visibility or reaction intensity instead of extracting anticipatory, decision-relevant content, which is consistent with broader evidence on shortcut learning in deep models [63]. Similarly, when posts are aggregated without separating anticipatory text from reaction commentary, the representation can encode an outcome description rather than an advance signal. These issues can persist even with strong encoders, because a more expressive model may exploit such shortcuts more efficiently rather than become more economically informative.

A practical example arises when daily or event-centered datasets mix anticipatory posts with messages written after a visible price move, earnings release, or other high-attention market episode; in that case, the model may achieve strong apparent accuracy by learning reaction intensity, event identity, or platform attention rather than genuinely forward-looking financial content [62].

Entity conditioning is another critical factor in text representation. Finance posts often mention multiple instruments, refer indirectly to firms (“the chipmaker”), or use ambiguous tickers and token symbols. Representations that ignore entity context can mix sentiment and intent across instruments, producing signals that are difficult to attribute and unstable across periods. Recent finance-oriented targeted sentiment formulations make the target explicit by tying polarity to a stated entity rather than to the whole post, and entity-aware financial resources likewise show that a single item of text can express distinct sentiments toward different entities [64,65]. More robust designs therefore incorporate explicit entity markers, instrument-specific prompts, or joint mention-to-instrument models. However, forcing an entity assignment when confidence is low can be worse than abstaining, because systematic misattribution introduces persistent bias into downstream decision signals; this aligns with broader evidence on reject-option mechanisms used to avoid high-cost errors [66].

Token-level artifacts also matter more than is usually acknowledged. Cashtags, emojis, repeated punctuation, and URL patterns can carry predictive information in some datasets, but that information may be platform-specific and short-lived because platforms differ in message conventions and in what is amplified or suppressed [67,68]. Heavy reliance on such cues can inflate short-horizon performance while reducing portability across platforms or time. Preprocessing that strips these markers indiscriminately can also remove meaningful intent signals in certain communities, so text normalization choices should be treated as part of the representation design, rather than a generic cleanup step [69].

The tension between representation capacity and operational constraints is especially visible in this setting. Larger encoders can capture richer context, yet they increase inference latency and the memory/compute footprint, which becomes binding when posts arrive at high volume or when the system has a tight decision budget [70]. In applied finance pipelines, these costs often drive the choice of smaller or adapted models even when larger variants are more accurate in offline tests [71]. For decision support, representational accuracy is only valuable if it can be delivered within the latency regime implied by the task and if confidence estimates remain stable under drift rather than only on a fixed historical split [58].

4.2. Temporal Modeling and Session-Aware Alignment

Temporal structure is not an optional refinement in social–financial analytics. Most targets of interest are defined over time, evidence arrives with delays and irregular bursts, and the same textual content can have different implications depending on when it appears relative to market sessions and scheduled events [72]. Temporal modeling therefore has two intertwined roles: representing time-varying evidence and enforcing an alignment discipline that prevents outcome leakage through window overlap or calendar mistakes [31].

A common starting point is to aggregate posts into fixed windows and learn a mapping from window-level features to a future label. This approach is simple, but it makes strong assumptions. Fixed bins assume that evidence density is stable and that the relevant horizon is constant across regimes. In practice, social volume is highly uneven, with spikes during earnings, crises, and viral episodes, and the choice of aggregation can make models sensitive to volume as a proxy for attention or volatility rather than content [73]. If aggregation does not account for this, performance can look better in event-heavy periods while degrading under routine conditions. Windowing can also blur decision timing: daily features may include posts created after a market move if the bin is defined by calendar day rather than by the relevant trading session [72].

Session-aware alignment is a minimal requirement for interpretability. Financial labels often follow exchange calendars, with holidays, half-days, and distinct pre-market and after-hours intervals, so the meaning of “next day” depends on whether the label is defined at the market close or the market open [74]. Social timestamps are typically recorded in UTC and must be mapped to the session definition used for labels; otherwise, misalignment can shift posts into the wrong interval and create artificial correlations, especially when using short horizons [75]. The issue becomes more severe when decisions are tied to specific cutoffs such as market open, close, or scheduled announcements, because the key criterion is whether a post was observable before the decision cutoff and within the evidence window assumed by the system.

Temporal architectures attempt to address these issues, but they do not automatically fix them. Recurrent networks and temporal convolution models can capture sequential patterns, yet they still rely on correctly defined input sequences and labels, and they are often evaluated under protocols that do not reflect deployment [76,77]. Attention-based temporal pooling can help focus on salient evidence bursts, but it can also over-weight high-activity moments that are partly reaction cascades, particularly in settings where volume is tightly coupled to volatility or attention [78]. More recent approaches represent time explicitly through positional encodings and time-aware embeddings, or through continuous-time mechanisms designed for irregular event streams [79,80]. Related finance-facing work also increasingly uses transformer-style temporal encoders when long-range dependencies and irregular inter-arrival times need to be modeled more explicitly than in fixed-window aggregation alone [79,80]. These designs are better aligned with social evidence because they model variable inter-arrival times and reduce reliance on arbitrary binning, but the risk of learning proxies for period identity remains high when splits do not respect time order [77].

Another persistent challenge is the interaction between temporal modeling and exogenous structure. Market behavior changes across regimes; scheduled events such as earnings releases introduce predictable discontinuities; and broader event risk can dominate short-horizon outcomes [81,82,83]. Models that ignore these factors may attribute regime effects to social signals. Conversely, models that incorporate market covariates without a careful evaluation design can hide leakage. A practical approach is to treat event indicators and basic market states as controls, then evaluate whether social evidence provides incremental value beyond those controls under a time-respecting split [84,85].

Temporal generalization should also be treated as a central evaluation objective. Performance that is stable across adjacent periods can still fail under distribution shifts caused by platform changes, new asset narratives, or changes in retail participation [86,87]. Updating models frequently may reduce drift effects, but it raises additional requirements: clear retraining triggers, validation refresh policies, and safeguards against training on reaction content that is only observable after outcomes. Without these controls, frequent updating can quietly introduce future information into the model state and produce optimistic results, especially under recurring drift patterns [88,89].

4.3. Relational Modeling and Graph Encoders

Relational modeling addresses a limitation of text-only approaches: market-relevant behavior is often expressed through interaction structure, coordinated exposure, and dependencies across instruments. Graph representations formalize these dependencies by defining nodes (users, posts, assets, or communities) and edges (mentions, replies, reshares, co-posting, similarity links, or inferred influence traces). In social–financial settings, the graph is rarely stable across time. It is a task-specific construct whose validity depends on how edges are defined, how temporal ordering is handled, and how much of the interaction space is observable under platform and collection constraints [90,91].

Graph encoders are commonly used to learn representations that summarize neighborhood context, supporting tasks such as influence estimation, coordination detection, and propagation modeling [92]. They are also used to represent multi-entity dependence where outcomes are coupled, such as cross-asset spillovers, sector co-movement, and correlated attention cascades [93]. In both uses, the graph definition imposes strong assumptions. Follower links treat influence as a relatively static property, whereas reply and reshare links capture event-driven interaction that can be transient. Co-mention edges are also ambiguous because a single message can reference multiple instruments for contrasting reasons, and frequency can reflect attention rather than informative content [94].

Temporal construction choices can inflate results through subtle leakage. Many studies build graphs over long windows and then predict outcomes over shorter horizons. When the graph construction window overlaps the labeling horizon, edges can encode reaction structure and indirectly carry outcome information; in temporal link prediction, this is exactly why training/evaluation is typically framed as forecasting future links from strictly prior interaction history and time slices [95,96]. Even with separated windows, topology can shift rapidly during high-attention episodes; models trained on quieter periods can fail when interaction patterns densify or become dominated by inorganic campaigning and coordination [24].

Representativeness is another recurring issue. Observed interaction graphs reflect platform visibility and collection limits, so missing edges are often systematic rather than random. When missingness is non-random (e.g., more central actors or ties are more likely to be absent), centrality and related network measurements can become biased, and downstream interpretations of influence can partly reflect observability rather than underlying dynamics [97]. For longitudinal network settings, analytic choices that restrict to complete cases can further reduce representativeness, and explicit missing-data modeling is often needed to limit bias [98].

Edge semantics are frequently under-specified. Interactions can represent endorsement, disagreement, irony, or coordination, and their meaning varies across communities and time. Reply edges may signal debate rather than diffusion; reshares may amplify content or mock it; engagement can be manipulated. Graph encoders can learn predictive structure from these patterns, yet the learned signal may reflect platform behavior or manipulation intensity rather than information transfer [99,100].

Operational constraints also affect what relational models can support. Dynamic graphs at social scale are expensive to update and encode, especially under near-real-time or streaming regimes. Many systems, therefore, rely on sampling, sparsification, or periodic rebuilds, which can change what motifs remain observable and how stable message passing is across time [101,102].

Finally, relational models can appear superior to text-only baselines while exploiting confounds. Interaction density around an instrument is correlated with volatility and news intensity, so graph features can function as proxies for market stress. Without controls for volume, scheduled events, and basic market state variables, the contribution of graph structure can be overstated; evaluation designs also need to avoid methodological traps that inflate apparent predictability [103].

4.4. Fusion Mechanisms for Heterogeneous Evidence

Heterogeneous evidence is the norm in social–financial settings. Text, interaction traces, simple market features, and sometimes images or links arrive at different rates, with different reliability, and with different failure modes. Recent finance-facing work increasingly uses multimodal transformer-style and cross-attention architectures to combine these inputs, rather than relying only on simple concatenation or late ensemble rules. Fusion methods aim to combine these inputs into a shared representation or a coordinated set of representations that improve downstream tasks [104]. The main risk is that fusion increases flexibility faster than it increases validity: a fused model can fit spurious correlations more easily, and its apparent gains can be driven by leakage through timing overlap or by dominance of a single modality that is correlated with the label [105].

A basic fusion pattern is early fusion, where features from different modalities are concatenated and processed jointly [106]. This approach is easy to implement and can work when modalities are consistently present and aligned to the same decision window. In practice, alignment is often imperfect. Interaction counts may reflect a longer accumulation horizon than text, market features may update at different frequency than social evidence, and images or links may be missing for many posts [107,108]. When early fusion is applied without explicit alignment rules, the model can exploit timing artifacts, learning that certain modalities are richer after outcomes begin to unfold [108].

Late fusion combines modality-specific models at the prediction stage, typically via weighted averaging, stacking, or gating, which can reduce cross-modal interference and keep diagnosis more modular. However, late fusion often depends on weights (fixed or learned) that reflect one evidence regime. When evidence density shifts, these weights can become unstable: a modality that is reliable in calm periods may become dominated by low-quality or manipulated signals during high-attention episodes, while a modality that is usually sparse may suddenly arrive at scale and overwhelm the ensemble unless fusion is explicitly uncertainty/quality-aware [109,110,111].

Intermediate fusion uses cross-attention or shared latent spaces so modalities can condition each other. In social–financial analytics, this is attractive because message meaning is often shaped by relational context, while interaction patterns are partly driven by content and timing. At the same time, this design is usually more expensive than early or late fusion. It requires modality-specific encoders to remain active up to the fusion stage, additional parameters for cross-modal interaction layers, and tighter synchronization across inputs that may arrive at different rates or with different delays. In practice, this in-creases memory use, training cost, and inference latency, especially when text sequences, interaction structures, and market-side features must be processed jointly rather than through modular branches. The burden becomes more visible under streaming or near-real-time conditions, where one delayed or missing modality can hold back the fused representation or force the system into fallback behavior. Intermediate fusion can therefore offer richer cross-modal conditioning, but the gain comes with higher orchestration cost and a greater risk that apparent predictive improvements mainly reflect increased flexibility rather than better evidence integration. This is also why it is especially exposed to shortcut learning: the model can lean on engagement or attention intensity as a proxy for volatility or news pressure, producing strong metrics while extracting limited decision-relevant signal from content. The risk is higher when training data over-represents event-heavy windows and when evaluation does not explicitly test stability across market states or regime changes, since attention and sentiment effects on volatility and correlation can be state-dependent [112,113,114].

In finance-facing applications, early fusion is commonly used for joint text-plus-market inputs, late fusion for modular combination of social and technical baselines and intermediate fusion for settings where textual, relational and market-side signals are expected to condition one another rather than contribute independently.

A central practical issue is missing-modality behavior. Real evidence streams are incomplete: some posts have no images, some platforms restrict access to engagement counters, and collection pipelines drop segments under load. Fusion systems that assume complete inputs are brittle, so two design choices are common: imputing missing modalities with learned/default embeddings (or explicit imputation modules), or masking/dropping missing modalities so the model learns to rely on available evidence [115,116]. Imputation is not neutral when missingness is structured (missing-not-at-random), as it often is under platform policies, content types, and community behavior; it can induce spurious dependencies and biased feature attribution, which is especially problematic when representations are later treated as evidence [117,118]. Masking (including modality-dropout training) is typically safer for robustness, but only when training exposes the model to realistic missingness patterns; otherwise, missingness itself can become a shortcut “platform signature” rather than an absence of information [119].

Another constraint is that modalities differ in trustworthiness. Text can be synthetic, interaction counts can be inflated, and cross-platform signals can be distorted by coordinated reposting, so treating all modalities as equally reliable encourages overconfident decisions. More careful fusion introduces reliability weighting tied to evidence quality indicators (e.g., account credibility, bot-likelihood scores, coordination cues, provenance constraints, freshness), often paired with uncertainty- or conflict-aware aggregation so noisy modalities are down-weighted rather than amplified [120,121]. However, these reliability cues are not time-stable: coordination patterns and manipulation tactics vary by context and change as platforms and adversaries adapt, so static trust heuristics are fragile [100]. Reliability estimation itself can drift under domain and time shift, which is why test-time adaptation work in multimodal misinformation detection is relevant when reliability modules are treated as operational components rather than fixed calibrators [122].

For decision-centric tasks, the most informative fusion results are those that isolate incremental value. It should be clear whether improvements come from adding a new modality, from changing alignment rules, or from increasing model capacity. Comparisons are difficult to interpret when the fused model is larger and better tuned than unimodal baselines, because gains can be driven by capacity rather than by multimodal complementarity [123]. Recent fusion work typically addresses this with matched baselines and explicit modality ablations (e.g., unimodal vs. fused variants under comparable settings) [124,125]. Ablations should therefore keep training budgets and split strategies comparable, and they should report sensitivity to evidence availability, not only average performance. For intermediate fusion in particular, ablations should not be limited to predictive performance alone. They should also report whether gains remain after accounting for training budget, inference latency, memory demand, and sensitivity to missing or delayed modalities, because otherwise part of the apparent advantage may simply come from higher computational capacity and tighter cross-modal coupling rather than from genuinely better evidence integration.

Fusion is valuable when it improves decision-relevant robustness: stable performance across regimes, controlled behavior under partial evidence, and reduced dependence on manipulable modalities. Finance-oriented multimodal models increasingly test performance under stressed conditions (e.g., crash-focused settings) and quantify the incremental contribution of each modality rather than reporting only pooled averages [126,127]. Robustness under incomplete inputs is also commonly studied through explicit missing-modality settings and training/evaluation designs that prevent the model from winning, by exploiting availability artifacts [128]. When these properties are not demonstrated, fusion gains often reflect flexible fitting to dataset-specific artifacts rather than reliable integration of heterogeneous evidence.

4.5. Training and Deployment Constraints

Model performance in social–financial analytics is bounded by practical constraints that are often treated as secondary. Training data is expensive to curate with defensible timestamps and entity mappings, evidence distributions drift, and deployment settings impose latency and reliability requirements that rule out many designs even if they look strong offline [12,50]. These constraints shape what is feasible and should be treated as part of the modeling problem rather than as implementation detail.

Training regimes frequently assume stationarity that does not hold. Community language, platform mechanics, and market structure change, so a model trained on a historical slice can become miscalibrated when deployed later, a classic setup for concept drift and degraded decision quality [58]. Retraining can mitigate drift, but it introduces a second constraint: the retraining pipeline must avoid incorporating reaction content or post-event artifacts that were not available at decision time. Without careful cutoff rules and frozen evidence snapshots, continual training can leak future information into the model state while appearing to improve performance [44].

Compute and latency constraints limit model size, feature extraction, and fusion complexity, but the burden is not distributed evenly across model components. For text encoders, the main cost is per-message inference and memory footprint, which becomes critical when post volume spikes and the system must score many short messages within a narrow decision window. For temporal modules, the bottleneck is often less the forward pass itself than the need to maintain session-consistent windows, refresh state under streaming arrival, and retrain often enough to limit drift. For graph-based components, latency is strongly shaped by graph construction and update frequency: rebuilding edges, refreshing neighborhoods, or propagating over dynamic interaction structures can become more expensive than downstream prediction, especially when coordination patterns are changing quickly. Fusion layers add a different constraint, because they require synchronization across modalities that may arrive with unequal delay or completeness; this creates overhead not only through larger model capacity, but also through buffering, fallback logic, and robustness handling when one modality is stale or missing. Finally, resource-aware deployment must also account for provenance, logging, and reproducibility controls, since auditable systems incur additional storage and processing overhead even when these components do not directly improve predictive accuracy. During high-attention periods, these costs accumulate at the same time as evidence volume rises. If the pipeline cannot sustain throughput, it will either drop evidence, increase delay, or degrade model quality through adaptive sampling. Under these conditions, best offline architectures are not automatically preferable; resource-aware, latency-bounded designs are often more operationally useful because they balance representational richness against predictable service behavior, controlled degradation, and auditability under load [129,130,131].

Deployment environments also constrain what can be logged, reproduced, and audited. Reproducibility requires explicit versioning of models, preprocessing rules, entity dictionaries, and time-alignment policies; otherwise it becomes difficult to reconstruct why a signal was produced, particularly when evidence streams contain deletions, edits, or shifting platform access. End-to-end provenance capture across ML pipelines provides a concrete mechanism for reconstructing decisions from artifact lineage rather than relying on informal documentation, and provenance can also be treated as a verifiable requirement when auditability is a first-class constraint [132,133]. Monitoring is equally important because drift can occur in evidence content, in interaction patterns, and in the mapping between signals and outcomes; this is better framed as tracking evidence-quality and data-quality degradation (e.g., completeness/missingness, consistency) rather than only tracking prediction error [134]. Where manipulation risk is material, monitoring should also track shifts in coordination prevalence and coordinated diffusion patterns as part of evidence quality [135].

Calibration and uncertainty estimation are often underemphasized relative to raw predictive metrics, even though decision support depends on whether confidence values behave sensibly under drift and regime change [136]. For operational use, confidence governs thresholding, alert volume, and risk exposure, so calibration quality directly shapes what gets escalated and what gets suppressed [137]. Models that look strong on average but become overconfident under distribution shift can concentrate operational risk, especially when confidence is used as a gate for action [138]. Training and evaluation therefore need to treat calibration and uncertainty as first-class objectives, not as auxiliary diagnostics, and to test them explicitly under realistic shifts rather than only in i.i.d. (independent and identically distributed) settings [139]. Common approaches include post hoc calibration and broader uncertainty-quantification techniques, including ensemble- and dropout-based strategies, when systems must support abstention, alert-rate control, or thresholded escalation decisions [139].

Finally, many published comparisons confound modeling advances with training budget, hyperparameter tuning, and dataset curation effort [140]. When one model receives substantially more tuning or engineering, gains cannot be attributed cleanly to the architecture, and cross-paper conclusions become unstable [141]. Results are most interpretable when budgets and tuning procedures are comparable across baselines, and when sensitivity to preprocessing decisions (cleaning choices, feature construction, filtering thresholds) is reported alongside headline metrics, because those pipeline choices can materially change outcomes [142,143].

4.6. Comparative Synthesis of Model Components

Table 1 summarizes the main model components used in social–financial systems and compares them along dimensions that are stable across datasets: the evidence type each component is best suited to represent, typical failure modes under realistic observability, sensitivity to distribution shift, and operational constraints such as latency and update cost. This approach avoids cross-study ranking based on headline metrics that are not directly comparable across tasks, horizons, and evaluation designs.

Text encoders are the default choice when the evidence stream is dominated by short, informal messages; their main limitation is sensitivity to language drift and to shortcut learning driven by reaction content or platform visibility artifacts. Temporal modules are most valuable when decision cutoffs and label horizons are defined precisely and when evidence arrives in irregular bursts; they can still fail when session alignment is weak or when aggregation smears timing [144]. Graph encoders are appropriate when dependence structure is central to the task, such as coordination patterns or cross-entity coupling; they are vulnerable to edge semantics ambiguity, missing edges, and leakage through graph construction windows [145]. Fusion mechanisms can improve robustness when modalities complement each other and missingness is handled explicitly; they also increase the risk that one manipulable modality dominates or that alignment artifacts drive apparent gains [146].

Across components, the most reliable improvements are those that persist under time-respecting evaluation, remain stable across regimes, and do not depend on fragile preprocessing thresholds. Capacity increases alone are not sufficient evidence of decision utility, particularly when they raise latency, reduce calibration stability, or amplify sensitivity to evidence-quality shifts [103,147].

Figure 3 summarizes the end-to-end pipeline that connects social network evidence to financial actions and post-decision evaluation. It starts from evidence acquisition and conditioning, where platform access constraints, bot and spam activity, entity linking, and timestamp alignment determine what is actually observable and attributable at the decision time. The pipeline then separates representation (text and temporal modeling), relational structure (graph construction and graph encoders), and fusion across heterogeneous inputs, before mapping outputs into actions under operational constraints and tracking outcomes. The figure also highlights where common validity risks enter, especially sampling and visibility bias, timing overlap and observation delay, attribution errors in instrument mapping, and robustness degradation under drift, manipulation, or missing modalities.

5. Transformer-Based Sentiment Analysis for Financial Decision-Making

Section 5 synthesizes how social network evidence is transformed into task-oriented signals used in financial applications. The emphasis is on the construction of signals that can be interpreted at the decision level, including how they are attributed to instruments or themes, aligned to market time, and tested under shifting regimes. Rather than treating each modeling family as a separate topic, the discussion groups the literature by signal type and highlights where empirical claims tend to be strongest, where they depend on narrow conditions, and where common data and evaluation choices can produce optimistic results.

5.1. Sentiment Signals and Aggregation

Sentiment extracted from social text is often treated as a direct proxy for market expectations. In practice, however, sentiment is an intermediate evidence variable that becomes meaningful only after it is attributed to a target (asset, sector, or market theme), aligned to a decision time, and aggregated into a stable signal [148,149]. A similar methodological point appears in adjacent financial analytics settings, where behavioral traces do not become financially meaningful indicators automatically but only after they are summarized, weighted, and linked to a predictive objective; recent reviews of customer lifetime value modeling likewise show that deterministic, probabilistic, and AI-based value formulations depend heavily on how behavioral histories are aggregated and converted into decision-facing indicators [150]. Although customer lifetime value and market-sentiment inference are not identical decision settings, the comparison reinforces the same methodological lesson: financial indicators are shaped by modeling choices rather than read directly from raw behavior. In social–financial pipelines, this is especially important because aggregation choices can convert platform behavior into an apparent financial signal, particularly when engagement, visibility, or event intensity influence which messages enter the evidence set [151].

Sentiment estimation in finance spans a spectrum from simple polarity detection to stance, intent, and aspect-level judgments [152]. Polarity alone is frequently insufficient because finance posts can express optimism about a long horizon while warning about short-term risk, or endorse one asset while criticizing the broader market. For decision use, a more relevant object is often stance toward a specific instrument or claim, because stance is closer to actionable intent than generic positivity [152]. However, stance models depend heavily on accurate entity linking and on context that is frequently missing in short posts [153].

Aggregation is the main step that turns post-level sentiment into a decision-relevant indicator. Common aggregations include averaging sentiment scores, counting positive and negative messages, using net-balance indices, and applying weights based on author credibility or engagement. Each choice encodes assumptions. Equal-weight aggregation assumes each message carries similar informational value. Engagement weighting assumes higher visibility implies higher relevance, even though visibility is partly driven by ranking systems and algorithmic amplification that can systematically privilege low-credibility content [154]. Credibility weighting assumes a stable mapping between account-level features and reliability, but these features are not all of the same kind. Some are relatively static metadata, such as account age, persistence over time, declared affiliation, historical topical focus, or verification and profile-completeness indicators where such signals are available. These features can serve as coarse priors, but they are limited because they can be fabricated, purchased, or remain unchanged even when behavior becomes manipulative. Other signals are dynamic behavioral features, including posting burstiness, original-to-reshare ratio, repeated cashtag or hashtag usage, cross-account timing synchronization, unusual engagement-to-activity ratios, abrupt topic switching, and repeated linking to the same domains or artifacts across narrow windows. These dynamic signals are often more informative for detecting current degradation in credibility, coordinated promotion, or shifts from routine participation to engineered amplification, which is why credibility pipelines increasingly combine bot detection and behavior-based risk indicators rather than relying on static metadata alone [155]. More broadly, interaction-derived behavioral features can also track attention and stress states, for example, through joint movement in volume and volatility, so weighted sentiment indices can drift toward measuring attention rather than decision-relevant information if the weighting logic is not stress-tested across market states [156].

The evidence window and sampling rules often dominate what the aggregated sentiment represents. Aggregating over long windows can dilute actionable changes and can blend anticipatory commentary with reaction narratives. Narrow windows reduce contamination but increase variance and sensitivity to bursts, especially during major events. If the window overlaps the labeling horizon, sentiment indices can partially encode the outcome through contemporaneous discussion. Even when overlap is avoided, a calendar-day window can still misalign with market sessions and after-hours trading, shifting posts into the wrong decision interval. Session-aware windows and explicit decision cutoffs are therefore necessary if the sentiment signal is meant to support time-bound actions [41,157,158].

A further complication is volume and attention confounding. Social sentiment indices often co-move with message volume and engagement. Volume spikes are correlated with volatility and news intensity, which means that sentiment features can act as indirect proxies for market stress rather than for directional expectations. If models do not separate sentiment polarity from attention level, improvements may reflect detection of high-attention episodes rather than extraction of market-relevant beliefs [159]. Practical mitigations include reporting sentiment conditional on volume levels, using normalized indices that control for baseline activity, and evaluating incremental value over simple attention measures [160].

Label and ground truth choices also shape what aggregation can legitimately claim. When sentiment models are trained or calibrated using weak labels (emojis, hashtags, reaction icons), the resulting scores may reflect platform conventions rather than financial meaning [161]. When labels are derived from price movement, the sentiment estimator can become circular, because it is optimized to reproduce outcomes that it is later used to predict [103]. Human annotation improves interpretability but remains sensitive to financial literacy, sarcasm, and mixed sentiment. For this reason, aggregation should be paired with uncertainty estimates and with error analysis focused on ambiguous, multi-entity, and event-driven posts [162].

Finally, sentiment signals are regime-dependent. During crises and high-volatility periods, language becomes more polarized and narrative coordination becomes more common; during stable periods, sentiment may be diffuse and less predictive [163]. Platform drift adds another layer, as moderation and ranking changes alter which messages are observed and which accounts dominate discussion [154]. Aggregation methods that work in one situation can degrade in another if they rely on engagement structure, a fixed vocabulary, or a stable distribution of author types.

In decision-oriented settings, sentiment aggregation is most useful when it is instrument-specific, time-aligned to session cutoffs, and designed to reduce dependence on visibility artifacts. When these conditions are not met, aggregated sentiment often captures platform attention dynamics and post-event commentary more than anticipatory information [159].

5.2. Relational Signals: Diffusion, Coordination, Influence

Relational signals aim to capture information that is not contained in isolated messages. In finance-related social environments, market-relevant patterns often emerge through who interacts with whom, how narratives propagate, and whether multiple accounts behave in a coordinated way. These phenomena are frequently operationalized through interaction graphs, diffusion traces, or community-level summaries, then used to support tasks such as early warning, manipulation detection, influence estimation, and cross-asset dependence analysis [164,165,166].

Diffusion signals describe how content spreads over time. The underlying assumption is that propagation dynamics encode something about credibility, urgency, or coordinated amplification. That assumption is plausible in some settings, but it is easy to misapply. Propagation can reflect the platform’s ranking and recommendation system as much as it reflects user choice, so diffusion speed and reach are not purely social quantities [167,168]. In addition, diffusion is strongly event-conditioned: during high-attention episodes, even low-information content can propagate quickly, while in routine periods informative content may remain local. If diffusion features are evaluated mainly on event-heavy intervals, they can become proxies for market stress rather than indicators of decision-relevant signals [169].

Coordination signals focus on patterned behavior across accounts: synchronized posting, repeated phrasing, shared link targets, bursts of reshares within narrow windows, or consistent cross-account co-mentioning. These patterns can indicate manipulation or organized promotion, but they also occur in legitimate settings, such as communities responding to scheduled announcements or reacting to breaking news. A central difficulty is separating coordination that is strategically engineered from coordination that is simply collective attention. The two can look similar in aggregate statistics. More robust formulations rely on fine-grained timing, repeated motif structures, or stability of coordinated behavior across multiple episodes, rather than on raw volume or engagement spikes alone [170,171,172].

Influence signals attempt to estimate whether certain accounts, communities, or narratives have disproportionate downstream effects. Influence is often inferred from follower structure, reshare cascades, or centrality-based features. These proxies are convenient, yet they mix exposure, visibility, and persuasion. An account can appear influential because it is amplified by platform mechanics, because it posts after outcomes are already visible and others reshare it, or because it specializes in high-attention topics. Without careful temporal separation, influence estimates can drift into reverse causality, where market moves trigger social activity and then social activity is interpreted as having driven the move [173,174,175].

Across diffusion, coordination, and influence, the construction of relational features is the main validity bottleneck. Graphs built from long windows can leak outcome information when they include reaction interactions, that occur after the decision cutoff. Co-mention graphs are sensitive to entity ambiguity, especially when a single post references multiple instruments with different intents. Reply edges often reflect disagreement rather than diffusion, while reshare edges can represent endorsement, ridicule, or automated amplification. These ambiguities matter because relational models can be highly effective at exploiting structure, even when that structure encodes platform artifacts or post-event behavior rather than anticipatory information [96,176,177].

Observability constraints further complicate interpretation. Private groups, deleted threads, and rate-limited collection create systematic gaps. Missing edges are not random and can vary with volatility and moderation intensity. When the observed network is treated as complete, model outputs can overweight highly visible communities and underrepresent less visible channels that may still matter for risk monitoring or manipulation. This representativeness issue becomes more severe when results are generalized beyond the specific platform, language, or access tier used for collection [178,179].

Evaluation should reflect the operational meaning of relational signals. For detection tasks, timeliness is part of performance: a delayed correct detection may have limited value, and time-aware evaluation metrics explicitly incorporate detection delay rather than treating outputs as static labels [180]. False positives also have direct cost because they consume analyst capacity and can create alarm fatigue; recent work formalizes the “cry-wolf” effect and shows why standard detection metrics can be insufficient in practice [181]. For influence-oriented tasks, stability across regimes is more informative than peak performance in a single episode, because influence relationships can be transient and can shift when communities migrate or when platform policies change [182]. For diffusion-based indicators, it is important to separate the effect of content from the effect of attention, since diffusion intensity and attention often rise with volatility regardless of informational content [183].

Relational signals are therefore most defensible when they are built from time-local evidence that respects decision cutoffs, when interaction semantics are audited rather than treated as interchangeable edges, and when results remain stable after controlling for attention level and event structure. This is consistent with time-aware evaluation work that treats on-time detection as part of correctness [184], with engagement studies showing that likes, replies, retweets, and quotes behave differently and should not be treated as equivalent relational evidence [185], and with finance evidence that jointly models sentiment and attention across sources to avoid attributing attention-driven movement to content alone [186].

5.3. Multimodal and Cross-Platform Signals

Multimodal and cross-platform signals reflect an empirical reality: finance narratives rarely stay in one format or one platform. Text posts reference charts, screenshots, memes, videos, and links, while related claims can appear across investor communities with different user mixes and platform mechanics [187]. Combining these sources can increase coverage and reduce reliance on any single stream, including setups that explicitly fuse textual sentiment with visual encodings of price action [188]. It also introduces new alignment errors and new routes for spurious predictability, especially when modalities differ in availability, timing, and trustworthiness.

Multimodal signals typically combine social text with auxiliary inputs such as market time series, engagement traces, images, or linked content. The most common multimodal setting is text plus market features [189]. This pairing can improve short-horizon forecasting by providing context on volatility, trend, and liquidity conditions. Yet attribution is often underspecified: market/technical features are frequently strong baselines, so multimodal gains should be reported alongside technical-only comparisons and explicit ablations that quantify the incremental contribution of the social channel [189,190]. Without isolating marginal value, improvements attributed to multimodality may be driven mainly by added market covariates rather than better extraction of social information.

Visual content raises different issues. Images and short videos in finance communities often include annotated charts, screenshots of positions, headlines, or stylized memes that signal group identity and intent. These inputs can carry information that is not present in text alone, but they are also easy to manipulate and hard to attribute, which makes integrity checks and provenance handling central when visuals are treated as decision evidence [191]. A chart screenshot does not guarantee a real position; a headline image can be edited; memes can encode stance indirectly and shift meaning quickly across subgroups [192]. Visual models can then learn community markers and attention cues rather than content that generalizes across periods, especially when labels are limited and models exploit shortcut correlations that look predictive in-sample but do not hold under shift [193].

Linked and referenced content introduces a provenance challenge. URLs may point to news, blog posts, filings, or aggregator pages. Using link targets as evidence can be informative, but it can also introduce instability because pages can disappear or drift away from what was originally referenced [194]. That instability is not only technical: pipelines that treat URL targets as “evidence” need explicit content-trace mechanisms that record what was actually consulted at decision time [195]. Selection effects matter too: algorithmic curation can change exposure to posts that contain external links, so link-derived features can partly track platform-specific linking patterns rather than underlying financial content [196].

Cross-platform signals extend these issues. Platforms differ in user identity conventions, language style, moderation policies, and ranking logic; propagation between platforms can therefore be asymmetric and delayed [197]. Because identity linkage across platforms is itself a difficult inference problem, cross-platform fusion often falls back to topics, URLs, or coarse entity mentions rather than robust person-level mapping [198]. Multi-platform analyses also show systematic asymmetries in narrative framing and lead–lag dynamics between platforms, which makes cross-platform aggregation useful for broad themes but fragile for instrument-level decisions where attribution precision matters [199].

Missingness is a defining feature of multimodal and cross-platform settings. Many posts have no images; some platforms restrict access to engagement counters; some evidence disappears through deletion or moderation. Missingness is often correlated with content type, community norms, or platform policy, so it is not safe to treat it as random [200]. Models can learn missingness itself as a predictor, which yields performance that does not transfer when collection conditions change [201]. Practical designs must therefore incorporate explicit missing-modality handling and should be evaluated under controlled missingness patterns that resemble real collection constraints [202].

Trustworthiness is uneven across modalities and platforms. Engagement can be inflated, visuals can be fabricated, and cross-platform reposting can be coordinated [203]. Treating all channels as equally reliable therefore encourages overconfident outputs. More stable approaches attach reliability indicators to each modality, but these indicators should be grounded in feature-level signals rather than in generic trust assumptions. Relevant signals include provenance-related features (whether the content can be linked to a stable source, whether the linked artifact was accessible at decision time, whether the same URL or image hash recurs across coordinated accounts), account-behavior features (posting burstiness, account age and persistence, original-to-reshare ratio, repeated cashtag or hashtag usage, cross-account timing synchronization, and unusual engagement-to-activity patterns), and observability features (freshness, collection delay, deletion or moderation exposure, and whether a modality is intermittently absent under known platform constraints) [204]. These cues do not establish truth on their own, but they can help distinguish routine evidence from content that is more likely to be manipulated, recycled, or selectively observed. Just as importantly, shifts in the distribution of these signals can indicate that the reliability estimator itself is drifting. Sudden increases in synchronized posting, template reuse, anomalous engagement relative to historical baselines, dominance of newly created or previously inactive accounts, or rising conflict between social content and externally attributable sources are all warning signs that a weighting scheme calibrated on earlier periods may no longer be valid [122]. Reliability weighting should therefore be monitored as a time-varying component, with explicit reporting of when its cues become unstable, less discriminative, or biased by changing manipulation and moderation conditions.

Multimodal and cross-platform integration is most useful when it improves robustness rather than just average accuracy [205]. Practical evidence includes performance stability across regimes, graceful degradation under missing evidence, and reduced dependence on dominant or easy-to-game channels [206,207]. When these properties are not demonstrated, observed gains often reflect flexible fitting to dataset-specific artifacts and platform-dependent visibility patterns.

Table 2 provides a compact summary of the main multimodal and cross-platform signal sources, clarifying what each adds beyond text-only inputs, the dominant validity risks, and the minimum safeguards needed for results to be interpretable under realistic observability and alignment constraints.

5.4. What Transfers Across Market Phases, What Breaks Under Drift

A recurring claim in the literature is that social signals generalize across time. In practice, transfer is uneven. Some relationships appear stable because they reflect broad attention and information-flow dynamics, while others are contingent on narrow conditions such as high-volatility intervals, episodic attention shocks, or concentrated retail participation. Separating durable structure from episode-specific correlation matters because the latter can look strong in retrospective evaluation yet deliver limited decision value in ordinary conditions [208,209,210].

Signal stability varies by task and by the way the signal is constructed. Aggregated sentiment measures can shift in apparent usefulness across calmer versus turbulent market phases, and the same social variable can change meaning as platform behavior and incentives change. Relational indicators show similar fragility: dense interaction around an asset may reflect organic diffusion in one period, then become dominated by bot activity or coordinated promotion in another, creating spurious persistence that breaks under drift [211,212]. Multimodal settings inherit the same failure modes when one channel becomes intermittently unreliable; for example, when linked resources disappear or drift away from what was originally referenced. So, time-respecting evaluation needs to treat evidence availability as part of the system state [213].

Drift enters through multiple pathways. Language change is common in finance communities, where slang, sarcasm conventions, and narrative templates evolve quickly, particularly around new assets or new market narratives [214]. Platform changes can be equally disruptive: adjustments in recommendation and personalization systems alter what becomes visible, moderation actions change which accounts survive, and access constraints reshape the observable sample [215]. Behavioral change adds another layer, because the population producing finance posts is not fixed; participation expands and contracts with market excitement, and new sub-communities emerge with different norms and different levels of sophistication [18]. These shifts can break models even when the underlying financial target is defined consistently.

Evaluation choices can hide drift sensitivity. If test sets are drawn from the same contiguous period as training data, performance can look stable while failing when applied to later periods with different language and interaction patterns. Event-centered datasets amplify this risk because they concentrate on a limited set of market episodes, often with unusually strong coupling between attention, narrative coordination, and price movement. When systems are evaluated mainly on such windows, they can appear robust while relying on a narrow set of high-attention cues. Designs that separate training and testing time windows (rather than relying on a single split) and that treat forecasting evaluation as rolling, time-respecting estimation are more informative about deployment behavior under drift [216,217]. Drift-aware monitoring and detection further helps make “breakage under shift” observable rather than implicit [218].

Some failure modes are particularly common. First, models can become overconfident under shift, producing extreme probabilities even when evidence quality degrades or when the evidence distribution changes [219]. Second, signals can invert, where indicators that were positively associated with returns or volatility become negatively associated after participation patterns change or after communities learn to game visible metrics [220]. Third, entity attribution errors increase under drift because new tickers, renamed tokens, and evolving slang expand ambiguity; forced mapping of uncertain mentions then introduces systematic noise rather than random error [221].

Mitigation strategies can reduce, but not eliminate, these problems. Periodic retraining can help when drift is gradual, yet it becomes risky when retraining data includes reaction content that is only observable after outcomes, or when platform changes alter sampling in ways that retraining silently absorbs [222]. Calibration and uncertainty estimation are often more valuable than marginal accuracy gains because they control decision thresholds and alert volume under changing conditions [162]. Stress testing under degraded observability is also informative: models should be evaluated when evidence is sparse, delayed, or missing, because these conditions coincide with operational pressure and platform instability [223].

Transfer claims are therefore most credible when supported by time-forward evaluation over multiple non-adjacent periods, with explicit reporting of where performance degrades and which evidence properties changed [222]. Without that, a model can appear reliable mainly because it is evaluated within a narrow distribution that does not resemble actual deployment.

6. Network Structure and Information Diffusion

Social network analytics can produce useful signals, but the evidence stream is easy to distort and difficult to observe consistently. Decision relevance therefore depends not only on model accuracy, but on whether the pipeline remains reliable when exposure is shaped by platform ranking, when content is synthetic or coordinated, and when posts or accounts are deleted or removed, so the remaining data no longer reflects what was actually visible at the time [38,224]. Access constraints and platform policy changes further complicate observability, so operational claims should be tied to what was actually collectible at the time and under the same access regime [225]. This section therefore consolidates assurance conditions—especially manipulation resilience, observability under stress, and traceability of outputs back to time-appropriate evidence, supported by explicit documentation of data provenance and intended use [226].

Figure 4 provides a compact view of the end-to-end pipeline from social evidence to actions and evaluation, highlighting where validity risks enter. The figure is used here as an assurance map: manipulation and platform bias affect what evidence is observed; timing and alignment issues can create optimistic results through overlap and delay; attribution errors can corrupt instrument-level signals; and drift or missing modalities can degrade reliability under real operating conditions.

6.1. Manipulation, Synthetic Content, and Platform Bias

Manipulation is not an edge case in finance-related social streams. It appears as coordinated posting and reposting, engagement inflation, narrative seeding around specific instruments, and strategic use of ambiguous tickers or token symbols [23,24]. Some campaigns aim to affect market behavior directly, while others target the measurement process itself by shaping what becomes visible, what receives interaction, and what is likely to be collected under keyword- or account-based sampling [54]. These behaviors can generate signals that look consistent and predictive in retrospective data, even when they reflect engineered amplification rather than informational content.

Synthetic content increases this risk. Large-scale text generation enables high-volume, stylistically consistent messaging that can be adapted rapidly to current narratives. Synthetic images and short videos can also be used to circulate edited headlines, fabricated screenshots, or persuasive chart imagery [227,228]. The practical impact is twofold. First, models trained on historical corpora can misinterpret synthetic artifacts as legitimate community signals, especially when the artifacts mimic high-credibility styles. Second, the presence of synthetic content can change the base rate of certain linguistic and visual cues, weakening features that previously carried meaning [229,230]. In this setting, an apparent improvement in prediction may reflect improved detection of campaign templates, not better inference about underlying market expectations.

Platform bias interacts with manipulation and often dominates what is observable. Ranking systems and recommendation logic systematically shape exposure, concentrating attention on certain narratives and accounts while suppressing others [231,232]. Moderation and enforcement actions remove content in non-random ways, sometimes after it has already propagated widely. Access constraints add further selection: APIs and scraping pipelines rarely provide a complete stream, and the resulting dataset is typically conditioned on visibility and access rather than only on user behavior [233]. When this conditioning is not modeled, social signals can become proxies for platform amplification rather than for beliefs or information.

Ethical concerns extend beyond technical validity. Even when social media content is publicly accessible, its reuse in finance-facing systems raises questions about privacy, secondary use, user expectations, and compliance with platform-specific terms and access policies [231,233]. These issues are not only legal or procedural. They affect whether evidence collection and downstream deployment remain defensible when systems are used for surveillance, risk escalation, or action prioritization. The concern is sharper in financial contexts because outputs can influence trading behavior, market monitoring, or suspicion scoring, which creates stronger incentives for strategic manipulation and raises the possibility that model use itself may feed back into market behavior [220]. For this reason, responsible deployment requires explicit documentation of collection conditions, conservative treatment of identifiable user-level signals, and governance safeguards that distinguish public-signal monitoring from intrusive profiling or opaque market-facing automation.

These issues matter because they distort error costs and evaluation meaning. A surveillance-oriented system that is insensitive to coordinated amplification can produce high false-positive rates by treating campaign bursts as emerging risks. A trading-oriented system can be steered by manufactured attention that coincides with volatility, producing timing signals that do not persist outside the sampled period. The situation worsens when collection is unstable during spikes, or when sampling endpoints can be influenced in ways that change what counts as observable activity [14,234].

Mitigation is partly technical and partly methodological. On the technical side, evidence streams benefit from provenance-aware collection, indicators of inauthentic or coordinated behavior, and reliability signals that down-weight evidence with suspicious propagation patterns. Because removals and account enforcement are systematically uneven, pipelines also need to treat deletions as a source of bias in what can later be audited and re-scored [235]. On the methodological side, claims should be supported by evaluations that include manipulation-prone episodes, include stress tests where engagement is artificially inflated or content is templated, and report degradation under these conditions, not only average performance. Provenance-based trust evidence models can help formalize what is accepted as reliable evidence at runtime, rather than treating all channels as equally trustworthy [236].

6.2. Information Cascades and Viral Content

Operational settings rarely provide a complete and stable evidence stream. Missingness arises from rate limits, access tier changes, scraping interruptions, API outages, moderation actions, and user-driven deletions, and these gaps are typically structured rather than random [237,238]. They often cluster around high-attention episodes, when posting volume spikes and collection systems saturate, and they can also follow platform interventions that remove specific narratives or communities. As a result, the moments when decision support is most demanded can coincide with the poorest observability [239].

Degraded observability changes both feature meaning and model calibration. A sentiment index computed from sparse posts is not simply noisier; it can become systematically biased if the remaining observable content comes disproportionately from high-visibility accounts, promoted narratives, or a single language community. Relational signals are similarly affected because missing edges alter network structure in non-uniform ways; centrality, cascade depth, and coordination motifs can change dramatically when parts of the interaction graph are unobserved. Multimodal systems face additional fragility because one channel can disappear entirely, for example, when image access is restricted, engagement counters are hidden, or linked pages become unavailable [240,241,242].

Robustness therefore requires explicit handling of evidence absence, but the appropriate choice depends on the financial task. Many pipelines implicitly assume that missing content can be imputed without consequence or that models will generalize from complete-data training. In practice, imputation can introduce systematic artifacts when missingness correlates with the target, with volatility, or with platform policy, because missingness patterns can be informative rather than incidental [243]. For sentiment-to-signal pipelines, masking or missing-indicator handling is usually preferable when missing posts, engagement counters, or linked artifacts reflect platform constraints, event-time overload, or selective moderation, because the absence pattern may itself carry information about observability and evidence quality. Imputation is more defensible only when the missing field is low-level, auxiliary, and comparatively stable; for example, short gaps in dense numeric covariates or optional modality embeddings whose absence is not itself decision-relevant, and even then the model should retain an explicit indicator that the value was unavailable. For relational or surveillance-oriented tasks, masking is generally safer than imputation because synthetic filling can fabricate cascade structure, centrality, or coordination motifs that were never actually observed; under these conditions, partial observability should be treated as part of the problem rather than repaired away. For multimodal systems, learned/default imputation may be operationally acceptable when one optional modality is frequently absent but the system is explicitly trained and stress-tested under that absence pattern; otherwise, masking and fallback behavior are more defensible [244,245]. However, even masking is not risk-free because models can learn collection failure as a shortcut predictor rather than as a sign of evidential uncertainty. The practical implication is therefore not that masking is universally preferable to imputation, but that missingness should be preserved when absence is plausibly informative, whereas imputation is more appropriate when the missing field is peripheral, structurally regular, and unlikely to change the interpretation of the financial signal. Because the appropriateness of masking and imputation depends on the decision setting, Table 3 summarizes the main task-dependent trade-offs most relevant to social–financial pipelines.

A second dimension is delay. Even when evidence is ultimately collected, it may not be observable within the decision latency budget. Delays can be variable, platform-specific, and load-dependent. If retrospective evaluation assumes immediate availability, results can be overly optimistic, particularly for short-horizon decisions where minutes matter [246,247]. Delay also interacts with reaction content: later-arriving posts are more likely to be commentary after an event begins to unfold, which increases the risk that the model is implicitly conditioned on outcomes rather than on anticipatory evidence unless alignment and timing controls are explicit [248].

Robust decision support requires fallback behavior. When evidence quality drops, systems should degrade gracefully, for example, by abstaining (reject option), widening uncertainty bounds (set-valued outputs), switching to conservative thresholds, or relying more heavily on stable covariates that are consistently observable [249,250]. Without explicit fallback logic, models can produce confident outputs from thin evidence and can concentrate errors in precisely the windows where operational impact is highest. Calibration becomes central here: probability estimates that are acceptable on average can drift sharply under distributional shift and degraded evidence quality, destabilizing decision thresholds and alert volume [251].

Evaluation should match these conditions. Beyond time-forward splits, studies that claim operational readiness should include controlled tests where evidence is partially removed, delayed, or corrupted in plausible ways, because distribution shifts can change reliability even when average metrics look stable [252]. For multimodal systems, missing-modality stress tests should be reported rather than assumed. For relational systems, partial-observability tests should examine how outputs change when edges are missing in structured patterns consistent with rate limiting or moderation; centrality measures can be sensitive to incomplete network data, and diffusion structure can be mis-inferred under partial observation [253,254]. For all systems, performance should be reported not only as an average but also as conditional behavior under low-observability and high-pressure intervals, since this is where decision support is most likely to fail.

6.3. Interpretability, Traceability, and Audit Needs

Interpretability in social–financial systems has a narrower meaning than in many general ML settings. The primary requirement is not a human-friendly story about why a model produced an output, but the ability to reconstruct which evidence was used, whether that evidence was observable at the decision time, and which transformations turned raw posts into a decision-facing signal. Without this traceability, it is difficult to diagnose failures, hard to separate model error from data misalignment, and risky to deploy systems in workflows that require accountability [255,256].

Explanations based on feature attributions or attention weights can be useful as diagnostic aids, but they are easily over-interpreted. In social evidence streams, highly weighted tokens or edges may reflect platform artifacts, common templates, or topic markers rather than drivers that generalize across periods. Explanations can also be unstable under small perturbations in preprocessing, sampling, or timing, and they can be gamed—so they are a weak basis for high-stakes justification [257,258]. For graph-based models, explanations are further complicated by ambiguous edge semantics and partial observability; highlighting an influential neighbor is not the same as identifying a meaningful influence mechanism, and many explainers struggle to remain faithful and reliable across settings [259].

For decision readiness, traceability should be framed as an engineering property of the pipeline. In financial settings, this requirement is not only methodological but also organizational, because systems that influence monitoring, risk assessment, or action prioritization must support post hoc reconstruction, secure evidence handling, and defensible governance review. More broadly, work on AI-enabled financial services has also emphasized that security, privacy, and trust are inseparable from the practical legitimacy of these systems, since opaque or weakly controlled pipelines can undermine both regulatory confidence and user acceptance [260]. At minimum, a system should log the evidence window and decision cutoff, the collection timestamps and any observation delay, the entity linking outputs and confidence, and the aggregation rules that produce indices. For relational signals, the graph construction policy must be versioned, including the edge types, the temporal window used to form edges, and any sampling or sparsification rules. For multimodal systems, missing-modality handling and reliability weighting should be explicit and auditable, and aligned with process-level quality gates rather than treated as an afterthought [261,262]. These records allow post-incident analysis to answer basic questions after a failure: whether evidence was stale, whether an entity was misattributed, whether the signal was dominated by suspicious engagement, or whether modality dropout triggered an implicit behavior change.

Audit needs extend beyond reconstruction. Operational use requires monitoring of evidence quality and model behavior over time. This includes tracking missingness rates, shifts in the distribution of key features, changes in the prevalence of templated content, and changes in coordination indicators [50]. It also includes monitoring of calibration and threshold stability, because decision systems often fail through alert flooding or silent under-detection even when headline metrics look stable; calibration shift can emerge under distribution shift and may need explicit detection/correction mechanisms [138]. When models are retrained, audits must ensure that the training corpus respects decision-time observability and does not absorb post-event artifacts that create optimistic backtests [31].

Interpretability should therefore be treated as part of a broader accountability stack: reproducible preprocessing and alignment, evidence provenance and freshness controls, and traceability logs that allow post hoc review of decisions [132,133]. Explanations are valuable insofar as they support debugging and governance, but they cannot substitute for rigorous control over what evidence was available, how it was transformed, and how uncertainty was handled under drift and partial observability.

To make results interpretable and comparable across studies, Table 4 lists a minimal set of assurance items that should be reported for social signal pipelines used in financial tasks.

7. Real-Time Processing and Scalability

The literature on social network analytics for finance has expanded rapidly, with improvements reported across sentiment modeling, relational inference, and multimodal integration. More broadly, work on big data decision pipelines also reinforces that real-time deployment requirements differ from retrospective predictive evaluation, especially when outputs are used to support operational decisions [263]. At the same time, many findings remain difficult to compare—and sometimes difficult to interpret—because evidence observability, platform-shaped visibility, and manipulation exposure are handled inconsistently across studies [28,264,265]. This section consolidates what appears to be relatively stable under more defensible evaluation choices, then separates recurring failure modes from genuinely open technical and methodological problems.

Because headline performance statistics are rarely comparable across studies with different evidence windows, label horizons, access conditions, and leakage controls, Table 5 summarizes the literature comparatively at the level of task formulation, evidence type, model family and evaluation discipline rather than by pooled metric ranking.

Across these task families, the most stable pattern is not a consistent ranking of architectures, but a consistent ranking of evaluation discipline. Studies become more interpretable when they define the decision cutoff explicitly, align evidence to market time correctly, separate anticipatory from reaction content, and compare models under matched observability and baseline conditions. In contrast, apparently strong headline metrics become difficult to interpret when the underlying task formulation, sampling regime or leakage controls differ. For this reason, the present review treats methodological comparability as more informative than pooled performance values across heterogeneous studies.

7.1. Stable Findings and Narrow Claims That Hold

Social network evidence can add value in narrowly defined settings, but the value is conditional and often concentrated in high-attention windows. Signals derived from social text, interaction dynamics, or cross-platform activity tend to be more informative when narratives are rapidly forming, participation is elevated, and uncertainty is high [2,19]. Under these conditions, attention and sentiment measures can support monitoring and triage tasks and may provide incremental information beyond simple market baselines, especially when signals are instrument-specific and aligned to explicit intraday or post-market decision cutoffs [22,41].

Across model families, improvements that persist most often are associated with better alignment and attribution rather than with architectural novelty alone. Time-respecting evaluation design and leakage control typically reduce optimistic estimates while producing results that are more likely to transfer [44,77]. Similarly, explicit handling of financial entity mentions, including disambiguation of ambiguous cashtags, improves interpretability and reduces persistent error accumulation in instrument-level signals [11].

Relational indicators are most defensible when they are used to characterize coordination patterns, narrative propagation, or dependence structure, rather than to assert causal influence from social actors to market outcomes [266]. Diffusion and coordination features can contribute to detection and surveillance, especially when evaluated with timeliness constraints and when the analysis distinguishes cascade structure and exposure timing instead of treating all propagation traces as equivalent [267]. Multimodal and cross-platform integration is most credible when it improves robustness to partial evidence and platform-specific shifts, and when the incremental contribution of each modality is isolated rather than assumed through explicit robustness and component-level evaluation [51,268].

Finally, assurance-oriented practices appear broadly necessary for decision readiness. Robust handling of missingness and delay, calibration under shift, and traceability of outputs back to time-appropriate evidence are repeatedly linked to whether systems can be interpreted as decision support rather than as retrospective pattern matching [269,270,271]. When these conditions are made explicit, claims become narrower, but they also become more comparable and more operationally meaningful.

7.2. Fragile Findings and Common Failure Modes

Many reported gains remain fragile because they depend on evaluation and data-construction choices that do not reflect deployment. A frequent problem is temporal leakage in subtle forms, including evidence windows that overlap label horizons and preprocessing/evaluation choices that leak future information into model development. In finance-facing ML settings, even strong retrospective results can disappear once look-ahead bias is removed or leakage-resistant validation is enforced [272,273]. A related issue for social data is that archived or recollected datasets may differ from what was actually observable at decision time because of nonrandom content/account disappearance and changing access conditions [39].

A second failure mode is confounding by attention and visibility. Social volume, engagement, and cascade intensity are often entangled with investor attention and volatility conditions, so models can appear predictive while mostly tracking stressed or highly salient periods rather than extracting decision-relevant social content [73,274]. Platform ranking adds another layer: what is observed is a visibility-shaped stream, and amplification can differ systematically across content types and account profiles, which changes what a model learns from the collected sample [154]. When studies do not condition on attention regimes or compare against simple attention baselines, gains attributed to sentiment or relational modeling can be overstated.

Entity attribution errors are also a persistent source of instability. Ambiguous symbols, evolving naming conventions, and multi-entity posts can produce systematic mislabeling at the instrument level; forcing a single attribution under uncertainty introduces structured bias rather than random noise. This concern is consistent with findings from noisy social media entity recognition and entity-linking research, where emerging entities and ambiguous mentions reduce reliability unless uncertainty is handled explicitly [275,276]. Under drift, these attribution failures can propagate into aggregated sentiment indices, co-mention graphs, and cross-platform topic alignment.

Relational modeling introduces additional fragility through heterogeneous edge semantics and partial observability. Interaction edges can reflect endorsement, disagreement, irony, or automated amplification, while missing edges are structured because of private spaces, rate limits, deletions, and moderation. Under partial observation, graph-based inference of diffusion and dependence can become unstable, and centrality-style indicators can shift substantially as observation conditions change [253,254]. In addition, coordinated inauthentic activity can create topology patterns that resemble organic propagation, which increases the risk that relational indicators function as attention proxies unless observability and coordination controls are explicit [277].

Multimodal and cross-platform systems add degrees of freedom that can amplify these problems. Modalities differ in update rate and reliability; missingness is often correlated with platform policy and content type [200]. When one channel dominates learning (e.g., market covariates or engagement signals), apparent gains can reflect channel dominance rather than genuinely improved fusion. This is why modality contribution should be isolated with ablations and missing-modality stress tests rather than inferred from average performance alone [51,206].

Finally, reliability often degrades exactly when stakes are highest. During high-attention episodes, evidence volume rises, manipulation pressure increases, and collection systems or access regimes become less stable, producing both missingness and delay [238,247]. Under these shifts, models can become mis-calibrated or overconfident, which in turn destabilizes thresholds and can produce alert flooding or silent under-detection even when average metrics appear acceptable [251]. When studies report only aggregate performance without conditional results for degraded observability or manipulation-prone windows, operational risk is typically understated. Table 6 summarizes how recurrent failure modes typically manifest in reported results and lists minimal mitigations that make claims more interpretable.

8. Research Directions

Progress in this area depends less on introducing new model families and more on establishing evaluation and assurance practices that make claims comparable and decision-relevant:

A first priority is standardizing time-aware study design for social evidence, including explicit decision cutoffs, session-aligned windows, and observability constraints that model collection delay and deletion effects. Shared benchmarks are useful only if they enforce these constraints; otherwise they encourage optimization to leakage-prone setups.
A second direction is stronger treatment of attribution under ambiguity. This includes time-aware entity dictionaries, conservative abstention when mention confidence is low, and evaluation that measures not only sentiment accuracy but also attribution error and its downstream impact on indices and graphs. Work in this area would shift the focus from post-level classification toward instrument-level signal validity, which is closer to financial use.
Third, robustness needs to be evaluated as a primary objective. This requires stress tests that mimic operational degradation: rate limiting, partial outages, modality dropout, and shifts in bot prevalence. Models should be compared on graceful degradation and calibration stability, not only on peak performance. In parallel, reliability weighting methods should be developed and tested under shifting manipulation intensity, with explicit reporting of when weighting fails or becomes biased.
Fourth, relational analysis would benefit from clearer semantics and observability models. Rather than treating all interactions as equivalent edges, future work should differentiate edge types and evaluate whether the learned structure reflects diffusion, debate, coordinated amplification, or platform-driven exposure. Dynamic graph designs should also be evaluated under time-local construction rules to avoid embedding post-event structure that inflates results.
Fifth, multimodal and cross-platform integration should be judged by incremental value and transferability. Research should report modality contribution under consistent training budgets, include missing-modality stress tests, and evaluate cross-platform alignment methods under realistic propagation delays and vocabulary drift. Where identity linkage is impossible, topic- and URL-based alignment should be treated as an approximation with measured error, not as a neutral merge.
Finally, decision readiness calls for practical traceability standards. Systems should be designed so that outputs can be reconstructed from versioned preprocessing, logged evidence identifiers, and explicit action mappings with thresholds and abstention rules. This would make failures diagnosable, support governance requirements, and reduce the gap between retrospective modeling and accountable operational deployment.

9. Conclusions

This review examined how social network evidence is used to support financial tasks, focusing on the transformation from raw posts and interactions into decision-facing signals. The central finding is that the usefulness of these signals is conditional on disciplined handling of observability, timing, and attribution. When evidence windows respect session cutoffs, reaction content is separated from anticipatory signals, and entity linking is treated conservatively, social data can support monitoring and triage and can sometimes add incremental value beyond basic market baselines during high-attention periods.

At the same time, many reported gains remain difficult to interpret because they are sensitive to platform visibility mechanics, manipulation pressure, and structured missingness that intensifies during market stress. Relational and multimodal methods can improve robustness, but they also introduce new failure modes when edge semantics are ambiguous, modalities are misaligned, or one channel dominates learning. For systems intended to inform consequential actions, decision readiness therefore depends on assurance practices that go beyond predictive metrics, including stress testing under degraded observability, calibration under shift, and traceability of outputs back to time-appropriate evidence.

In general, the literature supports cautious, bounded claims: social signals are most defensible as complementary evidence streams whose value depends on explicit alignment and accountability constraints, rather than as standalone predictors. Future progress will be driven by methods and benchmarks that prioritize time-aware evaluation, attribution accuracy, robustness under adversarial and missing evidence, and auditable decision pipelines that make operational use interpretable and governable.

Author Contributions

Conceptualization, L.T.; methodology, L.T.; investigation, L.T. and A.T.; writing—original draft preparation, A.T.; writing—review and editing, L.T. and A.T.; visualization, A.T.; supervision, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created for this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Theodorakopoulos, L.; Theodoropoulou, A.; Bakalis, A. Big Data in Financial Risk Management: Evidence, Advances, and Open Questions: A Systematic Review. Front. Artif. Intell. 2025, 8, 1658375. [Google Scholar] [CrossRef] [PubMed]
Verma, R.; Verma, P. Economic News, Social Media Sentiments, and Stock Returns: Which Is a Bigger Driver? J. Risk Financ. Manag. 2025, 18, 16. [Google Scholar] [CrossRef]
Day, M.-Y.; Lee, C.-C. Deep Learning for Financial Sentiment Analysis on Finance News Providers. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 18–21 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1127–1134. [Google Scholar]
Xie, L.; Chen, Z.; Yu, S. Deep Convolutional Transformer Network for Stock Movement Prediction. Electronics 2024, 13, 4225. [Google Scholar] [CrossRef]
Arian, H.; Norouzi Mobarekeh, D.; Seco, L. Backtest Overfitting in the Machine Learning Era: A Comparison of out-of-Sample Testing Methods in a Synthetic Controlled Environment. Knowl.-Based Syst. 2024, 305, 112477. [Google Scholar] [CrossRef]
Arsenault, P.-D.; Wang, S.; Patenaude, J.-M. A Survey of Explainable Artificial Intelligence (XAI) in Financial Time Series Forecasting. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Mahdi, O.A.; Pardede, E.; Bevinakoppa, S.; Ali, N. Federated Learning Under Concept Drift: A Systematic Survey of Foundations, Innovations, and Future Research Directions. Electronics 2025, 14, 4480. [Google Scholar] [CrossRef]
Lazer, D.M.J.; Pentland, A.; Watts, D.J.; Aral, S.; Athey, S.; Contractor, N.; Freelon, D.; Gonzalez-Bailon, S.; King, G.; Margetts, H.; et al. Computational Social Science: Obstacles and Opportunities. Science 2020, 369, 1060–1062. [Google Scholar] [CrossRef]
Huszár, F.; Ktena, S.I.; O’Brien, C.; Belli, L.; Schlaikjer, A.; Hardt, M. Algorithmic Amplification of Politics on Twitter. Proc. Natl. Acad. Sci. USA 2022, 119, e2025334119. [Google Scholar] [CrossRef]
Gorwa, R.; Binns, R.; Katzenbach, C. Algorithmic Content Moderation: Technical and Political Challenges in the Automation of Platform Governance. Big Data Soc. 2020, 7, 205395171989794. [Google Scholar] [CrossRef]
Evans, L.; Owda, M.; Crockett, K.; Vilas, A.F. A Methodology for the Resolution of Cashtag Collisions on Twitter—A Natural Language Processing & Data Fusion Approach. Expert Syst. Appl. 2019, 127, 353–369. [Google Scholar] [CrossRef]
Daudert, T. A Multi-Source Entity-Level Sentiment Corpus for the Financial Domain: The FinLin Corpus. Lang. Resour. Eval. 2022, 56, 333–356. [Google Scholar] [CrossRef]
Chen, C.Y.-H.; Hafner, C.M. Sentiment-Induced Bubbles in the Cryptocurrency Market. J. Risk Financ. Manag. 2019, 12, 53. [Google Scholar] [CrossRef]
Pfeffer, J.; Mayer, K.; Morstatter, F. Tampering with Twitter’s Sample API. EPJ Data Sci. 2018, 7, 50. [Google Scholar] [CrossRef]
Alizadeh, M.; Zare, D.; Samei, Z.; Alizadeh, M.; Kubli, M.; Aliahmadi, M.; Ebrahimi, S.; Gilardi, F. Comparing Methods for Creating a National Random Sample of Twitter Users. Soc. Netw. Anal. Min. 2024, 14, 160. [Google Scholar] [CrossRef]
Gulnerman, A.G.; Karaman, H.; Pekaslan, D.; Bilgi, S. Citizens’ Spatial Footprint on Twitter—Anomaly, Trend and Bias Investigation in Istanbul. ISPRS Int. J. Geo-Inf. 2020, 9, 222. [Google Scholar] [CrossRef]
Khan, M.T.; Dimitrov, D.; Dietze, S. Characterization of Tweet Deletion Patterns in the Context of COVID-19 Discourse and Polarization. In Proceedings of the 36th ACM Conference on Hypertext and Social Media, Chicago, IL, USA, 15–18 September 2025; ACM: New York, NY, USA, 2025; pp. 43–47. [Google Scholar]
Olteanu, A.; Castillo, C.; Diaz, F.; Kıcıman, E. Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries. Front. Big Data 2019, 2, 13. [Google Scholar] [CrossRef] [PubMed]
Warkulat, S.; Pelster, M. Social Media Attention and Retail Investor Behavior: Evidence from r/Wallstreetbets. Int. Rev. Financ. Anal. 2024, 96, 103721. [Google Scholar] [CrossRef]
Bastos, M. This Account Doesn’t Exist: Tweet Decay and the Politics of Deletion in the Brexit Debate. Am. Behav. Sci. 2021, 65, 757–773. [Google Scholar] [CrossRef]
Bouadjenek, M.R.; Sanner, S.; Wu, G. A User-Centric Analysis of Social Media for Stock Market Prediction. ACM Trans. Web 2023, 17, 1–22. [Google Scholar] [CrossRef]
Krystyniak, K.; Liu, H.; Hu, H. What’s Trending? Stock-Level Investor Sentiment and Returns. Int. J. Financ. Stud. 2025, 13, 158. [Google Scholar] [CrossRef]
Cresci, S.; Lillo, F.; Regoli, D.; Tardelli, S.; Tesconi, M. Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter. ACM Trans. Web 2019, 13, 1–27. [Google Scholar] [CrossRef]
Tardelli, S.; Avvenuti, M.; Tesconi, M.; Cresci, S. Detecting Inorganic Financial Campaigns on Twitter. Inf. Syst. 2022, 103, 101769. [Google Scholar] [CrossRef]
Alothali, E.; Hayawi, K.; Alashwal, H. SEBD: A Stream Evolving Bot Detection Framework with Application of PAC Learning Approach to Maintain Accuracy and Confidence Levels. Appl. Sci. 2023, 13, 4443. [Google Scholar] [CrossRef]
Abdelwahab, A.; Mostafa, M. A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam. Appl. Sci. 2022, 12, 6407. [Google Scholar] [CrossRef]
Graham, T.; Hames, S.; Alpert, E. The Coordination Network Toolkit: A Framework for Detecting and Analysing Coordinated Behaviour on Social Media. J. Comput. Soc. Sci. 2024, 7, 1139–1160. [Google Scholar] [CrossRef]
Weber, D.; Neumann, F. Amplifying Influence through Coordinated Behaviour in Social Networks. Soc. Netw. Anal. Min. 2021, 11, 111. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Cui, X. Improving Named Entity Recognition for Social Media with Data Augmentation. Appl. Sci. 2023, 13, 5360. [Google Scholar] [CrossRef]
Li, H.; Li, C.; Sun, Z.; Zhu, H. Entity Linking Model Based on Cascading Attention and Dynamic Graph. Electronics 2024, 13, 3845. [Google Scholar] [CrossRef]
Liu, F.; Chen, L.; Zheng, Y.; Feng, Y. A Prediction Method with Data Leakage Suppression for Time Series. Electronics 2022, 11, 3701. [Google Scholar] [CrossRef]
Wang, W. Investor Sentiment and Stock Market Returns: A Story of Night and Day. Eur. J. Financ. 2024, 30, 1437–1469. [Google Scholar] [CrossRef]
Fang, Z.; Dudek, J.; Costas, R. Facing the Volatility of Tweets in Altmetric Research. J. Assoc. Inf. Sci. Technol. 2022, 73, 1192–1195. [Google Scholar] [CrossRef]
Ulloa, R.; Mangold, F.; Schmidt, F.; Gilsbach, J.; Stier, S. Beyond Time Delays: How Web Scraping Distorts Measures of Online News Consumption. Commun. Methods Meas. 2025, 19, 179–200. [Google Scholar] [CrossRef]
Davidson, B.I.; Wischerath, D.; Racek, D.; Parry, D.A.; Godwin, E.; Hinds, J.; Van Der Linden, D.; Roscoe, J.F.; Ayravainen, L.; Cork, A.G. Platform-Controlled Social Media APIs Threaten Open Science. Nat. Hum. Behav. 2023, 7, 2054–2057. [Google Scholar] [CrossRef]
Albarrak, M.S. The Effect of Twitter Messages and Tone on Stock Return: The Case of Saudi Stock Market “Tadawul”. J. Risk Financ. Manag. 2024, 17, 405. [Google Scholar] [CrossRef]
Hino, A.; Fahey, R.A. Representing the Twittersphere: Archiving a Representative Sample of Twitter Data under Resource Constraints. Int. J. Inf. Manag. 2019, 48, 175–184. [Google Scholar] [CrossRef]
Elmas, T. The Impact of Data Persistence Bias on Social Media Studies. In Proceedings of the 15th ACM Web Science Conference 2023, Austin, TX, USA, 30 April–1 May 2023; ACM: New York, NY, USA, 2023; pp. 196–207. [Google Scholar]
Küpfer, A. Nonrandom Tweet Mortality and Data Access Restrictions: Compromising the Replication of Sensitive Twitter Studies. Polit. Anal. 2024, 32, 493–506. [Google Scholar] [CrossRef]
Wang, X.; Xiang, Z.; Xu, W.; Yuan, P. The Causal Relationship between Social Media Sentiment and Stock Return: Experimental Evidence from an Online Message Forum. Econ. Lett. 2022, 216, 110598. [Google Scholar] [CrossRef]
Sun, G.; Li, Y. Intraday and Post-Market Investor Sentiment for Stock Price Prediction: A Deep Learning Framework with Explainability and Quantitative Trading Strategy. Systems 2025, 13, 390. [Google Scholar] [CrossRef]
Vicente, P. Sampling Twitter Users for Social Science Research: Evidence from a Systematic Review of the Literature. Qual. Quant. 2023, 57, 5449–5489. [Google Scholar] [CrossRef] [PubMed]
Dujeancourt, E.; Garz, M. The Effects of Algorithmic Content Selection on User Engagement with News on Twitter. Inf. Soc. 2023, 39, 263–281. [Google Scholar] [CrossRef]
Apicella, A.; Isgrò, F.; Prevete, R. Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning. Artif. Intell. Rev. 2025, 58, 339. [Google Scholar] [CrossRef]
Shobayo, O.; Adeyemi-Longe, S.; Popoola, O.; Ogunleye, B. Innovative Sentiment Analysis and Prediction of Stock Price Using FinBERT, GPT-4 and Logistic Regression: A Data-Driven Approach. Big Data Cogn. Comput. 2024, 8, 143. [Google Scholar] [CrossRef]
Wang, R.; Zhu, H.; Wang, L.; Chen, Z.; Gao, M.; Xin, Y. User Identity Linkage Across Social Networks by Heterogeneous Graph Attention Network Modeling. Appl. Sci. 2020, 10, 5478. [Google Scholar] [CrossRef]
Cinelli, M.; De Francisci Morales, G.; Galeazzi, A.; Quattrociocchi, W.; Starnini, M. The Echo Chamber Effect on Social Media. Proc. Natl. Acad. Sci. USA 2021, 118, e2023301118. [Google Scholar] [CrossRef]
Zarour, M.; Alzabut, H.; Al-Sarayreh, K.T. MLOps Best Practices, Challenges and Maturity Models: A Systematic Literature Review. Inf. Softw. Technol. 2025, 183, 107733. [Google Scholar] [CrossRef]
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
Kustitskaya, T.A.; Esin, R.V.; Noskov, M.V. Model Drift in Deployed Machine Learning Models for Predicting Learning Success. Computers 2025, 14, 351. [Google Scholar] [CrossRef]
Reza, M.K.; Prater-Bennette, A.; Asif, M.S. Robust Multimodal Learning With Missing Modalities via Parameter-Efficient Adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 742–754. [Google Scholar] [CrossRef] [PubMed]
Sambasivan, N.; Kapania, S.; Highfill, H.; Akrong, D.; Paritosh, P.; Aroyo, L.M. “Everyone Wants to Do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; ACM: New York, NY, USA, 2021; pp. 1–15. [Google Scholar]
Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; ACM: New York, NY, USA, 2019; pp. 220–229. [Google Scholar]
Ardia, D.; Bluteau, K. Twitter and Cryptocurrency Pump-and-Dumps. Int. Rev. Financ. Anal. 2024, 95, 103479. [Google Scholar] [CrossRef]
Huang, A.H.; Wang, H.; Yang, Y. FINBERT: A Large Language Model for Extracting Information from Financial Text. Contemp. Account. Res. 2023, 40, 806–841. [Google Scholar] [CrossRef]
Nasiopoulos, D.K.; Roumeliotis, K.I.; Sakas, D.P.; Toudas, K.; Reklitis, P. Financial Sentiment Analysis and Classification: A Comparative Study of Fine-Tuned Deep Learning Models. Int. J. Financ. Stud. 2025, 13, 75. [Google Scholar] [CrossRef]
Memiş, E.; Akarkamçı (Kaya), H.; Yeniad, M.; Rahebi, J.; Lopez-Guede, J.M. Comparative Study for Sentiment Analysis of Financial Tweets with Deep Learning Methods. Appl. Sci. 2024, 14, 588. [Google Scholar] [CrossRef]
Hovakimyan, G.; Bravo, J.M. Evolving Strategies in Machine Learning: A Systematic Review of Concept Drift Detection. Information 2024, 15, 786. [Google Scholar] [CrossRef]
Garcia, C.M.; Abilio, R.; Koerich, A.L.; Britto, A.D.S.; Barddal, J.P. Concept Drift Adaptation in Text Stream Mining Settings: A Systematic Review. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–67. [Google Scholar] [CrossRef]
Wilksch, M.; Abramova, O. PyFin-Sentiment: Towards a Machine-Learning-Based Model for Deriving Sentiment from Financial Tweets. Int. J. Inf. Manag. Data Insights 2023, 3, 100171. [Google Scholar] [CrossRef]
Giantsidi, S.; Tarantola, C. Deep Learning for Financial Forecasting: A Review of Recent Trends. Int. Rev. Econ. Finance 2025, 104, 104719. [Google Scholar] [CrossRef]
AlRashedy, A.S.; Mathkour, H.I. Label-Driven Optimization of Trading Models Across Indices and Stocks: Maximizing Percentage Profitability. Mathematics 2025, 13, 3889. [Google Scholar] [CrossRef]
Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut Learning in Deep Neural Networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Sinha, A.; Kedas, S.; Kumar, R.; Malo, P. SENTFIN 1.0: ENTITY-AWARE Sentiment Analysis for Financial News. J. Assoc. Inf. Sci. Technol. 2022, 73, 1314–1335. [Google Scholar] [CrossRef]
Pan, R.; García-Díaz, J.A.; Valencia-García, R. Individual- vs. Multiple-Objective Strategies for Targeted Sentiment Analysis in Finances Using the Spanish MTSA 2023 Corpus. Electronics 2024, 13, 717. [Google Scholar] [CrossRef]
Hendrickx, K.; Perini, L.; Van Der Plas, D.; Meert, W.; Davis, J. Machine Learning with a Reject Option: A Survey. Mach. Learn. 2024, 113, 3073–3110. [Google Scholar] [CrossRef]
Reschke, F.; Strych, J.-O. Emojis and Stock Returns. Rev. Behav. Financ. 2024, 16, 223–233. [Google Scholar] [CrossRef]
Ballinari, D.; Behrendt, S. How to Gauge Investor Behavior? A Comparison of Online Investor Sentiment Measures. Digit. Financ. 2021, 3, 169–204. [Google Scholar] [CrossRef] [PubMed]
Palomino, M.A.; Aider, F. Evaluating the Effectiveness of Text Pre-Processing in Sentiment Analysis. Appl. Sci. 2022, 12, 8765. [Google Scholar] [CrossRef]
Chitty-Venkata, K.T.; Mittal, S.; Emani, M.; Vishwanath, V.; Somani, A.K. A Survey of Techniques for Optimizing Transformer Inference. J. Syst. Archit. 2023, 144, 102990. [Google Scholar] [CrossRef]
Jaggi, M.; Mandal, P.; Narang, S.; Naseem, U.; Khushi, M. Text Mining of Stocktwits Data for Predicting Stock Prices. Appl. Syst. Innov. 2021, 4, 13. [Google Scholar] [CrossRef]
Xiao, Q.; Ihnaini, B. Stock Trend Prediction Using Sentiment Analysis. PeerJ Comput. Sci. 2023, 9, e1293. [Google Scholar] [CrossRef]
Yang, N.; Fernandez-Perez, A.; Indriawan, I. Spillover between Investor Sentiment and Volatility: The Role of Social Media. Int. Rev. Financ. Anal. 2024, 96, 103643. [Google Scholar] [CrossRef]
Fuertes, A.-M.; Olmo, J. On Setting Day-Ahead Equity Trading Risk Limits: VaR Prediction at Market Close or Open? J. Risk Financ. Manag. 2016, 9, 10. [Google Scholar] [CrossRef]
Zhang, L.; Hua, L. Market Predictability Before the Closing Bell Rings. Risks 2024, 12, 180. [Google Scholar] [CrossRef]
Casolaro, A.; Capone, V.; Iannuzzo, G.; Camastra, F. Deep Learning for Time Series Forecasting: Advances and Open Problems. Information 2023, 14, 598. [Google Scholar] [CrossRef]
Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating Time Series Forecasting Models: An Empirical Study on Performance Estimation Methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
Zhang, W.; Liu, J.; Deng, W.; Tang, S.; Yang, F.; Han, Y.; Liu, M.; Wan, R. AMTCN: An Attention-Based Multivariate Temporal Convolutional Network for Electricity Consumption Prediction. Electronics 2024, 13, 4080. [Google Scholar] [CrossRef]
Shchur, O.; Türkmen, A.C.; Januschowski, T.; Günnemann, S. Neural Temporal Point Processes: A Review. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization, Montreal, QC, Canada, 19–27 August 2021; pp. 4585–4593. [Google Scholar]
Foumani, N.M.; Tan, C.W.; Webb, G.I.; Salehi, M. Improving Position Encoding of Transformers for Multivariate Time Series Classification. Data Min. Knowl. Discov. 2024, 38, 22–48. [Google Scholar] [CrossRef]
Milidonis, A.; Chisholm, K. The Regime-Switching Structural Default Risk Model. Risks 2024, 12, 48. [Google Scholar] [CrossRef]
Ugras, Y.J.; Ritter, M.A. Market Reaction to Earnings Announcements Under Different Volatility Regimes. J. Risk Financ. Manag. 2025, 18, 19. [Google Scholar] [CrossRef]
Alexiou, L.; Goyal, A.; Kostakis, A.; Rompolis, L. Pricing Event Risk: Evidence from Concave Implied Volatility Curves. Rev. Financ. 2025, 29, 963–1007. [Google Scholar] [CrossRef]
Bergmeir, C.; Hyndman, R.J.; Koo, B. A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction. Comput. Stat. Data Anal. 2018, 120, 70–83. [Google Scholar] [CrossRef]
Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef]
Yue, Z.; Yu, G. Effects of Policy Communication Changes on Social Media: Before and After Policy Adjustment. Systems 2025, 13, 248. [Google Scholar] [CrossRef]
Kim, H. Social Media Engagement and Retail Investors’ Short-Termism. Financ. Res. Lett. 2025, 85, 108249. [Google Scholar] [CrossRef]
Suárez-Cetrulo, A.L.; Quintana, D.; Cervantes, A. A Survey on Machine Learning for Recurring Concept Drifting Data Streams. Expert Syst. Appl. 2023, 213, 118934. [Google Scholar] [CrossRef]
Guerrero Cano, J.V.; Aguiar, G.J.; Cano, A. Anticipating to Change: A Proactive Approach for Concept Drift Adaptation in Data Streams. Mach. Learn. 2026, 115, 3. [Google Scholar] [CrossRef]
Huang, X.; Li, J.; Yuan, Y. Link Prediction in Dynamic Social Networks Combining Entropy, Causality, and a Graph Convolutional Network Model. Entropy 2024, 26, 477. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Wei, Z.; Chen, L.; Xu, C.; Guan, Z. Multi-Modal Temporal Dynamic Graph Construction for Stock Rank Prediction. Mathematics 2025, 13, 845. [Google Scholar] [CrossRef]
Liu, W.; Wang, S.; Ding, J. Influence Maximization Based on Adaptive Graph Convolution Neural Network in Social Networks. Electronics 2024, 13, 3110. [Google Scholar] [CrossRef]
Morshed, A. Graph Neural Networks and Explainable Spillovers: Global Monetary and Oil Shocks in GCC Financial Markets. Economies 2025, 13, 308. [Google Scholar] [CrossRef]
Meštrović, A.; Petrović, M.; Beliga, S. Retweet Prediction Based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features. Appl. Sci. 2022, 12, 11216. [Google Scholar] [CrossRef]
Sun, M.; Tang, M. A Review of Link Prediction Algorithms in Dynamic Networks. Mathematics 2025, 13, 807. [Google Scholar] [CrossRef]
Liu, Z.; Li, Z.; Li, W.; Duan, L. Deep Graph Tensor Learning for Temporal Link Prediction. Inf. Sci. 2024, 660, 120085. [Google Scholar] [CrossRef]
Smith, J.A.; Moody, J.; Morgan, J.H. Network Sampling Coverage II: The Effect of Non-Random Missing Data on Network Measurement. Soc. Netw. 2017, 48, 78–99. [Google Scholar] [CrossRef]
De La Haye, K.; Embree, J.; Punkay, M.; Espelage, D.L.; Tucker, J.S.; Green, H.D. Analytic Strategies for Longitudinal Networks with Missing Data. Soc. Netw. 2017, 50, 17–25. [Google Scholar] [CrossRef]
Cinelli, M.; Cresci, S.; Quattrociocchi, W.; Tesconi, M.; Zola, P. Coordinated Inauthentic Behavior and Information Spreading on Twitter. Decis. Support Syst. 2022, 160, 113819. [Google Scholar] [CrossRef]
Yang, Y.; Paudel, R.; McShan, J.; Hindman, M.; Huang, H.H.; Broniatowski, D. Coordinated Link Sharing on Facebook. Sci. Rep. 2025, 15, 15684. [Google Scholar] [CrossRef]
Jiao, P.; Guo, X.; Jing, X.; He, D.; Wu, H.; Pan, S.; Gong, M.; Wang, W. Temporal Network Embedding for Link Prediction via VAE Joint Attention Mechanism. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 7400–7413. [Google Scholar] [CrossRef] [PubMed]
Wei, Q.; Hu, G. Evaluating Graph Neural Networks under Graph Sampling Scenarios. PeerJ Comput. Sci. 2022, 8, e901. [Google Scholar] [CrossRef] [PubMed]
Wasserbacher, H.; Spindler, M. Machine Learning for Financial Forecasting, Planning and Analysis: Recent Developments and Pitfalls. Digit. Financ. 2022, 4, 63–88. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Kim, J.; Hong, J.; Choi, Y. Causal Inference for Modality Debiasing in Multimodal Emotion Recognition. Appl. Sci. 2024, 14, 11397. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Mou, S.; Xue, Q.; Chen, J.; Takiguchi, T.; Ariki, Y. MM-iTransformer: A Multimodal Approach to Economic Time Series Forecasting with Textual Data. Appl. Sci. 2025, 15, 1241. [Google Scholar] [CrossRef]
Bustarviejo, J.; Bousoño-Calzón, C. Multimodal Information Fusion for Financial Forecasting via Cross-Attention and Calibrated Uncertainty. Mach. Learn. Appl. 2026, 23, 100840. [Google Scholar] [CrossRef]
Pereira, L.M.; Salazar, A.; Vergara, L. A Comparative Study on Recent Automatic Data Fusion Methods. Computers 2023, 13, 13. [Google Scholar] [CrossRef]
Pan, L.; Han, X.; Liu, X.; Liu, Y. A Practical Multimodal Fusion System with Uncertainty Modeling for Robust Visual and Affective Applications. IEEE Access 2025, 13, 145289–145302. [Google Scholar] [CrossRef]
Nirala, V.; Ratneshwer. A Robust Weighted Late Fusion Approach for IoT. Internet Things 2026, 36, 101857. [Google Scholar] [CrossRef]
Alomari, M.; Al Rababa’a, A.R.; El-Nader, G.; Alkhataybeh, A.; Ur Rehman, M. Examining the Effects of News and Media Sentiments on Volatility and Correlation: Evidence from the UK. Q. Rev. Econ. Financ. 2021, 82, 280–297. [Google Scholar] [CrossRef]
Al Guindy, M. Cryptocurrency Price Volatility and Investor Attention. Int. Rev. Econ. Financ. 2021, 76, 556–570. [Google Scholar] [CrossRef]
Huang, C.; Chen, J.; Huang, Q.; Wang, S.; Tu, Y.; Huang, X. AtCAF: Attention-Based Causality-Aware Fusion Network for Multimodal Sentiment Analysis. Inf. Fusion 2025, 114, 102725. [Google Scholar] [CrossRef]
Wu, Y.; Chen, J.; Hu, L.; Xu, H.; Liang, H.; Wu, J. OmniFuse: A General Modality Fusion Framework for Multi-Modality Learning on Low-Quality Medical Data. Inf. Fusion 2025, 117, 102890. [Google Scholar] [CrossRef]
Zhao, B.; Zhang, W.; Zou, Z. MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates. Pattern Recognit. 2026, 172, 112591. [Google Scholar] [CrossRef]
Seijo-Pardo, B.; Alonso-Betanzos, A.; Bennett, K.P.; Bolón-Canedo, V.; Josse, J.; Saeed, M.; Guyon, I. Biases in Feature Selection with Missing Data. Neurocomputing 2019, 342, 97–112. [Google Scholar] [CrossRef]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A Survey on Missing Data in Machine Learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef]
Almakhamreh, A.H.A.; Bozkir, A.S. CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification. Appl. Sci. 2026, 16, 751. [Google Scholar] [CrossRef]
Orabi, M.; Mouheb, D.; Al Aghbari, Z.; Kamel, I. Detection of Bots in Social Media: A Systematic Review. Inf. Process. Manag. 2020, 57, 102250. [Google Scholar] [CrossRef]
Yang, Q.; Zhao, Y.; Cheng, H. Uncertainty-Aware Evidential Fusion for Multi-Modal Object Detection in Autonomous Driving. Drones 2026, 10, 130. [Google Scholar] [CrossRef]
Xu, K.; Wang, S.; Diao, Z. DATTAMM: Domain-Aware Test-Time Adaptation for Multimodal Misinformation Detection. Appl. Sci. 2025, 15, 11832. [Google Scholar] [CrossRef]
Zhang, X.; Duh, K. Reproducible and Efficient Benchmarks for Hyperparameter Optimization of Neural Machine Translation Systems. Trans. Assoc. Comput. Linguist. 2020, 8, 393–408. [Google Scholar] [CrossRef]
Pawłowski, M.; Wróblewska, A.; Sysko-Romańczuk, S. Effective Techniques for Multimodal Data Fusion: A Comparative Analysis. Sensors 2023, 23, 2381. [Google Scholar] [CrossRef]
Dong, Y.; Hao, Y. A Stock Prediction Method Based on Multidimensional and Multilevel Feature Dynamic Fusion. Electronics 2024, 13, 4111. [Google Scholar] [CrossRef]
Liu, R.; Liu, H.; Huang, H.; Song, B.; Wu, Q. Multimodal Multiscale Dynamic Graph Convolution Networks for Stock Price Prediction. Pattern Recognit. 2024, 149, 110211. [Google Scholar] [CrossRef]
Sheng, Y.; Qu, Y.; Ma, D. Stock Price Crash Prediction Based on Multimodal Data Machine Learning Models. Financ. Res. Lett. 2024, 62, 105195. [Google Scholar] [CrossRef]
Yu, S.; Wang, J.; Hussein, W.; Hung, P.C.K. Robust Multimodal Federated Learning for Incomplete Modalities. Comput. Commun. 2024, 214, 234–243. [Google Scholar] [CrossRef]
Ngo, D.; Park, H.-C.; Kang, B. Edge Intelligence: A Review of Deep Neural Network Inference in Resource-Limited Environments. Electronics 2025, 14, 2495. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, J. A Unified and Resource-Aware Framework for Adaptive Inference Acceleration on Edge and Embedded Platforms. Electronics 2025, 14, 2188. [Google Scholar] [CrossRef]
De Barrena, T.F.; Fernandes, A.; Ferrando, J.L.; García, A.; Landaluce, H.; Angulo, I. Adaptive High Frequency Data Streaming for Soft Real-Time Industrial AI: A Scalable Microservices Based Architecture with Dynamic Downsampling. Array 2025, 27, 100488. [Google Scholar] [CrossRef]
Schlegel, M.; Sattler, K.-U. Capturing End-to-End Provenance for Machine Learning Pipelines. Inf. Syst. 2025, 132, 102495. [Google Scholar] [CrossRef]
Vonderhaar, L.; Couder, J.; Procko, T.T.; Lueddeke, E.; Cisneros, D.; Ochoa, O. Verifying Machine Learning Interpretability and Explainability Requirements Through Provenance. Software 2026, 5, 9. [Google Scholar] [CrossRef]
Mohammed, S.; Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Naumann, F.; Harmouch, H. The Effects of Data Quality on Machine Learning Performance on Tabular Data. Inf. Syst. 2025, 132, 102549. [Google Scholar] [CrossRef]
Zareie, A.; Bakir, M.E.; Greenwood, M.A.; Bontcheva, K.; Scarton, C. Identifying Coordination in Online Social Networks through Anomalous Sharing Behaviour. Online Soc. Netw. Media 2025, 50, 100341. [Google Scholar] [CrossRef]
Xiao, Y.; Shao, H.; Liu, B. Evaluating Calibration of Deep Fault Diagnostic Models under Distribution Shift. Comput. Ind. 2025, 171, 104334. [Google Scholar] [CrossRef]
Johnston, S.S.; Fortin, S.; Kalsekar, I.; Reps, J.; Coplan, P. Improving Visual Communication of Discriminative Accuracy for Predictive Models: The Probability Threshold Plot. JAMIA Open 2021, 4, ooab017. [Google Scholar] [CrossRef]
Shashikumar, S.P.; Amrollahi, F.; Nemati, S. Unsupervised Detection and Correction of Model Calibration Shift at Test-Time. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Meaney, C.; Wang, X.; Guan, J.; Stukel, T.A. Comparison of Methods for Tuning Machine Learning Model Hyper-Parameters: With Application to Predicting High-Need High-Cost Health Care Users. BMC Med. Res. Methodol. 2025, 25, 134. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, D. Comparative Study on Hyperparameter Tuning for Predicting Concrete Compressive Strength. Buildings 2025, 15, 2173. [Google Scholar] [CrossRef]
Xin, D.; Miao, H.; Parameswaran, A.; Polyzotis, N. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. In Proceedings of the 2021 International Conference on Management of Data, Virtual Event, China, 20–25 June 2021; ACM: New York, NY, USA, 2021; pp. 2639–2652. [Google Scholar]
Martins, P.; Cardoso, F.; Váz, P.; Silva, J.; Abbasi, M. Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets. Data 2025, 10, 68. [Google Scholar] [CrossRef]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial Time Series Forecasting with Deep Learning: A Systematic Literature Review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Gao, J.; Li, P.; Chen, Z.; Zhang, J. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. IEEE Signal Process. Mag. 2018, 35, 126–136. [Google Scholar] [CrossRef]
Liu, Q.; Son, H. Methods for Aggregating Investor Sentiment from Social Media. Humanit. Soc. Sci. Commun. 2024, 11, 925. [Google Scholar] [CrossRef]
Muhammad, I.; Rospocher, M. On Assessing the Performance of LLMs for Target-Level Sentiment Analysis in Financial News Headlines. Algorithms 2025, 18, 46. [Google Scholar] [CrossRef]
Kasula, V.K.; Tumma, C.; Konda, B. A Comprehensive Review of Artificial Intelligence Models for Lifetime Value Optimization. In 2025 2nd International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), Chennai, India, 7–8 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
Abdollahi, H.; Fjesme, S.L.; Sirnes, E. Measuring Market Volatility Connectedness to Media Sentiment. N. Am. J. Econ. Financ. 2024, 71, 102091. [Google Scholar] [CrossRef]
ALDayel, A.; Magdy, W. Stance Detection on Social Media: State of the Art and Trends. Inf. Process. Manag. 2021, 58, 102597. [Google Scholar] [CrossRef]
Tang, Y.; Yang, Y.; Huang, A.; Tam, A.; Tang, J. FinEntity: Entity-Level Sentiment Classification for Financial Texts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15465–15471. [Google Scholar]
Corsi, G. Evaluating Twitter’s Algorithmic Amplification of Low-Credibility Content: An Observational Study. EPJ Data Sci. 2024, 13, 18. [Google Scholar] [CrossRef]
Aguilera, A.; Quinteros, P.; Dongo, I.; Cardinale, Y. CrediBot: Applying Bot Detection for Credibility Analysis on Twitter. IEEE Access 2023, 11, 108365–108385. [Google Scholar] [CrossRef]
Akdogan, Y.E.; Anbar, A. More than Just Sentiment: Using Social, Cognitive, and Behavioral Information of Social Media to Predict Stock Markets with Artificial Intelligence and Big Data. Borsa Istanb. Rev. 2024, 24, 61–82. [Google Scholar] [CrossRef]
Broadstock, D.C.; Zhang, D. Social-Media and Intraday Stock Returns: The Pricing Power of Sentiment. Financ. Res. Lett. 2019, 30, 116–123. [Google Scholar] [CrossRef]
Chu, X.; Wan, X.; Qiu, J. The Relative Importance of Overnight Sentiment versus Trading-Hour Sentiment in Volatility Forecasting. J. Behav. Exp. Financ. 2023, 39, 100826. [Google Scholar] [CrossRef]
Audrino, F.; Sigrist, F.; Ballinari, D. The Impact of Sentiment and Attention Measures on Stock Market Volatility. Int. J. Forecast. 2020, 36, 334–357. [Google Scholar] [CrossRef]
Oliveira, N.; Cortez, P.; Areal, N. The Impact of Microblogging Data for Stock Market Prediction: Using Twitter to Predict Returns, Volatility, Trading Volume and Survey Sentiment Indices. Expert Syst. Appl. 2017, 73, 125–144. [Google Scholar] [CrossRef]
Kastrati, M.; Kastrati, Z.; Shariq Imran, A.; Biba, M. Leveraging Distant Supervision and Deep Learning for Twitter Sentiment and Emotion Classification. J. Intell. Inf. Syst. 2024, 62, 1045–1070. [Google Scholar] [CrossRef]
Pan, Q.; Meng, Z. Hybrid Uncertainty Calibration for Multimodal Sentiment Analysis. Electronics 2024, 13, 662. [Google Scholar] [CrossRef]
Wagner, M.; Wei, X. Ambiguous Investor Sentiment. Financ. Res. Lett. 2024, 67, 105773. [Google Scholar] [CrossRef]
Firdaniza, F.; Ruchjana, B.; Chaerani, D.; Radianti, J. Information Diffusion Model in Twitter: A Systematic Literature Review. Information 2021, 13, 13. [Google Scholar] [CrossRef]
Bellutta, D.; Carley, K.M. Investigating Coordinated Account Creation Using Burst Detection and Network Analysis. J. Big Data 2023, 10, 20. [Google Scholar] [CrossRef]
Hirshleifer, D.; Peng, L.; Wang, Q. News Diffusion in Social Networks and Stock Market Reactions. Rev. Financ. Stud. 2025, 38, 883–937. [Google Scholar] [CrossRef]
Bandy, J.; Diakopoulos, N. Curating Quality? How Twitter’s Timeline Algorithm Treats Different Types of News. Soc. Media Soc. 2021, 7, 20563051211041648. [Google Scholar] [CrossRef]
Gausen, A.; Luk, W.; Guo, C. Using Agent-Based Modelling to Evaluate the Impact of Algorithmic Curation on Social Media. J. Data Inf. Qual. 2023, 15, 1–24. [Google Scholar] [CrossRef]
Chen, Z.-H.; Wu, W.-L.; Li, S.-P.; Bao, K.; Koedijk, K.G. Social Media Information Diffusion and Excess Stock Returns Co-Movement. Int. Rev. Financ. Anal. 2024, 91, 103036. [Google Scholar] [CrossRef]
Tardelli, S.; Nizzoli, L.; Tesconi, M.; Conti, M.; Nakov, P.; Da San Martino, G.; Cresci, S. Temporal Dynamics of Coordinated Online Behavior: Stability, Archetypes, and Influence. Proc. Natl. Acad. Sci. USA 2024, 121, e2307038121. [Google Scholar] [CrossRef]
Zouzou, Y.; Varol, O. Unsupervised Detection of Coordinated Fake-Follower Campaigns on Social Media. EPJ Data Sci. 2024, 13, 62. [Google Scholar] [CrossRef]
Tian, Y.; Xie, Y. Artificial Cheerleading in IEO: Marketing Campaign or Pump and Dump Scheme. Inf. Process. Manag. 2024, 61, 103537. [Google Scholar] [CrossRef]
Ogburn, E.L.; Sofrygin, O.; Díaz, I.; Van Der Laan, M.J. Causal Inference for Social Network Data. J. Am. Stat. Assoc. 2024, 119, 597–611. [Google Scholar] [CrossRef] [PubMed]
Agarwal, S.; Mehta, S. Effective Influence Estimation in Twitter Using Temporal, Profile, Structural and Interaction Characteristics. Inf. Process. Manag. 2020, 57, 102321. [Google Scholar] [CrossRef]
Loh, W.W.; Ren, D. Estimating Social Influence in a Social Network Using Potential Outcomes. Psychol. Methods 2022, 27, 841–855. [Google Scholar] [CrossRef]
Yang, J.; Li, Y.; Gao, C.; Dong, W. Entity Disambiguation with Context Awareness in User-Generated Short Texts. Expert Syst. Appl. 2020, 160, 113652. [Google Scholar] [CrossRef]
Park, J.H.; Moon, J.Y.; Hong, S.-J. Understanding the Bi-Directional Message Diffusion Mechanism in the Context of IT Trends and Current Social Issues. Inf. Manag. 2021, 58, 103527. [Google Scholar] [CrossRef]
Morstatter, F.; Liu, H. Discovering, Assessing, and Mitigating Data Bias in Social Media. Online Soc. Netw. Media 2017, 1, 1–13. [Google Scholar] [CrossRef]
Torres-Lugo, C.; Pote, M.; Nwala, A.C.; Menczer, F. Manipulating Twitter through Deletions. Proc. Int. AAAI Conf. Web Soc. Media 2022, 16, 1029–1039. [Google Scholar] [CrossRef]
López-Vizcaíno, M.; Nóvoa, F.J.; Fernández, D.; Cacheda, F. Time Aware F-Score for Cybersecurity Early Detection Evaluation. Appl. Sci. 2024, 14, 574. [Google Scholar] [CrossRef]
Diallo, A.R.; Homri, L.; Boeuf, T.; Dantan, J.-Y.; Bonnet, F. Quantifying and Mitigating Alarm Fatigue Caused by Fault Detection Systems. Reliab. Eng. Syst. Saf. 2026, 267, 111890. [Google Scholar] [CrossRef]
Horta Ribeiro, M.; Hosseinmardi, H.; West, R.; Watts, D.J. Deplatforming Did Not Decrease Parler Users’ Activity on Fringe Social Media. PNAS Nexus 2023, 2, pgad035. [Google Scholar] [CrossRef] [PubMed]
Ben El Hadj Said, I.; Slim, S. The Dynamic Relationship between Investor Attention and Stock Market Volatility: International Evidence. J. Risk Financ. Manag. 2022, 15, 66. [Google Scholar] [CrossRef]
Lopez-Vizcaino, M.F.; Novoa, F.J.; Fernandez, D.; Cacheda, F. Measuring Early Detection of Anomalies. IEEE Access 2022, 10, 127695–127707. [Google Scholar] [CrossRef]
Toraman, C.; Şahinuç, F.; Yilmaz, E.H.; Akkaya, I.B. Understanding Social Engagements: A Comparative Analysis of User and Text Features in Twitter. Soc. Netw. Anal. Min. 2022, 12, 47. [Google Scholar] [CrossRef]
Kim-Hahm, H.; Abou-Zaid, A.S.; Mohd, A. News vs. Social Media: Sentiment Impact on Stock Performance of Big Tech Companies. J. Risk Financ. Manag. 2025, 18, 660. [Google Scholar] [CrossRef]
Cookson, J.A.; Lu, R.; Mullins, W.; Niessner, M. The Social Signal. J. Financ. Econ. 2024, 158, 103870. [Google Scholar] [CrossRef]
Ho, T.-T.; Huang, Y. Stock Price Movement Prediction Using Sentiment Analysis and CandleStick Chart Representation. Sensors 2021, 21, 7957. [Google Scholar] [CrossRef]
Ruan, L.; Jiang, H. Stock Price Prediction Using FinBERT-Enhanced Sentiment with SHAP Explainability and Differential Privacy. Mathematics 2025, 13, 2747. [Google Scholar] [CrossRef]
Nguyen, N.-H.; Nguyen, T.-T.; Ngo, Q.T. DASF-Net: A Multimodal Framework for Stock Price Forecasting with Diffusion-Based Graph Learning and Optimized Sentiment Fusion. J. Risk Financ. Manag. 2025, 18, 417. [Google Scholar] [CrossRef]
Duszejko, P.; Walczyna, T.; Piotrowski, Z. Detection of Manipulations in Digital Images: A Review of Passive and Active Methods Utilizing Deep Learning. Appl. Sci. 2025, 15, 881. [Google Scholar] [CrossRef]
Asmawati, E.; Saikhu, A.; Siahaan, D.O. Sentiment Analysis of Meme Images Using Deep Neural Network Based on Keypoint Representation. Informatics 2025, 12, 118. [Google Scholar] [CrossRef]
Hill, B.G.; Koback, F.L.; Schilling, P.L. The Risk of Shortcutting in Deep Learning Algorithms for Medical Imaging Research. Sci. Rep. 2024, 14, 29224. [Google Scholar] [CrossRef]
Jones, S.M.; Van De Sompel, H.; Shankar, H.; Klein, M.; Tobin, R.; Grover, C. Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLoS ONE 2016, 11, e0167475. [Google Scholar] [CrossRef]
Farhan, M.; Butt, U.; Sulaiman, R.B.; Alraja, M. Self-Sovereign Identities and Content Provenance: VeriTrust—A Blockchain-Based Framework for Fake News Detection. Future Internet 2025, 17, 448. [Google Scholar] [CrossRef]
Bandy, J.; Diakopoulos, N. More Accounts, Fewer Links: How Algorithmic Curation Impacts Media Exposure in Twitter Timelines. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–28. [Google Scholar] [CrossRef]
Murdock, I.; Carley, K.M.; Yağan, O. An Agent-Based Model of Cross-Platform Information Diffusion and Moderation. Soc. Netw. Anal. Min. 2024, 14, 145. [Google Scholar] [CrossRef]
Gao, H.; Wang, Y.; Shao, J.; Shen, H.; Cheng, X. User Identity Linkage across Social Networks with the Enhancement of Knowledge Graph and Time Decay Function. Entropy 2022, 24, 1603. [Google Scholar] [CrossRef]
Huang, M.; Wang, J.-L.; Zhang, Z.-K. Narrative Co-Evolution in Hybrid Social Networks: A Longitudinal Computational Analysis of Confucius Institutes. Entropy 2025, 27, 1240. [Google Scholar] [CrossRef] [PubMed]
Zhan, Y.; Yang, R.; You, J.; Huang, M.; Liu, W.; Liu, X. A Systematic Literature Review on Incomplete Multimodal Learning: Techniques and Challenges. Syst. Sci. Control Eng. 2025, 13, 2467083. [Google Scholar] [CrossRef]
Fernando, M.; Cèsar, F.; David, N.; José, H. Missing the Missing Values: The Ugly Duckling of Fairness in Machine Learning. Int. J. Intell. Syst. 2021, 36, 3217–3258. [Google Scholar] [CrossRef]
Pereira, R.C.; Abreu, P.H.; Rodrigues, P.P.; Figueiredo, M.A.T. Imputation of Data Missing Not at Random: Artificial Generation and Benchmark Analysis. Expert Syst. Appl. 2024, 249, 123654. [Google Scholar] [CrossRef]
Nevado-Catalán, D.; Pastrana, S.; Vallina-Rodriguez, N.; Tapiador, J. An Analysis of Fake Social Media Engagement Services. Comput. Secur. 2023, 124, 103013. [Google Scholar] [CrossRef]
Chelas, S.; Routis, G.; Roussaki, I. Detection of Fake Instagram Accounts via Machine Learning Techniques. Computers 2024, 13, 296. [Google Scholar] [CrossRef]
Yuan, Y.; Li, Z.; Zhao, B. A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Comput. Surv. 2025, 57, 1–34. [Google Scholar] [CrossRef]
Singh, S.; Saber, E.; Markopoulos, P.P.; Heard, J. Regulating Modality Utilization within Multimodal Fusion Networks. Sensors 2024, 24, 6054. [Google Scholar] [CrossRef]
Ma, X.; Cai, X.; Song, Y.; Liang, Y.; Liu, G.; Yang, Y. RMP: Robust Multi-Modal Perception Under Missing Condition. Electronics 2025, 15, 119. [Google Scholar] [CrossRef]
Ma, Y.; Li, S.; Zhou, M. Twitter-Based Market Uncertainty and Global Stock Volatility Predictability. N. Am. J. Econ. Finance 2025, 75, 102256. [Google Scholar] [CrossRef]
Anand, A.; Pathak, J. The Role of Reddit in the GameStop Short Squeeze. Econ. Lett. 2022, 211, 110249. [Google Scholar] [CrossRef]
Gan, B.; Alexeev, V.; Bird, R.; Yeung, D. Sensitivity to Sentiment: News vs Social Media. Int. Rev. Financ. Anal. 2020, 67, 101390. [Google Scholar] [CrossRef]
Bousbaa, Z.; Sanchez-Medina, J.; Bencharef, O. Financial Time Series Forecasting: A Data Stream Mining-Based System. Electronics 2023, 12, 2039. [Google Scholar] [CrossRef]
Fan, R.; Talavera, O.; Tran, V. Social Media Bots and Stock Markets. Eur. Financ. Manag. 2020, 26, 753–777. [Google Scholar] [CrossRef]
Zeng, T.; Shema, A.; Acuna, D.E. Dead Science: Most Resources Linked in Biomedical Articles Disappear in Eight Years. In Information in Contemporary Society; Taylor, N.G., Christian-Lamb, C., Martin, M.H., Nardi, B., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11420, pp. 170–176. [Google Scholar]
Abdalgader, K.; Matroud, A.A.; Al-Doboni, G. Temporal Dynamics in Short Text Classification: Enhancing Semantic Understanding Through Time-Aware Model. Information 2025, 16, 214. [Google Scholar] [CrossRef]
Eg, R.; Demirkol Tønnesen, Ö.; Tennfjord, M.K. A Scoping Review of Personalized User Experiences on Social Media: The Interplay between Algorithms and Human Factors. Comput. Hum. Behav. Rep. 2023, 9, 100253. [Google Scholar] [CrossRef]
Bugajev, A.; Kriauzienė, R.; Chadyšas, V. Realistic Data Delays and Alternative Inactivity Definitions in Telecom Churn: Investigating Concept Drift Using a Sliding-Window Approach. Appl. Sci. 2025, 15, 1599. [Google Scholar] [CrossRef]
Oliveira, J.M.; Ramos, P. Evaluating the Effectiveness of Time Series Transformers for Demand Forecasting in Retail. Mathematics 2024, 12, 2728. [Google Scholar] [CrossRef]
Lin, X.; Chang, L.; Nie, X.; Dong, F. Temporal Attention for Few-Shot Concept Drift Detection in Streaming Data. Electronics 2024, 13, 2183. [Google Scholar] [CrossRef]
Cao, Z.; Li, Y.; Kim, D.-H.; Shin, B.-S. Deep Neural Network Confidence Calibration from Stochastic Weight Averaging. Electronics 2024, 13, 503. [Google Scholar] [CrossRef]
Kogan, S.; Moskowitz, T.J.; Niessner, M. Social Media and Financial News Manipulation. Rev. Financ. 2023, 27, 1229–1268. [Google Scholar] [CrossRef]
Fernandez Vilas, A.; Diaz Redondo, R.P.; Lorenzo Garcia, A. The Irruption of Cryptocurrencies Into Twitter Cashtags: A Classifying Solution. IEEE Access 2020, 8, 32698–32713. [Google Scholar] [CrossRef]
Hewamalage, H.; Ackermann, K.; Bergmeir, C. Forecast Evaluation for Data Scientists: Common Pitfalls and Best Practices. Data Min. Knowl. Discov. 2023, 37, 788–832. [Google Scholar] [CrossRef]
Incorvaia, G.; Hond, D.; Asgari, H. Uncertainty Quantification of Machine Learning Model Performance via Anomaly-Based Dataset Dissimilarity Measures. Electronics 2024, 13, 939. [Google Scholar] [CrossRef]
Cresci, S. A Decade of Social Bot Detection. Commun. ACM 2020, 63, 72–83. [Google Scholar] [CrossRef]
Breuer, J.; Kmetty, Z.; Haim, M.; Stier, S. User-Centric Approaches for Collecting Facebook Data in the ‘Post-API Age’: Experiences from Two Studies and Recommendations for Future Research. Inf. Commun. Soc. 2023, 26, 2649–2668. [Google Scholar] [CrossRef]
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for Datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
Gupta, G.; Raja, K.; Gupta, M.; Jan, T.; Whiteside, S.T.; Prasad, M. A Comprehensive Review of DeepFake Detection Using Advanced Machine Learning and Fusion Methods. Electronics 2023, 13, 95. [Google Scholar] [CrossRef]
Theodorakopoulos, L.; Theodoropoulou, A.; Klavdianos, C. Big Data Analytics and AI for Consumer Behavior in Digital Marketing: Applications, Synthetic and Dark Data, and Future Directions. Big Data Cogn. Comput. 2026, 10, 46. [Google Scholar] [CrossRef]
Ghiurău, D.; Popescu, D.E. Distinguishing Reality from AI: Approaches for Detecting Synthetic Content. Computers 2024, 14, 1. [Google Scholar] [CrossRef]
Liu, X.; Li, Y.; Li, K. Enhancing the Robustness of AI-Generated Text Detectors: A Survey. Mathematics 2025, 13, 2145. [Google Scholar] [CrossRef]
Gillespie, T. Do Not Recommend? Reduction as a Form of Content Moderation. Soc. Media Soc. 2022, 8, 20563051221117552. [Google Scholar] [CrossRef]
Theodorakopoulos, L.; Theodoropoulou, A.; Klavdianos, C. Interactive Viral Marketing Through Big Data Analytics, Influencer Networks, AI Integration, and Ethical Dimensions. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 115. [Google Scholar] [CrossRef]
Rieder, B.; Hofmann, J. Towards Platform Observability. Internet Policy Rev. 2020, 9, 1–28. [Google Scholar] [CrossRef]
Ohme, J.; Araujo, T.; Boeschoten, L.; Freelon, D.; Ram, N.; Reeves, B.B.; Robinson, T.N. Digital Trace Data Collection for Social Media Effects Research: APIs, Data Donation, and (Screen) Tracking. Commun. Methods Meas. 2024, 18, 124–141. [Google Scholar] [CrossRef]
Haimson, O.L.; Delmonaco, D.; Nie, P.; Wegner, A. Disproportionate Removals and Differing Content Moderation Experiences for Conservative, Transgender, and Black Social Media Users: Marginalization and Moderation Gray Areas. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–35. [Google Scholar] [CrossRef]
Lee, H.-C.; Lee, S.-W. Provenance-Based Trust-Aware Requirements Engineering Framework for Self-Adaptive Systems. Sensors 2023, 23, 4622. [Google Scholar] [CrossRef]
Stieglitz, S.; Mirbabaie, M.; Ross, B.; Neuberger, C. Social Media Analytics—Challenges in Topic Discovery, Data Collection, and Data Preparation. Int. J. Inf. Manag. 2018, 39, 156–168. [Google Scholar] [CrossRef]
Golland, L.; Watteler, O.; Recker, J.; Schwalbach, J.; Bishop, L. From (Almost) Open to Heavily Restricted Data Access—The Development of the Twitter/X Developer Policies. Big Data Soc. 2026, 13, 20539517261419333. [Google Scholar] [CrossRef]
Kim, Y.; Nordgren, R.; Emery, S. The Story of Goldilocks and Three Twitter’s APIs: A Pilot Study on Twitter Data Sources and Disclosure. Int. J. Environ. Res. Public Health 2020, 17, 864. [Google Scholar] [CrossRef] [PubMed]
Belák, V.; Mashhadi, A.; Sala, A.; Morrison, D. Phantom Cascades: The Effect of Hidden Nodes on Information Diffusion. Comput. Commun. 2016, 73, 12–21. [Google Scholar] [CrossRef]
Huang, F.; Zhang, M.; Li, Y. A Comparison Study of Tie Non-Response Treatments in Social Networks Analysis. Front. Psychol. 2019, 9, 2766. [Google Scholar] [CrossRef]
Salehzadeh-Yazdi, A.; Hütt, M.-T. Assessing the Impact of Sampling Bias on Node Centralities in Synthetic and Biological Networks. Npj Syst. Biol. Appl. 2025, 11, 47. [Google Scholar] [CrossRef]
Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef]
Ehrig, M.; Bullock, G.S.; Leng, X.I.; Pajewski, N.M.; Speiser, J.L. Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data. JMIR Med. Inform. 2025, 13, e64354. [Google Scholar] [CrossRef]
Yang, C.; Liang, Z.; Liu, T.; Hu, Z.; Yan, D. MGMR-Net: Mamba-Guided Multimodal Reconstruction and Fusion Network for Sentiment Analysis with Incomplete Modalities. Electronics 2025, 14, 3088. [Google Scholar] [CrossRef]
Grzenda, M.; Gomes, H.M.; Bifet, A. Delayed Labelling Evaluation for Data Streams. Data Min. Knowl. Discov. 2020, 34, 1237–1266. [Google Scholar] [CrossRef]
Botacin, M.; Gomes, H. Towards More Realistic Evaluations: The Impact of Label Delays in Malware Detection Pipelines. Comput. Secur. 2025, 148, 104122. [Google Scholar] [CrossRef]
Gabrovšek, P.; Aleksovski, D.; Mozetič, I.; Grčar, M. Twitter Sentiment around the Earnings Announcement Events. PLoS ONE 2017, 12, e0173151. [Google Scholar] [CrossRef]
Hanczar, B. Performance Visualization Spaces for Classification with Rejection Option. Pattern Recognit. 2019, 96, 106984. [Google Scholar] [CrossRef]
Zhou, X.; Chen, B.; Gui, Y.; Cheng, L. Conformal Prediction: A Data Perspective. ACM Comput. Surv. 2026, 58, 1–37. [Google Scholar] [CrossRef]
Cheng, J.; Tian, J.; Spoto, F.; Azhir, A.; Mork, D.; Estiri, H. Signal Fidelity Index-Aware Calibration for Addressing Distributional Shift in Predictive Modeling across Heterogeneous Real-World Data. Sci. Rep. 2025, 16, 2807. [Google Scholar] [CrossRef]
Foltyn, A.; Deuschel, J. Towards Reliable Multimodal Stress Detection under Distribution Shift. In Proceedings of the Companion Publication of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, 18–22 October 2021; ACM: New York, NY, USA, 2021; pp. 329–333. [Google Scholar]
Ramezani, M.; Ahadinia, A.; Ziaei Bideh, A.; Rabiee, H.R. Joint Inference of Diffusion and Structure in Partially Observed Social Networks Using Coupled Matrix Factorization. ACM Trans. Knowl. Discov. Data 2023, 17, 1–28. [Google Scholar] [CrossRef]
Shvydun, S. Centrality in Complex Networks under Incomplete Data. PLoS Complex Syst. 2025, 2, e0000042. [Google Scholar] [CrossRef]
Mora-Cantallops, M.; Sánchez-Alonso, S.; García-Barriocanal, E.; Sicilia, M.-A. Traceability for Trustworthy AI: A Review of Models and Tools. Big Data Cogn. Comput. 2021, 5, 20. [Google Scholar] [CrossRef]
Gultekin, E.; Aktas, M.S. A Novel End-to-End Provenance System for Predictive Maintenance: A Case Study for Industrial Machinery Predictive Maintenance. Computers 2024, 13, 325. [Google Scholar] [CrossRef]
Jain, S.; Wallace, B.C. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 3543–3556. [Google Scholar]
Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial Attacks on Post Hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–9 February 2020; ACM: New York, NY, USA, 2020; pp. 180–186. [Google Scholar]
Li, Y.; Zhou, J.; Zheng, B.; Shafiabady, N.; Chen, F. Reliable and Faithful Generative Explainers for Graph Neural Networks. Mach. Learn. Knowl. Extr. 2024, 6, 2913–2929. [Google Scholar] [CrossRef]
Seranmadevi, R.; Addula, S.R.; Kumar, D.; Tyagi, A.K. Security and Privacy in AI: IoT-Enabled Banking and Finance Services. In Monetary Dynamics and Socio-Economic Development in Emerging Economies; IGI Global Scientific Publishing: Hershey, PA, USA, 2026; pp. 163–194. [Google Scholar] [CrossRef]
Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Winkler, L.; Peters, S.; Müller, K.-R. Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Mach. Learn. Knowl. Extr. 2021, 3, 392–413. [Google Scholar] [CrossRef]
Zhao, X.; Ma, Z.G.; Jørgensen, B.N. An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise. Information 2025, 16, 805. [Google Scholar] [CrossRef]
Theodorakopoulos, L.; Theodoropoulou, A.; Kampiotis, G.; Kalliampakou, I. NeuralACT: Accounting Analytics Using Neural Network for Real-Time Decision Making from Big Data. IEEE Access 2025, 13, 8621–8637. [Google Scholar] [CrossRef]
Petre, C.; Duffy, B.E.; Hund, E. “Gaming the System”: Platform Paternalism and the Politics of Algorithmic Visibility. Soc. Media Soc. 2019, 5, 2056305119879995. [Google Scholar] [CrossRef]
Erlandsson, F.; Bródka, P.; Boldt, M.; Johnson, H. Do We Really Need to Catch Them All? A New User-Guided Social Media Crawling Method. Entropy 2017, 19, 686. [Google Scholar] [CrossRef]
An, W.; Beauvile, R.; Rosche, B. Causal Network Analysis. Annu. Rev. Sociol. 2022, 48, 23–41. [Google Scholar] [CrossRef]
Yuan, Y.; Pang, N.; Zhang, Y.; Liu, K. Which Cascade Is More Decisive in Rumor Detection on Social Media: Based on Comparison between Repost and Reply Sequences. Knowl.-Based Syst. 2023, 278, 110857. [Google Scholar] [CrossRef]
Wang, M.; Fan, S.; Li, Y.; Gao, B.; Xie, Z.; Chen, H. Robust Multi-Modal Fusion Architecture for Medical Data with Knowledge Distillation. Comput. Methods Programs Biomed. 2025, 260, 108568. [Google Scholar] [CrossRef]
Komorniczak, J.; Ksieniewicz, P.; Zyblewski, P. Structuring the Processing Frameworks for Data Stream Evaluation and Application. Pattern Recognit. 2026, 172, 112516. [Google Scholar] [CrossRef]
Yang, Y.; Kuchibhotla, A.K.; Tchetgen Tchetgen, E. Doubly Robust Calibration of Prediction Sets under Covariate Shift. J. R. Stat. Soc. Ser. B Stat. Methodol. 2024, 86, 943–965. [Google Scholar] [CrossRef] [PubMed]
Pham, T.; Kottke, D.; Krempl, G.; Sick, B. Stream-Based Active Learning for Sliding Windows under the Influence of Verification Latency. Mach. Learn. 2022, 111, 2011–2036. [Google Scholar] [CrossRef]
Arratia, A.; El Daou, M.; Kagerhuber, J.; Smolyarova, Y. Examining Challenges in Implied Volatility Forecasting: A Critical Review of Data Leakage and Feature Engineering Combined with High-Complexity Models. Comput. Econ 2025. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, Y.; Linnainmaa, J.T. Man versus Machine Learning Revisited. Rev. Financ. Stud. 2025, 38, 3768–3790. [Google Scholar] [CrossRef]
Ayala, M.J.; Gonzálvez-Gallego, N.; Arteaga-Sánchez, R. Google Search Volume Index and Investor Attention in Stock Market: A Systematic Review. Financ. Innov. 2024, 10, 70. [Google Scholar] [CrossRef]
Derczynski, L.; Nichols, E.; Van Erp, M.; Limsopatham, N. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark, 7 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 140–147. [Google Scholar]
Ganea, O.-E.; Hofmann, T. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 2619–2629. [Google Scholar]
Cinus, F.; Minici, M.; Luceri, L.; Ferrara, E. Exposing Cross-Platform Coordinated Inauthentic Activity in the Run-Up to the 2024 U.S. Election. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; ACM: New York, NY, USA, 2025; pp. 541–559. [Google Scholar]

Figure 1. Evidence types, provenance layers, and sampling biases that shape the observed dataset.

Figure 2. Decision cutoffs, evidence windows, and common leakage routes.

Figure 3. End-to-end social–financial decision pipeline and where validity risks enter.

Figure 4. Social–financial pipeline with primary validity risks.

Table 1. Model blocks, suitability, typical validity risks, and feasibility constraints.

Model Block	Best Suited Evidence	Typical Validity Risks	Feasibility Constraints
Text encoders	Short posts, threads, instrument-anchored mentions	Reaction content in the evidence window; post edits/deletions not handled; ambiguous mentions forced to an asset	Per-message inference latency and memory footprint under volume spikes; calibration under shift
Temporal modules	Session-level sequences, irregular bursts, rolling windows	Session misalignment; horizon overlap; observation delay ignored	Window/state maintenance sensitivity; retraining cadence and monitoring overhead
Graph encoders	Interaction graphs, co-mentions, user–asset links	Graph construction crossing decision cutoffs; edge semantics ambiguity; visibility-driven missing edges	Graph construction and update cost at scale; sampling/sparsification can alter behavior
Fusion mechanisms	Text + interaction + market features; optional media/links	Cross-modal timing mismatch; missing-modality artifacts; dominance of manipulable modalities	Cross-modal synchronization and buffering overhead; robustness to missing or delayed evidence required
Lightweight tabular heads	Aggregated indices, engineered indicators	Leakage via post hoc aggregates; target leakage through market-derived labels	Low latency; easier monitoring but limited expressiveness
Retrieval-augmented analysis	Threads plus external artifacts and links	Retrieval includes future material; freshness/provenance not enforced	Retrieval and external lookup latency; audit logging/versioning needed for reproducibility

Table 2. Multimodal and cross-platform signals: what they add and what can go wrong.

Signal Source	What It Adds	Main Risks	Minimum Safeguards to Report
Social text + market features	Context on volatility and trend; sometimes improves stability	Market channel dominates; social contribution becomes unclear	Incremental value test (market-only vs. market + social) under the same time-forward split
Visual content (images, video frames)	Non-text cues (charts, screenshots, community-coded signals)	Fabrication, weak attribution, small labeled sets; models learn style and attention cues	Provenance assumptions; missing-visual handling; error analysis on manipulated or low-quality visuals
Links and referenced content	External context; topic anchoring via sources	Pages change or disappear; timing often reaction-driven; link-sharing bias	Time cutoffs for link availability; retention/archiving policy; robustness to dead or modified links
Cross-platform aggregation	Broader coverage; reduces single-platform dependence	Alignment errors, identity mismatch, vocabulary drift; conflation of unrelated threads	Alignment method (topic/entity/URL); propagation delay handling; tests on platform-shift periods and missing-platform conditions

Table 3. Task-dependent comparison of masking and imputation under structured missingness.

Task Setting	When Masking/Missing-Indicator Handling Is Preferable	When Limited Imputation Is More Defensible	Main Risk If Mishandled
Sentiment-to-signal aggregation	When missing posts, counters, or linked artifacts reflect moderation, rate limits, platform outages, event-time overload, or selective visibility, because absence may itself indicate reduced observability or evidence quality	When the missing field is auxiliary and low-level, such as short gaps in dense numeric covariates, and when an explicit missingness indicator is retained	Imputation can convert collection artifacts into artificial stability and distort the meaning of the sentiment index
Surveillance/manipulation detection	Usually preferable, because structured absence may itself be part of the signal, and synthetic completion can hide suspicious propagation patterns or moderation effects	Only in narrow cases involving peripheral numeric fields that do not alter cascade structure or behavioral interpretation	Imputation can fabricate coordination motifs, suppress anomaly cues, or increase false confidence in threat assessment
Relational/graph-based analysis	Preferable when missing edges or users arise from rate limiting, visibility restrictions, or removals, since partial observability affects centrality and cascade structure in non-uniform ways	Rarely appropriate, except for carefully bounded structural assumptions that are disclosed and stress-tested	Filled-in edges or interactions can create graph structures that were never observed and can mislead diffusion or centrality analysis
Multimodal fusion	Preferable when modality absence is informative, irregular, or platform-driven, and when the system must preserve uncertainty about what was actually observed	More defensible when one optional modality is frequently absent in a stable pattern and the model is explicitly trained and evaluated under that condition	Default filling can make the model appear robust while actually relying on unrealistic completion patterns
Short-horizon operational decision support	Preferable when absence is tied to timing, collection delay, or evidence freshness constraints, because missingness affects what was knowable at decision time	Limited imputation may be acceptable for slow-moving auxiliary features whose absence does not change time validity	Imputation can blur the boundary between unavailable evidence and available evidence, leading to unrealistic operational evaluation

Table 4. Minimal assurance checklist for decision-ready social signal pipelines.

What to Report	Why it Matters
Evidence source and access limits (API tier, sampling method, rate limits)	Observed data is a visibility-filtered sample; access changes can alter results without any model change
Evidence freshness policy (cutoffs, delays tolerated, staleness handling)	Prevents post-cutoff reaction content from entering features and inflating performance
Observation delay and backlog behavior under spikes	Delays are largest when attention spikes; retrospective studies often assume immediate availability
Deletion and edit handling (retention, re-fetching, snapshot policy)	Post edits and removals can create unrealistic evidence that was not observable at decision time
Entity linking method and confidence handling (abstention or fallback)	Ambiguous mentions can systematically corrupt instrument-level indices and graphs
Window definitions (evidence window, decision cutoff, label horizon; session convention)	Avoids leakage and cross-session mixing; enables apples-to-apples comparison of results
Missing-modality policy (masking, imputation, reliability weights)	Missingness is structured and can become a predictor; models must degrade predictably
Controls for attention confounds (volume/engagement baselines, conditioning by activity level)	Prevents social indicators from acting as proxies for volatility or news intensity
Calibration and thresholding policy (abstention, uncertainty, alert-rate control)	Decision costs depend on confidence; miscalibration under drift drives unstable actions and alerts
Stress tests (partial outages, modality dropout, manipulation templates, shifted periods)	Measures robustness under plausible failure conditions rather than only average-case accuracy
Traceability logs (evidence IDs, timestamps, preprocessing versions, mapping rules)	Enables reconstruction, debugging, and governance; supports accountable decision use

Table 5. Comparative synthesis of representative task settings in social–financial analytics.

Task Family	Typical Evidence Used	Common Model Families	Common Evaluation Target/Horizon	What Tends to Hold up Across Studies	Why Raw Metrics Are Hard to Compare
Sentiment-based short-horizon forecasting	Post text, sentiment scores, cashtags, basic engagement indicators	Contextual text encoders, FinBERT-style models, lightweight sequence or tabular prediction heads	Intraday, same-day, post-market, next-session direction or volatility-sensitive response	Gains are most defensible when entity attribution is explicit, evidence windows end before the decision cutoff, and sentiment is evaluated beyond simple attention baselines	Same-day versus next-day targets, market-close conventions, reaction contamination, and session alignment differ substantially across studies
Risk monitoring and volatility-sensitive inference	Aggregated sentiment indices, activity bursts, narrative intensity, news-linked or cross-source signals	Temporal models, hybrid text-plus-market pipelines, multimodal fusion architectures	Volatility response, stress-state classification, risk monitoring around event windows, short-horizon deterioration signals	Social evidence is often most useful during high-attention or uncertainty periods, especially when it is evaluated incrementally over market-only controls	Volatility definitions, event windows, baseline controls, and horizon length vary enough to make pooled metrics unstable
Coordination, manipulation, and surveillance tasks	Interaction graphs, repost cascades, user–asset links, repeated phrasing, behavioral metadata	Graph encoders, dynamic graph models, graph-plus-text hybrids, anomaly and coordination detectors	Suspicious coordination, diffusion anomalies, influence estimation, surveillance flags, manipulation-related detection	Relational signals are more defensible for monitoring structure, propagation, and coordinated behavior than for claiming direct causal prediction of market outcomes	Graph construction rules, edge semantics, visibility gaps, and labeling practices differ sharply across studies
Multimodal or cross-platform decision support	Text, interaction traces, linked artifacts, optional market covariates, sometimes images or external documents	Early, late, or intermediate fusion; retrieval-supported pipelines; multimodal transformers	Event interpretation, short-horizon support, robustness under partial evidence, decision support under heterogeneous observability	Improvements are more credible when ablations, missing-modality tests, and reliability controls are explicit	Platform access, identity linkage quality, modality availability, and observability assumptions are rarely aligned across papers
Transfer, drift, and deployment-oriented evaluation	Time-separated social evidence, repeated evaluation windows, delayed or incomplete evidence, platform-change periods	Drift-aware pipelines, recalibrated classifiers, uncertainty-aware models, monitoring-oriented evaluation designs	Time-forward generalization, breakage under shift, robustness under delay or missingness	Stable findings are more likely when studies report where performance degrades, how evidence properties changed, and whether calibration survives shift	Adjacent-period splits, retraining frequency, platform evolution, and observability changes make cross-study performance numbers especially fragile

Table 6. Common failure modes and minimal mitigations.

Failure Mode	How It Appears in Results	Minimal Mitigation
Visibility and sampling bias	Strong performance that does not transfer when access tier, ranking, or moderation changes	Report access limits; test across non-adjacent periods; sensitivity analysis to sampling rules
Temporal overlap leakage	Unusually high short-horizon accuracy; sharp drop under time-forward evaluation	Enforce strict decision cutoffs; separate evidence and label horizons; document window definitions
Entity ambiguity and misattribution	Noisy or inconsistent instrument-level signals; unstable cross-asset results	Confidence-aware entity linking; abstain on ambiguous mentions; evaluate attribution error explicitly
Engagement and attention confounding	Models succeed mainly during high-volatility episodes; weak incremental value beyond volume proxies	Include attention baselines; condition results by activity level; control for event intensity
Drift in language and participation	Degradation over time; brittle behavior on new assets and emerging narratives	Time-forward testing on later periods; recalibration; monitor feature shift and error by period
Missing evidence and observation delay	Silent failures or alert flooding under spikes; calibration collapse in sparse evidence	Missingness stress tests; delay-aware evaluation; conservative fallback and abstention rules

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Theodorakopoulos, L.; Theodoropoulou, A. Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review. Electronics 2026, 15, 1405. https://doi.org/10.3390/electronics15071405

AMA Style

Theodorakopoulos L, Theodoropoulou A. Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review. Electronics. 2026; 15(7):1405. https://doi.org/10.3390/electronics15071405

Chicago/Turabian Style

Theodorakopoulos, Leonidas, and Alexandra Theodoropoulou. 2026. "Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review" Electronics 15, no. 7: 1405. https://doi.org/10.3390/electronics15071405

APA Style

Theodorakopoulos, L., & Theodoropoulou, A. (2026). Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review. Electronics, 15(7), 1405. https://doi.org/10.3390/electronics15071405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Big Data and Graph Deep Learning for Financial Decision Support from Social Networks: A Critical Review

Abstract

1. Introduction

1.1. Contributions of This Study

1.2. Scope and Limitations

1.3. Paper Structure

2. Methods

3. Social Network Data in Financial Contexts

3.1. Evidence Types, Provenance, Sampling Bias

3.2. Conditioning: Bots, Spam, Entity Resolution, Timestamp Alignment

3.3. Practical Data Issues That Directly Affect Evaluation

4. Model Components for Social–Financial Representation and Inference

4.1. Text Representation for Noisy, Time-Sensitive Evidence

4.2. Temporal Modeling and Session-Aware Alignment

4.3. Relational Modeling and Graph Encoders

4.4. Fusion Mechanisms for Heterogeneous Evidence

4.5. Training and Deployment Constraints

4.6. Comparative Synthesis of Model Components

5. Transformer-Based Sentiment Analysis for Financial Decision-Making

5.1. Sentiment Signals and Aggregation

5.2. Relational Signals: Diffusion, Coordination, Influence

5.3. Multimodal and Cross-Platform Signals

5.4. What Transfers Across Market Phases, What Breaks Under Drift

6. Network Structure and Information Diffusion

6.1. Manipulation, Synthetic Content, and Platform Bias

6.2. Information Cascades and Viral Content

6.3. Interpretability, Traceability, and Audit Needs

7. Real-Time Processing and Scalability

7.1. Stable Findings and Narrow Claims That Hold

7.2. Fragile Findings and Common Failure Modes

8. Research Directions

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI