Exploring Multi-Modal LLMs for Time Series Anomaly Detection

Niu, Hao; Habault, Guillaume; Ung, Huy Quang; Legaspi, Roberto; Li, Zhi; Wang, Yanan; Zeng, Donghuo; Vizcarra, Julio; Taya, Masato

doi:10.3390/cmsf2025011022

Open AccessProceeding Paper

Exploring Multi-Modal LLMs for Time Series Anomaly Detection^†

by

Hao Niu

^*,

Guillaume Habault

,

Huy Quang Ung

,

Roberto Legaspi

,

Zhi Li

,

Yanan Wang

,

Donghuo Zeng

,

Julio Vizcarra

and

Masato Taya

KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi 356-8502, Saitama, Japan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 11th International Conference on Time Series and Forecasting, Canaria, Spain, 16–18 July 2025.

^‡

These authors contributed equally to this work.

Comput. Sci. Math. Forum 2025, 11(1), 22; https://doi.org/10.3390/cmsf2025011022

Published: 11 August 2025

(This article belongs to the Proceedings of The 11th International Conference on Time Series and Forecasting)

Download

Browse Figures

Versions Notes

Abstract

Anomaly detection in time series data is crucial across various domains. Traditional methods often struggle with continuously evolving time series requiring adjustment, whereas large language models (LLMs) and multi-modal LLMs (MLLMs) have emerged as promising zero-shot anomaly detectors by leveraging embedded knowledge. This study expands recent evaluations of MLLMs for zero-shot time series anomaly detection by exploring newer models, additional input representations, varying input sizes, and conducting further analyses. Our findings reveal that while MLLMs are effective for zero-shot detection, they still face limitations, such as effectively integrating both text and vision representations or handling longer input lengths. These challenges unveil diverse opportunities for future improvements.

Keywords:

multi-modal large language models; time series; anomaly detection; zero-shot

1. Introduction

Time series data, consisting of chronologically ordered data points, is widely used in fields such as transportation, cybersecurity, healthcare, and industrial operations. Anomalies in time series often signal opportunities, such as stock prices’ drastic changes in finance, or issues such as traffic accidents in transportation or irregular vital signs in healthcare. As data grows especially with emerging scenarios, advanced techniques are increasingly needed for time series anomaly detection (TSAD) in evolving datasets.

Traditional TSAD methods, including the Z-score, exponential smoothing, and ARIMA [1], rely on rigid distributional assumptions, struggle with complex, non-linear relationships, and lack adaptability to changing data and the appearance of new patterns. Machine learning approaches like isolation forest [2] and neural networks [3] offer improvements but require extensive training and potential updates to remain effective, adding operational overhead.

The rapid development of large language models (LLMs) presents a promising shift in TSAD. With their embedded knowledge, LLMs can identify anomalies using vast pre-existing information from diverse sources without additional training, making them well-suited for dynamic time series data. Some studies have explored LLMs for TSAD using textual time series representations [4,5,6,7]. Recently, multi-modal LLMs (MLLMs) further extend this potential by introducing multiple data modalities, providing a richer understanding of complex time series [8,9,10]. Prior studies [8] found that vision representations outperformed textual representations, while refs. [9,10] focused solely on vision inputs, demonstrating MLLMs’ effectiveness in zero-shot and few-shot TSAD, respectively.

This study expands on these efforts by evaluating MLLMs’ zero-shot TSAD capabilities from multiple perspectives. The key differences between our study and recent zero-shot studies [8,9] are summarized in Table 1.

The main contributions of our study are summarized as follows:

We conduct an extensive empirical evaluation of MLLMs for zero-shot TSAD, incorporating a wider range of models with newer versions, different model sizes, more time series representations, and varying input lengths.
We analyze results at two levels, (i) assessing models’ ability to distinguish normal and anomalous time series and (ii) evaluating their precision in pinpointing anomalies.
We extend analysis beyond detection accuracy and also assess response appropriateness to determine how well models follow instructions.
Our findings highlight both MLLMs’ potential and limitations, such as their inability to effectively combine text and vision representations. These insights reveal new research directions for improving multimodal TSAD.

2. Methodology

This section outlines the experimental design, detailing the models, time series representations, prompting strategy, input lengths, dataset generation, and evaluation methods.

2.1. Evaluated MLLMs

For this study, we selected a diverse set of recent proprietary and open source MLLMs, including Gemini-2.0 [11], GPT-4o [12], LLama-3.2 [13], Ovis2 [14], Qwen2.5-VisionLanguage (Qwen2.5-VL) [15], and LLaVa-OneVision (LLaVa-OV) [16]. To analyze the impact of model size on performance, we evaluated both their large and lite versions, as detailed in Table 1.

To evaluate the capabilities of these models to detect anomalies, we defined a “Dummy” baseline, which considers that all time series are anomaly-free. We also included isolation forest, a classical anomaly detection algorithm, as used in [8], to serve as another baseline.

2.2. Time Series Representation

Unlike traditional LLMs, which process Text-only inputs, MLLMs integrate both language and vision capabilities, enabling them to handle text, images, or a combination of both. This multimodal flexibility has led to strong performance across various tasks. However, existing studies [8,9] on MLLMs for TSAD have primarily focused on a single representation type—either Text-only or Vision-only. Extending beyond these studies, we introduce a Text and Vision input type, where both textual and visual representations of the time series are provided simultaneously.

Our objective is to determine whether a particular input representation (Text-only, Vision-only, or Text and Vision) is optimal for time series anomaly detection.

2.3. Prompting Strategy

We base our prompting strategy on the template introduced by Zhou et al. [8], which instructs models to return a list of time step ranges corresponding to the positions of anomalies in the time series. However, we introduce the following key modifications to enhance instruction clarity and reduce ambiguity in model outputs: (i) Explicit indexing: We specify that time step indices range from 0 to [L − 1], where L is the time series length; (ii) exclusive end indices: the prompt clarifies that each anomaly range should follow an exclusive end index format; and (iii) strict format enforcement: we explicitly instruct models to adhere to the output template to prevent inconsistencies.

Our revised prompt is organized in two parts. The specific part varies depending on the input representation and describes how the time series is provided to the model. The common part remains consistent across all input representations and defines the task, ensuring the model understands the expected output format. Our template is shown below:

Specific part

Text-only: [Time series text] This time series comprises [L] values.

Vision-only: [Image] This image depicts a time series comprising [L] values.

Text and Vision: [Time series text] [Image] This time series, illustrated by the accompanying image, comprises [L] values.

Common part

The indices range from 0 to [L − 1]. Assume there are up to [N] ranges of anomalies within this series.

Detect these anomaly ranges in the series. If there are no anomalies, answer with an empty list [].

Each range should be described by a start and an exclusive end index. List these ranges one by one in JSON format.

Use the following output template: [{“start”: …, “end”: …}, {“start”: …, “end”: …}, …]. Strictly follow this template.

Do not include any additional explanations, description, reasoning, or code. Only return the answer.

In the specific part, the [Time series text] placeholder provides the time series as a plain text sequence, such as [0.23, −0.15, …]. The expected output is an empty list ([]) if the model detects that there are no anomalies. Otherwise, it follows a structured format, such as [{“start”: 0, “end”: 12}, {“start”: 72, “end”: 85}].

2.4. Time Series Length

We hypothesize that increasing the input length provides models with additional contextual information, such as periodic patterns, which may improve their ability to detect anomalies. However, longer inputs also require more tokens and computational resources, posing challenges for MLLMs in modeling long-term dependencies. This increased complexity could make anomaly detection more difficult and potentially reduce model efficiency.

For image-based inputs, larger time series may further introduce limitations, as compressing longer sequences into a fixed image resolution could reduce the level of detail available for analysis. Given these trade-offs, we aim to evaluate how input length affects MLLM performance and identify configurations where models either excel or struggle. To achieve this, we experiment with two primary input lengths:

L = 100

(short) and

L = 1000

(long).

2.5. Synthetic Dataset

Following Zhou et al.’s work [8], we consider four types of anomalies observed in time series data. Our objective is to assess MLLMs’ ability to detect diverse anomaly patterns. The four anomaly types are defined as follows: (1) Trend anomalies—sudden shifts in the established trend, either accelerating or decelerating, as illustrated in Figure 1a. (2) Frequency Domain anomalies—abrupt increases or decreases in frequency components (Figure 1b). (3) Point anomalies—isolated data points that deviate significantly from the expected pattern, as illustrated in Figure 1c. (4) Range anomalies—a consecutive sequence of points where values exceed normal thresholds (Figure 1d).

To ensure consistency with prior studies, we adopt the same parameters as Zhou et al. [8] for input sequences of length 1000. However, for input sequences of length 100, we increase the frequency and shorten anomaly ranges to better align with this shorter length. Figure 1 illustrates the differences between normal time series (green) and those containing anomalies (blue) across all anomaly types (columns) and both input lengths (rows). As expected, longer time series (

L = 1000

) tend to smooth out finer details, while shorter ones (

L = 100

) appear more sparse, making anomalies visually distinct. For each anomaly type, we generate 400 synthetic time series following the methodology in a prior study. Table 2 details the distribution statistics of normal and anomalous time series.

2.6. Evaluation Methods and Metrics

We evaluate TSAD performance at two levels, namely series-level detection and step-level detection. Series-level detection assesses whether a model correctly classifies an entire time series as normal or anomalous. If the model outputs an empty list, the time series is considered normal; otherwise, it is classified as containing anomalies. Step-level, in contrast, evaluates whether the model can correctly identify the specific time steps that constitute anomalies. This distinction allows us to determine whether some models can detect the presence of anomalies but struggle to pinpoint their exact locations.

Both detection tasks are treated as binary classification problems, where normal instances belong to class 0 and anomalous ones to class 1. However, the generated dataset is imbalanced. As shown in Table 2, for series-level detection, approximately

\frac{2}{3}

of the time series contains anomalies. The imbalance is even more pronounced in step-level detection, where as illustrated in Figure 1, the vast majority of time steps are normal. To account for these imbalances, we focus on precision, recall, and the F1-score for the anomaly class (

p o s_l a b e l = 1

). For these metrics, we also set the parameter

z e r o_d i v i s i o n

to 0. In addition, we added the balanced accuracy (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html (accessed on 31 March 2025)), which accounts for class distribution, with one being the optimal score. Given the numerous experimental settings, we report only results averaged across anomaly types, but full results are available in our GitHub repository (https://github.com/haoniukr/tsad (accessed on 31 March 2025)).

3. Cleaning Models’ Answers

Despite our revised prompt, the models did not always adhere to the instructed answer format in our experiment, necessitating output cleaning before evaluation.

Table 3 presents the percentage of answers that required cleaning, averaged across anomaly types and time series representations for each input length.

The results indicate that using longer time series tends to increase formatting issues, making it harder to maintain response appropriateness. Additionally, Gemini-2.0 models and GPT-4o produced a higher percentage of improperly formatted answers.

However, such a metric must be interpreted alongside the severity of the formatting issues. We classified them into five categories, each requiring a different cleaning process, as explained in the following:

Unexpected character (UC): This comprises answers that contained extraneous characters (e.g., backticks, or JSON tags) preventing proper parsing but that did not affect anomaly detection results. Cleaning involved simply removing these characters.
Low-impact issues (ERL): Answers that included unnecessary formatting elements (e.g., commas, double quotes, or leading zeros) that, while not altering results, required careful additional checks before removal to ensure proper processing.
Moderate-impact issues (ERM): These answers hit the token limit due to listing every anomaly separately instead of using range-based notation. This issue led to missing portions of the answer, potentially affecting model performance. Cleaning involved discarding incomplete ranges while preserving the rest.
High-impact issues (ERH): Answers that included indices exceeding the input length, indicating a failure to follow instructions. If both start and end indices were out of range, the range was removed; if only the end index was invalid, it was replaced with the maximum valid index.
Formatting issues (FMT): Answers that did not follow the desired output format, requiring cleaning that might negatively impact the performance (however, this could also be interpreted as a penalty for not following the instructions). Some answers mixed explanations with properly formatted ranges, necessitating text removal. Others provided anomaly ranges only in textual descriptions, requiring conversion to the correct format. And, in two cases, the model failed to return a valid response (once outputting Python code and once stating that it lacked information to provide an answer). These were replaced with an empty list, meaning no anomalies were detected.

Figure 2 illustrates the distribution of cleaning operations using heatmaps across models and combinations of anomaly-type and time series representation for these categories (columns). The top row represents shorter time series (

L = 100

), while the bottom row represents longer ones (

L = 1000

).

In this figure, we omitted ERL, as they were rare and only occurred with a long input length. Some key observations can be drawn from this figure as follows: (i) UC issues appeared mostly with Gemini-2.0 (flash and flash-lite) and GPT-4o, with Gemini-2.0 Flash-lite producing them across all settings, while Flash had issues mainly with Text-only inputs; (ii) ERM issues occurred mainly in the experiment with

L = 1000

using LLama-3.2 and Gemini-2.0 models; (iii) ERH issues primarily appeared in Frequency and Point anomaly types when using GPT-4o, both LLama-3.2 and Ovis2-8B; and (iv) FMT issues were exclusive to Llama32-90B.

Therefore, despite having a lower proportion of properly formatted answers, issues generated by Gemini-2.0 models are not severe. In contrast, errors from Llama-3.2-90B—and to a lesser extent GPT-4o—were more critical, which might impact their performance.

4. Experimental Results

4.1. Performance Evaluation

The tables in this subsection present the precision, recall, and F1-scores for the anomaly class, along with the balanced accuracy score, for various MLLMs (rows) and input representations (columns) and averaged across different anomaly types.

4.1.1. Series-Level Detection

Table 4 presents the series-level detection results for the short input length.

Based on both the F1-score and balanced accuracy, Qwen2.5-VL-72B achieves the highest overall performance with Vision-only input but struggles to further improve with multimodal data (Text and Vision), despite also excelling in that setting. In contrast, GPT-4o models can in most cases enhance their Text-only performance when using Text and Vision, highlighting their ability to benefit from both textual and visual information. However, Gemini-2.0 models exhibit a different behavior: they rank among the best in F1-score and balanced accuracy with Text-only and Vision-only inputs, respectively, but perform worse with Text and Vision, suggesting less affinity with both modalities.

Overall, except for LLama-3.2-11B and LLaVa-OV-7B, which mostly remain consistent across all inputs, the performance of other models appears to depend on the choice of input representation.

Table 5 presents results for the long input length. The highest F1-score of the anomaly class is achieved by Gemini-2.0-flash using Text-only input. However, the highest balanced accuracy is attained by Ovis2-34B using Vision-only input. This result is driven by this model’s high precision, indicating strong anomaly detection without an excessive misclassification of normal time series. Focusing on balanced accuracy, we observed that the results of GPT-4o and Gemini-2.0 models further improve when incorporating both modalities. On a separate note, we notice that Qwen2.5-VL-7B behaves similarly to the dummy baseline (i.e., assuming all time series contain no anomalies).

4.1.2. Step-Level Detection

Table 6 and Table 7 present the step-level detection results for the short and long input lengths, respectively. The results indicate that models struggle to accurately pinpoint the positions of anomalies, particularly with Text-only input and a long input length. Nevertheless, similar observations can be extracted from these results: overall, (i) most models do not benefit from combining both modalities and (ii) Vision-only achieves the best performance for both short and long input lengths. In addition, Gemini-2.0-flash-lite with Vision-only emerges as the top-performing model across both input lengths.

4.2. Overall Insights Derived from Our Study

Large vs. Lite Models: While larger models generally outperform their lite versions, the differences are sometimes subtle and exceptions exist. Notably, Gemini-2.0-flash-lite often surpasses Gemini-2.0-flash in step-level detection with Vision-based input, achieving the top performance.

Effect of Time Series Representation: Most MLLMs favor specific input representations, likely due to their training data and design. While Text and Vision is expected to provide the best performance as it integrates both modalities, the associated results often fall between or even behind those of Text-only and Vision-only approaches.

Impact of Input Length: Detection performance is significantly influenced by input length. Models are generally more effective at identifying anomalies in shorter time series.

Different Level of Detection: In both series- and step-level detections, Vision-only is the most effective representation for most MLLMs. However, GPT-4o and Gemini-2.0 exhibit distinct behaviors: they generally achieve the best performance in series-level detection with Text-only or Text and Vision input, whereas Vision-only or Text and Vision input is more effective for step-level detection, highlighting their varying strengths across detection levels.

5. Limitations and Opportunities

Our results demonstrate that MLLMs can be effective for TSAD, with some models outperforming the baselines for specific representations, input lengths, and detection levels. However, key limitations persist, requiring further research.

Limited Benefit from Multimodality: Most models fail to leverage both text and vision inputs effectively, likely due to training limitations or architectural constraints. Future research should explore improved fusion techniques, such as fine-tuning or dedicated alignment modules, to better leverage the presence of multiple time series representations.

Challenges with Long Input Length: Longer inputs not only decrease response appropriateness but also increase detection difficulty, as shown in our cleaning process and experimental results. Optimizing model performance for long inputs—e.g., a patching mechanism segmenting them using sliding windows—could be a valuable future direction.

Prompt Dependency: We used a uniform prompt to ensure transparency and consistency with the literature, but model-specific prompts might enhance performance. Moreover, for models being optimized for textual responses, requesting structured outputs like a list of ranges may not be optimal. Adapting prompts to each model could yield better results.

Response Quality and Performance Issues: Some MLLMs fail to generate the expected outputs, requiring cleaning processes that may affect their performance. While we used the cleaning rate to assess response appropriateness, a more comprehensive evaluation should consider factors such as hallucination rates and answer diversity and consistency.

Author Contributions

Conceptualization, H.N.; methodology, H.N., G.H., H.Q.U., R.L., Z.L., Y.W., D.Z., J.V. and M.T.; software, H.N. and G.H.; validation, H.N. and G.H.; formal analysis, H.N. and G.H.; investigation, H.N. and G.H.; resources, H.N., G.H., H.Q.U., R.L., Z.L., Y.W., D.Z., J.V. and M.T.; data curation, H.N. and G.H.; writing—original draft preparation, H.N. and G.H.; writing—review and editing, H.N., G.H., H.Q.U., R.L., Z.L., Y.W., D.Z., J.V. and M.T.; visualization, H.N. and G.H.; supervision, M.T.; project administration, H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In this work, we generated synthetic data based on Zhou et al.’s work [8], the code of which is available at https://github.com/Rose-STL-Lab/AnomLLM/ (accessed on 9 January 2025).

Conflicts of Interest

All authors were employed by the company KDDI CORPORATION or KDDI Research, Inc.

References

Gujral, E. Survey: Anomaly Detection Methods. figshare 2023. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar]
Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.C.; Salehi, M. Deep Learning for Time Series Anomaly Detection: A Survey. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
Zhou, T.; Niu, P.; Wang, X.; Sun, L.; Jin, R. One Fits All: Power General Time Series Analysis by Pretrained Lm. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Dong, M.; Huang, H.; Cao, L. Can LLMs Serve As Time Series Anomaly Detectors? arXiv 2024, arXiv:2408.03475. [Google Scholar] [CrossRef]
Liu, C.; He, S.; Zhou, Q.; Li, S.; Meng, W. Large Language Model Guided Knowledge Distillation for Time Series Anomaly Detection. arXiv 2024, arXiv:2401.15123. [Google Scholar] [CrossRef]
Alnegheimish, S.; Nguyen, L.; Berti-Equille, L.; Veeramachaneni, K. Large Language Models Can Be Zero-shot Anomaly Detectors for Time Series? arXiv 2024, arXiv:2405.14755. [Google Scholar] [CrossRef]
Zhou, Z.; Yu, R. Can LLMs Understand Time Series Anomalies? In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025. [Google Scholar]
Xu, X.; Wang, H.; Liang, Y.; Yu, P.S.; Zhao, Y.; Shu, K. Can Multimodal LLMs Perform Time Series Anomaly Detection? arXiv 2025, arXiv:2502.17812. [Google Scholar]
Zhuang, J.; Yan, L.; Zhang, Z.; Wang, R.; Zhang, J.; Gu, Y. See it, Think it, Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers. arXiv 2024, arXiv:2411.02465. [Google Scholar] [CrossRef]
Mallick, S.B.; Kilpatrick, L. Gemini 2.0: Flash, Flash-Lite and Pro. Available online: https://developers.googleblog.com/en/gemini-2-family-expands/ (accessed on 31 March 2025).
OpenAI. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Meta. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 31 March 2025).
Lu, S.; Li, Y.; Chen, Q.G.; Xu, Z.; Luo, W.; Zhang, K.; Ye, H.J. Ovis: Structural Embedding Alignment for Multimodal Large Language Model. arXiv 2024, arXiv:2405.20797. [Google Scholar] [CrossRef]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. LLaVA-OneVision: Easy Visual Task Transfer. arXiv 2024, arXiv:2408.03326. [Google Scholar]

Figure 1. Illustrations of the different anomaly types: (a) trend, (b) frequency, (c) point, and (d) range.

Figure 2. Heatmaps of the total number of cleaning operations that was necessary per model, per anomaly type, per input type for each category: (a) UC, (b) ERM, (c) ERH, and (d) FMT.

Table 1. Comparison with related studies on zero-shot time series anomaly detection (TSAD) using multi-modal LLMs (MLLMs).

	Study [8] (ICLR25)	Study [9] (arXiv25)	Ours
MLLM Models	Gemini-1.5 (flash) GPT-4o (mini) InternVL2-Llama3 (76B) Qwen-VL (chat)	Gemini-1.5 (pro; flash) GPT-4o (main; mini) Qwen2-VL (72B; 7B) LLaVa-NeXt (72B; 8B)	Gemini-2.0 (flash; flash-lite) GPT-4o (main; mini) LLama-3.2 (90B; 11B) Ovis2 (34B; 8B) Qwen2.5-VisionLanguage (72B; 7B) LLaVa-OneVision (72B; 7B)
Primary input types	Text, Vision	Text	Text, Vision, Text and Vision
Primary input lengths	1000	400	1000, 100
Primary metrics	General performance metrics *		Response appropriateness; general performance metrics; balanced accuracy score

* General performance metrics: precision, recall and F1 ([8] also used their affinity variants).

Table 2. Number of time series with and without anomalies over the 400 generated for each input length and each anomaly type (T = trend, F = frequency, P = point, and R = range).

Input Length	100				1000
	T	F	P	R	T	F	P	R
Normal	186	74	76	101	230	40	117	99
Anomalous	214	326	324	299	170	360	283	301

Table 3. Percentage (%) of answer cleaned, average over anomaly and input types for experiments with input length 100 and 1000.

	Gemini-2.0		GPT-4o		LLama-3.2		Ovis2		Qwen2.5-VL		LLaVa-OV
	Flash-Lite	Flash	Mini	Main	11B	90B	8B	34B	7B	72B	7B	72B
L = 100	99.7	35.9	0.3	41.0	0.4	1.6	0.2	0.0	0.0	0.0	6.7	0.0
L = 1000	99.5	44.9	1.0	52.1	0.8	4.5	2.1	0.0	0.0	3.9	0.2	0.3

Table 4. Performance comparison on 400 time series (

L = 100

). Gray cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Table 4. Performance comparison on 400 time series (

L = 100

). Gray cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Input Types	Text				Text and Vision				Vision
Dummy	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000
Isolation Forest	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000
Selected metrics	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.
Gemini-2.0-flash-lite	0.7276	1.0000	0.8370	0.5013	0.9374	0.7778	0.7732	0.7651	0.9873	0.7606	0.7698	0.8605
Gemini-2.0-flash	0.7873	0.9528	0.8511	0.6622	0.9488	0.7440	0.7284	0.7767	0.9911	0.7657	0.7843	0.8692
GPT-4o-mini	0.7557	0.9824	0.8457	0.5852	0.8145	0.9439	0.8556	0.7101	0.9524	0.0200	0.0380	0.5050
GPT-4o	0.8284	0.6296	0.6612	0.7007	0.7213	0.6161	0.6487	0.7610	0.7500	0.2632	0.3374	0.6316
LLama-3.2-11B	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000
LLama-3.2-90B	0.7269	1.0000	0.8364	0.5000	0.7292	0.9847	0.8327	0.5060	0.7274	1.0000	0.8367	0.5012
Ovis2-8B	0.7269	1.0000	0.8364	0.5000	0.9565	0.7459	0.8052	0.7950	0.7411	0.6442	0.6786	0.8085
Ovis2-34B	0.7819	0.7281	0.7244	0.6141	0.8992	0.7433	0.7337	0.6464	0.8943	0.8230	0.8157	0.6467
Qwen2.5-VL-7B	0.0000	0.0000	0.0000	0.5000	0.2500	0.0034	0.0066	0.5017	0.5000	0.0115	0.0223	0.5057
Qwen2.5-VL-72B	0.9399	0.5178	0.5439	0.6426	0.9374	0.9479	0.9367	0.8502	0.9588	0.9633	0.9586	0.9086
LLaVa-OV-7B	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000
LLaVa-OV-72B	0.7269	1.0000	0.8364	0.5000	0.7269	1.0000	0.8364	0.5000	0.7926	0.9900	0.8700	0.6671

Table 5. Performance comparison on 400 time series (

L = 1000

). Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Table 5. Performance comparison on 400 time series (

L = 1000

). Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Input Types	Text				Text and Vision				Vision
Dummy	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000
Isolation Forest	0.6962	1.0000	0.8078	0.5000	0.6962	1.0000	0.8078	0.5000	0.6962	1.0000	0.8078	0.5000
Selected metrics	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.
Gemini-2.0-flash-lite	0.6967	1.0000	0.8081	0.5011	0.9346	0.7448	0.7357	0.7431	0.8957	0.8209	0.8055	0.6516
Gemini-2.0-flash	0.7944	0.9535	0.8382	0.7267	0.7374	0.7125	0.7232	0.8360	0.9094	0.7493	0.7166	0.6753
GPT-4o-mini	0.7247	0.9059	0.7811	0.5505	0.7969	0.7545	0.6743	0.6416	0.5000	0.2315	0.3097	0.6158
GPT-4o	0.8189	0.6457	0.6869	0.7174	0.9210	0.5765	0.6139	0.7480	0.7500	0.4738	0.5097	0.7369
LLama-3.2-11B	0.6962	1.0000	0.8078	0.5000	0.6946	0.9910	0.8034	0.4955	0.6953	0.9950	0.8054	0.4975
LLama-3.2-90B	0.6959	0.9977	0.8067	0.4989	0.6978	0.9976	0.8079	0.5051	0.6980	0.9735	0.8004	0.5042
Ovis2-8B	0.6962	1.0000	0.8078	0.5000	0.7500	0.5686	0.6083	0.7843	0.7500	0.5674	0.6087	0.7837
Ovis2-34B	0.4381	0.4788	0.4536	0.6144	0.9638	0.6515	0.7121	0.7613	1.0000	0.6967	0.7821	0.8484
Qwen2.5-VL-7B	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000	0.0000	0.0000	0.0000	0.5000
Qwen2.5-VL-72B	0.5000	0.1825	0.2485	0.5912	0.9992	0.5350	0.5668	0.7662	0.9741	0.6888	0.7437	0.7805
LLaVa-OV-7B	0.6962	1.0000	0.8078	0.5000	0.6962	1.0000	0.8078	0.5000	0.6962	1.0000	0.8078	0.5000
LLaVa-OV-72B	0.6962	1.0000	0.8078	0.5000	0.7944	0.7845	0.6789	0.6422	0.7209	0.8225	0.6847	0.5363

Table 6. Performance comparison on

400 \times 100

time steps. Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Table 6. Performance comparison on

400 \times 100

time steps. Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Input Types	Text				Text and Vision				Vision
Dummy	0.000	0.0000	0.0000	0.6365	0.000	0.0000	0.0000	0.6365	0.000	0.0000	0.0000	0.6365
Isolation Forest	0.1388	0.5890	0.1984	0.5612	0.1388	0.5890	0.1984	0.5612	0.1388	0.5890	0.1984	0.5612
Selected metrics	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.
Gemini-2.0-flash-lite	0.2808	0.2673	0.2477	0.6991	0.4120	0.4446	0.4057	0.8426	0.4225	0.4740	0.4288	0.8620
Gemini-2.0-flash	0.4153	0.3520	0.3552	0.7493	0.4101	0.4347	0.4049	0.8395	0.3747	0.4712	0.4014	0.8553
GPT-4o-mini	0.1276	0.1921	0.1335	0.6294	0.1307	0.1746	0.1315	0.6437	0.0024	0.0034	0.0024	0.6370
GPT-4o	0.3471	0.2831	0.2933	0.7669	0.3547	0.3201	0.3212	0.7873	0.0820	0.1077	0.0858	0.6826
LLama-3.2-11B	0.0767	0.4178	0.1168	0.4950	0.0903	0.3906	0.1342	0.4994	0.0985	0.3279	0.1391	0.5309
LLama-3.2-90B	0.1368	0.2884	0.1683	0.6073	0.1976	0.3455	0.2236	0.6122	0.1897	0.3800	0.2314	0.6433
Ovis2-8B	0.0667	0.1628	0.0834	0.4894	0.1558	0.1694	0.1503	0.6879	0.1697	0.1580	0.1502	0.6980
Ovis2-34B	0.1504	0.1576	0.1423	0.6490	0.3073	0.2405	0.2563	0.7415	0.2957	0.2610	0.2588	0.7405
Qwen2.5-VL-7B	0.0000	0.0000	0.0000	0.6365	0.0015	0.0011	0.0012	0.6371	0.0058	0.0051	0.0048	0.6390
Qwen2.5-VL-72B	0.1406	0.1808	0.1431	0.6929	0.2921	0.3800	0.3171	0.7811	0.3084	0.4267	0.3341	0.7990
LLaVa-OV-7B	0.0378	0.0765	0.0441	0.5825	0.1398	0.3628	0.1855	0.6008	0.1497	0.3838	0.1971	0.6145
LLaVa-OV-72B	0.1135	0.1706	0.1210	0.6076	0.1560	0.3688	0.2006	0.6658	0.1803	0.4116	0.2351	0.6801

Table 7. Performance comparison on

400 \times 1000

time steps. Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Table 7. Performance comparison on

400 \times 1000

time steps. Red [resp. gray] cells indicate performance lower than [resp. equal to] the dummy baseline, and yellow indicates lower than isolation forest. Uncolored cells indicate models that outperform both baselines, for which the best and second-best performances are highlighted.

Input Types	Text				Text and Vision				Vision
Dummy	0.0000	0.0000	0.0000	0.6519	0.0000	0.0000	0.0000	0.6519	0.0000	0.0000	0.0000	0.6519
Isolation Forest	0.0797	0.6320	0.1234	0.5927	0.0797	0.6320	0.1234	0.5927	0.0797	0.6320	0.1234	0.5927
Selected metrics	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.	Prec.	Rec.	F1	B.Acc.
Gemini-2.0-flash-lite	0.0421	0.1108	0.0492	0.6077	0.1817	0.3344	0.2166	0.8005	0.2240	0.3782	0.2626	0.8260
Gemini-2.0-flash	0.0775	0.1114	0.0688	0.6732	0.1828	0.2757	0.1997	0.7762	0.1908	0.3461	0.2335	0.8102
GPT-4o-mini	0.0318	0.0754	0.0344	0.6087	0.0426	0.0665	0.0417	0.6517	0.0255	0.0261	0.0209	0.6590
GPT-4o	0.0456	0.0535	0.0424	0.6612	0.0724	0.1055	0.0786	0.6901	0.0828	0.1506	0.0997	0.7167
LLama-3.2-11B	0.0278	0.2637	0.0440	0.5495	0.0384	0.3464	0.0627	0.4999	0.0428	0.3320	0.0701	0.5432
LLama-3.2-90B	0.0325	0.1289	0.0426	0.5790	0.1082	0.3325	0.1229	0.5808	0.1021	0.3396	0.1318	0.6196
Ovis2-8B	0.0300	0.3257	0.0520	0.4352	0.0704	0.1072	0.0746	0.6923	0.0760	0.0882	0.0735	0.6868
Ovis2-34B	0.0344	0.0327	0.0295	0.6610	0.1401	0.1095	0.1160	0.7002	0.1459	0.1591	0.1428	0.7238
Qwen2.5-VL-7B	0.0000	0.0000	0.0000	0.6519	0.0000	0.0000	0.0000	0.6519	0.0000	0.0000	0.0000	0.6519
Qwen2.5-VL-72B	0.0080	0.0132	0.0081	0.6533	0.1250	0.1710	0.1347	0.7282	0.1657	0.1939	0.1658	0.7390
LLaVa-OV-7B	0.0169	0.0816	0.0214	0.6183	0.0466	0.2968	0.0726	0.5592	0.0476	0.2974	0.0744	0.5659
LLaVa-OV-72B	0.0244	0.0270	0.0159	0.6371	0.0869	0.2336	0.1112	0.6729	0.1018	0.3271	0.1413	0.6756

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niu, H.; Habault, G.; Ung, H.Q.; Legaspi, R.; Li, Z.; Wang, Y.; Zeng, D.; Vizcarra, J.; Taya, M. Exploring Multi-Modal LLMs for Time Series Anomaly Detection. Comput. Sci. Math. Forum 2025, 11, 22. https://doi.org/10.3390/cmsf2025011022

AMA Style

Niu H, Habault G, Ung HQ, Legaspi R, Li Z, Wang Y, Zeng D, Vizcarra J, Taya M. Exploring Multi-Modal LLMs for Time Series Anomaly Detection. Computer Sciences & Mathematics Forum. 2025; 11(1):22. https://doi.org/10.3390/cmsf2025011022

Chicago/Turabian Style

Niu, Hao, Guillaume Habault, Huy Quang Ung, Roberto Legaspi, Zhi Li, Yanan Wang, Donghuo Zeng, Julio Vizcarra, and Masato Taya. 2025. "Exploring Multi-Modal LLMs for Time Series Anomaly Detection" Computer Sciences & Mathematics Forum 11, no. 1: 22. https://doi.org/10.3390/cmsf2025011022

APA Style

Niu, H., Habault, G., Ung, H. Q., Legaspi, R., Li, Z., Wang, Y., Zeng, D., Vizcarra, J., & Taya, M. (2025). Exploring Multi-Modal LLMs for Time Series Anomaly Detection. Computer Sciences & Mathematics Forum, 11(1), 22. https://doi.org/10.3390/cmsf2025011022

Article Menu

Exploring Multi-Modal LLMs for Time Series Anomaly Detection^†

Abstract

1. Introduction

2. Methodology

2.1. Evaluated MLLMs

2.2. Time Series Representation

2.3. Prompting Strategy

2.4. Time Series Length

2.5. Synthetic Dataset

2.6. Evaluation Methods and Metrics

3. Cleaning Models’ Answers

4. Experimental Results

4.1. Performance Evaluation

4.1.1. Series-Level Detection

4.1.2. Step-Level Detection

4.2. Overall Insights Derived from Our Study

5. Limitations and Opportunities

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exploring Multi-Modal LLMs for Time Series Anomaly Detection †

Abstract

1. Introduction

2. Methodology

2.1. Evaluated MLLMs

2.2. Time Series Representation

2.3. Prompting Strategy

2.4. Time Series Length

2.5. Synthetic Dataset

2.6. Evaluation Methods and Metrics

3. Cleaning Models’ Answers

4. Experimental Results

4.1. Performance Evaluation

4.1.1. Series-Level Detection

4.1.2. Step-Level Detection

4.2. Overall Insights Derived from Our Study

5. Limitations and Opportunities

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Exploring Multi-Modal LLMs for Time Series Anomaly Detection^†