4.2. Analysis of Prompt Injection Vulnerabilities
In this section, the results of the LLM analysis from a cybersecurity perspective are introduced. To do so, the results are structured into three main dimensions: (1) a detailed matrix with the specific vulnerability results per LLM and technique, (2) the global resistance of each LLM measured as the percentage of consistently rejected techniques, and (3) the effectiveness of each prompt injection technique analysed, measured as the percentage of vulnerable systems.
Table 4 presents the complete vulnerability matrix so that specific patterns of vulnerability per LLM and technique can be identified. The results provided are obtained based on the criteria established in
Section 3.2.3. For each prompt injection technique, a total of two payloads were considered. If the LLM was resistant to both payloads, it is given a score of 0/2. If the LLM is vulnerable to one of the two payloads, it is considered vulnerable to this technique and receives a score of 1/2, or 2/2 if it was vulnerable to both payloads. Finally, if the LLM is partially vulnerable to one of the payloads, it receives a penalty of 0.5. Thus, if the LLM is partially vulnerable to both payloads of the same technique, it would receive a score of 1/2. Based on this, the main results of resistance and effectiveness are calculated.
As observed in
Table 4, LLM3 is the only model that is resistant to all prompt injections. The LLM with the most vulnerabilities is LLM8, with a total of thirteen vulnerabilities and one partial vulnerability. Analogously, the second most vulnerable model is LLM5 with a total of thirteen vulnerabilities. In Base64 Encoding, it has been considered as a partial vulnerability when the LLM refuses to execute the malicious instruction but the fact of decoding exposes its internal processing and demonstrates that the system processes potentially dangerous content before its security evaluation, thus expanding the attack surface. In the case of Task Redefinition, it is classified as a partial vulnerability when the system partially executes its new role before rejecting it.
The global resistance per LLM can also be consulted in
Table 4. LLM3, as described above, is the model that presents the best results, with 100% global resistance, which means that the model manages to reject all prompt injection techniques analysed in this study. Thus, LLM3 can be used as a baseline in adversarial security, meaning that its defence is robust against the diversity of attack vectors implemented.
LLM6, LLM2, and LLM1 present high resistance, as they present global resistance values of 86.67%, 80.00%, and 80.00%, respectively. The vulnerabilities in these three models are limited, even though they can be detected. These systems demonstrate that product maturity and resources dedicated to security matter, although none reach the resistance of LLM3. LLM9, LLM4, and LLM7 present medium resistance, which indicates moderate vulnerabilities. They present global resistance percentages of 73.33%, 71.67%, and 60.00%, respectively. Finally, LLM5 and LLM8 are the two models with the lowest global resistance, and thus the two models most susceptible to vulnerability. Both present global resistance values of 46.67%. This means that more than half of the attacks were successful, suggesting that both models need to strengthen their security defences.
To finalise the analysis from the cybersecurity perspective, the effectiveness of each attacking technique is measured. Thus, the vectors that represent potential threats and those that are effective against specific LLMs can be identified.
Table 4 shows this effectiveness metric by technique, which was measured as the percentage of LLMs that were vulnerable to a specific technique in both validation phases.
Conversation Poisoning seems to be the most effective technique, as it compromised eight of the nine analysed models. Thus, this attacking technique achieved an effectiveness score of 88.89%. This technique can be particularly dangerous, as it exploits a fundamental aspect of conversational LLMs, which is the prioritisation of recent context over assistant directives. When a user establishes a legitimate conversational framework over several messages (in this case, as an “academic cybersecurity researcher”), systems interpret this accumulated context as sufficient justification to lower their defences against sensitive content.
Role-playing techniques (DAN variants) show variable effectiveness: DAN 13.0 and STAN Prompt compromise 55.56% of LLMs, while older versions (DAN 10.0, KEVIN Jailbreak) prove less effective (22.22%). This suggests that providers have updated their defences against older jailbreaking patterns that are widely available in public repositories. On the other hand, classic obfuscation techniques (Character Substitution, Base64 Encoding, and Reverse Psychology) were blocked by the vast majority of LLMs, indicating that providers implement robust input preprocessing. Language Switching achieves 44.44% effectiveness through multilingual reformulation, indicating that security filters may be more optimised for specific languages.
4.3. Performance Analysis in the Retrieval of Maritime Accident Information
In this section, the precision analysis results are presented. These are structured around four dimensions: (1) overall performance by LLM, (2) analysis by cognitive level according to Bloom’s Taxonomy, (3) analysis of specific strengths and weaknesses patterns, and (4) comparative analysis between evaluation metrics.
To do so, the questionnaire must be defined beforehand.
Table 5 presents the 28 questions defined for this study. The questions were derived from the information contained in the first accident report and subsequently applied consistently to the remaining reports in the dataset. Additionally, questions were introduced to assess the capabilities of the distinct LLMs being analysed.
Table 6 presents the mean score of each system across the full set of 28 questions, calculated as the average of the three evaluation methodologies (LLM as a Judge, DeepEval, BERTScore F1). As it can be observed, LLM6 emerges as the best-performing LLM in the performance task, reaching a mean score of 0.643. Together with LLM7, which achieves a mean score of 0.638, they form the high-ranking group, indicating more robust and consistent performance in maritime incident analysis within the evaluated set.
A second, larger group is concentrated in the medium ranking, comprising LLM3, LLM4, LLM5, LLM2, and LLM1. The proximity of these results suggests relatively homogeneous behaviour, with very small differences in terms of overall performance. Within this intermediate block, LLM3, LLM4, and LLM5 stand out in particular, as their scores are practically equivalent, indicating that, although none reach the top quartile, they maintain a competitive level in this domain.
Finally, LLM8 and LLM9 are placed in the low-ranking group. In the case of LLM8, the gap with respect to the medium group is small, whereas LLM9 exhibits a clearly inferior performance compared to the rest of the assistants, becoming decoupled from the general distribution. This result is consistent with its smaller size (8B parameters) and its local execution environment.
For a better interpretation of the results provided,
Table 7, which includes examples of LLM responses alongside the ground truth, is provided.
For instance, a low score answer, which had a value between 0 and 0.25, corresponds to dissimilar responses or not specified, while experts found in the report a clear answer to the question. On the other hand, a high score response, between 0.76 and 1, is awarded to a correct reply, with small nuances, that coincides with the ground truth declared by experts according to reports. Intermediate answers are valued also in two categories, one from 0.26 to 0.5 and the other from 0.51 to 0.75, and they gradually approach the ground truth.
With the aim of gaining deeper insight into system behaviour beyond overall performance, an analysis by cognitive level has been conducted following the Revised Bloom’s Taxonomy. The 28 benchmark questions are distributed across three categories: Remembering (12 questions), Analysing (10 questions), and Applying (6 questions), enabling the assessment of differentiated capabilities in literal retrieval, contextual reasoning, and information application. The main results are introduced in
Table 8.
Remembering-type questions yield the highest average scores across all three metrics used (LLM as a Judge: 0.727; DeepEval: 0.611; BERTScore F1: 0.619), indicating that the assistants handle direct fact extraction from explicit content in the documents more easily. Within this group, the question “When was the vessel built?” (0.810) proves to be the simplest, while “In which sea did the accident occur?” (0.522) is the most challenging.
In contrast, Analysing-type questions show a notable decline in performance (LLM as a Judge: 0.588; DeepEval: 0.459; BERTScore F1: 0.469). This result reflects the greater cognitive complexity of this level, which requires integrating multiple pieces of information. The most accessible question in this category is “What did the crew member consume before heading to their post?” (0.644), while “Was there any communication from the vessel’s crew to nearby ships or maritime authorities before the incident?” (0.420) proves more difficult.
Finally, Applying-type questions show intermediate behaviour. Although the LLM evaluator’s judgement maintains a relatively high score (0.649), the automated metrics penalise this type of task more severely (DeepEval: 0.468; BERTScore F1: 0.469). This pattern suggests that, while systems tend to correctly describe the expected procedure or reasoning, they more frequently fall short in the precise execution of calculations, unit conversions, or numerical derivations, particularly when information must be combined with exactness.
The analysis by cognitive level also reveals that the LLMs exhibit differentiated capability profiles, confirming that overall performance can obscure relevant information. Some assistants show a clear strength in certain levels of Bloom’s Taxonomy, while others display a more balanced behaviour across levels.
For example, LLM6 achieves the strongest overall performance, with a particularly solid profile in Applying (0.606) and Analysing (0.568) questions, while also ranking among the highest in Remembering (0.723). This profile suggests a well-developed capacity both for retrieving explicit information and for applying and connecting data from the document.
LLM7 also shows a consistent and balanced profile, with strong results across all three categories (Remembering 0.724, Applying 0.596, and Analysing 0.561). This pattern reflects good adaptability to diverse task types, without a pronounced weakness at any of the cognitive levels examined.
LLM3 similarly presents a stable and well-rounded profile, with high scores in Remembering (0.705), Applying (0.550), and the highest score in the group for Analysing (0.571). Its consistency across cognitive levels reinforces the impression of a robust and reliable system for document analysis tasks.
A second group includes LLM4, LLM5, and LLM2, the overall results of which are closely aligned. LLM4 shows particular strength in Remembering (0.706), while its scores in Applying (0.548) and Analysing (0.531) are somewhat lower. LLM5 presents a similar profile, with solid performance in Remembering (0.703) and less consistency in application and analysis tasks. LLM2 shows a similar behaviour, with scores close to the group average across all levels.
LLM1 presents a relatively balanced profile, though with somewhat lower scores than the previous group, particularly in Analysing (0.464). LLM8 falls below the main cluster in all three categories, with a more pronounced gap in Analysing (0.477), suggesting greater limitations in interpretive or reasoning-based tasks. Finally, LLM9 shows the lowest performance across all three cognitive levels, with notably reduced results in Analysing (0.349) and Applying (0.338).
The three metrics employed display distinct behaviours and allow the quality of responses to be analysed from different perspectives.
Table 9 summarises the obtained scores, divided by cognitive category and evaluation metric.
In the case of LLM as a Judge, the average score obtained is the highest of the three (0.661). This metric relies on a specific prompt that incorporates contextual information from the maritime domain and allows for a degree of tolerance toward synonyms and formatting differences, unless they contradict the ground truth. As a result, it is particularly useful for assessing whether the essential content of a response is correct, even when the wording does not match the expected reference literally.
BERTScore F1 yields an intermediate average score (0.533). It operates exclusively on the basis of semantic similarity between embeddings, without access to the original question or any domain-specific maritime information. For this reason, it may penalise responses that are conceptually correct but phrased differently from the ground truth, particularly for open-ended questions or those with greater expressive variability.
As for DeepEval, this metric obtains the lowest average score of the three (0.526). Although it also uses an evaluator model, its comparison between the generated response and the reference answer is carried out using more fixed criteria, which leads it to more frequently penalise broad reformulations, formatting deviations, or partially correct responses. This explains why its scores are, in general, lower than those of LLM as a Judge.
Therefore, each metric has its own strengths and limitations. LLM as a Judge is better suited to evaluating content considering context and domain knowledge; DeepEval introduces a stricter standard for comparing responses; and BERTScore F1 provides an objective measure grounded in semantic similarity. While the combination of all three yields a more robust and balanced evaluation than any single metric alone, it is recommended that additional available metrics be incorporated in future work to complement the findings presented in this study.
Beyond the quantitative performance results, the observed errors also provide relevant insight into hallucination risks in maritime accident report analysis. In this safety-critical domain, hallucinated information may not simply constitute an incorrect answer but may distort the interpretation of an accident by introducing unsupported causal factors, incorrect timelines, non-existent communications with maritime authorities, inaccurate weather or visibility conditions, or safety recommendations not grounded in the report. If such outputs were used without expert supervision, they could affect the identification of contributing factors and lead to misleading conclusions. For this reason, LLM-based systems in this domain should be understood as decision-support tools for experts rather than autonomous investigative systems. Although RAG has been proposed as a strategy to improve factual grounding and reduce hallucinations, recent reviews show that hallucinations may still arise from both retrieval failures and generation deficiencies in RAG-based systems [
25].
During the experiments, incorrect or unsupported reasoning was observed mainly in questions requiring analysis or application of information rather than direct factual retrieval. Some low-scoring answers failed to identify information that was present in the report, while others provided responses that did not match the expert-defined ground truth. These errors were especially relevant in analysing questions, where the model had to integrate information distributed across different parts of the document, and in applying questions, where calculations, unit conversions, or numerical derivations were required. Therefore, although the study did not compute a separate hallucination rate, the comparison with the expert-defined ground truth captured several cases of incomplete, unsupported, or incorrectly reasoned answers.
Regarding the relationship between cybersecurity resistance and hallucination tendency, the results do not support a direct conclusion that cybersecurity-resistant models systematically hallucinate less. The proposed framework evaluated cybersecurity robustness and document-analysis performance as two complementary dimensions, but hallucinations were not isolated as an independent metric. In fact, the results suggest that security and factual reliability should not be assumed to evolve in parallel, as a model may perform well in information retrieval while remaining vulnerable to prompt injection, or may be robust against adversarial prompts without necessarily achieving the best accuracy.