Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis

Doropoulos, Stavros; Karapalidou, Elisavet; Charitidis, Polychronis; Karakeva, Sophia; Vologiannidis, Stavros

doi:10.3390/app15148059

Open AccessArticle

Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis

by

Stavros Doropoulos

^1,*

,

Elisavet Karapalidou

²

,

Polychronis Charitidis

²

,

Sophia Karakeva

² and

Stavros Vologiannidis

¹

Department of Computer, Informatics and Telecommunications Engineering, International Hellenic University, 62124 Serres, Greece

²

DataScouting, 30 Vakchou Street, 54629 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8059; https://doi.org/10.3390/app15148059

Submission received: 30 June 2025 / Revised: 14 July 2025 / Accepted: 18 July 2025 / Published: 20 July 2025

(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The vast volume of media content, combined with the costs of manual annotation, challenges scalable codebook analysis and risks reducing decision-making accuracy. This study evaluates the effectiveness of large language models (LLMs) and multi-agent teams in structured media content analysis based on codebook-driven annotation. We construct a dataset of 200 news articles on U.S. tariff policies, manually annotated using a 26-question codebook encompassing 122 distinct codes, to establish a rigorous ground truth. Seven state-of-the-art LLMs, spanning low- to high-capacity tiers, are assessed under a unified zero-shot prompting framework incorporating role-based instructions and schema-constrained outputs. Experimental results show weighted global F1-scores between 0.636 and 0.822, with Claude-3-7-Sonnet achieving the highest direct-prompt performance. To examine the potential of agentic orchestration, we propose and develop a multi-agent system using Meta’s Llama 4 Maverick, incorporating expert role profiling, shared memory, and coordinated planning. This architecture improves the overall F1-score over the direct prompting baseline from 0.757 to 0.805 and demonstrates consistent gains across binary, categorical, and multi-label tasks, approaching commercial-level accuracy while maintaining a favorable cost–performance profile. These findings highlight the viability of LLMs, both in direct and agentic configurations, for automating structured content analysis.

Keywords:

large language models; media content analysis; codebook annotation; multi-agent systems; agentic

1. Introduction

Processing and systematically extracting insights from media content is increasingly becoming a critical challenge for organizations, governments, and individuals. The volume of media data is growing exponentially with news articles from both traditional and online media serving as vital information channels for society [1]. However, the combination of extensive information, time limitations, and constrained cognitive processing capacity can substantially impair decision-making accuracy [2]. Understanding, analyzing, and leveraging this vast stream of media data are crucial for effective and accurate decision making [3], public relations [4], crisis management [5,6], and policy development [7]. These demands and constraints highlight the importance of integrating technological advancements in content analysis.

Content analysis can be defined as the “systematic, objective, quantitative analysis of message characteristics” [8]. While the need to understand information dates back to the dawn of human civilization, the first systematic efforts in content analysis emerged in the 17th and 18th centuries, particularly in the context of religious document interpretation. This was followed by a significant expansion of interest in information understanding during periods of economic crisis and in pre-, post-, and wartime environments [9]. Today, content analysis is carried out by a range of actors, including media monitoring organizations, public relations’ professionals, analysts, and intelligence agencies, who are responsible for developing workflows for the structured coding of news stories.

News story coding is a systematic method that involves defining and operationalizing variables to measure message content, developing coding schemes or dictionaries (codebooks), sampling and coding the content with reliability testing, and finally analyzing the results to establish relationships between content characteristics and other variables [8]. In practice, the creation of a codebook, both for human and automated annotation, involves the selection of questions and a specified answering schema. Codebooks are typically tailored to specific domains, organizations, or semantic topics. They comprise a list of domain-specific variables and possible values (codes), coding instructions, examples, and additional guidelines. This structure ensures consistent annotation and supports replicable, data-driven content analysis. They serve as a standardized framework for coders, reducing subjectivity and enabling comparison across studies [8,9].

Manual codebook-based annotation of news stories presents several limitations, including inter-annotator reliability issues, biases, and, most critically, scalability constraints due to the labor-intensive nature of the task. Although technological advancements have enabled the rapid creation and dissemination of media content, content understanding and analysis remain limited by human cognitive capacity throughput and the financial resources required for either in-house or outsourced analysis. Processing and coding a typical web news article, generally ranging between 500 and 2500 words, can require approximately 10 to 35 min, depending on article length and the complexity of the coding schema. Recent technological advances in generative artificial intelligence (AI) [10] have enabled the creation of large language models (LLMs) capable of unprecedented contextual understanding without the need for domain-specific training [11,12,13]. Many research works explore the application of LLMs to automate and augment qualitative research tasks. Studies have shown that LLMs can be effective in generating thematic analyses of textual data and producing high-quality annotations, often rivaling human performance while significantly reducing time and effort [14]. However, structured media codebook analysis remains under-explored across diverse LLMs, particularly when leveraging state-of-the-art prompting techniques and coordinated expert roles within multi-agent systems. To address this gap, this study investigates the following research questions:

To what extent can LLMs accurately respond to media analysis questions derived from structured codebooks?
Can agentic workflows enhance the performance of LLMs in codebook-based news content analysis?

To this end, we compiled a dataset of 200 news articles related to the tariff policies announced by the United States in 2025. We then manually annotated these articles using a 26-question, tariff-specific analysis codebook developed for this study, comprising 122 possible codes. Based on the ground truth established by human annotators, this study evaluates the question-answering capabilities of LLMs through various model configurations and agentic workflows. To the best of our knowledge, this is the first work to systematically explore the applicability of LLMs compared to a multi-agent expert team approach for comprehensive codebook-based news content analysis. More specifically, the contribution of this work is multifold. First, we enable a reproducible evaluation of automated content analysis by publishing our annotated dataset and structured codebook on an open data repository. Then, we design and assess a prompting framework across seven state-of-the-art LLMs spanning multiple capacity tiers, achieving strong performance on direct annotation tasks. Finally, we develop and benchmark an agentic architecture utilizing Meta’s Llama 4 Maverick model powering a multi-agent team of media, political, trade economist, and reviewer expert agents, demonstrating improved accuracy (F1 increasing from 0.757 to 0.805) and significantly enhanced cost efficiency relative to commercial models.

The rest of this paper is structured as follows. First, related work is presented in Section 2. Then, we detail the dataset and the data annotation process and analytically describe the proposed method in Section 3. Next, we provide details of our extensive experimental study in Section 4. Finally, Section 5 and Section 6 conclude the paper and discuss possible future research directions.

2. Related Work

2.1. Large Language Models and Codebooks for Deductive Coding

The field of natural language processing (NLP) was fundamentally transformed by the introduction of the Transformer architecture and its “attention mechanism” [10]. This innovation enabled models to process text by weighing the importance of different words in relation to each other, effectively capturing long-range dependencies and complex contextual relationships that were challenging for previous architectures. This breakthrough paved the way for the development of large language models, such as those in the GPT series [11], Llama [12], and Mistral [13], which are pre-trained on vast and diverse text corpora.

A key advantage of LLMs is their powerful zero-shot and few-shot learning capabilities. Unlike traditional machine learning models, LLMs can perform a wide range of tasks with no or very few examples, simply by being prompted with instructions in natural language. This ability drastically reduces the need for task-specific training data and extensive feature engineering. The process of learning new tasks directly from examples given in the prompt is known as in-context learning (ICL) [15]. This remarkable ability stems from the massive scale of these models and the extensive text data they are pre-trained on, minimizing or eliminating the need for parameter updates for new tasks. Following this discovery, researchers have actively explored ways to enhance ICL, evaluated its effectiveness in various models [16], and benchmarked it against traditional methods like fine-tuning and other few-shot learning techniques [17]. A key advancement in this area is “chain-of-thought” (CoT) prompting [18]. This method enhances few-shot learning in LLMs by providing not just the answers to examples but also the step-by-step reasoning to reach them. The benefits of CoT are particularly pronounced in larger models [19]. Further studies have examined the ideal composition of these in-context examples. The utilization of these methods has established the viability of using LLMs for deductive coding.

Foundational studies by Chew et al. [14] and Xiao et al. [20] demonstrated that with carefully designed prompts, LLMs could achieve performance comparable to human coders on specific tasks. A key finding from this initial wave of research was that providing the model with the codebook’s explicit rules and definitions in a codebook-centered prompt was significantly more effective than providing only coded examples [20]. This established the codebook itself, rather than a large corpus of examples, as the central artifact for guiding the LLM’s analytical process. This work spurred a wave of methodological refinements aimed at improving the fidelity and transparency of LLM-driven analysis. In the context of qualitative analysis, CoT was shown to substantially improve coding accuracy and, crucially, make the LLM’s thinking process more transparent and auditable for researchers [21,22]. The emphasis on structured reasoning has led researchers to focus on adapting codebooks to be more machine-readable. For some tasks, a high-quality, well-structured codebook can even enable a zero-shot approach to outperform few-shot learning with examples, underscoring the primacy of clear instructions over in-context examples [23].

Rather than pursuing full automation, many have proposed collaborative frameworks where LLMs act as a partner in analysis. These LLM-in-the-loop systems can dramatically reduce the labor involved in coding large datasets but introduce new challenges for methodological objectivity [24]. A primary concern is that the LLM may learn to mimic the specific biases and interpretive idiosyncrasies of its human collaborator, potentially amplifying them across the dataset and creating a feedback loop that undermines the validity of the findings. This has led to calls for more reflexive approaches to content analysis, where the researcher’s interaction with the LLM is itself a subject of critical examination [21].

Much of the recent work in this area focuses on applying and validating these methods in specific domains. In fields like political science, for instance, researchers are evaluating the ability of LLMs to serve as reliable measurement tools for complex, theory-laden concepts defined in discipline-specific codebooks. This involves not only adapting codebooks for LLM use but also fine-tuning LLM behavior to more closely adhere to the nuanced rules of the codebook [25]. The validation of these applications often relies on traditional social science metrics for inter-coder reliability, assessing the level of agreement between the LLM and expert human coders. Nevertheless, despite achieving high agreement scores, LLMs continue to exhibit inherent limitations such as hallucinations, limited context size, scope drift, biases, and lack of understanding. The introduction of agent-based systems through the use of profiling, memory, planning and action tools can mitigate some of those issues while increasing the overall efficacy of automated content analysis [26].

2.2. Coding with Autonomous Agents

An agent can be defined as a system “that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators” [27]. Additionally, the concept of a rational agent describes an agent that selects an action for each percept sequence such that the expected outcome maximizes its performance measure, based on the available perceptual evidence and the agent’s prior knowledge [27]. This definition of an agent, as one that utilizes environmental input, draws on prior knowledge, and employs tools to achieve a goal, aligns well with recent LLM-based agentic architectures. However, while early agentic approaches were constrained by limited capabilities in natural language understanding and generation, recent advancements in LLMs have enabled the development of intelligent agents that can collaborate to accomplish tasks of high cognitive complexity. The ability of LLM-based agents to increase their accuracy through collaborative reasoning and the utilization of prior and in-context knowledge and tools mitigates some of the inherent LLM limitations. This enables LLM-based agents to excel in a variety of tasks including content understanding and reasoning [28,29], software engineering [30,31,32,33], societal simulations [34,35,36], economic modeling [37,38], and robotics [39,40,41].

In recent LLM-based agentic architectures, agents leverage profiling, memory mechanisms, planning modules, and a predefined set of actions to accomplish specified tasks [26]. The profiling module defines agent roles and characteristics using handcrafted prompt templates, LLM-generated descriptions, or automated alignment with task-specific datasets, and therefore establishes a contextual persona foundation that guides all subsequent agent behaviors [34,42]. The memory module stores environmental perceptions and past experiences through memory architectures that integrate short-term contextual awareness or long-term knowledge storage. This enables the utilization of memory operations, such as reading, writing, and updating, thus allowing behavioral modeling, situational awareness, and task-oriented adaptation over time [43,44]. The planning module facilitates task decomposition and multi-agent planning utilizing reasoning strategies, such as CoT [18], Tree-of-Thought (ToT) [45], and Reasoning and Acting (ReAct) [46], which allow LLM agents to incorporate feedback from the environment and human users to iteratively refine execution plans. Finally, the action module translates agent decisions into executable actions by following plans, retrieving relevant memories or utilizing external systems. It employs both external tools (e.g., APIs, databases, or specialized models) and internal LLM capabilities (e.g., planning, dialogue generation, and commonsense reasoning) to carry out tasks effectively [31,47,48].

The application of agentic workflows on structured news coding-based analysis remains under-explored. Recent work applies multi-agent LLM systems to perform large-scale thematic analysis, focusing on exploring diverse qualitative themes through internal consistency metrics [49]. In contrast, we conduct a structured content analysis using a comprehensive, manually developed codebook and human-annotated ground truth, enabling a direct evaluation of LLM coding capabilities, from direct responses and multi-agent teams, in domain-specific analytical questions.

Manual media coding involves sophisticated, multi-step workflows, from codebook preparation to content understanding, question answering, error analysis, and collaborative iterative refinement, which are, in effect, a representation of an agentic system. These processes embody the core principles of agentic AI: breaking down a complex goal into a plan, using tools, and reflecting on feedback and collaboration to improve performance. This points toward a research frontier focused on formalizing these processes into more autonomous agentic workflows, which is the central focus of this study.

3. Materials and Methods

3.1. Data Collection

The dataset for this analysis was constructed by collecting publicly available online articles through a systematic search and manual screening process. The search was conducted across the open internet, targeting online news articles, reports, and commentary. A major web search engine was used to identify relevant content. The search was limited to content published between 7 April 2025 and 21 April 2025. The search was restricted to English-language articles. Instead of a simple keyword search, a targeted search strategy was employed to retrieve articles that covered a specific intersection of topics. To be included in the results, an article had to contain terms from both of the following thematic categories:

U.S. Administration: This theme focused on identifying the key political actor. The search looked for mentions of President Donald Trump by name, his official titles (e.g., “President Trump,” “US President”), and the administration as an entity.
Trade and Tariff Policy: This theme centered on economic policy and international trade. The search targeted a wide range of relevant terms, including specific keywords like tariff and import duty, as well as broader concepts such as trade policy, trade war, and trade barriers. It was also designed to capture discussions around policy actions, like tariff announcements or proposals.

This two-part approach ensured that the retrieved articles were not just about the Trump administration or trade, in general, but specifically about the administration’s involvement with trade and tariff policy. The initial search yielded a large volume of results. These were then manually screened by an analyst to ensure their relevance to the research questions. The process involved reviewing titles, lead paragraphs, and, when necessary, the full text to discard irrelevant results and create a final, focused dataset of 200 articles for analysis. This process follows media monitoring and open source intelligence workflows where the selection of relevant data for codebook-based analysis originates from keyword based searches.

3.2. Codebook

To support this study, a comprehensive analysis codebook was developed. The codebook is designed to systematically analyze media coverage of Donald Trump’s tariff policy. It provides a structured framework for uniformly tagging and evaluating articles. The codebook is divided into several sections, with each addressing a specific aspect of the media coverage with a total of 26 questions and 122 codes.

3.2.1. Content and Source Identification

This section of the codebook is concerned with the basic attributes of each article. It aims to classify the content type and identify the author and the sources that are cited. The questions in this section are as follows:

Content Type: Is the article a news story, an opinion piece, or an analysis?
Author Mentioned: Does the article mention the author’s name?
Sources Cited: Are there specific references cited in the article?
Source Types: If sources are cited, what type of sources are they (e.g., government official, financial leader, academic professional)?

3.2.2. Tariff Policy Mention

This section focuses on whether and how Donald Trump’s tariff policy is mentioned in the article. This includes the following:

Tariff Policy Mentioned: Is Donald Trump’s tariff policy mentioned in the article?
Prominence of Mention: How prominently is the tariff policy mentioned (e.g., prominent mention, passing mention, mere mention)?

3.2.3. Entities and Stakeholders

This section identifies the key entities and stakeholders involved in the discussion of the tariff policy. The questions cover the following:

Countries Affected: Does the article mention any specific countries as being directly impacted by Donald Trump’s tariff policy? If so, which countries?
Industries Affected: Does the article mention specific industries as being directly impacted by Donald Trump’s tariff policy? If so, which industries?
Brands Affected: Does the article mention specific brands (companies) as being directly impacted by Donald Trump’s tariff policy? If so, which brands?
Political Leaders and Stakeholders: Are other political leaders or stakeholders explicitly mentioned in relation to Donald Trump’s tariff policy in the article? If so, who?

3.2.4. Sentiment Analysis

This section of the codebook is designed to capture the sentiment of the article towards various aspects of the tariff policy. This includes the following:

Sentiment towards China: What is the sentiment regarding the impact of Donald Trump’s tariff policy towards China (e.g., Positive, Negative, Neutral, Undefined)?
Sentiment towards the Automotive Industry: What is the sentiment regarding the impact of Donald Trump’s tariff policy on the automotive industry?
Overall Sentiment towards Donald Trump: What is the overall sentiment of the article towards Donald Trump in relation to his tariff policy?
Overall Sentiment towards the Economic Impact: What is the overall sentiment of the article towards the economic impact of Donald Trump’s tariff policy?
Overall Sentiment towards the Political Impact: What is the overall sentiment towards the political impact of Donald Trump’s tariff policy?

3.2.5. Policy Framing

This final section examines how the tariff policy is framed within a broader political context. The questions are as follows:

Connection to “America First”: Does the article connect Donald Trump’s tariff policy to his “America First” philosophy, either in a direct or indirect manner?
Direct or Indirect Connection: If connected to the “America First” philosophy, is the connection direct or indirect?

The codebook uses a mix of binary, categorical and multi-label questions to capture a wide range of data, allowing for a detailed and multi-faceted analysis of media coverage on this topic, as detailed in Table 1.

3.3. Annotation Process

The 200 selected articles were subjected to manual annotation by an expert analyst based on the predefined codebook. The codebook provided detailed definitions, rules, and examples for applying specific labels to the text. Each article was systematically reviewed by the analyst, who applied codes to relevant codebook questions. The primary objective of this process was to systematically convert qualitative textual data into a structured format suitable for quantitative analysis.

To ensure the reliability and validity of the manual annotation, a rigorous quality control procedure was implemented, adapting the principles described in international standards for acceptance sampling (ISO 2859 [50] and ANSI/ASQ Z1.4-2003 [51]).

The total set of coding decisions from the 200 articles was treated as a complete lot of annotations. We used the normal inspection severity (Level II), which stipulates that a batch of annotations is considered to be of acceptable quality if the error rate does not exceed a predefined threshold. For a lot size ranging from 151 to 280, this threshold, using an Acceptance Quality Limit (AQL) of 0.4, is set at 4 rejected items.

To verify this, a single sampling plan was used. After the analyst completed the annotation of the articles, two independent reviewers assessed a random sample of 32 annotation decisions. An annotation was deemed erroneous if it represented a clear misapplication of the codebook’s rules. If this sample of 32 contained more than a specified number of errors (set to 4 errors according to the defined error rate threshold), the whole lot of 200 articles would be rejected. In such a case, the lot would be returned to the analyst for a complete review and re-annotation, accompanied by careful instructions and clarification of the codebook to rectify the systematic errors. The reviewers agreed on 31 of the 32 items (Cohen’s

κ = 0.65

), indicating a high level of inter-rater reliability in evaluating annotation correctness. Taking into account both of their independent assessments, 30 of the 32 annotations were deemed acceptable, remaining within the allowable error threshold (AQL 0.4 —threshold of 4 rejected items). Although Cohen’s

κ

is considered a conservative estimate, particularly when class distributions are skewed or when agreement is near-perfect, the raw agreement rate of 96.9% between the two reviewers reflected strong consistency. As a result, the dataset was accepted without requiring re-annotation.

The dataset, including the codebook, collected articles, and human annotations used in this study, is publicly available at (https://doi.org/10.5281/zenodo.15767938, accessed on 29 June 2025).

3.4. Large Language Model-Based Question Answering

3.4.1. Model Selection

To gain a comprehensive understanding of the potential and limitations of using LLMs for news content analysis, we conducted a thorough evaluation of various models across different performance tiers in an attempt to assess their performance in the task of tagging news based on a predefined list of questions [52]. This included testing LLMs classified as low-, mid- and high-level based on their complexity, capabilities and resource requirements. Table 2 shows the different models tested in our experiments.

Specifically, for the lower-tier models, we opted for the quantized version of Meta’s Llama 3.1 [53] and the distilled version of DeepSeek-R1 [54], which is based on the Qwen 2.5 [55] architecture. Both models support a similar context window (128 K) in terms of the volume of data they can process at once. However, they differ significantly in size, with Llama 3.1 utilizing approximately 8 billion parameters and DeepSeek-R1 employing around 32 billion. While a direct comparison between these two models may not be entirely equitable, given their distinct optimization techniques (quantization versus distillation) [56], training pipelines and intended use cases, it remains valuable to examine how they perform relative to each other, particularly when benchmarked against more advanced and resource-intensive LLMs. This comparison could also provide an additional insight into the trade-offs between model efficiency and performance, especially in resource-constrained environments [57,58].

For the mid-tier category, we selected Meta’s Llama 3.3 [53] and a quantized version of Qwen 3 [59]. These two models exhibit notable differences in both architecture and configuration, most significantly in their parameter counts and context window sizes. Llama 3.3 represents a balanced model with a moderate number of parameters and a reasonably large context window, making it well suited for a variety of general-purpose tasks [12]. In contrast, Qwen 3, despite being quantized for efficiency, features 235 billion parameters, placing it near the boundary of what might typically be considered a high-tier model [59]. However, its relatively limited context window of just 32 K tokens may constrain its effectiveness in tasks that require broader document comprehension or long-range reasoning, thus justifying its inclusion in the mid-tier category for this analysis. Mid-tier models often strike a favorable balance between computational efficiency and task performance. They are especially valuable in real-world applications where resource constraints exist but where the complexity of tasks still demands a reasonable level of model sophistication [60]. Evaluating models like Llama 3.3 and Qwen 3 in this tier provides insights into how architectural trade-offs, such as parameter scaling versus context length, affect practical outcomes across different types of news content analysis tasks.

For the high-tier category, we selected two of the most advanced open-weight models available, Meta’s Llama 4 Maverick [61] (in its quantized form) and the latest iteration of DeepSeek-R1, identified as version 0528. Both models significantly exceed the capabilities of the low- and middle-tier selections in terms of parameter count, context window size and computational demands. They represent the cutting edge of open-weight LLM development and are recognized for their advanced performance [62]. High-tier models such as these typically exhibit superior reasoning abilities, enhanced problem-solving skills, greater proficiency in handling complex tasks and more nuanced content comprehension [63]. As such, comparing them against lower-tier models provides valuable insights into the extent and scalability of their capabilities, especially in the domain of news content analysis where context, nuance and depth of understanding are critical.

While our primary focus was open models with available weights across all performance tiers, we also included a leading commercial model in the high-tier group to broaden the scope of our evaluation, Anthropic’s Claude 3.7 Sonnet [64]. This model, accessed via a commercial platform, features an expansive context window of 200 K tokens and, although its parameter count remains undisclosed, it is considered to be competitive with GPT-4-class models [65,66,67]. Including Claude 3.7 Sonnet allows us to benchmark open-weight models against state-of-the-art proprietary alternatives, offering an additional dimension of comparison. This helps illustrate not only how far open-weight models have advanced but also where commercial offerings may still hold an edge in specific use cases.

3.4.2. Prompt Formulation

LLMs, built on Transformer architectures with self-attention mechanisms, excel at natural language tasks due to their ability to process long-range dependencies and understand context, meaning and intent behind text sequences [68,69,70]. Trained on vast, diverse datasets, they grasp nuanced syntax, semantics and pragmatics across multiple domains [71]. Their applications range from summarization and classification to advanced discourse analysis and content generation. However, LLMs are constrained by fixed knowledge cut-offs [72], limiting their awareness of post-training developments and retraining them to stay current, especially in fast-evolving fields like news content analysis, which is resource-intensive and impractical [73,74]. To address this, dynamic augmentation techniques allow LLMs to incorporate up-to-date external information at inference time [75,76]. In our study, this involved supplementing the model with news articles and a detailed codebook to answer questions that require contextual understanding. Thus, rather than relying solely on static knowledge, the model synthesizes real-time information using its deep linguistic capabilities, making this approach scalable and practical for media monitoring and journalism. Consequently, our research prioritized prompt engineering [77], including the development of a global system prompt to ensure analytical consistency and a structured codebook schema featuring specific questions, instructions, tagging values and data extraction protocols, all of which are detailed below.

In the first part, we developed a concise, comprehensive global system prompt for the selected LLMs, designed to serve as a foundational guide throughout the content analysis process. This system prompt outlined the model’s role, established its responsibilities in handling news content, emphasized the primary focus on the topic of Trump’s tariffs and offered general guidelines to shape its reasoning process. Importantly, the prompt was designed to function independently of any specific news story, as it provided only high-level instructions intended to align the model’s behavior without introducing bias from example-based conditioning. Our prompting strategy followed a zero-shot approach, meaning no examples of the desired output format were included. Instead, we relied exclusively on clear, direct suggestions that the model had to interpret and apply autonomously in order to reason and generate its final output. Another central component of our approach was role prompting. We instructed the LLMs to adopt the persona of a professional journalist, an expert tasked with evaluating, interpreting and tagging news content with accuracy, impartiality, and editorial insights. This role-based framing was reinforced with elements of emotional/motivational prompting aimed at enhancing the model’s steerability by encouraging a measured and professional responsibility within the role-based context, encouraging the model to take its task more seriously and generate thoughtful, deliberate responses. Additionally, to strengthen the LLM’s grasp of the full analytical context, we introduced a meta-cognitive step. Before proceeding with tagging individual questions from the codebook, the model was instructed to briefly summarize its approach as to re-evaluate its understanding of the given schema. This reflective prompt was implemented to encourage the model to internalize the logic and structure of the tagging system and general guidelines, thereby improving the relevance and accuracy of its output.

In the second part of our prompt engineering process, after designing the initial version of the codebook, including the core questions, legible tagging values and their descriptions, we shifted our focus toward refining its clarity, structure, and usability. The questions were carefully reviewed and paired, when necessary, with distinct, unambiguous instructions for every available response option. This refinement followed the same principles of meticulous prompt design used in our global system prompt with an emphasis on reducing ambiguity and improving model interpretability. From a software implementation perspective, we developed a structured schema architecture to formalize the codebook. This schema not only defined the data types associated with each value but also introduced standardized tag names for all possible answers across every question in the codebook. This structured format ensured compatibility with automated processing systems and provided the LLMs with a clear, machine-readable framework to operate within. A key enhancement to this schema was the addition of a reasoning prompt, a directive requiring the LLM to include a brief explanation of its thought process for each selected answer. Even in our earliest experiments, we observed that prompting the model to articulate its reasoning (similar to chain-of-thought prompting) significantly improved its performance. It encouraged more deliberate analysis and often helped the model self-correct or avoid superficial interpretations.

In summary, the complete context provided to the LLMs for each task consisted of three core components, the global system prompt, the refined codebook (including structured questions, detailed value descriptions and standardized tags) and the relevant news content. Together, these elements formed a comprehensive input designed to guide each model toward consistent, interpretable and high-quality outputs.

Conventionally, LLMs are designed to generate free-form natural language output in response to user instructions or conversational prompts. While this flexibility is useful in many general-purpose applications, it poses challenges in contexts like structured content tagging, particularly in our case, where the goal is to extract standardized labels from news articles. Relying on free-text responses would have required extensive post-processing to parse, interpret, and validate the output, introducing unnecessary complexity and potential for inconsistency. It quickly became evident that a structured and consistent output format was far more suitable for our task. To ensure precision and minimize downstream processing, we adopted a schema-driven approach that constrained the model’s responses to a predefined structure. During the software implementation phase, we leveraged standard parsing libraries to enforce this format, guiding the model to return its answers in a machine-readable structure aligned with the codebook schema. This approach not only improved the reproducibility of the tagging process across all models but also significantly reduced the risk of malformed or ambiguous outputs. By tightly coupling prompt design with structural constraints at the software level, we created a robust pipeline capable of extracting clean, well-organized labels from unstructured news content with minimal human intervention. The complete prompt used is presented in Appendix A.1.

3.5. LLM-Based Multi-Agent Codebook Answering

To evaluate the influence that agentic workflows can have on the performance of LLMs during codebook-based news content analysis, we designed and evaluated an LLM-based multi-agent architecture. Our system simulates a team of expert analysts and utilizes specific profiles, memory modules, planning architectures, and actions to produce schema-specific media coding outputs, as detailed in Figure 1.

A modular multi-agent architecture was developed for structured news content analysis using LLM-based agents. The system comprises a coordinator agent that orchestrates domain-specific expert agents in media, politics, and economics. This distribution of cognitive load, a principle of distributed reasoning and cognition, assigns each agent a specific role to constrain its interpretive scope, thereby reducing the risk of hallucinations [78,79]. Each of these agents addresses a subset of the 26-question codebook while also providing relevant domain-specific analysis. The analysts agents share the same conversational context to mitigate context dilution and drift across iterative reasoning steps. A reviewer agent discusses and integrates expert responses into a unified output, functioning as a critical internal consistency check. Finally, a converter agent processes the unified output for schema compliance. Memory services, including session and context stores, support state management and inter-agent contextual awareness. Tool modules facilitate access to questions and JSON validation. The agents’ architecture, detailed in Table 3, enables distributed reasoning, profile-specialization-driven analysis, and standardized output generation. The complete agents’ profile and orchestration instructions are presented in Appendix B.2. For the system implementation, Google’s Agent Development Kit version 1.4 was used [80].

3.6. Evaluation Metrics

The codebook questions are clearly differentiated by both their type and the nature of their expected responses. At a high level, this differentiation is based on the number of labels permitted per question—specifically, whether the model is expected to select a single value or multiple values. While this classification is useful for pre-processing and formatting during implementation, its primary function is to support a deeper semantic categorization that plays a crucial role in the final evaluation process, as we organize the questions into three distinct groups: binary, categorical, and multi-label. Binary questions require a single boolean response, either true or false, and typically represent the presence or absence of a specific feature or characteristic in the news content. Categorical questions also allow only one response, but from a predefined list of mutually exclusive values, and are used when a piece of content must be classified into a single category within a defined taxonomy. Multi-label questions permit multiple responses from an also fixed set of predefined values and are appropriate when the content may relate to several applicable categories simultaneously. This three-tiered classification system not only improves prompt design and model alignment but also informs our evaluation metrics, as each group requires distinct performance criteria and error analysis techniques. It ensures that model outputs are both interpretable and consistent with the logical structure of our coding framework.

To assess the LLMs’ tagging performance against human-annotated ground truth, we designed an evaluation process using the unified metric of the weighted F1-score across all question types (binary, categorical, and multi-label) to ensure consistency and comparability despite their differing structures. Rather than relying on accuracy for binary questions, weighted F1 was chosen to better handle class imbalances. The same weighted F1 metric was applied to categorical questions to evaluate how well the model distinguishes mutually exclusive labels and is adapted for multi-label questions by treating each label as a separate binary classification, aggregating performance accordingly. Precision measures the correctness of positive predictions, while recall gauges the model’s ability to capture all relevant instances. Both metrics are derived from the counts of True Positives (

T P

), False Positives (

F P

), and False Negatives (

F N

), as presented by their respective mathematical equations in Equation (1) below:

Precision = \frac{T P}{T P + F P} Recall = \frac{T P}{T P + F N}

(1)

The F1-score presented in Equation (2) harmonizes these metrics to provide a balanced performance measure:

F1-score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(2)

By standardizing on weighted F1, shown in Equation (3), our evaluation framework ensured consistent, interpretable, and unbiased comparisons across all question formats, accounting fairly for variations in answer distribution and question complexity.

Weighted F1-score = \sum_{c \in C} \frac{n_{c}}{N} \cdot F_{1}^{(c)}

(3)

where

c: Each of the binary, categorical, and multi-label classes.
C: Set of all classes.
$F_{1}^{(c)}$ : F1-score for class c.
$n_{c}$ : Number of true instances (support) for class c.
$N = \sum_{c \in C} n_{c}$ : Total number of instances across all classes.

4. Results

4.1. Model-Based Direct LLM Prompting Evaluation

After executing the finalized codebook across all selected LLMs, alongside the corresponding news content, we collected each model’s predicted answers for all questions, applied to every curated article in the dataset. This process yielded a set of model-generated annotations, reflecting how each LLM interpreted and tagged the content based on the provided schema, global prompt, and codebook instructions. Each model’s predictions were systematically compared against the ground-truth annotations produced by our expert human annotator. This comparative analysis allowed us to assess the performance of the models across a wide range of question types and tagging tasks, forming the empirical basis for our quantitative assessment. In addition, it enabled us to examine and compare the relative performance of each model within the context of its respective tier, i.e., low, middle, or high.

The evaluation of F1-scores across 26 questions, categorized into binary, categorical, and multi-label types, reveals clear and consistent performance disparities among the seven models assessed. Table 4 provides an aggregated overview of average F1-scores for each question group (binary, categorical and multi-label) and a global score that represents the model’s overall tagging effectiveness across the full codebook on a per-model basis. Overall, the commercial model Claude-3-7-Sonnet (M6) achieved the highest average F1-score at 0.822, demonstrating superior performance across all question types. It was especially strong in multi-label tasks, which proved to be the most challenging category for all models. The model’s performance in all question categories suggests a high degree of generalization, making it particularly suitable for applications that demand nuanced judgment across diverse types of analysis tasks.

DeepSeek-R1 (M4) and its distilled variant DeepSeek-R1-Distill-Qwen (M5) also perform robustly, achieving global F1-scores of 0.800 and 0.787, respectively. These models show particular strength in binary tasks, and while not as dominant as Claude-3-7-Sonnet in multi-label performance, they remain reliable choices for diverse classification needs. Meta-Llama-3-3 (M3) ranks slightly behind these models overall but performs competitively in binary and categorical tasks, making it a solid middle-tier option. On the other end of the spectrum, Meta-Llama-3-1 (M7) underperforms. With a global F1 score of just 0.636, it struggles most in multi-label classification, often showing the lowest scores across those questions. Even in binary and categorical tasks where other models perform well, its results remain well below average, showcasing the difficulties low-tier models have in complex coding-based analysis.

Meta-Llama-4-Maverick (M1) delivered solid results, outperforming mid-tier models like Qwen3 (M2) and Llama-3-1 (M7), yet it consistently lagged behind DeepSeek-R1 (M4) and Claude-3-7-Sonnet (M6), particularly on categorical and multi-label tasks. This suggests that a large scale alone does not ensure top-tier performance in complex structured coding and reasoning. These findings motivated our development of an agentic system to determine whether Meta-Llama-4-Maverick (M1), using coordinated, role-specialized agents that reason according to their expert roles, could surpass the limitations observed in direct single-model runs.

4.2. Agentic Architecture with Multi-Step Reasoning Evaluation

The detailed agentic architecture was evaluated using a high-tier LLM, Meta’s Llama 4 Maverick. A different conversational session was simulated for each article of the dataset. The input to the setup agent was the article’s text and ID without any additional information. Our evaluation captured the agents’ responses, flows, actions and token usage information. The session was marked as completed when the final JSON response was received, and thus the system had achieved the specified task. Table 5 showcases the increased performance of the agentic system, on all question categories, when compared to the single model approach.

These results demonstrate that our agentic approach meaningfully enhances structured coding performance, notably improving Meta-Llama-4-Maverick’s capabilities across all question categories. Remarkably, this elevates its performance close to that of the top commercial system (Claude) and even marginally surpasses the larger, reasoning-enabled DeepSeek-R1 model. We show how coordinated, role-specialized agentic collaboration can bridge gaps in parameter size and training specialization to achieve high-quality content analysis.

4.3. Per-Question Performance and Insights

Figure 2 presents the final evaluation metrics, i.e., the weighted F1-scores for all questions, computed by comparing each model’s output to the ground truths provided by the expert human annotator. Table A1 of Appendix A.2 also presents the evaluation metrics per question alongside the question types.

Binary questions, which mostly involve direct presence or absence detections (e.g., mentions, references), yield consistently high scores across models, typically exceeding

0.80

, with only minor variations between low- and high-capacity models. The Llama-Maverick-based agentic team maintains this strong performance, virtually matching Sonnet’s performance of 0.911 by increasing direct M1’s 0.841 to 0.910.

Categorical questions, which demand nuanced assessments of prominence, sentiment, or political framing, display notable performance variability across systems. These tasks often involve subtle judgments requiring discourse-level comprehension, such as evaluating how prominently a policy is discussed or inferring sentiment toward economic and political consequences. While Claude-3-7-Sonnet outperforms others on tasks like evaluating economic impacts (Q23), the agentic approach still provides competitive results on several questions (e.g.,

0.791

on Q26 vs. direct M1’s

0.594

), though inherently subjective judgments remain more difficult to standardize.

Multi-label questions, requiring identification of multiple applicable entities (such as industries or stakeholders), present greater challenges. Direct models frequently struggle with these tasks, exhibiting generally lower F1-scores, especially in smaller-scale systems. For example, in question Q21, Meta-Llama-3-1 scores as low as 0.210. Direct Meta-Llama-4-Maverick achieves only

0.536

on Q14, whereas Claude-3-7-Sonnet reaches around

0.593

. Here, the agentic system demonstrates clear advantages by distributing cognitive load among specialized roles and consolidating findings, resulting in more balanced multi-label extraction (e.g., achieving

0.643

on Q14).

These patterns indicate that while simpler binary tasks are reliably addressed by both large direct models and the agentic system, structured multi-agent workflows are particularly effective for distributed extraction tasks. Notably, even binary questions such as Q25 can prove to be difficult if they require deep political understanding and indirect contextual connections, while the agentic system improves Meta-Llama-4-Maverick’s scores in nuanced tasks such as Q11 and Q26 (sentiment and ideological framing), underscoring the benefit of collaborative reasoning for subjective classifications, fine-grained categorical judgments continue to represent a critical limitation for automated media content analysis irrespective of the model or methodology used.

4.4. Model Efficiency

Table 6 details the total tokens processed and indicative costs for each model. Token-level analysis indicates that agentic orchestration leads to an approximately 37% increase in token consumption and almost double total inference time relative to baseline prompting using the same model, largely due to task decomposition, inter-agent communication, and intermediate reasoning steps. While this results in additional computational overhead and latency, it produces clear gains in structured coding accuracy, raising Meta-Llama-4-Maverick’s overall F1-score from 0.757 to 0.805 over the entire 200-article evaluation set. Although the direct accuracy-to-cost ratio slightly declines (from 1.10 to 0.72 for the agentic approach), this approach remains highly competitive, especially compared to substantially more expensive commercial models like Claude-3-7-Sonnet (0.0507 accuracy-to-cost ratio). Per-agent token utilization is shown in Table A2 of Appendix B.

While the agentic approach incurs additional latency and computational overhead, these findings support agentic workflows as an effective strategy in contexts where accuracy outweighs moderate increases in inference cost. Figure 3 provides a visual comparison of total cost versus average F1-score, with bubble size reflecting total tokens used for the 200 articles’ evaluations. This shows that while Claude-3-7-Sonnet achieves the highest F1-score, it does so at a much higher cost. By contrast, our agentic system with Meta-Llama-4-Maverick delivers a substantial accuracy improvement over direct prompting with a moderate cost increase, offering a compelling balance of performance and efficiency overall.

5. Discussion

In this section, we discuss the findings of our comprehensive evaluation of LLMs’ capacity for structured media codebook analysis. We address two fundamental research questions regarding LLM-based media coding efficacy and the potential benefits of agentic workflows, through a systematic experimentation on our curated dataset of 200 annotated news articles spanning 26 codebook-derived questions.

RQ1: To What Extent Can LLMs Accurately Respond to Media Analysis Questions Derived from Structured Codebooks?

Our results confirm that modern LLMs are capable of responding to structured media analysis questions with reasonable efficacy, comparable to human accuracy in challenging content analysis tasks [82,83]. Our methodology for direct LLM prompting followed best practices, including system-role definition, schema-constrained response formats, and explicit reasoning instructions, achieving global F1-scores ranging from 0.636 to 0.822 across seven models of small-, medium-, and high-tier capacities. Anthropic’s Claude 3.7 Sonnet achieved the highest overall performance, followed by DeepSeek R1 and Meta’s Llama 3.3.

The reported variability in performance across model tiers underscores that raw LLM capabilities play a critical role, especially in handling nuanced, multi-label questions. Nevertheless, our findings suggest that with carefully designed and structured prompts, even direct-call LLMs can handle codebook-based content analysis requiring advanced reasoning and contextual understanding without prior training. Despite no known prior exposure to the codebook’s thematic definitions, models generally respected its structural logic, suggesting that prompt engineering alone can elicit schema-consistent behavior, especially in mid- and high-tier models.

RQ2: Can Agentic Workflows Enhance the Performance of LLMs in Codebook-Based News Content Analysis?

To evaluate the impact of agentic orchestration, we developed a multi-agent coordination layer atop Meta’s Llama 4 Maverick. This agentic architecture integrated multi-agent coordination and planning, tool utilization, and sub-task delegation, while preserving schema consistency. The system yielded a consistent improvement in performance, with the global average F1-score increasing from 0.757 (Direct-LLM prompt) to 0.805 under agentic execution, with consistent gains across binary, categorical, and multi-label classification tasks.

While the agentic configuration increased both input and output token volumes (see Table 6), our cost–performance analysis (Figure 3) demonstrates that it remains among the most cost-effective strategies for local models. This moderate cost increase preserves feasibility for large-scale corpora analysis. At a total processing cost of USD 1.12, compared to Sonnet’s USD 16.21 for the full dataset, the system achieved near state-of-the-art performance without reliance on higher-capacity models. This contrast illustrates a key strength of agentic workflows: the ability to increase performance through multi-agent profile-based cooperation, memory utilization, tool execution, and planning architectures. Our findings align with recent academic studies that report consistent performance improvements associated with the utilization of agentic architectures [26,29,30,31,42]. However, while these results are encouraging, particularly in binary and categorical tasks, trained human annotators remain the benchmark for reliability, especially in cases requiring subtle contextual judgments or interpretation across overlapping multi-label codes.

6. Conclusions

This paper explored the capabilities of LLMs in performing structured media content analysis, guided by two primary research questions. The key contributions of this work are as follows:

A curated dataset of 200 news articles, annotated with 26 codebook-derived questions and 122 codes related to US tariffs, is introduced, supporting reproducible evaluation and future research on media content analysis systems.
A systematic evaluation of seven state-of-the-art LLMs is conducted using structured prompting strategies. Notably, Claude-3-7-Sonnet achieved a global F1-score of 0.822, demonstrating strong performance in direct prompting settings.
An expert-based agentic architecture built on Meta’s Llama 4 Maverick is proposed and benchmarked, demonstrating how role-specialized reasoning can systematically improve structured codebook-based content analysis performance (raising the F1-score from 0.757 to 0.805) while remaining cost-effective relative to commercial models.

While the proposed approach demonstrates strong performance, several limitations remain. Our LLM-based and agentic approaches may ultimately aid large-scale audits of media data or, potentially, global coverage, which would be infeasible to scale with purely human coders. However, the scalability of our method must be further validated on larger and more diverse datasets. Evaluating this framework with additional languages and an even more varied collection of sources, while investigating different thematic categories, will increase its generalizability. Second, although structured prompting and agentic reasoning improve accuracy, human-in-the-loop validation may be essential to ensure reliability and provide explainability insights. Our agentic system is designed to support human-in-the-loop validation for future research. Future work could also extend our agentic approach to other architectures and models to further generalize our findings. It is worth noting that an evaluation of earlier LLM releases, traditional NLP techniques, and shallow ML models could also provide valuable insights into the relative performance and impact of the proposed approach [84]. While the focus of this study was to evaluate the performance of LLMs and LLM-based agents using a human-expert-created codebook, future work could explore codebook update phases during the agentic workflow. Finally, future work could address systematic errors in subjective sentiment or multi-label extraction, potentially through hybrid or ensemble approaches, as our results show important differences by question type and context across models. Our agentic approach could leverage different models per question or category of questions.

This study has implications for scalable automations in media monitoring, public relation analysis, political communication, and journalistic accountability audits. While this study employed a tariff-focused codebook, the methodology is applicable to other policy or issue-specific content analyses. However, such analyses often demand near-perfect accuracy to prevent misinformation, inaccurate decision making, and biases that, depending on the domain, could have significant consequences. As LLM-based systems become increasingly integrated into content analysis workflows, attention must be given to interpretability, accountability, ethics, and error auditing to ensure responsible deployment in both academic and applied settings.

Author Contributions

Conceptualization, S.D. and S.V.; methodology, S.D., E.K. and P.C.; software, S.D. and E.K.; validation, S.D. and S.K.; formal analysis, S.D. and E.K.; investigation, S.D. and S.K.; resources, P.C.; data curation, S.K., E.K. and P.C.; writing—original draft preparation, S.D.; writing—review and editing, S.D., E.K. and P.C.; visualization, S.D. and E.K.; supervision, S.V.; project administration, S.D. and S.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset, including the codebook, collected articles, and human annotations used in this study, is publicly available at (https://doi.org/10.5281/zenodo.15767938, accessed on 29 June 2025) for research purposes.

Acknowledgments

During the preparation of this manuscript/study, the authors used generative AI for the purpose of editing grammar. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors employed by the company DataScouting and the remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ANSI	American National Standards Institute
API	Application Programming Interface
AQL	Acceptance Quality Limit
CoT	Chain of Thought
CSV	Comma-Separated Values
FP8	8-bit Floating Point
GPT	Generative Pre-trained Transformer
ICL	In-Context Learning
ISO	International Organization for Standardization
JSON	JavaScript Object Notation
LLM	Large Language Model
LLMs	Large Language Models
NLP	Natural Language Processing
OSINT	Open Source Intelligence
ReAct	Reasoning and Acting
ToT	Tree of Thought

Appendix A

Appendix A.1. LLM Model Prompting

The global prompt used for the model-based approach:

You are a world-class journalist with exceptional expertise in analyzing

and extracting information related to Trump’s tariffs from media.

Your task is to carefully read a given content (web article, social media

post or tv news transcript) and tag the relevant information

according to a specific codebook schema.

Ensure that you:

- Provide concise, accurate tagging strictly based on the given content.

- Leverage your ability to interpret complex data and identify nuanced

information.

- Make use of all relevant details, ensuring no omissions.

Focus solely on the content---avoid speculation or external information.

Please begin by summarizing your approach to ensure alignment with

the codebook schema. Then, proceed with the tagging.

This is important to your career.

Questions:

{

"question": "Q1",

"prompt": "What is the primary type of content for this article?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"News",

"Opinion",

"Analysis",

"Other"

]

},

{

"question": "Q2",

"prompt": "Is the author’s name mentioned in the article?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q3",

"prompt": "Are specific references cited in the article?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q4",

"prompt": "If the answer to Q3 is true, provide the list of types of reference

cited.\nInstructions:\nDo not include vague or anonymous references that do not

clearly specify the individual or entity (e.g., general references to \"officials\"

without specifying who they are). If a person is named, categorize ONLY once, ONLY

according to the entity they are linked to (government, financial, institution, etc.).

For Government Official/Political Leader: include references to government officials,

political leaders, presidents, heads of states, prime ministers, ministers and other

high ranking public office holders. Include also references to capital cities when

clearly referring to the government; for example, Brussels confirmed, according

to Beijing officials, according to Washington officials. Financial Leader: include

references to financial institutions or leaders within financial organizations, like

ECB, IMF, Fed (examples: according to the European Central Bank, Fed Chair Jerome

Powell stated). Exclude: do not include representatives of stock markets or commercial

companies. Institution leader: include references to institutions and leaders or

representatives of major institutions who act as policy makers, such as the European

Union, World Health Organization (examples: WHO announced, EU Commission declared,

EU President Ursula von der Leyen claimed). Industry leader: include references

to representatives of industry associations, such as Chamber of Commerce. Exclude:

do not include individual company or brand representatives such as CEOs or brand

spokespersons. Academic Professional: include references of experts or individuals who

are explicitly linked to an academic institution or university. Exclude: do not

include persons who are cited as independent experts without a clear institutional

affiliation. Report/study/Books: include references to data, statistics or

finding explicitly taken/cited from named reports, studies or books. Think

Tank: include references that specifically mention the entity as a Think Tank (e.g.

the Think Tank term needs to be mentioned in the article). Spokesperson: include

references to official spokespersons; for example, the White House Press Secretary

and government spokespersons. Other: use when the source is a media outlet (such as

a newspaper, TV network, news agency) or when no other category clearly applies.",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"Government official/political leader",

"Financial leader",

"Institution leader",

"Industry leader",

"Academic professional",

"Report/study/books",

"Think tank",

"Spokesperson",

"Other"]

},

{

"question": "Q5",

"prompt": "Is Donald Trump’s tariff policy mentioned in the article?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q6",

"prompt": "How prominently is the tariff policy mentioned in the article? If it is

not mentioned answer Undefined\nInstructions:\nJustify your selection based on

where in the article it is mentioned.Select the one that applies. Guidelines: Prominent

mention: main topic, the tariff policy / tariffs mentioned in the headline and in the

first paragraph, Most of the article discusses the tariff policy/tariffs in depth.

Passing mention: the tariff policy/tariffs is a minor topic of the article: there is

no mention in the headline or the first paragraph, the tariff policy/tariffs is

referred to randomly in middle or end paragraph without much elaboration, there is no

deep discussion of the tariff policy/tariff. Mere mention: the tariff policy/tariffs

is a mere mention in the article: the tariff policy/tariffs is mentioned only

once/twice anywhere in the article, not many details are mentioned.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Prominent mention",

"Passing mention",

"Mere mention",

"Undefined"

]

},

{

"question": "Q7",

"prompt": "Is Donald Trump mentioned in the article regarding his tariff policy?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q8",

"prompt": "How prominently is Donald Trump mentioned in the article in relation to his

tariff policy (even if the tariff policy is not the primary focus of the article).

If it is not mentioned answer Undefined\nInstructions:\nJustify based on where in

the article he is mentioned.Select the one that applies. Guidelines: Prominent

mention: main topic, the tariff policy / tariffs mentioned in the headline and in

the first paragraph, Most of the article discusses the tariff policy/tariffs

in depth. Passing mention: the tariff policy/tariffs is a minor topic of the article:

there is no mention in the headline or the first paragraph, the tariff policy/tariffs

is referred to randomly in middle or end paragraph without much elaboration, there is

no deep discussion of the tariff policy/tariff. Mere mention: the tariff policy/tariffs

is a mere mention in the article: the tariff policy/tariffs is mentioned only

once/twice anywhere in the article, not many details are mentioned.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Prominent mention",

"Passing mention",

"Mere mention",

"Undefined"

]

},

{

"question": "Q9",

"prompt": "Does the article mention any specific countries as being directly impacted

by Donald Trump’s tariff policy?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q10",

"prompt": "If the answer to q9 is true, choose from the countries listed.

\nInstructions:\nChoose only the countries that are explicitly mentioned as being

directly impacted by Donald Trump’s tariff policy. Select the country even if it

is mentioned differently. Example. Select USA if article mentions US",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"USA",

"China",

"Mexico",

"Canada",

"Germany",

"India",

"France",

"UK",

"Italy",

"Other"

]

},

{

"question": "Q11",

"prompt": "What is the sentiment regarding the impact of Donald Trump’s tariff policy

towards China. If China is not mentioned answer Undefined.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Positive",

"Negative",

"Neutral",

"Undefined"

]

},

{

"question": "Q12",

"prompt": "Does the article mention a Chinese political leader/stakeholder taking a

direct stance on Donald Trump’s tariff policy?\nInstructions:\nPolitical leaders:

the article needs to mention: government officials (presidents, former presidents,

prime ministers, heads of states, ministers, official spokespersons/White house press

secretary, ambassadors), political leaders (party leaders). Include also references

to capital cities when clearly referring to the government (examples: Brussels confirmed,

according to Beijing officials, according to Washington officials). Exclude: political

commentators, analysts, unofficial spokespersons, leaders of political movements

without formal government rules. Stakeholder: for leading policy makers such leaders

within financial organizations (like the heads of the ECB, IMF, or Fed) and leaders

within major international institutions like the European Union or the World Health

Organization. Exclude: representatives or leaders of commercial companies, academic

experts, think-tank representatives or industry analysts. ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q13",

"prompt": "Does the article mention retaliatory / counter measures taken by China

in response to Donald Trump’s tariff policy? ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q14",

"prompt": "If the answer to q13 is true, which of the following retaliatory / counter

measures are mentioned as taken by China in response to Donald Trump’s tariff

policy?\nInstructions:\nSelect all that apply",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"Retaliatory Tariffs",

"Trade negotiations and agreements",

"Diversification of trade relationships",

"Non-tariff measures and trade barriers",

"Strategic and sector/industry specific measures",

"Market adaptation and internal policy adjustments",

"Production/manufacturing allocation",

"Undefined"

]

},

{

"question": "Q15",

"prompt": "Does the article mention specific industries as being directly impacted

by Donald Trump’s tariff policy?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q16",

"prompt": "If the answer to question q15 is true, which industries are affected?

\nInstructions:\nSelect only the industries that are explicitly mentioned as being

directly impacted by Donald Trump’s tariff policy. Agriculture stands for all

farming, including soybeans, corn, wheat, tomatoes. Blockchain includes bitcoin,

cryptocurrency. Energy includes oil, fossil fuel, and clean energy. Food (e.g.

chocolate, can include issues such as safety, processing). Metal includes steel,

aluminum and other metals, even precious metals like gold and silver.

Technology includes smartphones, computers, software, hardware. ",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"Agriculture",

"Automotive",

"Aviation",

"Energy",

"Food",

"Metal",

"Pharmaceutical",

"Retail",

"Shipping",

"Technology",

"Blockchain",

"Other"

]

},

{

"question": "Q17",

"prompt": "What is the sentiment regarding the impact of Donald Trump’s tariff policy

on automotive industry? If automotive industry is not mentioned answer Undefined.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Positive",

"Negative",

"Neutral",

"Undefined"

]

},

{

"question": "Q18",

"prompt": "Does the article mention specific brands (companies) as being directly

impacted by Donald Trump’s tariff policy?",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q19",

"prompt": "If the answer to q18 is true, select which brands are mentioned in the

article.\nInstructions:\nRespond with brands that are explicitly mentioned as being

directly impacted by Donald Trump’s tariff policy. Select VW if the article also

mentions Volkswagen. Select Mercedes if the article mentions Mercedes-Benz

Models / products are excluded from the list. If you select OTHER: add in a note

all other brands that are mentioned in the article as being directly affected

by Donald Trump’s tariff policy. Do not add models / products. ",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"Apple",

"Audi",

"BMW",

"Boeing",

"Dandelion",

"Ford",

"Hershey",

"Jaguar",

"Mercedes",

"Meta",

"Microsoft",

" NVIDIA",

"Porsche",

"Primo Chocolate",

"Samsung",

"Shein",

"Temu",

"TikTok",

"VW",

"Tesla",

"Other"

]

},

{

"question": "Q20",

"prompt": "Are other political leaders or stakeholders explicitly mentioned in

relation to Donald Trump’s tariff policy in the article?\nInstructions:

\nPolitical leaders: the article needs to mention: government officials

(presidents, former presidents, prime ministers, heads of states, ministers,

official spokespersons/White house press secretary, ambassadors), political

leaders (party leaders). Include also references to capital cities when clearly

referring to the government (examples: Brussels confirmed, according to

Beijing officials, according to Washington officials). Exclude: political

commentators, analysts, unofficial spokespersons, leaders of political movements

without formal government rules. Stakeholder: for leading policy makers such leaders

within financial organizations (like the heads of the ECB, IMF, or Fed) and leaders

within major international institutions like the European Union or the World

Health Organization. Exclude: representatives or leaders of commercial companies,

academic experts, think-tank representatives or industry analysts. Note: The question

specifically asks for mentions of other political leaders or stakeholders,

besides Donald Trump, explicitly in relation to Donald Trump’s tariff policy.

If the article only mentions Donald Trump as a political leader in connection with

his own tariff policy, the answer should be FALSE. ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q21",

"prompt": "If the answer to q21 is True, choose from persons listed.",

"questionAnswerType": "MULTI_CHOICE",

"eligibleQuestionAnswers": [

"Christine Lagarde, ECB President",

"Jerome Powell, Fed Chairman",

"Ursula von der Leyen, EU President",

"Xi Jinping, China President",

"Other"

]

},

{

"question": "Q22",

"prompt": "What is the overall sentiment of the article towards Donald Trump in

relation to his tariff policy?\nInstructions:\nSuggested criteria for annotating

sentiment related to Donald Trump’s decision to impose tariffs, while these

categories offer a structures approach, please apply your critical thinking to

ensure an accurate assessment. The categories are intended as guidance rather

than strict rules.Positive (highlights positive outcomes): when the article

expresses approval, support, praise, benefits Negative (highlights negative impacts):

critical, disapproval, failure, backlash, harm Neutral (balanced): balanced, objective,

informative, factual Undefined (when it is ambiguous): unclear, vague, nonspecific,

irrelevant ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Positive",

"Negative",

"Neutral",

"Undefined"

]

},

{

"question": "Q23",

"prompt": "What is the overall sentiment of the article towards the economic impact

of Donald Trump’s tariff policy?\nInstructions:\nEconomic impact should refer

to factors such as inflation, supply chain disruption, job loss/creation, slower

economic growth/GDP decline, impact on stock market/currencies, consumer spending

decline, retaliatory measures. Suggested criteria for annotating the overall

sentiment of the article regarding the economic impact of Donald Trump’s tariff

policy, while these categories offer a structures approach, please apply your

critical thinking to ensure an accurate assessment. The categories are intended

as guidance rather than strict rules. Positive (highlights positive outcomes):

when the article expresses approval, support, praise, benefits. Negative

(highlights negative impacts): critical, disapproval, failure, backlash, harm.

Neutral (balanced): balanced, objective, informative, factual. Undefined (when

it is ambiguous): unclear, vague, nonspecific, irrelevant ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Positive",

"Negative",

"Neutral",

"Undefined"

]

},

{

"question": "Q24",

"prompt": "What is the overall sentiment towards the political impact of Donald

Trump’s tariff policy?\nInstructions:\nPolitical impact should refer to factors

such as changes in Donald Trump’s popularity (domestic/global), impact on

US-foreing-ally relations, backlash from domestic/foreign industries,

elections. Suggested criteria for annotating the overall sentiment of the article

regarding the political impact of Donald Trump’s tariff policy, while these

categories offer a structures approach, please apply your critical thinking to

ensure an accurate assessment. The categories are intended as guidance

rather than strict rules. Positive (highlights positive outcomes): when the

article expresses approval, support, praise, benefits. Negative (highlights

negative impacts): critical, disapproval, failure, backlash, harm. Neutral (balanced):

balanced, objective, informative, factual. Undefined (when it is ambiguous):

unclear, vague, nonspecific, irrelevant ",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Positive",

"Negative",

"Neutral",

"Undefined"

]

},

{

"question": "Q25",

"prompt": "Does the article connect Donald Trump’s tariff policy to his \"America

First\" philosophy, either in direct or indirect manner?\nInstructions:\n

The core of Donald Trump’s America First, philosophy emphasizes on prioritizing

American interests, particularly in economic and foreign policy decisions.

Trump advocates for reducing trade deficits, bringing jobs back to the U.S.,

focusing on national security, and reshaping U.S. foreign relationships

to ensure they benefited the country’s economy and security first. As part of

his America First, agenda, Trump’s tariff policy aimed to reshape global trade

by imposing tariffs on imports to protect American industries, reduce trade

imbalances, and discourage offshoring. The day Trump announced his tariff policy

was called Liberation Day.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"TRUE",

"FALSE"

]

},

{

"question": "Q26",

"prompt": "Does the article connect the policy directly or indirectly to the ’America

First’ philosophy?If the answer to Q26 is False answer Undefined.\nInstructions:

\nDirectly: the article explicitly links the tariff policy to Donald Trump’s

\"America First\" philosophy citing the phrase \"America First\" in the context of

justifying his policy. Indirectly: the article suggests or implies a connection

with the "America First" philosophy citing references such as: prioritizing

American interests, reducing trade deficits, bringing jobs back to the U.S.,

focusing on national security, and reshaping U.S. foreign relationships,

protect American industries, reduce trade imbalances, and discourage

offshoring. The day Trump announced his tariff policy was called Liberation Day.",

"questionAnswerType": "SINGLE_CHOICE",

"eligibleQuestionAnswers": [

"Directly",

"Indirectly",

"Undefined"

]

}

Response models for high-level types of codebook questions (singe- and multi-value answer):

{

"type": "object",

"fields": {

"answer": {

"type": "enum",

"description": <QUESTION-PROMPT>,

"values": <LIST-OF-TAGNAMES>}

"reasoning": {

"type": "str",

"description": "The brief reasoning process that led to the answer.

Single sentence if possible."

}

},

}

{

"type": "object",

"fields": {

"answer": {

"type": "list",

"description": <QUESTION-PROMPT>,

"items": {

"type": "enum",

"values": <LIST-OF-TAGNAMES>

}

"reasoning": {

"type": "str",

"description": "The brief reasoning process that led to the answer.

Single sentence if possible."

}

},

}

Appendix A.2. Metrics

Table A1. F1-scores for binary, categorical and multi-label types of questions. All models were evaluated against the annotator’s values. The table shows each model by its abbreviation. M1: Meta-Llama-4-Maverick; M2: Qwen3; M3: Meta-Llama-3-3; M4: DeepSeek-R1; M5: DeepSeek-R1-Distill-Qwen; M6: Claude-3-7-Sonnet; M7:Meta-Llama-3-1; A: Agentic (Meta-Llama-4-Maverick). Bold indicates the highest value in each row.

Question	Type	M1	M2	M3	M4	M5	M6	M7	A
Q1	categorical	0.912	0.914	0.934	0.927	0.903	0.913	0.914	0.863
Q2	binary	0.693	0.935	0.726	0.841	0.823	0.867	0.708	0.940
Q3	binary	0.940	0.829	0.864	0.950	0.944	0.940	0.940	0.944
Q4	multi-label	0.588	0.481	0.527	0.553	0.541	0.660	0.440	0.657
Q5	binary	1.000	1.000	0.997	1.000	1.000	1.000	1.000	0.990
Q6	categorical	0.753	0.738	0.732	0.772	0.699	0.754	0.640	0.742
Q7	binary	0.995	0.997	0.995	0.995	1.000	0.997	0.947	0.990
Q8	categorical	0.447	0.481	0.641	0.481	0.444	0.658	0.506	0.536
Q9	binary	0.937	0.930	0.952	0.941	0.934	0.933	0.776	0.919
Q10	multi-label	0.800	0.738	0.838	0.843	0.760	0.835	0.609	0.781
Q11	categorical	0.632	0.813	0.810	0.804	0.793	0.841	0.517	0.809
Q12	binary	0.781	0.868	0.800	0.915	0.929	0.904	0.807	0.896
Q13	binary	0.780	0.888	0.856	0.851	0.920	0.910	0.733	0.925
Q14	multi-label	0.536	0.510	0.535	0.569	0.591	0.593	0.396	0.643
Q15	binary	0.849	0.785	0.815	0.804	0.832	0.850	0.727	0.850
Q16	multi-label	0.683	0.552	0.615	0.657	0.652	0.734	0.468	0.692
Q17	categorical	0.906	0.857	0.869	0.880	0.915	0.933	0.672	0.929
Q18	binary	0.929	0.929	0.945	0.944	0.929	0.945	0.684	0.955
Q19	multi-label	0.802	0.707	0.828	0.821	0.787	0.877	0.391	0.790
Q20	binary	0.677	0.616	0.784	0.825	0.840	0.860	0.485	0.805
Q21	multi-label	0.677	0.616	0.724	0.801	0.720	0.836	0.210	0.701
Q22	categorical	0.769	0.758	0.744	0.703	0.700	0.525	0.656	0.604
Q23	categorical	0.831	0.824	0.868	0.877	0.834	0.865	0.761	0.841
Q24	categorical	0.494	0.559	0.536	0.552	0.540	0.558	0.470	0.535
Q25	binary	0.664	0.437	0.708	0.767	0.742	0.814	0.583	0.792
Q26	categorical	0.594	0.393	0.687	0.722	0.693	0.779	0.482	0.791

Appendix B

Appendix B.1. Agentic System Token Metrics

Table A2. Per agent token utilization.

Agents	Input Tokens	Output Tokens	Total Tokens
Setup Agent	160,082	126,873	286,955
Media Analyst	569,442	133,801	703,243
Political Analyst	885,590	160,859	1,046,449
Trade Economist	1,006,613	177,110	1,183,723
Reviewer	841,492	230,152	1,071,644
Converter	455,562	52,763	508,325

Appendix B.2. Detailed Agent Prompts and System Orchestration

Appendix B.2.1. Media Analyst Prompt Template

You are a Media and Journalism Expert analyzing news articles about Donald Trump’s

tariff policies.

ARTICLE TO ANALYZE:

{article_data}

You are a world-class expert with exceptional expertise in analyzing and extracting

information related to Trump’s tariffs from media.

Your task is to carefully read a given content (web article) and tag the relevant information.

Ensure that you:

- Provide concise, accurate tagging strictly based on the given content.

- Leverage your ability to interpret complex data and identify nuanced information.

- Make use of all relevant details, ensuring no omissions.

Focus on the content and avoid speculation.

Please begin by summarizing your approach to ensure alignment with the codebook schema.

Then, proceed with the tagging.

This is important to your career.

Your expertise covers:

- Content classification and journalistic standards

- Authorship analysis and source credibility

- Media framing and presentation techniques

- Publication context and editorial positioning

ANALYSIS QUESTIONS TO ADDRESS:

{MEDIA_QUESTIONS}

Instructions:

1. Read the provided article carefully

2. Apply your media expertise to analyze the content

3. Address each relevant question from your domain

4. Provide detailed insights about journalistic approach, sourcing, framing, and presentation

5. Focus on HOW the story is told, not just WHAT is told

Provide a comprehensive analysis covering all relevant media and journalism aspects.

Appendix B.2.2. Political Analyst Prompt Template

You are a Political Analysis Expert analyzing news articles about

Donald Trump’s tariff policies.

ARTICLE TO ANALYZE:

{article_data}

You are a world-class expert with exceptional expertise in analyzing and extracting

information related to Trump’s tariffs from media.

Your task is to carefully read a given content (web article) and tag the relevant information.

Ensure that you:

- Provide concise, accurate tagging strictly based on the given content.

- Leverage your ability to interpret complex data and identify nuanced information.

- Make use of all relevant details, ensuring no omissions.

Focus on the content and avoid speculation.

Please begin by summarizing your approach to ensure alignment with the codebook schema.

Then, proceed with the tagging.

This is important to your career.

Your expertise covers:

- Political leadership and governance analysis

- International relations and diplomatic implications

- America First, philosophy and policy alignment

- Political strategy and policy implementation

Distinguish sentiment levels carefully:

- NEGATIVE: Any criticism, concern, worry, opposition, or unfavorable tone

- NEUTRAL: Purely factual reporting without evaluative language

- POSITIVE: Explicit support, praise, or favorable framing

ANALYSIS QUESTIONS TO ADDRESS:

{POLITICAL_QUESTIONS}

Instructions:

1. Read the provided article carefully

2. Apply your political expertise to analyze the content

3. Address each relevant question from your domain

4. Provide detailed insights about political dynamics, leadership, and policy implications

5. Focus on political strategy, governance, and international relations aspects

6. Make use of all relevant details, especially regarding persons mentioned,

ensuring no omissions.

7. Implicit references to persons or philosophies are also valid.

Provide a comprehensive analysis covering all relevant political aspects.

Appendix B.2.3. Trade Economist Prompt Template

You are a Trade and Economics Expert analyzing news articles about

Donald Trump’s tariff policies.

You are a world-class expert with exceptional expertise in analyzing and extracting

information related to Trump’s tariffs from media.

Your task is to carefully read a given content (web article) and tag the relevant information.

Ensure that you:

- Provide concise, accurate tagging strictly based on the given content.

- Leverage your ability to interpret complex data and identify nuanced information.

- Make use of all relevant details, ensuring no omissions.

Focus on the content and avoid speculation.

Please begin by summarizing your approach to ensure alignment with the codebook schema.

Then, proceed with the tagging.

This is important to your career.

ARTICLE TO ANALYZE:

{article_data}

Your expertise covers:

- International trade and tariff analysis

- Economic impact assessment

- Industry and sector analysis

Distinguish sentiment levels carefully:

- NEGATIVE: Any criticism, concern, worry, opposition, or unfavorable tone

- NEUTRAL: Purely factual reporting without evaluative language

- POSITIVE: Explicit support, praise, or favorable framing

ANALYSIS QUESTIONS TO ADDRESS:

{TRADE_QUESTIONS}

Instructions:

1. Read the provided article carefully

2. Apply your trade and economics expertise to analyze the content

3. Address each relevant question from your domain

4. Provide detailed insights about economic impacts, trade relationships, and market dynamics

5. Focus on quantitative impacts, industry effects, and economic implications

Provide a comprehensive analysis covering all relevant trade and economic aspects.

Appendix B.2.4. Reviewer Prompt Template

You are a Research Synthesis Expert responsible for converting expert

analyses into structured JSON output.

You will receive detailed analyses from three domain experts

You are a world-class expert with exceptional expertise in analyzing and extracting

information related to Donald Trump’s tariffs from media.

Your task is to carefully read a given content (web article) and tag the relevant information.

Ensure that you:

- Provide concise, accurate tagging strictly based on the given content.

- Leverage your ability to interpret complex data and identify nuanced information.

- Make use of all relevant details, ensuring no omissions.

Focus on the content and avoid speculation.

Please begin by summarizing your approach to ensure alignment with the codebook schema.

Then, proceed with the tagging.

This is important to your career.

Your task is to synthesize their insights into a structured JSON

response following the exact question schema format.

CRITICAL REQUIREMENTS:

1. Review all expert analyses carefully and map their insights

to the appropriate questions (Q1--Q26)

2. For each question, provide a "reasoning" (brief explanation) both an "answer"

(from the specified options)

3. Use "Undefined" for questions where experts did not provide clear answers

4. Maintain consistency across related questions

5. Follow conditional logic: skip dependent questions if prerequisite is FALSE

(provide empty arrays for multi-select)

6. MAKE NO OMISSIONS - e.g., If an analyst mentioned a person or industry

include it in the answer.

7. Check if a reviewer missed a subtle/indirect reference.

SYNTHESIS PROCESS:

1. Review the original article and all expert analyses

2. For each of the 26 questions, determine the most appropriate

answer based on expert insights

3. Provide brief reasoning for each answer based on the expert analyses

4. Ensure all responses match the specified options exactly

5. Handle conditional questions properly

(e.g., if Q3 is FALSE, Q4 should be empty array)

The output will be automatically formatted as JSON according to the schema -

focus on providing accurate answers and reasoning.

References

Al-Quran, M.W.M. Traditional media versus social media: Challenges and opportunities. Tech. Rom. J. Appl. Sci. Technol. 2022, 4, 145–160. [Google Scholar] [CrossRef]
Shahrzadi, L.; Mansouri, A.; Alavi, M.; Shabani, A. Causes, consequences, and strategies to deal with information overload: A scoping review. Int. J. Inf. Manag. Data Insights 2024, 4, 100261. [Google Scholar] [CrossRef]
Power, D.J.; Phillips-Wren, G. Impact of social media and Web 2.0 on decision-making. J. Decis. Syst. 2011, 20, 249–261. [Google Scholar] [CrossRef]
Duhé, S.C. New Media and Public Relations; Peter Lang: New York, NY, USA, 2007. [Google Scholar]
Ghassabi, F.; Zare-Farashbandi, F. The role of media in crisis management: A case study of Azarbayejan earthquake. Int. J. Health Syst. Disaster Manag. 2015, 3, 95–102. [Google Scholar]
Reuter, C.; Hughes, A.L.; Kaufhold, M.A. Social media in crisis management: An evaluation and analysis of crisis informatics research. Int. J. Hum.-Comput. Interact. 2018, 34, 280–294. [Google Scholar] [CrossRef]
Soroka, S.; Farnsworth, S.; Lawlor, A.; Young, L. Mass media and policy-making. In Routledge Handbook of Public Policy; Routledge: Abingdon, Oxon, UK, 2012; pp. 204–214. [Google Scholar]
Neuendorf, K.A. The Content Analysis Guidebook; SAGE: Thousand Oaks, CA, USA, 2017. [Google Scholar]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology; SAGE Publications: Thousand Oaks, CA, USA, 2018. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USAn 4–9 December 2017; Curran Associates: Red Hook, NY, USA, 2017. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Chew, R.; Bollenbacher, J.; Wenger, M.; Speer, J.; Kim, A. LLM-assisted content analysis: Using large language models to support deductive coding. arXiv 2023, arXiv:2306.14924. [Google Scholar] [CrossRef]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A survey on in-context learning. arXiv 2022, arXiv:2301.00234. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Mosbach, M.; Pimentel, T.; Ravfogel, S.; Klakow, D.; Elazar, Y. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv 2023, arXiv:2305.16938. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 24824–24837. [Google Scholar]
Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.V.; Chi, E.H.; Zhou, D.; et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv 2022, arXiv:2210.09261. [Google Scholar]
Xiao, Z.; Yuan, X.; Liao, Q.V.; Abdelghani, R.; Oudeyer, P.Y. Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In Proceedings of the 28th International Conference on Intelligent User Interfaces, Sydney, NSW, Australia, 27–31 March 2023; pp. 75–78. [Google Scholar]
Dunivin, Z.O. Scaling hermeneutics: A guide to qualitative coding with LLMs for reflexive content analysis. EPJ Data Sci. 2025, 14, 28. [Google Scholar] [CrossRef]
Dunivin, Z.O. Scalable qualitative coding with llms: Chain-of-thought reasoning matches human performance in some hermeneutic tasks. arXiv 2024, arXiv:2401.15170. [Google Scholar]
Ruckdeschel, M. Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 6317–6337. [Google Scholar]
Dai, S.C.; Xiong, A.; Ku, L.W. LLM-in-the-loop: Leveraging large language model for thematic analysis. arXiv 2023, arXiv:2310.15100. [Google Scholar]
Halterman, A.; Keith, K.A. Codebook llms: Adapting political science codebooks for llm use and adapting llms to follow codebooks. arXiv 2024, arXiv:2407.10747. [Google Scholar] [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Prentice Hall: Englewood Cliffs, NJ, USA, 1995. [Google Scholar]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; pp. 8634–8652. [Google Scholar]
D’Arcy, M.; Hope, T.; Birnbaum, L.; Downey, D. Marg: Multi-agent review generation for scientific papers. arXiv 2024, arXiv:2401.04259. [Google Scholar] [CrossRef]
Huang, D.; Zhang, J.M.; Luck, M.; Bu, Q.; Qing, Y.; Cui, H. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv 2023, arXiv:2312.13010. [Google Scholar]
Qian, C.; Cong, X.; Yang, C.; Chen, W.; Su, Y.; Xu, J.; Liu, Z.; Sun, M. Communicative agents for software development. arXiv 2023, arXiv:2307.07924. [Google Scholar] [CrossRef]
Hong, S.; Zheng, X.; Chen, J.; Cheng, Y.; Wang, J.; Zhang, C.; Wang, Z.; Yau, S.K.S.; Lin, Z.; Zhou, L.; et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv 2023, arXiv:2308.00352. [Google Scholar]
Yang, J.; Jimenez, C.E.; Wettig, A.; Lieret, K.; Yao, S.; Narasimhan, K.; Press, O. Swe-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems 37: Thirty-eighth Annual Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; pp. 50528–50652. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Yang, Z.; Zhang, Z.; Zheng, Z.; Jiang, Y.; Gan, Z.; Wang, Z.; Ling, Z.; Chen, J.; Ma, M.; Dong, B.; et al. Oasis: Open agents social interaction simulations on one million agents. arXiv 2024, arXiv:2411.11581. [Google Scholar] [CrossRef]
Aher, G.V.; Arriaga, R.I.; Kalai, A.T. Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 337–371. [Google Scholar]
Li, N.; Gao, C.; Li, M.; Li, Y.; Liao, Q. Econagent: Large language model-empowered agents for simulating macroeconomic activities. arXiv 2023, arXiv:2310.10436. [Google Scholar]
Hao, Y.; Xie, D. A multi-llm-agent-based framework for economic and public policy analysis. arXiv 2025, arXiv:2502.16879. [Google Scholar]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar] [CrossRef]
Singh, H.; Das, R.J.; Han, M.; Nakov, P.; Laptev, I. MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation. arXiv 2024, arXiv:2411.17636. [Google Scholar]
Bharadhwaj, H.; Vakil, J.; Sharma, M.; Gupta, A.; Tulsiani, S.; Kumar, V. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4788–4795. [Google Scholar]
Zhang, H.; Du, W.; Shan, J.; Zhou, Q.; Du, Y.; Tenenbaum, J.B.; Shu, T.; Gan, C. Building cooperative embodied agents modularly with large language models. arXiv 2023, arXiv:2307.02485. [Google Scholar]
Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. Memorybank: Enhancing large language models with long-term memory. Proc. AAAI Conf. Artif. Intell. 2024, 38, 19724–19731. [Google Scholar] [CrossRef]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; pp. 11809–11822. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; pp. 38154–38180. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; pp. 68539–68551. [Google Scholar]
Qiao, T.; Walker, C.; Cunningham, C.; Koh, Y.S. Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 649–658. [Google Scholar]
ISO 2859-1; Sampling Procedures for Inspection by Attributes—Part 1: Sampling Schemes Indexed by Acceptance Quality Limit (AQL) for Lot-by-Lot Inspection. International Organization for Standardization (ISO): Geneva, Switzerland, 1999.
ANSI/ASQ Z1.4-2003; American National Standards Institute, American Society for Quality. Sampling Procedures and Tables for Inspection by Attributes. ASQ Quality Press: Milwaukee, WI, USA, 2003.
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar]
Girija, S.S.; Kapoor, S.; Arora, L.; Pradhan, D.; Raj, A.; Shetgaonkar, A. Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques. arXiv 2025, arXiv:2505.02309. [Google Scholar] [CrossRef]
Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar] [CrossRef]
Seymour, L.; Kutukcu, B.; Baidya, S. Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs. arXiv 2024, arXiv:2412.15352. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Shi, Y.; Shu, P.; Liu, Z.; Wu, Z.; Li, Q.; Liu, T.; Liu, N.; Li, X. Mgh radiology llama: A llama 3 70b model for radiology. arXiv 2024, arXiv:2408.11848. [Google Scholar]
Meta. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation. 2025. Available online: https://ai.meta.com/blog/llama-4-multimodal-intelligence/ (accessed on 27 June 2025).
Wu, H.; Han, Z.; Zhou, J.T.; Huang, H.; Zhang, C. Computational Reasoning of Large Language Models. arXiv 2025, arXiv:2504.20771. [Google Scholar]
Ferrag, M.A.; Tihanyi, N.; Debbah, M. Reasoning beyond limits: Advances and open problems for llms. arXiv 2025, arXiv:2503.22732. [Google Scholar]
Anthropic. Claude 3.7 Sonnet and Claude Code. 2025. Available online: https://www.anthropic.com/news/claude-3-7-sonnet (accessed on 27 June 2025).
Dinc, M.T.; Bardak, A.E.; Bahar, F.; Noronha, C. Comparative analysis of large language models in clinical diagnosis: Performance evaluation across common and complex medical cases. JAMIA Open 2025, 8, ooaf055. [Google Scholar] [CrossRef] [PubMed]
Pirkelbauer, P. CompilerGPT: Leveraging Large Language Models for Analyzing and Acting on Compiler Optimization Reports. arXiv 2025, arXiv:2506.06227. [Google Scholar] [CrossRef]
Viegas, C.; Gheyi, R.; Ribeiro, M. Assessing the Capability of LLMs in Solving POSCOMP Questions. arXiv 2025, arXiv:2505.20338. [Google Scholar] [CrossRef]
Arora, G.; Jain, S.; Merugu, S. Intent detection in the age of LLMs. arXiv 2024, arXiv:2410.01627. [Google Scholar] [CrossRef]
Jain, Y.; Hollander, J.; He, A.; Tang, S.; Zhang, L.; Sabatini, J. Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty. In Proceedings of the International Conference on Human-Computer Interaction, Gothenburg, Sweden, 22–27 June 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 202–213. [Google Scholar]
Harnad, S. Language writ large: LLMs, ChatGPT, meaning, and understanding. Front. Artif. Intell. 2025, 7, 1490698. [Google Scholar] [CrossRef] [PubMed]
Ma, B.; Li, Y.; Zhou, W.; Gong, Z.; Liu, Y.J.; Jasinskaja, K.; Friedrich, A.; Hirschberg, J.; Kreuter, F.; Plank, B. Pragmatics in the era of large language models: A survey on datasets, evaluation, opportunities and challenges. arXiv 2025, arXiv:2502.12378. [Google Scholar]
Cheng, J.; Marone, M.; Weller, O.; Lawrie, D.; Khashabi, D.; Van Durme, B. Dated data: Tracing knowledge cutoffs in large language models. arXiv 2024, arXiv:2403.12958. [Google Scholar] [CrossRef]
Xia, Y.; Kim, J.; Chen, Y.; Ye, H.; Kundu, S.; Hao, C.; Talati, N. Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. arXiv 2024, arXiv:2408.04693. [Google Scholar] [CrossRef]
Zhang, L.; Liu, X.; Li, Z.; Pan, X.; Dong, P.; Fan, R.; Guo, R.; Wang, X.; Luo, Q.; Shi, S.; et al. Dissecting the runtime performance of the training, fine-tuning, and inference of large language models. arXiv 2023, arXiv:2311.03687. [Google Scholar] [CrossRef]
Zhou, H.; Hu, C.; Yuan, D.; Yuan, Y.; Wu, D.; Liu, X.; Zhang, C. Large language model (llm)-enabled in-context learning for wireless network optimization: A case study of power control. arXiv 2024, arXiv:2408.00214. [Google Scholar]
Zhang, X.; Zhang, J.; Mo, F.; Wang, D.; Fu, Y.; Liu, K. LEKA: LLM-Enhanced Knowledge Augmentation. arXiv 2025, arXiv:2501.17802. [Google Scholar]
Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The prompt report: A systematic survey of prompting techniques. arXiv 2024, arXiv:2406.06608. [Google Scholar]
Hutchins, E. Cognition in the Wild; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
Li, G.; Hammoud, H.; Itani, H.; Khizbullin, D.; Ghanem, B. Camel: Communicative agents for “mind” exploration of large language model society. In Advances in Neural Information Processing Systems 36: Proceedings of the Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 51991–52008. [Google Scholar]
Google. adk-python: Agent Development Kit (ADK). 2025. Available online: https://github.com/google/adk-python (accessed on 25 June 2025).
OpenRouter, Inc. OpenRouter: The Unified Interface for LLMs. 2025. Available online: https://openrouter.ai (accessed on 27 June 2025).
Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef] [PubMed]
Bojić, L.; Zagovora, O.; Zelenkauskaite, A.; Vuković, V.; Čabarkapa, M.; Veseljević Jerković, S.; Jovančević, A. Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm. Sci. Rep. 2025, 15, 11477. [Google Scholar] [CrossRef] [PubMed]
Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-aware embedding fusion in large language models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for intelligent response generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]

Figure 1. LLM-based news coding multi-agent architecture.

Figure 2. Heatmap presenting the average F1-scores per question for all evaluated models.

Figure 3. Model performance comparison showing cost effectiveness versus average F1-score over the 200-article dataset. Each bubble represents a different language model, with bubble size being proportional to total token usage (input + output tokens). The x-axis shows the total inference cost in USD, while the y-axis represents the average F1-score across evaluation tasks. Pricing data for various LLMs was sourced from [81].

Table 1. Codebook question types and distribution summary.

Question Type	Count	Percentage
Binary	11	42.3%
Categorical	9	34.6%
Multi-label	6	23.1%
Total	26	100.0%

Table 2. Models selected for the evaluation process.

Model	Quantization	Parameters	Context	Weights
Meta-Llama-3-1	FP8	8 B	128 K	Available
DeepSeek-R1-Distill-Qwen	-	32 B	128 K	Available
Meta-Llama-3-3	-	70 B	128 K	Available
Qwen3	FP8	235 B	32 K	Available
Meta-Llama-4-Maverick	FP8	402 B	1 M	Available
DeepSeek-R1	-	685 B	160 K	Available
Claude-3-7-Sonnet	-	-	200 K	Not Available

Table 3. Agents’ profiles and planning characteristics.

Agent	Profile Instructions	Planning and Actions
Coordinator	Orchestrates workflow execution across agents; manages sequence and agent hand-offs	Agent coordination
Setup Agent	Parses article input, extracts metadata, and prepares structured content for downstream agents	Performs preprocessing: parses content, extracts identifiers, and stores structured input
Media Analyst	Expert in journalism and media framing; analyzes content classification, authorship, and editorial stance	Examines how information is presented; applies framing and authorship heuristics; and addresses media-specific question subset
Political Analyst	Specialist in political leadership, governance, and international relations; detects sentiment and ideological framing	Analyzes political actors and context; detects implicit and explicit sentiment; and addresses political and governance questions
Trade Economist	Expert in international trade and economic impact; analyzes tariffs, market responses, and industry implications	Identifies affected industries; assesses trade dynamics and economic reasoning
Reviewer Agent	Synthesis expert; integrates multiple expert perspectives into a single structured output conforming to schema	Maps expert insights to the codebook-question schema; checks consistency and resolves ambiguity
Format Converter	Schema compliance agent; finalizes JSON structure and ensures output validity	Converts structured output into schema-compliant JSON; ensures formatting

Table 4. Aggregated F1-scores for each question group (binary, categorical, and multi-label), as well as the global average. The table shows each model by its abbreviation. M1: Meta-Llama-4-Maverick; M2: Qwen3; M3: Meta-Llama-3-3; M4: DeepSeek-R1; M5: DeepSeek-R1-Distill-Qwen; M6: Claude-3-7-Sonnet; M7: Meta-Llama-3-1. Bold indicates the highest value in each row.

Average Scores	M1	M2	M3	M4	M5	M6	M7
Global	0.757	0.737	0.782	0.800	0.787	0.822	0.636
Binary	0.841	0.838	0.858	0.894	0.899	0.911	0.763
Categorical	0.704	0.704	0.758	0.746	0.724	0.759	0.624
Multi-label	0.681	0.601	0.678	0.707	0.675	0.756	0.419

Table 5. Meta’s Llama 4 Maverick and Meta’s Llama 4 Maverick multi-agent team evaluation.

Average Scores	Llama 4 Maverick Direct-LLM Prompt	Llama 4 Maverick Multi-Agent Team
Global	0.757	0.805
Binary	0.841	0.910
Categorical	0.704	0.739
Multi-label	0.681	0.711

Table 6. Token usage and indicative costs across evaluated models for processing the full dataset of 200 articles.

Models	Input Tokens (M)	Output Tokens (K)	Total Tokens (M)	Input Cost/M ($) *	Output Cost/M ($) *	Inference Time (s)
Meta-Llama-4-Maverick	3.15	356	3.50	0.15	0.60	4508
Qwen3	3.39	369	3.76	0.13	0.60	5300
Meta-Llama-3-3	4.10	440	4.54	0.05	0.18	8288
DeepSeek-R1	2.90	267	3.16	0.50	2.15	13,517
DeepSeek-R1-Distill-Qwen	3.47	497	3.97	0.075	0.15	11,513
Claude-3-7-Sonnet	2.99	482	3.47	3.0	15.0	8067
Meta-Llama-3-1	3.28	373	3.66	0.016	0.022	5409
Agentic System (Llama-4)	3.92	882	4.80	0.15	0.60	8967

* Pricing data for various LLMs was sourced from [81]. Open-weight LLMs can be deployed locally, but the providers’ Cost/M was used as resource utilization indicator.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Doropoulos, S.; Karapalidou, E.; Charitidis, P.; Karakeva, S.; Vologiannidis, S. Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis. Appl. Sci. 2025, 15, 8059. https://doi.org/10.3390/app15148059

AMA Style

Doropoulos S, Karapalidou E, Charitidis P, Karakeva S, Vologiannidis S. Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis. Applied Sciences. 2025; 15(14):8059. https://doi.org/10.3390/app15148059

Chicago/Turabian Style

Doropoulos, Stavros, Elisavet Karapalidou, Polychronis Charitidis, Sophia Karakeva, and Stavros Vologiannidis. 2025. "Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis" Applied Sciences 15, no. 14: 8059. https://doi.org/10.3390/app15148059

APA Style

Doropoulos, S., Karapalidou, E., Charitidis, P., Karakeva, S., & Vologiannidis, S. (2025). Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis. Applied Sciences, 15(14), 8059. https://doi.org/10.3390/app15148059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Manual Media Coding: Evaluating Large Language Models and Agents for News Content Analysis

Abstract

1. Introduction

2. Related Work

2.1. Large Language Models and Codebooks for Deductive Coding

2.2. Coding with Autonomous Agents

3. Materials and Methods

3.1. Data Collection

3.2. Codebook

3.2.1. Content and Source Identification

3.2.2. Tariff Policy Mention

3.2.3. Entities and Stakeholders

3.2.4. Sentiment Analysis

3.2.5. Policy Framing

3.3. Annotation Process

3.4. Large Language Model-Based Question Answering

3.4.1. Model Selection

3.4.2. Prompt Formulation

3.5. LLM-Based Multi-Agent Codebook Answering

3.6. Evaluation Metrics

4. Results

4.1. Model-Based Direct LLM Prompting Evaluation

4.2. Agentic Architecture with Multi-Step Reasoning Evaluation

4.3. Per-Question Performance and Insights

4.4. Model Efficiency

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. LLM Model Prompting

Appendix A.2. Metrics

Appendix B

Appendix B.1. Agentic System Token Metrics

Appendix B.2. Detailed Agent Prompts and System Orchestration

Appendix B.2.1. Media Analyst Prompt Template

Appendix B.2.2. Political Analyst Prompt Template

Appendix B.2.3. Trade Economist Prompt Template

Appendix B.2.4. Reviewer Prompt Template

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI