Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input

Kim, Jooyoung; Lee, Wonkyung; Park, Jungwoon

doi:10.3390/app15137190

Open AccessArticle

Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input

by

Jooyoung Kim

^1,*

,

Wonkyung Lee

²

and

Jungwoon Park

³

¹

Department of Convergence Software, Myongji University, Seoul 03674, Republic of Korea

²

PS Analytics, Seoul 03993, Republic of Korea

³

Department of Electrical & Electronics Engineering, Yonsei University, Seoul 03829, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7190; https://doi.org/10.3390/app15137190

Submission received: 28 May 2025 / Revised: 9 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Applications of Natural Language Processing to Data Science)

Download

Browse Figures

Versions Notes

Abstract

Large-scale match data from esports games like League of Legends are stored in complex JSON files that often exceed the input token limitations of large language models (LLMs), restricting advanced analysis and applications such as automated commentary and strategic insight generation. This paper introduces the League of Legends Match Data Compactor (LoL-MDC), a tool designed to transform extensive match data into a concise and structured format optimized for LLM processing. By systematically summarizing structured match information—including match overviews, player and team statistics, timeline summaries, and algorithmically selected key events—the LoL-MDC significantly reduces the data size from approximately 80,000 tokens to under 2000 tokens while retaining analytical value. This method enables LLMs to generate coherent match summaries, analyze player performances, and identify key momentum shifts more effectively than processing raw JSON files. Additionally, the LoL-MDC integrates a winning probability metric to quantitatively enhance the selection of pivotal game events, ensuring relevance in esports analytics. Experimental evaluations demonstrate that the LoL-MDC improves data processing efficiency while maintaining critical insights. The proposed approach provides a structured and adaptable framework for applying LLMs to esports analytics and can be adapted to other competitive gaming environments, supporting AI-driven applications in match analysis, player performance evaluation, and strategic forecasting.

Keywords:

esports analytics; League of Legends; JSON data processing; data compression; large language models

1. Introduction

The exponential growth of the esports industry has transformed competitive gaming into a global phenomenon. Titles like League of Legends [1] have established extensive professional scenes, attracted millions of viewers, and generated a significant economic impact [2,3]. As competition intensifies, teams, analysts, and coaches increasingly rely on match data to evaluate performance, plan strategies, and study opponents. These analyses often uncover trends and patterns that are difficult to observe without computational tools [4]. However, the large volume and complexity of these data present significant challenges. A single match of League of Legends can generate detailed JSON records of player actions, in-game events, and time series data, often exceeding 100 kilobytes in size [5]. This makes manual analysis and automated processing both time-consuming and computationally demanding.

Recent advances in large language models (LLMs) have opened new possibilities for automated analysis of esports data. These models excel at natural language comprehension, summarization, and generating expert-like commentary [6]. However, a key limitation of most LLMs is their restricted input size, typically ranging from 4096 to 8192 tokens. While extended-context models like GPT-4o support up to 128,000 tokens [7], the detailed JSON records from a single League of Legends match can still approach or exceed this limit in complex scenarios.

Existing summarization methods, such as manual extraction or adaptive data chunking [8,9], lack structured and quantitative identification of high-impact events critical to rigorous esports analysis. Current approaches do not systematically utilize measurable momentum metrics to identify pivotal in-game events, creating a clear research gap. The summarization techniques employed in esports analytics draw theoretically from structured data reduction and summarization principles from the information retrieval domain, which emphasize identifying events based on measurable momentum and probability metrics [10,11]. Empirical studies have shown that accurately representing momentum shifts enhances the interpretability of match summaries [12,13], a capability inadequately addressed by simplistic chunking methods. Hence, this study addresses two research questions. (1) Can structured summarization effectively reduce the data size while maintaining analytical accuracy for esports analysis? (2) Does integrating momentum-based win probability metrics objectively enhance the identification of pivotal game events for LLM-based analytics?

We propose the League of Legends Match Data Compactor, a tool designed to transform extensive match data into concise summaries suitable for LLM analysis. LoL-MDC systematically extracts the explicitly defined and quantifiable information, including match metadata, player and team statistics, simplified timeline views, and key events. The key events are identified algorithmically using a measurable advantage metric—defined as the cumulative difference between team-level gold or experience scores computed at each minute—and a corresponding Pythagorean expectation-based winning probability model. This structured and reproducible approach objectively prioritizes events based on the quantifiable momentum impact, ensuring analytical precision and clarity. Unlike general adaptability claims, this method’s applicability to other esports titles can be empirically validated by substituting game-specific metrics (e.g., net worth in Dota 2or objective control in Valorant), though empirical validation beyond League of Legends remains a subject for future research.

Importantly, the primary contribution of the LoL-MDC extends beyond a mere reduction in data size. Its structured approach using momentum and probability metrics directly addresses the ambiguity and disorganization inherent in raw JSON match logs, improving the correctness and interpretability of analytical insights generated by LLMs. This paper evaluates the effectiveness of the LoL-MDC through quantitative experiments measuring data size reduction and qualitative evaluations assessing analytical accuracy improvements provided by compacted data. The results demonstrate a measurable reduction in data size of over 97% and substantial improvement in the quality and correctness of analytical insights provided by LLMs compared with using unstructured JSON data.

The remainder of this paper is structured as follows. Section 2 reviews the related literature on esports data analysis and LLMs in gaming. Section 3 outlines the detailed algorithmic implementation of the LoL-MDC. Section 4 presents experimental results and analysis. Finally, Section 5 summarizes the contributions and outlines future empirical validations needed to confirm generalizability to other competitive gaming scenarios.

2. Related Works

2.1. Esports Data Analysis

Esports data analysis has evolved through the application of machine learning, data visualization, and statistical modeling methods to handle large volumes of in-game telemetry and player behavior data [10,14]. Earlier research has primarily concentrated on synchronization and structured representation of match events, video logs, and sensor-based data to assist teams, analysts, and broadcasters in decision making [15,16]. Recent studies have expanded this scope to include predictive modeling and real-time metrics, aiming to improve strategic insights, training methods, and viewer experiences [12,13,17].

Visualization-oriented methods have also emerged prominently. Xenopoulos et al. [18] introduced ggViz, a visual analytics tool for Counter-Strike: Global Offensive (CSGO). The ggViz tool enables users to perform interactive, sketch-based queries of game state data, facilitating visual identification of similar scenarios using win probability charts and heatmaps. Despite its strengths in interactive exploration and visual summarization, ggViz does not provide structured textual summaries or quantitative momentum metrics, which are important for automated, language-based analytics.

Overall, current methods in esports data analysis typically focus on visualization, exploratory analysis, and predictive modeling, but structured textual summarization explicitly designed for automated analyses, particularly those involving LLMs, remains underexplored.

2.2. Large Language Models in Games

LLMs have recently been employed across various gaming contexts, including narrative generation, non-player character (NPC) interactions, and strategic reasoning. For instance, LLMs can adapt narratives dynamically, generate coherent dialogues, and respond effectively to player actions, enhancing player immersion [19,20].

In competitive gaming, LLMs have been applied for automated commentary, event summarization, and strategic insight generation [6,21]. These applications typically require structured textual representations of game states and events to allow precise interpretation and reduce model hallucination risks due to token limit constraints [22].

A recent example addressing structured textual summarization is the “Chain of Summarization” proposed by Ma et al. [23] for StarCraft II. This method incorporates single-frame observation summarization and multi-frame strategic summarization, enabling LLMs to reason and plan strategically with proficiency comparable to experienced human players. However, this method does not explicitly use quantitative metrics such as momentum shifts or probability analysis, which limits its direct applicability to structured event analysis and rigorous esports data interpretation.

The proposed LoL-MDC addresses this limitation by explicitly integrating quantitative momentum and win probability metrics within structured textual summaries to enhance LLM-driven analytics specifically in esports contexts.

3. LoL-MDC: League of Legends Match Data Compactor

The League of Legends Match Data Compactor is an automated tool designed to compress and curate extensive League of Legends match data into a concise, structured format suitable for LLMs. By processing raw JSON files, the LoL-MDC extracts essential match information—including match overviews, team and player statistics, simplified timelines, and pivotal events—while significantly reducing the data size. As shown in Figure 1, the system employs advanced techniques such as algorithmically calculating winning probability metrics using a generalized Pythagorean expectation formula and identifying momentum shifts through quantitative advantage metrics to ensure high-quality summarization for effective narrative and strategic analysis. We note here that while the process shown in Figure 1 suggests linear steps, the actual order is configurable. The term “LLM-ready” indicates that the final JSON output adheres to common tokenization and formatting standards for seamless integration into prevalent large language models.

3.1. Match Overview

The match overview provides a high-level summary that establishes the contextual framework for further analysis. It includes essential metadata such as the match identifier, game title, patch version, participating teams, match date and time, match duration, and the winning side. By focusing on key reference points, this component ensures the summarized data remains directly usable by LLMs while eliminating redundant details.

To generalize this component for other esports titles, the metadata fields can be replaced with title-specific equivalents, such as game-specific roles or identifiers. This modular design makes the compactor broadly applicable to various gaming ecosystems.

3.2. Team and Player Statistics

A detailed analysis of individual and team performances is critical for understanding competitive gameplay. The LoL-MDC compiles key performance metrics at both the player and team levels. For players, it includes identifiers such as names, chosen champions, and roles or positions, along with performance metrics like kills, deaths, assists, gold earned, creep score, damage dealt, and vision score. At the team level, it aggregates these metrics to provide an overview, capturing total kills, deaths, assists, and objectives secured, such as dragons, barons, towers, and rift heralds, which collectively illustrate the game’s flow.

This component can be adapted for other games by substituting terminology and metrics. For example, champions in League of Legends can be replaced with heroes in Dota 2 [24] or agents in Valorant [25], and objectives can be customized to align with game-specific milestones. This adaptability ensures relevance across different titles.

3.3. Simplified Timeline View

Static statistics alone often fail to capture the dynamic flow of a match. The LoL-MDC addresses this by incorporating a simplified timeline view, highlighting momentum shifts and strategic turning points throughout the game. This timeline uses two forms of data: primary and secondary. Primary data include raw sequential metrics such as gold earned or experience points (XP) collected by each player over time. Secondary data comprise calculated metrics derived from primary data. Specifically, we compute an advantage metric at each one-minute interval, defined as the difference in aggregated metrics (such as gold or XP) between the two teams. This advantage metric quantifies the relative lead or deficit between teams, calculated by

Advantage (t) = (\sum_{i = 1}^{5} M_{i} (t)) - (\sum_{j = 6}^{10} M_{j} (t)),

(1)

where

M_{i} (t)

is the chosen metric (gold or XP) for player i at minute t. The winning probability calculated from each metric is then obtained with a generalized Pythagorean expectation:

P_{1} (t) = \frac{{(S_{1} (t))}^{α}}{{(S_{1} (t))}^{α} + {(S_{2} (t))}^{α}},

(2)

where

S_{1} (t) = \sum_{i = 1}^{5} M_{i} (t), S_{2} (t) = \sum_{j = 6}^{10} M_{j} (t) .

(3)

The exponent

α

is a tunable parameter that can be periodically updated through learning from historical match data.

The final winning probability

P_{final} (t)

is then computed by averaging the probabilities

P_{1}^{(metric)} (t)

derived from multiple metrics (e.g., gold and XP):

P_{final} (t) = \frac{1}{| M |} \sum_{metric \in M} P_{1}^{(metric)} (t),

(4)

where

M

denotes the set of selected metrics (e.g., {gold or XP}), and each metric-specific probability

P_{1}^{(metric)} (t)

is individually calculated using the generalized Pythagorean expectation given in Equation (2).

3.4. Selected Key Events

Key events play a decisive role in determining the narratives and outcomes of esports matches. The LoL-MDC identifies these moments by analyzing match logs for high-impact events such as securing objectives, executing multi-elimination sequences, or achieving critical milestones. Each event is annotated with its timing, event type, and relevant details, along with momentum indicators before and after the event. This ensures the preservation of pivotal moments that define the match’s storyline.

To algorithmically identify these moments from match logs, we define the momentum impact of events occurring between minutes t and

t + 1

as follows:

MomentumImpact = |P_{final} (t + 1) - P_{final} (t)| .

(5)

Events are then ranked based on Equation (5), where the top-N events are retained. The overall extraction procedure is summarized in Algorithm 1. Here, we note that the timestamp is extends up to 60,000 to fit the minute scale.

Events with the most significant momentum shifts, measured by changes in the winning probability metric from the timeline view, are prioritized. For example, a critical dragon kill or Baron steal that shifts the team advantage can be saved. The number of key events included (e.g., top five events) can be adjusted based on the level of detail desired. When momentum metrics are unavailable, alternative indicators from the timeline can be used.

Algorithm 1 Key event extraction.

1:: Input: event_logs, winning_probabilities, N
2:: key_events ← [ ]
3:: for each event e in event_logs do
4:: $t \leftarrow ⌊ e . timestamp / 60, 000 ⌋$
5:: $P_{before} \leftarrow$ winning_probabilities[t]
6:: $P_{after} \leftarrow$ winning_probabilities[ $t + 1$ ]
7:: $Δ \leftarrow | P_{after} - P_{before} |$
8:: append $(e, Δ)$ to key_events
9:: end for
10:: sort key_events by $Δ$ (descending)
11:: return top N events

The final output of the LoL-MDC is a concise, structured JSON file that fits within the token limitations of modern LLMs while retaining essential match information. As shown in Figure 2, the compacted JSON file is designed for both human readability and machine interpretability, ensuring its utility for a wide range of esports analytics applications.

3.5. Cross-Game Adaptability and Examples

While this paper focuses on League of Legends, the principles underlying the LoL-MDC can be adapted to a wide range of esports titles by tailoring its components to reflect the unique characteristics of each game. Esports titles typically share fundamental structural characteristics such as defined player roles, team-based metrics, quantifiable performance indicators (kills, objectives, and scores), and critical momentum-shifting events. These similarities enable the LoL-MDC framework to be readily adapted across different competitive gaming environments. Below are specific examples illustrating how the LoL-MDC can be applied to other games.

Aside from the match overview, which can be easily customized by replacing metadata fields with game-specific details, team and player statistics can be adapted by substituting terms and metrics. For instance, “champion” in League of Legends could be replaced with “hero” in Dota 2 [24] or “agent” in Valorant [25]. Additional metrics, such as “headshots” for Counter-Strike [26] or “damage blocked” for Overwatch [27], can be incorporated to reflect the unique gameplay mechanics of these titles.

The timeline view and key events are particularly flexible. In Dota 2, the timeline could use “net worth difference” as the primary data, while in Overwatch, key events could include milestones like “payload checkpoint reached”. Similarly, in Counter-Strike, events such as “bomb defusal” or “round ace” could be captured to highlight pivotal moments.

By customizing each component—match overview, statistics, timeline, and key events—the LoL-MDC can serve as a versatile framework for esports analytics across a variety of games. This adaptability ensures that the tool remains relevant and effective in supporting AI-driven analysis for competitive gaming ecosystems.

4. Experiments

In this section, the proposed LoL-MDC is evaluated to verify its effectiveness in compacting curated League of Legends match data into a manageable format for LLMs. The experiments are divided into two parts: quantitative evaluation and qualitative evaluation. The quantitative evaluation measures the compactor’s efficiency in reducing data size and token counts across different game versions and language settings. It also examines the contributions of each component (match overview, team and player statistics, timeline view, and selected key events) to the compacted data. The qualitative evaluation assesses the usability of compacted data by comparing the responses of a commercial LLM chatbot to those using raw JSON files and curated expert summaries. Specifically, these experiments aim to verify two primary hypotheses: (1) the LoL-MDC significantly reduces token counts and data size without losing analytically critical match details, and (2) the structured format provided by the LoL-MDC enhances the accuracy and coherence of LLM-generated insights compared with less structured data.

4.1. Datasets and Experimental Set-Up

The dataset used for the experiments consists of JSON files internally curated by the analytics team at LOL.PS [28], one of the largest League of Legends statistics websites. Each match record was initially obtained from the Match-V5 endpoint of the Riot Games API [5]. A single raw response from this endpoint typically exceeds 300 kB (approximately 120,000 tokens), containing comprehensive metadata and numerous auxiliary fields, which makes it impractical for direct use with LLM prompts. To address this issue, we utilized an intermediate, internally curated version containing the variables essential for performance analytics—match metadata, per-team and per-player statistics, one-minute aggregated gold and experience (XP) timelines, and high-level event logs (objectives, multi-kills, and turret destructions)—to describe the match information. This curated version removes redundant low-level frame details and purely cosmetic fields, reducing the JSON file size to approximately 160 kB per match while preserving all critical information. Throughout the remainder of this paper, we refer to this curated version as the “raw” dataset and use it as our baseline for evaluating the performance and effectiveness of our proposed compactor. This curated “raw” dataset, while not fully identical to the original Riot API response, closely mirrors realistic formats regularly utilized by professional esports analysts and serves as a practical baseline for our evaluations.

The experiments focused on matches from two regions: North America (NA) using English (en_US) and Korea (KR) using Korean (ko_KR) language settings. Data were collected for two patch versions, 14.21 and 14.22, to evaluate the robustness of the LoL-MDC across different configurations. For each patch version, 10,000 curated matches were processed to assess the compactor’s efficiency in reducing data size and token counts. Additionally, a subset of five anonymized matches was selected for qualitative evaluation, where GPT-4o responses were compared using raw and compacted JSON inputs. For the hyperparameters in the LoL-MDC, we set a 1 min interval for timeline view generation and selected the top five key events to summarize pivotal moments effectively.

To evaluate tokenization efficiency, we utilized two open-source tokenizers: o200k_base [29] and cl100k_base [30]. The o200k_base tokenizer is employed by GPT-4o and GPT-4o-mini models, while the cl100k_base tokenizer is used by OpenAI models such as GPT-4, GPT-4-turbo, GPT-3.5-turbo, and embedding models. These tokenizers represent different encoding strategies and token capacities, enabling us to analyze how the LoL-MDC performs across diverse LLM architectures. File size reduction was measured in bytes, and token counts were calculated using the respective tokenizers to ensure consistency and comparability.

Qualitative evaluations assessed the ability of compacted data to answer analytical questions by comparing LLM responses generated from raw JSON and compacted JSON files. To ensure consistency, all experiments used the standardized prompt as shown in Listing 1.

Listing 1. Standarized prompt for JSON based question-answering.

You are an expert esports analyst with a specialization in analyzing
   competitive gaming data.
Your task is to evaluate and summarize the given match data and
   provide accurate and detailed~insights.
[JSON]
{
	…
}
[Question]
What was the total gold earned by the winning team?

4.2. Quantitative Evaluation

The quantitative evaluation assesses the compactor’s ability to reduce the data size and token counts without losing critical match information. We note here that the direct comparisons with alternative summarization methods such as chunked JSON or LLM-assisted summarizers were beyond the intended scope, as our primary objective was to assess the effectiveness of structured summarization optimized for LLM input. The analysis includes results from the North America (NA) region using English and the Korea (KR) region using Korean across two patch versions: 14.21 and 14.22. The evaluation highlights the LoL-MDC’s performance in different languages and configurations, demonstrating its robustness and adaptability.

4.2.1. Global Reduction Analysis

Table 1 and Table 2 present the overall statistics for the NA and KR regions, respectively. For matches from the NA region with patch 14.22, the LoL-MDC reduced the average decompressed size from 149,324 bytes to 3929 bytes, achieving over 97% reduction. Token counts were reduced from 75,849 to 1591 using cl100k_base and from 75,008 to 1576 using o200k_base. Similar results were observed for patch 14.21, with compacted sizes averaging 4220 bytes and token counts reduced by over 97.5%.

For matches from the KR region, the LoL-MDC achieved comparable results across both patch versions. For patch 14.22, the average decompressed size of 163,989 bytes was reduced to 4241 bytes. Token counts were reduced from 84,478 to 1751 using cl100k_base and from 83,598 to 1695 using o200k_base. Patch 14.21 followed a similar trend, with decompressed sizes reduced from 162,954 bytes to 4227 bytes and token counts reduced by 97.4%.

The results indicate that the LoL-MDC achieved consistent performance across regions, languages, and configurations. While slight differences in the compacted sizes and token counts arose due to regional variations, such as naming conventions and text length, the overall reductions consistently exceeded 97% in all cases. These findings confirm the LoL-MDC’s adaptability to diverse datasets while preserving essential analytical content.

4.2.2. Component-Level Analysis

The compacted match data generated by the LoL-MDC were divided into four main components: match overview, team and player statistics, timeline view, and selected key events. Table 3 and Table 4 summarize the compacted sizes and token counts for these components in the NA and KR regions for the 14.22 patch, respectively. This analysis highlights the relative size of each component and their adaptability for different configurations.

Team and player statistics consistently accounted for the largest portion of the compacted data. In the NA region, this component averaged 2197.53 bytes, corresponding to approximately 900 tokens in cl100k_base and 889 tokens in o200k_base. In the KR region, the size was slightly larger at 2286.33 bytes, with 956 tokens in cl100k_base and 913 tokens in o200k_base. The increased size in the KR region is primarily due to linguistic differences, such as longer champion names and player IDs in Korean.

In contrast, match overview was consistently the smallest component across both regions. It averaged about 220 bytes and 79 tokens regardless of tokenizer or region. This component contains metadata such as the match ID, game title, patch version, participating teams, and match duration. Its small size reflects its role as a concise summary of the match’s context.

The timeline view and selected key events occupied intermediate positions in terms of size. In the NA region, the timeline view averaged 681.31 bytes, translating to approximately 349 tokens in both tokenizers. The selected key events averaged 735.96 bytes, corresponding to 246 tokens in cl100k_base and 242 tokens in o200k_base. The KR region followed a similar pattern, with the timeline view averaging 773.86 bytes and the selected key events averaging 866.91 bytes.

The sizes of these components, particularly the timeline view and selected key events, demonstrated the flexibility of their size. For example, the size of the timeline view depends on the time intervals used to aggregate data; shorter intervals result in finer-grained data but increase the size, whereas longer intervals reduce granularity and compact the representation. Similarly, the size of the selected key events is directly influenced by the number of events recorded. Saving only a few key events minimizes its size, whereas recording more events increases the component’s footprint. This adaptability allows the LoL-MDC to balance data granularity with token limitations.

4.3. Qualitative Evaluation

To evaluate the usability of the LoL-MDC, we conducted a qualitative analysis by comparing the answers generated by GPT-4o using raw JSON input files and compacted JSON input files. GPT-4o was selected for its extended context capabilities and strong natural language understanding, which allow it to effectively analyze esports data and generate detailed responses. The analysis focused on answering a series of structured questions using a sampled dataset of five anonymized matches from 14.22 KR matches, with Match 1 visualized in Figure 3 to illustrate the statistics provided by LOL.PS.

Match 1’s statistics, as displayed in Figure 3, include player-level details such as kills, deaths, assists (KDA), damage dealt, damage taken, wards placed, creep score (CS), and items used. Additionally, team-level information, such as total kills and gold, provides a holistic summary of the game. This dataset served as the foundation for evaluating the performance of compacted JSON input files compared with raw JSON input files.

The qualitative evaluation consisted of three categories of questions designed to analyze the effectiveness of compacted data in generating meaningful answers compared with raw JSON data. For each question type, we provide examples of the prompts used and the answers generated.

4.3.1. General Questions

General questions focus on extracting basic match statistics and team-level performance metrics. The following questions were used:

What was the total gold earned by the winning team?
What was the KDA of the winning team’s SUPPORT and his or her champion?
Which team had a higher total creep score?
How many objectives (e.g., dragons or barons) were secured by each team?

Figure 4 and Figure 5 display the responses generated by GPT-4o using raw JSON and compacted JSON inputs, respectively. While the raw JSON inputs did not exceed the token constraints in this case, the chatbot failed to answer Question 4 (the number of creeps) correctly. Despite the presence of creep score data elsewhere in the JSON files, the model struggled to extract and synthesize the information accurately. It provided correct answers for the remaining three questions, including the total gold, KDA, and objectives.

In contrast, the compacted JSON inputs enabled GPT-4o to provide accurate and complete answers for all four questions. The chatbot successfully identified the total gold, KDA, objectives, and creep score, demonstrating the value of the LoL-MDC in structuring input data for effective analysis. This phenomenon was kept for all five utilized matches, which highlights the compactor’s ability to enhance the usability of match data by organizing information in a way that facilitates accurate LLM responses.

4.3.2. Match Overview Questions

The match overview questions were designed to summarize the match concisely and identify key details. The following questions were used:

Summarize the match in one sentence, highlighting the winning team and their key strengths.
Who was the MVP and why?
What were the most significant moments that shifted the momentum of the game?

Figure 6 and Figure 7 show the responses generated using raw and compacted JSON inputs. When provided with raw JSON inputs, GPT-4o exhibited hallucinations in its responses, as shown in its answer to Question 2. Additionally, the chatbot’s output contained unreadable, system-like encrypted values, such as “Players 9 and 10” or “Timestamp: 132,244” instead of an exact champion name or description. These issues made it difficult for users to interpret the match summary and key moments effectively.

In contrast, the compacted JSON inputs allowed GPT-4o to generate structured and accurate match overviews. The chatbot correctly identified the MVP with a coherent explanation, highlighted significant momentum shifts, and provided a readable summary of the match. Compared with the raw JSON inputs, which often resulted in vague or misleading outputs, the compacted JSON inputs consistently produced high-quality responses, demonstrating its effectiveness in enabling structured high-level match analysis.

4.3.3. Match-Comparing Questions

Match-comparing questions required analyzing trends and differences across five anonymized matches. The following questions were used:

Which team performed better across the sampled matches in terms of total kills and gold?
What was the most frequently picked champion across the sampled matches?
What was the most common factor contributing to a team’s victory in these matches?

When using raw JSON files, GPT-4o was unable to process more than two matches simultaneously due to token constraints, making meaningful match comparisons infeasible. As a result, no valid multi-match responses could be generated, preventing the chatbot from identifying overarching trends or patterns across the dataset.

In contrast, compacted JSON files enabled GPT-4o to analyze all five matches simultaneously, facilitating detailed comparisons across multiple metrics as shown in Figure 8. The chatbot successfully identified the best-performing team, the most frequently picked champion, and the most common victory factors. This demonstrates the LoL-MDC’s effectiveness in structuring match data for large-scale trend analysis, overcoming the limitations imposed by raw JSON inputs.

4.4. Findings and Contributions

The findings from this study confirm that the LoL-MDC is highly effective at compacting esports match data while preserving its analytical value. The compactor achieved up to 97% reductions in data size and token counts, enabling efficient analysis within the input constraints of LLMs. It demonstrated compatibility with tokenization schemes such as o200k_base and cl100k_base, making it suitable for diverse LLM architectures. By preserving match narratives, key events, and performance metrics, the LoL-MDC enables meaningful insights that align closely with expert analyses.

The qualitative evaluation further highlights the advantages of compacted JSON inputs for question-answering tasks. Using compacted JSON files, GPT-4o generated coherent and accurate responses across all question categories, including general match statistics, high-level overviews, and cross-match comparisons. In contrast, raw JSON inputs often led to incomplete or incorrect answers due to the token limitations of LLMs. These findings validate the LoL-MDC’s ability to transform complex match data into manageable and insightful formats, enabling LLMs to answer a wide range of analytical questions effectively. Our qualitative evaluation demonstrates that the compacted format provided by the LoL-MDC did more than remove redundant metadata. It improves the accuracy and coherence of LLM-generated responses compared with the baseline JSON files, indicating practical analytical advantages beyond data size reduction.

Additionally, the modular design of the compactor ensures scalability across patch versions and adaptability to other esports titles, further broadening its application scope. The ability to adjust components such as the timeline view intervals and the number of selected key events enables analysts to balance data granularity with computational efficiency. The experimental results also emphasize the LoL-MDC’s computational efficiency, allowing it to process entire tournaments or seasons within feasible timeframes. These contributions position the LoL-MDC as a robust and versatile tool for advancing esports analytics and unlocking new possibilities for AI applications in competitive gaming.

5. Conclusions

This paper introduced the League of Legends Match Data Compactor, a tool designed to transform voluminous esports match data into concise, structured formats optimized for LLMs. By leveraging curated JSON data from LOL.PS, the LoL-MDC demonstrated an ability to reduce data sizes by up to 97% while retaining critical analytical information. This enables LLMs to efficiently process multiple matches simultaneously within token constraints, addressing practical challenges in esports analytics.

Quantitative evaluations highlighted the LoL-MDC’s effectiveness in significantly reducing data sizes and token counts, ensuring compatibility with tokenization schemes such as o200k_base and cl100k_base. Qualitative results indicated that the compacted format improved the coherence and accuracy of LLM-generated summaries, enabling more effective extraction of player performances and pivotal match moments compared with less structured data. These findings provide preliminary evidence of the LoL-MDC’s potential to enhance AI-driven analytics for esports.

Limitations and Future Research

While the results of this study indicate promising directions for the use of the LoL-MDC in esports analytics, several areas require further investigation to strengthen the generalizability and robustness of our findings. The qualitative evaluation was based on a limited sample of five matches, suggesting the need for larger-scale evaluations to confirm broader applicability. Additionally, our assessments relied on subjective interpretations of LLM-generated outputs without systematically quantifying response consistency or the frequency of hallucinations. To address these issues, we plan to deploy the LoL-MDC on an accessible web platform, allowing users and esports analysts to interact with the model directly and provide structured feedback.

Future research will also involve systematic experiments to evaluate hallucination frequency, inter-rater reliability, explicit control of LLM parameters such as temperature settings, benchmarking against alternative NLP summarization methods, and detailed assessments of LLM performance metrics, including accuracy, latency, computational cost, and scalability, under realistic conditions.

Author Contributions

Conceptualization, W.L.; methodology, J.K.; software, validation, and analysis, J.K.; writing—original draft preparation, J.K.; writing—review and editing, W.L. and J.P.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2024 Research Fund of Myongji University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from Riot Games via the Riot Games API (https://developer.riotgames.com/ (accessed on 10 December 2022)) and are available from the authors with the permission of Riot Games.

Conflicts of Interest

Author Wonkyung Lee was employed by the company PS Analytics. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Riot Games. League of Legends. Available online: https://www.leagueoflegends.com/ (accessed on 23 June 2025).
Lee, J.Y.; An, J.W.; Lee, S.W. Factors affecting eSports audience satisfaction-The case of League of Legends. J. Korea Game Soc. 2014, 14, 35–46. [Google Scholar] [CrossRef]
Jarrett, J. Gaming the gift: The affective economy of League of Legends ‘fair’free-to-play model. J. Consum. Cult. 2021, 21, 102–119. [Google Scholar] [CrossRef]
Przybylski, A.K.; Rigby, C.S.; Ryan, R.M. A motivational model of video game engagement. Rev. Gen. Psychol. 2010, 14, 154–166. [Google Scholar] [CrossRef]
Riot Games. Riot Games API. Available online: https://developer.riotgames.com/ (accessed on 23 June 2025).
Gallotta, R.; Todd, G.; Zammit, M.; Earle, S.; Liapis, A.; Togelius, J.; Yannakakis, G.N. Large language models and games: A survey and roadmap. arXiv 2024, arXiv:2402.18659. [Google Scholar] [CrossRef]
OpenAI. OpenAI Platform. Available online: https://platform.openai.com/ (accessed on 23 June 2025).
Zhong, Z.; Liu, H.; Cui, X.; Zhang, X.; Qin, Z. Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.00456. [Google Scholar]
Yepes, A.J.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial report chunking for effective retrieval augmented generation. arXiv 2024, arXiv:2402.05131. [Google Scholar]
Hodge, V.J.; Devlin, S.; Sephton, N.; Block, F.; Cowling, P.I.; Drachen, A. Win prediction in multiplayer esports: Live professional match prediction. IEEE Trans. Games 2019, 13, 368–379. [Google Scholar] [CrossRef]
Hitar-García, J.A.; Morán-Fernández, L.; Bolón-Canedo, V. Machine learning methods for predicting league of legends game outcome. IEEE Trans. Games 2022, 15, 171–181. [Google Scholar] [CrossRef]
Novak, A.R.; Bennett, K.J.; Pluss, M.A.; Fransen, J. Performance analysis in esports: Modelling performance at the 2018 League of Legends World Championship. Int. J. Sport. Sci. Coach. 2020, 15, 809–817. [Google Scholar] [CrossRef]
Choi, E.; Kim, J.; Lee, W. Rethinking Evaluation Metric for Probability Estimation Models Using Esports Data. In Proceedings of the 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, UK, 1–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2683–2689. [Google Scholar]
Stepanov, A.; Lange, A.; Khromov, N.; Korotin, A.; Burnaev, E.; Somov, A. Sensors and game synchronization for data analysis in esports. In Proceedings of the 2019 IEEE 17th International Conference on Industrial Informatics (INDIN), Helsinki, Finland, 22–25 July 2019; IEEE: Piscataway, NJ, USA, 2019; Volume 1, pp. 933–938. [Google Scholar]
Block, F.; Hodge, V.; Hobson, S.; Sephton, N.; Devlin, S.; Ursu, M.F.; Drachen, A.; Cowling, P.I. Narrative bytes: Data-driven content production in esports. In Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online Video, Seoul, Republic of Korea, 26–28 June 2018; pp. 29–41. [Google Scholar]
Korotin, A.; Khromov, N.; Stepanov, A.; Lange, A.; Burnaev, E.; Somov, A. Towards understanding of esports athletes’ potentialities: The sensing system for data collection and analysis. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1804–1810. [Google Scholar]
Fei, Y. Quantitative Evaluation of Predictive Analytics: A Comparative Study of Machine Learning Models in eSports Outcome Forecasting. In Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024), Singapore, 9–11 August 2024; Atlantis Press: Dordrecht, The Netherlands, 2024; pp. 137–145. [Google Scholar]
Xenopoulos, P.; Rulff, J.; Silva, C. GgViz: Accelerating large-scale esports game analysis. Proc. ACM Hum.-Comput. Interact. 2022, 6, 1–22. Available online: https://github.com/pnxenopoulos/ggViz (accessed on 23 June 2025). [CrossRef]
Urbanek, J.; Fan, A.; Karamcheti, S.; Jain, S.; Humeau, S.; Dinan, E.; Rocktäschel, T.; Kiela, D.; Szlam, A.; Weston, J. Learning to Speak and Act in a Fantasy Text Adventure Game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 673–683. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Ranella, N.; Eger, M. Towards Automated Video Game Commentary Using Generative AI. In Proceedings of the EXAG@ AIIDE, Salt Lake City, UT, USA, 10–11 November 2023. [Google Scholar]
Todd, G.; Earle, S.; Nasir, M.U.; Green, M.C.; Togelius, J. Level generation through large language models. In Proceedings of the 18th International Conference on the Foundations of Digital Games, Lisbon, Portugal, 12–14 April 2023; pp. 1–8. [Google Scholar]
Ma, W.; Mi, Q.; Zeng, Y.; Yan, X.; Lin, R.; Wu, Y.; Wang, J.; Zhang, H. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Adv. Neural Inf. Process. Syst. 2024, 37, 133386–133442. [Google Scholar]
Valve Corporation. Dota 2. Available online: https://www.dota2.com/ (accessed on 23 June 2025).
Riot Games. Valorant. Available online: https://playvalorant.com/ (accessed on 23 June 2025).
Valve Corporation. Counter-Strike. Available online: https://csonline.nexon.com/ (accessed on 23 June 2025).
Blizzard Entertainment. Overwatch. Available online: https://playoverwatch.com/ (accessed on 23 June 2025).
PS Analytics. LOL.PS. Available online: https://lol.ps/ (accessed on 23 June 2025).
OpenAI. o200k_base Tokenizer. Available online: https://github.com/openai/tiktoken (accessed on 23 June 2025).
OpenAI. cl100k_base Tokenizer. Available online: https://github.com/openai/tiktoken (accessed on 23 June 2025).

Figure 1. Overview of the LoL-MDC pipeline for transforming raw match data into a compact, LLM-ready format.

Figure 2. Sample output of the LoL-MDC’s compacted JSON file, including the match overview, team and player statistics, timeline metrics, and key events.

Figure 3. Example statistics from LOL.PS (anonymized) summarizing Match 1 of the sampled dataset.

Figure 4. Responses generated by GPT-4o using raw JSON inputs for general questions. Red-colored text denotes wrong information, while green-colored text denotes the correct information.

Figure 5. Responses generated by GPT-4o using LoL-MDC-compacted input for general questions.

Figure 6. Responses generated by GPT-4o using raw JSON inputs for match overview questions. Red-colored text denotes wrong information, while orange-colored text denotes the encrypted values.

Figure 7. Responses generated by GPT-4o using LoL-MDC-compacted input for match overview questions.

Figure 8. Responses generated by GPT-4o using LoL-MDC-compacted input for match-comparing questions.

Table 1. Averaged data size and token counts for North America (NA) matches in English.

Metric	14.21 NA	14.22 NA
Raw Size (bytes)	173,284.81	149,324.22
Compacted Size (bytes)	4219.55	3929.28
Raw Token Count (`cl100k_base`)	89,410.73	75,849.92
Compacted Token Count (`cl100k_base`)	1721.49	1591.07
Raw Token Count (`o200k_base`)	88,589.34	75,008.02
Compacted Token Count (`o200k_base`)	1705.87	1576.12

Table 2. Averaged data size and token counts for Korea (KR) matches in Korean.

Metric	14.21 KR	14.22 KR
Raw Size (bytes)	162,954.40	163,989.32
Compacted Size (bytes)	4226.73	4240.62
Raw Token Count (`cl100k_base`)	83,914.27	84,477.72
Compacted Token Count (`cl100k_base`)	1743.96	1750.54
Raw Token Count (`o200k_base`)	83,032.83	83,598.03
Compacted Token Count (`o200k_base`)	1689.48	1694.96

Table 3. Compacted data statistics for NA/en_US matches (patch 14.22).

Component	Size	Tokens	Tokens
	(bytes)	(`cl100k_base`)	(`o200k_base`)
Match Overview	220.48	79.00	79.00
Team and Player Statistics	2197.53	900.14	888.90
Timeline View	681.31	348.66	348.66
Selected Key Events	735.96	245.91	242.20

Table 4. Compacted data statistics for KR/ko_KR matches (patch 14.22).

Component	Size	Tokens	Tokens
	(bytes)	(`cl100k_base`)	(`o200k_base`)
Match Overview	219.52	78.00	78.00
Team and Player Statistics	2286.33	956.49	912.53
Timeline View	773.86	402.66	402.66
Selected Key Events	866.91	296.29	284.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Lee, W.; Park, J. Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input. Appl. Sci. 2025, 15, 7190. https://doi.org/10.3390/app15137190

AMA Style

Kim J, Lee W, Park J. Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input. Applied Sciences. 2025; 15(13):7190. https://doi.org/10.3390/app15137190

Chicago/Turabian Style

Kim, Jooyoung, Wonkyung Lee, and Jungwoon Park. 2025. "Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input" Applied Sciences 15, no. 13: 7190. https://doi.org/10.3390/app15137190

APA Style

Kim, J., Lee, W., & Park, J. (2025). Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input. Applied Sciences, 15(13), 7190. https://doi.org/10.3390/app15137190

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structured Summarization of League of Legends Match Data Optimized for Large Language Model Input

Abstract

1. Introduction

2. Related Works

2.1. Esports Data Analysis

2.2. Large Language Models in Games

3. LoL-MDC: League of Legends Match Data Compactor

3.1. Match Overview

3.2. Team and Player Statistics

3.3. Simplified Timeline View

3.4. Selected Key Events

3.5. Cross-Game Adaptability and Examples

4. Experiments

4.1. Datasets and Experimental Set-Up

4.2. Quantitative Evaluation

4.2.1. Global Reduction Analysis

4.2.2. Component-Level Analysis

4.3. Qualitative Evaluation

4.3.1. General Questions

4.3.2. Match Overview Questions

4.3.3. Match-Comparing Questions

4.4. Findings and Contributions

5. Conclusions

Limitations and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI