1. Introduction
Large Language Models (LLMs), such as GPT-4, are increasingly embedded within research, industry, and everyday applications, powering widely adopted tools including Microsoft Copilot and ChatGPT 4 [
1]. These systems promise seamless interaction with user-provided documents, giving the impression that the form of input should not matter. In principle, an LLM should return the same answer whether the source material is encoded as a typeset PDF, a DOCX file with styles and embedded objects, a minimal TXT file, or an XML file structured with markup tags. In practice, however, file formats introduce non-trivial differences. PDFs often contain layout artifacts such as multi-column text, headers, or figures that complicate parsing; DOCX files intermix semantic text with rich formatting metadata; TXT discards all structure, leaving only raw character streams; and XML introduces explicit markup that can either support structure-aware interpretation or increase parsing overhead. These distinctions do not necessarily alter the semantic content of the model’s responses, but they can shape efficiency, that is, how quickly a response is produced and, style i.e., how naturally and fluently the output reads.
This observation raises two central research questions. First, whether file format influences the readability and linguistic complexity of answers generated by LLMs, and second, whether these potential stylistic effects interact with performance-related factors such as response latency. Prior work suggests that format indeed influences efficiency: both practical analyses of document ingestion and controlled experiments on format normalization demonstrate systematic differences in response time [
2,
3,
4]. Yet much less is known about whether readability or stylistic qualities vary with format. Addressing this gap, we evaluate GPT-4’s responses to 100 queries derived from 50 peer-reviewed academic papers, each converted into four encoding formats including TXT, PDF, DOCX, and XML. Across this dataset of 400 question–answer pairs, we assess (i) efficiency, measured through response latency and verbosity, (ii) linguistic style, captured through readability indices (Flesch–Kincaid Grade Level, Dale–Chall), average word and sentence length, and lexical diversity, and (iii) semantic stability, evaluated through embedding-based similarity measures to ensure that style variations do not obscure meaning preservation.
The importance of format sensitivity becomes clearer when considering how models internally process documents. Research on ingestion pipelines shows that LLMs are not truly format-agnostic: parsing irregular layouts, footnotes, or embedded objects in PDFs can degrade downstream behavior [
2]. Practitioner guidance reflects this as well, with technical documentation warning against raw PDF ingestion and recommending normalization into cleaner encodings such as DOCX, HTML, or structured text for question-answering and retrieval [
3]. Such evidence underscores that document representation is not a superficial concern, but a critical determinant of usability and performance.
At the same time, research in evaluation of natural language generation has expanded beyond semantic accuracy to include stylistic and qualitative aspects of output. Contemporary evaluators such as G-Eval, BLEURT, and COMET aim to capture fluency, coherence, and readability [
5,
6,
7]. Traditional readability indices like Flesch–Kincaid and Dale–Chall, while decades old, remain interpretable, widely used, and adaptable to NLP tasks, including post-editing effort estimation and fluency assessment [
8,
9,
10]. Readability datasets and standardized computational tools make it possible to apply these methods in a systematic and large-scale way [
11,
12]. Complementary measures such as lexical richness and syntactic complexity extend this analysis by linking vocabulary diversity and sentence structure to perceptions of text difficulty, credibility, and trust [
13,
14].
Semantic stability across different formats has also been investigated, with embedding-based similarity methods consistently showing that core meaning is often preserved even when stylistic or structural properties shift [
4,
15,
16]. Practitioner reports support this finding: plain text inputs tend to elicit faster and more concise responses, while PDFs are more prone to generating verbose or inconsistent answers [
17]. Moreover, technical studies emphasize that encoding itself is never neutral. Work on byte-level processing and polyglot file structures demonstrates that adversarial or ambiguous encodings can mislead models [
18,
19]. Similarly, noise from inconsistent markup or format irregularities has been shown to reduce classification and retrieval accuracy [
20,
21], while effective normalization processes are crucial for optimizing outcomes in high-stakes applications such as biomedical and legal NLP [
22,
23].
Despite these insights, very few studies have jointly investigated readability, stylistic complexity, and response efficiency under controlled conditions where both model and content are held constant. This study addresses the critical gap by systematically comparing GPT-4’s performance across multiple file encodings. By integrating readability and linguistic complexity metrics with embedding-based semantic similarity checks, we disentangle stylistic differences from substantive meaning. In doing so, we provide the first comprehensive analysis of how document format shapes both the efficiency and presentation quality of LLM outputs, offering practical guidance for researchers and practitioners deploying these systems at scale.
2. Methods
To ensure transparency and reproducibility, this study was structured around a multi-stage workflow that integrated document preparation, agent deployment, query execution, and multi-metric evaluation. This process, illustrated in
Figure 1, was designed to systematically isolate the influence of file format while holding all other variables constant. The workflow began with the conversion of academic articles into four file types including TXT, DOCX, PDF, and XML. Format-specific GPT-4 agents were then deployed through Microsoft Copilot Studio. Each format-specific agent (e.g., PDF_Research, DOCX_Research, XML_Research) was created using identical configurations of the same GPT-4 model instance, with no custom prompts or modified system topics. All agents relied solely on Copilot Studio’s built-in system topics (Conversation Start, Greeting, End of Conversation) and default GPT-4o reasoning model, ensuring that only the document format varied across conditions. Each agent was queried with identical questions, and outputs were logged alongside associated metadata, including response time, answer length, and input file size. The responses were subsequently evaluated along two complementary dimensions. First, semantic similarity analysis was applied to confirm whether meaning was preserved across formats. Second, readability and complexity metrics were used to assess differences in linguistic style. Statistical analyses including Kruskal–Wallis tests, post hoc pairwise comparisons, and correlation analyses were then employed to evaluate whether format exerted a systematic effect on performance. By maintaining identical content, questions, and model parameters across all conditions, this structured pipeline ensured that file format alone served as the independent variable.
2.1. Corpus and Document Preparation
The dataset consisted of 50 academic research articles; all originally published in PDF format. Each article was systematically converted into three additional encodings (DOCX, XML, and TXT) using automated conversion tools optimized for preserving textual fidelity. To safeguard against distortions introduced during conversion, all outputs were manually inspected and verified for alignment with the source text. This quality control step ensured that differences observed downstream could be attributed to format representation rather than content discrepancies. For each article, two natural-language queries were designed to probe GPT-4’s responses at different levels of granularity. The first query was intentionally broad, targeting high-level information such as the study’s main objective, the affiliations of its authors, or the identities of contributing researchers. For instance, one representative query asked: “What is the main objective of the MONO2REST approach?” By phrasing these questions in plain language and avoiding direct reference to full titles, we ensured that the model was evaluated on substantive content retrieval rather than on surface-level metadata recognition. The second query served as a follow-up, probing deeper into technical details such as study methodology, modeling strategies, or specific algorithmic approaches. This two-tier design for general inquiry followed by targeted follow-up that allowed us to capture both surface-level and in-depth dimensions of model performance. Across 50 documents, this strategy yielded 100 unique questions. Each question was tested under four format conditions (TXT, DOCX, PDF, XML), resulting in a corpus of 400 question–answer pairs for downstream analysis.
2.2. Response Collection and Metadata Recording
Each GPT-4 output was systematically collected with detailed metadata to enable both performance-oriented and content-focused analyses. The following parameters were recorded for every query–response pair:
Response time (in seconds): Latency was measured with millisecond precision to capture even subtle differences in processing speed between formats.
Answer length (in characters): Used as a proxy for verbosity, this measure provided insight into how much textual detail the model produced under each condition.
Input file size (in KB): Document size was tracked to examine whether file weight contributed to response latency or introduced variability in the generated outputs.
To safeguard data integrity, multiple validation steps were implemented. Each query was verified to have been executed against the correct document version and in the intended file format. Responses were then cross-checked for alignment with the source material for example, ensuring that answers correctly referenced the appropriate methodology, findings, or author list. In addition, retrieval logs from Copilot Studio were inspected to confirm that the deployed agents accessed the intended inputs without misrouting across formats. All Copilot agents were executed under the same GPT-4o environment in Microsoft Copilot Studio, which uses a fixed, deterministic configuration and a single stable model endpoint. These constraints ensured consistent generation behavior across all formats. This preprocessing step established a high level of confidence that the final dataset comprising 400 responses across 50 documents, four formats, and two queries per document was both complete and accurately aligned. These safeguards provided a reliable foundation for subsequent multi-level analyses of efficiency, linguistic style, and semantic stability.
2.3. Readability and Complexity Metrics
To assess the stylistic qualities of GPT-4 outputs, each response was systematically evaluated using a suite of established readability and linguistic complexity measures. These indices, which have been widely applied in both educational research and computational linguistics, provide quantitative insights into the accessibility of text and the cognitive effort required for comprehension.
Applying such measures in this context served two purposes. First, it enabled a structured evaluation of whether different file formats influenced not only the efficiency of responses but also the stylistic properties of the generated language. Second, it allowed us to capture subtle variations in how GPT-4 adapts sentence structure, vocabulary richness, and overall fluency depending on the input representation. By combining these readability and complexity metrics with performance indicators such as response time and verbosity, the analysis provided a multidimensional perspective. This ensured that any format-driven differences could be interpreted not only in terms of computational efficiency but also in terms of how naturally and clearly the model communicated its answers.
2.3.1. Flesch–Kincaid Grade Level (FKGL)
The FKGL, Equation (1), provides an estimate of the U.S. school grade level required to understand a passage [
8].
This formula is essentially a weighted combination of two core measures:
Average Sentence Length (ASL) = , which reflects syntactic complexity.
Average Syllables per Word (ASW) = , which captures lexical complexity.
where the variables are defined as:
: The total number of words in the passage. A word is typically defined as any sequence of characters separated by whitespace.
: The number of sentences in the passage. Sentences are usually determined by punctuation markers such as periods, exclamation marks, or question marks.
: The estimated number of syllables by standard rule-based counting methods.
The constants (0.39, 11.8, −15.59) are empirically derived to scale the score to align with U.S. grade levels. Interpretation typically follows grade-level bands, ranging from ≤ 1 (very easy; early primary school) to 16+ (graduate-level or professional prose) as followed in
Table 1 [
8,
12].
2.3.2. Dale–Chall Readability Score
The Dale–Chall formula [
9] estimates readability based on word familiarity and sentence length. The score is presented on Equation (2).
where variable are defined as:
: The number of words in the text that do not appear in the Dale–Chall list of 3000 familiar words. These words are considered more challenging for a 4th grade reader.
: The total number of words in the passage, defined as sequences of characters separated by whitespace.
: The number of sentences in the passage, generally determined by terminal punctuation markers (period, exclamation mark, or question mark).
If the percentage of difficult words exceeds 5%, a constant of 3.6365 is added to the score. Values can then be interpreted across grade bands, with ≤4.9 indicating very simple text and ≥10 corresponding to graduate-level material as shown in
Table 2 [
9,
12].
2.3.3. Lexical Diversity (Type–Token Ratio, TTR)
Lexical richness was assessed through the Type–Token Ratio [
13], shown in Equation (3):
where each variable is defined as follows:
: The number of distinct words in the text, ignoring repeated occurrences. For example, if “system” appears ten times, it is counted once.
: The total number of words in the passage, typically defined as sequences of characters separated by whitespace.
Higher values indicate a broader vocabulary relative to text length, though TTR is known to be sensitive to sample size (shorter texts tend to have artificially higher values).
2.4. Semantic Similarity (For Content Consistency)
To ensure that format-driven shifts in readability did not alter the meaning of answers, we measured semantic similarity across formats. Each answer was embedded using the SentenceTransformers library with the MPNet-base-v2 model, which is optimized for capturing sentence-level meaning [
15,
16]. Pairwise cosine similarity was then calculated between answers produced from the same query across the four formats, as shown in Equation (4).
where variables are defined as:
: Embedding vectors representing the semantic content of two answers. Each vector encodes meaning in a high-dimensional space.
: The dot product between the two vectors, measuring their alignment.
: The Euclidean (L2) norms of vectors AAA and BBB, which scale the dot product by their magnitudes.
Values range from 0 (no similarity) to 1 (identical meaning). In practice, higher cosine similarity indicates stronger semantic equivalence between answers across formats.
2.5. Statistical Analyses
A structured statistical framework was applied to determine whether observed differences across formats and question types were statistically meaningful. Because readability and complexity metrics are text-derived and may not follow normal distributions, assumption checking and normalization steps were performed before choosing the appropriate tests. This ensured that each comparison was grounded in the correct statistical methodology rather than relying on uniform parametric assumptions.
Normality Testing and Justification: The distribution of each metric was first assessed using normality tests (e.g., Shapiro–Wilk and Anderson–Darling) [
24,
25]. When data failed to meet normality assumptions, we applied non-parametric tests to avoid inflating Type I error rates.
Group Comparisons: To compare metrics across the four file formats, we used the Kruskal–Wallis H test, a rank-based non-parametric alternative to ANOVA. Significant omnibus results were followed with Dunn’s post hoc tests using Bonferroni correction to identify specific pairwise differences [
26,
27].
Pairwise Comparisons by Question Type: To test whether responses differed between general and follow-up queries, we employed the Mann–Whitney U test, again with Bonferroni adjustment for multiple comparisons [
28].
Correlation Analysis: Associations between metrics (e.g., readability vs. sentence length, verbosity vs. response time) were evaluated using Spearman’s rank correlation coefficient, which is robust to non-linear relationships and ordinal data [
29]. This allowed us to capture monotonic trends even when absolute values were not normally distributed.
Reliability Across Formats: To quantify how interchangeable answers were across formats, we calculated intraclass correlation coefficients (ICCs), which provide a measure of consistency across repeated observations of the same item under different conditions [
30].
By integrating normality checks, non-parametric group tests, post hoc comparisons, and correlation analyses, this statistical framework allowed us to isolate the effects of file format and question type on both readability and performance.
3. Results
Our analysis was designed to progress in a stepwise manner, beginning with measures of readability and linguistic complexity, moving next to performance-oriented indicators such as response latency, and concluding with an assessment of semantic stability. This layered structure allowed us to determine whether differences introduced by file format are confined to surface-level stylistic variation or whether they extend into deeper dimensions of efficiency and meaning preservation. By examining outputs across these three tiers, we sought to develop a comprehensive understanding of how document encoding influences both the form and function of GPT-4 responses.
3.1. Overview of Experimental Setup and Query Design
The study corpus consisted of 50 academic research articles, all of which were originally published in PDF format. To systematically evaluate the role of encoding, each article was converted into three additional formats DOCX, XML, and TXT using automated conversion tools. These tools were selected for their ability to preserve textual fidelity, ensuring that core content was carried consistently across representations. Following conversion, all versions were manually inspected to confirm alignment with the source document. This validation step was critical to isolating file format as the sole independent variable while eliminating potential confounds arising from content mismatches or incomplete transfers. For each article, two natural-language queries were constructed to evaluate GPT-4’s responses. The first query was intentionally broad, typically addressing high-level information such as the study’s primary objectives, author affiliations, or contributing research groups. For instance, one representative query asked: “What is the main objective of the MONO2REST approach?” By formulating these queries in plain language and avoiding reliance on full paper titles or metadata, we ensured that the evaluation targeted substantive content retrieval rather than superficial matching. The second query was designed as a follow-up, probing deeper into technical details such as the study’s methodology, analytical models, or specific computational approaches. Together, this two-tier query structure captured both surface-level and in-depth dimensions of GPT-4 outputs, offering a richer perspective on the impact of file format. Across the 50 documents, this design yielded 100 unique questions. Each question was then tested under four experimental conditions, corresponding to the four file formats, resulting in a total of 400 question–answer pairs available for downstream analysis.
3.2. Readability and Complexity Across Formats
Readability and surface-level complexity metrics remained highly consistent across all four formats. Statistical testing using the Kruskal–Wallis method confirmed no significant between-format differences for any of the indices examined: Flesch–Kincaid Grade Level (FKGL,
p = 0.86), Dale–Chall score (
p = 0.84), average sentence length (
p = 0.72), and lexical diversity (
p = 0.69). As illustrated in
Figure 2, the violin plots for each metric display substantial overlaps, with medians tightly clustered around comparable values across TXT, DOCX, PDF, and XML. This convergence indicates that GPT-4’s generated answers were equally readable and stylistically consistent regardless of the input encoding. None of the formats exhibited a systematic tendency toward producing either simpler phrasing or more elaborate sentence structures. Put differently, while file format clearly influenced performance measures such as response latency (reported in subsequent sections), it did not alter how accessible, fluent, or stylistically complex the responses appeared to a human reader.
3.3. Response Time, Efficiency, and Latency Differences Across Formats
In contrast to the stability observed in readability measures, response latency was strongly influenced by file format. Kruskal–Wallis tests confirmed significant between-format differences (
H = 30.59,
p ≈ 1 × 10
−6 overall), with this effect consistent across both general queries (
p = 2.2 × 10
−4) and follow-up queries (
p = 4.4 × 10
−5). As depicted in
Figure 3, XML responses were consistently the fastest, with a median response time of approximately 6.8 s. Pairwise post hoc comparisons revealed that XML significantly outperformed DOCX (
p = 0.00024), TXT (
p = 0.0003), and PDF (
p = 1.1 × 10
−6). Among the slower formats, PDF and DOCX exhibited the longest latencies, frequently exceeding 7.5–8 s.
TXT responses occupied an intermediate position, slower than XML but faster than PDF and DOCX. Taken together, these results establish file format as a clear determinant of efficiency, with XML consistently delivering the most stable and pronounced gains in response speed.
To further examine whether the observed latency differences were influenced by caching, execution order, or temporal variation, three additional experiments (Exp 1, Exp2, and Exp 3) were conducted several months after the original February 2025 run, using the same dataset, identical queries, and the same set of Copilot agents.
Experiment 1 (morning, Day 1): Followed the original fixed-order workflow (PDF_Research → DOCX_Research → XML_Research → TXT_Research). Each query was submitted sequentially across all agents, completing the first query for every agent before moving to the second.
Experiment 2 (afternoon, Day 2): Repeated the same fixed order and procedure as Experiment 1 but was executed the following day in the afternoon to test for possible session-level or time-of-day effects.
Experiment 3 (afternoon, Day 3): Employed an interleaved execution order (TXT_Research → DOCX_Research → PDF_Research → XML_Research), alternating the sequence of agents to assess whether order influenced response times.
All three experiments were executed within Microsoft Copilot Studio using the built-in GPT-4o model, where generation parameters (temperature, top-p, and max-tokens) are fixed and not user-configurable. Each agent was linked to a single document format, ensuring identical content and model conditions across runs. Across the three replication experiments, no statistically significant latency differences were observed between formats as shown in
Figure 4.
This contrasts with the initial February 2025 run, which showed significantly faster XML responses. The convergence of mean response times across formats likely reflects backend improvements to Microsoft’s GPT-4o ingestion. These replications confirm that while file format previously exerted a measurable impact on latency, platform-level enhancements have since minimized these differences, yielding more uniform performance across encoding types.
3.4. Answer Length, Verbosity, and File Size
Despite substantial differences in file encoding, the length of GPT-4’s answers remained statistically indistinguishable across formats. Kruskal–Wallis testing confirmed no significant differences in output length (
p = 0.83). As illustrated in
Figure 5 (top row), the distributions of character counts across TXT, DOCX, PDF, and XML overlapped almost entirely, with medians clustering at similar values. This suggests that input format did not systematically influence how much detail the model produced. However, scatterplots in the middle row of
Figure 5 reveal a strong and consistent positive correlation between answer length and response time, with Spearman’s ρ ranging from 0.47 to 0.68 (all
p < 10
−6). In practical terms, longer answers naturally require more time to generate, regardless of format. These findings underscore verbosity as a key determinant of latency, independent of input encoding.
By contrast, input file size itself showed no meaningful association with response time. As shown in the bottom row of
Figure 5, correlations between file size and latency were negligible (ρ ≈ 0.03–0.06, all nonsignificant). These results indicate that the computational burden arises from the generative process rather than from parsing document size or weight. Taken together, the findings highlight verbosity, not file size, as the primary driver of efficiency differences. This distinction is important for deployment: while file format sets baseline processing overhead (as seen in latency differences across XML, TXT, DOCX, and PDF), the length of generated responses ultimately determines real-world throughput.
3.5. Correlation Structures Among Readability, Verbosity, and Latency
The correlation heatmaps presented in
Figure 6 shed light on the mechanistic drivers of readability and response efficiency. As expected, the Flesch–Kincaid Grade Level (FKGL) was strongly associated with average sentence length (ρ = 0.73 overall; ranging from 0.60 to 0.83 across query subsets), confirming that sentence length is the dominant factor shaping readability scores. Similarly, Dale–Chall values correlated positively with average word length (ρ = 0.52), consistent with the measure’s emphasis on vocabulary complexity. Performance-related variables also demonstrated systematic relationships. Verbosity, measured as total answer length, showed a robust positive correlation with response time (ρ = 0.57 overall), reinforcing the earlier observation that longer outputs take more time to generate. In contrast, lexical diversity was negatively correlated with both length (ρ = –0.60) and latency (ρ = −0.34). This indicates that more concise responses tended to incorporate a wider range of vocabulary, while longer and slower answers often recycled words, reducing lexical richness. Importantly, these correlation structures were consistent across both general and follow-up queries, demonstrating that the underlying linguistic mechanics remained stable regardless of query type. Together, the results suggest that while file format can shift baseline latency, it does not fundamentally alter the relationships among readability, verbosity, and semantic style. Instead, these dynamics reflect general properties of GPT-4’s generative behavior, largely independent of document encoding.
3.6. Semantic Similarity and Content Fidelity
Finally, semantic similarity analyses demonstrated that the underlying meaning of GPT-4’s responses was preserved across all file format conditions. As shown in
Figure 7, cosine similarity medians were uniformly high, consistently exceeding 0.91 across format pairs. Comparisons between XML and TXT yielded the highest median similarity (0.923), whereas PDF–DOCX pairs were marginally lower (0.911). Importantly, both general and follow-up query subsets followed this same pattern, and statistical testing confirmed no significant differences among format pairs (Kruskal–Wallis
H = 5.69,
p = 0.89). Although a handful of outlier cases revealed minor format-induced drift, most often associated with PDF encodings that introduced layout artifacts. These instances were rare and did not essentially affect overall findings. Taken together, the results indicate that while file format strongly influenced efficiency metrics such as latency, it had negligible impact on semantic fidelity. In other words, encoding differences altered how fast GPT-4 produced responses, but not what those responses conveyed. These findings reinforce the conclusion that format operates as a performance lever rather than a semantic one: XML offers efficiency advantages, while TXT, DOCX, and PDF remain equivalent in preserving the substantive content of answers.
4. Discussion
This study demonstrates that while file format exerts only a modest influence on readability and linguistic complexity, it plays a more decisive role in shaping the efficiency of GPT-4’s responses. Across 400 question–answer pairs, readability scores including Flesch–Kincaid and Dale–Chall showed no significant differences between TXT, DOCX, PDF, and XML. In practical terms, this means that GPT-4 consistently generated responses at a comparable grade level regardless of the input encoding. Such consistency suggests that once textual content is successfully ingested, the model’s language generation process is largely format-agnostic with respect to stylistic complexity and readability.
In the original experiment conducted in February 2025, response latency varied significantly by format. XML consistently produced faster outputs than the other three formats, even though XML files were often larger in size. This finding indicated that document structure mattered more than document weight. XML’s explicit tagging appeared to provide the model with clearer parsing boundaries, thereby streamlining processing, whereas DOCX and PDF contained layout metadata that introduced noise and slowed performance. The TXT format’s minimalist design proved less effective than more structured formats, suggesting that the total absence of organizational cues can obstruct rather than facilitate the model’s scanning process.
To evaluate whether these differences were stable over time, three replication experiments (Exp 1–Exp 3) were conducted several months later under the updated Copilot Studio environment. Each used the same dataset and queries but varied the execution order and timing of agent calls. Across all three replications, no statistically significant latency differences were found among formats. The previously observed XML advantage diminished, and mean response times converged around comparable values for PDF, DOCX, TXT, and XML. This convergence likely reflects backend improvements in Microsoft’s GPT-4o infrastructure, particularly in document-ingestion and layout-parsing pipelines.
These later results do not invalidate the initial findings. Instead, they represent two snapshots of the same system at different stages of maturity. February 2025 results captured a period when format-specific parsing overhead still influenced response time, whereas the October 2025 replications show that ongoing platform updates have reduced these disparities. Together, these findings highlight the evolving nature of LLM-based systems: as their underlying infrastructure improves, the influence of input format on latency decreases while core linguistic behavior remains consistent.
Beyond baseline timing, correlation analyses clarified the main drivers of latency. Across all formats, longer answers, not larger input files, were dominantly associated with slower responses. Response time scaled with verbosity, reflecting the computational cost of generating additional tokens. Furthermore, lexical diversity tends to decline as answers became longer and slower, indicating that extended responses often recycled vocabulary rather than introducing new terms. This finding links efficiency with linguistic richness, underscoring that verbosity does not necessarily translate into more informative content.
Semantic similarity results provide reassurance that content fidelity was preserved. Across all formats, cosine similarity scores remained above 0.91, confirming that meaning was largely stable. Rare exceptions were observed in specific queries. Most notably in cases where PDF or DOCX encodings interfered with the representation of structured lists, leading to minor semantic drift. While these deviations were uncommon, they highlight that encoding artifacts can occasionally alter output. For most users, this means that format choice will not affect the factual content of answers. However, for developers and researchers deploying LLM systems at scale, awareness of these edge cases is essential, particularly in high-stakes contexts such as biomedical or legal applications.
4.1. Practical Implications
These evolving results carry several practical lessons for both practitioners and system designers
Format choice impacts efficiency: While the original experiments showed XML as the fastest format, subsequent replications found comparable response times across all formats, suggesting that ongoing backend improvements have reduced these disparities.
Readability is robust across formats: Users can expect comparable difficulty levels (FKGL and Dale–Chall) regardless of input format, which is reassuring for accessibility.
Verbosity, not file size, drives latency: Longer answers take more time to generate, while input size alone does not predict performance.
Semantic meaning is preserved: High similarity scores confirm that factual content remains consistent across formats, reducing the risk of loss of information.
Designing systems with structured formats pays off: For developers of retrieval-augmented or Copilot-style assistants, prioritizing structured inputs such as XML (or clean HTML) can improve stability and speed in earlier or less-optimized environments without compromising content quality.
4.2. Analysis of Response Time Differences Across File Formats in Copilot
It must be pointed out that although in our study, the Copilot’s response time varied depending on the file format of the documents, the key difference lies in how different file types are parsed, tokenized, and preprocessed before the model receives them for inference. A synopsis of the factors contributing to variability in response time is summarized below.
Different preprocessing pipelines: PDF, DOCX, TXT, and XML formats require different preprocessing steps. PDFs often need text extraction or OCR, DOCX files require unpacking and conversion, while XML files are already structured and text-based, minimizing parsing time.
Efficient content segmentation: XML includes semantic tags such as <title>, <abstract>, and <section>, which enable the system to identify and prioritize relevant sections quickly.
Lower token count: XML text tends to be cleaner and more compact, resulting in fewer tokens and faster model input processing.
However, this does not indicate a true model speed difference. The model inference speed, the actual computation once text tokens are received, is essentially constant for a given input length. The observed differences arise primarily from the preprocessing stage, not from the model reasoning or generation itself.
In summary, as summarized in
Table 3, XML-based research papers yield shorter Copilot response times primarily because their structure allows for faster, more direct text extraction and fewer preprocessing steps. The model’s inference process itself is not inherently faster; the efficiency comes from reduced preprocessing overhead. For LLM-based querying of scientific literature, XML or clean structured text formats are therefore recommended.
5. Conclusions
In conclusion, this study demonstrates that while file format does not alter the semantic fidelity or readability of GPT-4’s answers, it significantly affects how quickly and consistently those answers are generated. Among the formats tested, XML consistently delivered the most efficient and stable performance, highlighting the value of structured encodings. By contrast, PDF and DOCX introduced measurable delays and, in rare cases, minor semantic misalignments likely stemming from layout or metadata artifacts. TXT functioned as a reliable baseline, though its lack of organizational structure meant it did not provide the same performance gains as XML. Importantly, verbosity emerged as the strongest predictor of latency, whereas document size itself showed no meaningful influence.
However, follow-up replication experiments conducted several months later under the updated Microsoft Copilot Studio environment showed that these latency differences had largely converged. The new experiments (Exp 1–Exp 3), performed with interleaved and randomized execution orders, revealed no statistically significant differences across formats. This temporal convergence suggests that backend improvements, particularly in GPT-4o’s document-ingestion and parsing pipelines, have reduced the disparities observed in the February 2025 results. The findings therefore represent two valid snapshots of the same system at different stages of evolution: earlier runs captured format-specific sensitivity, whereas later runs reflect a more uniform, optimized environment.
This work is not without limitations. First, it focused exclusively on a single large language model, GPT-4, meaning that results may differ when applied to alternative architectures or future model revisions. Second, the dataset comprised 50 academic papers with two queries per document. While sufficient to reveal systematic trends, this scope may not capture the full variability present in other domains such as legal, industrial, or medical corpora. Third, readability indices such as Flesch–Kincaid and Dale–Chall, although widely established and interpretable, provide approximations of linguistic difficulty and may not fully reflect human perceptions of clarity or fluency.
Additionally, Microsoft Copilot Studio operates as a continuously evolving cloud platform. Its underlying GPT-4o model, caching behavior, and document-parsing components are periodically updated without user control, which can influence timing and ingestion performance across different dates of testing. As a result, experiments conducted in February 2025 and those repeated in October 2025 reflect distinct but valid system states. Moreover, while Copilot Studio allows multiple agents to be created independently, agents within the same application share fixed default parameters (e.g., temperature, top-p, and max-token settings) that cannot be modified by the user. These evolving and fixed platform characteristics highlight the importance of documenting model versions, platform configuration, and execution timeframe in future reproducibility efforts.
These results suggest that the effect of file format on system efficacy is a function of the underlying AI model’s developmental stage and algorithmic optimization. Consequently, the practical implications vary across systems with different levels of refinement. For organizations managing large-scale document collections, particularly when working with earlier or less optimized LLM environments adopting structured representations such as XML or equivalently clean and semantically tagged HTML can yield measurable efficiency gains without compromising content quality. On the enterprise scale, even modest reductions in latency can compound into substantial time and cost savings. Conversely, for smaller organizations or projects operating within newer, more optimized AI platforms, the benefits of format conversion may be minimal. Modern LLMs, as demonstrated with GPT-4, can handle TXT, DOCX, PDF, and XML without altering the underlying readability or semantic quality of their responses.
These findings indicate that file format is not a significant determinant of information comprehensibility, but rather a key factor influencing the efficiency of data processing and retrieval. While meaning and readability remain robust across formats, response speed and stability are clearly sensitive to encoding choices. As AI models and enterprise platforms such as Copilot Studio continue to evolve, these performance gaps may continue to narrow. The integrity of AI-driven document analysis is contingent upon transparent and reproducible research practices. This requires a persistent and meticulous recording of all experimental conditions, specifically the model’s version, computational environment, and timestamp of execution.
We must point out that, while earlier versions such as GPT-4o (that was employed in this study) showed measurable response-time differences when processing research papers in different file formats, GPT-5’s upgraded architecture (the most recent version introduce on 7 August 2025) largely eliminates these disparities. This improvement stems from unified preprocessing, multimodal document understanding, and better token handling. The reason that response times regarding file-format differences have diminished are (summarized in
Table 4):
Unified input pipeline: GPT-5 converts all supported file formats (PDF, DOCX, XML, TXT, HTML) into a standardized internal representation before inference.
Built-in multimodal document understanding: GPT-5 can interpret structured and visual layout directly, making parsing times nearly identical.
Parallelized context construction: Multiple segments of a document are processed concurrently, reducing sequential preprocessing delays.
Optimized token management: The tokenizer efficiently compresses repetitive markup, ensuring similar token counts across formats.
However, there might still be some minimal residual variations, and minor differences may still occur due to:
File size and embedded graphics (especially in image-heavy PDFs)
Encoding or corruption anomalies
Complexity of the user query or retrieval pathway
In summary, in GPT-4o, response-time variations across file formats arose from external parsing and tokenization steps. With GPT-5, those bottlenecks have been largely eliminated due to integrated preprocessing and multimodal handling. The model now delivers nearly uniform response times across XML, TXT, DOCX, and PDF formats. For researchers, this means that format choice is no longer a critical factor regarding performance, and in turn content quality and structure take precedence.