A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models

Vera-Amaro, Guillermo; Rojano-Cáceres, José Rafael

doi:10.3390/computers15060343

Open AccessArticle

A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models

by

Guillermo Vera-Amaro

and

José Rafael Rojano-Cáceres

^*

Facultad de Estadística e Informática, Universidad Veracruzana, Xalapa-Enríquez 91020, Veracruz, Mexico

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(6), 343; https://doi.org/10.3390/computers15060343

Submission received: 22 April 2026 / Revised: 23 May 2026 / Accepted: 26 May 2026 / Published: 27 May 2026

(This article belongs to the Section Human–Computer Interactions)

Download

Browse Figures

Versions Notes

Abstract

Web accessibility remediation using large language models (LLM) has recently gained attention; however, most approaches remain tool-centric and lack formal architectural grounding. This article introduces a formally structured conceptual model for born-accessible web remediation using LLMs. The model was derived through a systematic literature review and refined under the Design Science Research Methodology. Unlike patch-based repair strategies, it treats remediation as constrained regeneration, producing accessible content from semantically reorganized inputs. The model defines five core components—input acquisition, intermediate transformation, prompt configuration, generative inference, and output evaluation—and formalizes their interactions and decision mechanisms. A controlled demonstration using multiple LLMs (GPT, Gemini) and automated tools (Lighthouse, Axe, WAVE), complemented by checklist-based structural inspection, was conducted. Results indicate that accessibility improvement depends strongly on architectural structuring of transformation and evaluation sequencing. The formalization advances LLM-driven accessibility remediation from empirical experimentation toward a reproducible, decision-governed generative paradigm, providing a structured foundation for the systematic development of accessibility-oriented architectures, frameworks, and software systems.

Keywords:

web accessibility; conceptual modeling; systematic literature review; prompt engineering; artificial intelligence

1. Introduction

The World Wide Web has become essential infrastructure for education, commerce, governance, and healthcare. Recent reports indicate that 67.9% of the global population uses the Internet [1], while approximately 16% lives with some form of disability [2], including over two billion people with visual impairment [3]. For many of these users, equitable participation depends on accessible web design and assistive technology compatibility.

Despite international standards such as the Web Content Accessibility Guidelines (WCAG) [4], non-compliance remains widespread. Large-scale automated studies show that most high-traffic websites contain detectable accessibility errors [5], with similar deficiencies reported in educational and governmental domains [6]. These patterns indicate systemic shortcomings rather than isolated defects.

Recent advances in machine learning have accelerated the adoption of AI-assisted decision-making and generative workflows across multiple domains, including scientific modeling, engineering optimization, predictive analytics, and automated content generation [7,8]. These developments demonstrate the growing capability of AI systems to support complex transformation and reconstruction tasks under domain-specific constraints.

Within web accessibility research, these advances in AI offer new opportunities for accessibility remediation, as they can interpret and generate structured markup aligned with accessibility guidelines [9,10]. However, existing applications remain fragmented and experimental. Studies report inconsistencies, hallucinations, limited generalization, and frequent WCAG violations in generated outputs [11,12,13]. Crucially, there is still no formally structured model that integrates input acquisition, prompt configuration, generative transformation, and evaluation within a coherent remediation framework.

This work is situated within an emerging line of research on AI-driven web accessibility remediation, where prior studies have explored preliminary architectural perspectives [14]. In contrast to earlier exploratory formulations, the present article introduces a formally specified conceptual model with a focused emphasis on LLMs and a born-accessible regeneration paradigm [15], in which accessibility is embedded during the reconstruction of a new version of the webpage rather than introduced afterward through incremental patch-based corrections applied to the original source code. The model defines its core constructs, component interactions, transformation logic, constraints, and feedback mechanisms within a structured design science framework.

Although the operational stages of the proposed workflow may resemble processes commonly found in LLM-based systems, the contribution of this work does not lie in defining a simple sequential pipeline. Instead, the novelty of the proposal resides in the formalization of accessibility remediation as a constrained generative architecture governed by explicit transformations, accessibility constraints, and decision-oriented evaluation mechanisms. Unlike conventional prompt-centered workflows, the proposed model specifies how acquisition, transformation, prompting, generation, and evaluation interact as formally defined operators within a reproducible and extensible remediation framework.

By formalizing accessibility remediation as a constrained generative process, the proposal moves beyond descriptive approaches toward a systematically articulated conceptual artifact grounded in software engineering principles. The main contributions of this article are:

A structured development process grounded in the Design Science Research Methodology [16], clarifying problem identification, objective definition, artifact design, demonstration, and evaluation.
A systematic literature review on LLMs and related techniques, informing the formal definition and scope of the proposed conceptual model.
A formalization of a conceptual model for web accessibility remediation using LLMs, specifying its constructs, relationships, and operational constraints.
A demonstration designed to integrate seamlessly into a visually impaired user’s workflow, featuring compatibility with screen readers and keyboard shortcuts.

The remainder of this article is structured as follows. Section 2 provides the background. Section 3 presents the methodology. Section 4 reports the literature review and defines the objectives of the solution. Section 5 details the formal design of the conceptual model, followed by its demonstration and evaluation in Section 6 and Section 7. Section 8 presents the discussion, and Section 9 concludes the article and outlines future research directions.

2. Background and Motivation

Web accessibility aims to ensure that digital interfaces can be perceived, understood, navigated, and operated by users with diverse abilities [17]. The Web Content Accessibility Guidelines (WCAG) provide a structured set of success criteria that guide accessible design and evaluation [4]. In practice, however, achieving accessibility compliance extends beyond simply detecting violations. Manual accessibility remediation is widely recognized as time-consuming, cognitively demanding, and prone to inconsistency [18]. Correcting issues such as missing semantic roles, inappropriate document structure, or inadequate alternative text often requires human judgment and contextual reasoning rather than mechanical rule application [19,20].

Automated evaluation tools—such as static analyzers that inspect HTML code—can efficiently identify syntactic violations. Nevertheless, they offer limited insight into semantic accessibility and user experience considerations [13]. These tools typically flag detectable rule breaches but cannot reliably determine whether alternative text is meaningful, whether heading structures convey logical hierarchy, or whether interactive components behave coherently with assistive technologies [21]. Consequently, remediation remains heavily dependent on expert interpretation and manual intervention.

Artificial intelligence seeks to construct systems capable of performing tasks traditionally requiring human reasoning [22]. Within this domain, LLMs have demonstrated strong capabilities in understanding structured documents, transforming markup languages, and generating semantically enriched content [9,23]. This has led researchers to explore their application in accessibility remediation [19]. LLMs can suggest structural corrections, generate alternative text, and reorganize markup into more semantically coherent forms [24]. Prompt engineering techniques further enable structured guidance during generation processes [25]. However, empirical studies consistently report important limitations. Generated outputs frequently contain accessibility violations, incomplete corrections, hallucinated attributes, or non-compliant structures [12]. Several studies emphasize that large language models cannot yet produce fully compliant code without human supervision [9,11,13,26]. Moreover, results are often context-dependent and lack generalizable reliability [10].

These findings suggest that the challenge is not merely whether LLMs can assist remediation, but how they should be systematically integrated into a structured and reproducible process. Therefore, the goal of this research is to design and formally articulate a conceptual model that structures born-accessible web accessibility remediation using LLMs.

Current accessibility remediation practices present a dual limitation: manual approaches require expert knowledge and substantial effort, and automated and generative approaches lack structured integration and formal guidance. While techniques such as web scraping, prompt configuration, generative inference, and automated evaluation exist independently, there is no formally articulated conceptual structure that defines their relationships, constraints, and feedback mechanisms within a unified remediation workflow [14]. As a result, developers frequently rely on ad hoc experimentation when applying LLMs to accessibility tasks. This absence of formalization produces inconsistent outcomes, limited reproducibility, and difficulties in comparing alternative configurations.

Problem Statement

There is no formally defined, structured, and reproducible conceptual model that systematically guides the transformation of non-accessible web content into regenerated born-accessible alternatives using LLMs while ensuring alignment with established accessibility standards.

3. Research Methodology

This study adopts the Design Science Research Methodology (DSRM) as its guiding research framework. DSRM is specifically intended for the systematic creation and evaluation of artifacts designed to solve identified problems in information systems research. According to the study [16], the process consists of six nominal activities—problem identification, objectives of a solution, design and development, demonstration, evaluation, communication—that form a logical progression for artifact-oriented research, although they may be iteratively revisited depending on research dynamics.

Figure 1 illustrates the adapted DSRM applied in this study. The figure preserves the nominal six-stage structure while mapping each activity to the corresponding sections of this article. Unlike purely empirical research designs, DSRM emphasizes artifact construction and refinement through iterative interaction between problem context, theoretical grounding, and evaluation cycles.

By grounding the development of the conceptual model in DSRM, this study ensures that the artifact is not only descriptively presented but systematically derived, demonstrated, and evaluated. This methodological alignment strengthens the theoretical contribution of the work by positioning the model as a formally articulated design artifact rather than an isolated procedural proposal.

Iterations of the Conceptual Model

The presented model was developed through an iterative research process conducted across multiple complementary studies. An initial conceptual abstraction emerged from a broad review of AI-assisted accessibility remediation approaches and LLM-related workflows [22]. This preliminary structure was progressively refined through empirical investigations involving blind-user interaction analysis, accessibility barrier identification, and evaluation of remediation workflows using automated tools, expert inspection, and assistive technology testing [19,20].

The resulting findings contributed to the formulation of a broader theoretical perspective on AI-assisted accessibility remediation processes [27]. Building upon these prior iterations, the present study conducted a focused systematic literature review specifically centered on LLM-based web accessibility remediation, enabling the refinement, delimitation, and formal specification of the ATPGE model presented in this article.

Therefore, the quality assessment of the proposal was conducted at two complementary levels: (1) component-level evaluation, examining acquisition, prompting, generation, and evaluation mechanisms independently through empirical experimentation and accessibility assessment; and (2) architectural-level assessment, evaluating the coherence, reproducibility, and decision-oriented behavior of the complete remediation proposed artifact.

4. Systematic Literature Review

To inform the formalization of the proposed conceptual model, a systematic literature review was conducted. The objective of this review is not merely descriptive but analytical: to identify existing techniques, architectural patterns, methodological gaps, and unresolved challenges at the intersection of web accessibility remediation and artificial intelligence. The findings of this review directly support the definition of solution objectives in the Design Science Research process.

4.1. Review Methodology

Given that research on LLMs for accessibility has reached sufficient maturity, a systematic literature review (SLR) was selected as the most appropriate secondary study design. The review process followed the main activities of an SLR in software engineering [28]: (1) planning, (2) conducting, and (3) reporting. These stages were complemented by adherence to PRISMA reporting standards, incorporating artifacts such as the PRISMA flow diagram and checklist to enhance methodological transparency [29]. The complete workflow is illustrated in Figure 2.

4.2. Identification of the Need for the Review

Before conducting the full review, a preliminary search for existing secondary studies was performed to determine whether a dedicated synthesis already addressed the integration of LLMs, associated techniques, and web accessibility remediation. Searches were performed in Scopus and Web of Science, two multidisciplinary databases widely recognized for comprehensive coverage in computing and information systems research [35]. The following search string was employed: (“web accessibility” OR “digital accessibility”) AND (“artificial intelligence” OR “AI” OR “scraping” OR “large language models” OR “LLM” OR “prompt engineering”).

As shown in Table 1, the search identified several recent review studies addressing related domains. These studies explore digital accessibility, assistive technologies, cognitive accessibility, and general applications of artificial intelligence. However, none provide a focused synthesis of workflows that integrate content extraction, generative modeling, and accessibility evaluation within a unified remediation framework.

Identified Research Gap

Although prior reviews discuss accessibility tools, AI-based classification systems, or assistive technologies independently, they do not systematically examine the procedural integration of web content extraction techniques, prompt-based generative workflows using large language models, and evaluation mechanisms aligned with accessibility standards. No study was found that explicitly maps how these components can be organized into a reproducible and structured remediation process.

4.3. Research Questions

To guide the review, the Population–Concept–Context (PCC) framework was adopted [30]. The PCC structure supports exploratory questions and facilitates systematic construction of search string. The guiding question of the review is formulated as follows: In web content (P), how are LLM related strategies (C) applied within the context of accessibility remediation (C)? From this overarching question, the PCC components were derived and are summarized in Table 2. Based on these PCC elements, four research questions were formulated to guide the review:

RQ1: What techniques are employed to extract and preprocess web content?

RQ2: What types of LLMs are applied in web-related workflows?

RQ3: How is prompt engineering used to guide LLMs in producing accessible content?

RQ4: What evaluation strategies are employed to assess the accessibility of outputs?

These research questions enable systematic mapping of current practices and directly inform the definition of components for the proposed conceptual model.

4.4. Identification and Screening

The search strategy followed the structured procedure described by Zhang et al. [31], which combines manual and automated search techniques to ensure comprehensive coverage while maintaining study quality. The process comprised five primary steps, with a sixth complementary step incorporated to retrieve relevant studies that may have been missed during the initial searches.

4.4.1. Step 1: Identification of Digital Libraries

Digital libraries were selected based on the research questions and established guidelines for SE reviews [30,39]. In line with recommendations in [28], at least two multidisciplinary indexing databases were included: Scopus and Web of Science. These platforms aggregate peer-reviewed content from major publishers, including SpringerLink, ScienceDirect, IEEE Xplore, and the ACM Digital Library. Additionally, preprints from arXiv were included through Scopus indexing. Table 3 summarizes the selected sources.

4.4.2. Step 2: Establishing the Quasi-Gold Standard (QGS)

A set of 17 relevant studies was identified through manual search and defined as the Quasi-Gold Standard (QGS). This reference set was used to evaluate the coverage and effectiveness of candidate automated search strings. To ensure transparency and reproducibility, the complete QGS dataset is publicly available in the Mendeley Data repository [40].

4.4.3. Step 3: Definition of Search Strings

Search terms were derived from the PCC framework and iteratively refined using keywords extracted from the QGS studies. These terms were then used to construct candidate search strings (CS) by combining key terms and their abbreviations to balance coverage and specificity.

4.4.4. Step 4: Conducting the Automated Search and Screening

Inclusion and exclusion criteria were defined prior to the automated search to ensure alignment with the review objectives and to preserve the intended scope of the study (see Table 4). Given that research on LLMs for accessibility remediation significantly expanded after late 2023 [22], the temporal selection window was restricted to studies published from 2024 onward.

The screening process was conducted in sequential stages ordered by increasing evaluation effort, as summarized in Table 5. This staged filtering approach minimized bias and reduced unnecessary full-text assessments by progressively narrowing the candidate pool.

4.4.5. Step 5: Evaluation of Search Performance

To evaluate the performance of the candidate search strings, the retrieved results were compared against the QGS. Two standard information retrieval metrics were applied [31]: sensitivity (recall) measures the proportion of relevant studies retrieved and precision (effort) indicates the proportion of relevant studies among the total retrieved. The metrics are defined as follows:

Sensitivity (recall) = \frac{N_{r e l}}{N_{r e l}^{t o t}} = \frac{17}{17} = 1.00

(1)

Precision (effort) = \frac{N_{r e l}}{N_{r e t r}} = \frac{17}{78} = 0.2197

(2)

Among the candidate strings, CS02 achieved full sensitivity (100%), ensuring that no QGS study was missed, while maintaining good precision (21.79%). Consequently, CS02 was selected as the final search string for the automated retrieval phase (see Table 6).

4.4.6. Step 6: Retrieval of Additional Studies

To mitigate the risk of missing relevant publications, backward and forward snowballing were conducted following the guidelines proposed in [32]. This complementary process identified two additional relevant studies not captured during the initial database search, thereby strengthening coverage and completeness.

4.5. Eligibility of Studies

The quality assessment (QA) aimed to determine the extent to which the results of the included empirical studies were methodologically sound and free from major sources of bias [28]. In accordance with established software engineering review practices [33], and consistent with recent accessibility-focused secondary studies [22,36], the methodological rigor of each study was evaluated using a structured checklist. Table 7 presents the defined quality assessment criteria.

Each primary criterion (QA1–QA4) was assigned a binary score of 1 (Yes) or 0 (No). An additional score (QA5) was incorporated to reflect publication quality, awarding weighted points based on journal ranking: +1 for Q1 venues, +0.75 for Q2, +0.50 for Q3, and +0.25 for Q4. The overall quality score for each study was calculated as:

Score = \sum (Q A_{1} + Q A_{2} + Q A_{3} + Q A_{4} + Q A_{extra})

(3)

To ensure comparability across studies, min–max normalization [36] was applied to the raw quality assessment scores. The normalization procedure rescales all values into the interval

[0, 1]

, where higher values indicate stronger methodological quality. The normalization formula is defined as follows:

Normalization = \frac{Score - min (Score)}{[max (Score) - min (Score)]}

(4)

In this study, the minimum possible raw score was

0.00

, while the maximum possible score was

6.00

, corresponding to studies satisfying all quality assessment criteria and receiving the maximum venue-ranking score. For example, a study obtaining a raw score of

4.00

would produce the following normalized value:

Normalization = \frac{4.00 - 0.00}{[6.00 - 0.00]} = 0.67

(5)

A minimum normalized threshold of

0.60

was established as the eligibility cutoff. Consequently, all studies retained in the final corpus achieved normalized scores equal to or greater than this threshold. The complete scoring matrix, including raw and normalized values, is available in the Mendeley Data repository [40].

4.6. Included Studies

Following the automated search and staged screening process, 24 studies were retained. An additional relevant study was identified through forward snowballing [32]. After applying the quality assessment threshold, a total of 25 primary studies were included in the final analysis. The complete list of primary studies included in the final corpus is available in the Mendeley Data repository [40]. The PRISMA flow diagram shown in Figure 3 summarizes the identification, screening, eligibility, and inclusion phases of the review process.

4.7. Data Extraction and Analysis

Each study was coded using a standardized extraction template aligned with the PCC framework and the defined research questions, ensuring traceability between extracted data and review objectives. For example, extraction process and input format informed RQ1; LLM type and purpose addressed RQ2–RQ3; and evaluation type supported RQ4.

To structure the analysis, an inductive thematic synthesis approach was adopted following the recommendations of [34]. The process was conducted manually by the authors and involved iterative immersion in the extracted data, open coding of relevant segments, constant comparison across studies, and progressive abstraction into higher-order analytical themes.

Initially, each study was examined to identify recurrent concepts related to extraction mechanisms, prompt engineering strategies, generative processes, and accessibility evaluation approaches. These codes were iteratively grouped into descriptive categories and subsequently consolidated into broader analytical dimensions aligned with the research questions. The final thematic structure resulted in four interconnected dimensions: Scraping, Prompting, Generation, and Evaluation. The complete extraction schema and coded dataset are available in the Mendeley Data repository [40].

4.8. Review Findings

This section presents the results of the systematic review and synthesizes the evidence that informs the formalization of the proposed model.

4.8.1. Descriptive Analysis of Included Studies

As shown in Figure 4, the final corpus consists of studies published between 2024 and 2026, reflecting the rapid expansion of research on LLMs in multidisciplinary contexts. Of the included studies, 8 were published in journals, 12 in conference proceedings, 4 as preprints, and one in a book chapter.

Figure 5 shows that all publications are indexed in Scopus 7 (28%) and in Web of Science (27%).

Regarding journal quality, Figure 6 indicates that 20% of the journal’s studies appeared in Q1 venues, 60% in Q2, 13% in Q3, and 7% in Q4, demonstrating a strong concentration of high-impact publications.

In terms of source distribution, Figure 7 shows that the majority of studies were retrieved from ACM (8), followed by SpringerLink (6) and IEEE Xplore (5).

4.8.2. Thematic Analysis

The included studies were classified according to four analytical dimensions derived from the research questions: Scraping, Prompting, Generation, and Evaluation. These categories reflect the core technical components involved in LLM-based web workflows.

The thematic categorization of the included studies in Table 8 reveals distinct structural tendencies. While several studies address multiple dimensions, generation and evaluation are the most prominent, followed by prompting and scraping mechanisms. This distribution indicates that current research predominantly focuses on LLM content generation and evaluation—often substituting traditional tools—while comparatively less attention is given to systematic prompt design. The lower presence of scraping components suggests a methodological gap in retrieval-oriented accessibility workflows.

As shown in Table 9, the most frequent co-occurrence pattern is Prompting + Generation + Evaluation, indicating that current research emphasizes output production and validation rather than full pipeline integration. Generation + Evaluation also appears recurrently, reinforcing the centrality of performance-oriented assessment. Although only one study addresses all four dimensions simultaneously, such comprehensive integration remains exceptional. The limited presence of combinations involving scraping suggests that input acquisition is often treated as peripheral or procedural steps.

Figure 8 presents the thematic map derived from the inductive synthesis as a conceptual representation of the relationships identified across the reviewed studies. Its purpose is to visually summarize how the literature organizes and combines the principal architectural and methodological elements involved in LLM-based accessibility regeneration workflows. The resulting thematic structure also guides the analysis presented in the following sections according to the defined research questions.

4.8.3. RQ1: Techniques for Extracting and Preprocessing Web Content

The analysis confirms that content acquisition remains a foundational yet weakly theorized layer in LLM-based accessibility workflows. Extraction is widely implemented but seldom conceptualized as an architectural construct; instead, it is treated as a technical prerequisite.

Extraction Techniques

HTML scraping continues to dominate acquisition practices. Lightweight parsers such as Beautiful Soup are most frequently reported, while Selenium and Playwright are employed for dynamic, JavaScript-rendered content [19,25]. This distinction reflects two implicit assumptions: static DOM retrieval versus client-side rendered state capture.

However, few studies explicitly evaluate structural fidelity or semantic loss during scraping. Even in more structured remediation pipelines [26,41], extraction quality is assumed rather than measured, despite its downstream influence on LLM behavior.

Preprocessing

Table 10 shows that HTML remains the dominant preprocessing representation, often accompanied by embedded CSS and JavaScript. Only a minority of studies convert HTML into plain text, Markdown, or JSON-based formats. Multimodal inputs appear primarily in image-captioning contexts [42,43], rather than in full-page remediation workflows. Schema-guided or template-aware structuring is rarely formalized.

Overall, preprocessing decisions prioritize structural compatibility and direct markup handling rather than semantic abstraction. Retrieval augmentation and schema-constrained transformation appear only in isolated cases and are not yet integrated as formal architectural layers.

Implications for Conceptual Model Formalization

The literature reveals procedural extraction and utility-oriented transformation, but lacks explicit modeling of acquisition fidelity, structural preservation criteria, or compatibility constraints. This fragmentation motivates the formal separation of input acquisition and intermediate transformation in the proposed model.

4.8.4. RQ2: Types of LLMs That Are Applied in Web-Related Workflows

Figure 9 confirms a strong predominance of general-purpose proprietary LLMs, particularly GPT-family models, followed by Gemini variants, including Gemma.

Figure 10 shows that within the GPT family, GPT-4o is the most frequently reported model, consistent with the timeframe of the reviewed studies. Open-source models appear less frequently and are typically employed in experimental contexts.

Role of LLMs

Most applications employ LLMs for generation (61%), specifically for code regeneration, alt-text generation, and structural rewriting [19,25,41], followed by evaluation-oriented uses (36%) involving structured assessment frameworks or heuristic methods [9,44]. Classification and prediction tasks remain comparatively marginal [45].

Model Configuration and Specialization

Architectural justification for model selection is rare. Choices are generally pragmatic—based on API access or benchmark reputation—rather than aligned with task-specific reasoning capabilities. Fine-tuning or Retrieval-Augmented Generation (RAG) is scarcely implemented [46]. Specialization is usually limited to prompt refinement rather than knowledge injection.

Context Window and Token Considerations

Token consumption and context length constraints are acknowledged [19], yet seldom modeled as design variables. Differences in token usage across models do not consistently correlate with accessibility outcomes, suggesting that architectural structuring influences performance more than model size alone.

Implications for Conceptual Model Formalization

Across studies, the LLM functions as a flexible transformation engine rather than a bounded architectural component. The evidence supports treating generation as a probabilistic operator constrained by structural, accessibility, and evaluation rules, rather than as unconstrained synthesis.

4.8.5. RQ3: Prompt Engineering Guiding LLMs in Producing Accessible Web Content

Prompt engineering emerges as a central yet inconsistently formalized mechanism in LLM-based accessibility workflows. While LLMs provide strong generative capacity, output quality depends heavily on instruction framing, structural constraints, and contextual grounding. Across the corpus, prompting ranges from simple zero-shot directives [9] to combine reasoning steps with interactive responses (ReAct) [25] or structured template-guided remediation [19].

Distribution of Prompting Strategies

Table 11 shows that approximately 86% of the reviewed studies explicitly report a prompting strategy. Zero-shot prompting remains dominant, normally used in conjunction with role instruction prompting and chaining approaches. Advanced reasoning strategies (e.g., chain-of-thought or chain-of-verification) appear in a smaller subset of works.

Instruction Framing and Constraint Embedding

The most common configuration consists of high-level directives such as “fix accessibility issues” or “ensure WCAG compliance.” However, these instructions are frequently under-specified. More formalized approaches [19,25,41,47] embed explicit WCAG references, structural constraints, ARIA rules, or output-format policies within the prompt. This reduces output ambiguity and narrows the generative search space. Despite this, few studies analyze instruction granularity as a design variable.

Template-Guided and Constrained Regeneration

Template injection represents a more controlled prompting paradigm. Rather than requesting free correction, the model receives a predefined accessible scaffold or structural schema. Empirical evidence [19] suggests that template guidance could improve structural stability and reduces drift. Nevertheless, explicit template-based prompting remains limited in the broader literature.

Prompting in Multimodal Contexts

In image accessibility contexts [43], prompting directly affects descriptive completeness and usefulness. Variations in verbosity, specificity, and contextual framing influence perceived usability, demonstrating that prompt design functions as a content calibration mechanism rather than a mere instruction.

Hallucination and Structural Drift

Across the literature, two paradigms emerge: (1) free regeneration based on general accessibility instructions, and (2) constrained regeneration embedding structural and normative rules. Evidence suggests that constrained prompting yields more stable accessibility outcomes and reduces structural hallucination. Unstructured prompts increase risks of fabricated elements or unintended semantic modifications.

Implications for Conceptual Model Formalization

These findings justify modeling prompting as a distinct architectural layer. Rather than treating it as ad hoc instruction tuning, prompting must be formalized as a constraint-definition mechanism governing structural enforcement, accessibility compliance, verbosity calibration, and hallucination mitigation within the remediation pipeline.

4.8.6. RQ4: Evaluation Strategies Employed to Assess the Accessibility of Outputs

The evaluation is no longer purely tool-based but increasingly structured around multi-layer validation schemes. However, fragmentation persists. Three dominant approaches were identified: automated (10 studies), hybrid (6 studies), and manual, including expert and end-user validation (6 studies).

Automated Evaluation

As shown in Figure 11, automated evaluation remains the most frequent approach. Tools such as Axe, WAVE and Lighthouse are widely used to quantify accessibility improvement through error counts, score differentials, or composite metrics [19,25,41].

Figure 12 shows that some recent studies are beginning to explore the use of LLMs as evaluators of semantic aspects. Research focusing on image accessibility [43] demonstrates that lexical similarity metrics are insufficient to assess descriptive usefulness or contextual adequacy. Automated evaluation therefore remains prevalent and primarily syntactic, although early forms of semantic assessment are emerging.

Hybrid Evaluation

Slightly more than half of the studies that initially rely on automated metrics combine them with manual assessment. Study [11] evaluates the accessibility of the generated webpage using both manual and tool-based methods using GPT-4o and WAVE and Axe tools. Similarly, study [19] combines automated scoring with manual semantic inspection and screen-reader walkthroughs. These approaches distinguish macro-level scoring from micro-level structural integrity. Thematic analysis suggests that LLM-generated outputs still require verification, and hybrid strategies appear more reliable than purely automated pipelines.

Manual Expert-Based Evaluation

Manual evaluation typically involves WCAG-based checklists, structured questionnaires, or qualitative scoring [43,47,48]. Only one study reports a formal Cognitive Walkthrough protocol [19]. The average number of participants across user studies is approximately ten, ranging from one to thirty-nine, often involving authors or academic participants. Although resource-intensive, such evaluations provide stronger evidence of usability.

User-Centered and Assistive Technology Evaluation

Direct user validation remains limited but methodologically richer. A small subset of studies (2) incorporates blind-user walkthroughs, NVDA testing, or task-based evaluation [19,43]. These studies frequently apply preliminary filtering thresholds before advancing candidates to user testing, reducing manual evaluation effort. Despite its critical role in ecological validity, user-based evaluation is rarely embedded into iterative feedback loops.

Decision-Oriented and Threshold-Based Mechanisms

Findings show clearer evidence of decision modeling than previously documented. Study [19] applies quantitative thresholds (e.g., minimum Lighthouse and WAVE criteria) before proceeding to expert validation. Study [48] formalizes checkpoint classification (evaluated vs. not-evaluated) to guide remediation routing. Study [41] employs multi-dimensional assessment policies governing output acceptance and [25] integrates iterative remediation guided by tool feedback loops. These approaches treat evaluation as a control mechanism rather than as descriptive reporting. Nevertheless, explicit mathematical formalization of acceptance functions remains rare, and most studies do not define reproducible threshold criteria.

Implications for Conceptual Model Formalization

The revised evidence strongly supports modeling evaluation as a decision-governed architectural component. Rather than functioning as a terminal reporting step, evaluation must be formalized as a multi-layer operator combining automated conformance metrics, structural inspection, and optional user validation, governed by explicit acceptance rules. This shift transforms remediation from metric comparison into a structured control process.

4.9. Objective of the Proposed Solution

The literature review reveals fragmentation across four dimensions: input acquisition, LLM configuration, prompting strategies, and evaluation mechanisms. Although widely implemented, these elements are seldom formalized as interrelated architectural constructs. In particular, transformation logic, structural constraints, and decision-based evaluation remain weakly articulated.

In response, this research aims to formally define a conceptual model for born-accessible web remediation using LLMs. The proposed artifact seeks to articulate explicit constructs and relationships within the remediation pipeline, specify transformation logic across its stages, embed structural and normative constraints aligned with accessibility standards, formalize threshold-based and rule-based evaluation mechanisms, and support reproducible implementation and systematic experimentation across LLM configurations.

5. Formal Specification for the Conceptual Model

Following the third activity of the Design Science Research Methodology, this section presents the formal specification of the proposed conceptual model named ATPGE (Acquisition, Transformation, Prompting, Generative, Evaluation), composed of five interconnected components, each one contributing to the full transformation pipeline, through the following roles:

Input acquisition (A), which retrieves structured HTML from a target webpage while preserving semantic structure.
Intermediate transformation (T), which converts HTML into an LLM-compatible representation (e.g., text, Markdown, JSON).
Prompt configuration (P), which constructs structured prompts including accessibility rules and formatting constraints.
Generative inference (G), which generates candidate accessible pages using selected LLMs.
Output evaluation (E), which evaluates accessibility metrics and determines acceptance or refinement.

Therefore, to avoid ambiguity the model uses standard functional notation from set theory [49] to describe transformations between representations. In this theory, a set is a collection of things. The elements belonging to this collection are denoted betwxeen curly brackets. Thus, if W denotes the set of web pages, their elements would be represented as a collection of pages

w_{1}, w_{2}, \dots, w_{n}

, formally:

W = {w_{1}, w_{2}, w_{3}, \dots, w_{n}}

(6)

Another fundamental concept is the notion of a function, which can be understood as a rule that assigns to each object in a set (the domain) a unique object in another set (the range). In other words, the domain denotes the set of possible inputs, while the range denotes the set of possible outputs. This formalism is used in Section 5.1 to define the components that compose the ATPGE model.

Therefore, to formally define the relationships of the proposed conceptual model we established it as a composition of functions, which means that one function is applied after another. Thus, the ATPGE is a pipeline that transforms a page

w \in W

through sequential stages of acquisition, transformation, prompt construction, generative inference, and evaluation. The resulting output can be expressed as:

y = F (w) = E (G (P (T (A (w)))))

(7)

where y represents an accessibility-improved version of the original content w. Formally, the ATPGE model is defined as the functional composition:

F = E \circ G \circ P \circ T \circ A

(8)

where each component of the model is defined as a mapping between sets. Thus, the model specifies the internal logic of LLM-driven accessibility remediation under a born-accessible paradigm [15]. Instead of correcting isolated WCAG violations within the original source code, the model treats remediation as a constrained regeneration process in which a new accessible version is reconstructed from the semantic content of the original webpage. Accessibility properties are therefore embedded during generation rather than applied post hoc as localized fixes.

5.1. Core Components

The following section defines each of the components that correspond to the ATPGE model. Hence, each of the functions that comprise the elements of the pipeline will be defined as the following expression:

f : X \to Y

(9)

where f denotes a function that maps each element of set X to exactly one element of set Y. Formally, a function is defined as the ordered pairs

(x, y)

belonging to the Cartesian product

X \times Y

, such that:

X \times Y = {(x, y) ∣ x \in X \land y \in Y}

(10)

5.1.1. Input Acquisition (A)

This component performs the task of extraction of structured web content from a target webpage, typically via HTML scraping techniques identified in RQ1. Such function is represented by:

A : W \to H

(11)

where the set of web pages is defined as

W = {w_{1}, w_{2}, w_{3}, \dots, w_{n}}

and the set of raw HTML representations as

H = {h_{1}, h_{2}, h_{3}, \dots, h_{n}}

. The objective is to preserve structural hierarchy while defining appropriate content selection criteria.

5.1.2. Intermediate Transformation (T)

This component converts raw HTML into a representation optimized for LLM ingestion. The model treats this as a design decision rather than a simple preprocessing step:

T : H \to R

(12)

where R represents the set of LLM-ready representations, and each

r \in R

corresponds to a representation optimized for language model processing (e.g., Markdown, plain text, or JSON). Transformation policies include structural preservation, semantic abstraction, noise reduction, and token-length optimization.

5.1.3. The Prompt Configuration (P)

This component is modeled as a function that acts as a configuration layer responsible for generating structured prompt templates rather than ad hoc instructions. This proposition is based on the findings of RQ3, which indicate that prompting strategies may vary (e.g., zero-shot, one-shot, few-shot, structured prompting). Formally:

P : R \times C \to Π

(13)

where the constraint set C denotes the constraint configuration defined as

C = {C_{w}, C_{f}, C_{s}, C_{d}}

being such constraints

C_{w}

WCAG references,

C_{f}

formatting rules,

C_{s}

schema templates and

C_{d}

domain-specific restrictions. The resulting prompt instance

π \in Π

encodes accessibility objectives, structural constraints, output formatting requirements, and regeneration policies that guide the generative inference stage.

Optionally, Retrieval-Augmented Specialization (

Ψ

) provides this layer with LLM specialization through mechanisms such as retrieval-augmented generation. Formally, let:

Ψ : Q \to 2^{K}

(14)

where the function

Ψ

maps a query

q \in Q

derived from the transformed input representation to a subset of the knowledge resource set K. Thus,

2^{K}

represents the power set of the knowledge resource set K, which by definition also includes the empty set ⌀, representing the absence of retrieval augmentation, as well as any subset of retrieved knowledge resources (e.g., WCAG guidelines, specialization examples). Therefore, the prompt configuration can be extended with retrieval-augmented specialization in:

P : R \times C \times 2^{K} \to Π

(15)

Thus, prompt construction may additionally incorporate retrieved accessibility standards, domain-specific structural exemplars, accessibility rule fragments, and prior validated accessible patterns.

5.1.4. Generative Inference (G)

This component submits configured prompts to one or more large language models (LLMs) and produces candidate accessible outputs. Findings from RQ2 revealed model diversity and varying generative tasks across studies. Formally:

G : Π \times M \to 2^{O}

(16)

Given a pair

(π, m)

, where

π \in Π

is configured prompt and

m \in M

is a selected language model, the function generates a set of candidate outputs

G (π, m) \subseteq O

. These outputs correspond to alternative accessible page reconstructions produced under the structural and accessibility constraints encoded in the prompt.

Generation is therefore constrained by accessibility rules and structural requirements defined in the prompt. Modeling generation as a mapping

2^{O}

captures the nondeterministic nature of LLM inference, where multiple candidate outputs may be produced under identical prompt-model configurations. This ensures reproducibility and systematic experimentation, and multiple models or configurations may run in parallel for comparative evaluation.

5.1.5. Output Evaluation Component (E)

Finally, this component measures accessibility compliance and determines whether the output should be accepted or refined. Unlike descriptive reporting found in most studies (RQ4), the model formalizes evaluation as a decision mechanism. Let us define:

E : O \to S \times D

(17)

where, given a generated output

o \in O

, the function returns both an accessibility score

s \in S

and a decision

d \in D = {a c c e p t, r e f i n e}

.

When multiple candidate outputs are generated

G (π, m) \subseteq O

, the evaluation function is applied to each candidate output individually. If the resulting decision does not satisfy the acceptance criteria, the process iterates through the pipeline by refining the prompt configuration (P), adjusting the model selection (G), or modifying the intermediate transformation (T).

To make acceptance explicit, it can be expressed as the decision rule:

d (o) = \{\begin{matrix} accept, & if s (o) \geq Ω \\ refine, & otherwise \end{matrix}

(18)

where

Ω

represents the acceptance criterion. In practice, score

s (o)

may be scalar or multi-metric. Evaluation is therefore part of the architecture’s control logic, not only a reporting step.

Thus, this mechanism yields a repeatable remediation loop where evaluation governs progression and resource-intensive validation (e.g., expert or end-user checks) can be applied only after automated acceptance.

5.2. Model Properties

From the formal specification presented above, the proposed model exhibits several key properties:

Modularity. Each component can be independently replaced without affecting the overall structure, enabling flexible adaptation and reuse.
Composability. The model is defined as a functional transformation chain, allowing components to be systematically combined through composition.
Extensibility. Additional evaluation mechanisms or constraints can be incorporated without altering the core architecture.
Reproducibility. Given fixed inputs, parameters, and configurations, the model produces consistent and stable outcomes.
Adaptivity. The model supports iterative refinement, where evaluation outcomes guide adjustments in transformation, prompting, or model selection.
Technology-agnosticism. The model does not depend on a specific platform, provider, or LLM architecture, but instead defines the remediation process at a conceptual level. This technological neutrality enables flexible adoption across different environments and supports its application to real-world scenarios, where the focus is on solving accessibility-related problems rather than optimizing for a particular technology stack.

5.3. Conceptual ATPGE Model

The final conceptual model is organized as a layered functional architecture of five components connected through explicit transformations, as illustrated in Figure 13.

6. Demonstration

To demonstrate the operational feasibility of the proposed conceptual model, a controlled experimental application was conducted using a benchmark website intentionally designed to contain accessibility violations. The objective of this demonstration is not to optimize model performance, but to illustrate how each formally defined component interacts within a reproducible workflow.

6.1. Experimental Design

The selected benchmark was the web testing course accessibility demo provided by Deque Systems, a site intentionally containing known accessibility violations and therefore suitable for controlled remediation experiments. Its use enables reproducible comparison of remediation strategies without risks associated with modifying production systems.

Following the structure of the conceptual model, the first stage corresponds to input acquisition (

A : W \to H

). Three scraping techniques were compared: Beautiful Soup, Jina Reader, and Crawl4AI (see Table 12). Beautiful Soup showed competitive execution times but did not preserve the expected HTML character count (7007) and lacks native Markdown export. Both Jina Reader and Crawl4AI support structured Markdown output; Crawl4AI retrieved the highest HTML character count, while Jina Reader produced more semantically structured Markdown representations.

The intermediate transformation stage (

T : H \to R

) therefore used Crawl4AI for HTML extraction and Jina Reader for Markdown retrieval, prioritizing semantically rich, LLM-compatible representations over raw syntactic extraction.

The prompt configuration stage (

P : R \times C \to Π

) employed a structured regeneration prompt (see Table 13) embedding WCAG constraints, reflection-based verification, and optional template guidance. When no template was provided, the setup corresponded to zero-shot instruction-driven prompting with implicit chain-of-thought reasoning. When included, the template functioned as a structural exemplar, corresponding to one-shot constrained prompting.

The generative inference stage (

G : Π \times M \to 2^{O}

) used two large language models: GPT-5.2 and Gemini 3 Flash. Both represent state-of-the-art transformer architectures widely adopted in research and industry [19], while exhibiting different deployment characteristics (latency, token handling, and cost), enabling comparative analysis without bias toward a single ecosystem.

The demonstration followed a

2 \times 2 \times 2

factorial design across three variables: model (GPT-5.2, Gemini 3 Flash; temperature = 0.5), input representation (HTML, Markdown), and template constraint (with or without an accessibility-aware Bootstrap-based template). Table 14 summarizes the experimental configuration.

Finally, the evaluation stage (

E : O \to S \times D

) used tools widely adopted in accessibility research [22,36]. Lighthouse provides composite heuristic scoring aligned with WCAG, Axe-core performs rule-based violation detection, and WAVE complements these with visual annotations. Together, they enable triangulated automated assessment, reducing tool-specific bias.

6.2. Operational Instantiation of the ATPGE Model

The five components of the conceptual model were operationally instantiated in sequence. HTML was extracted without manual correction to preserve ecological validity (

h = A (x)

). A Markdown version was subsequently generated to implement the transformation layer (

r = T (h)

). Prompts embedded either raw content or content combined with an accessibility-aware template enforcing semantic structure, ARIA roles, and WCAG guidance (

π = P (r, c)

). In all cases, models were instructed to regenerate the page as accessible-by-design rather than apply localized fixes (

O = G (π, m)

). Token usage and latency were recorded to assess computational feasibility.

As shown in Table 15, the original page (V0) and all regenerated variants were evaluated using Lighthouse, WAVE AIM score, and Axe-core violations. Differences across input representations and models are examined in the following section through structured evaluation criteria and threshold-based decision analysis (

(s, d) = E (o)

).

7. Evaluation

This phase assesses whether the conceptual model satisfies its formal objectives using explicit acceptance criteria and statistical analysis.

7.1. Iterative Threshold-Based Filtering and Acceptance Modeling

Threshold values were defined according to three criteria: WCAG alignment, severity filtering, and WAVE interpretability. However, the objective of this rule is not to establish definitive WCAG 2.2 compliance certification. Instead, the acceptance mechanism functions as an automated preselection filter within the iterative model evaluation workflow.

Given that manual accessibility inspection, screen-reader validation, and expert-based evaluation are resource-intensive processes, the proposed model first applies automated threshold-based filtering to eliminate variants presenting lower accessibility quality. Candidate outputs satisfying the automated criteria may subsequently advance to additional validation stages involving manual structural inspection and assistive technology testing.

A Lighthouse score

\geq 96

was considered indicative of near-complete automated compliance; Axe issues

\leq 3

ensured that only minor residual violations remained; and an AIM score

\geq 9

reflected high accessibility maturity on the 1–10 scale. These thresholds are consistent with prior decision-oriented remediation strategies [19,26]. The automated filtering rule is therefore defined as:

f (x) = \{\begin{matrix} 1, & if L \geq 96 \land A \leq 3 \land A I M \geq 9 \\ 0, & otherwise \end{matrix}

(19)

Following automated filtering, accepted variants were additionally subjected to expert-based structural validation using keyboard navigation and screen-reader interaction. In this secondary evaluation stage, outputs were considered acceptable provided that no critical accessibility failures affecting semantic navigation, keyboard interaction, landmark discoverability, or assistive technology interpretation were identified during manual inspection.

Formally, the expert-centered validation stage can be represented as:

d_{m} (o) = \{\begin{matrix} accept, & if C_{f} (o) = 0 \\ refine, & otherwise \end{matrix}

(20)

where

C_{f} (o)

denotes the number of critical interaction or semantic accessibility failures identified during manual inspection.

This multi-stage evaluation strategy reduces the cost of expert-centered accessibility validation by restricting manual inspection to variants that first satisfy automated accessibility thresholds. Consequently, ATPGE treats accessibility evaluation as an iterative and progressively refined decision process rather than as a single metric-based acceptance step.

7.2. Quantitative Accessibility Assessment

As shown in Table 15, all regenerated variants achieved Lighthouse scores between 96 and 100. However, deeper differentiation emerges when considering Axe violations and WAVE AIM scores. Variants V1, V3, V7, and V8 achieved the strongest combined performance (Lighthouse

\geq 98

, Axe

\leq 1

, AIM

= 9.9

, zero contrast errors), eliminating all WAVE-detected structural issues. In contrast, V5 retained 11 Axe violations and 11 contrast errors despite a Lighthouse score of 96, resulting in a lower AIM score (6.0).

Markdown-based regeneration without template constraints (V3, V7) showed high structural stability. However, the impact of template-guided regeneration varied across configurations. While V4 exhibited additional WAVE contrast-related issues, V8—also combining Markdown input with template guidance—achieved one of the strongest overall accessibility performances. These observations suggest that accessibility outcomes depend more on the interaction between model behavior, prompt configuration, and template constraints than on the representation format alone.

Under the threshold-based filtering, V1, V2, V3, V7, and V8 meet all acceptance criteria, whereas V4, V5, and V6 are rejected due to insufficient AIM scores or excessive violations.

7.3. Statistical Comparison of Model Performance

Given non-normal distribution (Shapiro–Wilk) and small sample size, a Mann–Whitney U test was applied. No statistically significant differences were found between models in accessibility metrics or token consumption (

p > 0.05

). As shown in Figure 14, GPT exhibited slightly more consistent AIM scores, whereas Gemini showed greater dispersion across configurations. A significant difference was observed only in execution time (

p = 0.0286

), as shown in Figure 15.

A Spearman rank correlation analysis revealed a very strong negative correlation between Axe issues and AIM score (

ρ = - 0.9280

,

p = 0.0009

), a very strong positive correlation between Lighthouse and AIM (

ρ = 0.8892

,

p = 0.0031

), and no statistically significant correlation between token consumption and accessibility metrics (

p > 0.5

).

7.4. Component-Level Structural Assessment with Screen Readers

To complement automated accessibility metrics, the accepted variants were additionally subjected to a manual component-level structural assessment described in [26], which is focused on semantic integrity and assistive technology compatibility. While automated tools provide efficient conformance-oriented evaluation, they cannot fully determine whether regenerated interfaces behave coherently from the perspective of screen-reader interaction and keyboard navigation. Therefore, the model incorporates a human-centered validation stage as part of the evaluation layer.

The assessment was conducted by the two authors, both with prior experience in accessibility evaluation and assistive technology inspection workflows with blind users. The evaluation process employed keyboard-only navigation and the NVDA screen reader to manually inspect the regenerated pages under realistic desktop browsing conditions.

The inspection protocol focused on the following structural and interaction-oriented dimensions:

semantic HTML correctness,
heading hierarchy consistency,
ARIA role and landmark integrity,
form labeling behavior,
keyboard navigability,
screen-reader announcement coherence,
alternative text interpretation,
search interaction accessibility,
and overall structural consistency.

Each accepted variant was individually reviewed following a sequential inspection protocol consisting of four stages:

Structural navigation inspection. The evaluators verified whether the page exposed a coherent semantic structure through headings, landmarks, regions, and navigation elements. This stage focused on heading hierarchy progression, landmark discoverability, skip-navigation behavior, and consistency of semantic HTML elements.
Keyboard interaction assessment. The evaluators navigated the interface exclusively using keyboard interaction (Tab, Shift+Tab, Enter, arrow keys, and shortcut navigation). This stage verified focus order, keyboard reachability of interactive elements, search functionality accessibility, and absence of keyboard traps.
Screen-reader interaction validation. Using NVDA, the evaluators inspected how the regenerated content was announced and interpreted by the assistive technology. Particular attention was given to ARIA role interpretation, form label announcements, alternative text behavior, landmark identification, reading order consistency, and navigation shortcuts commonly employed by screen-reader users.
Semantic consistency review. Finally, the generated source code was manually inspected to verify whether accessibility-related structures identified during interaction corresponded correctly with the underlying semantic implementation. This included verification of ARIA relationships, label associations, heading nesting, and structural preservation.

During the evaluation, observations were systematically documented using a component-oriented checklist adapted from [26]. Findings were categorized according to structural correctness, interaction behavior, and semantic consistency. Variants presenting severe navigation inconsistencies, missing semantic relationships, or problematic interaction patterns were flagged even when automated accessibility scores remained high.

The results of the manual assessment presented in Table 16, additionally served to analyze the trade-off between automated LLM-based regeneration and human supervision. Although several variants achieved high automated scores, manual inspection revealed differences in ARIA consistency, landmark interpretation, contrast behavior, and semantic navigation quality.

Under this secondary validation stage, variants V1, V3, and V8 satisfied the expert-centered acceptance criteria, whereas V2 and V7 were flagged for refinement due to contrast-related issues and semantic discoverability inconsistencies identified during manual inspection. These observations reinforce findings from prior literature indicating that LLM-generated outputs may still require expert supervision despite strong conformance-oriented metrics [19].

This evaluation stage was intentionally designed as a hybrid validation mechanism combining automated accessibility metrics with expert-based assistive technology inspection. The objective was not only to verify WCAG-oriented conformance scores, but also to determine whether the regenerated outputs preserved usable accessibility characteristics under realistic interaction conditions.

8. Discussion

Recent research has explored the application of LLMs to web accessibility remediation, yet most contributions remain either tool-centric, patch-oriented, or empirically descriptive. Studies such as [11,12] show that LLMs can generate accessible code, yet frequently introduce WCAG violations, highlighting the limitations of unconstrained generation. Similarly, refs. [9,13] emphasize that LLM outputs require human supervision and may introduce semantic inconsistencies. In contrast, the present work does not merely evaluate LLM performance. It formalizes accessibility remediation as a structured generative transformation governed by explicit architectural components and decision rules. Rather than asking whether LLMs can improve accessibility, this study defines how such improvement can be systematically organized, constrained, and evaluated.

8.1. Positioning with Respect to Related Work

Much of the recent work on AI-assisted accessibility improvement can be organized around a shared remediation pipeline—detection, generation, and evaluation—yet contributions diverge in what is formalized and where decisions are enforced. Agentic approaches such as [25] frame remediation as an iterative control problem: violations are detected, candidate fixes are generated, and subsequent tool feedback guides further refinement. Related work, such as [41], advances this direction by emphasizing structured prompting and output gating mechanisms. In these approaches, evaluation functions not merely as descriptive reporting but as a selection mechanism determining whether outputs are accepted, revised, or discarded. This aligns closely with the decision-rule perspective adopted in this paper, where evaluation is modeled explicitly as a formal function governing progression within the remediation workflow.

A complementary contribution is the use of component-level structural assessment as a bridge between automated scoring and usable accessibility evidence [26]. The checklist-based approach extends evaluation beyond aggregate metrics by inspecting semantic structure and verifying interaction via keyboard-only navigation and screen readers. This perspective directly motivates our multi-stage evaluation strategy, in which metric-based filtering precedes structural inspection of accepted variants.

Not all decision-oriented workflows rely on LLM generation. Study [48] proposes a structured evaluation flow that distinguishes evaluated from non-evaluated checkpoints, enabling targeted feedback even when full automation is not feasible. Although not LLM-centered, such decision structuring reinforces a key principle: evaluation must be treated as an explicit control mechanism rather than an afterthought.

Other relevant research targets modality-specific subproblems, underscoring the importance of constraint-aware generation. For instance, ref. [43] focuses on image accessibility through alternative-text generation and assessment, illustrating that accessibility remediation requires artifact-specific rules aligned with WCAG principles. Similarly, ref. [50] emphasizes user experience evidence, noting that improvements in conformance metrics may not fully capture interaction quality for assistive technology users. This observation supports our inclusion of structural and interaction-based checks following automated screening.

Finally, ref. [19] represents an implementation-oriented instantiation of a modular remediation architecture derived from the conceptual foundations proposed here. While that work validates feasibility at the framework level, the present article contributes at a higher level of abstraction by formalizing constructs, interactions, constraints, and decision mechanisms. This enables reproducible instantiation and systematic comparison across models, prompting strategies, and evaluation regimes.

Key Takeaway

Most existing studies operationalize accessibility remediation as implementation-oriented workflows or experimental prompting strategies. In contrast, the proposed model contributes a higher-level conceptual abstraction that formally specifies the structural relationships, constraints, and control logic governing the remediation process itself.

8.2. Practical Deployment and End-User Applicability

Although the demonstration was conducted using a single benchmark page, the ATPGE model was intentionally designed as a technology-agnostic conceptual architecture capable of supporting different implementation scenarios. Rather than defining a fixed software tool, the model specifies a structured remediation workflow that can be instantiated in multiple forms depending on deployment requirements.

In practice, the model could be operationalized as browser extensions, CMS-integrated accessibility assistants, middleware remediation services, automated auditing pipelines, or standalone accessibility-support frameworks. For example, a CMS plugin could automatically acquire and transform page content, generate accessible alternatives through LLM-based regeneration, and subsequently present candidate outputs for editor validation before publication. Similarly, browser-based implementations could support real-time accessibility adaptation or accessibility-aware content rewriting.

The proposed workflow was intentionally structured to balance automation and human supervision. Acquisition, transformation, prompting, and candidate generation can be highly automated, substantially reducing repetitive remediation effort. However, the evaluation layer preserves the possibility of human-centered validation, particularly for semantic interpretation, interaction quality, and assistive technology compatibility. Consequently, the effort required from end users depends on the selected deployment configuration and validation requirements.

For lightweight remediation scenarios, users may rely primarily on automated evaluation and threshold-based acceptance mechanisms. In contrast, high-assurance accessibility contexts—such as governmental, educational, or public-service systems—may additionally incorporate expert inspection and user-centered validation stages. This flexibility represents one of the principal advantages of the ATPGE architecture, as it supports different levels of automation without assuming complete replacement of human oversight.

Key Takeaway

The current study focuses on conceptual formalization and controlled demonstration rather than large-scale production deployment. Future work should therefore evaluate the operational scalability of the model across multi-page environments, heterogeneous web platforms, and longitudinal accessibility maintenance workflows involving real end users.

8.3. Main Contributions

The contributions of this study extend beyond empirical accessibility evaluation. The proposed ATPGE model contributes at theoretical, methodological, and practical levels by formalizing accessibility remediation as a structured generative process, defining explicit architectural and decision-oriented mechanisms, and providing a reusable foundation for future accessibility-oriented systems and workflows.

8.3.1. Theoretical Contribution

The main theoretical contribution lies in reframing accessibility remediation from a defect-repair paradigm to a constrained generative paradigm. Patch-based approaches operate through incremental correction

x^{'} = x + Δ

, where

Δ

represents localized fixes applied to an existing page. In contrast, born-accessible regeneration reconstructs

y = F (x)

under accessibility constraints embedded during generation, aligning with the design principles advocated in [15]. This conceptual shift addresses a recurring limitation in prior LLM studies: improvements are often local and metric-driven rather than structural and systemic. By defining regeneration as a constrained transformation governed by evaluation rules, the model introduces a higher level of abstraction consistent with design science principles.

The notion of born-accessible adopted in this work should therefore not be interpreted as conventional accessibility authoring from an initially empty design process. Instead, the model operates through accessibility-oriented regeneration: an existing webpage serves as semantic input for generating a newly reconstructed accessible version under explicit accessibility constraints. This positions ATPGE between traditional patch-based remediation and native born-accessible authoring paradigms.

Importantly, the contribution is not limited to arranging known LLM workflow stages into a sequential process. Conventional LLM pipelines typically emphasize prompt execution and output generation as implementation procedures. In contrast, the model formalizes the relationships, constraints, and decision mechanisms connecting each stage, transforming remediation into a structured architectural process. The novelty therefore emerges from the explicit conceptual integration of accessibility constraints, transformation logic, and evaluation-governed regeneration within a unified formal model.

8.3.2. Methodological Contribution

Methodologically, the study formalizes LLM-based remediation as a structured transformation system aligned with the design and development phase of the DSRM [16]. Scraping, transformation, prompting, generation, and evaluation are defined as typed operators connected through explicit mappings and constraint sets, rather than procedural steps. Evaluation is operationalized as a decision function regulating acceptance and refinement. This decomposition enables reproducible experimentation and systematic comparison of configurations, moving beyond ad hoc pipeline implementations.

This abstraction distinguishes ATPGE from conventional procedural workflows by elevating remediation stages into formally specified architectural operators governed by explicit control and acceptance rules.

8.3.3. Practical Implications

For researchers, the proposed model provides a structured reference architecture for designing and comparing LLM-based remediation systems. For practitioners, it indicates that structured prompting and template constraints significantly enhance structural robustness. Importantly, the findings suggest that accessibility improvement is not solely model-dependent. Architectural structuring of input transformation and evaluation plays a decisive role. As the artifact remains an abstract structural specification, methodologies, frameworks, and implementations may be derived from it. In this sense, the study [19] can be interpreted as a framework-level instantiation validating the feasibility of the conceptual model under empirical conditions.

Key Takeaway

In the context of proposed model, remediation does not imply incremental modification of the original source code. Instead, the process reconstructs a new accessible artifact derived from the semantic and structural content of the original webpage.

8.4. Limitations

Several limitations must be acknowledged. First, the demonstration was conducted on a single benchmark page under controlled conditions, which limits direct generalization to large-scale production environments. The objective of the experiment was to validate the operational coherence of the conceptual model while minimizing external variability during comparison of prompting, transformation, and evaluation strategies. Consequently, the study does not yet evaluate deployment complexity, scalability across heterogeneous websites, or long-term maintenance requirements.

However, the model was intentionally designed as a modular and technology-agnostic architecture capable of supporting different implementation scenarios, including CMS-integrated assistants, browser extensions, middleware remediation services, and accessibility auditing frameworks as explored in related studies [25,41]. Future work should therefore investigate how the model behaves in large-scale and real-world deployment contexts involving multiple pages, dynamic content, heterogeneous frontend technologies, and continuous accessibility monitoring workflows.

Second, automated metrics primarily capture conformance rather than experiential quality. This limitation was mitigated through manual structural assessment, including semantic inspection and keyboard/screen-reader testing. Although manual inspections were performed by the authors—introducing potential observational bias—the evaluation was supported by structured reports from WAVE and the WebAIM WCAG 2 Checklist [51], increasing objectivity.

Third, LLM determinism remains probabilistic, meaning identical prompts may yield different outputs. Reflection phases and normative constraint reinforcement were incorporated to reduce hallucination risks, consistent with concerns raised in [25].

Finally, the model does not eliminate the need for expert validation. Instead, it structures and optimizes the remediation workflow, reducing manual effort while preserving oversight.

9. Conclusions and Future Work

This study formalized LLM-driven web accessibility remediation as a structured conceptual model grounded in design science principles. Rather than treating accessibility improvement as an incremental defect-repair process, the proposed model defines regeneration as a constrained transformation governed by explicit architectural components, normative constraints, and decision-based evaluation mechanisms. By integrating structured prompting, threshold-based filtering, and component-level structural assessment, the study advances accessibility remediation from empirical experimentation toward a formally articulated generative paradigm.

The demonstration illustrates that accessibility improvement is not solely dependent on model choice. Architectural structuring of input transformation, constraint embedding, and evaluation sequencing plays a decisive role in achieving consistent and verifiable outcomes. In this sense, the contribution extends beyond implementation: it provides a reusable structural specification from which frameworks, tools, and domain-specific applications may be derived.

While this marks a significant advancement, it also surfaces several challenges and limitations that offer avenues for future research. Future work should focus on instantiating the model across diverse application contexts, including browser extensions capable of performing real-time remediation. Evaluation should be extended to multiple real-world domains and involve additional LLMs, external accessibility experts, and users with disabilities to strengthen ecological validity. Further research may also refine the decision-theoretic foundations of the model by defining adaptive threshold mechanisms that determine whether automated, expert, end-user, or hybrid evaluation phases should be triggered.

Finally, the potential for partially automating structured manual inspection—leveraging formalized reports such as the WebAIM WCAG Checklist [51]—offers a promising direction toward scalable, decision-aware accessibility validation.

Author Contributions

Conceptualization, G.V.-A.; methodology, G.V.-A.; software, G.V.-A.; validation, G.V.-A. and J.R.R.-C.; formal analysis, G.V.-A.; investigation, G.V.-A. and J.R.R.-C.; resources, G.V.-A. and J.R.R.-C.; data curation, G.V.-A.; writing—original draft preparation, G.V.-A.; writing—review and editing, G.V.-A. and J.R.R.-C.; visualization, G.V.-A.; supervision, J.R.R.-C.; project administration, G.V.-A. and J.R.R.-C.; funding acquisition, G.V.-A. and J.R.R.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) under scholarship No. 1351657. The APC was partially funded through the 2026 Institutional Program for Academic Strengthening and Excellence, administered by the General Directorate for Academic Development and Educational Innovation at Universidad Veracruzana.

Institutional Review Board Statement

Not applicable. This study did not directly involve humans or animals; therefore, ethical review and approval were not required.

Data Availability Statement

The dataset supporting the findings of this study has been deposited in the publicly accessible Mendeley Data repository at [40].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
AI	Artificial Intelligence
WCAG	Web Content Accessibility Guidelines
ARIA	Accessible Rich Internet Applications
SLR	Systematic Literature Review
NVDA	NonVisual Desktop Access
RQ	Research Question
DSRM	Design Science Research Methodology
PRISMA	Preferred Reporting Items for Systematic reviews and Meta-Analyses
CSI	Computer Standard & Interfaces
IST	Information & Software Technology
FRAI	Frontiers in Artificial Intelligence
QGS	Quasi-Gold Standard
IC	Inclusion Criteria
EC	Exclusion Criteria
CS	Candidate String
ATPGE	Acquisition, Transformation, Prompting, Generative and Evaluation Model
SECIHTI	Secretaría de Ciencia, Humanidades, Tecnología e Innovación

References

Kemp, S. Digital 2025: Global Overview Report. Available online: https://datareportal.com/reports/digital-2025-global-overview-report (accessed on 10 March 2026).
WHO. Health Equity for Persons with Disabilities: Guide for Action: Executive Summary; World Health Organization: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]
WHO. World Report on Vision. Available online: https://www.who.int/publications/i/item/world-report-on-vision (accessed on 10 March 2026).
Henry, S.L. WCAG 2 Overview. Available online: https://www.w3.org/WAI/standards-guidelines/wcag/ (accessed on 10 March 2026).
WebAIM. The WebAIM Million. Available online: https://webaim.org/projects/million/ (accessed on 10 March 2026).
Campoverde-Molina, M.; Luján-Mora, S.; Valverde, L. Accessibility of university websites worldwide: A systematic literature review. Univers. Access Inf. Soc. 2023, 22, 133–168. [Google Scholar] [CrossRef]
ur Rehman, Z.; Khalid, U.; Ijaz, N.; Mujtaba, H.; Haider, A.; Farooq, K.; Ijaz, Z. Machine learning-based intelligent modeling of hydraulic conductivity of sandy soils considering a wide range of grain sizes. Eng. Geol. 2022, 311, 106899. [Google Scholar] [CrossRef]
ur Rehman, Z.; Aziz, Z.; Khalid, U.; Ijaz, N.; ur Rehman, S.; Ijaz, Z. Artificial intelligence-driven enhanced CBR modeling of sandy soils considering broad grain size variability. J. Rock Mech. Geotech. Eng. 2025, 17, 3161–3179. [Google Scholar] [CrossRef]
Delnevo, G.; Andruccioli, M.; Mirri, S. On the Interaction with Large Language Models for Web Accessibility: Implications and Challenges. In Proceedings of the 2024 IEEE 21st Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 6–9 January 2024; pp. 1–6. [Google Scholar] [CrossRef]
López-Gil, J.M.; Pereira, J. Turning manual web accessibility success criteria into automatic: An LLM-based approach. Univers. Access Inf. Soc. 2025, 24, 837–852. [Google Scholar] [CrossRef]
Ahmed, A.; Fresco, M.; Forsberg, F.; Grotli, H. From Code to Compliance: Assessing ChatGPT’s Utility in Designing an Accessible Webpage—A Case Study. arXiv 2025. [Google Scholar] [CrossRef]
Aljedaani, W.; Habib, A.; Aljohani, A.; Eler, M.; Feng, Y. Does ChatGPT Generate Accessible Code? Investigating Accessibility Challenges in LLM-Generated Source Code. In Proceedings of the 21st International Web for All Conference, Singapore, 13–14 May 2024; pp. 165–176. [Google Scholar] [CrossRef]
Othman, A.; Dhouib, A.; Nasser Al Jabor, A. Fostering websites accessibility: A case study on the use of the Large Language Models ChatGPT for automatic remediation. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 5–7 July 2023; pp. 707–713. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Rojano-Cáceres, J.R. Towards a Conceptual Model for AI-Driven Web Accessibility Remediation: A Prompt-Based Approach. In Proceedings of the 2025 13th International Conference in Software Engineering Research and Innovation (CONISOFT), La Paz, Mexico, 27–31 October 2025; pp. 288–297. [Google Scholar] [CrossRef]
Lazar, J. A Framework for Born-Accessible Development of Software and Digital Content. In Human-Computer Interaction–INTERACT 2023; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2023; Volume 14145 LNCS, pp. 333–338. [Google Scholar] [CrossRef]
Peffers, K.; Tuunanen, T.; Rothenberger, M.A.; Chatterjee, S. A Design Science Research Methodology for Information Systems Research. J. Manag. Inf. Syst. 2007, 24, 45–77. [Google Scholar] [CrossRef]
Henry, S.L. Introduction to Web Accessibility. Available online: https://www.w3.org/WAI/fundamentals/accessibility-intro/ (accessed on 11 March 2026).
Singh, U.; Divya Venkatesh, J.; Muraleedharan, A.; Saluja, K.S.; J H, A.; Biswas, P. Accessibility Analysis of Educational Websites Using WCAG 2.0. Digit. Gov. Res. Pract. 2024, 5, 32. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Rojano-Cáceres, J.R. Accessible Web Content Generation Using LLMs: An Empirical Study on Prompting Strategies and Template-Guided Remediation. IEEE Lat. Am. Trans. 2025, 23, 1230–1239. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Rojano-Cáceres, J.R. Accessibility Challenges for Blind Users Authoring in Content Management Systems: An Empirical Study. IEEE Rev. Iberoam. De Tecnol. Del Aprendiz. 2026, 21, 268–278. [Google Scholar] [CrossRef]
Power, C.; Freire, A.P.; Petrie, H.; Swallow, D. Guidelines are only half of the story: Accessibility problems encountered by blind users on the Web. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; pp. 433–442. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Rojano-Cáceres, J.R. Towards accessible website design through artificial intelligence: A systematic literature review. Inf. Softw. Technol. 2025, 186, 107821. [Google Scholar] [CrossRef]
Håkansson, A.; Phillips-Wren, G. Generative AI and Large Language Models-Benefits, Drawbacks, Future and Recommendations. Procedia Comput. Sci. 2024, 246, 5458–5468. [Google Scholar] [CrossRef]
Alsakran, W.; Alabduljabbar, R. Exploring the Potential of LLMs and Attributed Prompt Engineering for Efficient Text Generation and Labeling. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 244–252. [Google Scholar] [CrossRef]
Huang, C.; Ma, A.; Vyasamudri, S.; Puype, E.; Kamal, S.; Garcia, J.B.; Cheema, S.; Lutz, M. ACCESS: Prompt Engineering for Automated Web Accessibility Violation Corrections. arXiv 2024. [Google Scholar] [CrossRef]
Abu Doush, I.; Kassem, R. Can generative AI create accessible web code? A benchmark analysis of AI-generated HTML against accessibility standards. Univers. Access Inf. Soc. 2025, 24, 3483–3506. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Vera-Amaro, R.; Mata-Rivera, M.F.; Rojano-Cáceres, J.R. Artificial Intelligence in Web Accessibility: Towards a Theory of LLM-Assisted Remediation for Visual Disabilities. Technologies 2026, 14, 287. [Google Scholar] [CrossRef]
Kitchenham, B.; Budgen, D.; Brereton, P. Evidence-Based Software Engineering and Systematic Reviews; Chapman and Hall/CRC: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Carrera-Rivera, A.; Ochoa, W.; Larrinaga, F.; Lasa, G. How-to conduct a systematic literature review: A quick guide for computer science research. MethodsX 2022, 9, 101895. [Google Scholar] [CrossRef]
Zhang, H.; Babar, M.A.; Tell, P. Identifying relevant studies in software engineering. Inf. Softw. Technol. 2011, 53, 625–637. [Google Scholar] [CrossRef]
Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, London, UK, 13–14 May 2014; pp. 1–10. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, H.; Huang, X.; Yang, S.; Babar, M.A.; Tang, H. Quality assessment of systematic reviews in software engineering. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, Nanjing, China, 27–29 April 2015; pp. 1–14. [Google Scholar] [CrossRef]
Cruzes, D.S.; Dyba, T. Recommended Steps for Thematic Synthesis in Software Engineering. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement, Banff, AB, Canada, 22–23 September 2011; pp. 275–284. [Google Scholar] [CrossRef]
Gusenbauer, M.; Haddaway, N.R. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef]
Campoverde-Molina, M.; Luján-Mora, S. Artificial intelligence in web accessibility: A systematic mapping study. Comput. Stand. Interfaces 2026, 96, 104055. [Google Scholar] [CrossRef]
Casanova, E.; Guffanti, D.; Hidalgo, L. Technological Advancements in Human Navigation for the Visually Impaired: A Systematic Review. Sensors 2025, 25, 2213. [Google Scholar] [CrossRef] [PubMed]
Chemnad, K.; Othman, A. Digital accessibility in the era of artificial intelligence—Bibliometric analysis and systematic review. Front. Artif. Intell. 2024, 7, 1349668. [Google Scholar] [CrossRef]
Petersen, K.; Vakkalanka, S.; Kuzniarz, L. Guidelines for conducting systematic mapping studies in software engineering: An update. Inf. Softw. Technol. 2015, 64, 1–18. [Google Scholar] [CrossRef]
Vera-Amaro, G.; Rojano-Cáceres, J.R. Supplementary Material for A Conceptual Model for Born-Accessible Web Accessibility Remediation Using Large Language Models. Mendeley Data. 2026. Available online: https://data.mendeley.com/datasets/nf7dz42sfn/2 (accessed on 10 March 2026).
Fathallah, N.; Hernández, D.; Staab, S. AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code. In Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility, Denver, CO, USA, 26–29 October 2025; pp. 1–22. [Google Scholar] [CrossRef]
Huynh, G.K.; Lin, W. SmartCaption AI—Enhancing Web Accessibility with Context-Aware Image Descriptions Using Large Language Models. In Proceedings of the 2024 International Conference on Computer and Applications (ICCA), Cairo, Egypt, 17–19 December 2024; pp. 1–7. [Google Scholar] [CrossRef]
Pedemonte, G.; Leotta, M.; Ribaudo, M. Improving Web Accessibility with an LLM-Based Tool: A Preliminary Evaluation for STEM Images. IEEE Access 2025, 13, 107566–107582. [Google Scholar] [CrossRef]
Paternò, F.; Vinci, M.; Manca, M.; Iannuzzi, N. How an LLM Can Improve Automatic Web Accessibility Validation? In Proceedings of the 16th Biannual Conference of the Italian SIGCHI Chapter, Salerno, Italy, 6–10 October 2025; pp. 1–8. [Google Scholar] [CrossRef]
Gu, M.; Wang, Z.; Lai, S.; Gao, Z.; Zhou, S.; Bu, J. Towards Scalable Web Accessibility Audit with MLLMs as Copilots. arXiv 2025. [Google Scholar] [CrossRef]
Moterani, G.; Lin, W.R. Breaking the Linear Barrier: A Multi-Modal LLM-Based System for Navigating Complex Web Content. In Proceedings of the 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 8–11 July 2025; pp. 2066–2075. [Google Scholar] [CrossRef]
Doush, I.A.; Kassem, R. Evaluating AI-Generated Web Code for Accessibility Compliance: A Metric-Driven Approach. In Proceedings of the 11th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-Exclusion, Abu Dhabi, United Arab Emirates, 13–15 November 2024; pp. 338–344. [Google Scholar] [CrossRef]
Ara, J.; Sik-Lanyi, C.; Kelemen, A.; Guzsvinecz, T. An inclusive framework for automated web content accessibility evaluation. Univers. Access Inf. Soc. 2025, 24, 1581–1607. [Google Scholar] [CrossRef]
Poornima, U.S.; Suma, V. Visualization of Object Oriented Modeling from the Perspective of Set Theory. Lect. Notes Softw. Eng. 2013, 1, 214–218. [Google Scholar] [CrossRef][Green Version]
Lin, W.; Adewale, B.; Li, M.; Nasir, M.; Sultana, A.; Khokhar, R.H.; Zhang, Y. Dynamic Web Page Modification for Accessibility Using AI and Large Language Models. In Computer and Information Science and Engineering; Springer: Cham, Switzerland; Mount Pleasant, SC, USA, 2025; Volume 1192, pp. 33–46. [Google Scholar] [CrossRef]
WebAIM. WebAIM’s WCAG 2 Checklist. Available online: https://webaim.org/standards/wcag/checklist (accessed on 11 March 2026).

Figure 1. Design Science Research Process adopted in this study (adapted from [16]).

Figure 2. Systematic literature review process aligned with PRISMA reporting standards, adapted from Kitchenham et al. [28], Page et al. [29], Carrera-Rivera et al. [30], Zhang et al. [31], Wohlin et al. [32], Zhou et al. [33], and Cruzes et al. [34].

Figure 3. PRISMA flow diagram for the literature review process.

Figure 4. Distribution of included studies year and publication type.

Figure 5. Distribution of included studies by publication venue.

Figure 6. Distribution of studies by journal quartile.

Figure 7. Distribution of studies by source.

Figure 8. Thematic map of LLM-based accessibility remediation derived through manual inductive thematic synthesis of the included studies. The map summarizes the principal relationships among scraping, prompting, generation, and evaluation processes identified during the review.

Figure 9. Distribution of LLMs reported by model family.

Figure 10. Distribution of LLMs reported by breakdown of GPT-family variants.

Figure 11. Evaluation approaches reported.

Figure 12. Distribution of tools employed.

Figure 13. Layered functional architecture of the proposed accessibility remediation model.

Figure 14. Comparison of WAVE AIM scores across models.

Figure 15. Comparison of execution time across models.

Table 1. Retrieved literature review studies.

Ref.	Source	Library	Year	Papers	Period
[36]	CSI	ScienceDirect	2026	53	2018–2025
[22]	IST	ScienceDirect	2025	31	2019–2025
[37]	Sensors	MDPI	2025	58	2019–2024
[38]	FRAI	Frontiers	2024	43	2018–2023

Table 2. Definition of PCC components.

Component	Terms	Synonyms
Population	Web content	web, websites, web pages
Concept	LLM related strategies	scraping, prompt engineering, model
Context	Accessibility remediation	evaluation, web accessibility, accessible

Table 3. Selected digital libraries.

Source	Description	Included Content
Scopus (Elsevier)	One of the largest indexing platforms of peer-reviewed multidisciplinary literature.	Peer-reviewed articles from IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, and preprints from arXiv.
Web of Science	Multidisciplinary citation database indexing records from top-tier academic journals.
arXiv (via Scopus)	Open-access repository of scholarly preprints.

Table 4. Inclusion and exclusion criteria.

Inclusion Criteria	Exclusion Criteria
IC1. Between 2020 and 2025.	EC1. Outside computing domains.
IC2. Published in English.	EC2. Not primary research.
IC3. Peer-reviewed or preprint.	EC3. No access to full-text.
IC4. Addresses at least one RQ.	EC4. Duplicated studies.

Table 5. Application of selection criteria.

Stage	Criteria	Activity
Stage 1	IC1, EC1	Filter by publication year and discipline.
Stage 2	IC2, IC3, EC2	Filter by language and type.
Stage 3	EC3, EC4	Title and abstract screening.
Stage 4	IC4	Full-text review of relevant studies.

Table 6. Evaluation of candidate search strings.

ID	Search String	Sens	Prec
CS01	(“web accessibility”) AND (“scraping” OR “artificial intelligence” OR “ai” OR “model” OR “llm” OR “large language model”)	100%	3.23%
CS02	(“web accessibility”) AND (“scraping” OR “prompt engineering” OR “framework” OR “model”) AND (“artificial intelligence” OR “large language model” OR “llm”)	100%	21.79%
CS03	(“web” OR “accessible web”) AND (“scraping” OR “prompt engineering” OR “llm”) AND “artificial intelligence”	19%	0.28%
CS04	(“web accessibility”) AND (“scraping” OR “prompt engineering” OR “model” OR “evaluation”) AND (“large language model” OR “llm”)	82%	21.21%

Table 7. Quality assessment criteria (QAC).

Dimension	ID	Description	Points
Report	QA1	Is the research aim clearly stated and aligned with the study’s content?	+1
Rigor	QA2	Is the methodological approach clearly described and appropriate?	+1
	QA3	Are the results clearly presented and supported by data, examples or discussed?	+1
Credibility	QA4	Are the conclusions of the study presented?	+1
Relevance	QA5	Published in a indexed venue (JCR, SJR)?	+(0–2)

Table 8. Distribution of studies by analytical dimension.

Dimension	RQ	Studies	Percentage
Generation	RQ3	18	36%
Evaluation	RQ4	18	36%
Prompting	RQ2	10	20%
Scraping	RQ1	4	8%

Table 9. Most frequent dimension co-occurrences.

Combination	Studies
Prompting + Generation + Evaluation	6
Generation + Evaluation	4
Scraping + Evaluation	2
Scraping + Prompting + Generation + Evaluation	1

Table 10. Preprocessing format input for LLMs.

LLM Input Format	Studies	Percentage
HTML	18	51%
Images	6	17%
CSS	4	11%
TEXT	3	9%
JavaScript	2	6%
Markdown	1	3%
JSON	1	3%

Table 11. Distribution of reported prompting strategies.

Prompt Strategy	Studies	Percentage
Zero-shot	12	32%
Role-Prompting	5	14%
Prompt Chaining	4	11%
Few-shot	3	8%
Contextual prompting	3	8%
Re-Act	2	5%
One-shot	1	3%
Chain-of-Verification	1	3%
Chain-of-Thought	1	3%
Not reported	5	14%

Table 12. Comparison of web scraping techniques.

Method	HTML Chars	HTML Time (s)	Markdown Chars	Markdown Time (s)	Native Markdown Output
Beautiful Soup	6520	1.62	3479	2.04	No
Jina Reader	6842	2.98	5870	1.91	Yes
Crawl4AI	7007	3.53	4868	3.92	Yes

Table 13. Formal structure of the model regeneration prompt, illustrating the fixed and dynamic components used to guide accessibility-oriented HTML regeneration under WCAG 2.2 constraints.

Stage	Type	Prompt Template
Identity Framing	Fixed	You are a Web accessibility expert specialized in WCAG 2.2 compliance and semantic HTML correction. You analyze, regenerate, and produce fully accessible HTML documents.
Regeneration Objective	Fixed	You will regenerate a complete HTML document that is accessible by design according to WCAG 2.2 guidelines.
	Dynamic	HTML or Markdown input
	Dynamic	Root URL
Normative Constraint Enforcement	Fixed	Revise all WCAG 2.2 criteria including color contrast, alt text, ARIA labels, correct heading hierarchy starting from h1, and semantic HTML elements.
	Fixed	Preserve and correctly reference ARIA relationships.
	Fixed	Detect the language of the page and add the appropriate lang attribute to the <html> tag.
Resource Normalization	Fixed	Convert relative URLs to absolute URLs using the provided root URL.
	Fixed	Convert external CSS into inline <style> rules… correcting accessibility issues such as insufficient contrast.
	Fixed	Include required JS files in the <head> section.
Template-Guided Structuring (Optional)	Dynamic	The following is an accessibility-aware structural template to guide the regeneration…” template
Reflection and Verification	Fixed	Re-evaluate the regenerated HTML against WCAG 2.2. Double-check structural correctness. Ensure no new accessibility violations were introduced.
Output Constraint	Fixed	Return ONLY the final accessible HTML code. Do NOT include explanations or Markdown formatting.

Table 14. Experimental configuration used in the demonstration.

ID	Model	Input	Template
V1	GPT-5.2	HTML	No
V2	GPT-5.2	HTML	Yes
V3	GPT-5.2	Markdown	No
V4	GPT-5.2	Markdown	Yes
V5	Gemini 3	HTML	No
V6	Gemini 3	HTML	Yes
V7	Gemini 3	Markdown	No
V8	Gemini 3	Markdown	Yes

Table 15. Accessibility evaluation results for the original page and regenerated variants.

ID	Lighthouse Score	Wave AIM	Axe Issues	Tokens	Time (s)
Original	50	4.7	50	-	-
V1	100	9.9	0	4832	46.17
V2	96	9.6	1	4322	49.45
V3	100	9.9	0	4987	59.12
V4	96	8.3	3	4147	46.68
V5	96	6.0	11	4352	43.01
V6	96	8.4	3	3955	26.92
V7	98	9.9	1	4359	26.84
V8	100	9.9	0	3882	21.29

Table 16. Results of the expert-based component-level structural assessment conducted using keyboard navigation and the NVDA screen reader on accepted variants.

Component	V1	V2	V3	V7	V8
Skip to main content	Correct	Correct	Visible	Visible	Correct
Heading hierarchy	Correct	Correct	Correct	Correct	Correct
ARIA usage	High	High	Moderate	Low	Low
Form labels	Correct	Correct	Correct	Correct	Correct
Search functionality	Pass	Pass	Pass	Pass	Pass
Contrast	Pass	Fail	Pass	Pass	Pass
Landmark structure	Correct	Correct	Correct	Correct	Correct
Alternative text	Minor issues	Minor issues	Minor issues	Minor issues	Minor issues
Categories section detected	Pass	Pass	Pass	Fail	Pass
Structural integrity	High	High	High	High	High

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vera-Amaro, G.; Rojano-Cáceres, J.R. A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models. Computers 2026, 15, 343. https://doi.org/10.3390/computers15060343

AMA Style

Vera-Amaro G, Rojano-Cáceres JR. A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models. Computers. 2026; 15(6):343. https://doi.org/10.3390/computers15060343

Chicago/Turabian Style

Vera-Amaro, Guillermo, and José Rafael Rojano-Cáceres. 2026. "A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models" Computers 15, no. 6: 343. https://doi.org/10.3390/computers15060343

APA Style

Vera-Amaro, G., & Rojano-Cáceres, J. R. (2026). A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models. Computers, 15(6), 343. https://doi.org/10.3390/computers15060343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Conceptual Model for Born-Accessible Web Accessibility Regeneration Using Large Language Models

Abstract

1. Introduction

2. Background and Motivation

Problem Statement

3. Research Methodology

Iterations of the Conceptual Model

4. Systematic Literature Review

4.1. Review Methodology

4.2. Identification of the Need for the Review

Identified Research Gap

4.3. Research Questions

4.4. Identification and Screening

4.4.1. Step 1: Identification of Digital Libraries

4.4.2. Step 2: Establishing the Quasi-Gold Standard (QGS)

4.4.3. Step 3: Definition of Search Strings

4.4.4. Step 4: Conducting the Automated Search and Screening

4.4.5. Step 5: Evaluation of Search Performance

4.4.6. Step 6: Retrieval of Additional Studies

4.5. Eligibility of Studies

4.6. Included Studies

4.7. Data Extraction and Analysis

4.8. Review Findings

4.8.1. Descriptive Analysis of Included Studies

4.8.2. Thematic Analysis

4.8.3. RQ1: Techniques for Extracting and Preprocessing Web Content

Extraction Techniques

Preprocessing

Implications for Conceptual Model Formalization

4.8.4. RQ2: Types of LLMs That Are Applied in Web-Related Workflows

Role of LLMs

Model Configuration and Specialization

Context Window and Token Considerations

Implications for Conceptual Model Formalization

4.8.5. RQ3: Prompt Engineering Guiding LLMs in Producing Accessible Web Content

Distribution of Prompting Strategies

Instruction Framing and Constraint Embedding

Template-Guided and Constrained Regeneration

Prompting in Multimodal Contexts

Hallucination and Structural Drift

Implications for Conceptual Model Formalization

4.8.6. RQ4: Evaluation Strategies Employed to Assess the Accessibility of Outputs

Automated Evaluation

Hybrid Evaluation

Manual Expert-Based Evaluation

User-Centered and Assistive Technology Evaluation

Decision-Oriented and Threshold-Based Mechanisms

Implications for Conceptual Model Formalization

4.9. Objective of the Proposed Solution

5. Formal Specification for the Conceptual Model

5.1. Core Components

5.1.1. Input Acquisition (A)

5.1.2. Intermediate Transformation (T)

5.1.3. The Prompt Configuration (P)

5.1.4. Generative Inference (G)

5.1.5. Output Evaluation Component (E)

5.2. Model Properties

5.3. Conceptual ATPGE Model

6. Demonstration

6.1. Experimental Design

6.2. Operational Instantiation of the ATPGE Model

7. Evaluation

7.1. Iterative Threshold-Based Filtering and Acceptance Modeling

7.2. Quantitative Accessibility Assessment

7.3. Statistical Comparison of Model Performance

7.4. Component-Level Structural Assessment with Screen Readers

8. Discussion

8.1. Positioning with Respect to Related Work

Key Takeaway

8.2. Practical Deployment and End-User Applicability

Key Takeaway

8.3. Main Contributions

8.3.1. Theoretical Contribution

8.3.2. Methodological Contribution

8.3.3. Practical Implications

Key Takeaway

8.4. Limitations

9. Conclusions and Future Work

Author Contributions