Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches

Król, Karol

doi:10.3390/app16094543

Open AccessArticle

Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches

by

Karol Król

Digital Cultural Heritage Laboratory, Department of Land Management and Landscape Architecture, Faculty of Environmental Engineering and Land Surveying, University of Agriculture in Krakow, Balicka 253c, 30-198 Krakow, Poland

Appl. Sci. 2026, 16(9), 4543; https://doi.org/10.3390/app16094543

Submission received: 8 April 2026 / Revised: 1 May 2026 / Accepted: 2 May 2026 / Published: 5 May 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Featured Application

The proposed methodology provides a preliminary identification of potential deceptive design patterns on websites by analysing the structural characteristics of the front-end interface. It can also compare websites as digital environments without establishing their actual influence on user behaviour.

Abstract

Although recent years have seen an emergence of tools for automated identification of deceptive design patterns on websites, their scope and reliability remain understudied. Institutional websites are a particularly interesting research domain. They have an extensive information structure and shape the conditions of user interaction. The purpose of the article is to empirically evaluate signs of deceptive design patterns on Polish universities’ websites and analyse how they are identified using automated analytical tools. The study covers all public universities in Poland (N = 65). The analysis involved automated tools representing different methodological underpinnings, including web browser extensions and GPT language model-based analytical procedures. The study pinpoints significant differences in the paradigms behind the results provided by the two methods. Browser extensions yielded only qualitative suggestions of potential problems. They did not generate complete and comparable quantitative results for the entire population of the investigated websites. Results from the heuristic Real-Time Deceptive Pattern Auditor (RTDPA) were highly concentrated (mean 90.03, median 90, SD = 1.55, and interval 80–95), which may suggest limited discriminatory power for this website collection. In contrast, the rule-based Structural Interface Risk Screening (SIRS) revealed a much greater differentiation of results (mean 89.23, median 90, SD = 11.50, and interval 50–100). The association between the results from the two procedures was very weak (r = 0.089, p ≈ 0.48), which indicates their limited quantitative comparability. These findings indicate that the current capabilities of automated tools offer merely fragmented detection of selected deceptive design patterns, instead of a complete systemic diagnosis of the problem. Although they are measurement tools by declaration, the solutions can offer only preliminary screening, flagging potential risk areas. Not differentiating between risk signalling and actual measurement may lead to the illusion of automated precision.

Keywords:

deceptive design indicators; digital trust; consent quality; choice architecture

1. Introduction

Website quality evaluation is an important part of digital environment studies, covering analysis of interface characteristics, information structure, and functionalities that affect usability, transparency, and availability of information [1]. In the literature, this type of analysis typically comes as case studies or investigations of limited, purposively selected website collections, rather than in-depth multi-criteria audits of individual websites, which are most commonly found in the market and design practice. The latter angle facilitates detailed diagnosis of the interface and functional solutions. Still, its use in comparative research on larger, uniform sets of websites remains limited due to its effort intensity and low scalability [2]. This poses a significant methodological and operational challenge when the research objective is to evaluate specific characteristics of a digital environment at an institutional, sectoral, or systemic level, rather than to analyse individual cases. These limitations are particularly relevant to phenomena that are not solely linked to the website’s functionality but also to interface configuration, which affects how users make decisions. In this context, the identification of patterns considered deceptive design is important, not as an incidental problem of interface quality, but as a potentially systemic characteristic of a digital environment, affecting choice architecture, cognitive load, and information access equality [3]. Deceptive design patterns refer to interface configurations that influence user choice architecture through information, effort, or perceptual asymmetry, potentially reducing transparency in decision-making. Therefore, tools capable of partially automating analysis are increasingly used to screen websites and identify selected interface characteristics at scale across large collections of websites while retaining methodological control over the scope and meaning of the results. Considering the rapid advancements in AI-based tools, including large language models (LLMs), the question arises as to whether it is possible and reasonable to use these technologies to automate the identification of deceptive design patterns, including dark patterns, across large collections of websites [4]. In light of the sustainable development of digital services, the issue of the effectiveness, reliability, transparency, and interpretability of results generated by such tools for automated website analysis has become highly relevant [5]. Therefore, the question of using AI tools in deceptive design analysis pertains not only to detection performance but also to the scope of their responsible use for evaluating the quality of the digital environment of public institutions. Despite the growing use of AI-based tools, existing studies primarily focus on detection performance or case-based analyses. There is still a lack of systematic comparative research examining the character, stability, and interpretability of results generated by different classes of automated tools, particularly in large, homogeneous website collections. This study addresses this gap by providing a comparative evaluation of browser-based tools and LLM-based approaches in the context of deceptive design pattern detection.

As beneficiaries of public trust, universities should be particularly mindful of the high quality of their websites and web applications. Their responsibility goes beyond conformity with applicable technical and regulatory standards, such as Web Content Accessibility Guidelines (WCAGs) [6], World Wide Web Consortium (W3C) guidelines [7], Web Hypertext Application Technology Working Group (WHATWG) specifications [8], and principles of sustainable design, like Green SEO [9]. It also includes employing transparent, ethical, inclusive, and user-friendly design practices [10]. Another important part of this responsibility is to restrain design solutions that could lead to informational or decision asymmetry, including patterns considered deceptive design or dark patterns. This study aims to evaluate the incidence of deceptive design patterns on Polish universities’ websites and compare the paradigms and stability of results generated by various automated procedures declared as detection tools. The work employs automated screening procedures, serving as tools for identifying potential signs of risk areas within the investigated collection of websites. The study follows three methodological angles, covering tools available as web browser components and original GPT models with different operational configurations and ranges of generated results. Both the literature and practitioners increasingly call for significant automation of deceptive design pattern identification in response to the limited scalability of manual analyses [4].

The evolution of crawler-based detection in deceptive design pattern research [4,11], combined with the increasing use of AI-based and LLM-based tools in online content analysis [5,12], suggests that deceptive design pattern detection could indeed be automated to a significant degree. In light of the above, the author proposes that the currently available tools for automated website analysis, including browser extensions and front-end-focused LLMs, can identify selected deceptive design signs in large collections of websites.

The present contribution is both empirical and methodological. The article offers a systematic analysis of the incidence of deceptive design patterns on Polish universities’ websites and a comparative evaluation of the characteristics of results generated by the automated detection procedures. The analysis demonstrates the need to discriminate between a screening function and a complete measurement of the incidence of deceptive design patterns. In conceptual terms, the article employs the notion of deceptive design patterns as the analytical framework for investigating interface configurations that affect the user’s choice architecture, contrasting it with a more specific and normative idea of dark patterns. Methodologically, the author proposes and empirically tests a comparative technique for analysing results of automated detection tools, including both a structural principle-based approach and a generative heuristic approach powered by large language models (LLMs). This combination identifies differences in the tools’ capabilities to discriminate the investigated websites and in the transparency of the results they generate. The method also identifies the risk of ‘illusion of automated precision’ in generative AI-based evaluations. Furthermore, the results provide empirical evidence of structural indicators associated with potential deceptive design patterns across the investigated collection of Polish universities’ websites. This makes them relevant to further research on digital environment quality and the practice of public-sector website quality monitoring. In this study, the primary object of analysis is not the websites themselves, but the characteristics and behaviour of automated detection procedures applied to a uniform collection of websites, which serve as a controlled empirical context.

The remainder of the article is structured as follows. Section 2 presents the study’s conceptual framework and research questions, focusing on the automation of deceptive design pattern detection and the characteristics of results generated by AI- and LLM-based tools. Section 3 describes the material and research methods, including the profile of the website collection, the employed tools, and the research procedure. Section 4 reports the results of the empirical study, and Section 5 interprets and discusses them in light of prior research and the limitations of the selected methodology. The results are summarised in Section 6, along with their implications and future research directions.

2. Conceptual Framework and Research Questions

2.1. Automated Detection of Deceptive Design Patterns as a Methodological Problem

Deceptive design patterns are positioned on the interface of the technical, visual, and cognitive dimensions of websites. To identify them, one needs to consider the website’s formal structural characteristics, such as specific interface components and the sequence of events, as well as how these elements affect the user’s perception, choice architecture, and interaction pipeline [13]. In this setting, choice architecture refers to the arrangement and presentation of decision options within an interface, following the conceptual framework proposed by Münscher et al. [13], including their exposure hierarchy, defaults, the distribution of effort required to choose among alternatives, and the language in which the options are communicated to the user. Although it does not formally limit freedom of choice, choice architecture guides the decision-making process through the website’s structure, shaping the circumstances under which decisions are made [14]. In this context, asymmetry of effort concerns a situation where formally equal decision options in an interface require the user to make significantly different physical or cognitive efforts. The differences can involve the number of actions required, the complexity of the selection procedure, the necessity of processing additional information, or the visual presentation of options, leading to systemic bias towards a specific outcome at the expense of others, without formally limiting the choice range [15]. Therefore, deceptive design patterns are not merely technical or semantic issues, they are complex phenomena that depend heavily on context and relationships. This complexity arises from the interplay between structural interface elements, visual presentation, linguistic framing, and the sequence of user interactions, which together shape the user’s decision-making environment.

The literature defines deceptive design as a set of recurrent design patterns that affect user choice architecture through information, effort, or perceptual asymmetry, thereby providing limited transparency in decision processes [4,12]. The most common categories include asymmetric consent configurations (such as pre-selected options or the lack of an equivalent refusal option), interface interference (which favours a specific option at the expense of others through visual or structural means), and obstruction (which means hindering actions that the system does not favour by design, such as unsubscribing, changing settings, or withdrawing consent) [4]. The literature has identified patterns based on linguistic and semantic pressure that employ wording suggesting urgency, scarcity, or social proof. Some conceal important information by fragmenting content or delaying the disclosure of the costs and consequences of users’ decisions [16]. How the interface works and affects the user choice pipeline is critical to deceptive design research, making the concept particularly useful in comparative studies and website audits [17].

The present study uses deceptive design patterns as its analytical framework to describe the problem of structural interface configurations that increase the risk of disrupting the balance of user choice, regardless of assumptions about designers’ intent. The literature has generally used the term ‘dark patterns’ to refer to some of these schemes, both in research and design practice, often with a clear normative emphasis [4,18]. The present paper considers dark patterns to be a special case of deceptive design patterns, a subset of a broader category of user interface configurations that affect the user choice architecture (Table 1).

Despite the rising attention to deceptive design patterns and substantial advances in automated interface analysis tools, studies have so far tended to focus on case studies of individual websites, qualitative analyses, or experimental detectors whose applicability is limited [16]. Many contributions are descriptive, conceptual, and categorising, and investigate the problems of methodology, definitions, and deceptive design pattern categories [12,17]. The literature offers relatively few systematic comparative studies evaluating the usefulness, reliability, and limitations of various techniques for automated identification of deceptive design patterns in large, homogeneous collections of institutional websites [4]. The problem of the relationship between the scalability of automated tools and the transparency and interpretability of their results remains particularly understudied. This is central to sustainable monitoring of digital environment quality and responsible use of AI tools in the public sector.

Techniques for identifying deceptive design patterns found in the literature are based mostly on in-depth expert analyses involving manual interface inspections, heuristic analyses, and case studies [18]. Publications involving large, systematically selected collections of websites are rarer. The contributions by Mathur et al. [4] and Nouwens et al. [11] remain among the few examples of such large-scale projects. Manual expert analysis can pinpoint subtle manipulative mechanisms, but poor-to-none scalability remains its core limitation. At a practical level, this means that manual identification of deceptive design patterns is feasible mainly for individual websites or small collections. This limits its usefulness for comparative studies and systematic monitoring of the digital environment quality [18]. These limitations shift the researchers’ attention towards tools for automated website analysis, including rule-based solutions, HTML structure analysis, and machine learning and LLMs [21]. They can potentially support large-scale analyses with a relatively low operational cost and the high repeatability of the research procedure [22]. In the context of sustainable development of digital services, automated detection could be considered a sine qua non for long-term monitoring of the quality of the digital environment. Still, automated detection of deceptive design patterns faces serious methodological limitations. Many manipulative patterns have no clear-cut, formally defined website code representations, nor do they appear as simple binary signals. Their identification often hinges on context, the sequence of interactions, the relationships between content and visual form, and implied design intentions [18]. As a result, tools for automated analysis could exhibit very good performance when identifying specific categories of patterns, such as covert consent or forced interaction, while struggling to detect more subtle forms of manipulation, such as time pressure, information asymmetry, or distorted choice clarity [12]. Consequently, although they are measurement tools by declaration, the solutions can offer only preliminary screening, flagging potential risk areas. Another undeniable challenge is the significant diversity of tools for automated website analysis. Tools that are web browser components are typically limited to detecting specific interface elements and verifying conformity. In contrast, generative AI models analyse websites more holistically, using semantic descriptions and linguistic patterns [23]. These differences lead to different detection scopes, different levels of detail in the results, and different degrees of susceptibility to classification errors. Therefore, the notion of ‘automated detection of deceptive design patterns’ does not refer to a single, uniform procedure but to a spectrum of approaches with varying strengths and limitations. Hence, the primary research problem is the scope and degree of the reliability of automated detection of deceptive design patterns, rather than determining the feasibility of the detection as such. The problem is directly pertinent to the evaluation of the usefulness of automated methods in comparative studies, quality audits, and the responsible use of AI tools in digital environment quality assessments.

The study follows an exploratory and comparative research design; therefore, instead of formal hypothesis testing, it is guided by research questions aimed at examining the characteristics and limitations of automated detection approaches. In light of the above, the author poses the following research question:

RQ1: What are the characteristics and scope of limitations of automated detection of deceptive design patterns using tools for automated website analysis?

The scope of automated detection of deceptive design patterns is defined as a property of tools for automated website analysis. It covers the ability to identify various design pattern categories, the coherence and repeatability of results, and the level of detail in the resulting analyses. Awareness of the limits of this particular application is the critical prerequisite for the responsible use of automated tools in comparative studies and quality audits. In this study, the reliability of automated auditing procedures is evaluated through the analysis of result distributions, variability measures (including standard deviation and interquartile range), and the strength of association between outputs generated by different tools (Pearson’s correlation coefficient).

2.2. Reliability and Discriminatory Power of Automated Audit Procedures

Automated website auditing procedures mean a significant shift not only in the scale and speed of analysis but also in the characteristics of analytical results [12]. In contrast to in-depth expert analysis, which is based on overt criteria and relatively stable assessment procedures, automated tools generate results indirectly, using complex data processing, heuristics, and probabilistic models. As a result, the key problem is not the ability to detect specific patterns as such, but the reliability and comparability of the procedure’s outcomes [21].

Deceptive design pattern studies increasingly rely on AI-based tools, including LLMs, which can analyse large collections of websites automatically. Still, researchers emphasise that these tools operate largely as black boxes, in which the mechanisms that generate the results are neither fully transparent nor controllable [24,25]. Therefore, the primary methodological problem is not the ability to automatically detect deceptive design patterns, but the reliability of the results generated by AI tools used in comparative analyses. It specifically concerns whether the generated results reflect the actual diversity of the investigated website collection or are operational artefacts of the tool (algorithm), its configuration, or its response-generation mechanisms. This is why it is critical to distinguish between the formal ability to generate a result and its actual value as a reliable, dependable basis for evaluating an object, which is fundamental to using AI and LLM tools in website quality audits. Considering this, the author poses the following research question:

RQ2: To what degree are results generated by LLM-based automated auditing procedures a reliable and accurate basis for comparative analysis of websites?

RQ2 is addressed through analysis of the characteristics of the outputs generated by the automated auditing procedures, treated as comparative data, with particular attention to their distribution, variability, and cross-website stability. In this context, accuracy is interpreted operationally as the adequacy of the generated results to represent variation within the analysed website collection, rather than as benchmark-based classification performance.

2.3. Deceptive Design Patterns on University Websites

Deceptive design pattern research to date tended to focus on commercial websites such as e-commerce platforms, subscription services, or mobile applications, where deceptive design patterns were often investigated in the context of conversion optimisation or buying behaviour [3,4]. The systematic analyses have less often concerned public institution websites, including university websites, whose primary function is to ensure access to public information and handle educational and administrative processes.

University websites exhibit high information and functional complexity. They cater to diverse user groups such as prospective students, students, staff, and third parties. This has a practical impact on their design, in that these websites have extensive navigation structures, elaborate forms, and multi-stage interaction processes [7]. These circumstances are conducive to designs that introduce information asymmetry, hinder comparison, or affect the user’s decision-making regardless of the designer’s intent. The central question from the perspective of website quality research is the empirical assessment of the extent to which patterns considered deceptive design occur on university websites as systemic components rather than incidental lapses. The focus of this assessment is not to decide on design intentions. Instead, it concentrates on the observable presence of certain interface components and their prevalence across a uniform collection of websites.

Determining the prevalence of deceptive design patterns on university websites is a substantial aspect of comparative studies and indispensable for a thorough assessment of the quality of institutional websites. It is particularly pertinent for universities because their websites serve as the primary channel for accessing public information and administrative procedures. Any information asymmetries or structural barriers in choice architecture could affect access equality, process transparency, and user decision quality. Identifying the incidence of deceptive design patterns can help pinpoint areas of increased design risk and determine whether these patterns are marginal or sufficiently prevalent to warrant a systemic review of web design practices in higher education. To this end, the author poses the following research question:

RQ3: What is the extent of deceptive design pattern incidence on Polish universities’ websites?

The ‘extent of deceptive design pattern incidence’ is defined as the empirically observable frequency and distribution of signs indicating the presence of specific design components in the investigated collection of websites, rather than as a threshold-based classification of websites. The analysis focused on identifying recurrent interface elements and navigation structures that may affect the user’s choice architecture, regardless of the declared design intent. Its objective was not to classify individual websites using normative categories. Instead, it serves to comparatively diagnose the scale of the problem across a uniform collection of websites and to identify areas of potential design risk that require further in-depth analysis.

3. Materials and Methods

The study population comprised the websites of all public universities in Poland (N = 65), as listed by the minister for higher education (as of 10 March 2026) [26]. The object of analysis was the website, considered an autonomous digital environment, investigated independently of the university’s institutional position, renown, or scientific profile. This angle facilitated a comparative evaluation of website characteristics without taking into account the university’s organisational profile as a research and educational institution.

The analysis was conducted in desktop mode, treated as the baseline. This is because key components of information architecture, decision-making structures, and interface components related to administrative processes are designed primarily for desktop use [7]. The analysis did not include deeper pages, multi-step user interactions, or full site crawls; instead, it was restricted to a standardised snapshot of the homepage to ensure comparability across all observations. The page retrieval process combined browser-based inspection of the rendered interface with the extraction of static HTML code (“view source”) as the primary analytical input. The study employs three analytically independent modes of data acquisition: (1) analysis of the rendered interface using browser-based tools, (2) analysis of static HTML source code, and (3) heuristic evaluation based on URL input. These modes differ in input type, level of abstraction, and analytical scope, and are treated as methodologically non-equivalent.

The analysis was limited to the website’s landing page, focusing on the above-the-fold interface area and those structural components that could affect the user’s choice architecture, such as consent configurations, navigation, and decision-relevant messages. This limitation is intentional and reflects the study’s focus on comparable, standardised entry points to the analysed websites. The landing page typically concentrates key elements of choice architecture, such as consent configurations, navigation structures, and decision-relevant interface components, while also ensuring methodological consistency across the entire dataset. Deeper or dynamically generated pages may introduce variability related to user pathways and context-dependent content, which would reduce the comparability of results in a large-scale, cross-sectional analysis. The author collected each website URL and archived the static HTML source code as the primary analytical input. This approach excludes dynamically generated content, client-side scripts, and interaction-dependent interface states. As a result, certain deceptive design patterns that rely on visual rendering, temporal sequencing, or user-triggered behaviour may not be captured. Therefore, the findings should be interpreted as reflecting structural risk indicators rather than a complete representation of all possible interface manipulations.

3.1. Measurement Tools

The study employs an operational perspective akin to an actual audit conducted by a UX practitioner or compliance specialist without access to the back end. The analysis was conducted in a web browser in the client environment on a rendered version of the website that regular users access. This approach limits the possibility of technical intervention into the website’s structure while increasing the practical availability and replicability of the auditing procedure.

It employs third-party tools for automated deceptive design pattern detection (Chrome browser extensions), and two original LLM-based tools utilising different analytical strategies: (1) heuristic generative evaluation and (2) rule-based structural screening (Table 2). Note that the study investigates only observable characteristics of websites, not design intent or actual user behaviour. It does not rely on a ground truth benchmark dataset, as the objective is not to evaluate detection accuracy but to analyse the characteristics and comparability of outputs generated by different automated procedures. Therefore, the results should be interpreted as an evaluation of the structurally embedded design risk potential across this specific collection of websites, rather than a normative classification of the design practices of individual universities. It should be noted that the study does not include evaluation of detection correctness in terms of false positives and false negatives, as it does not rely on a ground truth benchmark or a binary classification framework. Instead, the analysis focuses on the comparative behaviour and output characteristics of different analytical procedures.

In this study, “comparative behaviour” refers to differences in how the analysed procedures respond to the same set of websites, including variation in output levels, stability across observations, and consistency of results, while “output characteristics” denote the statistical properties of the generated results, such as distribution, dispersion, and cross-method correspondence.

Details of the two custom-designed GPT models used in the study are provided below. Their methodological differences are central to the comparative analysis of the scope and characteristics of automated detection of deceptive design patterns.

3.1.1. Profiles of the GPT Models Used in the Study

The two models used in this study are original analytical procedures designed by the author and implemented within a GPT-based environment. They do not involve model training in the machine learning sense, but are defined through analytical rules, input specifications, and response structures.

The author employed two original GPT models built in ChatGPT (v5.2), designed as independent analytical procedures to identify deceptive design patterns. The models vary in scope, measurement methods, and operationalisation of deceptive design pattern detection (Table 3). The analysis was conducted using a structured prompt framework consisting of modular analytical instructions across all observations, and the general characteristics of the model and its analytical procedure are described in Appendix A.2.

The comparison between Structural Interface Risk Screening (SIRS) and Real-Time Deceptive Pattern Auditor (RTDPA) reveals fundamental methodological differences in how they operationalise the concept of deceptive design patterns. The models differ in input scope, evidence assessment rules, transparency levels, and the measurement characteristics of their results. Analysis with SIRS is restricted to clearly observable configurations of static HTML code. It requires direct structural proof for each detection. Meanwhile, RTDPA is based on a broader heuristic interpretation and may involve contextualised evaluation.

SIRS detects deceptive design patterns by analysing static HTML code provided by the server. It covers only interface attributes found in the HTML structure. Each detection requires a direct reference to the code and an assignment of a predefined rule, which ensures transparency and replicability of results. The index value (Structural Risk Index) is based on a mathematical formula and takes into account only confirmed structural configurations found in the HTML code, without conjectures about design intent (1).

SRI = 100 − (N_confirmed × 10)

(1)

where N_confirmed is the number of unambiguously confirmed HTML configurations meeting the criteria of one of the five predefined structural categories of deceptive design patterns: pre-selected options, absence of a symmetrical alternative, hidden elements, default affirmative configuration, and explicit linguistic pressure markers. These categories operationalise selected interface configurations described in the literature in the context of dark patterns and choice architecture. The author defined them so that they can be unambiguously identified in static HTML code [4,13,18]. The index values range from 0 to 100. An identical weight for each configuration serves a heuristic purpose to ensure transparency and comparability of the index. Equal weighting reduces the risk of arbitrary differentiation of the impacts of individual configurations and, in the absence of a validated empirical basis for assigning differentiated weights, avoids introducing additional subjectivity into the model. It also helps consider the index as a simple measure of the number of confirmed structural signs, rather than a weighted assessment of their impacts.

Unlike SIRS, the RTDPA model represents a methodological approach grounded in generative heuristic analysis. It can accommodate a broader linguistic context and potential design implications. Its evaluations do not necessarily have to be directly grounded in unambiguous structural proof, such as static HTML code. When an analysis is initiated solely using a URL, RTDPA operates in the heuristic-exploratory mode. A URL-based RTDPA analysis is heuristic. It does not cover interface rendering or simulated interaction. This means that, in this case, the tool does not render the website in a web browser, execute any JavaScript, or simulate user interaction. Hence, its result should be interpreted as a heuristic warning about potential design risk areas based on available structural and semantic inputs.

SIRS minimises the risk of misinterpretations, but its ability to detect subtle, context-dependent forms of manipulation in HTML code is limited. Meanwhile, RTDPA has a larger detection range at the cost of a greater risk of heuristic overproduction of false positives. Therefore, the two angles should be considered complementary; the first one as a measurement tool exhibiting a high level of methodological control, and the other as an exploratory system yielding hypotheses that need to be verified.

3.1.2. Operational Principles of the RTDPA Procedure

RTDPA operates on representations of the webpage that approximate the information available to the user, rather than relying on a single technical format. The analytical input, depending on configuration, may include simplified HTML (DOM structure), visible textual content (e.g., interface copy, button labels, or system messages), and inferred interaction cues (e.g., expected outcomes of user actions such as clicks or form submissions). In this study, RTDPA was applied in a URL-based mode, without direct access to rendered interface states or full HTML input; its evaluation therefore relies on heuristic assessment of interface logic based on common UX patterns rather than direct structural inspection.

Screenshots are not treated as the primary input modality and may be used only as auxiliary visual input. The analysis focuses primarily on the structural and semantic properties of the interface, as well as on the inferred logic of the user interaction flow.

The analytical procedure is not based on a single fixed prompt but on a structured prompt framework. This framework consists of a coherent set of instructions organised into modular sections corresponding to typical interface contexts, such as cookie consent banners, subscription mechanisms, and transactional flows. While these modules address different functional areas, they apply a consistent set of evaluation principles across contexts.

The output generated by RTDPA is hybrid. It includes a numerical score in the range 0–100 (Deceptive Design Pattern index, DDP), a qualitative label (e.g., Clean or Problematic), and a structured set of identified patterns. The numerical score is not interpreted in isolation but always in conjunction with the qualitative assessment, which provides context and supports interpretation of the result.

Due to the probabilistic nature of large language models, RTDPA does not produce fully deterministic outputs at the level of individual observations. However, the use of a consistent prompt framework and standardised analytical structure ensures comparability of results at the level of distributions across the analysed website collection.

3.1.3. Operational Principles of the SIRS Procedure

SIRS is a rule-based analytical procedure operating exclusively on static HTML obtained via “view source” or explicitly provided code. The method does not analyse the rendered DOM, JavaScript-driven mutations, external stylesheets, or user interaction flows. Only attributes present in the static HTML structure (e.g., checked, selected, hidden, or inline styles such as display:none) are considered.

The unit of analysis is a complete decision-related interface component, such as a cookie consent banner, subscription form, consent form, or checkout block. Individual elements (e.g., input fields or checkboxes) are treated as evidence but are not interpreted in isolation if the relational context of the component is missing.

The procedure is restricted to five predefined structural categories of deceptive design patterns: (1) pre-selected options, (2) hidden or obscured elements, (3) asymmetry of choice, (4) forced or constrained choice structures, and (5) limited availability of alternative options. Each confirmed instance of a rule violation constitutes a single detection. Multiple independent occurrences of the same category within a single component are counted separately (e.g., multiple pre-selected checkboxes corresponding to distinct decision options are recorded as separate detections).

The analytical process follows a deterministic sequence: inspection of static HTML, identification of complete decision-related components, evaluation against the predefined structural categories, recording only detections supported by explicit HTML evidence, aggregation of confirmed cases (N_confirmed), and calculation of the Structural Risk Index (SRI).

The output of SIRS extends beyond the numerical score. It includes a list of confirmed detections with corresponding HTML excerpts, non-scored heuristic observations, elements that cannot be assessed within the adopted methodological scope, calculation of the SRI, explicit methodological constraints, and mapping of confirmed structural categories to potentially related classes of deceptive design patterns. This design ensures full procedural transparency, determinism, and reproducibility, while deliberately limiting detection to explicitly verifiable structural configurations.

Figure 1 summarises the operational workflows of the two procedures and clarifies the basis of their comparison. The diagram highlights that SIRS and RTDPA rely on fundamentally different types of analytical evidence and processing logic, which directly shape the nature of their outputs. As a result, the comparison between the two approaches does not assume measurement equivalence, but focuses on differences in output behaviour, including distribution, variability, and consistency across observations. This distinction is important for interpreting the results and avoiding overgeneralisation of automated assessments.

3.2. Research Pipeline

First, each website was tested using the selected Google Chrome extensions. The websites were subjected to the following procedure: (1) load the landing page; (2) wait for all static assets to be fully loaded; (3) simulate basic interaction (closing the cookies banner or clicking the default option if it was required to unlock the content); (4) register output from the extension. The results of the measurement tools were recorded as the count of detected patterns, a system message (such as ‘Error’), or a list of reported issues as per the tool’s format (Appendix A.1).

In the second stage of the research procedure, the author recorded the static HTML code (‘view source’) for each investigated website. It was then archived as a snapshot of the source material. SIRS was fed a complete HTML file, and patterns were identified only by analysing the code’s structure. Each flagged deceptive design pattern had to be unambiguously confirmed by the presence of a relevant HTML code configuration. It was the basis for calculating the SRI as per Equation (1).

In the case of RTDPA, the model generated a heuristic evaluation of potential manipulative patterns based on the website’s URL. It was provided with a publicly available address of the landing page, instead of static HTML code. Therefore, its evaluation was based on a generative, contextualised analysis. As a consequence, RTDPA operated in exploratory mode, allowing for heuristic interpretation and contextualised evaluation, while SIRS analysed only the HTML code provided to it.

To quantify the association between the results from the two different analytical procedures, the generative heuristic RTDPA audit and the rule-based structural screening with SIRS, the author calculated Pearson’s correlation coefficient. The analysis involved pairs of RTDPA and SIRS scores assigned to the same 65 observations representing university websites. Both variables were continuous and expressed using the same numerical scale of 0 to 100, which justifies using a linear correlation metric. The coefficient was interpreted as a measure of the strength and direction of the linear correlation between the two measurements, without considering causal relationships or measurement equivalence.

4. Results

4.1. Results from Browser Extensions

Data from the five tools for detecting deceptive design patterns exhibited significant differences in both the characteristics of the results and their quantitative distributions. The Langford Dark Pattern Detector reported the number of deceptive design patterns detected, and its results varied significantly across the websites. The most common outcome was no detected signs (29 websites). For some websites, the tool reported many potential deceptive design-pattern problems, with a maximum of ten detected signs (Table 4).

The Pattern Shield extension typically reported being unable to analyse the website. It displayed an error message for 45 of the 65 investigated websites. This means that it provided an interpretable result for only 20 websites. It detected a single deceptive design pattern on 15 websites and two deceptive design patterns on five websites. When the tool did generate a result, the most common pattern category was ‘pre-selected options’. It involves specific choice options being selected by default. This was the most common category of deceptive design patterns on the websites. Less frequent patterns were ‘forced continuity’, ‘sneak into basket’, ‘confirmshaming’, and ‘misdirection’. In some cases, the pattern configurations were amalgamated. The most common combination was ‘pre-selected options’ supported by other categories.

The Dark Pattern Detector exhibited the most diversified results. The number of issues it reported varied from 0 to 104. It did not flag any problems on 12 websites. The remaining observations spanned a broad array of values, with many observations of high issue counts. The distribution was very wide, with many extreme values, yet lacked information on the severity of individual issues. The primary category reported by this tool was ‘potential blended ad: styled like native content, flagged for review’. This pattern was the most common and basic detection type in the collection. It indicates a potential similarity between advertising components and the website’s original content (native-like styling). The tool reported these issues as signs to be verified (‘flagged for review’), rather than a clear confirmation of a violation. The second most common issue category was ‘pre-checked checkbox (possible dark pattern)’. It means options with a specific checkbox selected by default, which is classified as a potential opt-in asymmetry. In some cases, ‘pre-checked checkbox’ and ‘blended ad’ co-occurred. The third category was ‘forced modal with no visible close option’. It reports modal dialogue boxes for which no close option is displayed. It was less common than the two previous patterns, but occurred in parallel with other deceptive designs in certain cases. The results indicate that the web browser tools generate data of various characteristics and distributions. As a consequence, the values fail to provide a uniform basis for comparison; instead, they exhibit significant variation in the number of reported signs across the investigated collection of websites. The observed inconsistency of outputs and the high rate of failed analyses indicate that browser-based extensions currently do not provide a stable or methodologically consistent basis for comparative evaluation across large website collections.

4.2. Results from the GPT Models

Results from the GPT models indicate clear differences between evaluations generated by the heuristic RTDPA model and the outcomes of structural screening with SIRS. The DDP values from RTDPA clustered around 90 points (mean 90.03, median 90, and standard deviation 1.55), which indicates minor score variability across the websites. The distribution spanned a narrow interval (80–95 points), and most websites were classified as ‘mostly clean’, indicating a low level of signs of potential deceptive design patterns. The results reported by SIRS were very different. They exhibited much greater diversity of SRI values (mean 89.23, median 90, and standard deviation 11.50), with scores ranging from 50 to 100 points (Table 5).

Correlation analysis between DDP (RTDPA) and SRI (SIRS) has revealed a very weak linear association. Pearson’s correlation coefficient (r = 0.089, p ≈ 0.48) does not indicate a significant association between the two indicators. This suggests that the tools analyse different interface attributes and generate hardly comparable results. Moreover, the first, second, and third quartiles were identical for RTDPA (Q1 = 90, median = 90, and Q3 = 90), indicating that 50% of the observations had exactly the same score and 75% of the results did not exceed 90 points. The interquartile range of zero indicates no diversification of results in the central interval and a lack of low extreme values. SIRS’s results exhibit different characteristics. Their quartiles indicated an undeniable dispersion (Q1 = 80, median = 90, and Q3 = 100). The interquartile range was 20 points. When compared, percentile distributions indicate that RTDPA yields a very concentrated distribution of scores, whereas SIRS shows much greater variability across the collection of websites. Therefore, the results confirm that heuristic generative evaluation with RTDPA exhibits a high level of stability burdened by a poor capability to discriminate between the websites, while the rule-based structural screening with SIRS reveals actual differences due to confirmed high-risk interface configurations.

Figure 2 presents the relationship between RTDPA and SIRS values. Each observation represents a single website, while the identity line y = x represents a hypothetical full consistency between the two models. The concentration of scores stems from RTDPA assigning the same or nearly the same scores to many websites. This leads to overlapping observations in the chart space, suggesting low dispersion in the model’s results. In contrast, SIRS offers greater diversity of scores at the same RTDPA values, indicating different sensitivities of the models to the structural attributes of the websites.

The distribution of points in relation to the y = x line does not indicate a systematic consistency between the scores from the two models (Figure 2a). This means that RTDPA and SIRS do not order the investigated objects comparatively, despite using the same nominal scoring scale. The high stability of RTDPA scores does not entail consistency with SIRS scores; instead, it indicates a limited ability of RTDPA to differentiate between the investigated websites. Figure 2b shows the distribution of differences between scores generated by SIRS and RTDPA. The differences do not cluster at zero, and their distribution shows a clear-cut asymmetry. This indicates that differences in scores by the two models are not random. Therefore, despite using the same nominal scoring scale, the models do not generate comparable results, and the differences are structural.

5. Discussion

As opposed to large-scale studies with dedicated crawlers and custom programming tools, such as those reported by Mathur et al. [4], this article assumes an operational perspective approximating audits by a UX designer, tester, or website quality expert. The tools used in this case, particularly website extensions and GPT models run in a browser environment, are available without constructing additional back-end infrastructure, creating dedicated software, or performing mass network crawls. Moreover, the analysis was performed on the client’s side (thin client) on a rendered version of the website available to regular users. This means that the research procedure partly simulates actual conditions of user-interface interaction, rather than a laboratory analysis based on automated processing of large sets of HTML data. This approach improves the availability and replicability of the research procedure under practical operational conditions. Therefore, the results should be interpreted as an effect of using web-browser auditing tools rather than an outcome of a sophisticated detection pipeline.

5.1. Observations

The RTDPA (DDP) results for the website collection are mostly similar and cluster around 90 points. The interquartile range of zero (IQR = 0) indicates no differentiation of scores in the central part of the distribution. Still, the interpretation of the result is not unambiguous. It can reflect a relative uniformity in the set and a low actual level of deceptive design patterns on university websites. At the same time, the concentration of DDP values could be due to RTDPA’s limited sensitivity in differentiating among the investigated websites. This would suggest that the tool is primarily a screening procedure for identifying potential risk areas, rather than a reliable detection mechanism. Characteristics of university websites could additionally promote uniform scoring by RTDPA. These websites serve mostly informational purposes, are under low commercial pressure, and employ few sales mechanisms, which organically limits the risk of sophisticated manipulative patterns. The investigated websites lack high-risk mechanisms typical of commercial settings, such as forced subscription, auto-renewal (forced continuity), hidden costs, time pressure (fake scarcity), difficult cancellation (roach motel), or aggressive nagging. As a result, the deceptive design pattern evaluation could remain uniform across the sample. The only recurring component that could be potentially problematic is the structure of cookie consent banners. They frequently exhibit choice asymmetry; the acceptance button may be more prominent than the refusal button, and refusal may require additional effort. These obstacles are classified as interface interference and obstructive deceptive design patterns. As this pattern occurred in a similar manner on many websites, the score reduction in subsequent analyses was similar.

The comparison between the results from web browser extensions and the RTDPA and SIRS models demonstrates significant differences in how the problems are operationalised and in the results’ paradigms. The browser extensions (add-ons) generate results that vary in form: from simple reports on detected problems and lists of issues to no reports due to processing errors. Additionally, there is no clear-cut definition of a ‘measurement unit’ or transparent principles of aggregating scores, which hinders direct comparison. As a result, the values they generate fail to establish a common plane for comparison. In this context, RTDPA and SIRS represent different qualitative paradigms. The first one generates stable, highly clustered distributions of results, reflecting the heuristic nature of generative evaluation. Meanwhile, the other reveals variability in results due to the number of confirmed structural configurations in HTML code. It is noteworthy that the low level of correlation between RTDPA’s values and SIRS’s results indicates that the tools operationalise the same phenomenon differently and generate results that are hardly comparable. Hence, empirical data do not support the assumption that it is possible to fully automate the detection of deceptive design patterns. On the contrary, they indicate the need to clearly distinguish between tools that flag potential risk areas, heuristics-based exploratory procedures, and rule-based structural screening methods. This distinction is necessary to avoid illusory precision of results generated by automated tools in research on deceptive design patterns on university websites.

These observations indicate that automated detection of dark patterns is possible today only to a limited extent. Furthermore, it requires a clear distinction between tools for signalling potential risks, research prototypes, and methods based on structural proof. Tools that can declaratively perform completely automated audits without access to specifications generate results with limited measurement accuracy. They should be used only to support exploratory analysis, rather than for standalone evaluations of the design quality of websites.

5.2. Answers to Research Questions

The use of a GPT model as a tool for automated detection of deceptive design patterns entails a significant risk of results that are seemingly precise, yet methodologically uncertain. Large language models are neither deterministic DOM (Document Object Model) parsers nor rules engines. Instead, they are language models that optimise response coherence. This means that when they come across missing data, they could fill in interpretative gaps and offer results that are not positively grounded in the code. An analysis based only on ‘view source’ can detect only structural interface attributes (such as pre-checked checkboxes, no alternative options, or specific wording of linguistic pressure), but it cannot evaluate the visual, dynamic, or behavioural layers, which often determine the manipulative paradigm of the pattern. Consequently, its scoring could seem objective, even though it is in fact an outcome of a heuristic interpretation by the model. Therefore, it is necessary to curb the risk of misinterpretation by ‘forcing proof’ (for example, by requiring that the tool indicate the excerpts of HTML code relevant to each detected problem), overtly defining scoring rules, and clearly distinguishing structural detections from hypotheses that require manual validation. In this setting, the GPT model can support screening, but it does not serve as an autonomous detector of deceptive design patterns. All of this positions the automated solutions employed in this study in the screening paradigm, rather than as instruments of expert audit.

The juxtaposition of SIRS and RTDPA reveals a fundamental methodological difference between the rule-based approach, in which code is analysed, and the generative-heuristic approach. SIRS takes into account solely static HTML code provided by the server and identifies only those interface configurations that are directly observable and verifiable within the document’s structure. This ensures a high level of transparency, determinism, and replicability of the results while limiting the scope of detection to attributes found in the code. As a result, SIRS offers high precision but a narrow measurement range. RTDPA represents a different perspective. It is based on a broad, heuristic interpretation and generative analysis of the linguistic context and potential design implications. In this way, it can identify more potential manipulative patterns at the expense of poorer transparency of the evaluation process. Their differences make the tools complementary: SIRS can serve as a controlled measurement tool, while RTDPA is an exploratory system that offers hypotheses to be verified later.

Regarding RQ1, which concerns the paradigm and limitations of automated analysis of deceptive design patterns, the present results show that the automated tools primarily serve the purpose of screening and flagging potential design-risk areas. Third-party tools, such as browser extensions (plugins), generate results exhibiting limited stability and comparability that do not conform to the criteria of unambiguous quantification. GPT models, on the other hand, generate results based on a different methodology, depending on the specific procedure (heuristic or rule-based). Only the procedure that employs rule-based structural scanning (SIRS) can automatically identify verified interface configurations in a transparent and replicable manner.

Regarding RQ2, which addresses the reliability of automated auditing, the results indicate that today’s automated tools fail to provide sufficient stability and comparability to be considered a fully reliable auditing aid. In addition, the greater variety in SIRS results follows from its analysis being based on clearly predefined model configurations, which improves the transparency and controllability of the measurement, while simultaneously reducing detection to formally predefined, specified, and technical attributes of the interface.

Concerning RQ3, the results do not suggest any systemic high-risk manipulative patterns in the collection of websites. Both the RTDPA’s results and the structural values of SIRS place most of the investigated websites in the low- or moderate-risk zones. The problems they detected are isolated and restricted to individual interface configurations. Note that the results should be interpreted as an assessment of the structural risk potential rather than a final diagnosis.

5.3. Three-Tier Deceptive Design Pattern Detection Model

The present results suggest dividing the deceptive design pattern detection process into three levels:

Tier 1: mechanical (algorithmic, deterministic, structure-based). It concerns the detection of clearly definable interface artefacts such as pre-checked checkboxes, choice-option asymmetry, specific form attributes, recurring wording, or specific DOM configurations. In this case, deterministic rules and universal analytical procedures can be employed. This tier is highly controllable and repeatable, but covers only clearly identifiable structural elements.
Tier 2: heuristic (a GPT model as an analytical assistant supporting heuristic interface analysis). At this level, LLM-based classification takes place. The tools suggest potential risk areas and interpret the patterns in the linguistic and functional contexts. The detection is probabilistic and must be validated by an expert, making it an automated, indicative prescreening. The tool is an analytical support rather than an autonomous measurement instrument.
Tier 3: interpretative (qualification of the manipulativeness of design). The highest level involves assessment of whether a specific interface configuration can be considered manipulative in design and regulatory contexts. It requires an analysis of the context of use, cognitive asymmetry, user choice architecture, and potential design intent. It cannot be fully automated as of today, and remains a domain of expert audit. This three-pronged scheme facilitates distinguishing between the detection of structural attributes and normative qualification. It also reduces the risk of equating algorithmic signalling with a fully fledged audit.

6. Conclusions

The empirical results indicate that the investigated university websites rarely resort to deceptive design patterns, and any such instances are limited in scope. No high-risk patterns typical of commercial environments, such as forced subscription, hidden costs, time pressure, or cancellation obstruction, were identified under the adopted measurement procedure. The most common potentially problematic element was the design of cookie consent banners, which exhibited an asymmetry between the acceptance and refusal options.

The present results indicate that the assumption that it might be possible to conduct a reliable, fully automated detection of deceptive design patterns using tools available to a front-end user has not been confirmed under the employed research design. The browser extensions exhibited poor operational performance and a high rate of failed or incomplete tests. Their results are indicative, follow no transparent detection rules, and lack a clearly defined measurement unit or principles for aggregating results. Therefore, they cannot be considered methodologically reliable auditing instruments. At the same time, the results confirm the usefulness of automated measurement procedures as screening tools, effectively flagging potential areas at risk of deceptive design patterns across large collections of websites.

The generative LLM-based analysis has similar limitations. Although linguistic models can generate coherent and convincing narratives about potential manipulative patterns, their evaluations are mostly heuristic and probabilistic rather than proof-based. If no clear, unambiguous structural proof is available, the model can fill in the interpretative gaps by generating a score whose accuracy remains dubious. This scoring creates the illusion of objective, precise measurement, whereas it is actually a structured opinion of the model rather than the result of a deterministic analytical procedure.

A deceptive design pattern audit requires expertise in information architecture, including choice architecture, interface design, and regulatory context. Automated tools can support pre-screening for selected, formerly identified attributes of a hypertext document, such as pre-checked checkboxes, option asymmetry, or covert opt-in/opt-out configurations. Still, they are incapable of offering a reliable assessment of visual manipulation, perception asymmetry, sequential cancellation obstruction, or contextualised decision pressure. Hence, automated auditing procedures should not be considered a substitute for expert evaluation but rather a preliminary step to identify risk areas for further in-depth analysis.

These findings indicate that the current capabilities of automated tools offer merely fragmented detection of selected deceptive design patterns, instead of a complete systemic diagnosis of the problem. This limitation is inherent in the characteristics of deceptive design patterns as context-dependent, relational, and often dynamic phenomena, whereas the tools employ simplified structural or heuristic representations. No differentiation between risk signalling and full measurement leads to the methodological risk that can be described as the illusion of automated precision.

6.1. Practical Implications

The role of automated tools for detecting deceptive design patterns should be limited to the initial identification of risk signs based on the website’s structural attributes that can be extracted from the code. A contextualised analysis and evaluation of the component’s actual impact on user decisions are necessary to classify a design as manipulative. For auditing practice, this means it is necessary to make a clear-cut distinction between two analytical levels: (1) preliminary automated detection of structural attributes and (2) in-depth expert evaluation of interface conformity with design standards for ethical choice architecture. It is a methodological oversimplification if the output of the tool, including a result/score, is treated as a direct indicator of the website’s conformity with design and regulatory standards. This could lead to false judgement and comparisons.

The article argues for employing a hybrid evaluation model to identify deceptive design patterns, which is highly relevant to public institutions, including supervisory bodies, web administrators, web designers, and internal test teams. The model combines algorithmic identification of cases requiring further verification or optimisation (flagging critical areas) with in-depth expert evaluation within a single auditing procedure. As technology advances and auditors’ and testers’ expectations grow, researchers and practitioners seek automated solutions that combine detecting isolated attributes of hypertext documents with analysing the visual, sequential, and contextualised dimensions of user interaction. Nevertheless, any deceptive design pattern inspection still requires human effort and expert analysis.

6.2. Limitations and Future Research

The study is subject to specific methodological limitations due to the adopted measurement procedure and paradigms of the employed analytical tools. The research procedures focus on identifying formally occurring structural attributes of the interface and risk signs that can be detected without simulating full user interaction. The analysis involved static HTML records, among others. This fact excluded analysis of the dynamic, visual, or behavioural interface layers, which is often critical for identifying deceptive design patterns. This means that the analysis focused mostly on the systemic configurations of a web interface, rather than on the complete array of dynamic behaviours that occur only after several consecutive user steps or actions. The study does not include multi-step user interactions, subscription flows, or cancellation processes, which may be critical for identifying certain types of deceptive design patterns. These aspects require context-dependent and sequential analysis and are therefore better suited for in-depth case studies rather than large-scale, standardised comparative research. Moreover, the employed tools operated at different levels of abstraction and lacked a shared analytical framework, which limited their direct comparability. In the case of generative, heuristic configurations of GPT models, whose evaluations are not unambiguously grounded in static HTML code analysis, the results are probabilistic, which adds to the risk of misinterpretation and hallucination. The study does not include quantitative evaluation of false positives and false negatives, as it does not rely on a ground truth benchmark or binary classification framework. Instead, the analysis focuses on the characteristics, variability, and comparability of outputs generated by different automated procedures. As a result, potential over-detection or under-detection is discussed in qualitative terms, particularly in relation to heuristic and rule-based approaches. Additionally, the sample consisted of a uniform category of websites (universities), which ensures comparability but limits the generalisability of the findings to domains characterised by stronger commercial pressure, where deceptive design patterns may be more prevalent and structurally different.

Further studies should focus on determining the boundaries of the automation potential of deceptive design pattern detection, particularly by distinguishing patterns whose structure can be easily formalised from those that require contextualised interpretation. At the same time, it is necessary to develop and validate tools that integrate structural, visual, and sequential analysis to capture the relational and dynamic nature of user interactions. From a broader research perspective, this entails determining whether deceptive design can be identified through objective automated detection or, by definition, requires expert interpretation grounded in design and regulatory context.

Funding

Co-financed by the Minister of Science under the ‘Regional Initiative of Excellence’ programme. Agreement No. RID/SP/0039/2024/01. Subsidised amount: PLN 6,187,000.00. Project period: 2024–2027.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset supporting the findings of this study is available at: https://doi.org/10.6084/m9.figshare.31375567.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
DDP	Deceptive Design Patterns
DOM	Document Object Model
DSA	Digital Services Act
GDPR	General Data Protection Regulation
GPT	Generative Pre-trained Transformer
HTML	HyperText Markup Language
IQR	Interquartile Range
LLM	Large Language Model
RTDPA	Real-Time Deceptive Pattern Auditor
SIRS	Structural Interface Risk Screening
SRI	Structural Risk Index
URL	Uniform Resource Locator
UX	User Experience

Appendix A

Appendix A.1

The measurement data used in the study are made available to ensure transparency and replicability of the research. The dataset includes the results of an automated audit of the websites for deceptive design patterns and is available in an open repository under DOI: https://doi.org/10.6084/m9.figshare.31375567 (accessed on 28 March 2026)

Appendix A.2

RTDPA (Real-Time Deceptive Pattern Auditor) is an autonomous analytical procedure implemented in a GPT-based environment. It classifies and evaluates the prevalence of deceptive design patterns on websites based on a URL input. The system operates in a heuristic mode and generates a structured output consisting of a numerical score (DDP), a qualitative label, and a list of identified patterns. The analytical procedure is defined through a structured prompt framework and does not involve model training. It requires ChatGPT Plus.

Appendix A.3

SIRS (Structural Interface Risk Screening) is a rule-based analytical procedure implemented in a GPT-based environment. It analyses static HTML code to identify structural interface attributes that may correspond to deceptive design patterns. The procedure operates on predefined detection rules and requires explicit structural evidence for each identified instance. It produces a numerical score (SRI) based on confirmed detections and provides a transparent, reproducible output. The analytical procedure is defined through a structured prompt framework and does not involve model training. It requires ChatGPT Plus.

References

Morales-Vargas, A.; Pedraza-Jimenez, R.; Codina, L. Website Quality Evaluation: A Model for Developing Comprehensive Assessment Instruments Based on Key Quality Factors. J. Doc. 2023, 79, 95–114. [Google Scholar] [CrossRef]
Muhammad, A.; Siddique, A.; Naveed, Q.N.; Khaliq, U.; Aseere, A.M.; Hasan, M.A.; Qureshi, M.R.N.; Shahzad, B. Evaluating Usability of Academic Websites through a Fuzzy Analytical Hierarchical Process. Sustainability 2021, 13, 2040. [Google Scholar] [CrossRef]
Luguri, J.; Strahilevitz, L.J. Shining a Light on Dark Patterns. J. Leg. Anal. 2021, 13, 43–109. [Google Scholar] [CrossRef]
Mathur, A.; Acar, G.; Friedman, M.J.; Lucherini, E.; Mayer, J.; Chetty, M.; Narayanan, A. Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites. Proc. ACM Hum.-Comput. Interact. 2019, 3, 1–32. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Zając, P.; Królak, A. Analysis of the Digital Accessibility of Selected Polish Government Portals for Persons with Disabilities. Univers. Access Inf. Soc. 2025, 24, 2705–2720. [Google Scholar] [CrossRef]
Fakrudeen, M. Evaluation of the Accessibility and Usability of University Websites: A Comparative Study of the Gulf Region. Univers. Access Inf. Soc. 2025, 24, 1883–1898. [Google Scholar] [CrossRef]
Ara, J.; Sik-Lanyi, C.; Kelemen, A.; Guzsvinecz, T. An Inclusive Framework for Automated Web Content Accessibility Evaluation. Univers. Access Inf. Soc. 2025, 24, 1581–1607. [Google Scholar] [CrossRef]
Lannelongue, L.; Grealey, J.; Inouye, M. Green Algorithms: Quantifying the Carbon Footprint of Computation. Adv. Sci. 2021, 8, 2100707. [Google Scholar] [CrossRef] [PubMed]
Persson, H.; Åhman, H.; Yngling, A.A.; Gulliksen, J. Universal Design, Inclusive Design, Accessible Design, Design for All: Different Concepts—One Goal? On the Concept of Accessibility—Historical, Methodological and Philosophical Aspects. Univers. Access Inf. Soc. 2015, 14, 505–526. [Google Scholar] [CrossRef]
Nouwens, M.; Liccardi, I.; Veale, M.; Karger, D.; Kagal, L. Dark Patterns after the GDPR: Scraping Consent Pop-Ups and Demonstrating Their Influence. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 21 April 2020; ACM: New York, NY, USA, 2020; pp. 1–13. [Google Scholar] [CrossRef]
Gray, C.M.; Santos, C.T.; Bielova, N.; Mildner, T. An Ontology of Dark Patterns Knowledge: Foundations, Definitions, and a Pathway for Shared Knowledge-Building. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11 May 2024; ACM: New York, NY, USA, 2024; pp. 1–22. [Google Scholar] [CrossRef]
Münscher, R.; Vetter, M.; Scheuerle, T. A Review and Taxonomy of Choice Architecture Techniques. Behav. Decis. Mak. 2016, 29, 511–524. [Google Scholar] [CrossRef]
Szaszi, B.; Palinkas, A.; Palfi, B.; Szollosi, A.; Aczel, B. A Systematic Scoping Review of the Choice Architecture Movement: Toward Understanding When and Why Nudges Work. Behav. Decis. Mak. 2018, 31, 355–366. [Google Scholar] [CrossRef]
Escobar, G.G.; Mitchell, S.H. A Systematic Review of Effort Discounting Research in Humans: Current Knowledge, Recommendations, and Future Directions. Judgm. Decis. Mak. 2025, 20, e33. [Google Scholar] [CrossRef]
Gray, C.M.; Sanchez Chamorro, L.; Obi, I.; Duane, J.-N. Mapping the Landscape of Dark Patterns Scholarship: A Systematic Literature Review. In Proceedings of the Designing Interactive Systems Conference, Pittsburgh, PA, USA, 10 July 2023; ACM: New York, NY, USA, 2023; pp. 188–193. [Google Scholar] [CrossRef]
Mathur, A.; Kshirsagar, M.; Mayer, J. What Makes a Dark Pattern... Dark? Design Attributes, Normative Considerations, and Measurement Methods. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 6 May 2021; ACM: New York, NY, USA, 2021; pp. 1–18. [Google Scholar] [CrossRef]
Gray, C.M.; Kou, Y.; Battles, B.; Hoggatt, J.; Toombs, A.L. The Dark (Patterns) Side of UX Design. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21 April 2018; ACM: New York, NY, USA, 2018; pp. 1–14. [Google Scholar] [CrossRef]
Owens, K.; Gunawan, J.; Choffnes, D.; Emami-Naeini, P.; Kohno, T.; Roesner, F. Exploring Deceptive Design Patterns in Voice Interfaces. In Proceedings of the 2022 European Symposium on Usable Security, Karlsruhe, Germany, 29 September 2022; ACM: New York, NY, USA, 2022; pp. 64–78. [Google Scholar] [CrossRef]
Chordia, I.; Tran, L.-P.; Tayebi, T.J.; Parrish, E.; Erete, S.; Yip, J.; Hiniker, A. Deceptive Design Patterns in Safety Technologies: A Case Study of the Citizen App. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 19 April 2023; ACM: New York, NY, USA, 2023; pp. 1–18. [Google Scholar] [CrossRef]
Król, K. Between Truth and Hallucinations: Evaluation of the Performance of Large Language Model-Based AI Plugins in Website Quality Analysis. Appl. Sci. 2025, 15, 2292. [Google Scholar] [CrossRef]
Ofori-Boateng, R.; Aceves-Martins, M.; Wiratunga, N.; Moreno-Garcia, C.F. Towards the Automation of Systematic Reviews Using Natural Language Processing, Machine Learning, and Deep Learning: A Comprehensive Review. Artif. Intell. Rev. 2024, 57, 200. [Google Scholar] [CrossRef]
Langford, A.; Lin, S.; Rakower, R.; Tao, T. Dark Pattern Detector. Available online: https://dark-pattern-detector.xyz/about/ (accessed on 12 February 2026).
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Król, K. From Local Interfaces to Global Challenges: Auditing Digital Noise on University Websites in Poland. Information 2025, 16, 1047. [Google Scholar] [CrossRef]
List of Public Higher Education Institutions Supervised by the Minister for Higher Education and Science—Public Universities. Available online: https://www.gov.pl/web/nauka/wykaz-uczelni-publicznych-nadzorowanych-przez-ministra-wlasciwego-ds-szkolnictwa-wyzszego-i-nauki-publiczne-uczelnie-akademickie (accessed on 28 March 2026).
Pattern Shield: Real-Time Deceptive Pattern Detector. Available online: https://www.patternshield.com/ (accessed on 28 March 2026).
dela Cruz, F.M. Dark Pattern Detector. Available online: https://www.floramaydc.com/ (accessed on 28 March 2026).

Figure 1. Conceptual comparison of the SIRS and RTDPA analytical procedures. Arrows indicate the direction of the analytical process.

Figure 2. Comparison of RTDPA and SIRS score behaviour across the same set of websites: (a) scatter plot showing the dispersion and consistency between RTDPA and SIRS scores, with the identity line (y = x, dashed line) indicating hypothetical full concordance between the models; (b) distribution of score differences calculated as Δ = SIRS (SRI) − RTDPA (DDP), illustrating the magnitude and direction of discrepancies between the two models across individual websites.

Table 1. Concepts of dark patterns and deceptive design patterns.

Criterion	Dark Patterns	Deceptive Design Patterns
Origin of the term	A popular science term introduced by the UX community	A normative and regulatory term used in research, compliance, and ethics
Literature references	[4,12,17]	[12,19,20]
Paradigm	A normative and evaluative concept emphasising ethically unacceptable or intentionally manipulative design practices	An analytical and operational concept aimed at describing observable interface configurations and their impact on the decision-making process, regardless of the designer’s intent
Semantic focus	Moral evaluation and presumed intentionality of manipulation	Operational principles and decision impact regardless of the designer’s intent
Designer’s intent	Often implied (manipulation); assumes intentional manipulative effort on the part of the designer	No need to determine the designer’s intent based on an analysis of observable interface attributes and their effects
Phenomenological scope	A finite set of typical, clearly manipulative design patterns	A broad framework of diverse interface configurations that affect choice architecture
Legal and regulatory aspects	Weak/indirect. The notion is found in regulatory discourse mostly in descriptions and critical contexts	Strong (GDPR, DSA, and customer protection); the notion is consistent with regulatory discourse and aimed at design effects and the quality of users’ decisions
Detection automation	Limited; domination of expert judgement and context-based interpretation	Unambiguous description using structural rules, potential for formalisation of rules and structural detection at scale
Typical examples	Confirmshaming, fake scarcity, and trick questions	Opt-in/opt-out asymmetry, covert costs, and problematic unsubscribing

Table 2. Profiles of automated tools used to identify deceptive design patterns.

ID	Tool	Type of Tool/Developer
1	Langford Dark Pattern Detector	Chrome * browser extension (v0.1.2) [23]
2	Pattern Shield: Real-Time Deceptive Pattern Detector ^	Chrome * browser extension (v1.0.0) [27]
3	Dark Pattern Detector	Chrome * browser extension (v0.2) [28]
4	Real-Time Deceptive Pattern Auditor (RTDPA)	RTDPA (v1.0, implemented in ChatGPT v5.2, OpenAI, San Francisco, CA, USA) (Appendix A.2)
5	Structural Interface Risk Screening (SIRS)	SIRS (v1.0, implemented in ChatGPT v5.2, OpenAI, San Francisco, CA, USA) (Appendix A.3)

* Google Chrome (v144.0.7559.133, Google LLC, Mountain View, CA, USA); ^ medium detection sensitivity.

Table 3. Comparison of SIRS and RTDPA.

Dimension	SIRS	RTDPA
Analytical approach and system characteristics	Deterministic, rule-based, transparent structural system	Generative, heuristic, black-box system
Scope of input	Static, server-delivered HTML (View source)	HTML + contextualised interpretation (often heuristic)
Detection type	Structural attributes of the interface	Potential manipulative patterns
Perceived designer’s intent	None; not considered	Often implied
Proof required (HTML code)	Obligatory	Optional
Heuristic layer	Separate, non-scored	Integrated with score
Score/indicator	Overt formula, Structural Risk Index (SRI), range: 0–100 plus DP Mapping—Dark Pattern Mapping	Typically narrative/model-based, Deceptive Design Patterns (DDPs), range: 0–100 with a label
Determinism	High (fixed rules)	Variable (probabilistic)
Replicability	High	Medium/low
Sensitivity	Low–moderate	High
Risk of over-detection	Low	High
Scope of UX interpretation	Structure only	Broad (language, context, presumptions)
Visual layer evaluation	None	Heuristic
Analysis of dynamic behaviour	None	Heuristic
Analytical perspective	Measurement tool	Exploratory tool

Table 4. Characteristics of the results from web browser extensions for detecting deceptive design patterns.

Tool	Result Type	Minimum Value	Maximum Value	Missing Data Count
Langford Dark Pattern Detector	Detected signs count (numerical value)	0	10	0
Pattern Shield	Pattern count/‘Error’ message	1	2	45
Dark Pattern Detector	Number of issues (numerical value)	0	104	0

Table 5. Summary of statistics for the results generated by the GPT models.

Statistic	RTDPA (DDP)	SIRS (SRI)
Observation count (N)	65	65
Mean	90.03	89.23
Median	90	90
Standard deviation	1.55	11.5
Minimum value	80	50
Maximum value	95	100
Score range	80–95	50–100

The comparison of the RTDPA’s values and SIRS’s results is intended to analyse the paradigm and stability of results generated by different analytical methodologies, rather than evaluate detection performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Król, K. Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches. Appl. Sci. 2026, 16, 4543. https://doi.org/10.3390/app16094543

AMA Style

Król K. Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches. Applied Sciences. 2026; 16(9):4543. https://doi.org/10.3390/app16094543

Chicago/Turabian Style

Król, Karol. 2026. "Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches" Applied Sciences 16, no. 9: 4543. https://doi.org/10.3390/app16094543

APA Style

Król, K. (2026). Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches. Applied Sciences, 16(9), 4543. https://doi.org/10.3390/app16094543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automated Detection of Deceptive Design Patterns on University Websites: A Comparative Analysis of Browser-Based Tools and LLM-Based Approaches

Featured Application

Abstract

1. Introduction

2. Conceptual Framework and Research Questions

2.1. Automated Detection of Deceptive Design Patterns as a Methodological Problem

2.2. Reliability and Discriminatory Power of Automated Audit Procedures

2.3. Deceptive Design Patterns on University Websites

3. Materials and Methods

3.1. Measurement Tools

3.1.1. Profiles of the GPT Models Used in the Study

3.1.2. Operational Principles of the RTDPA Procedure

3.1.3. Operational Principles of the SIRS Procedure

3.2. Research Pipeline

4. Results

4.1. Results from Browser Extensions

4.2. Results from the GPT Models

5. Discussion

5.1. Observations

5.2. Answers to Research Questions

5.3. Three-Tier Deceptive Design Pattern Detection Model

6. Conclusions

6.1. Practical Implications

6.2. Limitations and Future Research

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI