Next Article in Journal
GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection
Previous Article in Journal
Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Beyond DOM: Unlocking Web Page Structure from Source Code with Neural Networks

by
Irfan Prazina
*,
Damir Pozderac
and
Vensada Okanović
Faculty of Electrical Engineering, University of Sarajevo, 71000 Sarajevo, Bosnia and Herzegovina
*
Author to whom correspondence should be addressed.
AI 2025, 6(9), 228; https://doi.org/10.3390/ai6090228
Submission received: 24 July 2025 / Revised: 27 August 2025 / Accepted: 9 September 2025 / Published: 12 September 2025

Abstract

We introduce a code-only approach for modeling web page layouts directly from their source code (HTML and CSS only), bypassing rendering. Our method employs a neural architecture with specialized encoders for style rules, CSS selectors, and HTML attributes. These encodings are then aggregated in another neural network that integrates hierarchical context (sibling and ancestor information) to form rich representational vectors for each web page’s element. Using these vectors, our model predicts eight spatial relationships between pairs of elements, focusing on edge-based proximity in a multilabel classification setup. For scalable training, labels are automatically derived from the Document Object Model (DOM) data for each web page, but the model operates independently of the DOM during inference. During inference, the model does not use bounding boxes or any information found in the DOM; instead, it relies solely on the source code as input. This approach facilitates structure-aware visual analysis in a lightweight and fully code-based way. Our model demonstrates alignment with human judgment in the evaluation of web page similarity, suggesting that code-only layout modeling offers a promising direction for scalable, interpretable, and efficient web interface analysis. The evaluation metrics show our method yields similar performance despite relying on less information.

1. Introduction

The success of a web page depends on more than just content and its functionality. The layout of the web page is often overlooked, but it can dictate how users consume information and how they interact with the page. The layout of a web page is the structure and arrangement of its visual elements. It is the way content, such as text, images, buttons, and navigation menus, is organized and positioned on a page. Despite its importance, analyzing web page layout has traditionally depended on browser rendering or browser-based automation tools. Analyzing web page layouts purely from code (HTML and CSS) without using a browser has received much less attention. This leaves a significant gap in existing research. Existing techniques that depend on browser rendering can be sensitive to differences in browser states. Factors like different window sizes, user settings, or even ad blockers can alter the visual output of the same web page, potentially leading to inconsistent or unreliable analysis. By contrast, our approach, which analyzes web page layouts directly from the source code, is less susceptible to these browser-related issues. By bypassing the rendering step, our method aims to provide more consistent results regardless of the device or user environment. We hope this work offers a new, more consistent perspective on web UI analysis.
The well-designed layout is like an invisible hand that guides users and provides a pleasant user experience. Usability is one of the first objectives that we have to have in mind when making a web page. This becomes even more critical in our multi-device world, where the same content must seamlessly adapt to screens of all shapes and sizes.
A good user interface layout creates a visual hierarchy that guides users through the most important content. It also helps maintain consistency, making the application feel like a unified and intuitive whole. Even white space is important; it improves readability and helps avoid a cluttered look. Small errors in layout can negatively impact the user experience and make a web application difficult to use. This is why understanding and analyzing the impact of web layout is so crucial for creating effective web pages/applications.
The information architecture (IA) of a web application is also crucial to the user experience. IA is a way of organizing content to make a webpage easy to use. While a layout ensures elements are correctly positioned, IA focuses on the efficiency and clarity of the user interface (UI) so users can navigate it easily. According to Rosenfeld et al., “Good information architecture design is informed by all three areas [users, content, context], and all three are moving targets” [1].
Poor IA can lead to user frustration, no matter how visually appealing the layout (or the content) is. Even a correct layout can fail if it lacks a clear and consistent information structure.
In this paper, we focus primarily on the spatial and visual aspects of layout. However, we recognize that these aspects are interconnected with the higher-level information architecture, which influences how users find and understand the structured content within the interface.
Different approaches for the implementation of a web layout offer distinct advantages and disadvantages, especially when considering the diverse set of devices users usually use. The main approaches are as follows:
  • Fixed Layout—This approach uses constant width (and height) values for elements, regardless of screen dimensions. It is straightforward to design and predictable in behavior, but it is not ideal for users accessing the site on mobile devices, as content may appear cut off or require excessive zooming. This approach is no longer used today; it is included here to show a historical point in the evolution of layout methods.
  • Fluid Layout—Instead of fixed values, this method utilizes relative units like percentages. This allows content to expand and contract based on screen size. Although it offers flexibility, if not implemented correctly, it can often lead to layout errors and become difficult to maintain.
  • Adaptive Layout—This type uses predefined width breakpoints, applying different layouts for each. This allows the layout to be adaptable to specific predetermined screen sizes. The drawback is that it does not behave fluidly between these breakpoints, meaning transitions can appear abrupt.
  • Responsive Layout—This modern approach combines fluid grids, flexible images, and media queries (special rules that apply styles based on the browser window or screen type). This allows layouts to dynamically adjust to any screen width, offering a seamless experience across all devices. Ethan Marcotte provides an excellent overview of the core technologies needed for responsive design in his work [2].
Beyond traditional strategies, modern web design uses new techniques to meet user expectations and work on different devices. Single-page applications (SPAs) and long-page designs improve interaction by preventing page reloads and allowing continuous scrolling. In the Mozilla documentation, SPA is defined as follows: “SPA is a web app implementation that loads only a single web document, and then updates the body content of that single document via JavaScript APIs … ” [3]. To address performance issues on smaller devices, Accelerated Mobile Pages (AMP) and other mobile solutions were created. These focus on speed and delivering lightweight content. In their official documentation, the AMP Project team writes, “AMP is a web component framework to easily create user-first experiences for the web.” [4].
These approaches, along with frameworks like Bootstrap or Material Design, show that layout is part of a larger design ecosystem shaped by Web 2.0 and Web 3.0 developments. The existence of so many different ways to create layouts highlights the complexity of modeling a web application’s layout from its source code alone. Although our work focuses on the spatial structure of layouts from source code, understanding these broader design techniques helps provide context for the many solutions developers use to implement web layouts.
Displaying a user interface (and verifying it is correctly displayed) is often not a straightforward task. Unlike a lot of other types of graphical user interfaces, the layout of elements in web-based applications needs to be robust, ensuring usability across a variety of conditions. Many factors influence the appearance of a web application’s user interface. Despite existing standards for interpreting HyperText Markup Language (HTML), Cascading Style Sheets (CSS), and JavaScript code, each web browser has its own interpretation. This sometimes leads to the same web page appearing differently in various browsers. This problem is now almost solved, but developers should still check if the web page has the same appearance on different browsers. Different browsers still have some differences in displaying some controls, like a date picker and a scroll bar, and in the case of some new controls (or in the case of new or updated standards), discrepancies may occur until they settle. Also, users can access web applications on a wide range of devices, each with a different hardware configuration. This includes screen size and resolution, display type, and the kind of input devices available. The operating system on which a user accesses a web application is not uniform for all users and can affect how the user interface is displayed and functions. Even changing the language can impact the correct display of a web application. Some text labels might be longer or shorter than the labels used during the application’s development, potentially disrupting the layout. The content within web applications can change dynamically. Any alteration in content might lead to an incorrect display of the user interface’s element layout. We referenced these claims in the related work section.
Understanding these important factors is crucial for developers and designers aiming to create adaptable and user-friendly web experiences, and also for testers to test if the web page performs in the expected way. Having a method that can translate the structure, layout, and position of webpage elements into a format a neural network can understand would open up new ways to integrate machine learning into this complex ecosystem. Although some existing tools and approaches use machine learning, to our knowledge, none can infer a webpage’s layout data solely from its source code (CSS and HTML).
There are many approaches for analyzing a web page’s visual appearance. These analyses are conducted in various fields, and as we will see in the next section, most of them rely on using web browser data. In this paper, we propose a solution for the part of the analysis that requires information on the spatial relationships of elements on a web page. This solution is based solely on the static source code and does not need information from a web browser or an automation solution that uses a web browser directly or indirectly, like Selenium. There are a few important benefits this approach offers.
Our method provides significant potential advantages, particularly regarding performance, determinism, and security. Methods relying on web browsers must load the entire web page, which consumes considerable memory and time, as they need to load all multimedia and scripts in addition to HTML and CSS. While we could not directly compare performance in terms of time and memory due to the unavailability of source code from the dataset’s methods, common sense suggests our approach should be more efficient. Furthermore, methods that depend on web browsers are susceptible to issues such as advertisements, browser version, and the current state of the browser at the time of analysis. In contrast, methods that rely solely on source code avoid these problems, as they will always produce the same output for the same input. Finally, in terms of security, methods that depend on web browsers are exposed to potential security vulnerabilities if they analyze malicious websites, as harmful scripts can be executed along with the page. The method presented in this paper is not subject to such attacks because it does not execute scripts from the web page, nor are web pages parsed within a web browser. Despite progress in web page analysis tools, most current methods depend on web browsers or browser automation. As we have discussed, these methods have drawbacks related to performance, predictability, and security.
However, very little research has explored whether source code alone (HTML and CSS) can accurately model the spatial relationships between web page elements. This is the gap our study aims to fill.
The purpose of this paper is to determine if neural networks can learn spatial information about web page layouts from just their static source code. During training, labels are calculated using DOM information. However, for inference, only the static source code is used as input, without any additional data. To guide this, we focus on three research questions:
  • Can web page layouts be accurately represented using only HTML and CSS, without a browser?
  • Can neural networks learn the spatial relationships between elements from this representation?
  • How well can these learned representations support applications like evaluating web page similarity?
The main hypothesis of this paper proposes the feasibility of modeling web pages using neural networks. These networks can learn the spatial information of elements solely from their source code (HTML and CSS), and this information can then be used in applications such as web page similarity evaluation.
This paper is organized into an Introduction, Methods, Results, and Discussion. The Methods section is dedicated to introducing all components of our approach and their connectivity, along with a detailed explanation of our methodology. The Results section is primarily focused on presenting the training results for all key components and the evaluation outcomes of the model in the application of web page similarity evaluation. The Discussion section offers final remarks regarding the results and concludes the paper, addressing study limitations and outlining future work.

2. Related Work

Analysis of web-based user interfaces typically involves three main approaches: web browser data analysis (often via the Document Object Model, DOM), screenshot analysis, and manual human analysis. These methods can be employed individually or in combination. A comprehensive overview of these methods and their research applications is provided by I. Prazina et al. in their paper [5].
Web phishing detection is one area where web page visual analysis is crucial. In this context, it is important to recognize malicious sites that mimic legitimate ones by adopting their visual appearance. J. Mao et al. [6] address this using CSS selectors and attributes for DOM-based classification, as detailed in their paper. Similarly, other methods for phishing detection that utilize visual similarity and are also based on DOM data can be found in the papers written by Zhang et al. [7] and Rosiello et al. [8].
Another significant research area is software testing, where the early detection of visual appearance errors on web pages is paramount to mitigating potential financial and other damages. While numerous studies address UI testing based on control flow, a subset specifically focuses on UI layout testing. The work by T. A. Walsh [9] serves as a key inspiration for our method; this paper analyzes web page layouts by employing layout graphs, in which elements are depicted as nodes and their layout relations as edges. This aligns substantially with our method’s emphasis on modeling web page layouts, hierarchical context, and spatial relationships. Furthermore, analyses conducted by [10], written by Ryou et al., and [11], writen by Althomali et al., also utilize layout graphs to detect UI errors.
Classic usability research offers principles that guide effective interface design. Nielsen’s ten usability heuristics [12], in particular, are still very important for evaluating graphical user interfaces. They focus on things like visibility of the system status, consistency, error prevention, and aesthetic design. These heuristics have been commonly used to assess usability in both desktop and web applications, and they continue to influence how we design modern web and mobile experiences. While Nielsen’s heuristics and similar UX frameworks focus on qualitatively evaluating usability, our work tackles a different but related challenge: the automatic, quantitative analysis of web page layouts directly from their static source code.
Layout errors can break several of Nielsen’s principles. For instance, a layout that gets cluttered goes against aesthetic and minimalist design, and inconsistent spacing or alignment in a layout disrupts consistency. By modeling the spatial relationships between web elements, our approach provides technical tools that could help support some evaluations.
Roy Choudhary et al. [13] and Mahajan et al. [14] address the problem of ensuring consistent web page layouts across different web browser environments. More recently, Watanabe et al. [15] gave a method that combines screenshots and DOM information in finding similar layout errors. The results were compared with popular tools used in cross-browser testing. We bring up this very problem in our introduction, noting how a web page’s appearance can change depending on the browser used. Although modern browsers have largely solved this, some inconsistencies can still pop up in certain browsers. Another set of problems involves layout errors caused by internationalization. As mentioned in the introduction, these errors can occur when a web page’s language is changed. For instance, the width of certain labels might change because the translated text is either longer or shorter than the original. Alameer et al. [16,17] discuss this issue and offer solutions. Layout problems can also happen due to dynamic content changes. In their paper [18], S. Kolla et al. note, “Dynamic content represents one of the most persistent challenges in snapshot testing, creating frequent false positives that undermine confidence in test results”. This highlights how problematic dynamic content can be for developers. Our method could help with some of the problems mentioned before. Predicting spatial relationships directly from the source code might offer insight into why these layout problems occur. It could even work as a lightweight pre-rendering analysis tool, helping detect potential layout issues before they appear in different browsers.
However, a commonality among these approaches is their reliance on DOM data obtained either directly from a web browser or through an automation framework such as Selenium.
Another connected problem relevant in web page testing is the consistency of an element locator. Elements in different versions of a web page can have different IDs, classes, and attributes, which can cause automated tests to fail. Ricardo Coppola et al. addressed this problem in their paper [19], offering a comparison of approaches to solve it. This work on resilient element location has some similarity with our approach because it also analyzes the underlying structure of a web page. Their problem is different; they aim to identify key components for a more robust testing process.
The analysis of layouts is an important part of web page similarity assessment or modeling human perception of web pages. In their paper, Bozkir et al. [20] directly address the layout-based computation of web page similarity, which constitutes a core component of this paper’s model evaluation. Their work also offers a valuable dataset of websites. Furthermore, M. Bakaev et al. [21] present another method that assesses similarity using Artificial Neural Networks (ANN). An earlier contribution by R. Song [22] focuses on recognizing important blocks within a layout. Their research aims to identify and understand the significant components within a web page for applications such as information retrieval or summarization through the analysis of both structural arrangement and textual content. Web page segmentation is a relevant technique that partitions a page into semantically coherent blocks. Huynh et al., in their paper [23], aim to extract the main content and disregard irrelevant regions based on structural information from the DOM tree. While their goal is different, this method is similar to ours because it also leverages the underlying structure of a web page rather than relying on rendered images for analysis.
Although recent studies have shown that source code can be used for layout tasks, their methods often rely on complex or computationally intensive frameworks. For example, Cheng et al. [24] use neural networks to assess the quality of the web page from the DOM tree, which means that they are restricted to using the DOM during both training and inference. Another paper that extracts information from HTML is [25] by Kawamura et al., which focuses only on HTML tables and does not use CSS for its analysis. Other papers, like [26] by Seol et al. and [27] by Tang et al., focus on generating new layouts rather than analyzing existing ones, a problem that is outside the scope of our method. In contrast, our approach is designed to be lightweight, predictable, and secure by working only with static HTML and CSS, which offers potential advantages in both performance and consistency.
In the literature review, we identified various fields that address the challenges of web page layout analysis. Some techniques analyze screen images, others analyze the tree structure of the DOM, and some combine both types of information (for more details, see [5]). However, most methods are based on a web browser or an environment that uses one in some way. Those methods that do not depend on a web browser are typically focused on very specific applications.
This dependence on browsers often necessitates frequent updates to these methods and their tools due to browser changes. In addition, it makes the analysis results susceptible to various factors. These include the web browser’s state at the time of analysis, the web page’s current state (such as its data, script states, and advertisements), and network conditions.
We see that many fields benefit from the spatial information of web page elements. An approach that could provide this information without directly using a web browser would solve the aforementioned problems and open opportunities for further research and cooperation among different fields.

3. Methods

In this section, we will describe our approach and all its parts. Unlike other approaches, the method presented here utilizes web browser outputs as reference values only during the training phase; after training, only the source code is needed. This paves the way for addressing numerous issues such as performance, security, and analysis consistency, which will be discussed further. We will illustrate how this approach solves these problems through one of its model applications.
The approach is based on encoding parts of HTML and CSS using autoencoders made and trained for this paper. This step translates string inputs into corresponding vectors suitable for further processing within machine learning models based on neural networks. The second part of the approach combines different segments of the source code for each web page element to create a relevant vector per element, which can then be used to identify spatial relationships. The scheme of the approach can be seen in Figure 1.
Our approach is designed to analyze HTML and its associated CSS files. It does not consider the state of the web browser, nor does it account for scripts or the changes they induce on a page. Given any pair of elements, our method can identify their spatial relationships, recognizing the 8 distinct spatial relations (or their absence) (Figure 2).
While currently not implemented, the approach can be easily extended to recognize other spatial relations, such as element overlap, the overlap of two or more edges, or whether one element contains another. The output of our approach is an array of 8 values, with each value representing the probability that two elements share a given spatial relationship. Currently, the approach cannot work with dynamically added or modified content. However, if a script were available to save the HTML and CSS state at the moment of change, our method could then analyze such pages.

3.1. Autoencoders

To convert web page elements into vectors, we need to encode several key components: the style of HTML elements, the style selectors applied to those elements, and the attributes of the HTML elements themselves. These parts of the user interface specification include not only predefined keywords but also a large number of arbitrarily chosen words and values, such as class names, IDs, textual and numerical content, and web links.
To adapt this varied content for the final model—which will determine the spatial relationships of user interface elements based on the provided source code—it is essential to encode it into a fixed-length vector. For this task, we employed an autoencoder built upon a Long Short-Term Memory (LSTM) neural network. LSTM was chosen over a Transformer architecture due to its simplicity, performance, and suitability for the specific problem we are solving. While Transformer networks are often more appropriate for encoding spoken language, their complexity is excessive for the type of content encoding required here.
Before training the model, we preprocessed the data. Web pages often contain a large number of classes and IDs, whose values can be either words from natural language or generated words with no specific meaning. Such values make it difficult for the autoencoder model to learn the structure of the elements it needs to encode and unnecessarily increase the complexity of the training dataset. The actual meaning of a class or ID name within a web page is not essential for determining its style and appearance. The only crucial aspect is that for an ID, the name is unique within the web page, or for a class, there’s no collision with other class names. Bearing this in mind, we can replace all classes and IDs within a web page with unique values that follow the structure ‘ c l _ C O D E ’ or ‘ i d _ C O D E ’, where CODE is a generated sequence of random values unique to a given class or ID.
Beyond replacing classes and IDs, URLs within attributes and style values significantly impact the length of strings that need encoding. These values are considerably longer than attributes directly contributing to the style and final layout of user interface elements, and they can hinder the autoencoder’s ability to encode useful information. For this reason, we decided to remove all links from attributes and styles before encoding. We achieved this by utilizing the PostCSS (“PostCSS is a tool for transforming styles with JS plugins” [28]) and PurgeCSS (“PurgeCSS is a tool to remove unused CSS from your project” [29]) libraries.

Autoencoder Model and Structure

Our autoencoder was implemented in the PyTorch 2.7.0+cu118 environment, utilizing the following layers:
  • Embedding Layer—This PyTorch module represents discrete token values as continuous values, making them suitable for an LSTM network. Mapping from the token space to a continuous value space is essential because LSTMs are not designed to directly process words or text. In PyTorch, the embedding layer is realized as a special lookup table where each input token index returns a tensor of a predefined dimension (known as the e m b e d d i n g _ d i m e n s i o n ). During training, the returned tensor for a given index is adjusted to represent increasingly meaningful values.
  • LSTM Encoder—This LSTM module is responsible for encoding the sequence of tokens, which have already been mapped to continuous values by the Embedding layer.
  • Linear Bottleneck—A linear layer designed to reduce the dimensionality of the tensor. The output of this layer serves as the encoded value of the elements for other models.
  • Linear ‘Debottleneck’—A linear layer that increases the dimensionality of the tensor, effectively reversing the bottleneck’s operation.
  • LSTM Decoder—An LSTM model used to decode the output of the ‘debottleneck’ layer, with the aim of reconstructing the original input.
  • Linear Output—A final linear layer that maps the output of the LSTM decoder back into the space of the token vocabulary.
We built the autoencoder as a PyTorch module, incorporating the previously mentioned layers, each implemented using its corresponding PyTorch module. This model expects a BPE (Byte Pair Encoding) encoded string sequence as input, representing the element to be encoded.
The model’s output has the shape [ b a t c h _ s i z e , s e q u e n c e _ l e n g t h , v o c a b _ s i z e ] . Here, b a t c h _ s i z e refers to the batch size used during training, s e q u e n c e _ l e n g t h is the length of the BPE sequence after padding (all strings have a uniform length after encoding), and v o c a b _ s i z e represents the size of the BPE token vocabulary. This output signifies the probability that each token from the given vocabulary appears at every position within the sequence. A visual representation of this autoencoder’s scheme can be found in Figure 3.
Autoencoders, by their nature, aim to learn a compressed version of their input during the training process. This type of learning is known as representation learning, in which the model extracts the fundamental and representative information from the given input by compressing the data. The input to our system is source code (CSS for style and HTML for structure). Although this code is structured, its complexity can vary significantly across different elements. Additionally, some parts of HTML and CSS do not directly influence the user interface element layout we are modeling. These two aforementioned facts are precisely why this type of data compression and unification is necessary. This means that every element is represented by a tensor of consistent length and structure, encapsulating the most significant information from the given code.

3.2. Finding Target Spatial Relations and Layout Graph

In the literature review, it can be observed that some of the methods analyzing web-based user interfaces represent element layouts as a graph. In these graphs, nodes represent individual elements, and edges depict the type of spatial relationship between two elements. This method of representing elements simplifies the comparison of similar user interfaces and provides a way for the identification of differences during regression testing.
The layout graph also provides a useful representation for responsive pages. Here, information can be added within the edges to indicate at which resolutions a particular relationship holds true. This approach allows for effective modeling of changes due to responsiveness.
The layout graph was an inspiration for our method. We devised a plan for our approach to obtain edges using node representational vectors. For target values for edges, we need a way to get the real state of the element displayed in a web page. This information is obtained from the Selenium WebDriver DOM (Document Object Model) data structure. The elements bounding rectangle is used. To efficiently identify relationships between all elements, we need a structure that allows for effective searching based on element proximity. When considering the proximity between two elements in terms of spatial relations between their edges, there are four horizontal pairs and four vertical pairs of edges (see Figure 2). Given that this search involves multiple dimensions when looking for an element’s neighbors, a K-D Tree emerges as a suitable structure. The search procedure is explained in Algorithm 1.
The K-D tree construction process happens recursively. First, the array of elements is sorted along the current dimension. Then, the median element is found, and a new tree node is created based on this median element. After the node is created, the process repeats for the current node’s left and right children. The left child will receive half of the array with smaller values along the current dimension, and the right child will receive the other half. This process continues for new levels of the tree until, at the step where the array needs to be split, it contains no more than one element. If the array contains exactly one element, a node is simply created for that element, and it is placed in the appropriate position in the tree as a leaf.
A K-D Tree is a binary tree in which each node represents a k-dimensional point. This tree organizes elements by alternately dividing the k-dimensional space by levels, with each level bisecting a specific dimension of that space. For our search, we use a pair of values, D a D b , which indicates the direction along which we want to align edges, where D a and D b can be top, bottom, left, or right. To derive the spatial relations for each element, we initiate a nearest neighbor search for the 8 possible pairs of D a D b , which yields up to 8 potential spatial relations. In some instances, an element may not have neighbors for every D a D b pair. For example, an element positioned along the left edge of the screen might not have another element to its left. In such cases, that particular spatial relationship will not be detected.
Algorithm 1 Finding near elements using K-D Tree
Require: Tree is constructed
  1:
function findNear( t a r g e t N o d e , d i r )
  2:
     c l o s e s t n u l l
  3:
    function nearSearch(currentRoot, depth)
           ▹ Function is recursively called to search through the tree
  4:
          if  c u r r e n t R o o t = = n u l l  then
  5:
             return
  6:
          end if
  7:
           d i s t a n c e T C  sideDistance(dir, targetNode.rect, currentRoot.rect)
    ▹ Function sideDistance calculates Euclidean distance between centers of edges
  8:
          if isRightSide(targetNode, currentRoot, dir)
     AND currentRoot.key ≠ targetNode.key then
    ▹ Function isRightSide checks if edges are in right order, for example if left-right direction is used first edge position x (distance from left side of the screen) should be less than second edges x
  9:
             if closest ≠ null
     AND distanceTC < sideDistance(dir, targetNode, closest) then
10:
                    c l o s e s t { node : [ c u r r e n t R o o t ] , distance : d i s t a n c e T C }
11:
             else if closest ≠ null
     AND distanceTC = =  sideDistance(dir, targetNode, closest) then
12:
                   closest.node.push(currentRoot)
13:
             else
14:
                    c l o s e s t { node : [ c u r r e n t R o o t ] , distance : d i s t a n c e T C }
15:
             end if
16:
        end if
17:
         c u r r e n t D i m d e p t h mod 4
18:
         d i r I n d e x ( d i r . split ( _ ) ) [ 1 ]
19:
        if  c u r r e n t D i m = = t h i s . d i r e c t i o n [ d i r I n d e x ]  then
           ▹ If we are in the depth of the tree that corresponds to the search dimension (position of the edge we are interested in), then we can prune the search
20:
             if isRightSide(targetNode.rect, currentRoot.childL.rect, dir) then
21:
                   nearSearch( c u r r e n t R o o t . c h i l d L , d e p t h + 1 )
22:
             else if isRightSide(targetNode.rect, currentRoot.childR.rect, dir) then
23:
                   nearSearch( c u r r e n t R o o t . c h i l d R , d e p t h + 1 )
24:
             end if
25:
        else
         ▹ The case when pruning cannot be done, we are on the depth that does not correspond to the search dimension. We need to search both sides of the tree
26:
             if  c u r r e n t R o o t . c h i l d L n u l l  then
27:
                   nearSearch( c u r r e n t R o o t . c h i l d L , d e p t h + 1 )
28:
             end if
29:
             if  c u r r e n t R o o t . c h i l d R n u l l  then
30:
                   nearSearch( c u r r e n t R o o t . c h i l d R , d e p t h + 1 )
31:
             end if
32:
        end if
33:
        return  c l o s e s t
34:
    end function
35:
    nearSearch(this.root, 0)
36:
    return  c l o s e s t
37:
end function

3.3. Representing Element Relationships Through Deep Learning

We created the model of the relationship graph as a neural network incorporating linear and convolutional layers. This network aggregates data from multiple nodes within the graph to produce a vector containing the necessary information for determining spatial relations with other nodes. In this model, for each node, we utilize data from its ancestor nodes and from nodes whose direct parent is the same as the current node’s parent (sibling nodes). The influence of an individual parent’s value and that of a direct neighbor is a value that will be learned during training. In the context of this section, a direct neighbor refers to a node that shares the same direct parent as another node. A direct parent is a parent that directly contains the given node, meaning there’s a direct connection without any intermediate nodes between the parent and the given node.
Each node within our model holds the following information:
  • Encoded Styles: This is represented as an array of tensors, capturing the style information of the HTML elements.
  • Encoded Selectors: An array of tensors of the same length as the encoded styles array, where each element at a given position in this array corresponds directly to the style element at the same position in the style array. This links styles to their specific selectors for that element.
  • Encoded Element Attributes: A single tensor containing the encoded attributes of the element.
  • Ancestor Information: A list of node names that represent all of the element’s ancestors in the hierarchy.
  • Siblings Information: A list of node names that represent all of the element’s siblings
  • List of target relations for the pair of elements (only for training): An array of 8 values, where each value is a binary value (true/false) as an answer to the question: “Does the pair of elements share some spatial relationship?”.
We used a combination of Node.js (v23) and Python (v3.13) scripts to obtain the previously described data. The Node.js scripts, built with Selenium WebDriver, parse the web page, gather information from the generated DOM, and create an offline version of the page. This offline version then serves as input for the model training process. The Python scripts leverage the pre-trained autoencoder models discussed earlier. Information extracted from the DOM object is used to construct the K-D tree, from which we derive spatial relationship data. From the offline version of the page, we identify elements for which to create corresponding pairs sharing spatial relationships, along with structural relationships like parent and direct neighbor, and attribute data for each element. Style and selector data are obtained from offline versions of files linked via <link> tags that contain styles, as well as from <style> tags within the HTML file itself.
The described scripts create a dataset of web pages. The modular design of these web page preprocessing scripts makes this dataset highly versatile. Since every intermediate step and script result is saved, the dataset contains a wealth of useful information. It is also easily expandable, as the scripts can be simply utilized to add new pages. The dataset has the following limitations:
  • It does not include multimedia (images and video content from the pages);
  • It also does not include JavaScript files.
All CSS content from <link> tags or <style> tags is consolidated into a single allCSS.json file. This means the style data for elements is combined and merged from all CSS source files.
The dataset itself comprises 100 web pages selected from the Mozilla Top 500 list. The web pages are selected if they have a complete layout (it is not just a simple text web page), and the web page should not be just a login screen (some pages from the Mozilla Top 500 list require authentication). The structure of this dataset is detailed in Table 1.

Aggregating Node Information

The model’s input consists of a pair of nodes. Their encoded data is loaded from the encoded_elements.json file. Additionally, information about supplementary nodes (parents and neighbors of both original nodes) that require processing, along with a list of relationships between the two chosen nodes, is loaded from completePairs.json. The model’s output is an array. When the sigmoid function is applied to this array, it produces probabilities (values between 0 and 1) for each of the 8 spatial relationships. This represents a form of multilabel classification, where each pair can have one or more labels of spatial relationships.
The core part of the model is the encodeNode method, which takes all the prepared data for a node and returns an aggregated tensor. This tensor should capture as much relevant information as possible to determine spatial relationships with other nodes.
When the forward method receives two nodes for which it needs to determine relationships, it first calls encodeNode for the first node and then for the second. The resulting tensors (let us call them encoded_A and encoded_B) are then concatenated with the ordinal numbers of both nodes. In this context, the ordinal number of a node, on_X, refers to the count of elements in the same parent that precede a given node in the HTML document. This information is crucial and cannot be inferred from other data within the encoded tensors. The ordinal number of an element significantly impacts spatial relationships because three elements might have identical styles and attributes, yet their ordinal positions determine whether or not they share a particular spatial relationship. After creating this combined tensor (on_A, encoded_A, on_B, encoded_B), we apply a linear layer within the forward method. This layer outputs a tensor containing the final probabilities for each of the spatial relationships.
Figure 4 illustrates the schema for the encodeNode method. This method takes three main parameters. The first parameter represents the node to be encoded, including its tensor of encoded attributes and arrays of tensors for both encoded styles and selectors. The second parameter is a list of all the parent nodes, each parent containing the same data structure as the first parameter. Finally, the third parameter is a list of all neighbor nodes, formatted identically to the second parameter.
The next part of the encodeNode method is responsible for encoding styles. Considering that an element can have multiple styles and corresponding selectors, we need to process each one. All style values are combined with the element’s attributes to understand the meaningful influence of the style relative to those attributes (e.g., some selectors have higher priority). This combination happens in the processSSA method, as illustrated in Figure 5.
The core of the processSSA method acts as an attention layer. Depending on the attributes and selectors, it assigns a specific weight to each style. After applying this weight to a style, it is added to a list of styles. Within the encodeNode method, this list of styles is then processed by combining all styles with the totalTensor vector, which was previously aggregated from neighbor and ancestor nodes. During this combination, a residual bias parameter is used, allowing the model to learn how much each new style influences the overall totalTensor.
Once all styles are processed and a single, combined totalTensor is obtained, the final step is executed. A MultiHeadAttention layer is applied, giving all relevant tensors (ancestor tensors and neighbor tensors) the opportunity to contribute to the final result. This final result is then passed through a linear layer and, ultimately, a ReLU activation layer to produce the most representative tensor possible.
In Listing 1, the final steps of the model are presented. To estimate the spatial relationships between a pair of web page elements, we utilize their representative tensors (36-dimensional FloatTensor obtained from the encodeNode method, lines 1 and 2 in the listing). The process of determining the probabilities for each of the 8 spatial relationships begins by concatenating the tensors of the two elements with their respective ordinal numbers within their parents (lines 7 and 8 in the listing). This results in a 74-dimensional tensor (36D for the first element, 1D for its ordinal number, 36D for the second element, and 1D for its ordinal number).
A linear layer with an 8 × 74 weight matrix (output dimension 8, input dimension 74) is applied (line 11 of the listing) to this combined tensor. The second layer in this part of the network is another linear layer with dimensions of 8 × 45 . The input to this layer is the concatenated output of the previous layer and a tensor representing the difference between the two element tensors (lines 12–14 of the listing). This design allows the model to learn as many dependencies as possible between the paired elements when forming the final result.
Listing 1. Final steps of the model for detecting spatial relationships between elements ’a’ and ’b’.
Ai 06 00228 i001

3.4. Applying the Model: Web Page Similarity

In the spatial relationship recognition model, for each node of the web page we are modeling, we can obtain a representative vector. This vector implicitly contains information about the spatial relationships between other elements. This information can be leveraged for many applications where spatial data about an element itself is needed, or where aggregated information is used to compare parts of a web page, or an entire web page, with others.
In this section of the paper, we will provide an example application that simply utilizes spatial relationship information to calculate the similarity of one web page to others.
In [20], the authors provide a dataset of web pages along with the results of their experiment, where users were asked to rate the similarity of given pairs of web pages based on their own perception. This dataset is well-documented and valuable because, in addition to user interface screenshots, it includes the source code of the pages themselves. This last feature makes it particularly useful for applying the models described earlier in our work. Furthermore, this dataset will serve as an excellent form of validation, as none of its pages were used in the training process of the models presented in this paper.
Before the user evaluation, the authors in [20] categorized the pages into four groups, a fact they did not disclose to the participants. A total of 312 participants took part in the evaluation. Following the evaluation, the authors compared the results and found that users grouped the pages by similarity with a high degree of concordance with the pre-established groups. Therefore, we can utilize these same groups, just as the authors did, when training and validating our similarity model. The dataset from the aforementioned work comprises 40 pages. Unfortunately, two of these pages were not properly archived, making them unusable. Despite this, the remaining 38 pages can be utilized without issue.
The approach developed in this paper exclusively uses the raw HTML and CSS code of a web page. It generates representative vectors for each element without needing a web browser. This is a significant distinction from the original work from which our dataset was sourced. Unlike that work, which had access to DOM states and user interface images, our method relies solely on the source code. This fact makes our approach browser-agnostic and simplifies its integration into any environment, even one without a web browser. Furthermore, it opens up more opportunities for optimization. Processing a web page no longer requires loading it into a web browser, which typically consumes substantial RAM for parsing and rendering and significant processing time for parsing and executing some of the scripts that ultimately will not affect the web page’s visual appearance. Moreover, by not directly depending on a specific browser version, our approach reduces its reliance on web browsers, enabling future possibilities to treat browser type and resolution as just another parameter within our model.
Figure 6 illustrates the scheme of our approach, which leverages the models previously described. The autoencoder models (for style, selectors, and attributes) are used to extract the necessary properties per element for the model that generates a representative vector of spatial relationships. Unlike other approaches mentioned in Section 2, our method relies exclusively on source code, completely eliminating the need for a web browser’s capabilities.
Once we have a representative vector containing implicit spatial information, we need to adapt it by extracting useful data for similarity comparisons. When comparing two web pages, we will compare their aggregated vectors. For our distance metric, we are using Euclidean distance because, in addition to the angle between two vectors, their magnitude is also important (pages with more elements will have larger values in certain vector dimensions).
To obtain the necessary vector, we created a simple neural network. The network needs to be straightforward due to the limited number of available web pages; more complex networks would easily face overfitting issues. The role of this network is to map the vector and adapt it for comparison. The network has two layers with a ReLU activation function between them to learn nonlinear relationships. The hidden layer of the network has a dimension of 64, while the output vector has a dimension of 16. The input to the network is an aggregated vector of dimension 36 (Figure 7). Finally, similarity between pages is calculated using the Euclidean distance between their vectors (the 18 D vector obtained after adaptation through the simple neural network).

3.5. Training

3.5.1. Autoencoders

After preprocessing, we have three distinct datasets: a set of attributes (expanded with HTML element names), a set of selectors applied to the chosen HTML elements, and a set of styles. BPE (Byte Pair Encoding) was applied separately to each of these three sets.
The BPE algorithm is a compression technique where an arbitrarily large vocabulary is represented by a smaller set of subwords. A subword, in this context, is a sequence of characters that frequently appears across all words in the original vocabulary. This new set of subwords is constrained by the size of the resulting vocabulary, which we set as a parameter for the BPE algorithm. The role of BPE encoding is to identify the most frequently repeated sequences within strings and, based on them, create a dictionary of tokens used in these datasets. Applying BPE encoding to a string yields a sequence of tokens suitable for further processing, an approach commonly used in NLP (Natural Language Processing) problems. The resulting token sequence also represents a compressed version of the original string. We used the Hugging Face Tokenizer Python library to create the BPE token dictionaries. These dictionaries were created separately for each of the three datasets due to their differing structures and content, ensuring the best possible representation of the data for tokenization.
Our models were trained on data sourced from real-world websites. These pages were selected from the Mozilla’s Top 500 most-visited sites. While the majority of these sites are from English-speaking regions, this selection does not impact the model’s generality. This is because every class and ID name is replaced with a corresponding code, and the remaining input to the autoencoders primarily consists of CSS and HTML syntax keywords. From the chosen set of pages, we extracted styles for each element, encompassing two values: the CSS selector and CSS properties, as well as attributes for each element. Three distinct datasets were formed from this collected data and used for training the autoencoders. The models were trained for 10 epochs. For validation, we set aside 10 pages not used in training; these were used to monitor the training progress and ensure overfitting did not occur.
The loss function employed during training was Cross-Entropy, implemented within the torch.nn package. This function takes the autoencoder’s output as parameters, to which the SoftMax method is applied. This constrains the values to a 0–1 range, representing the probability of a specific token from the vocabulary appearing.

3.5.2. Spatial Relationship Model

We used 100 real-world websites to train our model. Pages are selected from Mozilla Top 500 list. Inclusion criteria for this dataset were that a webpage have a nontrivial layout and that the page not be hidden behind some authentication form. For each selected page, we generated dataset files using the scripts described earlier in Section 3.3. During the training process, we experimented with various optimizers, with the AdamW optimizer proving to be the most effective so far. AdamW operates on the principles of the Adam optimizer, utilizing an adaptive learning rate along with added L2 regularization of weights (also known as weight decay). This characteristic of the optimizer stabilizes training and significantly helps prevent overfitting. By applying weight loss, the model is penalized if certain weights become too large. This encourages the model to remain simpler and rely less on obvious features, thereby increasing its generalization capability. A model with better generalization performs more effectively on unseen inputs.
After selecting an optimizer, a significant challenge during the training of our relationship graph model was class imbalance (as detailed in Table 2). Some spatial relationship classes appeared with much higher frequency than others, while some were quite rare.
To address this class imbalance, we opted to exclude the most frequent classes from the dataset. This meant not all pairs with common spatial relationships were included. Our method involved keeping a counter for frequent pairs: if the counter was below a value of 2 and a pair contained a common element, we did not add that pair to the set. However, if the counter had reached 2, even if the pair included a common spatial relationship, it was added. Applying this rule resulted in a more uniform class distribution within the dataset (see Table 3).
Besides choosing a different set of web pages or modifying the existing one, another solution is to select a loss function that mitigates the impact of imbalance. Imbalance still persists for label configurations, which represent the types of relationships shared by a pair of elements. Many pairs exhibit the same configurations. We addressed this problem by introducing a configuration-specific weight. If a particular configuration is frequent, its weight is smaller, meaning it has less influence on training. Without this adjustment, the model would primarily learn frequent configurations, effectively ignoring those that appear less often.
For our loss function (Formula (1)), we selected a combination of Dice loss, binary focal loss, and L1 loss for the sum of labels. Dice loss was specifically designed to address the problem of imbalanced datasets. During training, we observed that while Dice loss alone (compared to cross-entropy loss) achieved better recall and precision for underrepresented classes, it led to slightly worse recall and precision for well-represented classes. Combining these loss functions yields better results for both minority and majority classes.
The role of the L1 loss function is to minimize the difference between the number of predicted labels and the actual number of labels. This reduces the chance of the model predicting more or fewer spatial relationships than are actually present. The weights for each component of the loss function were determined through trial training with various weight parameters, resulting in the most stable training performance for these specific values.
The loss function formula is
L o s s = 0.85 × w j × F o c a l ( y t r u e , y p r e d ) + 1.2 × D i c e L o s s ( y t r u e , y p r e d ) + 0.2 × L 1 ( ( y p r e d ) , ( y t r u e ) )
where:
  • w j is weight of current label configuration based on target labels y t r u e ;
  • y t r u e is the target, and y p r e d is model prediction;
  • F o c a l is the binary focal loss function;
  • D i c e L o s s is the Dice loss function;
  • L 1 is the L1 loss function.
Since our dataset is not structured for direct batch processing (due to varying numbers of parent and child nodes, which would necessitate extensive padding for uniform batch sizes), we employed a batch simulation approach, also known as delayed optimization. The loss value is aggregated over a specific number of samples or pairs (in our case, 100). The optimization step then occurs after every 100th pair. This method prevented large jumps in the training gradient and kept the training from oscillating between local minima, leading to a more stable learning process.

3.5.3. Simple Neural Network for Vector Adaptation in Similarity Detection

Training for this simple network was based on the triplet principle. For this process, the dataset was divided into 30 pages for training and 8 pages for validation. From the set of 30 training pages, 5000 unique triplets were generated. Each triplet consists of two elements that should belong to the same group and one element that should not. The group labels used were those provided in the paper where the dataset was originally introduced. During training, our objective was to increase the distance between elements from different groups while decreasing the distance between elements within the same group.
The triplet loss function is defined as follows: For a triplet ( a , p , n ), where a is the anchor (the element based on which we calculate the loss value), p is the positive element (an element from the same group as the anchor), and n is the negative element (an element from a different group than the anchor).
The formula for the loss function used in training this network is
Loss = ReLU ( E d ( a , p ) E d ( a , n ) + margin )
where:
  • E d —Euclidian distance.
  • a—the anchor element.
  • p—the positive element.
  • n—the negative element.

4. Results

The Results section is structured in two parts. The first part covers the training results for each phase of our approach. The second part presents the model metrics for the web page similarity application, as well as the precision and recall metrics for spatial relationship detection.

4.1. Training Results

4.1.1. Autoencoders

Figure 8, Figure 9 and Figure 10 display graphs illustrating the change in the loss function over epochs. Based on our chosen validation set, we ensured that the models do not overfit the data; in all training runs, the validation error never increased.
The inputs for the selector autoencoder are the simplest, so there is not a significant difference between the validation and training errors. In contrast, for the style and attribute models, the difference in errors (validation and training) was more noticeable. The graphs also show training scenarios with both 50 and 90 pages, along with their respective validations using a 10-page validation set. It is evident that training with more pages yielded better results.
When the training set was increased from 50 to 90 web pages, the following improvements were observed at the end of 10 epochs of autoencoder training:
  • Attributes: ∼20% better loss value in training and ∼10% better in validation.
  • Selectors: ∼80% better loss value in both training and validation.
  • Styles: ∼59% better loss value in training and ∼33% better in validation.
From the above list of improvements when increasing the number of pages, we can conclude that increasing the dataset size enhances autoencoder performance. The selector autoencoder benefits the most from a larger dataset, while the attribute autoencoder benefits the least. This can be explained by the fact that selectors have more inherent structure and rules in their formation, whereas attributes can have a wide variety of values and show significant differences across web pages.
The graphs clearly show that the loss function value consistently decreases across all epochs. The use of the AdamW optimizer also reduced the chance of overfitting. This algorithm improves upon the Adam optimizer (which is based on adaptive moment estimation) by introducing a weight decay factor. With each weight update step, this factor reduces the weights by a certain amount.

4.1.2. Main Model

Model training was conducted in phases because pairs from different web pages can have vastly different properties (attribute, style, and selector vectors). This meant training progressed very slowly when pairs from highly dissimilar pages appeared in quick succession. We solved this by implementing phased training. The training was conducted in several steps, validating it with new pages. We selected three pages that use different layout approaches: CSS flex, CSS grid, and float with relative units. This selection ensured that the training did not favor any particular CSS approach for creating web page layouts. You can see the training results by phase in Figure 11.
The phases were structured so that we would take a portion of the dataset and train on it (e.g., the first 40 pages). In the next phase, training would continue with 30 new pages plus 10 pages from the previous phase. This approach prevented the model from overfitting while also giving it space to learn specific characteristics that are valid across diverse pages.
The model was also fine-tuned after initial training. For this step, we used geometric conditions that should hold true among the recognized labels. For example, if the model identifies a “left-right” relationship between elements A and B (meaning the left edge of element A is next to the right edge of element B), then when the input is reversed (element B then element A), the model should recognize a “right-left” relationship. This is the property of symmetry. We calculated the symmetry error using Mean Squared Error (MSE) between symmetric relationships when the pair is presented as (A, B) versus (B, A).
Additionally, if elements A and B have a “left-right” relationship, they cannot simultaneously have a “right-left” relationship. This is the property of exclusivity.
The results of the fine-tuning process are visible in Figure 12. We experimented with different influence factors for these geometric conditions. A factor of 0.3 proved to be the best, as the model achieved the lowest overall loss with this value.
Below, we present the precision and recall metrics for the three selected validation pages: Wikipedia (Figure 13): the layout is primarily built using relative units and floated elements.
Figure 14 and Figure 15 show the recall and precision values for all spatial relationships for elements in the Wikipedia web page.
Futbin (Figure 16), whose layout is primarily built using the CSS flex property.
Figure 17 and Figure 18 display the recall and precision values for all spatial relationships in the Futbin web page. For elements up to depth 10, the average recall and precision values are better than when considering elements up to depth 16. This is likely because deeper levels contain a higher number of smaller elements, making their spatial relationships more challenging to determine.
Japanese-grid (Figure 19), whose layout is primarily built using CSS grid properties.
Figure 20 and Figure 21 show the recall and precision values for all spatial relationships. On this particular web page, the approach does not recognize “bottom-top” and “top-bottom” relationships, but it achieves similar results for other relationships compared to the previous two pages.
These three examples, using web pages not included in the training set, demonstrate that the model can effectively recognize certain relationships between elements. Furthermore, there’s no significant impact of the page’s implementation approach (e.g., CSS Grid, Flexbox) on the spatial relationship recognition results.
Another form of validation for this model can be found in Section 4.2. There, the model is validated in a real-world application on a much larger set of new pages.

4.1.3. Why Not BERT for Autoencoders?

In this section, we will explain why we developed our own autoencoders instead of leveraging existing approaches like the BERT model. BERT, developed by Google, is a prominent model for NLP. It excels at analyzing words within the context of their surrounding text, building on the revolutionary transformer architecture (which is based on the multi-head attention mechanism) (for more on BERT, see [30]). One key application of BERT is encoding words into representative vectors, a role our autoencoders also fulfill.
To evaluate whether BERT would be a better fit for our application compared to our custom autoencoders, we trained our approach using BERT as well. After one epoch of training on 40 pages, the results show that when our custom autoencoders are used to encode element properties, the model achieves a loss function value that is, on average, ∼8% better than when using BERT.
Furthermore, utilizing our custom autoencoders results in a simpler final model with a smaller memory footprint for weights (353.6 KB with our autoencoders versus 661.3 KB with BERT). The dataset size is also significantly smaller with our autoencoders (3.2 GB with our autoencoders versus 15 GB with BERT). For this analysis, we used ’bert-base-uncased’, which is one of the more compact BERT versions. For BERT’s encoding tensor, we took the output from its last layer (last_hidden_state[:, 0, :]). These results do not claim that the autoencoders in this paper are universally superior to BERT, but rather that their application in this specific context is more appropriate and efficient than using BERT.

4.2. Applying the Approach for Web Page Similarity Assessment

We used the previously described dataset from Bozkir et al. [20] to validate our approach. Since this dataset was developed independently of our methods, it provided a robust way to validate our models. Our hypothesis going into the experiment was our approach could recognize the similarity of web pages with comparable success to the work from which the dataset was taken, but without using any results or intermediate results from a web browser (i.e., without using the DOM or screenshots of web pages).
Our experiment had two limitations. First, a portion of the pages in the dataset were inadequately archived. This limitation was easily overcome, as the number of poorly archived pages was small (2 out of 40). The second limitation was that the original approach, from the work where the dataset was introduced, used comparisons based on screenshots of only a portion of the web page (specifically, 1024 px of page height). This means the clustering in the original study was limited to only a part of the web page. Consequently, two web pages that are partially similar might not be similar in their entirety, and vice versa, web pages that are entirely similar might not appear so based on the sampled portions. This limitation was not as straightforward to overcome. However, by training an auxiliary neural network, we can implicitly assign greater importance to vectors or parts of vectors that originate from elements displayed within the 1024 px height.
The expected outcome of the experiment was that our approach would be able to identify the four distinct groups of similarity of the web pages and that there would be a similarity match with the method proposed in Bozkir et al. [20]. We also anticipated that the approach described in this paper would produce meaningful distances between pages that align with the groups from the aforementioned work.
The metric used in the original dataset paper is ANR (Average Normalized Rank). This metric, ranging from 0 to 1 (where 0 is the best value), evaluates how well a system can return expected results for a given input. It reflects the average relative position of the correct results among the retrieved results (from the system). ANR is frequently used in information retrieval tasks to assess whether a search consistently ranks elements from the same group close to each other.
The formula for ANR is
ANR = 1 N · N r i = 1 N r R i N r ( N r + 1 ) 2
where:
  • N—total number of elements;
  • N r —number of relevant elements;
  • R i —real rank of the i-th relevant element.
The utility model (the simple neural network for preparing vectors for comparison) was trained for 12 epochs. The training results are shown in Figure 22 and Figure 23. These graphs indicate that training over epochs reduces the model’s error, and validation with the ANR top 5 metric confirms that the model achieves improved performance.
After we trained our model, we measured its performance. On the data it was trained on (the training set), it got an ANR top 5 score of 0.119. But on the validation set, its ANR score was 0.2.
To see how well our model works, we compared its score on the new data (0.2) to the best-performing model from another research paper by Bozkir et al., which got an ANR score of 0.1233. Our model’s score is a bit worse than theirs. But it has a notable advantage because our model uses much less information. The other model had to look at parts of screenshots and other web page data, while ours only needed the website’s source code. Using cropped screenshots can be a problem because it might make some web pages that are actually similar look different from each other (and vice versa). Our method avoids this by focusing on the underlying code, which gives us a more consistent way to determine similarity.
The use of cropped screenshots by Bozkir et al. could have also impacted their ANR score. Because their source code is not publicly available, we could not compare ANR results using full web page screenshots to see how that might have affected their findings.
It is worth noting that the original paper also calculates the ANR top 10 metric, which we could not perform due to our reduced set of pages (some were discarded due to incomplete archiving). Another limitation of our results is that the original paper’s topn5 comparison is based on user evaluation results, whereas we rely on predefined groups since complete user evaluation data is not available in our dataset.
The results of the evaluation metrics applied to our approach, mentioned earlier, showed that it yields similar results, even without access to all the information available to the method from the previously cited work (screen images and DOM values). This confirms our earlier hypothesis.
Even without these metrics, we could visually assess the capability of our approach by directly comparing the similarity matrices it generates against those from other methods. Figure 24 displays the similarity matrix of our approach. If we compare it to the matrix from the user evaluation in the paper that introduced the dataset (Figure 25a), we can see that, visually (in addition to the metrics already discussed), our approach produced results very similar to human perception.
In this case, you can even observe that web page groups 1 and 2 (the first half, up to page number 20) have a less clear boundary between them, while groups 3 and 4 (the latter half) are well-demarcated. This aligns with the user evaluation results in the work of Bozkir et al. [20].
If we compare the results of our method with those from the earlier work (Figure 24 and Figure 25b), we can see that our approach visually corresponds better to human perception and more effectively distinguishes different groups. On the similarity matrices, the intersections of different groups show more white values, indicating greater distances.

5. Discussion

In this paper, we presented a generalized approach that can be applied without directly using a web browser. To build our model, we focused on modeling individual elements and their relationships as displayed on a web page. This process involved creating three autoencoder models for element attributes, selectors, and styles, along with one model for learning relationships based on these encodings.
The autoencoder models showed strong performance during training, effectively mapping data from the code domain (CSS code, CSS selectors, or HTML attributes) into a continuous value domain. Such values are well-suited for various neural network-based applications.
After developing models for encoding elements, we created a method to extract crucial information for learning directly from a web page’s document. Only during training did we identify relevant elements and their relationships using a K-D Tree and the DOM. The DOM provided the necessary relationship labels for the model to learn. With this, we had all the information required for training. On the other hand, during inference, the model could then take only HTML and CSS to recognize spatial relationships.
The models showed good results in recognizing spatial relationships between elements, with recall ranging from 0.35 to 0.979 and precision from 0.1 to 0.46. The low precision was a trade-off for achieving higher recall in this multi-label classification problem. Furthermore, precision varies significantly across different spatial relationships. Some relationships are inherently more complex and difficult to assess accurately. For instance, relationships between elements involving complex CSS layouts or elements with smaller dimensions are challenging to classify with high precision. The lower precision scores (e.g., 0.10) are often associated with these more ambiguous cases. The fact that the model still identifies most of the relationships (high recall) is what we consider a successful outcome for our application. While these results could have been improved with more training and a larger dataset, even with these values, the model has been successfully applied to the problem of web page similarity assessment.
Considering the results obtained in this paper, we can conclude that our approach is effective for assessing web page similarity. Unlike other methods discussed in the literature review and those known to the authors, ours is unique in that it doesn’t require any data from a web browser (neither screenshots nor the DOM tree of a parsed web page nor a combination of these). Even without this information, our method achieves strong results with a validation ANR of 0.2. This is comparable to the approach presented in the work of Bozkir et al. [20], which uses all that information and achieved an ANR of 0.1233 for its best configuration. Based on a visual inspection of the similarity matrix, we found that the patterns of web page groupings our method created were very similar to those obtained from the original user evaluation data. In other words, our model classified web pages as “similar” in a way that closely matched human judgment.
These results show that we have a method that can successfully assess how visually similar web pages are without actually rendering them in a web browser. This confirms our hypothesis and also offers several advantages over other methods:
  • Performance: Methods relying on a web browser must load the entire web page, consuming significant memory and time (beyond HTML and CSS, all multimedia and scripts need to be loaded). We could not compare performance (time and memory) directly, as we do not have the source code for the method from the available dataset.
  • Determinism: Browser-dependent methods are sensitive to ads, browser versions, and the browser’s current state at the time of analysis. Methods that rely solely on source code do not have this problem; they will always produce the same output for the same input.
  • Security: If browser-reliant methods analyze malicious web pages, they risk potential security vulnerabilities because malicious scripts can execute alongside the page content. The method presented in this paper is immune to such attacks because it does not execute scripts from the web page, nor are web pages parsed within a browser.
Beyond these practical benefits, our approach shows a strong alignment with human perception, opening up possibilities for automating web page analysis. This approach automates parts of the analysis that would typically require human effort. In fact, human evaluation of visual web page similarity based solely on available source code is often impossible for more complex web pages with many elements and numerous CSS rules applying to the same elements.

Challenges, Limitations, and Future Work

Modeling web pages and their relationships is a challenging process. The challenges are numerous because the methods of creating web pages change drastically and frequently, leading to diverse web pages that achieve similar element layouts in different ways.
Some of these approaches require little data analysis (e.g., elements with fixed dimensions and positions), while others demand not only an analysis of the elements being compared but also an examination of their parents (and ancestors) and their siblings. Throughout this process, web pages often contain many elements that are effectively outliers (elements added as advertisements or as by-products of specific libraries, frameworks, or plugins). This further complicates the creation of a generalizable model.
These challenges show why creating adaptable methods for modeling web layouts is difficult and why many current methods depend on browser-based rendering. However, our approach is specifically designed and created despite these challenges. By focusing on the structural and spatial relationships identified directly from the source code, it can learn strong patterns that capture the layout’s core elements. This proves that, even though the overall task is tough, our method offers a practical and effective solution to the problem, which we evaluated through its application in assessing web page similarity.
Our current model was trained on 100 web pages, which takes a lot of time (one day for three training cycles). Training with more web pages would help the model better handle the many different ways web pages are designed. This would likely lead to more accurate results in recognizing spatial relationships.
While the current accuracy is good enough for the application shown in this paper, other uses might need better precision. For those cases, the model could be retrained with more data using the same method described here.
Beyond the challenges and limitations already mentioned, it would be beneficial to explore other applications in the future. The work we have done has opened up possibilities for applying these models in various areas and to problems identified during the analysis of the field’s current state. Although this goes beyond the scope of this particular paper, it would be valuable to investigate how the models and approaches developed could be utilized in diverse domains and their specific challenges.

Author Contributions

Conceptualization, I.P. and V.O.; methodology, I.P.; software, I.P.; validation, I.P., D.P., and V.O.; formal analysis, V.O. and I.P.; investigation, I.P.; resources, I.P.; data curation, I.P.; writing—original draft preparation, I.P.; writing—review and editing, D.P. and V.O.; visualization, I.P.; supervision, V.O.; project administration, V.O.; funding acquisition, I.P. and V.O. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by the Federal Ministry of Education and Science of the Federation of Bosnia and Herzegovina.

Data Availability Statement

The dataset created for this paper can be seen at: https://drive.google.com/file/d/1lx35UwnwZX-fgjsQnM8ucddH5P6tH64g/view?usp=sharing, accessed on 1 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
DOMDocument Object Model
HTMLHyper Text Markup Language
CSSCascading Style Sheets
IAInformation Architecture
UIUser Interface
SPASingle-page Application
AMPAccelerated Mobile Pages
UXUser Experience
ANNArtificial Neural Networks
LSTMLong Short-Term Memory
BPEByte Pair Encoding
NLPNatural Language Processing
ANRAverage Normalized Rank
JSJavaScript
URLUniform Resource Locator

References

  1. Rosenfeld, L.; Morville, P.; Arango, J. Information Architecture: For the Web and Beyond, 4th ed.; O’Reilly Media: Sebastopol, CA, USA, 2015. [Google Scholar]
  2. Marcotte, E. Responsive Web Design; A Book Apart, LLC: New York, NY, USA, 2011. [Google Scholar]
  3. Mozilla. SPA. Available online: https://developer.mozilla.org/en-US/docs/Glossary/SPA (accessed on 25 August 2025).
  4. AMP Project. AMP Documentation. Available online: https://amp.dev/documentation (accessed on 25 August 2025).
  5. Prazina, I.; Bećirović, Š.; Cogo, E.; Okanović, V. Methods for Automatic Web Page Layout Testing and Analysis: A Review. IEEE Access 2023, 11, 13948–13964. [Google Scholar] [CrossRef]
  6. Mao, J.; Bian, J.; Tian, W.; Zhu, S.; Wei, T.; Li, A.; Liang, Z. Detecting Phishing Websites via Aggregation Analysis of Page Layouts. Proc. Comput. Sci. 2018, 129, 224–230. [Google Scholar] [CrossRef]
  7. Zhang, W. Web Phishing Detection Based on Page Spatial Layout Similarity. Informatica 2013, 37, 231–244. [Google Scholar]
  8. Rosiello, A.P.E.; Kirda, E.; Kruegel, C.; Ferrandi, F. A Layout-Similarity-Based Approach for Detecting Phishing Pages. In Proceedings of the Third International Conference on Security and Privacy in Communications Networks and the Workshops—Securecomm, Nice, France, 17–21 September 2007; pp. 454–463. [Google Scholar]
  9. Walsh, T.A.; Kapfhammer, G.M.; McMinn, P. ReDeCheck: An Automatic Layout Failure Checking Tool for Responsively Designed Web Pages. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, New York, NY, USA, 10–14 July 2017; ACM: New York, NY, USA, 2017; pp. 360–363. [Google Scholar]
  10. Ryou, Y.; Ryu, S. Automatic Detection of Visibility Faults by Layout Changes in HTML5 Web Pages. In Proceedings of the IEEE 11th Conference on Software Testing, Validation and Verification (ICST), Västerås, Sweden, 9–13 April 2018; pp. 182–192. [Google Scholar]
  11. Althomali, I.; Kapfhammer, G.M.; McMinn, P. Automated Visual Classification of DOM-Based Presentation Failure Reports for Responsive Web Pages. Softw. Test. Verif. Reliab. 2021, 31, e1756. [Google Scholar] [CrossRef]
  12. Nielsen, J. Heuristic Evaluation. In Usability Inspection Methods; Nielsen, J., Mack, R.L., Eds.; John Wiley & Sons: New York, NY, USA, 1994; pp. 25–64. [Google Scholar]
  13. Roy Choudhary, S.; Prasad, M.; Orso, A. X-PERT: A Web Application Testing Tool for Cross-Browser Inconsistency Detection. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA), San Jose, CA, USA, 21–25 July 2014; ACM Press: New York, NY, USA, 2014; pp. 417–420. [Google Scholar]
  14. Mahajan, S.; Alameer, A.; McMinn, P.; Halfond, W. XFix: An Automated Tool for the Repair of Layout Cross Browser Issues. In Proceedings of the CM SIGSOFT International Symposium on Software Testing and Analysis, New York, NY, USA, 10–14 July 2017; pp. 368–371. [Google Scholar]
  15. Watanabe, W.M.; dos Santos, D.A.; de Oliveira, C. Layout Cross-Browser Failure Classification for Mobile Responsive Design Web Applications: Combining Classification Models Using Feature Selection. ACM Trans. Web 2023, 17, 32. [Google Scholar] [CrossRef]
  16. Alameer, A.; Mahajan, S.; Halfond, W.G.J. Detecting and Localizing Internationalization Presentation Failures in Web Applications. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST), Chicago, IL, USA, 11–16 April 2016; pp. 202–212. [Google Scholar]
  17. Alameer, A.; Chiou, P.T.; Halfond, W.G.J. Efficiently Repairing Internationalization Presentation Failures by Solving Layout Constraints. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST), Xi’an, China, 22–27 April 2019; pp. 172–182. [Google Scholar]
  18. Kolla, S. Best Practices for UI Testing Using Snapshot Testing. J. Comput. Sci. Technol. Stud. 2025, 7, 262–271. [Google Scholar] [CrossRef]
  19. Coppola, R.; Feldt, R.; Nass, M.; Alégroth, E. Ranking approaches for similarity-based web element location. J. Syst. Softw. 2025, 222, 112286. [Google Scholar] [CrossRef]
  20. Bozkir, A.S.; Sezer, E.A. Layout-based computation of web page similarity ranks. Int. J. Hum. Comput. Stud. 2018, 110, 95–114. [Google Scholar] [CrossRef]
  21. Bakaev, M.; Khvorostov, V.; Heil, S.; Gaedke, M. Evaluation of User-Subjective Web Interface Similarity with Kansei Engineering-Based ANN. In Proceedings of the IEEE 25th International Requirements Engineering Conference Workshops (REW), Lisbon, Portugal, 4–8 September 2017; pp. 125–131. [Google Scholar]
  22. Song, R.; Liu, H.; Wen, J.-R.; Ma, W.-Y. Learning Important Models for Web Page Blocks Based on Layout and Content Analysis. ACM SIGKDD Explor. Newsl. 2004, 6, 14–23. [Google Scholar] [CrossRef]
  23. Huynh, H.; Le, Q.T.; Nguyen, V.; Nguyen, T. A DOM-Structural Cohesion Analysis Approach for Segmentation of Modern Web Pages. World Wide Web 2025, 28, 23. [Google Scholar] [CrossRef]
  24. Cheng, A.; Liu, Y.; Li, W.; Dong, Q.; Wang, S.; Huang, Z.; Yin, D. Layout-Aware Webpage Quality Assessment. arXiv 2023, arXiv:2301.12152. [Google Scholar] [CrossRef]
  25. Kawamura, K.; Yamamoto, A. HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM. In Discovery Science, Proceedings of the 24th International Conference, DS 2021, Halifax, NS, Canada, 11–13 October 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 29–43. [Google Scholar]
  26. Seol, J.; Kim, S.; Yoo, J. Posterllama: Bridging Design Ability of Language Model to Content-Aware Layout Generation. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 451–468. [Google Scholar]
  27. Tang, Z.; Wu, C.; Li, J.; Duan, N. LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models. arXiv 2023, arXiv:2309.09506. [Google Scholar] [CrossRef]
  28. PostCSS. Available online: https://www.npmjs.com/package/postcss (accessed on 1 July 2025).
  29. PurgeCSS. Available online: https://www.npmjs.com/package/purgecss (accessed on 1 July 2025).
  30. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Figure 1. Schematics of the proposed approach. Unlike other methods, this approach uses browser outputs as reference values only during training, relying solely on source code afterwards.
Figure 1. Schematics of the proposed approach. Unlike other methods, this approach uses browser outputs as reference values only during training, relying solely on source code afterwards.
Ai 06 00228 g001
Figure 2. Spatial relationships between elements are illustrated here. The gray box represents the reference element, while the colored box indicates an element whose sides have a spatial relationship with the chosen reference.
Figure 2. Spatial relationships between elements are illustrated here. The gray box represents the reference element, while the colored box indicates an element whose sides have a spatial relationship with the chosen reference.
Ai 06 00228 g002
Figure 3. This figure illustrates an autoencoder scheme. The encoder’s output is the result of the bottleneck layer. For training the autoencoder, the decoder’s components (the debottleneck layer, the LSTM layer below it, and the final output layer) are essential. During training, cross-entropy loss is used to evaluate how well the decoder can reconstruct the data encoded by the autoencoder.
Figure 3. This figure illustrates an autoencoder scheme. The encoder’s output is the result of the bottleneck layer. For training the autoencoder, the decoder’s components (the debottleneck layer, the LSTM layer below it, and the final output layer) are essential. During training, cross-entropy loss is used to evaluate how well the decoder can reconstruct the data encoded by the autoencoder.
Ai 06 00228 g003
Figure 4. The schema of the encodeNode method used in the model is designed to combine multiple vectors representing the current HTML element, along with its connected ancestors and siblings. This method ultimately reduces a list of these vectors into a single, combined vector. This combined vector contains all relevant data, not just from the current node, but also important information from its ancestors and siblings.
Figure 4. The schema of the encodeNode method used in the model is designed to combine multiple vectors representing the current HTML element, along with its connected ancestors and siblings. This method ultimately reduces a list of these vectors into a single, combined vector. This combined vector contains all relevant data, not just from the current node, but also important information from its ancestors and siblings.
Ai 06 00228 g004
Figure 5. The schema of procesSSA method used in the encodeNode method. The method combines pairs of styles and their selectors for each CSS rule of the current HTML element.
Figure 5. The schema of procesSSA method used in the encodeNode method. The method combines pairs of styles and their selectors for each CSS rule of the current HTML element.
Ai 06 00228 g005
Figure 6. Algorithm for generating a representative spatial information vector. This illustrates our approach: autoencoders extract element properties, which then feed into the model that creates the spatial relationship vector. This process is done for all elements on the web page, and their vectors are then aggregated.
Figure 6. Algorithm for generating a representative spatial information vector. This illustrates our approach: autoencoders extract element properties, which then feed into the model that creates the spatial relationship vector. This process is done for all elements on the web page, and their vectors are then aggregated.
Ai 06 00228 g006
Figure 7. Neural network schema of vector adaptation for the similarity comparison process.
Figure 7. Neural network schema of vector adaptation for the similarity comparison process.
Ai 06 00228 g007
Figure 8. The visualization of the change in the loss function during the training of the LSTM autoencoder for HTML element attributes. As the graph shows, the loss for the validation set also decreases, indicating that the model is not overfitting the data.
Figure 8. The visualization of the change in the loss function during the training of the LSTM autoencoder for HTML element attributes. As the graph shows, the loss for the validation set also decreases, indicating that the model is not overfitting the data.
Ai 06 00228 g008
Figure 9. The visualization of the change during the training of the LSTM autoencoder for HTML element styles. It can be seen that the loss for the validation set also decreases, indicating that the model is not overfitting the data.
Figure 9. The visualization of the change during the training of the LSTM autoencoder for HTML element styles. It can be seen that the loss for the validation set also decreases, indicating that the model is not overfitting the data.
Ai 06 00228 g009
Figure 10. The visualization of the change in the loss function during the training of the LSTM autoencoder for CSS selectors. As the graph indicates, the loss for the validation set also decreases, confirming that the model is not overfitting the data.
Figure 10. The visualization of the change in the loss function during the training of the LSTM autoencoder for CSS selectors. As the graph indicates, the loss for the validation set also decreases, confirming that the model is not overfitting the data.
Ai 06 00228 g010
Figure 11. Visualization of loss change during different stages of training for our model that recognizes spatial relationships.
Figure 11. Visualization of loss change during different stages of training for our model that recognizes spatial relationships.
Ai 06 00228 g011
Figure 12. Fine-tuning the model with geometric conditions.
Figure 12. Fine-tuning the model with geometric conditions.
Ai 06 00228 g012
Figure 13. Wikipedia page layout.
Figure 13. Wikipedia page layout.
Ai 06 00228 g013
Figure 14. Wikipedia page: recall and precision for elements up to depth 10. At a depth of 10, some labels appear less frequently (or are entirely absent), leading to lower metric values for those specific relationships.
Figure 14. Wikipedia page: recall and precision for elements up to depth 10. At a depth of 10, some labels appear less frequently (or are entirely absent), leading to lower metric values for those specific relationships.
Ai 06 00228 g014
Figure 15. Wikipedia page: recall and precision for elements up to depth 16.
Figure 15. Wikipedia page: recall and precision for elements up to depth 16.
Ai 06 00228 g015
Figure 16. Futbin page layout.
Figure 16. Futbin page layout.
Ai 06 00228 g016
Figure 17. Futbin page: recall and precision for elements up to depth 10.
Figure 17. Futbin page: recall and precision for elements up to depth 10.
Ai 06 00228 g017
Figure 18. Futbin page: recall and precision for elements up to depth 16.
Figure 18. Futbin page: recall and precision for elements up to depth 16.
Ai 06 00228 g018
Figure 19. Japanese grid page layout.
Figure 19. Japanese grid page layout.
Ai 06 00228 g019
Figure 20. Japanese grid page: recall and precision for elements up to depth 10.
Figure 20. Japanese grid page: recall and precision for elements up to depth 10.
Ai 06 00228 g020
Figure 21. Japanese grid page: recall and precision for elements up to depth 16.
Figure 21. Japanese grid page: recall and precision for elements up to depth 16.
Ai 06 00228 g021
Figure 22. Logarithmic loss function value during 12 epochs of training.
Figure 22. Logarithmic loss function value during 12 epochs of training.
Ai 06 00228 g022
Figure 23. This graph shows the ANR top 5 metric values throughout the training process, including the ANR value for the validation set. Lower values indicate better performance.
Figure 23. This graph shows the ANR top 5 metric values throughout the training process, including the ANR value for the validation set. Lower values indicate better performance.
Ai 06 00228 g023
Figure 24. This matrix shows the distances between web pages using our method. Greener values indicate a smaller distance, meaning higher similarity.
Figure 24. This matrix shows the distances between web pages using our method. Greener values indicate a smaller distance, meaning higher similarity.
Ai 06 00228 g024
Figure 25. This image presents the similarity matrix from the original paper [20] by Bozkir et. al. (a) User evaluation results: Here, greener squares indicate more similar pages, reflecting user perceptions. (b) Best configuration from proposed method: This shows the similarity matrix generated by the best configuration of the method proposed in their paper. Reprinted from International Journal of Human-Computer Studies, Vol 110, Ahmet Selman Bozkir, Ebru Akcapinar Sezer, Layout-based computation of web page similarity ranks, Pages No. 95-114, Copyright (2018), with permission from Elsevier.
Figure 25. This image presents the similarity matrix from the original paper [20] by Bozkir et. al. (a) User evaluation results: Here, greener squares indicate more similar pages, reflecting user perceptions. (b) Best configuration from proposed method: This shows the similarity matrix generated by the best configuration of the method proposed in their paper. Reprinted from International Journal of Human-Computer Studies, Vol 110, Ahmet Selman Bozkir, Ebru Akcapinar Sezer, Layout-based computation of web page similarity ranks, Pages No. 95-114, Copyright (2018), with permission from Elsevier.
Ai 06 00228 g025
Table 1. Content of the created dataset for training.
Table 1. Content of the created dataset for training.
File NameFormatDescription
pageSourceHTMLHTML source code from the web page
graph_2JSONGraph structure of the document obtained using Selenium WebDriver DOM
graphSourceJSONGraph structure of the document obtained using only HTML source
layoutsJSONThe list of spatial relations with other elements
layouts2JSONThe list of spatial relations with other elements, classes and ids are replaced with standardized tokens
pairsJSONThe list of pairs of elements that share some spatial relationship, and the list of ancestors and siblings for each element from the pair
matchesCSVMatching elements from HTML source to the DOM elements
idMapJSONMap of original ids and their standardized tokens
classMapJSONMap of original classes and their standardized tokens
allCSS2JSONThe list of all CSS rules in the web page (unused rules and URLs are removed from the list)
stylePerElementJSONThe list of all CSS rules per each element in the web page
stylePerElement2JSONThe list of all CSS rules per each element in the web page, classes, and IDs are replaced with standardized tokens
completePairsJSONThe list of pairs where all spatial relations for the pair are collected
encoded_elementsJSONThe encoded data for elements with tensors of size 100 for attributes and a list of tensors for each style rule and selector. This encoding is calculated using the autoencoder method encode.
Table 2. Distribution of classes in dataset—without pruning common classes.
Table 2. Distribution of classes in dataset—without pruning common classes.
Relation typeL-RL-LR-LR-RT-BT-TB-TB-B
Percent of samples with the type5.6526.73.0721.923.3917.623.3818.31
Table 3. Distribution of classes in dataset—after pruning common classes.
Table 3. Distribution of classes in dataset—after pruning common classes.
Relation typeL-RL-LR-LR-RT-BT-TB-TB-B
Percent of samples with the type10.9916.8211.1715.0310.6112.8410.0312.49
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Prazina, I.; Pozderac, D.; Okanović, V. Beyond DOM: Unlocking Web Page Structure from Source Code with Neural Networks. AI 2025, 6, 228. https://doi.org/10.3390/ai6090228

AMA Style

Prazina I, Pozderac D, Okanović V. Beyond DOM: Unlocking Web Page Structure from Source Code with Neural Networks. AI. 2025; 6(9):228. https://doi.org/10.3390/ai6090228

Chicago/Turabian Style

Prazina, Irfan, Damir Pozderac, and Vensada Okanović. 2025. "Beyond DOM: Unlocking Web Page Structure from Source Code with Neural Networks" AI 6, no. 9: 228. https://doi.org/10.3390/ai6090228

APA Style

Prazina, I., Pozderac, D., & Okanović, V. (2025). Beyond DOM: Unlocking Web Page Structure from Source Code with Neural Networks. AI, 6(9), 228. https://doi.org/10.3390/ai6090228

Article Metrics

Back to TopTop