Previous Article in Journal
Mapping Data-Driven Research Impact Science: The Role of Machine Learning and Artificial Intelligence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Region Partitioning Framework (RCF) for Scatterplot Analysis: A Structured Approach to Absolute and Normalized Data Interpretation

by
Eungi Kim
Department of Library and Information Science, Keimyung University, 1095 Dalgubeoldaero, Dalseo-Gu, Daegu 42601, Republic of Korea
Submission received: 20 February 2025 / Revised: 3 April 2025 / Accepted: 3 April 2025 / Published: 8 April 2025

Abstract

:
Scatterplots can reveal important data relationships, but their visual complexity can make pattern identification challenging. Systematic analytical approaches help structure interpretation by dividing scatterplots into meaningful regions. This paper introduces the region partitioning framework (RCF), a systematic method for dividing scatterplots into interpretable regions using k × k grids, in order to enhance visual data analysis and quantify structural changes through transformation metrics. RCF partitions the x and y dimensions into k × k grids (e.g., 4 × 4 or 16 regions), balancing granularity and readability. Each partition is labeled using an R(p, q) notation, where p and q indicate the position along each axis. Two perspectives are supported: the absolute mode, based on raw values (e.g., “very short, narrow”), and the relative mode, based on min–max normalization (e.g., “short relative to population”). I propose a set of transformation metrics—density, net flow, relative change ratio, and redistribution index—to quantify how data structures change between modes. The framework is demonstrated using both the Iris dataset and a subset of the airquality dataset, showing how RCF captures clustering behavior, reveals outlier effects, and exposes normalization-induced redistributions.

1. Introduction

Scatterplots are essential tools for visualizing relationships between two continuous variables, offering a clear representation of data distribution, patterns, and potential correlations. Over time, various techniques have been developed to enhance scatterplot visualization, including trend lines to indicate overall patterns, movement tracking to observe shifts over time [1], geometric clustering [2], statistical overlays [3], interactive exploration methods [4], normalization techniques [5], density estimation [6], and pattern recognition [7]. These enhancements improve usability, but tend to emphasize global summaries or interactivity, which limits their ability to reveal localized patterns within the data.
Existing scatterplot partitioning methods [8,9,10,11] typically segment data points based on absolute distributions, but often lack interpretability due to the absence of structured region labels and clear relative comparisons. Recent approaches have improved efficiency or density handling (e.g., extreme clustering [12]), but still fall short in supporting intuitive, relative interpretations of data distributions. Recognizing these gaps, recent research emphasizes frameworks that enhance scatterplots through structured segmentation, labeling, and explicit relative comparisons [13,14,15], highlighting the need for more interpretable and systematically relative approaches to scatterplot partitioning.
To address these limitations, this paper introduces the region partitioning framework (RCF), a method that divides scatterplots into a k × k grid where each region is labeled R(p,q). RCF enhances interpretability by generating descriptive, human-readable region labels such as “very short, very narrow” or “long, wide”. This approach allows for localized analysis that retains spatial and structural data relationships. The structured labeling system supports both analytical rigor and communicative clarity. With this method, users can pinpoint patterns within specific subregions rather than relying solely on aggregate measures or visual approximation.
In this study, RCF supports two modes: absolute partitioning, which segments based on raw x and y values, and relative partitioning, which normalizes values for cross-population comparison. While relative partitioning improves comparability, absolute partitioning better preserves raw data structure, an important distinction in contexts such as academic impact or innovation mapping, where natural clustering may be critical. The combined flexibility of both modes makes RCF a novel and versatile tool for scatterplot interpretation.
The remainder of this paper is organized as follows. Section 2 presents the formulation and algorithms of RCF, detailing both absolute and relative partitioning methods. Section 3 introduces four metrics to quantify transformation effects between partitioning modes. Section 4 applies RCF to the Iris and airquality datasets [16] with comparative visualizations. Section 5 discusses key characteristics, nuances of interpretation, and relationships to existing frameworks alongside potential applications. Section 6 outlines methodological limitations, and Section 7 concludes with key contributions and future research directions.

2. Region Partitioning Framework (RCF)

2.1. Region Partitioning

To effectively analyze scatterplots, the region partitioning framework (RCF) divides the data space into a k × k grid, creating structured regions for analysis. Each axis is divided into k intervals, resulting in k2 total regions. Each region is labeled R(p,q), where p indicates the column index (from left to right along the x-axis) and q indicates the row index (from bottom to top along the y-axis). While numerically expressed, R(p,q) denotes categorical grid positions used for region labeling, not continuous values. For example, R(1,1) refers to the bottom-left region, while R(k,k) is the top-right region.
To determine which region a data point belongs to, the x- and y-axes are divided into fixed intervals using the (a, b] notation, which includes the upper bound, but not the lower. That is, a value belongs to an interval (a, b] if it is greater than a and less than or equal to b. This ensures that each data point is assigned to exactly one region, avoiding overlaps or gaps. To handle edge cases and ensure that maximum values are included, I make an exception for the last interval on each axis using [a, b], which includes both the lower and upper bounds. For example, if the x-axis is divided into four intervals using cut points x0, x1, x2, x3, x4, the intervals would be:
  • Region 1: (x0,x1].
  • Region 2: (x1,x2].
  • Region 3: (x2,x3].
  • Region 4 (last region): [x3,x4] (includes both endpoints).
The choice of k influences both interpretability and granularity. A small k results in broader regions that simplify patterns. but may overlook finer structures, while a large k increases granularity. but risks excessive sparsity or over-partitioning. In this study, I illustrate RCF with values such as 4, 16, and 64, selected to balance region detail with practical interpretability. Future work could explore adaptive methods where k is dynamically adjusted based on data characteristics, such as clustering tendencies or density distributions. The partitioning process begins by defining cut points along each axis, which determine the region boundaries. These cut points can be set in two ways:
Absolute partitioning—uses fixed-width intervals based on the raw data range.
Relative (normalized) partitioning—uses equal-width intervals after normalizing values to the [0, 1] range.
Once the cut points are established, each data point (x,y) is assigned to a region R(p, q) based on its position within these intervals. The labeling mechanism R(p, q) is numeric in format, but categorical in function, as it identifies distinct regions within the partitioned grid rather than representing continuous values. Here, p represents the column index along the x-axis and q represents the row index along the y-axis. Although expressed numerically, these labels serve as categorical identifiers, providing a structured reference for analysis.
In absolute partitioning, the raw values of x and y are divided into fixed-width intervals based on the data’s range. In contrast, relative partitioning applies min–max normalization, which rescales the original values to fall between 0 and 1. The normalized values are computed as:
x normalized = x     x min x max   x min ,   y normalized = y     y min y max     y min
where xnormalized and ynormalized represent the rescaled positions of data points along the x- and y-axes, where 0 corresponds to the minimum and 1 to the maximum observed value. These normalized axes are then divided into equal-width intervals, creating consistent region boundaries across datasets regardless of original scale. While this approach reduces the influence of extreme values, it does not guarantee a balanced distribution of data points across regions. Together, these two partitioning modes define the core structure of RCF and serve as the basis for the region-based comparisons in the following sections.

2.2. Algorithm Overview

The algorithm underlying RCF follows a systematic sequence of steps. First, the number of intervals k is determined, which in turn establishes the total number of regions m = k2. Next, the appropriate cut points for both the x- and y-axes are computed, either as fixed-width intervals for absolute partitioning or as equal-width intervals after normalization for relative partitioning. Following this, if relative partitioning is used, the data are normalized into the [0, 1] range. Each data point is then assigned to its corresponding region R(p,q) based on its value relative to the established intervals. Finally, for each region, key statistical measures such as density, mean, variance, and range are computed. These statistics provide valuable insights into the distribution of points within each region, facilitating a more interpretable analysis of clustering patterns and data structure.
Through the consistent application of these definitions and procedures, RCF offers a robust methodology for dissecting scatterplots into human-interpretable regions. This approach not only simplifies the analysis of complex datasets but also enhances the clarity and reproducibility of the results by ensuring that symbols like k, m, and R(p, q) are used uniformly throughout the framework.

2.3. Step-by-Step Procedure

Below are the steps for implementing RCF.
  • Determine the number of partitions: Calculate k = √m, where m is the desired total number of regions. Each region corresponds to a partition in the scatterplot grid, arranged in a k × k configuration. For example, if m = 16, then k = 4, resulting in a 4 × 4 grid. This defines how finely the x and y axes will be partitioned.
  • Compute axis cut points: (a) For absolute partitioning, divide the x-axis and y-axis into k fixed-width intervals based on the raw data range, (b) For relative partitioning, normalize x and y values to the [0, 1] range using min–max scaling, then apply uniform (equal-width) cuts along each axis. This creates consistent region boundaries across datasets, though it does not guarantee balanced data density across regions.
  • Normalize the data (only for relative partitioning): Apply min–max normalization to transform x and y values into the [0, 1] range. This ensures the partitioning is relative to the distribution of the data, enabling better comparability across different datasets.
  • Assign each point to a region: For each data point (x, y), determine its region R(p,q) based on which intervals the values fall into. Each axis is divided into k intervals using k + 1 cut points (e.g., x0, x1, …, xk). The value x is assigned to the p-th interval if it falls between xp−1 and xp, that is, in the interval (xp−1, xp], meaning greater than xp−1 and less than or equal to xp. Similarly, y is assigned to the q-th interval (yq−1, yq]. To ensure that the maximum values are included, the last intervals are closed on both ends: [xk−1, xk] and [yk−1, yk]. For example, if x falls in the third interval and y in the second, the point is assigned to region R(3,2).
  • Compute region-level metrics: For each region, compute basic descriptive statistics (mean, variance, range) for x and y values and evaluate how the distribution shifts under relative partitioning using metrics such as density, net flow, relative change ratio, and redistribution index. For detailed definitions and analytical interpretations of these metrics, see Section 3.
  • Label each region: Each region R(p,q) is assigned a semantic label that describes its position on the x and y axes. For example, in data with features like “length” and “width,” labels might be “short length, narrow width” or “medium length, wide width.” These labels provide an intuitive understanding of the data distribution.

3. Measuring Effects of RCF Transformation

This study introduces four metrics to understand how RCF transforms data distribution:
  • Density shows where data points concentrate.
  • Net flow tracks how populations shift between regions.
  • Relative change ratio measures the size of changes compared to starting values.
  • Redistribution index indicates how much data move overall.
Each metric reveals a different aspect of how normalization reshapes data structures, giving us a complete picture of transformation effects.
The first metric is density, which represents the proportion of total data points assigned to a specific region. It is calculated as:
density = frequency   in   region total   number   of   points
This metric provides insight into how populated each region is before and after normalization, identifying areas that experience significant concentration or dispersion of data.
The second metric, net flow, measures the shift in data points between partitions due to normalization:
net   flow = frequency normalized frequency absolute
A positive value indicates that a region gains more data points after normalization. Conversely, a negative value suggests a loss.
The third metric, relative change ratio, expresses the relative change in a region’s data representation:
relative   change   ratio = frequency normalized frequency absolute frequency absolute + 1
This measure prevents division by zero and helps quantify the extent to which a region expands or contracts relative to its original data composition.
The fourth metric, the redistribution index measures how extensively data points have shifted across regions:
redistribution   Index = frequency normalized frequency absolute frequency absolute + frequency normalized + 1
The redistribution index quantifies how much data points shift across regions after normalization. A higher value (closer to 1) indicates substantial reassignment, suggesting that normalization meaningfully alters regional composition. Conversely, a lower value (closer to 0) implies minimal change, indicating that the original data structure remains largely intact. This metric is especially useful for evaluating whether normalization improves cross-dataset comparability or distorts meaningful clustering patterns. To prevent division by zero, a + 1 is added to the denominator in both the relative change ratio and the redistribution index. This adjustment also ensures that values fall between 0 and 1, where 0 indicates no change and values closer to 1 reflect stronger redistribution. Collectively, these four metrics—density, net flow, relative change ratio, and redistribution index—offer a comprehensive basis for understanding how distributional adjustments reshape regional structures, setting the stage for the empirical results presented in the next section.

4. Empirical Results and Analysis

This section evaluates the behavior of the proposed transformation metrics under two experimental settings using datasets with contrasting distributional characteristics. The analysis proceeds in two parts: the first involves a well-structured benchmark dataset, while the second explores a noisier real-world dataset. In both experiments, absolute and relative (normalized) partitioning schemes are applied using a 4 × 4 spatial grid, as described in Section 2.1. The Iris and airquality datasets, both standard datasets in R, ensure reproducibility and ease of use. The subsections that follow present visual comparisons, metric outputs, and interpretation of transformation effects for each dataset independently. All computations and visualizations were conducted using R, and the complete source code is available at https://github.com/egkim68/rcf (accessed on 31 March 2025)

4.1. Experiment I—Iris Dataset

4.1.1. Experimental Setup

The Iris dataset consists of 150 records, evenly divided among three species: Setosa, Versicolor, and Virginica. Each record includes four numeric features describing flower morphology: sepal length, sepal width, petal length, and petal width. For this experiment, sepal length and sepal width are selected as the input dimensions due to their interpretability and relatively distinct distribution patterns across species. A 4 × 4 spatial grid is applied to this two-dimensional feature space, using both absolute and normalized (relative) partitioning. This configuration enables analysis of how transformation metrics respond to different scaling assumptions and spatial regioning in a well-structured dataset. Visual comparisons and region-based summaries are presented in Figure 1 and Table 1 and Table 2.

4.1.2. Visualizing Partition Effects

Figure 1 illustrates the impact of absolute and relative partitioning applied to the sepal (length, width) and petal (length, width) measurements of the Iris dataset. The figure consists of four scatterplots, each representing a different partitioning strategy. Panel A (top-left plot) shows the raw sepal length and sepal width values divided into predefined absolute regions, while panel B (top-right plot) presents the same data in a normalized space, allowing for a scale-invariant comparison. Similarly, panel C (bottom-left plot) applies absolute partitioning to petal length and petal width, whereas panel D (bottom-right plot) displays their normalized counterparts. Each point in these plots is color-coded based on species, making it easier to observe natural clustering patterns. The regional divisions (R(x,y) labels) denote partitioned sections of the feature space, illustrating how different methodologies influence the granularity of data segmentation. This visualization helps compare how absolute partitioning, which relies on fixed-value divisions, contrasts with relative partitioning, which scales the data for consistency across different feature ranges. The results show how normalization can mitigate the effects of varying feature magnitudes, which makes comparisons more robust, especially when features differ in units or distribution.

4.1.3. Absolute and Relative Partitioning Analysis

Table 1 shows the region-wise distribution of points in the Iris dataset using absolute mode, where sepal length (X-axis) and sepal width (Y-axis) are divided into four equal-width intervals, forming a 4 × 4 grid. Each region is labeled R(p, q), with p representing sepal length and q representing sepal width. The characteristics for each region are defined by combinations of sepal length and sepal width categories, each mapped to specific value ranges. For sepal length (p), the categories and corresponding value ranges are very short (p = 1: minimum to 25% of range), short (p = 2: 25–50% of range), medium (p = 3: 50–75% of range), and long (p = 4: 75–100% of range). Similarly, for sepal width (q), very narrow (q = 1), narrow (q = 2), medium (q = 3), and wide (q = 4) correspond to the four quartiles of the measurement range. For example, R(1,1) represents “very short sepal length, very narrow sepal width” and R(4,4) corresponds to “long sepal length, wide sepal width”. In absolute partitioning, these ranges are fixed-width intervals based on the raw data values.
The absolute mode reveals uneven data distribution due to fixed-width intervals applied to raw sepal measurements. Regions like R(1,1), R(2,1), and R(1,4) have zero counts, indicating no points fall in these intervals. In contrast, R(3,3) and R(4,3) have the highest densities (0.34 and 0.39, respectively). These regions, characterized as medium sepal length, medium sepal width and long sepal length, medium sepal width, show where most of the data points cluster. The use of fixed-width intervals results in sparse regions for low or high sepal values. Midrange values are more populated, reflecting the natural clustering in the Iris dataset.
Table 2 presents the region-wise distribution of points in the Iris dataset using relative mode, where sepal length (X-axis) and sepal width (Y-axis) are scaled to a [0, 1] range and divided into equal-width intervals, forming a 4 × 4 grid. Each region, labeled R(p, q), is characterized by positions such as “very short sepal length, very narrow sepal width” (R(1,1)) and “long sepal length, wide sepal width” (R(4,4)). This approach balances data distribution by reducing the influence of outliers.
Unlike absolute mode, relative mode results in a more even spread of points across the grid, with only R(1,4) and R(3,4) having zero counts. High-density regions, such as R(3,2) (density = 0.21) and R(2,2) (density = 0.18), are characterized as medium sepal length, narrow sepal width and short sepal length, narrow sepal width, respectively. Normalization rescales the data to [0, 1], reducing the influence of extreme values and improving comparability across regions. The use of uniform cuts in the normalized space helps avoid extreme sparsity, resulting in fewer empty regions than in absolute mode and enabling more consistent identification of sepal characteristics linked to species distinctions.

4.1.4. Density and Metric Interpretation in Relative Mode

This section interprets regional density patterns and transformation metrics under relative partitioning mode, which is referred to as “normalized” in the accompanying visualizations. Figure 2 (panels A–D) illustrates how the two partitioning methods affect data representation and interpretation. In the sepal analysis (panels A and B), absolute mode (panel A) uses fixed-width cuts, resulting in uneven densities. Certain regions, such as R(3,3) and R(4,3), exhibit high concentrations (dark red), while regions in the upper and left areas appear sparsely populated or empty, as indicated by white cells. This highlights absolute mode’s sensitivity to raw data scales and outliers, where sparsely populated regions may distort interpretation.
In contrast, relative mode (panel B) applies scaling to a [0, 1] range, ensuring that data points are more evenly distributed across regions. This minimizes the impact of outliers and enables relative comparisons. The heatmap for normalized sepal distributions (panel B) exhibits a smoother density distribution, with moderate-density regions distributed more uniformly across the plot and fewer empty regions compared to panel A. The petal measurements (panels C and D) show a similar pattern, though the contrast between absolute and relative modes is less pronounced, with both showing strong clustering in the R(1,1) region. Although density variation still exists across all panels, relative mode significantly improves interpretability by reducing distortions from extreme values and enhancing comparability across regions. The visual distinction between absolute and normalized methods, particularly evident in the sepal analysis (panels A vs. B), underscores the suitability of relative mode for analyses requiring outlier robustness and balanced density representation.
To quantify how data distributions change when switching from absolute to relative partitioning, I employ four key metrics that capture the shifts in point distribution: density, net flow, relative change ratio, and redistribution index. These metrics provide a comprehensive view of how points redistribute across sepal and petal regions during the normalization process, capturing both the direction and magnitude of changes between the two partitioning approaches.
Table 3 illustrates the patterns of change across partitioned sepal regions, comparing absolute and normalized distributions. Each region displays key metrics, including frequency, density, net flow, relative change ratio, and redistribution index. The frequency columns show the number of points in each region before and after normalization, whereas density represents the proportion of total points in each region. Net flow captures the change in point count, with positive values indicating an increase in points and negative values indicating a reduction. For instance, region R(3,3) experienced a sharp decrease in frequency from 51 (absolute) to 6 (normalized), resulting in a net flow of −45, a relative change ratio of −0.87, and a redistribution index of 0.78, indicating a significant redistribution of points. In contrast, regions such as R(2,2), which shifted from 0 to 27 points, exhibit large positive relative change ratio and redistribution index values. The mean values for all metrics are included at the bottom, revealing an average frequency of 9.38 points for both absolute and relative modes. The mean redistribution index of 0.78 indicates that on average, there was substantial movement of points across regions, reflecting the impact of normalization on sepal partitioned regions.
Table 4 presents the patterns of change across partitioned petal regions, comparing absolute and normalized distributions. Each region (from R(1,1) to R(4,4)) includes metrics such as frequency, density, net flow, relative change ratio, and redistribution index. Frequency shows the number of points before and after normalization, while density represents the proportion of total points in each region. Net flow captures the change in point count, with positive values indicating an increase and negative values a decrease. For example, R(3,3) experienced a decrease in frequency from 43 (absolute) to 33 (normalized), resulting in a net flow of −10, a relative change ratio of −0.23, and a redistribution index of 0.13. In contrast, R(2,2) saw an increase from 3 to 10, with strong positive relative change and redistribution indices. Mean values show a frequency of 9.38 points, a net flow of 0, and a redistribution index of 0.15. This indicates less point movement in petal regions than in sepal regions. Both absolute and normalized modes maintain identical mean frequencies, as normalization preserves the total point count.
Compared to Table 3, Table 4 shows that petal regions exhibit much less structural change than sepal regions. Although both tables share the same mean frequency (9.38), the mean values of both the relative change ratio and redistribution index for petal regions are substantially lower than those of sepal regions. Specifically, petal regions have a mean relative change ratio of 0.15, whereas sepal regions have a much higher value of 7.13. Similarly, the mean redistribution index is 0.15 for petal regions compared to 0.78 for sepal regions. This indicates that normalization caused more substantial shifts in sepal regions, particularly in areas like R(3,3) and R(4,3), which experienced large frequency reductions. In contrast, changes in the corresponding petal regions are more moderate, indicating greater structural stability. These results suggest that sepal data were more unevenly distributed prior to normalization, requiring greater adjustment. The redistribution index thus serves as a useful diagnostic for evaluating whether normalization improves interpretability or alters meaningful patterns. Overall, the comparison between sepal and petal regions demonstrates how RCF captures structural shifts induced by normalization. The redistribution index and relative change ratio offer clear, interpretable measures of how different features respond to transformation. These results support the value of RCF in diagnosing stability versus distortion across partitioning modes, and they provide a foundation for the interpretive insights discussed in the next section.

4.2. Experiment II—Airquality Dataset

4.2.1. Experimental Setup

The second experiment uses the airquality dataset, which contains 153 daily measurements of atmospheric conditions in New York City from May to September 1973. Three variables are selected, ozone, wind, and temperature, with ozone serving as the primary outcome variable. Partitioning is applied in two dimensions—ozone versus wind and ozone versus temperature—using a 4 × 4 grid. As in the previous experiment using the Iris dataset, both absolute and normalized (relative) partitioning are employed to examine how spatial distribution and transformation metrics respond to scale effects and irregular data concentrations. Compared to the Iris dataset, the airquality data exhibit greater variability, missing values, and less clearly defined structure, offering a more complex testbed for metric evaluation.

4.2.2. Visualizing Partition Effects

For the air pollution dataset, I apply the same analytical framework, but summarize results more concisely. The analysis focuses on selected variable pairs and key metrics to illustrate transformation effects. As shown in Figure 3, ozone levels vary across combinations of wind and temperature using both absolute and normalized partitioning. In the absolute views (panels A and C), ozone concentrations are highest when wind is low (R(2,1) in panel A) and when temperature is high (R(4,1) in panel C), reflecting intuitive relationships: low wind and high heat tend to trap pollutants, resulting in elevated ozone. When normalization is applied (panels B and D), the overall distribution remains consistent, but becomes more balanced across regions. High-density zones become less extreme, and moderate-density regions become more visible due to uniform scaling. This shift reveals how normalization can smooth sharp concentrations and highlight subtle structural patterns that absolute partitioning may obscure. Thus, normalization supports both interpretability and comparability, especially in data with uneven natural distributions.

4.2.3. Comparative Transformation: Ozone vs. Temperature

Table 5 presents the transformation patterns across temperature-partitioned regions in the airquality dataset, comparing absolute and normalized densities. Several regions that had no data in the absolute mode (e.g., R(1,1), R(2,1), R(3,2)) gain substantial density after normalization, reflecting the redistribution of points due to scaling. In contrast, regions like R(4,1) and R(4,2), which originally had high absolute counts, show large negative net flows and high redistribution indices (0.93 and 0.49, respectively), indicating significant shifts away from concentrated extremes. Most of the redistribution occurs in the lower rows (higher temperature ranges), especially along the first column, confirming that ozone levels tend to be higher when temperatures are elevated. However, normalization smooths these extremes by redistributing density more evenly across regions. Thus, whereas normalization enhances regional balance and surfaces hidden structures, it may also dampen meaningful extremes inherent in the original data—highlighting the trade-off between interpretability and fidelity to natural environmental distributions.

4.2.4. Comparative Transformation: Ozone vs. Wind

Similarly to the Table 5 format, Table 6 presents the transformation patterns across wind-partitioned regions in the airquality dataset, comparing absolute and normalized densities. The overall changes are more moderate than those observed in the temperature-based analysis. Region R(2,1) exhibits the highest absolute and normalized frequencies, increasing from 34 to 39, with a small net flow (+5), low relative change ratio (0.14), and redistribution index (0.07), suggesting it remains a central concentration zone both before and after normalization. Similarly, region R(3,1), which also had a high initial frequency, experiences a slight reduction (from 33 to 27), yielding a modest net flow (−6) and redistribution index (0.10), indicating only a subtle redistribution. Notable shifts include regions like R(1,3), where the frequency increased from 3 to 7, resulting in a relative change ratio of 1.00 and redistribution index of 0.36. On the other hand, region R(2,3) experienced a decrease from 9 to 5, with a negative relative change (−0.40) and redistribution index of 0.27. These changes, while not extreme, highlight how normalization redistributes densities in less populated regions. Still, the overall redistribution is limited: many regions (especially in the bottom rows and far-right columns) remain at zero before and after normalization. Thus, compared to the temperature-based patterns, wind-related transformations exhibit lower volatility, with key concentration zones largely preserved. Normalization in this context smooths minor disparities without substantially altering the structural layout of the data.

5. Discussion

Demonstrations with the Iris and airquality datasets illustrate RCF’s flexibility and its potential usefulness across diverse areas, including health analytics and research evaluation. Intuitive labels, such as “short length, narrow width”, further enhance clarity, enabling easier interpretation, effective communication, and systematic comparisons across datasets. Although direct comparisons with existing methods are challenging, RCF uniquely combines structured labeling, quantifiable redistribution effects, and adjustable granularity through the parameter k. Furthermore, relative partitioning enhances robustness against skewed data and outliers by uniformly normalizing the data, balancing regional densities, and reducing distortions arising from scale differences.
In the broader context of scatterplot design, several frameworks have been proposed to improve scalability and interpretability as data complexity increases. Sarikaya and Gleicher [17] outlined a task-driven framework for selecting scatterplot designs based on data characteristics, aligning with RCF’s goal of flexible, interpretable analysis. Rutter et al. [18] demonstrated how interactive visualizations enhance model diagnostics in RNA-seq studies, whereas Goh et al. [19] addressed perceptual limits by introducing techniques like sliding windows and axis bisection to reveal hidden associations. Doppalapudi et al. [13] found that scatterplot presentation affects user trust in recommender systems, emphasizing the importance of visual design. Complementing these, Rave et al. [14] proposed a density-equalizing transformation using integral images to reduce clutter and enhance readability in dense scatterplots. These visualization-enhancing strategies mirror RCF’s aim of making complex patterns more interpretable through labeled region structures. In terms of binning logic, RCF’s use of (a, b] intervals—with closed final bounds to ensure complete coverage—aligns with established conventions. Cattaneo et al. [20], for example, adopt a similar approach in their nonlinear binscatter framework, using right-closed intervals and a fully closed upper bin. Though their method is regression-focused, the underlying partitioning logic closely parallels RCF’s structure, reinforcing its design validity.
There are a couple of important characteristics of RCF that are worth noting. First, the mean redistribution index is a crucial metric that measures how data points redistribute between regions after normalization, with higher scores indicating substantial restructuring of cluster distributions and lower scores suggesting preserved structural integrity. Analysis of the Iris dataset revealed that sepal regions demonstrate higher redistribution indices compared to petal regions, indicating that normalization had a more pronounced effect on sepal-based distributions, a finding that aligns with the natural variability in sepal dimensions across species. In parallel, results from the airquality dataset showed that ozone–temperature combinations underwent more substantial redistribution than ozone–wind combinations, suggesting that temperature had a more uneven or extreme distribution in the original data. Second, the choice between absolute and relative partitioning methods significantly impacts data interpretation. Absolute partitioning preserves raw distributions and effectively identifies natural clusters and morphological variations, but it remains vulnerable to outlier effects. Relative partitioning, conversely, enhances cross-group comparability by evening out data point distribution across the feature space and reducing scale-dependent distortions, though this approach may diminish the visibility of natural variations in the process. The redistribution index serves as a valuable tool for quantifying these normalization-induced structural changes.
One notable strength of RCF is its relative partitioning method, which facilitates meaningful interpretations of datasets with outliers or skewed distributions. Absolute partitioning relies on fixed-width intervals based on raw data, making it vulnerable to distortion by extreme values and resulting in sparsely populated or misleading regions. In contrast, relative partitioning employs min–max normalization, dividing the [0, 1] interval into equal-width segments. This normalization approach reduces the disproportionate impact of outliers by distributing data more evenly across regions, thereby supporting clearer and more balanced relative comparisons. In the airquality dataset, absolute partitioning highlighted only the most extreme ozone concentrations in a few regions, whereas relative partitioning clarified relative patterns by smoothing extremes and uncovering moderately dense regions previously obscured, though at the cost of diminishing some naturally meaningful extreme observations.
In addition to these analytic advantages, RCF also lends itself well to integration with interactive dashboards and automated analytic workflows. While visualizations allow users to observe how data points are distributed, RCF’s metrics—such as net flow, relative change ratio, and redistribution index—enable precise tracking of how partition structures change across conditions. This dual capability makes RCF especially valuable for real-time data exploration, such as dynamic filtering or zooming in visual dashboards. For example, users could examine localized shifts by interacting with specific subregions (e.g., R(3,2)) and immediately receive quantitative feedback on structural transformation. Such integration with tools like Shiny [21] or Plotly [22] would extend RCF’s utility beyond academic settings, offering an explainable and adaptable framework for applied data visualization and decision support.
RCF is broadly applicable across domains where interpretability and comparative analysis are essential. First, in health analytics, RCF helps analyze BMI distributions: absolute partitioning reveals natural height–weight clusters, and relative partitioning enables fairer comparisons across demographic groups. A high redistribution index—around 0.8—in this context indicates that normalization substantially reshapes interpretation. Second, in research evaluation, RCF can uncover citation–ranking patterns. For instance, highly cited, but lower-ranked journals may appear in region R(4,2), and normalization adjusts for field-specific citation norms. A moderate redistribution index—around 0.5—suggests improved fairness without losing meaningful distinctions. Third, in environmental datasets like airquality, RCF helps differentiate between structural changes due to natural data skew (e.g., high ozone on hot, windless days) and those induced by normalization, allowing analysts to understand how preprocessing affects spatial interpretation. More broadly, RCF supports applications that require consistent spatial segmentation—such as comparing datasets, populations, or time points—by offering interpretable region labels and dynamic metrics not typically available in traditional methods. Overall, RCF enables interpretable, region-based analysis and facilitates structured comparisons across diverse datasets.

6. Limitations

RCF enables structured scatterplot analysis across various domains, but presents several important limitations. First, fixed grid partitioning can oversimplify complex or nonlinear patterns, particularly at lower resolutions. Although higher resolutions provide greater granularity, they may also increase interpretive complexity. Second, normalization improves comparability across scales, but can obscure meaningful variation in raw data. Third, region labels in normalized mode are dataset-specific and may not generalize well to other contexts. Fourth, redistribution indices quantify structural shifts, but do not indicate whether these shifts enhance or distort interpretation, requiring cautious analysis. Lastly, RCF assumes all data points carry equal weight, which may be limiting for applications involving importance-weighted or hierarchical data.

7. Conclusions

Scatterplots often display complex patterns that can be challenging to interpret clearly. RCF helps clarify these patterns by dividing scatterplots into concise, labeled regions using absolute and relative partitioning. Unlike conventional clustering or static grid-based methods, RCF assigns human-readable labels to regions, providing clearer insights into data structure. The use of equal-width partitioning after min–max normalization ensures evenly populated regions, improved density balance, and enables meaningful comparisons across datasets with varying measurement scales. This structured approach supports multilevel analysis—from identifying localized clusters to conducting broader cross-group evaluations—as demonstrated through the Iris and airquality datasets.
A key contribution of RCF is its ability to assess the impact of normalization on data structure, particularly through the redistribution index, which quantifies shifts in data density after transformation. This metric helps determine whether normalization enhances comparability or suppresses meaningful structural variation. Future studies may explore ways to make the partitioning process more adaptive: for instance, by adjusting grid resolution dynamically based on local density or by using coarser bins where data is sparse. Another direction is extending RCF to multivariate settings, potentially through layered or interactive visualizations. Applying RCF to large-scale or weighted datasets may also help evaluate its utility in real-world contexts. Finally, benchmarking RCF against other partitioning strategies or synthetic baselines could help further validate its interpretability and refine its analytical potential.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data and figures presented in this study are included within the article. The source code utilized to perform partitioning, compute region-level metrics, and generate visualizations is openly accessible via GitHub at https://github.com/egkim68/rcf, version 1.0.0 (accessed on 24 March 2025). Further inquiries regarding data or methods can be directed to the author.

Conflicts of Interest

The author declares no competing interests.

References

  1. Chan, Y.H.; Correa, C.D.; Ma, K.L. The Generalized Sensitivity Scatterplot. IEEE Trans. Vis. Comput. Graph. 2013, 19, 1768–1781. [Google Scholar] [CrossRef] [PubMed]
  2. Cui, W.; Zhou, H.; Qu, H.; Wong, P.C.; Li, X. Geometry-Based Edge Clustering for Graph Visualization. IEEE Trans. Vis. Comput. Graph. 2008, 14, 1277–1284. [Google Scholar] [CrossRef] [PubMed]
  3. Emerson, J.W.; Green, W.A.; Hartigan, J.A. The Generalized Pairs Plot. J. Comput. Graph. Stat. 2013, 22, 79–91. [Google Scholar] [CrossRef]
  4. Hilasaca, G.M.; Marcílio, W.E., Jr.; Eler, D.M.; Martins, R.M.; Paulovich, F.V. A Grid-Based Method for Removing Overlaps of Dimensionality Reduction Scatterplot Layouts. IEEE Trans. Vis. Comput. Graph. 2024, 30, 5733–5749. [Google Scholar] [CrossRef] [PubMed]
  5. Sanchez-Cabo, F.; Trajanoski, Z.; Cho, K.H.; Wolkenhauer, O. A Graphical User Interface to Normalize Microarray Data. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria, 20–22 March 2003; pp. 1–16. [Google Scholar]
  6. Hurley, C.B. Clustering Visualizations of Multidimensional Data. J. Comput. Graph. Stat. 2004, 13, 788–806. [Google Scholar] [CrossRef]
  7. Xia, J.; Lin, W.; Jiang, G.; Wang, Y.; Chen, W.; Schreck, T. Visual Clustering Factors in Scatterplots. IEEE Comput. Graph. Appl. 2021, 41, 79–89. [Google Scholar] [CrossRef] [PubMed]
  8. Poddar, M.; Sohns, J.-T.; Beck, F. Not Just Alluvial: Towards a More Comprehensive Visual Analysis of Data Partition Sequences. In Vision, Modeling, and Visualization; Linsen, L., Thies, J., Eds.; Eurographics Association: Lower Saxony, Germany, 2024; pp. 1–8. [Google Scholar] [CrossRef]
  9. Mishra, A.; Gupta, A. A Partition-Based Visual Secret Sharing Scheme for Large Images. In Proceedings of the 2015 1st International Conference on Next Generation Computing Technologies (NGCT), Dehradun, India, 4–5 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 885–890. [Google Scholar]
  10. Chen, Y.; De Spiegelaere, W.; Trypsteen, W.; Gleerup, D.; Vandesompele, J.; Lievens, A.; Vynck, M.; Thas, O. Benchmarking Digital PCR Partition Classification Methods with Empirical and Simulated Duplex Data. Brief. Bioinform. 2024, 25, bbae120. [Google Scholar] [CrossRef]
  11. Thrun, M.C.; Ultsch, A. Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data. J. Classif. 2021, 38, 280–312. [Google Scholar] [CrossRef]
  12. Wang, S.; Li, Q.; Zhao, C.; Zhu, X.; Yuan, H.; Dai, T. Extreme Clustering—A Clustering Method via Density Extreme Points. Inf. Sci. 2021, 542, 24–39. [Google Scholar] [CrossRef]
  13. Doppalapudi, S.; Stiso, J.T.; Matzen, K.S. How Scatterplots Influence Trust in AI Recommendations. arXiv 2024, arXiv:2409.13917. [Google Scholar]
  14. Rave, H.; Molchanov, V.; Linsen, L. De-Cluttering Scatterplots with Integral Images. IEEE Trans. Vis. Comput. Graph. 2025, 31, 2114–2125. [Google Scholar] [CrossRef] [PubMed]
  15. Bae, S.S.; Fujiwara, T.; Tseng, C.; Szafir, D. Uncovering How Scatterplot Features Skew Visual Class Separation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25), Yokohama, Japan, 26 April–1 May 2025; ACM: New York, NY, USA, 2025. [Google Scholar]
  16. R Core Team. R Datasets Package: Base R Built-in Datasets; R Project for Statistical Computing. 2024. Available online: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html (accessed on 20 March 2025).
  17. Sarikaya, A.; Gleicher, M. Scatterplots: Tasks, Data, and Designs. IEEE Trans. Vis. Comput. Graph. 2018, 24, 402–416. [Google Scholar] [CrossRef] [PubMed]
  18. Rutter, J.D.; O’Shaughnessy, K.; Wright, E.A. Visualization Methods for Differential Expression Analysis. BMC Bioinform. 2019, 20, 128. [Google Scholar] [CrossRef] [PubMed]
  19. Goh, W.W.B.; Foo, R.; Wong, L. What Can Scatterplots Teach Us about Doing Data Science Better? Res. Sq. 2022, preprint. [Google Scholar] [CrossRef]
  20. Cattaneo, M.D.; Jansson, M.; Ma, X. Nonlinear Binscatter Methods. arXiv 2024, arXiv:2407.15276. [Google Scholar]
  21. Posit, PBC. Shiny: Web Application Framework for R. Available online: https://shiny.posit.co/ (accessed on 1 April 2025).
  22. Plotly Technologies Inc. Plotly for Python: Interactive Graphing Library. Available online: https://plotly.com/python/ (accessed on 1 April 2025).
Figure 1. Comparison of absolute and relative partitioning on the Iris dataset. Absolute mode reflects raw scale boundaries, while relative mode distributes data evenly across regions.
Figure 1. Comparison of absolute and relative partitioning on the Iris dataset. Absolute mode reflects raw scale boundaries, while relative mode distributes data evenly across regions.
Metrics 02 00006 g001
Figure 2. Density comparison under absolute and relative partitioning (“Normalized” in plots)—Iris dataset. Relative mode shows a more balanced distribution across 4 × 4 grids of sepal and petal dimensions.
Figure 2. Density comparison under absolute and relative partitioning (“Normalized” in plots)—Iris dataset. Relative mode shows a more balanced distribution across 4 × 4 grids of sepal and petal dimensions.
Metrics 02 00006 g002
Figure 3. Density comparison under absolute and normalized partitioning (airquality dataset). Normalized partitioning balances ozone distribution across wind and temperature regions, revealing subtler patterns while reducing extreme concentrations.
Figure 3. Density comparison under absolute and normalized partitioning (airquality dataset). Normalized partitioning balances ozone distribution across wind and temperature regions, revealing subtler patterns while reducing extreme concentrations.
Metrics 02 00006 g003
Table 1. Absolute-mode summary (Iris dataset).
Table 1. Absolute-mode summary (Iris dataset).
RegionCountMean XMean YDensityCharacteristics
R(1,1)0N/AN/A0.00very short sepal length, very narrow sepal width
R(1,2)0N/AN/A0.00very short sepal length, narrow sepal width
R(1,3)0N/AN/A0.00very short sepal length, medium sepal width
R(1,4)0N/AN/A0.00very short sepal length, wide sepal width
R(2,1)0N/AN/A0.00short sepal length, very narrow sepal width
R(2,2)0N/AN/A0.00short sepal length, narrow sepal width
R(2,3)0N/AN/A0.00short sepal length, medium sepal width
R(2,4)0N/AN/A0.00short sepal length, wide sepal width
R(3,1)0N/AN/A0.00medium sepal length, very narrow sepal width
R(3,2)15.002.000.01medium sepal length, narrow sepal width
R(3,3)515.262.840.34medium sepal length, medium sepal width
R(3,4)315.183.650.21medium sepal length, wide sepal width
R(4,1)0N/AN/A0.00long sepal length, very narrow sepal width
R(4,2)36.072.200.02long sepal length, narrow sepal width
R(4,3)586.612.940.39long sepal length, medium sepal width
R(4,4)66.883.570.04long sepal length, wide sepal width
Table 2. Relative-mode summary (Iris dataset).
Table 2. Relative-mode summary (Iris dataset).
RegionCountMean XMean YDensityCharacteristics
R(1,1)60.170.140.04very short sepal length, very narrow sepal width
R(1,2)160.110.450.11very short sepal length, narrow sepal width
R(1,3)190.190.630.13very short sepal length, medium sepal width
R(1,4)0N/AN/A0.00very short sepal length, wide sepal width
R(2,1)120.390.190.08short sepal length, very narrow sepal width
R(2,2)270.420.360.18short sepal length, narrow sepal width
R(2,3)90.320.640.06short sepal length, medium sepal width
R(2,4)60.330.870.04short sepal length, wide sepal width
R(3,1)50.570.170.03medium sepal length, very narrow sepal width
R(3,2)320.630.410.21medium sepal length, narrow sepal width
R(3,3)60.590.560.04medium sepal length, medium sepal width
R(3,4)0N/AN/A0.00medium sepal length, wide sepal width
R(4,1)10.940.250.01long sepal length, very narrow sepal width
R(4,2)80.860.400.05long sepal length, narrow sepal width
R(4,3)30.920.720.02long sepal length, medium sepal width
R(4,4)0N/AN/A0.00long sepal length, wide sepal width
Table 3. Patterns of change across sepal partitioned regions (Iris dataset).
Table 3. Patterns of change across sepal partitioned regions (Iris dataset).
RegionFrequency
Absolute
Frequency
Normalized
Density
Absolute
Density
Normalized
Net FlowRelative
Change Ratio
Redistribution
Index
R(1,1)060.000.046.006.000.86
R(2,1)0120.000.0812.0012.000.92
R(3,1)050.000.035.005.000.83
R(4,1)010.000.011.001.000.50
R(1,2)0160.000.1116.0016.000.94
R(2,2)0270.000.1827.0027.000.96
R(3,2)1320.010.2131.0015.500.91
R(4,2)380.020.055.001.250.42
R(1,3)0190.000.1319.0019.000.95
R(2,3)090.000.069.009.000.90
R(3,3)5160.340.04−45.00−0.870.78
R(4,3)5830.390.02−55.00−0.930.89
R(1,4)000.000.000.000.000.00
R(2,4)060.000.046.006.000.86
R(3,4)3100.210.00−31.00−0.970.97
R(4,4)600.040.00−6.00−0.860.86
Mean9.389.380.060.060.007.130.78
Table 4. Patterns of change across petal partitioned regions (Iris dataset).
Table 4. Patterns of change across petal partitioned regions (Iris dataset).
RegionFrequency
Absolute
Frequency
Normalized
Density
Absolute
Density
Normalized
Net FlowRelative
Change Ratio
Redistribution
Index
R(1,1)48500.320.332.000.040.02
R(2,1)200.010.00−2.00−0.670.67
R(3,1)000.000.000.000.000.00
R(4,1)000.000.000.000.000.00
R(1,2)000.000.000.000.000.00
R(2,2)3100.020.077.001.750.50
R(3,2)12180.080.126.000.460.19
R(4,2)000.000.000.000.000.00
R(1,3)000.000.000.000.000.00
R(2,3)010.000.011.001.000.50
R(3,3)43330.290.22−10.00−0.230.13
R(4,3)890.050.061.000.110.06
R(1,4)000.000.000.000.000.00
R(2,4)000.000.000.000.000.00
R(3,4)8100.050.072.000.220.11
R(4,4)26190.170.13−7.00−0.260.15
Mean9.389.380.060.060.000.150.15
Table 5. Patterns of change across ozone-temperature regions (airquality dataset).
Table 5. Patterns of change across ozone-temperature regions (airquality dataset).
RegionFrequency
Absolute
Frequency
Normalized
Density
Absolute
Density
Normalized
Net FlowRelative
Change Ratio
Redistribution
Index
R(1,1)0170.000.111717.000.94
R(2,1)0310.000.203131.000.97
R(3,1)33230.220.15−10−0.290.18
R(4,1)3910.250.01−38−0.950.93
R(1,2)000.000.0000.000.00
R(2,2)000.000.0000.000.00
R(3,2)0200.000.132020.000.95
R(4,2)30100.200.07−20−0.650.49
R(1,3)000.000.0000.000.00
R(2,3)000.000.0000.000.00
R(3,3)020.000.0122.000.67
R(4,3)12100.080.07−2−0.150.09
R(1,4)000.000.0000.000.00
R(2,4)000.000.0000.000.00
R(3,4)020.000.0122.000.67
R(4,4)200.010.00−2−0.670.67
Table 6. Patterns of change across ozone-wind regions (airquality dataset).
Table 6. Patterns of change across ozone-wind regions (airquality dataset).
RegionFrequency
Absolute
Frequency
Normalized
Density
Absolute
Density
Normalized
Net FlowRelative
Change Ratio
Redistribution
Index
R(1,1)010.000.011.001.000.50
R(2,1)34390.220.255.000.140.07
R(3,1)33270.220.18−6.00−0.180.10
R(4,1)550.030.030.000.000.00
R(1,2)690.040.063.000.430.19
R(2,2)18150.120.10−3.00−0.160.09
R(3,2)660.040.040.000.000.00
R(4,2)000.000.000.000.000.00
R(1,3)370.020.054.001.000.36
R(2,3)950.060.03−4.00−0.400.27
R(3,3)000.000.000.000.000.00
R(4,3)000.000.000.000.000.00
R(1,4)220.010.010.000.000.00
R(2,4)000.000.000.000.000.00
R(3,4)000.000.000.000.000.00
R(4,4)000.000.000.000.000.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, E. Region Partitioning Framework (RCF) for Scatterplot Analysis: A Structured Approach to Absolute and Normalized Data Interpretation. Metrics 2025, 2, 6. https://doi.org/10.3390/metrics2020006

AMA Style

Kim E. Region Partitioning Framework (RCF) for Scatterplot Analysis: A Structured Approach to Absolute and Normalized Data Interpretation. Metrics. 2025; 2(2):6. https://doi.org/10.3390/metrics2020006

Chicago/Turabian Style

Kim, Eungi. 2025. "Region Partitioning Framework (RCF) for Scatterplot Analysis: A Structured Approach to Absolute and Normalized Data Interpretation" Metrics 2, no. 2: 6. https://doi.org/10.3390/metrics2020006

APA Style

Kim, E. (2025). Region Partitioning Framework (RCF) for Scatterplot Analysis: A Structured Approach to Absolute and Normalized Data Interpretation. Metrics, 2(2), 6. https://doi.org/10.3390/metrics2020006

Article Metrics

Back to TopTop