Next Article in Journal
Enhanced Conditional Ground Motion Selection Model Considering Spectral Compatibility and Variability of Three Components for Multi-Directional Analysis
Next Article in Special Issue
Industrial-Grade Edge Computing Device for Smart Furniture Products
Previous Article in Journal
LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design
Previous Article in Special Issue
Impact of Virtual Reality on Brain–Computer Interface Performance in IoT Control—Review of Current State of Knowledge
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Applying Methods of Exploratory Data Analysis and Methods of Modeling the Unemployment Rate in Spatial Terms in Poland

1
Department of Socio-Economic Geography, Institute of Spatial Management and Geography, Faculty of Geoengineering, University of Warmia and Mazury in Olsztyn, Prawochenskiego Street 15, 10-720 Olsztyn, Poland
2
Institute of Urban Geography, Tourism and Geoinformation, Faculty of Geographical Sciences, University of Lodz, 90-139 Łódź, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4136; https://doi.org/10.3390/app15084136
Submission received: 24 January 2025 / Revised: 24 March 2025 / Accepted: 3 April 2025 / Published: 9 April 2025
(This article belongs to the Special Issue IoT in Smart Cities and Homes, 2nd Edition)

Abstract

:
The level of unemployment in a region can be a good illustration of its socio-economic development. The choice of the data modeling method, both in terms of spatial and time-spatial approaches depends on the results of exploratory data analysis. The aim of the research is to investigate which methods of GIS spatial analysis can be used for the cartographic presentation of the variability of the unemployment rate in Poland, broken down into provinces (voivodeships) and districts in terms of time and space. This goal will be achieved by performing an exploratory analysis of data on the unemployment rate in Poland for the period 2004–2022 in order to select the methods of cartographic presentation and transfer in spatial and time-spatial terms, along with selected cartographic methods in the GIS of the level of unemployment in Poland. This study, excluding data analysis based on statistical tests, focuses on examining the distribution of unemployment rates in Poland by districts and provinces from 2004 to 2022. This leads to the selection of optimal methods for the visualization and analysis of spatial data. The use of data analysis methods based on statistical tests and the examination of the distribution of data on the unemployment rate in Poland at county (district) and province (voivodeship) level for the period 2004–2022 will be performed in order to validate the results of the research. The selection of optimal methods of visualization and analysis of spatial data is intended to be a model for use in other areas of research.

1. Introduction

The purpose of this article is to model the unemployment rate, which gives a good indication of the level of socio-economic development of regions. A spatial description of unemployment as a description of negative socio-economic phenomena is an important task of spatial econometrics, while modeling economic processes must be characterized by spatial dependencies [1]. The spatial analysis of unemployment rates has been widely studied, utilizing various modeling techniques [2,3,4]. A spatial approach requires the use of exploratory data analysis (EDA) and based on the results of this analysis, selecting the most appropriate visualization methods. This problem has been described in foreign literature [5,6,7,8] and in Polish literature [9,10]. EDA is the basis for making decisions about the choice of space-modeling tools [11]. According to Filiztekin [12], most of the models presented in the literature (see the comprehensive review in [13]) discuss the existence of unemployment disparities, but do not necessarily relate them to geography. Therefore, in Filiztekin’s research, Moran’s I test is calculated to examine the role of location in shaping the distribution of unemployment. The investigation of spatial clustering effects also requires an analysis of the spatial distribution of unemployment rates [14], as well as the application of statistical tests of spatial dependence to assess the non-random distribution of neighboring regions, as emphasized by the authors. Studying regional unemployment rates necessitates demonstrating positive spatial autocorrelation, revealing clusters of geographically adjacent regions with similar unemployment levels through the use of Moran’s I test [15]. Understanding the persistence of regional unemployment requires the consideration of spatial heterogeneity [16]. Regional models that incorporate spatial factors often provide a deeper insight into unemployment trends, particularly in areas affected by labor market transformations [17]. The analysis of spatial patterns in economic activity helps identify factors that influence regional unemployment rates [18]. In practice, Moran’s I tests are applied to both large [19] and small samples [20], and spatial autocorrelation indices provide evidence of spatial interactions between regions, which must be accounted for in the regression analyses and spatial modeling of unemployment rates, for instance, by using Moran’s scatterplots [21,22]. To analyze and visualize spatial patterns, researchers employ Local Indicators of Spatial Association (LISA) statistics [23] and Getis-Ord Gi* statistics [24,25], including their extensions for spatio-temporal analysis [26]. Visualization tools such as choropleth maps and cartograms are used, taking into account the heterogeneity of administrative units, both for regression models and for representing unemployment rates [7,27,28]. Additionally, Kernel Density Estimation (KDE) is used for spatial visualization [29].
In this publication, the type of distribution, characteristics, and statistical data such as the unemployment rate for various locational (spatial) approaches will be examined and determined. For this purpose, an exploratory analysis of data will be used, in GIS software (ArcGIS Pro 3.4), in regards to the unemployment rate measured in provinces, districts, and cities in Poland. Then, following the analysis of statistical test results, appropriate spatial modeling methods will be selected and applied. he aim of the article is to examine the unemployment rate in Poland as statistical data and its spatial analysis at the provincial level for the period 2004–2022, and at the district and city level in 2021 using statistically justified data modeling tools in GIS. The innovativeness of the proposed approach lies in the application of advanced spatial analysis methods for visualizing and modeling unemployment, integrating not only classical statistical indicators but also geostatistical techniques to assess local trends and spatial heterogeneity, thereby enabling more precise decision-making in labor market policy.
The authors posed the following research questions: Do traditional methods of unemployment analysis adequately reflect their spatial structure? Which spatial analysis techniques provide the most accurate representation of unemployment distribution and support decision-making in labor market policies? To what extent does exploratory data analysis (EDA) influence the selection of modeling and visualization methods for unemployment rate analysis?
Traditionally, studies on the unemployment rate focus on identifying the causes and factors influencing its level. In this study, a different approach was adopted, whereby the primary objective of the analysis was to achieve the best possible spatial representation of the phenomenon. To this end, exploratory data analysis (EDA) was used as the foundation for spatial analysis. Additionally, it was examined whether the data exhibited significant spatial dependencies, which was a key step in further modeling.

2. Materials and Methods

The methodological innovation in this study lies in its shift from traditional unemployment rate analyses, which typically focus on identifying causal factors, to an approach prioritizing the spatial representation of the phenomenon. The study employs exploratory data analysis (EDA) as the foundation for spatial analysis, first testing the null hypothesis of randomness to determine whether unemployment exhibits significant spatial patterns. If the results confirm that unemployment is not randomly distributed, the application of spatial modeling techniques is justified instead of simple visualizations of the average unemployment rate based on administrative boundaries. Interpolation methods, particularly Kriging, were used to create a continuous spatial surface from discrete data, optimizing model selection based on approximation errors compared to actual measurements. This innovative approach enhances precision in representing unemployment distribution, offering a more accurate and detailed spatial depiction of the phenomenon.
The unemployment rate data analyzed in this study were obtained from the official statistics provided by the Polish Central Statistical Office (GUS). The data are calculated using the classical method, based on the official Eurostat definition, ensuring consistency and comparability with other European statistical data. According to the official statistics of the Polish Central Statistical Office (GUS) and Eurostat, the unemployment rate is defined as the ratio of the number of unemployed individuals actively seeking work to the number of economically active individuals. This definition is widely accepted in economic research and ensures comparability with international studies. The dataset covers the period 2004–2022 and is publicly available in the Local Data Bank (https://bdl.stat.gov.pl/BDL/start, 20 January 2025) and the Geostatistical Portal (https://geo.stat.gov.pl/, 20 January 2025), as well as in the annual Statistical Yearbooks published by GUS. This ensures the reliability and transparency of the input data used for spatial analyses presented in this paper. According to Bal-Domańska, this is the most commonly used measure of unemployment, defined as the ratio of the number of people who are unemployed but actively seeking work to the number of economically active individuals [30]. In the literature, numerous studies focus on modeling unemployment changes in Poland [31], which can be classified into those utilizing spatial analyses and those employing GIS for visualizing the results of econometric research. All of these studies address the spatial variation of unemployment levels in Poland [32,33], with some specifically dedicated to the application of spatial analyses [34]. Spatial models extend classical econometric models by incorporating so-called spatial effects: spatial dependence and spatial heterogeneity [35,36]. Although these studies most commonly address the spatial aspect in a modeling context [37] or focus on correlations with potential factors [22,38].
The authors formulated the following research hypotheses:
H1. 
Traditional methods of unemployment analysis, based on administrative divisions, may distort the actual spatial structure of this phenomenon, justifying the need for spatial analysis methods.
H2. 
Exploratory data analysis (EDA) facilitates the optimal selection of modeling and visualization methods for unemployment rate analysis, improving result interpretation accuracy and enhancing its application in labor market policy.
H3. 
The application of spatial analysis techniques allows for a more precise identification of unemployment distribution patterns, enabling better-targeted policy interventions and resource allocation in labor market planning.
The data modeling process applied in this study follows a universal scheme (Figure 1).
The results of EDA determine the further selection of visualization methods and spatial analyses. This approach ensures that the applied tools are statistically justified and aligned with the nature of the data. Unfortunately, this methodological rigor is not always maintained, and the validation parameters are often omitted, which significantly reduces the credibility and reliability of such data. The first stage is to perform a spatial autocorrelation analysis (using Global Moran’s I). Next, the nearest neighbor analysis is applied to evaluate spatial clustering tendencies. Data distribution analysis, using charts, histograms, and QQ plots, helps assess whether the data meet normality assumptions. Finally, outlier detection is performed using Thiessen polygons and Voronoi diagrams. The visualization and modeling method is selected based on the results of these analyses. If the null hypothesis of random data distribution is confirmed, the recommended visualizations include choropleth maps, diagrams, graduated symbols, and proportional symbols. However, if the null hypothesis is rejected, spatial interpolation techniques (both geostatistical and deterministic) can be applied without methodological concerns. Additionally, model optimization is performed using the parametric estimation quality assessment method (MPQE), and advanced spatial analyses such as the hot spot analysis and Kernel Density Maps are recommended to identify significant spatial patterns.
A spatial dependency analysis was conducted in ArcGIS Pro, and appropriate methods for visualizing the unemployment rate were selected based on the results of EDA, which comprises a set of statistical techniques used to characterize observations exhibiting spatial relationships.
In the first step, the EDA will be performed using a spatial autocorrelation analysis (Global Moran’s I). This is based on the binary matrix of weights, and the significance of Moran’s I statistic is carried out by means of a test in which the following hypotheses are verified [39]:
H0. 
Spatial autocorrelation does not occur.
H1. 
There is a spatial autocorrelation.
In the second step of the EDA, the average test of the nearest neighbor will be performed. If the index is less than 1, the pattern shows grouping; if the index is greater than 1, the tendency is towards dispersion or competition [40,41]. In the third step of the EDA, the type of data distribution instruments of the histogram and QQ diagram will be examined. Understanding data distribution is an important step in the data-mining process [42]. For comparison, you can add a normal distribution overlay to the histogram [43]. Normal QQ plots, like histograms, make it possible to study the effect of logarithmic transformations and square roots on the distribution of data while comparing them with the normal distribution [44]. In the fourth step of the EDA, outliers will be tested using Thiessen polygons. Each Thiessen polygon contains only one point entry function. Each location in a Thiessen polygon is closer to its associated point than to any other input function of the point. The resulting Voronoi diagrams are used to determine the outliers. A global outlier is a measured sampling point that has a very high or very low value relative to all values in the dataset. A local outlier is a measured sample point that has a value within the normal range for the entire dataset, but if you look at the surrounding points, it is unusually high or low [45].
The next stage of work will be the selection of spatial statistics tools for visualization. Statistics are divided into descriptive and inferred statistics (statistical tests). In the case of descriptive statistics, classification depends more on the choices and methods of the analyst, rather than the results of statistical tests. In statistical tests, on the other hand, the analysis is more likely to result in clusters whose size, shape or distribution cannot be influenced by the analyst because they are formed automatically according to the algorithm assigned to the analysis technique. If the null hypothesis with a random distribution of data is confirmed, then cartograms, diagrams, graduated, and proportional symbols will be made, while if the null hypothesis is rejected, data interpolations and the optimization of models will be performed using the parametric estimation quality assessment method. In addition, hot spot analyses and density maps using the Kernel method will be performed.
A cartogram represents a phenomenon within the boundaries of a given territorial unit by color, shade or hatching. Depending on the situation and the intention of the analyst, this fact can make the cartogram overly general, which can be seen as a disadvantage. Another disadvantage of the cartogram is the freedom to choose a scale that can change the appearance of a given phenomenon [46]. Pie charts are often used to represent the overall value of a phenomenon (expressed in the area of a circle) and the percentage of its components, giving the reader a quick idea of the proportional distribution of data [47]. In the next step, maps will be performed using geostatistical methods. In this work, maps will be carried out using the Kriging method. Each measurement z(x) is the implementation of the function of the random variable Z(x) (where x- means the location in the area), i.e., if the measurement was performed several times at point x, the measurement results will be different. However, in any pair of points xn and xn + h, the random variables Z(xn) and Z(xn + h) are bound by a certain correlation relationship resulting from the spatial continuity of the studied phenomenon [48]. Data modeling is carried out using the semivariogram analysis. The course of the semivariogram function indicates at what rate the interaction of variables decreases with the increasing distance [49]. In terms of spatial statistics, measurement values are treated as a regionalized variable (also called spatial or localized), which is defined as a continuous coordinate function of space. The values of this regionalized variable are known only at the place of measurement, and there are a few of them compared to the studied space. In order to be able to perform a statistical interpretation on a regionalized variable, certain restrictions are imposed on it. In particular, the hypothesis of a weak stationary is assumed, which means that the expected value of the variable does not depend on the place of measurement and its covariance is only a function of the distance between the measurement points. In practice, because the hypothesis of a weak stationary in relation to real parameters is very strict, a much milder limitation is applied, which assumes that poor stationary is not from the regionalized variable itself, but from its increments. Of significant importance here is the variance of increments, which defines the basic characteristic function of geostatistics, known as the semivariogram [50]. In addition, when modeling a space using a semivariogram, the condition of positive definiteness must be met, or more precisely, the semivariance function must meet this condition. This mathematical condition is met by the function families used in ArcGIS PRO (Nugget effect model, Spherical model, Gaussian model, Power model, Exponential model, and Linear model). In these function models, the Kriging function equations can be solved. The indicators that were considered when comparing the results of spatial analyses are the statistical characteristics of the interpolation error:
(a)
Mean prediction error (MSE);
(b)
Root mean square prediction error (RMSE);
(c)
Average standard error (ASE);
(d)
Mean standardized prediction error (MSE);
(e)
Root mean square standardized prediction error (RMSSE).
During the modeling, the scheme of the optimal selection of approximated functions was used [51]: If the values of the interpolation parameters and the semivariogram model were chosen correctly, meaning that the results of the spatial analysis are satisfactory, then:
(a)
MSE = 0;
(b)
RMSE = smallest value;
(c)
ASE should have a value close to RMSE. If ASE is greater than RMSE, then this indicates that the variability in the dataset has been overestimated; conversely, this means underestimating volatility;
(d)
MSE = 0;
(e)
RMSSE = 1.
  • Formula 1.
If this error exceeds 1, then the variability in the dataset has been overestimated; if it is less than 1, we are dealing with an underestimation of volatility.
To verify the optimal method, the proprietary method of parameter assessment of the quality of estimation (MPQE) was used, which is based on an optimization algorithm determined both during cross validation and subset validation [52]:
  • C1, C2,… Cm—assessment of the quality of estimation;
  • D1, D2… Dn—weight;
  • a1a2… an—parameter a;
  • b1b2… bn—parameter b;
  • n1n2… nn—parameter n.
Δ1a1 + Δ2b1 +… + Δnnn = C1
Δ21a2 + Δ22b2 +… + Δ2nnn = C2
Δn1an + Δn2bn+ … + Δmnm = Cn
  • Formula 2.
This geostatistical method, whose parameter Cn is close to or equal to 0, is the optimal method:
Cn ≤ 0 = max
For the purposes of this study, the algorithm was adapted by supplementing the method with an additional three parameters from cross validation and three parameters from subset validation. This algorithm was adapted to the scheme of optimal selection of the modeling method when approximating functions:
  • a1a2… an—parameter a—average error for 100% of data;
  • b1b2… bn—parameter b—square root of the average error for 100% of data;
  • c1c2… cn—parameter c—average error for 100% of data;
  • d1d2… dn—parameter d—average standard error for 100% of data;
  • e1e2… en—parameter e—average standardized error for 100% of data;
  • f1f2… fn—parameter f—square root of the mean standardized error for 100% of data;
  • g1g2… gn—parameter g—average error for 90% of data;
  • h1h2… hn—parameter h—square root of the mean error for 90% of data;
  • i1i2… in—parameter i—average error for 90% of data;
  • J1J2… jn—parameter j—average standard error for 90% of data;
  • k1k2… kn—parameter k—average standardized error for 90% of data;
  • l1l2… ln—parameter l—square root of the mean standardized error for 90% of data;
  • Parameter weights: Δb and Δh are 30%, and the other parameters are 5% each.
In the next step, a hot spot (Gi* Getisa–Orda) will be developed. This tool identifies statistically significant spatial clusters with high values (hot spots) and low values (cold spots). It creates an output feature class with a z-score, a p-value, and a confidence interval field for each object in the input class. Z-scores and p-values are measures of statistical significance that tell you whether to reject the null hypothesis. As a result, they indicate whether the observed spatial grouping of high or low values is more pronounced than would be expected in a random distribution of the same values. The z-score and p-value fields do not reflect any type of FDR (False Discovery Rate) correction [53]. Another spatial analysis will be density analysis, the Kernel density tool, using non-parametric estimation using the Epanechnikov kernel [54].
  • Formula 3.
This calculates the size per unit area based on point or polyline features using the Kernel function to fit a smoothly conical surface to each point or polyline. The barrier can be used to change the influence of functions when calculating the density of the nucleus [55].
f λ x = 1 n λ Σ k 0 x x i λ k 0 t = 0.75 1 t 2 0  
where
  • k0—quadratic kernel function;
  • L—smoothing parameter;
  • for t ≤ 1 in other cases.

3. Results

The research process begins, as shown in Figure 2, with exploratory data analysis (EDA), which serves as the foundation for selecting further spatial analysis and visualization methods. Figure 2 shows the results of Moran’s I test for data on the unemployment rate broken down by voivodship.
Data for the period 2004–2022 were verified. In each case, the result was the same, namely, confirming the truth of hypothesis 0. From this, it follows that there is no autocorrelation of data (random distribution).
Figure 3 shows the results of Moran’s I test for data on the unemployment rate broken down by district.
For data on the unemployment rate in counties, the null hypothesis was also confirmed, which means that the data are distributed randomly. Also, in this case, no data autocorrelation was found, which limits the possibilities of data visualization using inferred spatial statistics.
Figure 4 shows the results of Moran’s I test for urban unemployment rate data.
Based on Figure 4, we conclude that the data are clustered, and therefore the null hypothesis about the randomness of their distribution should be rejected.
In order to examine whether the distribution of data on the unemployment rate is random or not, a test was carried out for the point data of the nearest average neighbor for the city of Poland. Figure 5 shows the results of the nearest neighbor average test.
Based on Figure 5, we conclude that the null hypothesis is rejected. A result of 4.601594 was obtained, from which it was confirmed that there is a probability of less than 1% that the scattered samples can be the result of a random variable.
In the next stage, the distribution of data was examined. Figure 6 presents the histogram for data on the unemployment rate in cities.
During the study, it was established that parameters close to the normal system are achieved when a logarithmic transformation is used. This is an important indication during geostatistical modeling that should be considered in the process of space modeling. In addition, if the subset validation method is used for the map validation process, it should be examined whether the distribution of data will change significantly after taking the data into the test set. The histogram for data without a test set is shown in Figure 7.
Based on Figure 7, it can be concluded that performing a logarithmic transformation has brought the data distribution closer to a normal distribution. Therefore, during modeling, the analyst should take this information into account and, particularly when using the Kriging method, apply the logarithmic transformation in the geostatistical wizard. The next step of the EDA is to examine the outliers based on Thiessen polygons and the analysis of the Voronoi diagram, as shown in Figure 8.
Based on Figure 8, we conclude that there are reflective values of the percentage rate. These data should be carefully analyzed, as there is a higher probability of measurement error. However, for a data type such as the percentage rate, the possibility of a spike in distribution is likely; therefore, none of the data will be removed in the main study.
Based on the results of research in EDA, data on the percentage rate for the area of districts were presented in the form of a diagram in Figure 9.
Figure 9 confirms the conclusions drawn from the EDA that assigning statistical characteristics to large areas and drawing strong conclusions on this basis flattens the perspective and the reliability of results. In addition, data on the percentage rate in provinces and districts do not have the form of a normal distribution, and classification using standard deviation should not be used during the classification, which makes it even more difficult to make correct inferences. Figure 10 shows the result of the hot spot for the unemployment rate.
A hot spot analysis allows the clusters of the unemployment rate to be interpreted using Getis and Ord statistics. This enables the differentiation of the strength of the significance of the highlighted clusters of events, depending on the strength of local autocorrelation [54]. Based on Figure 10, we conclude that the percentage rate brings together several cities into clusters with the same values. Therefore, the creation of a continuous surface of the phenomenon is justified. Although it should not be related to the cadastral units of the area, it is possible to impose their boundaries onto maps made using the techniques of inferred statistics.
Figure 11 shows the results of the Kernel [56] analysis of the nuclear density unemployment rate. Although this method is traditionally used for large point datasets, recent literature highlights its increasing application in socio-economic studies with a limited number of observations [57,58]. However, in this study, the Kernel Density Estimation (KDE) method was applied as one of the techniques for visualizing the spatial distribution of unemployment rates. Its application aimed to capture spatial continuity and smooth transitions between unemployment rates in neighboring cities. In this context, KDE also enables the identification of local concentrations and areas with increased unemployment risk, making it a valuable tool in spatial analyses of socio-economic phenomena. The resulting map in Figure 11 is in the form of a continuous value of the unemployment rate and the hot spots visible on it; in this way, it focuses on values. Figure 11 also shows the boundaries of districts to confirm the above-described defects of cartograms and diagram charts describing the value of the unemployment rate. In addition, it can be determined whether the problem of high unemployment is homogeneous within administrative boundaries of districts, or whether it occurs in localized parts. It is also possible to say where there is no problem because the unemployment rate is low.
Figure 12 shows maps made by geostatistical methods, which are developed based on the percentage rate values of cities recorded in the Geodatabase. In the process of modeling percentage rate data, the principles of process optimization were applied. The presented results are therefore the optimal version of the map for a given method.
Based on Figure 12, it is difficult to indicate which of the maps is optimal. A parametric estimation quality method was used to validate the maps. The results of the validation process are presented in Table 1, Table 2 and Table 3.
Based on the data contained in Table 1, Table 2 and Table 3, we conclude that the optimal geostatistical method for mapping the unemployment rate is the Empirical Bayesian Kriging (EBK) method presented in Figure 13. Studies have shown that these differences are statistically significant.
Figure 13 shows the division of provincial unemployment rate distribution without displaying the voivodeship borders. Also, in this case, we can assign provinces that differ in the unemployment rate within the administrative unit. The map shows the spatial diversity of the examined feature and allows for the analysis of places with a high unemployment rate. In addition, it is noticeable that administrative boundaries do not affect changes in value, and the spatial locations of cold and warm places are areas that are not dependent on the administrative unit. The phenomenon of low and high percentage rates generates areas that can be compared to irregularly shaped buffers.

4. Discussion and Conclusions

The innovativeness of the proposed approach lies in the application of spatial analysis methods to the unemployment rate calculated by classical methods at the city level, instead of relying solely on traditional averages computed for administrative units. The conventional approach, widely used in spatial studies, often leads to data oversimplification, resulting in artificially homogeneous representations of socio-economic conditions. However, the analyses conducted in this study clearly demonstrate that administrative areas are, in fact, spatially heterogeneous, and this local variability should be properly considered both in scientific research and in policy-making processes.
The proposed methodology, combining classical unemployment rate calculation with spatial analysis techniques, allows this spatial heterogeneity to be captured. Furthermore, the data modeling scheme described in the article, based on exploratory data analysis (EDA), ensures that the selection of visualization and modeling methods is statistically justified. This enables spatial interpolation, allowing for the estimation of unemployment rates in towns and cities where direct measurements are unavailable. Importantly, statistical tests confirmed the reliability of these estimations, since the spatial distribution of the unemployment rate exhibits clear patterns rather than randomness. An additional benefit of this approach is the identification of local anomalies and spatial outliers, which can be of particular relevance for labor market policies and regional development strategies.
The conducted analyses confirmed that spatial analyses applied to unemployment rate data at the city level provide a much more detailed and reliable picture of spatial disparities than traditional approaches based on aggregated administrative units. Moran’s I test showed a statistically significant spatial autocorrelation, justifying the use of spatial statistics. In contrast, when using polygon data for provinces or districts, descriptive statistics may be more appropriate. This highlights a key methodological contribution of this study—demonstrating that spatial unemployment patterns within administrative units are often masked when only averaged values per unit are considered.
Moreover, the removal of artificial administrative boundaries in the visualization process allows for more accurate spatial targeting of anti-unemployment policies. This refined approach could help allocate financial support more precisely, directing funds not to entire administrative units, but to the specific cities and subregions where interventions are most urgently needed. This practical application further highlights the relevance of integrating spatial statistics and classical economic indicators in labor market research.
The conducted analyses confirmed hypothesis 1, demonstrating that traditional methods of unemployment analysis, based on the aggregation of administrative data, can distort the actual representation of this phenomenon. Hypothesis 2 was also confirmed, whereby exploratory data analysis (EDA) enabled the optimal selection of modeling and visualization methods, revealing spatial unemployment patterns that were not detectable through traditional statistical analyses. Furthermore, the results validated hypothesis 3, indicating that the integration of GIS-based spatial analysis methods with classical economic indicators allows for a more precise identification of areas requiring intervention, emphasizing the crucial role of spatial analysis in modeling socio-economic phenomena.
The presented approach to spatial unemployment analysis can be applied in a wide range of socio-economic studies, including labor market monitoring, modeling the impact of transportation accessibility on employment, assessing the effectiveness of activation policies, and optimizing the allocation of support funds in regions with the highest unemployment risk. Additionally, the proposed methodology enhances the adaptability of labor market analyses by allowing for dynamic updates as new data become available, ensuring that policy responses can be more timely and data-driven. By integrating spatial interpolation with exploratory data analysis, this approach also provides a robust framework for detecting emerging labor market trends and forecasting potential shifts in unemployment distribution. Furthermore, the combination of classical economic indicators with spatial modeling opens new possibilities for comparative studies across different countries and regions, facilitating cross-border labor market assessments. The flexibility of this method makes it suitable for analyzing not only unemployment, but also other socio-economic phenomena, such as income disparities, workforce migration, and regional economic resilience, providing a comprehensive tool for decision-makers and researchers alike.

Author Contributions

Conceptualization, M.O.; methodology, M.O.; software, M.O.; validation, M.O.; formal analysis, M.O.; investigation, M.O.; resources, M.O.; data curation, writing—original draft preparation, M.O. and M.J.; writing—review and editing, M.O. and M.J.; visualization, M.O. and M.J.; supervision, M.O. and M.J.; project administration, M.O.; funding acquisition, M.O. and M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

MEMean Errorthe average prediction error. It shows the average difference between actual and predicted values. In a properly calibrated model, this value should be close to zero
RMSERoot Mean Square Errorthe square root of the mean squared error. It is one of the key measures of prediction accuracy; the lower the value, the better the model fits the data
MSEMean Standardized Errorthe average standardized prediction error. In well-calibrated models, this value should be close to zero
RMSSERoot Mean Square Standardized Errorthe square root of the mean standardized squared error. The optimal value is 1; values higher than 1 indicate overestimation of variance, while values below 1 indicate underestimation
ASEAverage Standard Errorthe average standard error. It should match the RMSE in a well-calibrated model, indicating that the variability is correctly assessed and that subset validation results assess the overall model quality. The lower the MPQE, the better the model performance
MPQE CWMean Parametric Quality Estimation for Cross ValidationMPQE calculated for cross validation
MPQE SWMean Parametric Quality Estimation for Subset ValidationMPQE calculated for subset validation

References

  1. Pietrzak, M.B. Application of economic distance for the purposes of a spatial analysis of the unemployment rate for Poland. Oeconomia Copernic. 2010, 1, 79–98. [Google Scholar] [CrossRef]
  2. Kopczewska, K. Models of changes in the unemployment rate in spatial terms. Wiadomości Stat. 2010, 2010, 26–40. [Google Scholar] [CrossRef]
  3. Müller-Frączek, I.; Pietrzak, M. Analysis of the unemployment rate in Poland using the spatial MESS model. Acta Univ. 2011, 11, 203–2011. [Google Scholar] [CrossRef]
  4. Müller-Frączek, I.; Pietrzak, M. Analysis of the unemployment rate in Poland in spatial-temporal terms. Oeconomia Copernic. 2012, 3, 43–49. [Google Scholar] [CrossRef]
  5. Tukey, J. Exploratory Data Analysis; Digital Publishing Institute: Morgantown, MV, USA, 1997; Volume 19, p. 7128. [Google Scholar]
  6. McKean, J.W.; Sheather, S.J. Diagnostic procedures. In Wiley Interdisciplinary Reviews: Computational Statistics; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2009; pp. 1221–1233. [Google Scholar] [CrossRef]
  7. Morgenthaler, S. Exploratory data analysis. In Wiley Interdisciplinary Reviews: Computational Statistics; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2009; Volume 1, pp. 33–44. [Google Scholar] [CrossRef]
  8. Camizuli, E.; Carranza, E.J. Exploratory Data Analysis (EDA). In The Encyclopedia of Archaeological Sciences; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2018; pp. 1–7. [Google Scholar] [CrossRef]
  9. Antczak, E.; Lewandowska-Gwarda, K. Application of exploratory methods of spatial data analysis in the study of mortality in Poland. Sci. Pap. Wroc. Univ. Econ. Taxon. 2009, 16, 333–340. [Google Scholar]
  10. Osowski, S.; Muszyński, M. Data mining methods for gene selection on the basis of gene expression arrays. Int. J. Appl. Math. Comput. Sci. 2014, 24, 657–668. [Google Scholar] [CrossRef]
  11. Ogryzek, M.; Krypiak-Gregorczyk, A.; Wielgosz, P. Optimal geostatistical methods for interpolation of the ionosphere: A case study on the St Patrick’s day storm of 2015. Sensors 2020, 20, 2840. [Google Scholar] [CrossRef]
  12. Filiztekin, A. Regional unemployment in Turkey. Pap. Reg. Sci. 2009, 88, 863–879. [Google Scholar] [CrossRef]
  13. Elhorst, J. The mystery of regional unemployment differentials: Theoretical and empirical explanations. J. Econ. Surv. 2003, 17, 709–740. [Google Scholar] [CrossRef]
  14. Khamis, F. Spatial dimensions of the unemployment rate in Jordan 2008. Austrian J. Stat. 2016, 40, 177–190. [Google Scholar] [CrossRef]
  15. Anggani, N.L.; Amrullah, H.M.; Gemilang, D.S.A. Moran I autocorrelation study for level spatial pattern analysis. J. Indones. Sos. Teknol. 2023, 4, 1285–1291. [Google Scholar] [CrossRef]
  16. Patuelli, R.; Schanne, N.; Griffith, D.A.; Nijkamp, P. Persistence of regional unemployment: Application of a spatial filtering approach to local labor markets in Germany. J. Reg. Sci. 2012, 52, 300–323. [Google Scholar] [CrossRef]
  17. Dubrovskaya, J.; Kosonogova, E. The impact of digitalization on the demand for labor in the context of working specialties: Spatial analysis. St Petersburg Univ. J. Econ. Stud. 2021, 37, 395–412. [Google Scholar] [CrossRef]
  18. Gelebo, B.M. Spatial modelling of disparity in economic activity and unemployment in southern and oromia regional states of Ethiopia. Am. J. Theor. Appl. Stat. 2015, 4, 347. [Google Scholar] [CrossRef]
  19. Sen, A. Large sample-size distribution of statistics used in testing for spatial correlation. Geogr. Anal. 1976, 9, 175–184. [Google Scholar] [CrossRef]
  20. Anselin, L.; Florax, R.J.G.M. Small sample properties of tests for spatial dependence in regression models: Some further results. In New Directions in Spatial Econometrics; Springer: Berlin/Heidelberg, Germany, 1995; pp. 21–74. [Google Scholar]
  21. Semerikova, E. Spatial Patterns of German Labor Market: Panel Data Analysis of Regional Unemployment. In Geographical Labor Market Imbalances. AIEL Series in Labour Economics; Mussida, C., Pastore, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 37–64. [Google Scholar] [CrossRef]
  22. Cracolici, M.; Cuffaro, M.; Nijkamp, P. Geographical distribution of unemployment: An analysis of provincial differences in Italy. SSRN Electron. J. 2007, 38, 649–670. [Google Scholar] [CrossRef]
  23. Anselin, L. Local indicators of spatial association—LISA. Geogr. Anal. 1995, 27, 93–115. [Google Scholar] [CrossRef]
  24. Getis, A.; Ord, J.K. The analysis of spatial association by use of distance statistics. Geogr. Anal. 1992, 24, 189–206. [Google Scholar] [CrossRef]
  25. Ord, J.K.; Getis, A. Local spatial autocorrelation statistics: Diapplicationes and an application. Geogr. Anal. 1995, 27, 286–306. [Google Scholar] [CrossRef]
  26. Netrdová, P.; Blažek, J. Soaring unemployment in Czechia during the global economic crisis. J. Maps 2019, 15, 69–76. [Google Scholar] [CrossRef]
  27. Bolińska, M.; Chornenka, O. Spatial diversity of unemployment in Ukraine. Humanit. Soc. Sci. 2019, 26, 7–26. [Google Scholar] [CrossRef]
  28. Zielecka-Dębska, D.; Pawlak, E.; Tukiendorf, A.; Szelachowska, J.; Wiśniewska, I.; Błaszczyk, J.; Matkowski, R. Socioeconomic aspect of breast cancer incidence and mortality in women in lower Silesia (Poland) in 2005–2014. Postępy Hig. Med. Doświadczalnej 2022, 76, 62–70. [Google Scholar] [CrossRef]
  29. Inspektor, T.; Ivan, I.; Horák, J. Mapping and monitoring unemployment hot spots towards identification of socially excluded localities: Case study of Ostrava. J. Maps 2013, 10, 35–46. [Google Scholar] [CrossRef]
  30. Bal-Domańska, B.; Sobczak, E. Propozycja poszerzonej miary bezrobocia. Pr. Nauk. Uniw. Ekon. We Wrocławiu 2001, 4350, 11–22. [Google Scholar] [CrossRef]
  31. Haładus, K.; Wolak, J. Analiza przestrzenna zmian stopy bezrobocia w Polsce. W: Przestrzeń w badaniach geograficznych. Znaczenie kategorii przestrzeni w geografii. Pr. Geogr. 2018, 152, 33–48. [Google Scholar]
  32. Pośpiech, E. Analiza przestrzenna bezrobocia w Polsce. Stud. Ekon. 2015, 227, 59–74. [Google Scholar]
  33. Tokarski, T. Regionalne zróżnicowanie bezrobocia. Wiadomości Stat. 2010, 5, 41–56. [Google Scholar] [CrossRef]
  34. Kopczewska, K. Modele zmian stopy bezrobocia w ujęciu przestrzennym. Wiadomości Stat. 2010, 5, 26–40. [Google Scholar] [CrossRef]
  35. Kossowski, T. Teoretyczne aspekty modelowania przestrzennego w badaniach regionalnych. Rozw. Reg. I Polityka Reg. 2010, 12, 9–26. [Google Scholar]
  36. Anselin, L.; Regression, S.; Fotheringham, A.S.; Rogerson, P.A. The SAGE Handbook of Spatial Analysis; SAGE Publications: Thousand Oaks, CA, USA, 2009. [Google Scholar]
  37. Müller-Frączek, I.; Pietrzak, M.B. Przestrzenne modelowanie zmian stopy bezrobocia. Pr. Nauk. Uniw. Ekon. We Wrocławiu 2015, 391, 118–127. [Google Scholar]
  38. Aragon, Y.; Haughton, D.; Haughton, J.; Leconte, E.; Malin, E.; Ruiz-Gazen, A.; Thomas-Agnan, C. Explaining the pattern of regional unemployment: The case of the Midi-Pyrénées region. Pap. Reg. Sci. 2003, 82, 155–174. [Google Scholar] [CrossRef]
  39. Suchecki, B. Spatial econometrics. In Methods and Models of Spatial Data Analysis; Beck, C.H., Ed.; Springer: Warsaw, Poland, 2010. [Google Scholar]
  40. Jackowski, A. Encyclopedia Geography; Zielona Sowa: Warsaw, Poland, 2004. [Google Scholar]
  41. Śliwicki, D. Application of nuclear estimators for estimating the effectiveness of active labour market programs. Acta Univ. Nicolai Copernic. Ekon. 2014, 45, 27–40. [Google Scholar] [CrossRef]
  42. Coltuc, D.; Bolon, P.; Chassery, J.M. Exact histogram specification. IEEE Trans. Image Process. 2006, 15, 1143–1152. [Google Scholar] [CrossRef]
  43. Hummel, R.A. Histogram modification techniques. Comput. Graph. Image Process. 1975, 4, 209–224. [Google Scholar] [CrossRef]
  44. Marden, J.I. Positions and QQ plots. Stat. Sci. 2004, 19, 606–614. [Google Scholar] [CrossRef]
  45. Gold, C. Voronoi Methods in GIS; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
  46. Ogryzek, M.; Ciski, M. Methods of Presentation the Average Transaction Prices of the Undeveloped Land. Civ. Environ. Eng. Rep. 2018, 28, 85–100. [Google Scholar] [CrossRef]
  47. Pasławski, J. Notes on the classification of quantitative forms of cartographic presentation. Pol. Przegląd Kartogr. 2005, 37, 95–100. [Google Scholar]
  48. Krige, D.G. A statistical approach to some basic mine valuation problems on the Witwatersrand. J. Chem. Metall. Min. Soc. S. Afr. 1951, 52, 119–139. [Google Scholar]
  49. Krige, D.G. A statistical analysis of some of the borehole values in the Orange Free State goldfield. J. Chem. Metall. Min. Soc. S. Afr. 1952, 53, 47–64. [Google Scholar]
  50. Matheron, G. Traité de Geostatistique Appliquée. Tome II. Le Krigeage (Treatise on Applied Geostatistics. Volume II. Kriging). Editions BRGM: Paris, France, 1965. [Google Scholar]
  51. Ogryzek, M.; Kurowska, K. Geostatistical Methods of Mapping Average Transaction Prices of Undeveloped Agricultural Land. Studia Prace WNEiZ 2016, 45/1, 397–408. [Google Scholar] [CrossRef]
  52. Ogryzek, M. Parametric assessment of the quality of estimation of maps developed by geostatistical methods. Stud. I Pr. WNEiZ 2018, 54, 319–330. [Google Scholar]
  53. Grubesic, T.; Murray, A. Detecting hot spots using cluster analysis and GIS. In Proceedings of the Fifth Annual International Crime Mapping Research Conference, Orlando, FL, USA, 25–27 June 2001; pp. 1–12. [Google Scholar]
  54. Mordwa, S. GIS techniques—In search of crime hot spots. Arch. Criminol. 2015, 27, 279–302. [Google Scholar] [CrossRef]
  55. Źróbek-Różańska, A.; Ogryzek, M.; Źróbek-Sokolnik, A. Creating a Healthy Environment for Children: GIS Tools for Improving the Quality of the Social Welfare Management System. Int. J. Environ. Res. Public Health 2022, 19, 7128. [Google Scholar] [CrossRef]
  56. Jażdżewska, I. Changes of the urban population density in Central Poland. Population Density Distribution Estimation Using Nonparametric Kernel Functions. Człowiek I Środowikso 2012, 36, 7–19. [Google Scholar]
  57. Kopczewska, K. Applied Spatial Statistics and Econometrics: Data Analysis in R*; Routledge: London, UK, 2020. [Google Scholar]
  58. Griffith, D.A. Spatial Autocorrelation and Spatial Filtering. In Advanced Spatial Statistics; Springer: Berlin/Heidelberg, Germany, 2003; pp. 89–127. [Google Scholar]
Figure 1. Data modeling diagram.
Figure 1. Data modeling diagram.
Applsci 15 04136 g001
Figure 2. Moran’s I test report for the unemployment rate in voivodships.
Figure 2. Moran’s I test report for the unemployment rate in voivodships.
Applsci 15 04136 g002
Figure 3. Moran’s I test report for the unemployment rate in district.
Figure 3. Moran’s I test report for the unemployment rate in district.
Applsci 15 04136 g003
Figure 4. Moran’s I test report for the urban unemployment rate.
Figure 4. Moran’s I test report for the urban unemployment rate.
Applsci 15 04136 g004
Figure 5. Report from the test of the nearest average neighbor for the unemployment rate.
Figure 5. Report from the test of the nearest average neighbor for the unemployment rate.
Applsci 15 04136 g005
Figure 6. Histogram of the unemployment rate in cities.
Figure 6. Histogram of the unemployment rate in cities.
Applsci 15 04136 g006
Figure 7. Histogram and the unemployment rate in cities.
Figure 7. Histogram and the unemployment rate in cities.
Applsci 15 04136 g007
Figure 8. Diagram of Voronoi unemployment rate in cities.
Figure 8. Diagram of Voronoi unemployment rate in cities.
Applsci 15 04136 g008
Figure 9. Unemployment rate in districts. Source: own work.
Figure 9. Unemployment rate in districts. Source: own work.
Applsci 15 04136 g009
Figure 10. Hot spots of the unemployment rate in Poland in 2021.
Figure 10. Hot spots of the unemployment rate in Poland in 2021.
Applsci 15 04136 g010
Figure 11. Hot spots of the unemployment rate in Poland in 2021 determined using the kernel density estimation method.
Figure 11. Hot spots of the unemployment rate in Poland in 2021 determined using the kernel density estimation method.
Applsci 15 04136 g011
Figure 12. Modeling the unemployment rate in Poland in 2021 using the geostatistical methods.
Figure 12. Modeling the unemployment rate in Poland in 2021 using the geostatistical methods.
Applsci 15 04136 g012
Figure 13. Modeling the unemployment rate in Poland in 2021 using the EBK method.
Figure 13. Modeling the unemployment rate in Poland in 2021 using the EBK method.
Applsci 15 04136 g013
Table 1. Cross validation result.
Table 1. Cross validation result.
MethodMERMSEMSERMSSEASEMPQE
KO0.0093.1540.0020.7873.8932.3615
KS0.2663.2430.0540.8743.642.4292
EBK0.0453.3250.0550.9433.3942.4387
KU0.0263.4890.13218.1540.1913.9437
Source: Own author’s draft based on data from the GIS data modeling validator.
Table 2. Subset validation result.
Table 2. Subset validation result.
MethodMERMSEMSERMSSEASEMPQE
EBK0.0112.4430.0030.9752.4831.4737
KO0.0652.4640.0030.93.1291.8881
KS0.1242.4670.0220.9322.5821.8462
KU0.492.922.921.7040.1672.2801
Source: Own author’s draft based on data from the GIS data modeling validator.
Table 3. Validation result by the parametric assessment of estimation quality.
Table 3. Validation result by the parametric assessment of estimation quality.
MethodMPQE CWMPQE SWMPQE
EBK2.43871.47371.9562
KO2.36151.84622.10385
KS2.42922.28012.35465
KU3.94372.28013.1119
Source: Own author’s draft based on data from the GIS data modeling validator.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ogryzek, M.; Jaskulski, M. Applying Methods of Exploratory Data Analysis and Methods of Modeling the Unemployment Rate in Spatial Terms in Poland. Appl. Sci. 2025, 15, 4136. https://doi.org/10.3390/app15084136

AMA Style

Ogryzek M, Jaskulski M. Applying Methods of Exploratory Data Analysis and Methods of Modeling the Unemployment Rate in Spatial Terms in Poland. Applied Sciences. 2025; 15(8):4136. https://doi.org/10.3390/app15084136

Chicago/Turabian Style

Ogryzek, Marek, and Marcin Jaskulski. 2025. "Applying Methods of Exploratory Data Analysis and Methods of Modeling the Unemployment Rate in Spatial Terms in Poland" Applied Sciences 15, no. 8: 4136. https://doi.org/10.3390/app15084136

APA Style

Ogryzek, M., & Jaskulski, M. (2025). Applying Methods of Exploratory Data Analysis and Methods of Modeling the Unemployment Rate in Spatial Terms in Poland. Applied Sciences, 15(8), 4136. https://doi.org/10.3390/app15084136

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop