Beyond Geography and Budget: Machine Learning for Calculating Cyber Risk in the External Perimeter of Local Public Entities
Abstract
1. Introduction
1.1. Related Work
1.1.1. Cyber Attack Figures
1.1.2. Cyber Threats
1.1.3. Local Public Environment
1.1.4. Cybersecurity Framework
1.1.5. Massive Data in Local Public Entities
1.2. Summary of Literature and Research Objectives
2. Materials and Methods
- The process of acquiring and processing captured data;
- Risk definition;
- Empirical justification of risk thresholds;
- A correlation study is to be conducted between technological investments and the identified risk;
- The generation of a qualitative model (Classification) is required to indicate whether the entity is at risk. This model should also identify the most representative risk variables;
- The generation of an optimized quantitative model (Regression) is necessary to establish a numerical risk value for each entity;
- Verification of the model’s suitability using data from subsequent years.
2.1. Data Acquisition and Processing Framework
2.1.1. Data Sources
- Technological variables: This information was obtained through the implementation of active scans of the exposed digital perimeter of each municipality. These scans, performed by automated scripts, measure technical aspects of public web services. The metrics encompass SSL/TLS certificate configuration, the presence of security headers, response times, known vulnerabilities (CVEs) in exposed components, and technical SEO best practices;
- Competential or Contextual variables: The data has been obtained from public and government data sources, including the National Institute of Statistics (INE) and the Ministry of Finance. The aforementioned variables are indicative of each municipality, and include demographic data (population, age distribution), socio-economic data (unemployment rate, average income) and budgetary data (specific investment in IT items, total budget).
- Complete information related with the variables is available in Appendix A.2.
2.1.2. Definition of the Risk Metric: CIORank
- Security: Aggregates defensive security metrics, such as the quality of cryptographic implementation, the absence of vulnerabilities, and the adoption of security headers;
- Availability: Measures of the reliability and performance of web services, including loading speed, correct time synchronization, and the absence of domain blacklists;
- SEO: Evaluates technical optimization, compliance with web standards, and accessibility, which act as a proxy for development quality and maintenance. This category is referred to as “Web Quality & Performance” to accurately describe its focus on technical audits. The underlying individual metrics retain their original SEO prefix (e.g., SEO3_performance) for consistency with the data collection tools used.
2.1.3. Empirical Threshold of Risk
2.2. Modeling with Machine Learning
2.2.1. Dataset Preparation
2.2.2. Model Training and Selection
- Classification Models: The objective is to predict the binary risk category, which is defined as either “at risk” or “not at risk”. A comprehensive comparison of a wide range of algorithms was conducted, including tree ensembles (such as Random Forest and CatBoost), logistic regression, and Support Vector Machines (SVMs). The Area Under the ROC Curve (AUC) was identified as the primary metric for selection, given its resilience to class imbalance;
- Regression Models: The primary objective is to predict the numerical value of risk. Analogous algorithms were evaluated in their regression versions. The selection metric employed was the Mean Absolute Error (MAE), a metric that lends itself to straightforward interpretation.
2.2.3. Variable Importance Analysis
3. Results
3.1. Correlation Between Risk and IT Investment
3.2. Qualitative Supervised Learning Model Results
3.3. Analysis of Variables in the Qualitative Model
3.4. Optimized Quantitative Supervised Learning Model Results
- S5_sslabs_scan;
- SEO3_performance;
- SEO4_cookies;
- SEO3_seo;
- SX_shodan;
- A7_black_list;
- province;
- S3_openports;
- A2_ntp;
- SEO3_bestpractices.
3.5. Goodness of the Quantitative Model in the 2024 Data
4. Discussion
- Risk vs. cyber losses can be viewed as a cause-effect relationship. The presence of a risk to an entity does not directly imply an economic loss until the occurrence of a cyber security event;
- The same research states that they have observed a weakening of the correlation between budget size and cyber losses. The authors of the study have provided a justification for this phenomenon, attributing it to an increase in attacks on smaller entities with limited financial resources.
- At the National level: Leverage structures such as CCN-CERT in Spain (or its counterparts) to centralize incident reporting by local administrations in a standardized, anonymized manner that is accessible to the research sector;
- At the European level: It is hereby proposed that ENISA assume the leadership role in the development of a unified platform and a standardized taxonomy of incidents, in accordance with the directives outlined in the NIS2 Directive. The establishment of a European-level risk observatory would facilitate the validation and refinement of predictive models such as ours on an unprecedented scale.
- Increase the number of local public entities in nearby countries. This could establish a supranational geographical trend among similar countries;
- Develop models that relate risk metric to expected economic loss if risks are not mitigated, using the FAIR framework [34];
- Perform attack simulations based on the vulnerabilities detected in the external perimeter. Since this study conducts a strategic analysis at the country level, answering “what” and “why”, the use of techniques similar to Monte Carlo would allow tactical actions to be prioritized by defining “how”;
- A perception study could be conducted among key C-level executives, such as Chief Operating Officers (COOs), Chief Technology Officers (CTOs), Chief Information Officers (CIOs), and Chief Information Security Officers (CISOs) in public administrations, in order to base the weighting in line with the country’s overall strategies.
- Explore and correlate the study results with publicly available datasets on regional differences, if accessible. This could include national statistics on digital infrastructure (e.g., broadband penetration) or official reports from national/European agencies, which would help further contextualize the impact of institutional versus infrastructural factors;
- The temporal dynamics of risk should be investigated by means of causal inference methodologies appropriate for mixed data types. Although standard Granger causality is constrained to numerical variables, future research could adapt advanced techniques to explore time-lagged relationships. Such analysis has the potential to yield valuable leading indicators of risk, for example, by determining whether certain technical changes systematically precede a change in an entity’s security posture.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AUC | Area Under the ROC Curve |
CERT | Computer Emergency Response Team |
CCN-CERT | Centro Criptológico Nacional |
CVE | Common Vulnerability and Exposures |
DDoS | Distributed Denial-of-Service |
ENISA | European Union’s Network and Information Security Agency |
GenAI | Generative AI |
ICMA | International City/County Management Association |
LLMs | Large Language Models |
MAE | Mean Absolute Error |
MFA | Multi-Factor Authentication |
ML | Machine Learning |
MSE | Mean Square Error |
NIS2 | Network and Information Systems directive 2 |
PII | Personally Identifiable Information |
RMSE | Root Mean Square Error |
SHAP | SHapley Additive Explanations |
SEO | Search Engine Optimization |
Appendix A. Data Acquisition Framework
Appendix A.1. Medallion Architecture for Data Processing
Appendix A.2. Data Sources and Variables
- INE—National Statistics Institute [36];
- Seguridad Social—Spanish Ministry of Social Security [37];
- SEPE—Spanish Ministry of Employment [38];
- AEAT—Tax Office [39];
- MINHAP—Spanish Ministry of Treasury [40];
- CNIG—National Center for Geographic Information [41];
- Datos.gov.es—Ministry for Digital Transformation [42].
- S1—The remaining time for HTTPS/SSL/TLS certificates to expire (in days);
- S2—The detection of obsolete cryptographic digests, such as SHA1, in encrypted communications;
- S3—The number of open ports in the main domain, in addition to those associated with web traffic;
- S4—Number of documents indexed by Google that contain entity metadata, such as user account locations;
- S5—Evaluation of the SSL/TLS digital certificate, where A is indicative of excellent quality and F is indicative of insecure quality;
- S6—The presence of a Robots.txt file serves to prevent indexing by search engines and impede the reconnaissance phase of the cyber kill chain;
- SX_Safe_Browsing—The domain reputation databases are consulted to confirm if the entity’s domain is listed as unsafe;
- SX_Shodan—A check is made to identify whether the entity’s domain has any known vulnerabilities that have not yet been conveniently patched. The vulnerabilities are identified with the CVE code or Common Vulnerabilities and Exposures. The Shodan tool has been used for this purpose, facilitating the cataloging and identification of each monitored entity;
- A1—Average download speed of web content;
- A2—The synchronization of the online service with the Network Time Protocol (NTP) time provider;
- A3—The number of servers that provide the online service. This metric is associated with the resilience, fault tolerance and high availability of the published service;
- A7—The number of instances in which the domain in question has been included on lists of malicious domains. To illustrate, there are commercial services such as MXToolBox;
- SEO1—The optimization of the service when using a desktop or laptop device;
- SEO2—The optimization of the service when using mobile devices, including tablets, smartphones, Chromebooks, and wearables;
- SEO3_accessibility—The service has been optimized in accordance with the accessibility criteria set out in the WCAG Web Content Accessibility Guidelines;
- SEO3_best_practices—The service was optimized in accordance with the standards of good web programming practices, which encompass HTML, CSS and JavaScript code;
- SEO3_performance—This metric concerns the optimization of the service from the moment a request is made until the end user receives the complete content that has been requested. This metric is employed in order to ascertain the quality of the user experience;
- SEO3_pwa—The service is to be optimized to become a progressive application. This type of application has the capability to be executed on any platform, operating system, or device that is web-based, which facilitates portability and simplifies the management of push notifications and operations that are performed offline;
- SEO3_seo—The service has been optimized in order to improve its ranking in web search engines;
- SEO4—The number of cookies that the online service requires the end user to accept. This metric gauges the user experience and perception of privacy in relation to the General Data Protection Regulation (GDPR);
- SEO5_Bing—The number of results displayed by the Microsoft Bing search engine in response to a query about the public entity;
- SEO6_Google—The number of results displayed by the Google search engine after a query about the public entity;
- SEO7_links—The number of web links displayed on the main domain page of the monitored entity.
- Total population—Total population and differentiated by gender (×3 variables);
- Social security—Number of citizens working in a common or special scheme, sector category (×7 variables);
- Municipal taxes—Taxes exclusive to the municipality (×41 variables);
- Registered unemployment—Registered unemployment by sector and age bracket (×12 variables);
- Land data—Surface, area, perimeter and GPS position of the entity (×6 variables);
- Public finance data—Debt registered with the Treasury (×1 variable).
- Budget_cnpt_1: Chapter 1, personnel expenses;
- Budget_cnpt_206: Chapter 2 of the current goods and services expenses includes “renting computer equipment, office automation applications, data transmission, operating systems, database management applications, and any other computer equipment and software”;
- Budget_cnpt_216: Chapter 2 of the current goods and services expenses includes “Maintenance of web services, intranet, Internet, data network, voice network, antivirus software, document management, warranties, repairs of computer equipment, and maintenance of computer programs”;
- Budget_cnpt_220.02: Chapter 2 of the current goods and services expenses includes “non-inventoriable computer equipment for the normal functioning of computer, office automation, transmission and other equipment such as diskettes, continuous paper, standard software packages”;
- Budget_cnpt_222.03: Chapter 2 of the current goods and services expenses includes “telephone services and computer communications”;
- Budget_cnpt_636: Chapter 6 of Real Investments includes “purchases of equipment for information processes”;
- Budget_cnpt_641: Chapter 6 of Real Investments includes “purchases of computer applications”.
Appendix A.3. Risk Metric
- Security: This indicator uses metrics such as the quality of digital certificates, the presence of common vulnerabilities and exposures (CVEs), and open ports. All of the previously mentioned technological variables beginning with SX are used in this indicator;
- Availability: The availability indicator is composed of various technical metrics, including download speed, server time synchronization and the blacklisting of domains;
- SEO: The SEO indicator includes several metrics such as optimization, compliance and accessibility of the monitored systems.
Appendix A.4. CIORank as a Standardized Penetration Testing Approach
- Planning & Preparation: The target is the perimeter exposed to the Internet by entities. The scope is defined by the full list of local public entities provided by the National Institute of Statistics. A black-box testing approach is defined, in which the infrastructures of each entity are unknown a priori. The Rules of Engagement specify that only actions equivalent to those of a regular user are performed, ensuring no disruption to the monitored entity;
- Information Gathering/Reconnaissance: Public information on each of the entities to be monitored is collected from government data sources. This information is supplemented with passive enumeration tools such as Whois, Google, Bing, and Shodan.
- Vulnerability Analysis/Threat Modeling: Ports, services, software and hardware versions are identified, and potential attack vectors are found in relation to the identified technologies and known vulnerabilities (CVE).
- Exploitation (Scanning & Enumeration): In this phase, the objective is to exploit the detected vulnerabilities to attempt code execution, privilege escalation, or lateral movement. It should be noted that none of these actions were executed during this research.
- Post-Testing Activities: In this phase, the impact to which an entity is exposed is assessed, preparing reports and recommendations to maximize resilience and user experience in the face of cybersecurity events.
Phase | Competency Metrics | Technological Metrics |
---|---|---|
Planning & Preparation | Government sources: INE | N/A |
Information Gathering | Government sources: INE Seguridad Social SEPE AEAT MINHAP CNIG Datos.gov.es | S4—sensitive metadata indexed by Google S6—a presence of a Robots.txt file SX_Safe_Browsing—domain reputation SX_Shodan—Shodan’s known CVE SEO5_Bing—number of results in Bing SEO6_Google—number of results in Google SEO7_links—links on the home page |
Vulnerability Analysis | N/A | S1—time SSL/TLS expiration S2—obsolete cryptographic algorithms S3—number of open ports S5—SSL/TLS quality assessment A2—synchronization with NTP A3—servers providing HA service A7—domain inclusion in blacklists SEO4—cookies required |
Exploitation | N/A | S2—weak cipher confirmation S3—exploitation of exposed ports S5—SSL/TLS flaw validation SX_Shodan—testing for detected CVE |
Post Testing Activities | N/A | A1—content download speed SEO1—optimization for desktop devices SEO2—optimization for mobile devices SEO3_accessibility—compliance with WCAG criteria SEO3_best_practices—compliance with good programming practices SEO3_performance—web application performance SEO3_pwa—support for progressive applications SEO3_seo—SEO optimization of the service |
Appendix B. Machine Learning Model Flow
Appendix B.1. Phase 1 Data Capture
Appendix B.2. Phase 2 Data Collection
Appendix B.3. Phase 3 Data Cleaning
Appendix B.4. Phase 4 Feature Selection
Appendix B.5. Phase 5 Model Selection & Training
- Light Gradient Boosting Machine (lightgbm) [51];
- Extra Trees Classifier (et) [52];
- Random Forest Classifier (rf) [53];
- Gradient Boosting Classifier (gbc) [54];
- Logistic Regression (lr) [55];
- Ridge Classifier (ridge) [56];
- Linear Discriminant Analysis (lda) [57];
- K Neighbors Classifier (knn) [58];
- Ada Boost Classifier (ada) [59];
- SVM—Linear Kernel (svm) [60];
- Decision Tree Classifier (dt) [61];
- Quadratic Discriminant Analysis (qda) [62];
- Naive Bayes (nb) [63];
- Dummy Classifier (dummy).
- Light Gradient Boosting Machine (lightgbm) [51];
- Random Forest Regressor (rf) [53];
- Gradient Boosting Regressor (gbr) [54];
- Extra Trees Regressor (et) [52];
- K Neighbors Regressor (knn) [58];
- Least Angle Regression (lar) [64];
- Bayesian Ridge (br) [65];
- Ridge Regression (ridge) [56];
- Linear Regression (lr) [66];
- Huber Regressor (huber) [67];
- Ada Boost Regressor (ada) [59];
- Decision Tree Regressor (dt) [52];
- Passive Aggressive Regressor (par) [68];
- Orthogonal Matching Pursuit (omp) [69];
- Elastic Net (en) [70];
- Lasso Regression (lasso) [71];
- Lasso Least Angle Regression (llar) [64].
Appendix B.6. Phase 6 Model Evaluation
- Accuracy: It is the proportion of correct predictions relative to the total number of observations evaluated. In instances where the dataset is imbalanced, this metric may appear to be highly effective, despite the model’s potential inadequacies;
- AUC: The Area Under the ROC (Receiver Operating Characteristic) Curve measures how well the model discriminates. It shows the true positive rate (recall) and false positive rate (FPR) at different thresholds;
- Recall: This is the Sensitivity or True Positive Rate. It is a key metric for determining the best models when avoiding false positives has significant cost;
- Precision: It is the proportion of cases classified as positive that actually belong to the positive class. Models with high precision indicate that they have low false positives;
- F1-Score: It is the harmonic mean between Recall and Precision This metric is particularly useful when the objective is to select a model that is balanced between the two metrics from which it is composed. The utility of this function is particularly pronounced in scenarios involving imbalanced data.
- MAE—Mean Absolute Error: It measures the average of the absolute differences between predicted and actual values. It does not penalize large and small errors differentially;
- MSE—Mean Square Error: It is a widely used metric that calculates the average of squared errors. It penalizes mostly large errors and outliers but is difficult to interpret because of the squared units;
- RMSE—Root Mean Square Error: It is a metric that combines the same units as MAE for ease of interpretation and is sensitive to large errors;
- R2—Coefficient of determination: This metric measures what proportion of the variability in the dependent variable is explained by the model. The closer this coefficient is to 1, the better the model.
- Y axis—The vertical axis contains all the variables of the model ordered by importance, with the most important variables being those located at the top;
- X axis—The horizontal axis represents the SHAP values, showing the importance of the impact of each variable in the model. Any positive value indicates that the characteristic increases the probability of the predicted class. A negative value decreases that probability;
- Colors—Each dot seen in the graph represents a row of data in the dataset. Red colors imply high values for the variable and blue colors are low values for the variable;
- Dot scatter—The dots are distributed across each variable horizontally showing the effects.
Appendix B.7. Phase 7 Optimization & Fine Tuning
Appendix B.8. Phase 8 Deploying the Model
Appendix C. Risk Threshold Sensitivity Analysis
Appendix C.1. Sensitivity in the Subcomponents of the Risk Metric
- The overall predictive performance of the model, measured by its coefficient of determination (R2);
- The stability of the predictor hierarchy, verifying whether the province variable maintained its predominance as a key risk factor.
Scenario | Strategic | Security | Availability | SEO |
---|---|---|---|---|
A | Baseline. Balanced | 33% | 33% | 33% |
B | Focus on Security (CISO Vision) | 50% | 25% | 25% |
C | Focus on User Experience (COO Vision) | 25% | 50% | 25% |
D | Focus on Technical Quality (CTO Vision) | 25% | 25% | 50% |
Scenario | Strategic | R2 | Province Position |
---|---|---|---|
A | Baseline. Balanced | 0.8457 | 4º |
B | Focus on Security (CISO Vision) | 0.8508 | 7º |
C | Focus on User Experience (COO Vision) | 0.8536 | 6º |
D | Focus on Technical Quality (CTO Vision) | 0.8438 | 6º |
Appendix C.2. Risk Threshold Sensitivity Analysis
- Conservative Threshold (55%): A stricter threshold to identify a larger number of potentially at-risk entities;
- Liberal Threshold (65%): A more lenient threshold, focused on identifying the most evident high-risk cases with greater certainty.
Threshold | Accuracy | AUC | Recall | Precision | F1 |
---|---|---|---|---|---|
55% | 0.9225 | 0.9671 | 0.9225 | 0.9214 | 0.9213 |
60% | 0.8475 | 0.9303 | 0.8475 | 0.8480 | 0.8474 |
65% | 0.8772 | 0.9253 | 0.8772 | 0.8730 | 0.8722 |
Appendix D. Interpretability of ML Algorithms
Appendix D.1. Maximum Interpretability
- Linear/Logistic Regression (lr, ridge, lasso, lar): Its operation is based on a simple mathematical equation. Each feature has a coefficient (a weight) that precisely indicates how much it influences the final outcome, and in which direction (positive or negative). This is the gold standard of causal explanation for a policymaker;
- Decision Tree (dt): A single decision tree is a flowchart of “if/then” rules. The system is highly intuitive, allowing users to visualize and follow the path to the conclusion of any given municipality (depending on the depth of the tree).
Appendix D.2. Moderate Interpretability
- K-Nearest Neighbors (knn): These types of algorithms are designed to predict risk by evaluating the k most similar municipalities based on training data. A municipality could be at risk because its closest neighbors are also at risk. This does not provide a global rule for interpretation, but rather offers a very clear local explanation.
- Linear Discriminant Analysis (lda) & SVM with a Linear Kernel (svm): There are two classes: ‘at risk’ versus ‘protected’. This approach identifies the optimal line (or hyperplane) that differentiates between these two classes (geometric logic). Due to its linear nature, the concept is relatively straightforward to understand. Concepts such as maximizing the distance between classes can increase abstraction and understanding compared to a simple regression equation.
Appendix D.3. Low Interpretability
- Random Forest (rf) & Extra Trees (et): These algorithms are constructed with hundreds or thousands of decision trees, not just one. Each tree has one vote, and the final decision is made by majority vote. It is challenging to follow a single logical path through all the trees, although it is possible to determine which characteristics are, in general, the most important for the forest.
Appendix D.4. Minimal Interpretability: Advanced Boosting Models
- AdaBoost (ada), Gradient Boosting (gbc), LightGBM (lightgbm), and CatBoost (catboost): CatBoost and other boosting models are at this level, building many trees (like Random Forest) but sequentially. Each new tree attempts to correct the errors of the previous one using complex mathematics, such as gradient descent. It is almost impossible for a human to trace the logic of a single prediction, which is why external tools such as SHAP are created to try to explain their internal workings and the impact of characteristics on their results.
Interpretability | Algorithms |
---|---|
Maximum | lr, ridge, lasso, lar, dt |
Moderate | knn, lda, svm |
Low | rf, et |
Minimal | ada, gbc, lightgbm, catboost |
References
- Nurmi, J.; Niemelä, M.; Brumley, B.B. Malware Finances and Operations: A Data-Driven Study of the Value Chain for Infections and Compromised Access. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023; pp. 1–12. [Google Scholar]
- Zoltan, M. Dark Web Price Index. 2023. Available online: https://www.privacyaffairs.com/dark-web-price-index-2023/ (accessed on 22 March 2025).
- Cherian, S. Healthcare Data: The Perfect Storm. Available online: https://www.forbes.com/councils/forbestechcouncil/2022/01/14/healthcare-data-the-perfect-storm/ (accessed on 22 March 2025).
- Brundage, M.; Avin, S.; Clark, J.; Toner, H.; Eckersley, P.; Garfinkel, B.; Dafoe, A.; Scharre, P.; Zeitzoff, T.; Filar, B.; et al. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. arXiv 2018, arXiv:1802.07228. [Google Scholar] [CrossRef]
- Mirsky, Y.; Demontis, A.; Kotak, J.; Shankar, R.; Gelei, D.; Yang, L.; Zhang, X.; Lee, W.; Elovici, Y.; Biggio, B. The Threat of Offensive AI to Organizations. Comput. Secur. 2023, 124, 103006. [Google Scholar] [CrossRef]
- Papadopoulos, P.; Katsikas, S.; Pitropakis, N. Editorial: Cybersecurity and Artificial Intelligence: Advances, Challenges, Opportunities, Threats. Front. Big Data 2025, 7, 1537878. [Google Scholar] [CrossRef] [PubMed]
- Hossain, S.T.; Yigitcanlar, T.; Nguyen, K.; Xu, Y. Local Government Cybersecurity Landscape: A Systematic Review and Conceptual Framework. Appl. Sci. 2024, 14, 5501. [Google Scholar] [CrossRef]
- CrowdStrike Crowstrike Global Threat Report. 2024. Available online: https://www.crowdstrike.com/en-us/resources/reports/crowdstrike-2024-global-threat-report/ (accessed on 20 May 2024).
- Verizon Verizon Data Breach Investigations Report. 2024. Available online: https://www.verizon.com/business/resources/Te3/reports/2024-dbir-data-breach-investigations-report.pdf (accessed on 18 March 2025).
- Al-Hawawreh, M.; Aljuhani, A.; Jararweh, Y. Chatgpt for Cybersecurity: Practical Applications, Challenges, and Future Directions. Clust. Comput. 2023, 26, 3421–3436. [Google Scholar] [CrossRef]
- Alawida, M.; Abu Shawar, B.; Abiodun, O.I.; Mehmood, A.; Omolara, A.E.; Al Hwaitat, A.K. Unveiling the Dark Side of ChatGPT: Exploring Cyberattacks and Enhancing User Awareness. Information 2024, 15, 27. [Google Scholar] [CrossRef]
- European Union Agency for Cybersecurity. ENISA Threat Landscape 2024: July 2023 to June 2024; Publications Office: Luxembourg, 2024; ISBN 978-92-9204-675-0. [Google Scholar]
- Perez, E. Un Ciberataque Paraliza el Ayuntamiento de Sevilla: Piden un Rescate de Cinco MILLONES de euros para Recuperarlo. Available online: https://www.xataka.com/seguridad/ciberataque-paraliza-ayuntamiento-sevilla-piden-rescate-cinco-millones-euros-para-recuperarlo (accessed on 20 May 2024).
- Hoffman, C. Washington County Pays $350,000 Ransom After Cyberattack. Available online: https://www.cbsnews.com/pittsburgh/news/washington-county-pays-ransom-cyberattack/ (accessed on 20 May 2024).
- Longo, A. Westpole-PA Digitale, il vero Conto del Disastro: Enorme. Available online: https://www.cybersecurity360.it/nuove-minacce/westpole-pa-digitale-il-vero-conto-del-disastro-enorme/ (accessed on 20 May 2024).
- Paganini, P. The Ransomware Attack on Westpole Is Disrupting Digital Services for Italian Public Administration. Available online: https://securityaffairs.com/156090/cyber-crime/westpole-ransomware-attack.html (accessed on 20 May 2024).
- Norris, D.F.; Mateczun, L.; Forno, R. Cybersecurity and Local Government; John Wiley & Sons: Hoboken, NJ, USA, 2022; ISBN 978-1-119-78831-7. [Google Scholar]
- ICMA Icma.Org. Available online: https://icma.org/ (accessed on 20 May 2024).
- Chourabi, H.; Nam, T.; Walker, S.; Gil-Garcia, J.R.; Mellouli, S.; Nahon, K.; Pardo, T.A.; Scholl, H.J. Understanding Smart Cities: An Integrative Framework. In Proceedings of the 2012 45th Hawaii International Conference on System Sciences (HICSS), Maui, HI, USA, 4–7 January 2012; pp. 2289–2297. [Google Scholar]
- Norris, D. A Look at Local Government Cybersecurity in 2020|Icma.Org. Available online: https://icma.org/articles/pm-magazine/look-local-government-cybersecurity-2020 (accessed on 20 May 2024).
- European Parliament 2019/881 EU Regulation 2019/881 on ENISA and on Information and Communications Technology Cybersecurity Certification. Available online: http://data.europa.eu/eli/reg/2019/881/oj (accessed on 19 May 2024).
- European Commission The EU Cybersecurity Act. Available online: https://digital-strategy.ec.europa.eu/en/policies/cybersecurity-act (accessed on 26 May 2024).
- European Parliament 2022/2555 EU Directive 2022/2555 on Measures for a High Common Level of Cybersecurity Across the Union. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32022L2555 (accessed on 19 May 2024).
- Sanchez-Zurdo, J.; San-Martín, J. A Country Risk Assessment from the Perspective of Cybersecurity in Local Entities. Appl. Sci. 2024, 14, 12036. [Google Scholar] [CrossRef]
- Kesan, J.P.; Zhang, L. An Empirical Investigation of the Relationship between Local Government Budgets, IT Expenditures and Cyber Losses. IEEE Trans. Emerg. Top. Comput. 2021, 9, 582–596. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. arXiv 2017. arXiv:1706.09516. [Google Scholar]
- Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for Big Data: An Interdisciplinary Review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
- Baral, A.; Reynolds, T.; Susskind, L.; Weitzner, D.J.; Wu, A. Municipal Cyber Risk Modeling Using Cryptographic Computing to Inform Cyber Policymaking. arXiv 2024, arXiv:2402.01007. [Google Scholar] [CrossRef]
- Yigitcanlar, T.; Senadheera, S.; Marasinghe, R.; Bibri, S.E.; Sanchez, T.; Cugurullo, F.; Sieber, R. Artificial Intelligence and the Local Government: A Five-Decade Scientometric Analysis on the Evolution, State-of-the-Art, and Emerging Trends. Cities 2024, 152, 105151. [Google Scholar] [CrossRef]
- Jha, R.K.; Jha, M. Optimizing E-Government Cybersecurity through Artificial Intelligence Integration. J. Trends Comput. Sci. Smart Technol. 2024, 6, 67–87. [Google Scholar] [CrossRef]
- Criado, J.I.; O.de Zarate-Alcarazo, L. Technological Frames, CIOs, and Artificial Intelligence in Public Administration: A Socio-Cognitive Exploratory Study in Spanish Local Governments. Gov. Inf. Q. 2022, 39, 101688. [Google Scholar] [CrossRef]
- European Parliament 2024/1689. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act); European Union: Maastricht, The Netherlands, 2024. [Google Scholar]
- Dong, F.; Wang, L.; Nie, X.; Shao, F.; Wang, H.; Li, D.; Luo, X.; Xiao, X. DISTDET: A Cost-Effective Distributed Cyber Threat Detection System. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023. [Google Scholar]
- FAIR Institute. Available online: https://www.fairinstitute.org/ (accessed on 11 August 2025).
- Schneider, J.; Gröger, C.; Lutsch, A.; Schwarz, H.; Mitschang, B. The Lakehouse: State of the Art on Concepts and Technologies. SN Comput. Sci. 2024, 5, 449. [Google Scholar] [CrossRef]
- INE INE—National Statistics Institute. Available online: https://www.ine.es/ (accessed on 26 May 2024).
- Spanish Ministry of Social Security Seguridad Social—Spanish Ministry of Social Security. Available online: https://www.seg-social.es/wps/portal/wss/internet/Inicio (accessed on 26 May 2024).
- Spanish Ministry of Employment SEPE—Servicio Público de Empleo Estatal—State Public Employment Service. Available online: https://www.sepe.es/HomeSepe (accessed on 26 May 2024).
- AEAT AEAT—Tax Office. Available online: https://sede.agenciatributaria.gob.es/ (accessed on 26 May 2024).
- MINHAP Hacienda—Contabilidad Pública y Control. Available online: https://www.hacienda.gob.es/es-ES/Paginas/Home.aspx (accessed on 26 May 2024).
- CNIG CNIG—Centro Nacional de Información Geográfica. Available online: http://www.ign.es/web/ign/portal/qsm-cnig (accessed on 26 May 2024).
- Ministry for Digital Transformation Datos.Gob.Es. Available online: https://datos.gob.es/es/ (accessed on 26 May 2024).
- Qualys, S.L. SSL Server Test. Available online: https://www.ssllabs.com/ssltest/ (accessed on 26 May 2024).
- Mozilla Mozilla Observatory. Available online: https://observatory.mozilla.org/ (accessed on 26 May 2024).
- Google. Google Safe Browsing; Google: Mountain View, CA, USA, 2024. [Google Scholar]
- Shodan Search Engine for the Internet of Everything. Available online: https://www.shodan.io/ (accessed on 26 May 2024).
- Network Time Foundation NTP Pool Project. Available online: https://www.ntppool.org/en/ (accessed on 26 May 2024).
- MXToolBox, Inc. MXToolbox Supertool Blacklists. Available online: https://mxtoolbox.com/blacklists.aspx (accessed on 26 May 2024).
- W3C Web Content Accessibility Guidelines (WCAG) 2.1. Available online: https://www.w3.org/TR/WCAG21/ (accessed on 26 May 2024).
- Scarfone, K.A.; Souppaya, M.P.; Cody, A.; Orebaugh, A.D. Technical Guide to Information Security Testing and Assessment. NIST SP 800-115; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2008; p. NIST SP 800-115. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Breiman, L.; Friedman, J.; Olshen, R.A.; Stone, C.J. Classification And Regression Trees, 1st ed.; Chapman and Hall: New York, NY, USA; CRC: New York, NY, USA, 1984; ISBN 978-1-315-13947-0. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Statist. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. B Stat. Methodol. 1958, 20, 215–232. [Google Scholar] [CrossRef]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
- Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inform. Theory. 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boost-ing. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Nembhard, H.B. Statistical Process Adjustment Methods for Quality Control. J. Am. Stat. Assoc. 2004, 99, 567–568. [Google Scholar] [CrossRef]
- Maron, M.E. Automatic Indexing: An Experimental Inquiry. J. ACM 1961, 8, 404–417. [Google Scholar] [CrossRef]
- Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least Angle Regression. Ann. Statist. 2004, 32, 407–451. [Google Scholar] [CrossRef]
- Tipping, M.E. Sparse Bayesian Learning and the Relevance Vector Machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar] [CrossRef]
- Galton, F. Regression Towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef]
- Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
- Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; Singer, Y. Online Passive-Aggressive Algorithms. J. Mach. Learn. Res. 2006, 7, 551–585. [Google Scholar]
- Pati, Y.C.; Rezaiifar, R.; Krishnaprasad, P.S. Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In Proceedings of the 27th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 1–3 November 1993. [Google Scholar]
- Zou, H.; Hastie, T. Regularization and Variable Selection Via the Elastic Net. J. R. Stat. Soc. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Year | n | CIORank Average |
---|---|---|
2022 | 125 | 57.3% |
2023 | 129 | 56.8% |
Budget Concept | Security | Availability | SEO | CIORank |
---|---|---|---|---|
1 | −0.0363 | 0.0827 | 0.0336 | 0.0459 |
206 | 0.0123 | 0.0525 | −0.0069 | 0.0266 |
216 | −0.0736 | 0.0173 | −0.0136 | −0.031 |
220.02 | −0.0701 | 0.0147 | 0.0058 | −0.022 |
222.03 | −0.0494 | 0.0841 | 0.0093 | 0.0272 |
636 | −0.0087 | 0.029 | −0.0115 | −0.0135 |
641 | −0.0061 | 0.0012 | −0.0025 | −0.0091 |
Model | Accuracy | AUC | Recall | Precision | F1 |
---|---|---|---|---|---|
catboost | 0.848 | 0.940 | 0.848 | 0.848 | 0.847 |
lightgbm | 0.845 | 0.929 | 0.845 | 0.845 | 0.844 |
et | 0.842 | 0.919 | 0.842 | 0.842 | 0.842 |
rf | 0.834 | 0.914 | 0.834 | 0.834 | 0.834 |
gbc | 0.832 | 0.920 | 0.832 | 0.833 | 0.832 |
lr | 0.821 | 0.903 | 0.821 | 0.822 | 0.821 |
ridge | 0.819 | 0.901 | 0.819 | 0.820 | 0.818 |
lda | 0.819 | 0.901 | 0.819 | 0.820 | 0.818 |
knn | 0.819 | 0.895 | 0.819 | 0.819 | 0.819 |
ada | 0.816 | 0.903 | 0.816 | 0.817 | 0.816 |
svm | 0.807 | 0.903 | 0.807 | 0.816 | 0.806 |
dt | 0.774 | 0.774 | 0.774 | 0.775 | 0.774 |
qda | 0.604 | 0.730 | 0.604 | 0.655 | 0.561 |
dummy | 0.506 | 0.500 | 0.506 | 0.256 | 0.340 |
nb | 0.496 | 0.804 | 0.496 | 0.516 | 0.351 |
Model | MAE | MSE | RMSE | R2 |
---|---|---|---|---|
catboost | 2.4319 | 10.2184 | 3.1947 | 0.8457 |
lightgbm | 2.4780 | 10.5687 | 3.2491 | 0.8403 |
rf | 2.5249 | 11.2499 | 3.3521 | 0.8301 |
gbr | 2.6563 | 11.9368 | 3.4539 | 0.8197 |
et | 2.6058 | 12.1019 | 3.4770 | 0.8173 |
knn | 2.6554 | 12.5221 | 3.5366 | 0.8106 |
lar | 3.2710 | 16.9024 | 4.1100 | 0.7453 |
br | 3.2709 | 16.9024 | 4.1100 | 0.7453 |
ridge | 3.2709 | 16.9023 | 4.1100 | 0.7453 |
lr | 3.2710 | 16.9024 | 4.1100 | 0.7453 |
huber | 3.2476 | 17.1601 | 4.1407 | 0.7417 |
ada | 3.4064 | 17.9231 | 4.2325 | 0.7293 |
dt | 3.2027 | 19.3444 | 4.3949 | 0.7080 |
par | 3.7511 | 22.8557 | 4.7635 | 0.6526 |
omp | 5.5543 | 49.3850 | 7.0232 | 0.2581 |
en | 5.6148 | 56.1674 | 7.4853 | 0.1603 |
lasso | 5.8728 | 60.7659 | 7.7857 | 0.0915 |
llar | 2.4780 | 10.5687 | 3.2491 | 0.8403 |
Year | MAE | MSE | RMSE | R2 |
---|---|---|---|---|
2022 and 2023 | 2.43 | 10.22 | 3.19 | 0.8457 |
2024 | 2.62 | 10.57 | 3.25 | 0.8403 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sanchez-Zurdo, J.; San-Martín, J. Beyond Geography and Budget: Machine Learning for Calculating Cyber Risk in the External Perimeter of Local Public Entities. Electronics 2025, 14, 3845. https://doi.org/10.3390/electronics14193845
Sanchez-Zurdo J, San-Martín J. Beyond Geography and Budget: Machine Learning for Calculating Cyber Risk in the External Perimeter of Local Public Entities. Electronics. 2025; 14(19):3845. https://doi.org/10.3390/electronics14193845
Chicago/Turabian StyleSanchez-Zurdo, Javier, and Jose San-Martín. 2025. "Beyond Geography and Budget: Machine Learning for Calculating Cyber Risk in the External Perimeter of Local Public Entities" Electronics 14, no. 19: 3845. https://doi.org/10.3390/electronics14193845
APA StyleSanchez-Zurdo, J., & San-Martín, J. (2025). Beyond Geography and Budget: Machine Learning for Calculating Cyber Risk in the External Perimeter of Local Public Entities. Electronics, 14(19), 3845. https://doi.org/10.3390/electronics14193845