Event Prediction Using Spatial–Temporal Data for a Predictive Traffic Accident Approach Through Categorical Logic

Koutsaki, Eleftheria; Vardakis, George; Papadakis, Nikos

doi:10.3390/data10060085

Open AccessArticle

Event Prediction Using Spatial–Temporal Data for a Predictive Traffic Accident Approach Through Categorical Logic

by

Eleftheria Koutsaki

,

George Vardakis

and

Nikos Papadakis

^*

Department of Electrical and Computer Engineering, Hellenic Mediterranean University, 71410 Heraklion, Greece

^*

Author to whom correspondence should be addressed.

Data 2025, 10(6), 85; https://doi.org/10.3390/data10060085

Submission received: 3 May 2025 / Revised: 16 May 2025 / Accepted: 21 May 2025 / Published: 3 June 2025

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

An event is an occurrence that takes place at a specific time and location that can be either weather-related (snowfall), social (crime), natural (earthquake), political (political unrest), or medical (pandemic) in nature. These events do not belong to the “normal” or “usual” spectrum and result in a change in a given situation; thus, their prediction would be very beneficial, both in terms of timely response to them and for their prevention, for example, the prevention of traffic accidents. However, this is currently challenging for researchers, who are called upon to manage and analyze a huge volume of data in order to design applications for predicting events using artificial intelligence and high computing power. Although significant progress has been made in this area, the heterogeneity in the input data that a forecasting application needs to process—in terms of their nature (spatial, temporal, and semantic)—and the corresponding complex dependencies between them constitute the greatest challenge for researchers. For this reason, the initial forecasting applications process data for specific situations, in terms of number and characteristics, while, at the same time, having the possibility to respond to different situations, e.g., an application that predicts a pandemic can also predict a central phenomenon, simply by using different data types. In this work, we present the forecasting applications that have been designed to date. We also present a model for predicting traffic accidents using categorical logic, creating a Knowledge Base using the Resolution algorithm as a proof of concept. We study and analyze all possible scenarios that arise under different conditions. Finally, we implement the traffic accident prediction model using the Prolog language with the corresponding Queries in JPL.

Keywords:

events; event prediction; traffic accident; spatiotemporal data; artificial intelligence; machine learning; OSRM; categorical logic; knowledge base; resolution algorithm; Prolog; JPL

1. Introduction

An event has a specific identity; that is, it involves a specific human activity or the environment. It can be weather-related, social, or medical, and it takes place at a specific time and location. A set of events can be related to a long-term situation, such as the course of business, medical care, political stability, etc. The study of events has a spectrum throughout human life and the environment. There has been particular interest in studying events in recent decades, particularly in terms of their detection and prediction [1]. Predicting events can have enormous benefits. By predicting events, a person can act promptly and effectively to manage a difficult situation and take necessary preventive measures, such as dealing with and avoiding a traffic accident, illness, crime, or even a threat to one’s home [2]. Until a few years ago, the mere thought of predicting events was prohibitive because of the heterogeneity in and interrelationship between events and their various causes, as well as the resources that researchers had at their disposal [3]. However, in the last decade, because of the progress in artificial intelligence and the development of computing power and big data, researchers can now predict events from big data using various methodologies. By studying previous events, they can now design models that can predict future events through observation of the characteristics of past events and, with the valuable help of machine learning and spatiotemporal data mining, extract the patterns within. Spatiotemporal data differ from the spatial data on which computational approaches are developed because the sources of these data have been available for many decades and have dynamic characteristics since they are constantly modified in real time. Spatiotemporal data include trajectory data, reference point data, Raster data, and spatiotemporal event-type data (which we will use in this study) [4,5]. Several problems arise when predicting events by studying and analyzing big data. The following are the most important:

(1): An event depends on time, space, and its nature, intensity, and duration. With each change in these heterogeneous but interdependent parameters, a different event arises. Because of the variety of results/events, determining the label of each result requires automatic methods, which introduce errors during the coding of events; thus, we cannot guarantee the quality of the label. It is difficult to determine the criteria by which a prediction can be characterized as false or true (valid or not) [6].
(2): Because of the interdependence of events, most of the time, one event indirectly or directly affects another or is the cause of another. Consequently, many and complex dependencies appear between the predicted events during the forecasting process, creating a problem in terms of examining and evaluating these correlations [7,8].

Event prediction requires the study and analysis of data on past events to derive the desired predictions; however, events are dynamic and are constantly changing, making monitoring them in a training model problematic; for example, a disease can spread to affect 20% of a population at one point in time, and then, very soon after, cover 70% of the population [9]. Consequently, the distribution of input data, as well as their sources, dynamically changes in real time. This requires trained learning models to be constantly upgraded and updated, which costs both time and money. Additional weaknesses include the prediction of rare events, the inability to recover lost input data, the inability to make long-term predictions, the inability to separate useful from scattered and irrelevant input data, and the management of uncertainty during prediction. To address the challenges associated with event prediction, researchers have conducted targeted experiments using specific and controlled input data. These experiments aim to isolate and better understand the complexities inherent in predictive modeling [10]. In recent years, substantial research has been devoted to overcoming these obstacles, both in refining the methodologies used for event prediction and expanding their practical applications.

Despite this progress, event prediction remains in its infancy compared to other scientific domains. Most techniques developed thus far are limited in scope, having been tailored to narrowly defined datasets and conditions. This specificity significantly constrains their generalizability and broader applicability.

One of the core challenges lies in the sensitivity of prediction outcomes to variations in input data—particularly in terms of accuracy and the timing of data acquisition. Even minor deviations in these parameters can produce significantly different results, hindering efforts to establish standardized forecasting methodologies. The absence of widely accepted benchmarks or identified bottlenecks further impedes progress, leaving the field fragmented and underdeveloped.

This study aims to systematically document the existing approaches to event prediction using big data, with a particular emphasis on spatiotemporal (st) data. We begin by categorizing predictive methods based on the nature of the problems they address and the primary data dimension they prioritize—namely, time, location, or semantics. These categories are then compared to highlight the strengths and limitations of each.

The resulting classification is intended to support researchers in selecting the most suitable forecasting techniques for specific use cases and to help define the levels of reporting and abstraction that future applications might require. We also propose a framework for standardizing the evaluation of forecasting methods, acknowledging the wide variability in prediction outcomes that depend on when, where, and under what conditions input data are collected [11].

Finally, this study presents a case study focused on traffic accident prediction, formulated under the following conditions:

“A driver is operating a vehicle on a roadway while under the influence of a significant amount of alcohol—exceeding the legal blood alcohol concentration limit of 0.25%. The environmental conditions are adverse, characterized by heavy rainfall at a rate of 50 mm/s. The route the driver intends to follow includes a sharp turn. Furthermore, the journey takes place during late-night hours (between 12:00 a.m. and 5:00 a.m.), resulting in low visibility due to darkness. Compounding these risks, the driver is traveling at a high speed, exceeding 100 km/h. The event predicted—and demonstrated in this study—is the occurrence of a traffic accident under these combined conditions. It is important to note that the threshold values assigned to the variables (e.g., alcohol level, rainfall intensity, speed) are indicative and may be adjusted to reflect the specific requirements of different scenarios or environments, depending on the problem under investigation”.

To enable the prediction of traffic accidents, this study employs categorical logic to construct a knowledge base, which is then evaluated using the Resolution algorithm. All potential scenarios that may emerge from the predicted situation are systematically examined by varying the values of the relevant descriptive variables. The predicted event scenario is subsequently implemented in a Prolog program, with associated queries executed through the Java-Prolog Library (JPL). This represents a novel approach that, to the best of our knowledge, has not yet been explored in prior research.

This study is structured as follows. Section 2 reviews the relevant literature in the field of event prediction. Section 3 explores various domains where forecasting techniques have been applied. Section 4 delves into specific methodologies for event prediction, while Section 5 introduces categorical logic as a formal framework for modeling such problems. In Section 6, we present the proposed traffic accident prediction model. We begin by formulating the problem using categorical logic and constructing a corresponding knowledge base (Section 6.1). We then demonstrate the logical validity of the predicted event using the Resolution algorithm (Section 6.2). To explore the model’s robustness, we extract all possible scenarios by varying the input variables across their full range of combinations (Section 6.3). The implementation phase follows in Section 6.4, where the knowledge base is encoded in the Prolog programming language. Section 6.5 outlines the execution of related queries using the Java-Prolog Library (JPL) to facilitate user interaction through a dynamic interface. Finally, Section 7 discusses open challenges and outlines future directions for research in event prediction. This study concludes with a summary of findings related to traffic accident forecasting, including a brief overview of the real-time application we developed. This application utilizes a live-updated map interface to feed variable values into the knowledge base, enabling continuous prediction updates under actual driving conditions.

2. Related Research

This section identifies and discusses three major categories of research related to event prediction using big data. These categories are defined by (i) the methods employed for detecting events, (ii) the nature and structure of the input data used in predictive models, and (iii) the types of outcomes targeted, often specific to particular application domains [12].

Early research primarily concentrated on event detection, which involves identifying historical or ongoing events rather than forecasting future occurrences. The objective of detection is to extract recurring patterns, identify anomalies, and group related observations [13]. For example, many modern applications utilize interactive maps to detect and log extreme or unusual events in real time [14]. Over the past decade, significant advancements have been made in this area, especially in fields such as social media analysis, where event detection techniques are used to uncover emergent patterns and disruptions [15].

In predictive modeling, the analysis process must adapt continuously to fluctuations in input data and changes in the associated dependent variable. This dependent variable, or target, can take the form of a scalar value, a vector, or a structured object—representing anything from economic activity or geographic regions to emotional sentiment. Notably, the target does not always pertain to future states; rather, its type determines the analytical approach. Depending on the nature of the prediction, outputs may be structured temporally, spatially, or semantically. These dimensions help to classify the predictive models and their use in various applications.

In parallel, several methods have been developed specifically for spatiotemporal event prediction, incorporating temporally and spatially dependent variables to improve predictive accuracy [16,17].

Considerable research attention has focused on domain-specific event forecasting. These include social phenomena, such as civil unrest; environmental conditions, like droughts and floods; renewable energy trends, including the forecasting of peak solar or wind prices at specific locations; and business-critical events, such as organizational failures or bankruptcies [18]. Despite growing interest, researchers face persistent and often complex challenges in modeling such events because of the dynamic, multidimensional nature of the data involved.

2.1. Time, Location, and Semantics Prediction

To predict future events by specifying their time, location, and semantic characteristics, researchers have developed a range of techniques. These approaches generally fall into three categories, i.e., (i) system-based, (ii) model-based, and (iii) tensor-based methods, which are more specifically defined as follows:

System-based techniques rely on integrated systems that employ fusion methods to forecast future events using human-derived predictive inputs. One of the primary challenges in these approaches is the considerable variability in individual predictive abilities, which often stems from differences in cognitive background and domain expertise. To mitigate this, some systems group participants based on similar competencies and then aggregate their predictions to improve overall accuracy. An alternative system-based method involves synthesizing inputs from multiple individual predictors. In this framework, contributors assign a confidence value to each prediction—typically represented as a virtual “coupon”—which quantifies their certainty regarding the outcome. These predictions are then traded in a simulated market environment where participants “buy” or “sell” outcomes. This mechanism incentivizes accurate predictions by rewarding correct outcomes and penalizing incorrect ones, thereby reinforcing reliability and accountability. Some system-based approaches are specifically designed to detect “programmed” future events—those that are anticipated based on identifiable trends or indicators extracted from structured or unstructured data sources, such as social media content or online news. These methods often leverage natural language processing (NLP) and are typically implemented through a four-stage pipeline. Stage 1 involves content filtering, where texts related to the target event are selected using either supervised methods (e.g., text classifiers) or unsupervised techniques (e.g., keyword-based filtering). Stage 2 focuses on time expression identification, which entails detecting future-oriented temporal references in the text using linguistic rules or NLP parsing tools. Stage 3 extracts future reference expressions, which serve as the core indicators of potential upcoming events. These expressions are identified using regular expressions or classification algorithms. Stage 4 addresses location identification, which is particularly challenging because of the inconsistency and noise in spatial references. To improve precision, geocoding techniques are employed. Spatial data may be drawn from article metadata, author information, or contextual clues, and the spatial scope is refined either through geometric boundaries or by logically merging similar location expressions to reduce redundancy and error [19,20,21].
Model-based approaches rely on predictive systems that integrate and compile multiple models to forecast future events. These systems are capable of determining not only the time, location, and semantic context of an event but also its frequency and type. A prominent example is the EMBERS system [22], which operates primarily in the digital domain and analyzes diverse data sources to anticipate civil unrest and other events of interest. EMBERS has demonstrated high levels of both prediction accuracy and recall, thereby enhancing user trust in its outputs. The methodology underpinning such systems typically begins with the independent evaluation of each predictive model, emphasizing accuracy regardless of recall. Once all candidate models are generated, their outputs are combined using fusion techniques—such as Bayesian fusion—to exploit their complementary strengths. This fusion process significantly improves recall, enabling the system to detect a broader range of potential outcomes. An illustrative example of this approach is the Cardon system, which also leverages multiple predictive models and combines their outputs to improve both the reliability and comprehensiveness of event prediction [23,24,25].
Tensor-based approaches represent data as multidimensional arrays—or tensors—that encode information across three primary dimensions: time, location, and semantics. These tensors are then decomposed into lower-order matrices, each of which captures latent patterns or unresolved relationships within a specific dimension. This decomposition facilitates the extraction of meaningful features from complex, high-dimensional data. To enable forecasting, the original tensor is extended to cover future time intervals using various extrapolation techniques. One such method involves extending the time dimension by multiplying it with matrices representing other contextual dimensions, thereby generating a new tensor that projects into the future. An alternative approach introduces blank entries—corresponding to future values—into the initial tensor. These missing values are then estimated using tensor completion or integration techniques, ultimately producing predictions that reflect plausible future events [26].

2.2. Event Prediction Evaluation

Event prediction evaluation seeks to determine whether the set of predicted events, denoted as Y′, accurately corresponds to and represents the actual set of observed events, Y. The evaluation techniques typically produce outputs in the form of entities characterized by multiple attributes, such as time, location, and type. However, before a meaningful assessment of a prediction model’s performance can be made, it is essential to establish prediction pairs—each consisting of a predicted event and its corresponding real-world counterpart. These pairs must be carefully labeled and matched to ensure that both prediction accuracy and error rates are reliably measured and interpreted [27,28,29].

2.3. The Techniques That Researchers Have Followed to Date

Assignment of Predicted to Real Events:

(i) One common method for aligning predicted events with actual occurrences is prefix matching. In this approach, a predicted event is considered a match to a real event when there is a high degree of similarity across key characteristics, particularly time and location. This method typically assumes that each predicted event corresponds to a single real event that occurs at a specific point in time and space. For example, a predicted event scheduled for 23 September 2025 (time/t) in Heraklion (location/l) would only be matched to a real event that actually takes place on that exact date and in that specific location [30].

(ii) Optimized matching, when a prediction cannot perfectly match any real event, the link is established between the predicted event and the real one that most closely resembles it in terms of characteristic similarity, with a corresponding degree of inaccuracy and a reduction in the technique’s overall precision. The comparison between characteristics is performed using either Euclidean distance or another appropriate metric to measure the distance between the attributes of the compared events. After these distances are calculated, the pair with the smallest distance across characteristics is selected. For example, if a predicted event is “at 10 a.m., on 20 September 2025” (t), “Heraklion, Crete, Greece” (l), “flood” (semantics/s), and the two actual events are (1) “at 10 a.m. on 20 September 2025” (t), “Heraklion, Crete, Greece” (l), “snowfall” (s), and (2) “20 September 2025” (t), “Heraklion, Crete, Greece” (l), “strike” (semantics), then the matching will be made with the most similar of the two actual events—namely, the first one [13].

However, researchers today commonly adopt the optimized matching technique to establish pairs between predicted and actual events, enabling subsequent comparison to evaluate the performance quality of the event prediction method.

The effectiveness of a prediction technique is assessed using two key indicators: (1) goodness of matching, which measures the percentage of predicted events that have been successfully matched with actual events [31], and (2) qualitative correspondence, which assesses how closely each predicted event aligns with its corresponding actual event among the established pairs [32].

2.4. Event Forecasting Techniques

Event forecasting techniques can be categorized into distinct types and subtypes, depending on the nature of the output produced by the forecasting method. This output is typically defined by three key dimensions: time, location, and the semantic nature of the predicted event. Forecasting methods are further classified according to their primary objective—whether the aim is to predict the timing, location, nature of the event, or some combination of these factors. The corresponding techniques applied in each case are presented in the following subsections.

2.4.1. Time Forecasting Techniques

Time forecasting techniques aim to determine the precise moment at which an event will occur [11]. These techniques are categorized as follows:

Event Forecasting: This method focuses on determining whether a specific event will occur within a given time frame. If the event is predicted to occur, it is labeled as a positive class; if not, it falls under the negative class. This approach effectively constitutes a binary classification of future events.
Anomaly Detection: This technique involves identifying anomalies in historical data to learn the characteristics of typical, or “normal,” patterns—those under which an event is not expected to occur. The distance of a new event’s data from these normal patterns is then measured; a significant deviation suggests the potential occurrence of a future event [33].
Discrete Time Prediction: In addition to predicting whether an event will occur, this approach aims to estimate the approximate time of occurrence [34]. Time is initially segmented into discrete intervals (e.g., hours, days, months), and the goal is to identify the interval during which the event is most likely to happen. These methods are further divided into the following approaches:
○
Direct approaches, which estimate the specific interval or ordinal scale (e.g., immediate, short-term, or long-term future) during which the event may occur. This is typically achieved using regression or ordinal regression techniques to determine either exact time boundaries or ranked time categories.
○
Indirect approaches, which first align the input data temporally and then apply autoregressive models to historical time series in order to forecast future time series. Once these future sequences are predicted, the presence of events is detected using methods such as burst detection, change detection, or supervised learning. In supervised techniques, researchers infer future event patterns based on historical observations, with or without labeled data. If no time series is available, labeled training data can still be used to extract the relevant predictive patterns.
Continuous Time Prediction: This method addresses the challenges of forecasting events on a continuous time scale. The primary difficulty lies in achieving the required time resolution, which often demands extremely high computational power. Moreover, the process is highly sensitive to the precision of time prediction, making it difficult and time-consuming, particularly during model training and synchronization phases [35].

To mitigate these challenges, researchers have proposed simplified modeling approaches, including the following:

Simple Regression [36];
Point Process Models [37].

2.4.2. Location Prediction Techniques

Location prediction techniques aim to determine the geographic position where an event is likely to occur. The predicted location can be represented in two primary forms: raster-based or point-based.

When the event is expected to occur over a broad spatial extent—such as a general region rather than a specific coordinate—the output is typically represented as a raster. A raster is a spatial grid composed of individual cells, each corresponding to a portion of the target area. This format is particularly useful for forecasting events with diffuse or large-scale spatial characteristics.

Conversely, when the event is localized and confined to a very small area—such as at discrete points or network nodes—the forecast output is defined as a point. This point-based representation is more appropriate for predicting events with precise geographic locations [38].

2.4.3. Semantic Prediction Techniques

Semantic prediction focuses not on determining when or where an event will occur but rather on forecasting its description, subject, or other semantic attributes. In these approaches, the goal is to predict the nature or category of an event—its meaning—rather than its temporal or spatial dimensions.

Unlike time or location prediction methods that often rely on numerical input, semantic prediction methods may utilize a variety of data formats, including symbolic representations and natural language text. As a result, the choice of technique is closely tied to the nature of the input data.

Three primary data types are commonly used in semantic prediction:

Rule-based data, where prediction is driven by association mining or the identification of logical patterns derived from historical data. These rules capture relationships that help anticipate future events based on past occurrences.
Sequential data, in which events are assumed to follow a temporal chain or order. By analyzing these sequences, it becomes possible to predict future events by extending the logical progression of prior occurrences.
Graph-based data, which build on sequential modeling by representing event relationships as graphs. This approach captures complex dependencies and interconnections among events by modeling them as nodes and edges within a structured graph [39].

2.4.4. Multifaceted Prediction Techniques

Multifaceted prediction refers to forecasting approaches that simultaneously consider multiple dimensions of an event—specifically, its time, location, and semantic content. These methods aim to provide a more comprehensive understanding of future events by integrating all relevant aspects of their occurrence.

There are three primary approaches within this framework, based on how these dimensions are weighted and combined:

One approach treats time and semantics as equally significant predictive factors.
Another approach emphasizes time and location.
The most comprehensive approach considers time, location, and semantics together, offering a fully integrated prediction model [40,41,42].

3. Fields in Which Event Prediction Techniques Are Applied

Event prediction techniques are now applied across a wide range of domains. In healthcare, they are used to forecast the onset and spread of diseases, particularly in the context of epidemics. In multimedia analysis, data from video, audio, or text sources are employed to predict actions in sports events or to anticipate future news developments. In the field of human mobility and transportation, these techniques help forecast both individual and group movements [43,44].

Additionally, event prediction has demonstrated high accuracy in political forecasting, including the prediction of social unrest and conflicts, primarily by leveraging data from social media platforms. In environmental applications, it is used to anticipate natural disasters, such as floods or earthquakes. In the business sector, prediction models are utilized to identify potential bankruptcies and to forecast consumer purchasing behavior within specific populations [45]. Moreover, these techniques are applied in security and public safety contexts to predict delinquent behavior, such as robbery, crime, or even terrorist attacks [27,46].

4. Event Prediction Problem

An event is defined as an occurrence that takes place at a specific time and location and is characterized by a distinct semantic identity—for example, a traffic accident. Formally, an event can be represented as:

y = (t, l, s)

where t denotes the time of the event, l represents the location, and s describes the semantic nature or type of the event. The location l can be specified at various levels of granularity: it may refer to a broad area, such as a neighborhood or city, or to an exact point defined by geographic coordinates. Similarly, the time parameter t can be expressed as either a precise timestamp or a broader time interval, for example, a 24-hour period. The semantic parameter s may encompass any descriptive characteristic that helps define the nature of the event. Using this modeling framework, an example event might be expressed as “1:00 p.m. on 12 May 2024” (t), “Heraklion, Crete, Greece” (l), “heavy rain” (s)—where the values represent time, location, and semantics, respectively. In an event prediction system, the inputs used to forecast such events are referred to as indicators (denoted as X). These indicators contain various types of information relevant to the potential occurrence of an event. However, not all inputs are equally useful; in addition to critical predictive features, some may include irrelevant or noisy information [47]. This relationship is typically formalized as:

X ⊆ T × L × F

In this context, let L represent the location, T the time, and F a set of features or information attributes that are unrelated to time or location. If we define the current moment as tnow and distinguish between past and future times, we can denote:

T − ≡ {t |t ≤ tnow, t ∈ T} and T+ ≡ {t |t > tnow, t ∈ T}.

Thus, the event prediction problem can be formulated as follows. Given a set of event indicators:

X ⊆ T − × L × F

and a corresponding set of historical event data Y0:

Y0 ⊆ T − × L × S,

the goal of event prediction is to derive a set of forecasted future events, denoted as Yˆ. This formulation defines prediction as the process of generating future event instances Yˆ based on prior indicator data and historical event records:

Yˆ ⊆ T + × L × S,

such that for each predicted future event yˆ = (t, l, s) ∈ Yˆ, where t > tnow.

It is important to note that different prediction methods assign varying levels of emphasis to the three core parameters—time, location, and semantic nature—depending on the specific requirements of the forecasting problem. For instance, in modeling the progression of an individual’s illness, the location of the patient may be of minimal or no relevance, whereas the duration and severity of the illness are critical factors [48]. In contrast, when predicting the spread of an infectious disease, the location becomes a primary variable of concern, with time and semantics playing a comparatively lesser role.

As a result, the evaluation criteria used across forecasting methods differ significantly. This variation stems from the distinct weighting assigned to each parameter—time, place, and nature of the event—in alignment with the goals of the specific application domain.

To effectively represent both temporal and spatial aspects—along with a variety of additional characteristics that accompany forecasted events—researchers increasingly turn to categorical logic. This approach offers a more abstract and algebraic representation of knowledge, closely aligned with human reasoning, in contrast to traditional classical logic, which tends to be rigid and strictly defined [49].

5. Categorical Logic in the Service of Event Prediction

Categorical logic, as the name implies, is a form of logic developed within the framework of category theory. It is a branch of algebraic logic, offering a structured and abstract representation of logical reasoning. At its core, categorical logic captures the way humans intuitively approach and interpret the world, but within an algebraic formalism.

Much like algebraic logic encodes propositional logic—whether classical, intuitionistic, or otherwise—through structures such as Lindenbaum–Tarski algebras (e.g., Boolean algebras, Heyting algebras), categorical logic generalizes this concept to first-order and higher-order logics. These logics are encoded in categories equipped with additional structural properties, such as Boolean categories and Heyting categories. From a technical standpoint, categorical logic can be seen as a generalization of the algebraic encoding of propositional logic, extending it into more expressive logical systems.

Unlike propositional logic, categorical logic goes beyond mere predicates to also express relations between predicates, particularly in the form of functions. A key distinction lies in its incorporation of quantifiers—the existential quantifier (∃) and the universal quantifier (∀). The existential quantifier asserts the existence of an object satisfying a certain condition, while the universal quantifier asserts that a condition holds for all objects within the domain of discourse.

In the context of prediction modeling, categorical logic consists of facts that define the properties of objects, which are the central entities in a prediction problem. For example, in the context of traffic accident prediction, these objects might include the driver(X), the road(Y), time(T), the vehicle speed (Tax), and the rain condition (B). The logic also includes rules or predicates that express properties or relationships between these objects—such as heavy_rain(B), if B > “value”.

Categorical logic enables connections to other foundational concepts, such as intuitive reasoning, recursive functions, and completeness theorems for various logical systems. More than just a technical tool, categorical logic provides a framework that reveals fundamental properties and conceptual insights about the structures it encodes. Many results derived from categorical techniques carry meaningful philosophical implications, offering a deeper understanding of logic beyond classical formalisms.

While categorical logic is firmly grounded in mathematics, its abstraction, flexibility, and intuitive alignment with human reasoning make it a particularly powerful tool for modeling and predicting real-world events. Researchers increasingly apply it to describe and reason about future events, establishing their logical existence using mechanisms such as the Resolution algorithm [1,49,50,51].

The next section presents a case study of future event prediction—specifically, a traffic accident—first formulated in natural language and then translated into categorical logic, followed by a formal proof of the event’s existence using the Resolution algorithm.

6. Problem Statement

Today, traffic accidents remain a leading cause of injury and death worldwide, often resulting from human error, adverse weather conditions, or other environmental factors. In countries such as Greece, the situation is particularly severe, with a significant number of fatalities each year—many involving young drivers. Despite growing awareness and legislative efforts, the number of traffic-related deaths remains alarmingly high.

Computer Science, through advancements in predictive learning and event forecasting, offers promising tools to help address this critical public health issue. Predictive models can be developed to anticipate the likelihood of traffic accidents, thereby enabling timely interventions that may help prevent them altogether.

Although driving under the influence of alcohol, speeding, and other reckless behaviors are prohibited by law, these regulations are not always observed. This disregard for traffic laws endangers not only the drivers themselves but also others on the road. In Greece, the high rate of fatalities among young drivers under the influence of alcohol is one of the country’s most pressing safety concerns.

While authorities have implemented various countermeasures—including increased police patrols, stricter fines, license revocation, and public awareness campaigns delivered through schools, television, news media, and social networks—the problem persists. These efforts, though important, are not always sufficient to prevent tragic outcomes, especially when drivers are incapable of making rational decisions in critical moments.

In this context, science—particularly data-driven methods and intelligent systems—can play a vital role. By supporting decision-making in real time, predictive systems can act when the driver cannot, enhancing safety for all road users.

This paper proposes the development of a traffic accident prediction model designed not only to forecast the likelihood of an accident but also to actively encourage safer driving behavior. The system aims to warn drivers about imminent dangers and promote preventive action, ultimately contributing to the reduction of accidents and saving lives.

The problem statement is as follows:

“Consider a scenario (Problem 1) in which a driver X is operating a vehicle A on a roadway Y, traveling at a speed Tax at a given time T. The overall risk associated with the journey is influenced by several factors. One such factor is the configuration of the road itself, particularly whether it includes sharp turns, denoted as Ap_str, which increase the likelihood of losing control. Additionally, the specific route chosen by the driver may introduce varying degrees of difficulty or danger. Weather conditions also play a crucial role; for example, heavy rainfall can create slippery surfaces, significantly compromising vehicle stability and braking capability. The driver’s sobriety further affects safety, as heavy alcohol consumption can impair decision-making, reduce situational awareness, and slow reflexes—all of which are critical for safe driving. Moreover, if the driver is traveling at high speed (e.g., ≥100 km/h), the severity and probability of an accident increase substantially. This risk is compounded if the journey takes place at night, where reduced visibility due to darkness further impairs the driver’s ability to perceive hazards in time. Taken together, these conditions form a high-risk environment that can potentially result in a traffic accident (at).”

In this study, we aim to address the problem of traffic accident prediction by employing categorical logic and constructing a dedicated knowledge base [51] (see Section 6.1). We will demonstrate the existence of the predicted event—a traffic accident—through formal proof using the Resolution algorithm [52] (Section 6.2). Subsequently, we generate and analyze all possible scenarios that may arise from the initial problem by assigning different values to each of the variables that define the context, exploring every possible combination (Section 6.3).

In addition, we implement this predictive framework using the Prolog programming language (Section 6.4) and integrate it with a user interface via the Java-Prolog Library (JPL) (Section 6.5). To the best of our knowledge, this integrated and logic-based approach to traffic accident prediction has not been previously undertaken in the existing research.

6.1. Knowledge Base

A verbal description (text format) of the real-world driving conditions outlined in Problem 1 is provided below. These conditions are formally represented through a set of predicates within the corresponding knowledge base:

“A driver (driver(X)) is driving a car (car(A), driving(X,A)) on a road (road(Y)) under the influence of a large amount of alcohol (alcohol(Al), Al >= 1, the alcohol, the amount of alcohol is Al and the limit in the human body is 0.25%). The driver has bad driving behavior (bdb(X,Al)), the weather conditions are bad, it is raining heavily (rain(B), heavy_rain(B), B >= 50, with B the amount of rain, is 50 mm/s of rain) and the road is slippery (sl(Y,B)). On the route that the driver wishes to follow, there is a turn (turn(F)), a sharp turn (sharp_turn(As), where As = ‘yes’ means yes, there is a sharp turn, so the turn F is a sharp turn, is(F,As), and have(Y,F,As), the road Y has a sharp turn), it is late at night (time(T), night(T), if T >= 0 and T =< 5, time is between 0 a.m. and 5 a.m.), so it is dark and the driver’s visibility is reduced (rv(X,T)). Finally, the driver is running with high speed (speed(Tax) with Tax >= 100, the speed is over 100 km/h, so the driver is running: run(X,Tax)). The limits of the values given to the variables are indicative and can be changed on a case-by-case basis to satisfy the conditions of another environment, depending on the problem that the user is called upon to face. The predicted event, which we wish to prove in this work, is that a traffic accident will occur (at(X,Y,A,F,As,Tax,B,Al,T))”.

Then, the extraction of facts and predicates is carried out.

The events include the following:

1. driver(X) (some driver);

2. road(Y) (some road);

3. car(A) (some car);

4. speed(Tax) (some speed);

5. time(T) (at some time);

6. turn(F) (some turn);

7. sharp_turn(As) (is sharp);

8. alcohol(Al) (amount of alcohol);

9. rain(B) (some rain);

10. Tax>=100 (high speed);

11. B>=50 (large amount of rain);

12. As=‘yes’ (is sharp);

13. Al>=1 (large amount of alcohol);

14. T>=0, T=<55 (time interval between 0 and 5 in the morning).

The rules are as follows: “Every driver X, drives a car A” (rule formulation in natural language).

driver(X),car(A) → driving(X,A) (rule formulation in categorical logic).

Using Morgan’s rule ¬(P^Q) ↔ ¬P ˅ ¬Q and the equivalence P → Q ↔ ¬P˅Q

¬driver(X)˅¬car(A)˅driving(X,A) (normal disjunctive form, where “^” is the logical and, “˅” is the logical or, “→” is the implication, “↔” is the equivalence, and ” ¬” is the logical not);

\+driver(X);\+car(A);driving(X,A) (rule execution in Prolog, where “\+” is the logical not, “;” is the logical or, “,” is the logical and, and “:-” is the implication).

“All drivers drive at some speed, and if this speed is high, then the driver is running”:

driver(X),speed(Tax),Tax>=0 → running(X,Tax);

¬driver(X)˅¬speed(Tax)˅¬(Tax>=100)˅ running(X,Tax);

\+driver(X);\+speed(Tax);\+(Tax>=100); running(X,Tax).

2.: “Between 0 and 5 a.m., it is night”:

time(T),T>=0,T=<5 → night(T);

¬time(T)˅¬(T>=0T=<5)˅night(T);

\+time(T);\+(T>=0T=<5);night(T).

3.: “A turn F is sharp”:

turn(F),sharp_turn(As),As=‘yes’ → is(F,As);

¬turn(F)˅¬sharp_turn(As)˅¬(As=‘yes’)˅is(F,As);

\+turn(F);\+sharp_turn(As);\+(As=‘yes’);is(F,As).

4.: “A Y road has a turn F and is sharp (As=‘yes’)”:

road(Y),turn(F),is(F,As) → have(Y,F,As);

¬road(Y)˅¬turn(F)˅¬is(F,As)˅have(Y,F,As);

\+road(Y);\+turn(F);\+is(F,As);have(Y,F,As).

5.: “A driver X has consumed a large amount of alcohol (Al>=1) and has bad driving behavior (bdb)”:

driver(X),alcohol(Al),Al>=1 → bdb(X,Al);

¬driver(X)˅¬alcohol(Al)˅¬(Al>=1)˅bdb(X,Al);

\+driver(X);\+alcohol(Al);\+(Al>=1);bdb(X,Al).

6.: “There is rain (rain), and it is heavy (heavy_rain with B>=50)”:

rain(B),B>=50 → heavy_rain(B);

¬rain(B)˅¬(B>=50)˅heavy_rain(B);

\+rain(B);\+(B>=50);heavy_rain(B).

7.: “There is reduced visibility (rv) due to night (night)”:

driver(X),night(T) → rv(X,T);

¬driver(X)˅¬night(T)˅rv(X,T);

\+driver(X);\+night(T);rv(X,T).

8.: “There is slipperiness (sl) on the road Y due to heavy rain (heavy_rain)”:

road(Y),heavy_rain(B) → sl(Y,B);

¬road(Y)˅¬heavy_rain(B)˅sl(Y,B);

\+road(Y);\+heavy_rain(B);sl(Y,B).

9.: “When a driver X is driving a car A with bad driving behavior (bdb) and there is a sharp turn (have(Y,F,As)) on the road Y, with slipperiness (sl) and reduced visibility (rv), and, at the same time, the driver is running (running) at high speed (Tax>=100), the result is to cause a traffic accident accident(at())”:

driving(X,A),bdb(X,Al),have(Y,F,As),sl(Y,B),rv(X,T),running(X,Tax),Tax>=100,B>=50,Al>=1,T>=0,T=<5,As=‘yes’→at(X,Y,A,F,As,Tax,B,Al,T);

¬driving(X,A)˅¬bdb(X,Al)˅¬have(Y,F,As)˅¬sl(Y,B)˅¬rv(X,T)˅¬running(X,Tax)˅¬(Tax>=100)˅¬(B>=50)˅¬(Al>=1)˅¬(T>=0^T=<5) ˅¬(As=‘yes’)˅at(X,Y,A,F,As,Tax,B,Al,T);

\+driving(X,A);\+bdb(X,Al);\+have(Y,F,As);\+sl(Y,B);\+rv(X,T);\+running(X,Tax);\+(Tax>=100);\+(B>=50);\+(Al>=1);\+(T>=0,T=<5);at(X,Y,A,F,As,Tax,B,Al,T).

10.: “No accident will occur”:

¬at(X,Y,A,F,As,Tax,B,Al,T);

\+at(X,Y,A,F,As,Tax,B,Al,T).

(Based on this sentence, with the method of atopy, through the Resolution algorithm, we will be led to the conclusion that an accident will occur.)

Although the knowledge base is represented using categorical logic, we do not explicitly use quantifiers. This is because all variables in the predicates are implicitly universally quantified. As such, the need for explicit quantifiers is eliminated, either through variable substitution or because the logic inherently assumes universal applicability. Below, we present two representative examples.

Rule 15 is stated as follows:

“Every driver X drives a car A” (rule formulation in natural language).

In predicate logic, this is expressed as follows:

∀X∀A(driver(X),car(A) → driving(X,A)).

The variables X and A are both universal, so they are eliminated, resulting in the following:

driver(X),car(A) →driving(X,A)

With the equivalence P→Q ↔ ¬P˅Q, and using the de Morgan rule ¬(P^Q) ↔ ¬P ˅¬Q, we then have the following:

¬(driver(X) ^ car(A)) ^driving(X,A)

or

¬driver(X) ˅ ¬car(A)˅driving(X,A).

And in an executable Prolog rule,

\+driver(X);\+car(A);driving(X,A).

But, rule 16 is stated as follows:

“All drivers drive at some speed, and if this speed is high, then the driver is running”, that is,

∀X∃Tax(driver(X),speed(Tax),Tax>=0→ running(X,Tax)).

For each X, there are some Tax, or in algebraic form

f(X)=Tax or X1=Tax.

Then, by substitution, we are led to the following relation:

driver(X),speed(X1),X1>=0 → running(X,X1),

which is equivalent to

driver(X),speed(Tax),Tax>=0 → running(X,Tax).

And based on the equivalence rule P → Q ↔ ¬P˅Q, we then have the following:

¬driver(X) ^¬speed(Tax)^¬(Tax>=100)˅running(X,Tax);

\+driver(X),\+speed(Tax),\+(Tax>=100); running(X,Tax).

6.2. Resolution Algorithm

In mathematical logic and automated theorem proving, the Resolution algorithm is a fundamental inference rule used to derive logical conclusions. It serves as the basis for a complete proof technique by refutation, applicable to both categorical logic and first-order logic. In the context of categorical logic, systematic application of the Resolution rule provides a decision procedure for determining the unsatisfiability of a formula, effectively solving the complement of the Boolean satisfiability problem.

The Resolution algorithm is considered one of the most robust inference mechanisms in categorical logic. It operates by generating a new clause that is logically implied by two existing clauses containing complementary literals. A literal is defined as a predicate, a rule, or the negation of a predicate. Two literals are complementary when one is the negation of the other—for example, driver(X) and ¬driver(X), where one negates the assertion of the other [52].

The Resolution algorithm (Figure 1) functions by combining logical sentences and systematically eliminating complementary predicates, meaning predicates, and their corresponding negations. This process introduces minimal changes to the original sentences and aims to establish the satisfiability of the knowledge base. In this way, the algorithm provides a means to validate a hypothesized event by demonstrating that the assumption of the opposite event leads to a logical contradiction.

In the context of this study, the algorithm is applied using abductive reasoning. We introduce, into the knowledge base, the negation of the predicted event, specifically that a traffic accident will not occur. The algorithm then tests whether this negated assumption is logically consistent with the remaining facts and rules. If it leads to an unsatisfiable knowledge base, the contradiction confirms that the negation is false.

As a result, we are able to prove that a traffic accident will occur under the specified conditions, which are described using variables such as Al, B, As, and others. The resulting contradiction provides formal evidence that the prediction is valid.

6.3. All Possible Scenarios

The following scenarios are generated by assigning different values to the variables that define the problem (Figure 2):

1. The driver is drunk, there is heavy rain that causes slipperiness on the road, and it is dark because it is night; therefore, there is reduced visibility. On the route, there is a sharp turn, and the driver is running at a speed of over 100 km/h. Consequently, there is a high certainty of an accident (at()).