SARIMA Modelling Approach for Forecasting of Trafﬁc Accidents

: To achieve greater sustainability of the trafﬁc system, the trend of trafﬁc accidents in road trafﬁc was analysed. Injuries from trafﬁc accidents are among the leading factors in the suffering of people around the world. Injuries from road trafﬁc accidents are predicted to be the third leading factor contributing to human deaths. Road trafﬁc accidents have decreased in most countries during the last decade because of the Decade of Action for Road Safety 2011–2020. The main reasons behind the reduction of trafﬁc accidents are improvements in the construction of vehicles and roads, the training and education of drivers, and advances in medical technology and medical care. The primary objective of this paper is to investigate the pattern in the time series of trafﬁc accidents in the city of Belgrade. Time series have been analysed using exploratory data analysis to describe and understand the data, the method of regression and the Box–Jenkins seasonal autoregressive integrated moving average model (SARIMA). The study found that the time series has a pronounced seasonal character. The model presented in the paper has a mean absolute percentage error (MAPE) of 5.22% and can be seen as an indicator that the prognosis is acceptably accurate. The forecasting, in the context of number of a trafﬁc accidents, may be a strategy to achieve different goals such as trafﬁc safety campaigns, trafﬁc safety strategies and action plans to achieve the objectives deﬁned in trafﬁc safety strategies.


Introduction
Road transport is a preferred mode of transport due to its low cost and delivery time. General transport and licensing regulations are possible for each country [1]. Transport remains an important development factor in every country [2].
Transport is one of the four existential functions of every living space (work, housing, recreation, and transport), the aim of which is to combine other functions, with as many negative effects as possible. The main harmful consequences of transport today are: depletion of natural resources; pollution of the environment through noise, exhaust fumes and waste materials; traffic accidents (light and serious injury and deaths); property damage; losses; and costs. Active road safety refers to the prevention of road accidents, i.e., reducing the probability of accidents. Active road safety measures reduce the number of road accidents. Passive road safety refers to reducing the harmful consequences of road accidents that have occurred. Society has not always had the same road safety problems (in terms of type and magnitude). These problems did not have the same importance, were not treated in the same way and were not solved in the same way.

•
Ambition-To reduce mortality and the risk of serious injury to the level of the most successful countries in the European Union; • Mission-A stable and effective road safety system; • Vision-Road transport without casualties, with a significantly reduced number of injuries and significantly reduced costs of road accidents.
Accepting the recommendations of the United Nations expressed in the document "Global Plan of the Decade of Action for Traffic Safety 2011-2020" prepared by the World Health Organization in different parts of the world [4][5][6], the strategy identifies five key areas of work (five pillars) to achieve the desired state of traffic safety: • In the event of a road accident causing damage to property or injury to people, the driver must stop and inform the competent authority so that the necessary measures can be taken to care for the injured. In order to qualitatively manage road safety, data relevant to road safety and a well-developed database on road safety indicators are needed. The Road Safety Agency of the Republic of Serbia, as one of the most important institutions in the field of road safety, has developed an integrated database on road safety characteristics. In addition to the integrated database on road safety characteristics, all those interested in this field in the Republic of Serbia have access to the portal of publicly available data.
The data will be used to train road accident prediction models to estimate the number of accidents on a monthly basis. The aim of our study is to enable the development of a software tool that will help the municipal services of the City of Belgrade to take preventive measures before major problems arise. The paper uses historical data from the Road Safety Agency and the Ministry of Interior of the Republic of Serbia to predict the future number of road accidents on a monthly basis.
According to the Decade of Action for Road Traffic Safety 2011-2020, it was expected that the number of road traffic accidents would decrease. The main contribution of the study is to show the increase in the number of traffic accidents in the city of Belgrade for the period from 2016 to 2019. Each municipality in the Republic of Serbia allocates certain budgetary funds for strategic documents such as the Road Traffic Safety Strategy and Action Plan for a period of five years.
A particularly important document is the Action Plan, which for each pillar of road safety specifies not only the funds but also the period of the year and the stakeholders who must implement a particular measure. An important contribution of the study is the reference to the last quarter of the year when most traffic accidents occur in the city of Belgrade.
The main motive for the research is to investigate the possibility of applying one of the forecasting models to a time series of traffic accidents in one of the largest cities in the Western Balkans, such as Belgrade.
A particularly important motive was to investigate whether there is a statistically significant period of the year when coordinated activities should be undertaken by the traffic police and all stakeholders to reduce the number of traffic accidents. The main novelty of the study is that in the last fourth quarter of the year (October, November, and December), most road accidents occurred in the capital of the Republic of Serbia.
The article consists of seven sections. The introductory section focuses on the importance of the theme and justifies why this particular theme was chosen. This is followed by Section 2, which contains a review of the literature. Section 3 contains the materials that make up the data sample used for the study. Section 4 describes the method of data preparation and time series analysis and presents the analysis procedure. Section 5 describes the results of the model obtained. Section 6 then presents and discusses the research findings, while Section 7 presents the conclusions with guidelines for further research studies.

Literature Review
The statistical change of many phenomena over time is described by time series. The levels of time series are formed under the influence of a number of long-term and short-term factors as well as a number of random influences.
Due to changes in these influences, there are also fluctuations in the levels of time series that indicate changes in events over time. The annual number of persons killed and injured in road accidents is of great interest in most countries.
Real-time traffic accident detection is very important for drivers and passengers as well as for city services dealing with traffic. Social networks play an important role in the detection of traffic accidents and can help in the analysis [7].
In the UK, Scott [8] analysed the time series of monthly data of road traffic accidents for the period 1970-1978, and Broughton [9] analysed deaths in road traffic accidents for the period 1949-1989. In addition, Quddus [10] studied annual road traffic deaths in the UK between 1950 and 2005. In Sweden, the problem of predicting the number of road traffic deaths was addressed by Brüde [11] based on data from the period 1977-1991. Dadashova et al. [12] used the monthly data on the number of fatal accidents in Spain in the period 2000-2011.
There are many different statistical models used in the analysis of time series of road traffic data. There are two categories of forecasting approaches in transport and traffic: parametric and non-parametric. The main difference is the functional dependence between independent variables and dependent variable [13].
In addition to the ARIMA model, Ihueze and Onwurah [20] used the autoregressive integrated moving average with explanatory variables (ARIMAX). The San-gare et al. study [21] shows an approach to predicting road traffic accidents using analytical measures and hybrid machine learning. Almeida et al. [22] propose the Seasonal Auto Regressive Integrated Moving Average (SARIMA) model in their study of traffic flow characteristics.
Artificial neural network algorithms have also been proposed in Refs. [22][23][24][25][26][27][28] for the forecasting approach. The most commonly used algorithms are the Feed-Forward Neural Network (FFNN), the Long Short-Term Memory (LSTM), the Convolutional Neural Network (CNN) and a hybrid LSTM-CNN. Naqvi et al. [29] propose SARIMA in their study of the relationship between higher fuel prices and road accidents.
Katrakazas et al. [30] use SARIMA to investigate the impact of COVID-19 on collisions, fatalities and injuries in Greece. Time series trends of the safety effects of pavement resurfacing with SARIMA are shown in the study by Park et al. [31]. Analysis of road crash mortality based on time of occurrence is also performed with SARIMA in the work of Vipin and Rahul [32]. Roland et al. [33] used a multilayer perceptron (MLP) neural network model to predict where and when accidents will occur on a given day and time in the study area. Shannon and Fountas [34] extend the Heston model to predict the collision rate of motor vehicles. Time Series Regression (TSR) analysis and SARIMA has been applied to assess the relationship between the outcome variable, the number of persons killed due to road traffic accidents, and the variables quantifying the trend and seasonal effects [32,35]. Timeseries models include autoregressive integrated moving average (ARIMA), ARIMA with seasonal factors (SARIMA), SARIMA with exogenous explanatory variables (SARIMAX), and nonlinear auto-regression exogenous (NARX) [36]. SARIMA models have been able to adapt and make good predictions even in the presence of anomalies [22]. The SARIMAX model is based on the ARIMA model with the added capability to include seasonality and exogenous parameters [37].
In Ref. [38], several Seasonal Auto-Regressive Integrated Moving Averages with Exogenous factors (SARIMAX) models were used to analyse waste and recycling timeseries trends and their relationship with exogenous explanatory variables regarding waste production. In Ref. [39], ARIMA, βARMA, and KARMA models were compared to forecast the mortality rates due to occupational accidents in the southern region of Brazil. Ref. [40] proposed an ARFIMA-GARCH model for the long memory property in crash risk analysis.
The Autoregressive Integrated Moving Average (ARIMA) is one of the most commonly used forecasting methods for univariate time series data. An extension of ARIMA that supports direct modelling of the seasonal component of the series is called SARIMA or Seasonal ARIMA. In this paper, the SARIMA model is used for short-term forecasts.
When examining the data on the number of traffic accidents in the city of Belgrade, it was found that there is a pronounced seasonal unevenness, which is why the SARIMA model was chosen accordingly.
The main advantages of SARIMA are: • the model is deterministic and computationally easy; • the model has the advantage of requiring multiple model parameters to describe time series that exhibit non-stationarity both within and between seasons; • conventional ARIMA cannot capture seasonality and trend in data sets.

•
The main disadvantages of SARIMA are: • the model can only predict a short period of time; • the model can only extract linear relationships within the time series data.

Materials
Open government data initiatives are on the rise in many countries. The main goals of open access to data are to democratise data access and knowledge production [41]. Open data plays an important role in many applications and services, such as social innovation, policymaking, public opinion research and economic growth [42].
The paper by de Souza et al. [43] highlights the key benefits of Open Government such as transparency, participation and cooperation and uses the term Government 2.0.
The database of open-source data is maintained by government agencies [44]. According to Refs. [44,45], open data can be defined as data produced and funded with public money and made available to the public without restrictions. The data must comply with all privacy and confidentiality laws. For the research, sets of open-source data were adopted from the Republic of Serbia [46,47]. According to the Law on e-Government [48], open data are data that are available for reuse together with metadata in a machine-readable and open form.
According to Ref. [48], data may be re-used by individuals or legal entities for commercial and non-commercial purposes that are different from the original purpose for which they were created.
At the web address data.gov.rs there is the Open Data Portal, where open datasets of the state agencies of the Republic of Serbia are published. In the datasets, there are a number related to road traffic accidents within the topic of public safety. The use cases mention the research of the Data Science Serbia organisation, whose research was conducted on the dataset on traffic accidents in the city of Belgrade for one year (2015). When searching for the data sets on traffic accidents in road traffic, two types of data sets can be identified:

•
Data on traffic accidents for the area of the city of Belgrade [46]; • Data on traffic accidents by police administration and municipality [47].
For these data sets [46,47] each row represents information about a traffic accident. The first dataset [46] is available in files with the extension .ods (OpenDocument Spreadsheet) for the period from 2015 to 2019 (28 February 2019). In the 2015 dataset, complete data are available for the first 11 months, and only data for one road accident were given for December. As the data for 2015 is not complete, it has not been considered in the analysis. When data for 2019 is considered, only the first two months are available for analysis (January and February). The data for the period from 2016 to 2018 are complete.
For example, the columns in the first dataset for 2016 [46] are: • The second dataset [47] is available in files with the extension. xlsx for the period from 2015 to 2020. The extension.xlsx refers to Microsoft Excel, which has been used in this programme since 2007.
The 2015 dataset is not complete in either source [46,47] and was not used for the study. For example, the columns in the second dataset for 2016 [47] are: • Column H: Name-type of traffic accident: traffic accident with one vehicle, traffic accident with at least two vehicles-without turning, traffic accident with at least two vehicles-turning or crossing, traffic accident with parked vehicles, traffic accident with pedestrians; • Column I: Detailed description of the traffic accident: traffic accident with one vehicle: 11 types of cases, traffic accident with at least two vehicles-without turning: 9 types of cases, traffic accident with at least two vehicles-turning or crossing: 18 types of cases, traffic accident with parked vehicles: 5 types of cases, pedestrian accident: 25 types of cases.
The first dataset [46] and the second dataset [47] have the same columns, but the second dataset has two more columns (police administration and municipality). The number of case types (column G [46] and column I [47]) may differ from the dataset for the year under investigation. The research area is the city of Belgrade, the capital of the Republic of Serbia. The time frame of the research is the period from 1 January 2016 to 31 December 2019. For the problem studied in this paper, the SARIMA model was used. In this part of the paper, the basic theoretical assumptions of the said model are presented.
The second dataset [47] for the period from 2016 to 2019 was used for the study, with data available for all months of the year (Table 1). Open data were used in the work, representing the effort of the Republic of Serbia in partnership with the United Nations Development Programme to gain new values through research projects. Data on the number of traffic accidents by month from 2016 to 2018 (36 months) were used to build the model, while data on the number of traffic accidents by month from 2019 (12 months) were used to test the accuracy of the model.
Data on the number of road accidents by month from 2020 were not used due to the circumstances caused by the COVID-19 pandemic.
Limitations of the study are: • The analysis of road accidents included only columns with latitude and longitude from the data sets; • The time series was only examined on a monthly basis; • Only years with complete data were used; • The year 2020 was not included in the analysis due to conditions caused by the COVID-19 pandemic; • Data were available for 48 months, with a model developed based on 36 months and a model that was tested based on 12 months (2019); • Data are limited due to constraints imposed by public use, so data on gender, age, length of driving experience and other driver characteristics are not available, but neither are data on the vehicle(s) involved in the accident.

Methodology
The advantages of SARIMA lie in its well-known statistical properties and its effective modelling process. The SARIMA model is one of the most effective linear models for seasonal time series forecasting. Although SARIMA is not a new method, it has been applied in this work in an innovative way, which implies its application under new conditions. In this paper, SARIMA was implemented using common statistical software, such as the R programming language and RStudio. The main advantage of SARIMA processes is its ability to model time series with trends, seasonal patterns and short-term correlation with a small data set.
The following steps are followed in the application of SARIMA time series analysis [49]: • The non-seasonal and seasonal Auto Regressive (AR) polynomial term of order p and P, Equations (1) and (2): • The non-seasonal and seasonal Moving Average (MA) part of order q and Q, Equations (3) and (4): • Non-seasonal differencing operator is the of order d used to eliminate polynomial trends, Equation (5): • Seasonal differencing operator is the order of D used to eliminate seasonal patterns, Equation (6): Parameters φ and θ are the ordinary ARMA coefficients, Φ and Θ are the seasonal ARMA coefficients, B is the backshift operator, whose effect on a time series Y t can be summarized as Equation (7): Therefore, the generalized form of the SARI MA(p, d, q) × (P, D, Q) s model for a series Y t can be written as in Ref. [50], Equation (8): where s is the length of the periodicity (seasonality) and ε t is a white noise sequence.
The following notations were used in the formulas for obtaining the value of the forecasting errors: Accuracy of the model was computed through three measures: • Mean Absolute Error (MAE) or Mean Absolute Deviation (MAD), Equation (9): • Mean Absolute Percentage Error (MAPE), Equation (10): • Theil's U1-statistics is a measure of forecast accuracy (0 ≤ U ≤ 1; U = 0 means a perfect fit), Equation (11):

Results
In this section, the results of the prediction with the SARIMA method are presented. The R programming language was used for the analysis. The programming language R is a powerful modern computing environment for data manipulation, statistical calculations and visualisation [51].

Basic Data on the Time Series
U e y f n n n (11)

Results
In this section, the results of the prediction with the SARIMA method are presented. The R programming language was used for the analysis. The programming language R is a powerful modern computing environment for data manipulation, statistical calculations and visualisation [51].  The basic descriptive data on the time series can be seen from the box-plot diagram in Figure 2 below. The basic descriptive data on the time series can be seen from the box-plot diagram in Figure 2 below. The lowest number of road accidents occurs in February, when the average number of accidents is 1248 for the four-year period studied. Looking at the quarters during the year, the first and third quarters correspond to approximately 1393 and 1390 road accidents on a monthly basis. The most dangerous time of the year for drivers, passengers, pedestrians and other road users is certainly the fourth quarter, when an average of 1659 reported road accidents occur per month. The most unsafe month of the year is December, which logically belongs to the fourth quarter, when an average of 1717 traffic accidents

Development of the SARIMA Model
In order to use the SARIMA model, a certain data arrangement is required. Data from 2016 to 2018 were used to build the model, while data from 2019 are used for testing. The following figure shows the logarithmic values of the number of traffic accidents by month (Figure 3), and the next figure ( Figure 4) shows diff(log(values)). The following libraries of the R programming language were used for data analysis: MASS [52], tseries [53], forecast [54] and astsa [55].  According to Ref. [56], many variables are used in logarithms (logs) for forecasts and economic analyses. In the analysis of time series, this transformation is often used to stabilise the variance of a series. The application of the function auto.arima with seasonal influence in the programming language R (Table 2) showed that it is the best model ARIMA (0,1,2) × (1,1,0) 12 . In Table 2 is presented information criteria such as the Akaike information criterion (AIC),  According to Ref. [56], many variables are used in logarithms (logs) for forecasts and economic analyses. In the analysis of time series, this transformation is often used to stabilise the variance of a series. The application of the function auto.arima with seasonal influence in the programming language R (Table 2) showed that it is the best model ARIMA (0,1,2) × (1,1,0) 12 . In Table 2 is presented information criteria such as the Akaike information criterion (AIC), the Akaike information criterion with a correction for small sample sizes (AICc) and the Bayesian information criterion (BIC).  According to Ref. [56], many variables are used in logarithms (logs) for forecasts and economic analyses. In the analysis of time series, this transformation is often used to stabilise the variance of a series.
The application of the function auto.arima with seasonal influence in the programming language R (Table 2) showed that it is the best model ARIMA (0,1,2) × (1,1,0) 12 . In Table 2 is presented information criteria such as the Akaike information criterion (AIC), the Akaike information criterion with a correction for small sample sizes (AICc) and the Bayesian information criterion (BIC). In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. The basic principles for the use of AIC are:

•
A lower value indicates a simpler model compared with a model with a higher AIC; • It is a relative measure of model parsimony, i.e., it is only significant when we compare the AIC for alternative hypotheses (=different models of the data).
The information about the AICc value of the model (the lower case 'c' indicates that the value was calculated from the AIC test corrected for small sample sizes). The smaller the AIC value, the better the model fits.
The Bayesian Information Criterion (BIC) is a criterion for model selection from a finite set of models. It is based in part on the likelihood function and is closely related to the Akaike information criterion (AIC). As the complexity of the model increases, the BIC value increases and as the likelihood increases, the BIC value decreases (a lower value is better).
The test results of the SARIMA model can be found in Table 3, based on the actual values and the predicted values of 2019.
Based on Equations (9)-(11), the calculated prediction error indicators are calculated in Table 4. The mean absolute error, MAE = 77, is the average (absolute) difference between the actual value and the forecast value for 2019. The mean absolute percentage error (MAPE) is the mean or average of the absolute percentage errors of the forecasts.
The MAPE is a relative measure that expresses the errors as a percentage of the actual data. This is its greatest advantage, as it provides a simple and intuitive way to assess the magnitude or significance of errors. Theil's U1 statistic ranges from 0 to 1, with values closer to 0 indicating higher predictive accuracy.
The interpretation of typical MAPE values is shown in Table 5, according to Ref. [57].

<10
Highly accurate forecasting [10][11][12][13][14][15][16][17][18][19][20] Good forecasting 20-50 Reasonable forecasting >50 Inaccurate forecasting MAPE is commonly used because it is easy to interpret and explain. For example, a MAPE value of 5.22% means that the average difference between the forecasted value and the actual value is 5.22%. It measures this accuracy as a percentage which can be calculated as the average absolute percent error for each time period minus actual values divided by actual values.
The logarithmic values for the prediction of the number of road accidents by month with confidence intervals of 80% and 95% are shown in Figure 5.
Log transformation is one of the most popular data transformation techniques. It is mainly used to transform a skewed distribution into a normal/less skewed distribution. In other words, the log transformation reduces or eliminates the skewness of our original data.  Figure 6. Figure 6 shows the pattern of traffic accidents, which is repeated from year to year with small changes. It can be seen that after a lower number of accidents in the first quarter, the number of accidents increases in the second quarter. Log transformation is one of the most popular data transformation techniques. It is mainly used to transform a skewed distribution into a normal/less skewed distribution. In other words, the log transformation reduces or eliminates the skewness of our original data.  The values of the forecast of the number of traffic accidents by month for 2019 (dashed line) are shown in Figure 6. Figure 6 shows the pattern of traffic accidents, which is repeated from year to year with small changes. It can be seen that after a lower number of accidents in the first quarter, the number of accidents increases in the second quarter.   Figure 6 shows the pattern of traffic accidents, which is repeated from year to year with small changes. It can be seen that after a lower number of accidents in the first quarter, the number of accidents increases in the second quarter.
Due to the holiday season, the number of road accidents decreased in the third quarter. In the period studied (2016-2019), most traffic accidents occurred in the last month of the third quarter (September-the beginning of the school year for primary and secondary schools) and in the last quarter.

Discussion
According to Almeida et al. [22], the SARIMA method is used for short-term forecasts. In this work, 75% of the data is used for training (data from 2016-2018) and another 25% for testing (data from 2019).
The choice of the best SARIMA model is shown in Table 6. Table 6. Choice of best SARIMA model-Data (2016-2018).
The model identification is based on a comparison of the autocorrelation functions (ACF) of the partial autocorrelation (PACF) with the theoretical profiles of these functions.
Model identification is characterised by considerable subjectivity. To minimise subjectivity and improve the process of determining the ranks of the ARIMA process, some of the model selection criteria are used.
The best known are information criteria such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) and the normalised version of the BIC.
All criteria are based on the assessment of the fit of nonlinear models, taking into account the number of model parameters.
They consist of the natural logarithm of the least square error and the penalty for the number of estimated parameters [58,59], Equations (12)- (14): In Equations (12)- (14), T represents the number of observations and k the number of parameters of the model (k = p + q + P + Q + 1). The number MSE is the mean square error. The result of the identification steps is represented by the corresponding model structure (p,d,q) × (P,D,Q) S . Model estimation is based on fitting the model selected in the previous step and determining the model parameters. This step is based on the non-linear least squares and maximum reliability methods.
Model validation, the final step of the Box-Jenkins method, involves analysing the stationarity, invertibility and redundancy of the model parameters [60]. If the residuals, i.e., the difference between the actual values and the values estimated by the model, are random, the model is satisfactory. Otherwise, it is necessary to repeat the process of model identification and estimation and find a better model.
The following figures (Figures 7 and 8) show the plot of the autocorrelation function (ACF) and the partial ACF (PACF).
Our aim now is to find a suitable SARIMA model based on the ACF and PACF shown in ln( ) 2   AIC T MSE k (12) ln( ) ln( )   BIC T MSE k T (13) In Equations (12)- (14), T represents the number of observations and k the number of parameters of the model (k = p + q + P + Q + 1). The number MSE is the mean square error. The result of the identification steps is represented by the corresponding model structure (p,d,q) × (P,D,Q) S . Model estimation is based on fitting the model selected in the previous step and determining the model parameters. This step is based on the non-linear least squares and maximum reliability methods.
Model validation, the final step of the Box-Jenkins method, involves analysing the stationarity, invertibility and redundancy of the model parameters [60]. If the residuals, i.e., the difference between the actual values and the values estimated by the model, are random, the model is satisfactory. Otherwise, it is necessary to repeat the process of model identification and estimation and find a better model.
The following figures (Figures 7 and 8) show the plot of the autocorrelation function (ACF) and the partial ACF (PACF).  The PACF show significant swings at lag 4 and 8, suggesting that some additional non-seasonal terms need to be included in the model. We also tried other models with AR terms, but none yielded a smaller AICc value. Therefore, we decided to use the SARI-MA (0,1,2) × (1,1,0) 12 model.
When all the data (2016-2019) were used to build the model, applying the auto.arima function resulted in the following model SARIMA(0,1,1) × (1,1,0) 12 . The results of the residual analysis are shown in Figure 9. Almost all spikes are now within the significance limits in the ACF diagram.  The PACF show significant swings at lag 4 and 8, suggesting that some additional non-seasonal terms need to be included in the model. We also tried other models with AR terms, but none yielded a smaller AICc value. Therefore, we decided to use the SARI-MA (0,1,2) × (1,1,0) 12 model.
When all the data (2016-2019) were used to build the model, applying the auto.arima function resulted in the following model SARIMA(0,1,1) × (1,1,0) 12 . The results of the residual analysis are shown in Figure 9. Almost all spikes are now within the significance limits in the ACF diagram. Our aim now is to find a suitable SARIMA model based on the ACF and PACF shown in The PACF show significant swings at lag 4 and 8, suggesting that some additional non-seasonal terms need to be included in the model. We also tried other models with AR terms, but none yielded a smaller AICc value. Therefore, we decided to use the SARI-MA (0,1,2) × (1,1,0) 12 model.
When all the data (2016-2019) were used to build the model, applying the auto.arima function resulted in the following model SARIMA(0,1,1) × (1,1,0) 12 . The results of the residual analysis are shown in Figure 9. Almost all spikes are now within the significance limits in the ACF diagram.  The standardised residual is determined by dividing the difference between the observed and expected values by the square root of the expected value. One type of residual that we often use to identify outliers in a model is the standardised residual. From the figures of the standardised residuals and the ACF of the residuals, we can see that there are no large outliers. The figure Normal Q-Q plot of Std residuals shows that only 5 of the 48 values are outside the confidence interval.
A significant p-value in the Ljung-Box statistical test rejects the null hypothesis that the time series is not autocorrelated. The figure of p-values for the Ljung-Box statistic shows that there is no basis for rejecting the hypothesis.
To show the robustness of the proposed model, a comparison is made with other known methods from Ref. [61]. The dataset from Ref. [61] corresponds to the monthly production of electrical equipment (computers, electronic and optical products) in the Eurozone (17 countries) in the period January 1996-March 2012. In the study from Ref. [61], it can be seen that the SARIMA method is in third place among the ten selected methods (with the lowest value) and the MAPE value is 0.09% higher compared with the first two methods. Ten selected methods were (in decreasing order for value MAPE): Naïve, Prophet, Seasonal Naïve, NNETAR, Exponential smoothing, ARIMA + Decomposition, Exponential smoothing + Decomposition, SARIMA, GARCH + Decomposition and TBATS. [61].

Conclusions
As in other studies [62][63][64], the number of traffic accidents was considered on a monthly basis. The study of spatio-temporal correlation of traffic accidents is addressed in papers [63,64] that are not included in our paper and represent one of the directions for further research. The main difference with the work studied is that this work clearly indicates that most traffic accidents occur in the last quarter of the year. If the spatiotemporal correlation was undertaken on the basis of individual zones and different years, then we would see the results of the work on road safety prevention. This paper presents an analysis of the time series that gave a clear picture of the existence of a pattern in the number of road accidents on a monthly basis for the period 2016-2019.
Predicting the number of traffic accidents during certain periods of the year is an important part of traffic management in a given area. Informing citizens about the status of road safety contributes significantly to understanding the problem, improving citizens' attitudes towards road safety and improving traffic behaviour. Assistance and support to other road safety sectors should contribute to their work and to the improvement of certain aspects of road safety.
Improving the monthly forecast of the number of traffic accidents can have several positive effects on the city and its citizens. For example, identifying problematic months of the year can lead the relevant authorities to take action to combat the problems through changes to the road safety strategy.
The results of the analysis of the time series and the SARIMA method show that for four consecutive years, most road accidents occurred in the fourth quarter. The road safety campaign is a system of activities whose general objective is to promote safer road use. The specific objectives of road safety campaigns relate to changing road knowledge, attitudes, skills and behaviour, all with the aim of improving road safety.
In this paper, a mathematical model for traffic accident prediction was proposed for the case of the city of Belgrade. The result of the model should be taken with caution, as not all road accidents are reported to the police, especially those involving minor property damage.
The results of applying the libraries of the R programming language show that the SARIMA (0,1,2) × (1,1,0) 12 model is the most suitable for modelling according to the open data from 2016 to 2018. If open data from 2016 to 2019 were used to build the model, then the best model is SARIMA (0,1,1) × (1,1,0) 12 . Future research will focus on developing forecasting models for other municipalities in the Republic of Serbia as well as for other countries.
The study of predicting the number of traffic accidents on a monthly basis has not yet been fully explored. In the future, we plan to test new algorithms to detect anomalies (holidays, specific routes, etc.). This research can be undertaken through a hybrid method composed of statistical and neural network approaches presented in numerous literature.
Modelling and prediction can be further improved to extract more desirable data on traffic accidents and reduce complexity.
As motorisation increases, so does the number of traffic accidents and all the negative consequences associated with them. Therefore, a constant review, analysis and evaluation of the existing situation is necessary. Statistically, there are about 47 road accidents with injured persons and about 130 road accidents with property damage for every road accident with fatalities, as shown in the available data for 2016. This paper has shown that meaningful results and conclusions can be obtained from open data from the database on traffic accidents. These results and conclusions can be used in determining preventive measures and campaigns for road safety in the city of Belgrade.
A possible measure and improvement in the management of open data on road accidents could be to increase the amount of information published on the open data portal. Of course, care should be taken to protect the personal data of road accident participants.
Some general data can be published in the open data from the investigation documentation, such as the meteorological conditions prevailing at the time of the accident or data on general visibility.
Among other things, it is necessary to provide information on the general condition of the vehicle when the vehicle is referred for a technical investigation (e.g., condition of the brake system).
A special category could refer to the driver and general information about them (e.g., age, profession, length of driving experience, whether he is a weekend driver, whether they were under the influence of alcohol or opiates). Finally, a column could be added to the note containing information that the investigating agencies consider important without being included in any of the previous columns.