Statistical Analysis for Long-Term Weather Forecast †

. Abstract: A weather forecast is a result of applying science and technology to predict the conditions of the atmosphere in a selected location and time in the future. The main input is a collection of data (atmospheric, land, and ocean), and the resulting output is meteorology (how atmospheric conditions will change). People have been trying to predict the weather by observing nature for thousands of years, but in a scientiﬁc way only since the 19th century. First manually and based mostly on changes in barometric pressure, and later in modern times with the contribution of computer-based models (numerical weather prediction). Despite the high inaccuracy of numerical weather prediction beyond 10 days, the interest in long-term weather forecasts is very high due to social reasons—energy sector, civil protection, etc.—and the scientiﬁc effort is constant. Here, we propose a statistical weather model for long-term weather forecasts based on weather/climate data time series. We will analyze atmospheric data in 850 hPa for a period of 35 years, resulting in temperature ensembles and temperature deviations for speciﬁc periods. Finally, we will contrast the results of the statistical weather model (prediction) with the real data to check the accuracy of the model


Introduction
A long-term weather forecast is the Holy Grail of meteorology. Supercomputers and numerical methods [1] are combined to improve weather forecasts beyond 10 days. Despite the applied high science and technology, results are poor due to fundamental obstacles in numerical forecast [2]. Therefore, long-term weather forecast is produced mostly from models based on statistical analysis.
Statistical analysis is the in-depth study of a problem using calculations, tables, and charts in order to produce statistical conclusions [3]. The presentation of the primary statistical material is called classification [4], with geometrical classification and time series classification as the most important subsets.
Time series classification (TSC) [5] is the tracking of the evolution of a variable over time. Time series are described with one or more of the following features: A long-term trend. A periodical trend. Some irregular or random variations. Statistical analysis of a time series is recommended to describe the regularity that may exist between successive values of a variable having as a purpose to predict the future behavior of that series.

Method
The method is based on a time-series analysis of temperature data at 850 hPa. The layer of 850 hPa is selected because it is not affected by surface conditions like the Foehn effect [6], temperature inversion [7], or urban heat island effect [8].

1.
Calculation of average temperatures in periods of 35, 22, 9, and 7 years (A35, A22, A9, and A7 values) ( Figure 1); Ten major administrative regions of Greece (Thrace, Macedonia, Epirus, Thessaly, Ionian islands/West Greece, Central Greece/Attica/Euboea, Peloponnese, North Aegean, Crete, and South Aegean) are selected, representing the grids of the study. For each grid, a dataset of daily temperatures in 850 hPa for the last 35 years is downloaded from the Physicals Science Laboratory, NOAA [9], and registered in a database.
Since the registration is completed, the database is simplified in 5-day periods and interfaced with the formula of the model.
The formula is a machine learning algorithm based on climate and astronomical cycles-El Nino, La Nina, and solar cycles-and statistical weights, sums, and possibilities. It is described by four main processes: 1. Calculation of average temperatures in periods of 35, 22, 9, and 7    Ten major administrative regions of Greece (Thrace, Macedonia, Epirus, Thessaly, Ionian islands/West Greece, Central Greece/Attica/Euboea, Peloponnese, North Aegean, Crete, and South Aegean) are selected, representing the grids of the study. For each grid, a dataset of daily temperatures in 850 hPa for the last 35 years is downloaded from the Physicals Science Laboratory, NOAA [9], and registered in a database.
Since the registration is completed, the database is simplified in 5-day periods and interfaced with the formula of the model.
The formula is a machine learning algorithm based on climate and astronomical cycles-El Nino, La Nina, and solar cycles-and statistical weights, sums, and possibilities. It is described by four main processes:   For each season, the model produces 18 prediction values (6 per month), resulting in a temperature ensemble for each studied region (Figure 4).
Predictions proceed into a second-level analysis resulting in temperature deviations ( Figure 5), and they are plotted in weather maps ( Figure 6).
The total amount of information (temperature ensembles, temperature deviations, and weather maps) is combined for translating data into text. Short paragraphs are published with reference to possible dates with significant changes in weather circulation.  Predictions proceed into a second-level analysis resulting in temperature deviations ( Figure 5), and they are plotted in weather maps ( Figure 6).   Predictions proceed into a second-level analysis resulting in temperature deviations ( Figure 5), and they are plotted in weather maps ( Figure 6).  The total amount of information (temperature ensembles, temperature deviations, and weather maps) is combined for translating data into text. Short paragraphs are published with reference to possible dates with significant changes in weather circulation.

Results
We will examine the accuracy of the model for the period 1 December 2021 till 30 November 2022 in terms of deviation trend, separated into four (4) seasons and with respect to the grid of Attica.
We start with the registration of the predicted deviation values and real deviation values downloaded from the Physicals Science Laboratory, NOAA [9]. Since we built the dataset, we proceeded to the transformation of values-1 for positive values and 0 for negative values. Identical values result in 1, and non-identical values in 0 (Figure 7). Figure 6. Plotting of deviation (5-day periods). Gradient color scale from -30 degrees Celsius (negative deviation: deep blue to light blue) to +30 degrees Celcius (positive deviation: light red to deep red) [10].
The total amount of information (temperature ensembles, temperature deviations, and weather maps) is combined for translating data into text. Short paragraphs are published with reference to possible dates with significant changes in weather circulation.

Results
We will examine the accuracy of the model for the period 1 December 2021 till 30 November 2022 in terms of deviation trend, separated into four (4) seasons and with respect to the grid of Attica.
We start with the registration of the predicted deviation values and real deviation values downloaded from the Physicals Science Laboratory, NOAA [9]. Since we built the dataset, we proceeded to the transformation of values-1 for positive values and 0 for negative values. Identical values result in 1, and non-identical values in 0 (Figure 7).

Attica
Deviation ( Among twelve months, three months meet high scores (February, March, October), five months meet higher-than-average scores (December, January, April, May, September), two months meet average scores (July and August), and two months meet low scores According to the resulting coefficient, the accuracy of each season is the following: Among twelve months, three months meet high scores (February, March, October), five months meet higher-than-average scores (December, January, April, May, September), two months meet average scores (July and August), and two months meet low scores (June and November).
The above results show a weakness in algorithmic function during the summer period, as well as for November. In contrast, from December to May (as well as September), the results show a strong algorithmic functionality, while October meets an exceptional score.
The dysfunctionality of the summer period is under investigation. A mismatch in the order of D and A values in the formula has been detected, it is corrected, and it will be reevaluated in the next seasonal prediction (published on 19 May 2023) [11].
Further research is required, involving more seasons/years of predictions to detect specific dysfunctionalities of the algorithm and to confirm the good results as well. Additionally, the method should be applied in more grids, and globally if possible, to ensure total application (ongoing). A final and more complicated step is the application of the method with temperature data in 500 hPa and the combination of resulting values in 850 hPa and 500 hPa for a more accurate prediction.