Next Article in Journal
A Hybrid GIS–MCDM Approach to Optimal EV Charging Station Siting for Urban Planning and Decarbonization
Previous Article in Journal
Tribological Assessment of FFF-Printed TPU Under Dry Sliding Conditions for Sustainable Mobility Components
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clustering of Civil Aviation Occurrences in Brazil: Operational Patterns and Critical Contexts

by
Felipe Duarte Santana
1,
Daniel Alberto Pamplona
1,2,*,
Mateus Habermann
1,
Lila Kacedan
1 and
Marcelo Xavier Guterres
2
1
Graduate Program in Operational Applications, Aeronautical Institute of Technology, São José dos Campos 12228-612, SP, Brazil
2
Department of Air Transportation, Aeronautical Institute of Technology, São José dos Campos 12228-612, SP, Brazil
*
Author to whom correspondence should be addressed.
Future Transp. 2025, 5(4), 185; https://doi.org/10.3390/futuretransp5040185
Submission received: 7 October 2025 / Revised: 21 November 2025 / Accepted: 27 November 2025 / Published: 2 December 2025

Abstract

This study applied clustering algorithms to reveal latent structures in 9791 Brazilian civil aviation occurrences recorded from 2007 to 2023. We tested K-means, hierarchical clustering, and K-medoids, using aircraft type, flight phase, and severity as variables in different configurations. The K-medoids method with Manhattan distance produced the best separation. It formed clusters that isolated accidents involving helicopters, ultralights, and critical phases such as takeoff and landing. It also highlighted a specific group of specialized operations. Results confirm that occurrences with similar operational profiles tend to group together, which may help prioritize investigation and prevention actions. The analysis also shows that combining different types of aviation in the same dataset reduces specificity, as heterogeneous operations are mixed. Even so, the findings provide a first overview of safety dynamics in Brazilian civil aviation. The study concludes that clustering can expose latent structures not detected by traditional descriptive analyses and may support the development of more targeted safety policies.

1. Introduction

Air transport is important for the economic and social development of a country. A robust civil aviation system is also a sign of growth and integration. In 2024, the Brazilian air transport market increased by 5% in number of passengers and 1.5% in number of departures [1]. At the global level, the International Air Transport Association (IATA) projected 5.2 billion passengers and a 4.4% revenue growth for airlines in 2025 [2].
Despite this expansion, aviation safety remains a constant concern. The year 2024 registered the highest number of fatalities in the past six years, with 296 fatalities and a fatality rate of 65 per billion passengers [3]. In Brazil, in the same year, an ATR 72-500 crashed and all people on board died. This case increased the public concern about risk and safety. Accident investigation in Brazil is conducted by the Brazilian Aeronautical Accidents Investigation and Prevention Center (CENIPA). The agency focuses on identifying contributing factors and on preventing recurrence.
Research on aviation safety has developed over decades. Early works focused on quantifying risks through frequency and severity analysis [4]. Traditional models, such as Reason’s Swiss Cheese model [5], the Human Factors Analysis and Classification System (HFACS) [6], and Leveson’s Systems-Theoretic Accident Model and Processes (STAMP) [7], have been widely applied in investigations. These models allowed for systemic analysis of human error, organizational factors, and technical failures. Comparative studies showed that HFACS is strong for classifying human error, while STAMP and AcciMap capture broader system interactions [8,9]. However, these methods are mainly descriptive and retrospective, limiting their ability to deal with heterogeneous datasets or to identify new risks in real time [10].
To complement these approaches, recent studies have applied clustering methods to aviation safety data. Čokorilo et al. [11] used clustering on accidents from 1985–2010 and showed that groups of technical and contextual factors could be distinguished. Zhao et al. [12,13] proposed adaptive clustering methods to detect anomalies in flight data, aiming to identify failures early. Kharoufah et al. [14] examined human factors in more than 200 occurrences and confirmed the prevalence of situational awareness issues. Passarella et al. [15] applied K-means to commercial aviation accidents and identified profiles of high-severity events linked to adverse weather. More recently, Jasra et al. [16] showed that K-medoids provides more interpretable clusters in large flight monitoring datasets. Other applications in Indonesia [17] also demonstrated that hierarchical clustering could identify accident profiles linked to pilot experience and weather conditions.
These contributions confirm that clustering can detect latent structures in safety data that are not visible with descriptive statistics. They also highlight that data-driven techniques may complement the classical frameworks by capturing heterogeneous operational contexts.
The aim of this study is to apply clustering algorithms to Brazilian civil aviation occurrences recorded between 2007 and 2023 (the period). The analysis focuses on aircraft type, flight phase, and severity as main variables. We tested K-means, hierarchical clustering, and K-medoids and evaluated their capacity to identify latent operational patterns. The goal is to show how clustering methods can complement traditional descriptive analyses and provide objective support for investigation and prevention in aviation safety.
While clustering has been applied to aviation safety in international studies, its use with Brazilian occurrence data remains limited. This study applies clustering methods to the official CENIPA database, covering civil aviation occurrences over more than 15 years, thus linking data-driven techniques to the regulatory and preventive practices of Brazilian authorities.
The remainder of this article is organized as follows. Section 2 describes the materials and methods, including the dataset and clustering techniques applied. Section 3 presents the results. Section 4 discusses the main findings in light of previous research. Finally, Section 5 summarizes the conclusions of the study.

2. Materials and Methods

2.1. Dataset

The dataset used in this study comes from CENIPA. It includes civil aviation occurrences reported in Brazilian territory from the period. The data includes information such as type of aircraft, number of fatalities, location, date, time, and other variables used in accident and incident investigation.
According to [18], an accident is an occurrence related to aircraft operation in which a person is fatally or seriously injured, the aircraft sustains significant structural damage affecting its performance, or the aircraft becomes missing or completely inaccessible. A serious incident refers to an occurrence during aircraft operation that indicates a high probability of an accident but does not result in fatal or serious injuries. The difference between an accident and a serious incident lies only in the outcome. An incident is any occurrence, other than an accident, associated with aircraft operation that affects or could affect flight safety.
After downloading five spreadsheets, we merged them into a single dataframe, removing redundant columns. These data are publicly available in the Brazilian Open Data Portal under the CENIPA section [19].
Variables unrelated to the study were excluded, resulting in 22 variables, most of them categorical, as shown in Table 1.
The initial merge created 33,519 rows due to multiple values per occurrence. These were aggregated by occurrence code. After cleaning, the final dataset contained 9791 occurrences. Data types were corrected, categorical classes standardized, and missing or inconsistent entries coded as “Not Informed”.

2.2. Tools

Data processing was conducted in R (version 4.4.2) using RStudio (version 2024.12.0+467). Main packages included tidyverse for data manipulation, cluster for PAM and hierarchical clustering, factoextra for visualization and determination of optimal k, and fastDummies for one-hot encoding.

2.3. Data Preparation

Only occurrences from the period were considered. Older records were not available, and newer ones might be incomplete. Flight phases were grouped according to the standard categories used by CENIPA. The most frequent phases included takeoff, landing, landing roll, climb, cruise, descent, taxi, final approach, and maneuver.
Occurrences coded as “Other” correspond to less frequent or non-standard phases, such as pushback, engine start, traffic circuit, hovering, low-altitude flight, ground operation, or undetermined phases not clearly reported in the original records.
Although the dataset includes 22 parameters, only three were selected for clustering: occurrence classification, aircraft type, and flight phase.
These variables were chosen because they are operationally relevant, complete, and directly related to the dynamics of aviation occurrences. Parameters such as date, geographic location, or report identifiers were excluded, as they do not express intrinsic similarities among occurrences and would reduce cluster interpretability.
We tested other variables, such as contributing-factor codes, engine type, and location. These variables produced non-interpretable clusters due to sparsity and heterogeneity. For this reason, the study focused on the three variables that provided consistent and operationally meaningful structures.
This choice was also based on exploratory univariate and bivariate analysis. Variables such as year, month, state, or engine type were excluded because they did not reveal clear operational patterns. Preliminary tests showed that including them led to clusters that were hard to interpret for safety analysis. Categorical variables were transformed using one-hot encoding. Standardization (z-score) was applied to ensure that the variables had a mean of 0 and variance of 1, avoiding distortions in distance-based methods.

2.4. Clustering Methods

We applied three clustering algorithms: K-means, hierarchical clustering, and K-medoids implemented through the Partitioning Around Medoids (PAM) algorithm. These methods were selected because they are well-established in the literature, but present complementary properties regarding distance metrics, initialization, and interpretability [20,21,22,23]. Different distance metrics and validation strategies were tested in preliminary runs; details of these comparative results are reported in Section 3 (Results).

2.4.1. K-Means

K-means was used as a baseline because of its simplicity and wide adoption. The method divides the dataset into k groups, minimizing the variance inside each cluster. It works in an iterative way. First, initial centroids are selected. Then each observation is assigned to the nearest centroid using Euclidean distance. After that, the centroid of each group is recalculated. The process repeats until the groups do not change anymore [20].
The algorithm can be summarized in three main steps [21,22]:
  • Initialization: k observations are randomly selected as the initial centroids;
  • Assignment: each remaining point is assigned to the cluster with the nearest centroid, calculated using Euclidean distance;
  • Update: after assignment, the centroid of each cluster is recalculated as the mean of all its observations.
The steps of assignment and update repeat many times. They stop when no further changes happen in the clusters. At this stage, the algorithm is considered stable.
A known limitation of K-means is its sensitivity to the first random centroids. A poor initialization may lead to suboptimal clusters. To reduce this effect, all executions in this study run 25 times with different initial centroids and return the best result, the one with the lowest within-cluster sum of squares (WSS).
In this study, K-means was applied to the dataset with the variables: aircraft type, flight phase, and occurrence classification. All analyses were performed in R (package cluster and factoextra). Although Euclidean distance is the default metric, its performance was compared with alternative approaches; only the most consistent results are presented in Section 3.

2.4.2. Hierarchical Clustering

Hierarchical clustering is a method that builds nested groups of observations, organized as a tree-like structure [23]. One of its main advantages, compared to K-means, is that it does not require the number of clusters to be defined in advance. Instead, the algorithm produces a full hierarchy, and different “cuts” can be explored later [24].
In this study, we applied the agglomerative (bottom-up) approach, which is the most common. Each observation starts as a single cluster. Then, at each step, the two closest clusters are identified based on a distance metric, and they are merged into a new cluster. This process is repeated until only one final group remains. The sequence of merges is usually visualized through a dendrogram, which shows how groups are joined step by step.
Because our dataset contains a large number of observations, full dendrograms were not informative. For this reason, we reported the results through cluster composition tables, after defining k with the elbow and silhouette methods.
The implementation used Ward’s linkage criterion, which tends to create compact clusters of relatively equal size [25,26].

2.4.3. K-Medoids

K-medoids was adopted as a robust alternative to K-means, since it also divides the dataset into k predefined groups. This method is implemented through the Partitioning Around Medoids (PAM) algorithm [23]. Instead of centroids, it uses medoids, which are real observations within the dataset [22].
The main difference is in the cluster center. K-means uses the mean of all points (a centroid), which can be a virtual point not present in the data. PAM, instead, uses a medoid. A medoid is the most central observation of the cluster, a real data point. This makes the algorithm less sensitive to noise and outliers, since the center cannot be strongly moved by extreme values. PAM is also deterministic, so it is not affected by random initialization as in K-means. The process has two phases [22]:
  • Build phase: k medoids are selected step by step, choosing points that reduce overall dissimilarity;
  • Swap phase: each medoid is tested against non-medoid points. If a swap reduces the total dissimilarity, it is kept. This continues until no further improvement is possible, and the clusters are stable.
This property makes the algorithm less sensitive to noise and outliers, and improves interpretability, since cluster representatives correspond to actual cases. As with hierarchical clustering, different distance metrics were tested, but Manhattan distance consistently produced the most stable results [23]. For both hierarchical clustering and PAM, different distance metrics were tested. Across the experiments, Manhattan distance consistently produced more robust and interpretable clusters, particularly after one-hot encoding of categorical variables. Therefore, Manhattan distance was adopted in the final analyses reported in Section 3.

2.5. Cluster Validation

The number of clusters (k) was evaluated from 2 to 15 using the elbow method (WSS) and the silhouette score [23]. These metrics guided the choice, but interpretability was the final criterion. In some cases, solutions with slightly lower scores were preferred when they provided clearer operational meaning. Preliminary tests sometimes showed low silhouette values even when the elbow curve was flat; these limitations are discussed in Section 4.

3. Results

To identify latent structures in the dataset, the methods described in Section 2 were applied. The analysis started with an exploratory evaluation of the data, followed by clustering with two and three variables. This section presents the main results.

3.1. Exploratory Data Analysis

The first univariate variables examined were Occurrence Classification, Aircraft Type, Flight Phase, and State of Occurrence. Figure 1 shows the distribution of these categories.
Figure 1 shows the distribution of Brazilian civil aviation occurrences (the period). Incidents represent the majority of cases, followed by accidents and serious incidents, which account for about 10% of the total. Regarding aircraft type, airplanes dominate most observations, while only helicopters and ultralights present significant numbers. In terms of flight phase, landing, takeoff, and cruise concentrate most occurrences, while the numbers decrease in later phases. Geographically, the state of São Paulo leads with more than twice the cases of the second state, and the Southeast region concentrates the highest share of occurrences.
Serious incidents are, by definition, rare high-risk precursors. Hence, accidents can outnumber serious incidents in annual records. This pattern reflects the ICAO classification logic, where only occurrences with a high probability of an accident but without serious injury or damage are coded as serious incidents.
Figure 2 shows the distribution of occurrences by year, month, and engine type. In the annual series, 2023 shows a visible spike in counts. Since total movements are not normalized, a causal interpretation is avoided. According to [1], there was an increase in flight activity from 2022 to 2023, which helps explain part of the rise in reported occurrences. The monthly distribution for 2023 does not show isolated spikes, suggesting that the increase is spread across several months.
Figure 3 presents the bivariate analysis of civil aviation occurrences in Brazil (the period). The figure shows the relationship between occurrence classification and flight phase.
It can be observed that incidents, being the majority of occurrences, set the trend already seen in the univariate analysis. For accidents, there is an inversion between landing and takeoff phases, while the less frequent phases grouped under “Other” still represent a significant share. The other phases appear more balanced when compared with accidents or incidents.
Figure 4 shows the analysis by aircraft type.
In this case, there is a clear concentration of helicopter and ultralight occurrences within the accident category, compared with their distribution in the univariate analysis. The opposite effect is seen for airplanes, which appear more strongly in incidents than in accidents.
Figure 5 shows the distribution of flight phases for airplanes, helicopters and ultralights.
For airplanes, the result is very similar to the univariate analysis, which is expected since they represent 84% of all occurrences and drive the overall behavior of the dataset. This predominance is consistent with their operational share in Brazilian civil aviation, although the data are not normalized by exposure, for example number of flights or flight hours.

3.2. Clustering

3.2.1. Two Variables

The clustering analysis started with the variables Occurrence Classification and Flight Phase. The phases of flight were clustered separately for each occurrence classification. The first tests used the K-means algorithm. However, both the Elbow Test and the Silhouette Test gave unsatisfactory results. Both, with low values and weak separation between classes. As shown in the Silhouette Score plot (Figure 6), a significant increase in k would be required to reach minimally acceptable values, but this would reduce the interpretability of the clusters.
In the first runs with Hierarchical Clustering using Euclidean distance, the results produced mostly single-class clusters. Switching to Manhattan distance improved the grouping. For accidents, incidents, and serious incidents, the cut at five clusters yielded Silhouette Scores values between 0.40 and 0.52, which were considered acceptable for categorical data.
With K-medoids, single-class clusters reappeared, but they carried operational meaning. The most frequent phases, landing, takeoff, and cruise, formed individual clusters across all occurrence categories. The chosen k ranged from four to five, depending on the category, with silhouette scores around 0.47–0.50.
For Aircraft Type versus Occurrence Classification, the optimal k was three. Helicopters and ultralights consistently separated from airplanes, either individually or together. Clustering by flight phase for each aircraft type showed similar patterns, confirming that the most frequent phases tended to form their own clusters.
Overall, the two-variable analysis validated and structured the tendencies observed in the exploratory analysis, providing a formal quantitative basis for the three-variable clustering presented next. This stage ensured methodological robustness and supported the interpretability of subsequent multidimensional results.

3.2.2. Three Variables

Figure 7 shows the silhouette scores for K-means, Hierarchical Clustering, and K-medoids using three variables. For K-means and Hierarchical Clustering, the values remained low across different k. To reach around 0.3, more than ten clusters would be required. For K-medoids, the scores were also low at small k, but values above 0.4 were obtained when k exceeded eight.
The results were not considered satisfactory for K-means. Although most clusters contained homogeneous observations, some classes of the variables were not fully assigned to a single cluster. This produced overlap and reduced interpretability. In addition, there was no clear logic in the separation of occurrences at the intersections of clusters, which made the structure harder to explain.
For Hierarchical Clustering, the problems of overlap were even stronger. Regardless of the value of k, the number of well-defined clusters was small, and classes appeared in multiple groups. As a result, the interpretability was the lowest among the three algorithms. This pattern was observed for accidents, incidents, and serious incidents.
The K-medoids results were more consistent. With k = 9, the silhouette score was 0.4652. Despite the higher number of clusters, they had operational meaning and a clear separation logic. The main identified clusters for accidents were:
  • Helicopters;
  • Ultralights;
  • Takeoff;
  • Landing;
  • Cruise;
  • Maneuver;
  • Landing roll;
  • Specialized operations;
  • Other fixed-wing phases.
When conflicts occurred (e.g., helicopter accidents during takeoff), the priority was given first to helicopters and ultralights, then to clusters defined by flight phases. Cluster 9 always yielded to the others in these cases.
This structure shows the relevance of helicopter and ultralight accidents, even though they represent a smaller share of cases. Frequent phases such as takeoff and landing also stood out. When k = 5 was used, these clusters remained stable, while others were merged. Specialized operations also formed a distinct cluster, despite their small number of occurrences. These were mostly agricultural aviation and some media flights.
The silhouette results for incidents are shown in Figure 8. Again, the scores for K-means and Hierarchical Clustering were low, although slightly higher than those for accidents.
The clusters obtained with K-means and Hierarchical Clustering were similar to the accident case, but the overlap problem was even stronger. Even when reduced to two groups, hierarchical clustering continued to show the worst performance in this dataset.
For K-medoids, with k = 7, the silhouette score was 0.4951. The identified clusters were:
  • Helicopters;
  • Takeoff;
  • Landing;
  • Cruise;
  • Landing roll;
  • Taxi;
  • Other fixed-wing phases.
In this case, the clustered events were less severe, but with a much larger number of occurrences. Helicopters remained a distinct group, while ultralights were absorbed into the other fixed-wing clusters. This suggests that only the most severe ultralight cases stand out from the rest. A similar effect was observed for specialized phases, where agricultural operations lost relevance in the context of incidents. In addition, a new cluster appeared for the taxi phase, indicating that ground occurrences form a latent component among the less severe cases. The same allocation rule for intersections was applied: priority to helicopters, followed by clusters defined by flight phases.
Finally, the clustering analysis was performed for serious incidents. The silhouette results are shown in Figure 9.
The Silhouette Score values were slightly higher for K-means, while hierarchical clustering showed significant improvement only for higher k values. However, the pattern of clusters remained uninformative, with strong overlap among classes even when larger values of k were used.
In K-means, an apparent improvement was observed for k > 4, as clusters became more defined. A deeper look, however, revealed that the algorithm was simply dividing observations by aircraft type, which did not provide new insights beyond the Exploratory Data Analysis.
For K-medoids, using k = 7, the Silhouette Score reached 0.5672. The following clusters were identified:
  • Helicopters;
  • Ultralights;
  • Takeoff;
  • Landing;
  • Cruise;
  • Landing roll;
  • Other fixed-wing phases.
The results for serious incidents were very similar to those of accidents. If k = 9 had been applied, the only difference would have been the replacement of the specialized phase (observed in accidents) by the climb phase.
Across all tests, K-medoids provided the most consistent and operationally meaningful clusters, while hierarchical clustering produced unstable groupings and K-means offered limited insights.
The results highlighted three consistent findings: (i) helicopters concentrate a disproportionate share of severe accidents and incidents, despite representing a smaller share of the fleet; (ii) ultralights and agricultural aircraft also appear more frequently in severe outcomes, suggesting operational vulnerabilities; and (iii) the phases of takeoff and landing remain critical points of susceptibility, confirming trends widely recognized in global aviation safety studies.

4. Discussion

4.1. Exploratory Data Analysis

The exploratory data analysis highlights that incidents are the most frequent category, while accidents represent about one quarter of the total. The predominance of airplanes is expected, but the presence of helicopters and ultralights indicates operational diversity. The predominance of incidents supports the idea that aviation remains a safe mode of transport, consistent with Boeing’s 2025 report [4].
The distribution of occurrences by flight phase also follows global trends, with landing and takeoff being the most critical segments. The concentration of accidents in airplanes reflects the composition of the Brazilian fleet, while helicopters and ultralights, although fewer, still account for a relevant portion of cases. Finally, the dominance of the Southeast region is aligned with the economic and operational weight of this region, as indicated by the Brazilian Aeronautical Registry [27]. The distribution by engine type follows the structure of the Brazilian fleet, which is predominantly composed of turbofan and piston-powered aircraft, consistent with national aviation statistics [1].
The distribution of occurrences across flight phases reinforces the evidence that takeoff and landing remain the most critical moments in aviation operations, as also highlighted in international safety reports [1,2]. Compared with incidents, which are more evenly distributed and reflect a wide range of minor deviations, accidents and serious incidents emphasize how risk escalates in phases that demand higher workload and situational awareness from pilots.
Helicopters and ultralights, though smaller in number, appear consistently in more severe categories, a finding also noted in general aviation studies. These exploratory results already indicated that takeoff, landing, and non-conventional aircraft types would emerge as critical categories. The clustering analysis confirmed these tendencies and provided more structural insights.

4.2. Clustering Analysis

This study applied clustering techniques to Brazilian aviation occurrences, seeking to identify latent structures that traditional descriptive methods could not capture. While silhouette values were modest, they still pointed to stable structures consistent with operational expectations. Although the silhouette values were modest, this is consistent with heterogeneous datasets composed mainly of categorical variables. The interpretation of clusters was based exclusively on their data composition and operational characteristics, following quantitative validation rather than subjective assessment. Critically, the clusters uncovered multidimensional interaction patterns beyond marginal distributions and cross-tabs. This indicates that even when validation scores are not particularly high, clustering may reveal patterns of practical relevance for aviation safety.
When compared to previous research, our findings align with Čokorilo et al. [5] and Passarella et al. [12], who also identified clustering patterns associating specific flight phases with higher severity levels. Similarly, Zhao et al. [7,8] demonstrated that clustering-based methods can detect operational anomalies in real time, reinforcing the potential of these techniques as complementary tools for safety monitoring. Unlike those studies, which often used continuous flight data or simulator records, our analysis relied on the national investigation database maintained by CENIPA. This demonstrates the feasibility of applying clustering not only in highly technological contexts but also in the systematic analysis of occurrence reports. In this sense, the study helps bridge data-driven methods with regulatory and preventive practices in Brazil.
From a methodological perspective, K-medoids provided the most consistent and interpretable results, particularly when dealing with heterogeneous data and categorical variables. This supports previous observations by Jasra et al. [16], who emphasized the role of medoids in improving interpretability for safety analysts. The recurrent limitations of hierarchical clustering in our study confirm concerns raised in earlier reviews by Netjasov & Janić [9] about the difficulty of extracting operational meaning from overly fragmented groupings. K-means gave stable groups, but they were not always useful in practice. Many clusters had little operational meaning, so the method added less value for safety analysis. One limitation of our approach is that clustering different aircraft categories together may blur segment-specific risk profiles. Future analyses stratified by aircraft type could yield more precise patterns.
The present study showed two lessons. First, helicopters and ultralights exhibit different risk profiles and need specific attention in investigation and prevention. Second, clustering can help safety agencies such as CENIPA identify structural patterns in the data and direct resources to phases of flight or types of aircraft that need more focus. Data-driven clustering may provide an additional tool to prioritize investigative focus and preventive interventions. Table 2 shows the operational interpretation of clusters by domain.
Table 2 highlights the areas where operational attention is usually concentrated, such as human, operational, technical, and maintenance factors during phases like takeoff and landing. Although causality cannot be inferred, some tendencies are more evident. For example, helicopters and ultralights appear more frequently among accidents and serious incidents, whereas taxi and ground configurations are more visible among incidents. These groupings support the direction of specific actions for the prevention of aviation occurrences.
Although cluster analysis does not aim to model cause and effect relations, the resulting groups are aligned with the contributing factors used in official safety investigation reports. Aviation occurrences grouped under the takeoff and landing clusters are usually associated with operational culture and human factors, such as workload and approach stabilization. Clusters related to cruise and climb phases often involve technical or maintenance issues and system reliability. An interesting finding of this study is the separation of helicopters and ultralights into distinct clusters. This shows that these aircraft types should be analyzed according to their own characteristics. They are different groups, with specific aerodynamic and operational profiles. This result highlights the need for dedicated sectors or prevention programs focused on each aircraft type and on critical phases such as takeoff, landing, and others.
Cluster analysis is a technique that identifies latent structures beyond the marginal distributions and cross-tab tools used in exploratory data analysis. It adds a multidimensional view by grouping safety occurrences according to their similarity across variables. This approach makes it possible to identify simultaneous separation patterns, such as those of helicopters and ultralights versus phase-driven groups (takeoff and landing compared with cruise and climb). As a result, data-driven organizations can adjust their prevention programs to specific aviation safety occurrences, such as stabilized approach programs, instead of relying only on general statistical prevalences.
One of the main advantages of cluster analysis is the ability to organize accidents into homogeneous groups. In this way, safety professionals can examine individual cases within a specific cluster, explore the particularities of each occurrence, and extract practical meaning for the prevention of safety events. This post-clustering workflow increases analytical depth and supports hypothesis generation and targeted prevention, while preserving the unsupervised and non-causal nature of the approach
The purpose of applying clustering techniques in this study is exploratory rather than predictive. From a research perspective, the use of unsupervised learning methods allows the researcher to find groups of similar occurrences in a large and mixed dataset. This process does not try to confirm or reject a hypothesis. It organizes the data and shows how the occurrences are related.
By providing this information, it enables the identification of latent structures and similarity patterns within occurrence data. Even when the patterns match known trends, clustering adds value because it makes these relations visible and measurable. It also provides a clear understanding of how aviation occurrences are organized into meaningful groups. In practice, the resulting clusters serve as worklists for focused, case-by-case audit of narratives and contributing-factor coding, enhancing depth of analysis without imposing causal assumption. This can be used to support further safety studies and practical decisions.

5. Conclusions

The aim of the present study is to identify latent structures in Brazilian civil aviation occurrences using the variables aircraft type, flight phase, and occurrence classification. The clustering techniques revealed distinct profiles. Helicopters and ultralights formed exclusive groups linked to higher severity, while takeoff and landing phases stood out as recurrent critical contexts. A specific cluster of specialized flights, mainly agricultural operations, also emerged as a relevant category. These patterns, hidden in raw data, provide a basis for investigation and prevention strategies, indicating priorities and more vulnerable segments of aviation.
Beyond descriptive patterns, the study demonstrated that clustering can reveal operational structures consistent with the main contributing-factor domains used in accident investigations. The results confirmed that unsupervised learning methods can capture multidimensional relationships even when validation metrics, such as silhouette values, are modest. By organizing occurrences into homogeneous groups, the method offers an analytical framework for safety professionals to review cases within each cluster and identify actionable insights for prevention.
At the same time, the heterogeneity of the dataset limited interpretability. Mixing airplanes, helicopters, ultralights, and specialized aircraft in the same analysis may blur segment-specific risk profiles. This suggests that clustering across different categories may mask specific risk profiles. Future studies can explore stratification by aircraft category or operational domain to achieve finer resolution of safety patterns.

Author Contributions

Conceptualization, F.D.S., D.A.P. and M.X.G.; methodology, D.A.P.; software, F.D.S.; validation, F.D.S., D.A.P., M.H., L.K. and M.X.G.; formal analysis, F.D.S., D.A.P., M.H., L.K. and M.X.G.; investigation, F.D.S., D.A.P., M.H., L.K. and M.X.G.; resources, F.D.S. and D.A.P.; data curation, F.D.S. and D.A.P.; writing—original draft preparation, F.D.S., D.A.P., M.H. and L.K.; writing—review and editing, D.A.P. and M.X.G.; visualization, F.D.S. and D.A.P.; supervision, M.X.G.; project administration, D.A.P.; funding acquisition, D.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in https://dados.gov.br/dados/conjuntos-dados/ocorrencias-aeronauticas-da-aviacao-civil-brasileira (accessed on 28 July 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Brazilian National Civil Aviation Agency (ANAC). ANAC Consumer Bulletin 2024; ANAC: Brasília, Brazil, 2025. Available online: https://www.gov.br/anac/pt-br/noticias/2025/anac-divulga-boletim-do-consumidor-2024/BoletimANACConsumidor2024Final.pdf (accessed on 1 October 2025).
  2. International Air Transport Association (IATA). Strengthened Profitability Expected in 2025 Even as Supply Chain Issues Persist; IATA Press Release: Montreal, QC, Canada; Geneva, Switzerland, 2024; Available online: https://www.iata.org/en/pressroom/2024-releases/2024-12-10-01/ (accessed on 1 October 2025).
  3. International Civil Aviation Organization (ICAO). State of Global Aviation Safety. ICAO Safety Report 2025 Edition; ICAO: Montreal, QC, Canada, 2025; Available online: https://www.icao.int/safety/pages/safety-report.aspx (accessed on 1 October 2025).
  4. BOEING. Statistical Summary of Commercial Jet Airplane Accidents: Worldwide Operations, 1959–2024, 56th ed.; Boeing Commercial Airplanes: Seattle, WA, USA, 2025; 34p, Available online: https://www.boeing.com/content/dam/boeing/boeingdotcom/company/about_bca/pdf/statsum.pdf (accessed on 5 July 2025).
  5. Reason, J. Human Error; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
  6. Wiegmann, D.A.; Shappell, S.A. A Human Error Analysis of Commercial Aviation Accidents Using the Human Factors Analysis and Classification System (HFACS). Aviat. Space Environ. Med. 2001, 72, 1006–1016. [Google Scholar] [PubMed]
  7. Leveson, N. Engineering a Safer World: Systems Thinking Applied to Safety; MIT Press: Cambridge, MA, USA, 2011. [Google Scholar]
  8. Salmon, P.; Cornelissen, M.; Trotter, M. Systems-Based Accident Analysis Methods: A Comparison of Accimap, HFACS, and STAMP. Saf. Sci. 2012, 50, 1158–1170. [Google Scholar] [CrossRef]
  9. Netjasov, F.; Janić, M. A Review of Research on Risk and Safety Modelling in Civil Aviation. J. Air Transp. Manag. 2008, 14, 213–220. [Google Scholar] [CrossRef]
  10. Roelen, A.L.C.; Lin, P.H.; Hale, A.R. Accident Models and Organisational Factors in Air Safety: The Need for Multi-Method Models. Saf. Sci. 2011, 49, 1170–1179. [Google Scholar] [CrossRef]
  11. Čokorilo, O.; De Luca, M.; Dell’Acqua, G. Aircraft Safety Analysis Using Clustering Algorithms. J. Risk Res. 2014, 17, 1325–1340. [Google Scholar] [CrossRef]
  12. Zhao, W.; He, F.; Li, L.; Xiao, G. An Adaptive Online Learning Model for Flight Data Cluster Analysis. In Proceedings of the IEEE/AIAA 37th Digital Avionics Systems Conference (DASC), London, UK, 23–27 September 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar] [CrossRef]
  13. Zhao, W.; Li, L.; Alam, S.; Wang, Y. An Incremental Clustering Method for Anomaly Detection in Flight Data. Transp. Res. Part C Emerg. Technol. 2021, 132, 103406. [Google Scholar] [CrossRef]
  14. Kharoufah, H.; Murray, J.; Baxter, G.; Wild, G. A Review of Human Factors Causations in Commercial Air Transport Accidents and Incidents: From 2000 to 2016. Prog. Aerosp. Sci. 2018, 99, 1–13. [Google Scholar] [CrossRef]
  15. Passarella, R.; Iqbal, M.D.; Buchari, M.A.; Veny, H. Analysis of Commercial Airplane Accidents Worldwide Using K-Means Clustering. Int. J. Saf. Secur. Eng. 2023, 13, 813–819. [Google Scholar] [CrossRef]
  16. Jasra, S.K.; Valentino, G.; Muscat, A.; Camilleri, R. A Comparative Study of Unsupervised Machine Learning Methods for Anomaly Detection in Flight Data: Case Studies from Real-World Flight Operations. Aerospace 2025, 12, 151. [Google Scholar] [CrossRef]
  17. Passarella, R.; Iqbal, M.D.; Buchari, M.A.; Veny, H. Using the Agglomerative Hierarchical Clustering Method to Examine Human Factors in Indonesian Aviation Accidents. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 38. [Google Scholar] [CrossRef]
  18. International Civil Aviation Organization (ICAO). Annex 13—Aircraft Accident and Incident Investigation, 11th ed.; ICAO: Montreal, QC, Canada, 2016; Available online: https://www.icao.int/ (accessed on 1 October 2025).
  19. Brazilian Aeronautical Accidents Investigation and Prevention Center (CENIPA). Aeronautical Occurrences in Brazilian Civil Aviation (2007–2023); Open Data Portal, Brazilian Government: Brasília, Brazil, 2025. Available online: https://dados.gov.br/dados/conjuntos-dados/ocorrencias-aeronauticas-da-aviacao-civil-brasileira (accessed on 1 October 2025).
  20. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: San Francisco, CA, USA, 2012. [Google Scholar]
  21. Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; Wiley: New York, NY, USA, 1990. [Google Scholar] [CrossRef]
  22. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
  23. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  24. Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
  25. Murtagh, F.; Legendre, P. Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef]
  26. Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
  27. Brazilian National Civil Aviation Agency (ANAC). Brazilian Aeronautical Registry: Open Data; ANAC: Brasília, Brazil, 2025. Available online: https://www.gov.br/anac/pt-br/sistemas/rab/dados-abertos-rab (accessed on 6 July 2025).
Figure 1. Exploratory analysis of civil aviation occurrences in Brazil (the period): distribution by occurrence classification, aircraft type, flight phase, and state. Definitions of “accident”, “serious incident”, and “incident” follow [18].
Figure 1. Exploratory analysis of civil aviation occurrences in Brazil (the period): distribution by occurrence classification, aircraft type, flight phase, and state. Definitions of “accident”, “serious incident”, and “incident” follow [18].
Futuretransp 05 00185 g001
Figure 2. Exploratory analysis of civil aviation occurrences in Brazil (the period): distribution by year, month, number of engines, and engine type.
Figure 2. Exploratory analysis of civil aviation occurrences in Brazil (the period): distribution by year, month, number of engines, and engine type.
Futuretransp 05 00185 g002
Figure 3. Top 11 flight phases in Brazilian civil aviation occurrences (the period), grouped by occurrence classification.
Figure 3. Top 11 flight phases in Brazilian civil aviation occurrences (the period), grouped by occurrence classification.
Futuretransp 05 00185 g003
Figure 4. Frequency of aircraft types in Brazilian civil aviation occurrences (the period), grouped by occurrence classification. Counts are not normalized by exposure.
Figure 4. Frequency of aircraft types in Brazilian civil aviation occurrences (the period), grouped by occurrence classification. Counts are not normalized by exposure.
Futuretransp 05 00185 g004
Figure 5. Distribution of the top 11 flight phases in Brazilian civil aviation occurrences (the period), grouped by aircraft type.
Figure 5. Distribution of the top 11 flight phases in Brazilian civil aviation occurrences (the period), grouped by aircraft type.
Futuretransp 05 00185 g005
Figure 6. Average silhouette width for different numbers of clusters (accidents dataset).
Figure 6. Average silhouette width for different numbers of clusters (accidents dataset).
Futuretransp 05 00185 g006
Figure 7. Silhouette scores for different numbers of clusters (k) in accident data.
Figure 7. Silhouette scores for different numbers of clusters (k) in accident data.
Futuretransp 05 00185 g007
Figure 8. Silhouette scores for different numbers of clusters (k) in incident data.
Figure 8. Silhouette scores for different numbers of clusters (k) in incident data.
Futuretransp 05 00185 g008
Figure 9. Average Silhouette Score values by number of k for Serious Incidents.
Figure 9. Average Silhouette Score values by number of k for Serious Incidents.
Futuretransp 05 00185 g009
Table 1. Structure and description of variables contained in the dataset (2007–2023).
Table 1. Structure and description of variables contained in the dataset (2007–2023).
Variable NameDescriptionType
aircraft_damage_levellevel of damage sustained by the aircraftcategorical
aircraft_engine_quantitynumber of engines installed on the aircraftcategorical
aircraft_engine_typetype of aircraft engine (piston, turboprop, jet, etc.)categorical
aircraft_modelmodel designation of the aircraftcategorical
aircraft_total_fatalitiestotal number of fatalities associated with the occurrencecategorical
aircraft_typegeneral type of aircraft (airplane, helicopter, ultralight, etc.)categorical
citycity where the occurrence took placecategorical
contributing_factor_domaindomain or category of the contributing factor (human, operational, technical, etc.)categorical
contributing_factor_namename of the contributing factor(s) identified in the investigationcategorical
countrycountry where the occurrence took placecategorical
flight_phasephase of flight at the time of the occurrence (takeoff, landing, cruise, etc.)categorical
latitudelatitude coordinate of the occurrence sitenumeric
longitudelongitude coordinate of the occurrence sitenumeric
occurrence_categorybroader operational category of the occurrencecategorical
occurrence_classificationclassification of the event (accident, incident, serious incident)categorical
occurrence_datedate of the occurrencedate
occurrence_idunique identifier of the occurrence recordinteger
occurrence_monthmonth of the occurrenceinteger
occurrence_stateBrazilian state (federative unit) where the occurrence took placecategorical
occurrence_typespecific operational type of occurrencecategorical
occurrence_yearyear of the occurrenceinteger
report_numberidentification number of the published investigation reportcategorical
Table 2. Operational interpretation of clusters by domain (illustrative, non-causal).
Table 2. Operational interpretation of clusters by domain (illustrative, non-causal).
Cluster LabelOperational InterpretationLikely Contributing-Factor Domain *
cruise 1,2,3system reliability; route failure; anomaly management.technical/maintenance; (environmental)
helicopters 1,2,3mission profiles with low altitude/speed; operations in confined areas.human/operational; (environmental)
landing 1,2,3approach stabilization; runway/wind effect; go-around decision.human/operational; (environmental)
landing roll 1,2,3directional control; braking; runway/contamination.human/operational; (environmental)
maneuver 1outside the standard profile; low height/sharp curves.human/operational
other fixed-wing phases 1,2,3infrequent/undetermined situations.operational
specialized operations 1specific mission profiles (agriculture, photography/filming, inspection).human/operational
takeoff 1,2,3high workload; setup/briefing; takeoff performance.human/operational
taxi 2situational awareness; signage/lighting; incursions.human/operational
ultralights 1,3limited performance envelope; lower structural robustness; recreational aviation.human/operational; (technical/maintenance)
1 accident cluster; 2 incident cluster; 3 serious incident cluster. * Interpretative, non-causal mapping. Domains in parentheses denote secondary influences.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Santana, F.D.; Pamplona, D.A.; Habermann, M.; Kacedan, L.; Guterres, M.X. Clustering of Civil Aviation Occurrences in Brazil: Operational Patterns and Critical Contexts. Future Transp. 2025, 5, 185. https://doi.org/10.3390/futuretransp5040185

AMA Style

Santana FD, Pamplona DA, Habermann M, Kacedan L, Guterres MX. Clustering of Civil Aviation Occurrences in Brazil: Operational Patterns and Critical Contexts. Future Transportation. 2025; 5(4):185. https://doi.org/10.3390/futuretransp5040185

Chicago/Turabian Style

Santana, Felipe Duarte, Daniel Alberto Pamplona, Mateus Habermann, Lila Kacedan, and Marcelo Xavier Guterres. 2025. "Clustering of Civil Aviation Occurrences in Brazil: Operational Patterns and Critical Contexts" Future Transportation 5, no. 4: 185. https://doi.org/10.3390/futuretransp5040185

APA Style

Santana, F. D., Pamplona, D. A., Habermann, M., Kacedan, L., & Guterres, M. X. (2025). Clustering of Civil Aviation Occurrences in Brazil: Operational Patterns and Critical Contexts. Future Transportation, 5(4), 185. https://doi.org/10.3390/futuretransp5040185

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop