6.1. Results Considering the Linear Weighted Kappa
Table 14 presents the agreement results between the reference FMECA ranking RPI(SC
5) and the six fuzzy-based configurations (T2-FIS 01 to T2-FIS 06), based on the linear weighted kappa coefficient
κw-lin. This table presents the rankings assigned to each failure mode, the corresponding kappa values, the strength of agreement, the test statistic, and the outcomes of the hypothesis tests.
Triangular membership functions used in this work (
Figure 1), although computationally simple, are not symmetric concerning their peak, nor is the maximum value located at the geometric center of each category. Nevertheless, in T2-FIS 03, their use for Severity appears effective, possibly because Severity is often rated with more consensus among experts, reducing the impact of the membership functions’ asymmetry. The trapezoidal membership functions used for FRPN in this configuration (
Figure 2), while also asymmetric and not centered, providing wider support regions that may have helped in aggregating fuzzy values more smoothly.
In contrast, configuration T2-FIS 06 produced the lowest agreement (κw-lin = 0.4154), classified as “moderate agreement”. This configuration used a g-bell function for Severity, a triangular function for Occurrence, and trapezoidal functions for both Detection and FRPN. Several factors may contribute to this lower performance. First, the triangular MFs used for Occurrence are asymmetric and exhibit sharp transitions, which may limit their expressiveness in modeling uncertainty due to incomplete failure data, particularly when Occurrence is not based on robust statistics. Second, the trapezoidal MFs used for Detection and FRPN, while offering broader support, can introduce ambiguity near membership transitions, especially if the flat tops are not well aligned with the central values of their respective categories. Such ambiguity may lead to less accurate activation of fuzzy rules and less precise inference outputs.
Intermediate κw-lin values were obtained by configurations T2-FIS 05 with kappa equal to 0.6308 and T2-FIS 04 and T2-FIS 04, both equal to 0.6615. These configurations differ primarily in the combinations of trapezoidal, Gaussian, and g-bell membership functions (MFs) assigned to the input and output variables. For instance, T2-FIS 05 uses trapezoidal MFs for Severity, Gaussian for Occurrence, g-bell for Detection, and trapezoidal for FRPN. The presence of smooth, symmetric MFs (g-bell and Gaussian) in at least one of the subjective inputs (D or O) seems to improve alignment with the reference ranking, though not enough to surpass the performance of T2-FIS 03.
Interestingly, T2-FIS 02, which combines Gaussian membership functions for Severity and Occurrence, also achieved a relatively high agreement (κw-lin = 0.6615); this is the same kappa value obtained by T2-FIS 04, even though their assigned rankings differ for several failure modes. This equivalence in agreement strength despite differing rankings highlights an important feature of the weighted kappa coefficient: it accounts not only for exact matches but also for the magnitude and distribution of disagreements across the rankings. That is, two configurations can have different ranking patterns, but if their deviations from the reference are similarly distributed in distance and direction, they may result in comparable kappa values.
From a modeling perspective, T2-FIS 02 uses smooth, symmetric functions for the two more uncertain risk factors (Severity and Occurrence), which may help capture uncertainty more consistently. However, it relies on trapezoidal MFs for Detection, which, due to their asymmetry and flat tops not centered on the category midpoints, may contribute to ambiguity near the rule activation thresholds, reducing overall model precision.
In contrast, T2-FIS 04 applies trapezoidal MFs for Severity, g-bell for Occurrence, triangular for Detection, and trapezoidal for FRPN. The use of g-bell functions for Occurrence likely contributes positively due to their symmetrical and flexible shape, but the triangular and trapezoidal MFs used elsewhere may have reduced the system’s sensitivity to subtle input variations. Notably, the triangular MFs used for Detection in this configuration are not symmetric and have sharper transitions, potentially limiting their ability to reflect gradual shifts in expert assessments.
The fact that both configurations reach the same κw-lin value suggests that the kappa statistic reflects an aggregate measure of concordance, rather than rewarding specific item-wise matches. This reinforces the idea that kappa-based agreement is a more robust and statistically grounded metric than simply comparing rankings, which may overlook broader consistency patterns or disproportionately penalize near matches. Therefore, even when the rankings differ, the overall distribution and Severity of disagreements may result in statistically equivalent levels of agreement, as seen with T2-FIS 02 and T2-FIS 04.
Finally, it is worth noting that all six configurations rejected the null hypothesis of agreement occurring by chance, as indicated by z-test values exceeding the critical threshold (z > 1.645). This confirms the statistical significance of the observed agreements and supports the reliability of the weighted kappa as a comparative metric across fuzzy-based FMECA models.
6.2. Results for Quadratic Weighted Kappa
Table 15 presents the agreement results between the reference ranking RPI(SC
5) and the six fuzzy-based configurations (T2-FIS 01 to T2-FIS 06), using the quadratic weighted kappa coefficient (
κw-quad). As with the linear weighting case, this metric captures not only exact matches but also the degree to which each ranked item deviates from the reference. However, the quadratic scheme penalizes larger deviations more heavily, making it more sensitive to distant mismatches in the rankings.
The highest agreement was again achieved by T2-FIS 03, with κw-quad = 0.9033, falling within the “almost perfect” agreement category. This configuration utilizes triangular membership functions (MFs) for Severity, a Gaussian bell curve for Occurrence, a Gaussian curve for Detection, and a trapezoidal curve for False Rejection Probability of Non-Detection (FRPN). As discussed previously, this combination benefits from the smoothness and symmetry of Gaussian and g-bell functions, attributes that improve inference quality for variables subject to uncertainty and expert variability (notably Detection and Occurrence). The trapezoidal output MFs provide broader aggregation support, enhancing the model’s resolution. Despite the triangular MFs for Severity being asymmetric and off-centered, they appear suitable given the more consistent and less subjective nature of this risk factor.
As in the previous section, T2-FIS 02 achieved the second-highest kappa value (κw-quad = 0.8901), surpassing T2-FIS 04, which had shown identical κw-lin to T2-FIS 02 in the linear weighting case. Here, however, T2-FIS 04 drops to κw-quad = 0.8374. This divergence underscores a key distinction: under quadratic weighting, not only does the number of mismatches matter, but also their magnitude. T2-FIS 02 appears to exhibit smaller average deviations from the reference ranking than T2-FIS 04, despite having a similar total number of exact matches. This suggests that T2-FIS 02 produced disagreements that were closer to the reference, perhaps rank offsets of ±1, whereas T2-FIS 04 included some more severe outliers (e.g., FM6 and FM9). In modeling terms, T2-FIS 02 utilizes g-bell and Gaussian membership functions (MFs) for Severity and Occurrence, providing symmetry and smoother transitions, which may help maintain output consistency, especially when Detection and FRPN use trapezoidal and Gaussian MFs.
T2-FIS 05 shows a slightly better performance (κw-quad = 0.8549) than both T2-FIS 01 (κw-quad = 0.8418) and T2-FIS 04 (κw-quad = 0.8374), despite having only two exact ranking matches with reference, compared to two for T2-FIS 01 and six for T2-FIS 04. This highlights once more that the quadratic weighted kappa captures the overall proximity of rankings more effectively than simply counting exact matches. T2-FIS 05 utilizes trapezoidal functions for Severity and FRPN, Gaussian functions for Occurrence, and g-bell functions for Detection, a combination that likely contributed to its performance, enabling smoother aggregation and better modeling of subjectivity.
Configuration T2-FIS 01, despite using symmetric and relatively expressive functions (Gaussian for Occurrence and g-bell for Detection), has the lowest κw-quad among the “almost perfect” group (0.8418). This suggests that some of its ranking errors may be further from the reference. This may stem from its use of triangular MFs for Severity and FRPN, which, being asymmetric and narrow, may reduce expressiveness in aggregating fuzzy scores and lead to more rigid prioritization outcomes.
Finally, T2-FIS 06 exhibits the lowest agreement (κw-quad = 0.6571, “substantial” agreement). This configuration uses a g-bell for Severity, a triangular function for Occurrence, and a trapezoidal function for both Detection and FRPN. The combination of a complex shape (g-bell) for a typically less subjective input, such as Severity, with less expressive or abrupt transitions in the remaining factors, seems to impair consistency in the prioritization process. Additionally, triangular MFs for Occurrence may not adequately represent the uncertainty typical of that input when failure data are incomplete.
The T2-FIS 03 configuration consistently achieves the highest level of agreement under both schemes, suggesting that its fuzzy membership function design is particularly effective and can serve as a strong reference for future fuzzy-based FMECA applications in power transformers.
Figure 6 presents a radar chart comparing the failure mode rankings produced by the reference method RPI(SC
5) and four selected fuzzy-based FMECA configurations (T2-FIS 01, 02, 03, and 05).
While a first visual inspection reveals varying degrees of alignment with the reference, a deeper analysis shows that visual closeness or the number of exact ranking matches does not always accurately reflect the statistical agreement captured by the kappa coefficient.
To illustrate this, the number of exact matches and the corresponding quadratic weighted kappa (κw-quad) values are summarized as follows:
T2-FIS 01: 2 exact matches, κw-quad = 0.8418
T2-FIS 02: 1 exact match, κw-quad = 0.8901;
T2-FIS 03: 4 exact matches, κw-quad = 0.9033;
T2-FIS 05: 2 exact matches, κw-quad = 0.8549.
These values emphasize a crucial point: a higher number of exact matches does not necessarily imply a better overall agreement. For instance, T2-FIS 05, with only two exact matches, outperforms T2-FIS 01, which also has two matches but a lower
κw-quad value. This suggests that the distance of mismatches plays a more important role in statistical concordance than the absolute number of matches. This is visually reflected in
Figure 6, where T2-FIS 05 traces a line slightly more distant from the reference in some segments but overall remains close, especially for the most critical failure modes (FM1, FM2, and FM10). Conversely, T2-FIS 01, although achieving the same number of exact matches, exhibits larger divergences in several points (e.g., FM5, and FM6), resulting in a lower
κw-quad score.
A particularly revealing case is T2-FIS 02, which has only one exact match but achieves κw-quad = 0.8901, outperforming both T2-FIS 01 and T2-FIS 05. This apparent contradiction is explained by the fact that T2-FIS 02 produces minimal deviations across the board, i.e., most ranks are either identical or differ by only one or two positions. Thus, the global ranking structure is better preserved, even in the absence of exact matches.
This analysis highlights a fundamental limitation in traditional FMECA evaluation practices that rely solely on comparing rankings item by item. While intuitive, this approach fails to capture near matches, such as a failure mode being ranked 6th by one method and 7th by another, differences that are negligible in practical contexts but completely ignored in coincidence counting.
The weighted kappa statistic overcomes this limitation by applying distance-sensitive penalties and adjusting for agreement expected by chance. Consequently, it offers a more holistic and statistically grounded measure of methodological alignment.
These presented results have meaningful implications for practical risk assessment in engineering systems, particularly in critical infrastructure such as power transformers. The observed sensitivity of the kappa coefficient to the type and distribution of membership functions underscores that fuzzy modeling decisions are far from trivial; they materially influence the prioritization of failure modes and, consequently, the reliability of maintenance or mitigation strategies. For instance, Gaussian membership functions may better reflect smooth transitions in uncertainty for risk factors such as Occurrence or Detection, while trapezoidal membership functions can more effectively represent hard thresholds, particularly in the aggregation stage (FRPN).
From a methodological perspective, configuration T2-FIS 03 emerged as a robust and consistent fuzzy modeling approach for FMECA applied to power transformers. Its superior agreement values under both linear and quadratic weighting schemes indicate a stable prioritization output closely aligned with the reference method. These results support the suitability of this configuration as a candidate reference for future studies involving fuzzy-based FMECA in similar industrial settings.
However, in complex practical applications, where multiple factors interact and system conditions are dynamic, the diversity and uncertainty introduced by different membership function designs can compromise the consistency and reliability of fuzzy-FMECA outputs. To address these challenges, future work should explore adaptive or data-driven strategies for tuning membership functions. For example, optimization techniques or machine learning approaches could be used to calibrate membership function parameters based on empirical failure data or expert consensus. Furthermore, incorporating robustness testing, such as sensitivity analyses across membership function types or configurations, and validating the models through cross-domain case studies will be key to ensuring generalizability and methodological resilience.
Another complementary solution involves the direct involvement of domain experts in defining and validating membership functions. Experts in FMECA can contribute valuable insights into the appropriate shape, symmetry, and distribution of the membership functions to accurately reflect real-world behavior and system thresholds. By systematically integrating expert knowledge into the fuzzy modeling process, the resulting functions can more effectively capture operational semantics and enhance model interpretability. While this introduces a degree of subjectivity, structured elicitation methods and guided parameterization tools can help standardize the process and enhance consistency across applications.
Finally, the integration of statistical validation mechanisms, such as kappa significance testing, adds rigor to the comparison between FMECA methods and elevates the evaluation beyond subjective or qualitative matching. Further research could investigate hybrid weighting schemes or the development of adaptive membership function structures that respond dynamically to the characteristics of the analyzed system, ultimately enhancing agreement with expert-based prioritization.