The Distribution and Severity of Corrosion Damage at Eight Distinct Zones of Metallic Femoral Stem Implants

: Metallic taper junctions of modular total hip replacement implants are analysed for corrosion damage using visual scoring based on different granularity levels that span from analysing the taper holistically to dividing the taper into several distinct zones. This study aims to objectively explore the spatial distribution and the severity of corrosion damage onto the surface of metallic stem tapers. An ordinal logistic regression model was developed to ﬁnd the odds of receiving a higher score at eight distinct zones of 137 retrieved stem tapers. A method to ﬁnd the order of damage severity across the eight zones is introduced based on an overall test of statistical signiﬁcance. The ﬁndings show that corrosion at the stem tapers occurred more commonly in the distal region in comparison with the proximal region. Also, the medial distal zone was found to possess the most severe corrosion damage among all the studied eight zones.


Introduction
Despite the clinical benefits of modularity in total hip replacement (THR) implants, modular interfaces such as head-neck taper junction sustain mechanically assisted crevice corrosion due to relative micro-motions at the metallic interface and also the presence of corrosive body fluid [1,2]. Previous studies [3][4][5] have reported that the solid and soluble wear debris and corrosion products released from the head-neck junction may elicit untoward host body reactions such as osteolysis, peri-prosthetic fracture, and metallosis. Depending on the intensity of these postoperative complications, revision surgeries may be needed to replace failed prostheses.
Through large-scale retrieval studies, the surface damage sustained by retrieved implants is assessed, and possible associations between several implant/patients factors and the extent/location of the damage are investigated. The severity of the damage is quantified by using visual scoring methods [6,7]. To date, many studies have applied these methods (with or without modifications) to various modular junctions [7][8][9][10]. Upon scoring the damage, each study employs a causal-explanatory statistical modelling to investigate the effect of a particular set of factors (predictors) on the damage score.
In the head-neck junction, stems have a tapered geometry which can be divided into several zones (e.g., anterior, medial, posterior, and lateral quadrants). A deeper level of score granularity can provide more details about the severity and spatial distribution of damage. Distribution of the corrosion damage over the distinct zones of tapers has been investigated by a limited number of studies [8,[11][12][13][14][15][16][17][18][19].
The number of zones scored at stem tapers has seldom gone beyond four (anterior, medial, posterior, and lateral quadrants). One reason for that could be the complexity of conducting pairwise comparisons within the groups of zone factor. With four zones, six combinations (order disregarded) would be required. If it is desired to consider the distal and proximal regions of each quadrant as well, 28 (i.e., 8! (8−2)!×2! ) pairwise comparisons would be required to investigate the damage thoroughly. The studies that scored the distal and proximal regions separately have observed different damage patterns within these regions [15,18,19]. Therefore, it is necessary to look at stem taper zones with a higher level of granularity in order to explore whether any significant difference exists between the distal and proximal regions of the quadrants.
This study introduces a method for addressing this gap. Using this approach, eight individual corrosion scores are assigned to eight distinct zones of each metallic stem taper. Next, an ordinal logistic regression (OLR) model is used to quantitatively compare the severity of corrosion damage at these eight zones.

Retrieved Implants Information
This study was approved by the Southern Adelaide Clinical Human Research Ethics Committee (Reference No. 485.13); 137 total hip replacement implants retrieved between 1995 and 2015 at the Royal Adelaide Hospital (RAH), Adelaide, Australia were selected. The selection was limited to include only detached head-neck junctions so that the stem tapers were accessible for assessment. The retrieved implants had been disinfected by immersing in 70% ethanol for four days followed by a 4% Biogram solution (polyphenolic disinfectant and detergent with 18% phenol) for 48-72 h. Biologic debris (blood or proteinaceous films) had been removed using a cotton bud without abrasion. The stem tapers, selected for this research, were further cleaned with acetone followed by a gentle wipe with a soft nylon brush. Eleven implant/patient factors were retrieved from Our Patient Management and Outcomes Database (OPMOD) of the RAH. Table 1 provides the demography of these categorical and continuous factors. The missing information associated with each factor supplements the quantity of each factor to add up to 137. This study only looks at the distribution and severity of corrosion. Therefore, the missing patient and implant information did not pose any concern.

Visual Assessment of Corrosion Damage
The Goldberg's scoring method [7] was used to inspect and rate corrosion on the stem tapers ( Table 2). Based on this method, eight distinct zones of the retrieved stem tapers were scored individually. Fretting wear was not scored because it has been reported by several studies that fretting may be masked by corrosion damage; and therefore, hard to visually identify [14,15,19,20]. Also, it is thought that the severity of fretting in Goldberg's method cannot be measured consistently because the pitch of the machined threads over the taper surface varies among different stem designs [14,21]. Lastly, fretting scars can be mixed up with scratches caused by attaching or detaching the head intraoperatively [7,12,21].
In order to have a consistent scoring, one trained investigator (RM) evaluated the damage. The stem tapers were visually scored twice in a random order. Each stem taper was photographed and eight zones (posterior-distal (PD), posterior-proximal (PP), medial-distal (MD), medial-proximal (MP), anterior-distal (AD), anterior-proximal (AP), lateral-distal (LD), and lateral-proximal (LP)) were identified according to our previous study [22]. Figure 1 displays an exemplary taper for each score level.

Visual Assessment of Corrosion Damage
The Goldberg's scoring method [7] was used to inspect and rate corrosion on the stem tapers ( Table 2). Based on this method, eight distinct zones of the retrieved stem tapers were scored individually. Fretting wear was not scored because it has been reported by several studies that fretting may be masked by corrosion damage; and therefore, hard to visually identify [14,15,19,20]. Also, it is thought that the severity of fretting in Goldberg's method cannot be measured consistently because the pitch of the machined threads over the taper surface varies among different stem designs [14,21]. Lastly, fretting scars can be mixed up with scratches caused by attaching or detaching the head intraoperatively [7,12,21].
In order to have a consistent scoring, one trained investigator (RM) evaluated the damage. The stem tapers were visually scored twice in a random order. Each stem taper was photographed and eight zones (posterior-distal (PD), posterior-proximal (PP), medial-distal (MD), medial-proximal (MP), anterior-distal (AD), anterior-proximal (AP), lateral-distal (LD), and lateral-proximal (LP)) were identified according to our previous study [22]. Figure 1 displays an exemplary taper for each score level.

Statistical Analysis
In this study, SPSS (version 25) was used for the statistical analysis and a p-value of <0.05 was determined as the level of statistical significance. Weighted kappa (κ W ) with quadratic weights was run to determine the single-observer repeatability of the corrosion scores. A confusion matrix was established to quantify the disagreements. For quadratic weights, the further away a disagreement was from the perfect agreement, the more harshly that disagreement is considered. The strength of agreement based on the magnitude of the weighted kappa (κ W ) was interpreted according to the guideline reported in Landis et al. [23].
Having an ordinal dependent variable (DV) as the response, OLR was employed to capture the ordered nature of the DV levels. The OLR model in this study uses cumulative logits. Selection of cumulative logits against other models (e.g., adjacent or continuation categories) was due to the interest of this study to use the entire response scale regardless of the score level.
Consequently, the cumulative odds OLR came with proportional odds constraint to ensure the regression lines across the DV levels are parallel. This OLR model divides the categories of the ordinal DV to run cumulative logits, as demonstrated in Table 3. Table 3. An ordinal dependent variable (DV) with four levels giving three cumulative probabilities and consequently logits.

The Ordinal Logistic Regression (OLR) Assumptions
Before deploying an OLR model, four assumptions (constraints) needed to be considered to ensure the validity of the results. The first assumption mandates the DV (visual scores) having an ordinal level of measurement which is valid here. Under the second assumption, there should be at least one independent variable (IV) that is continuous, ordinal or categorical (including dichotomous variables) which is valid as well.
The other two assumptions are related to the characteristics of the data. The third assumption mandates no multi-collinearity between the IVs. It was implemented by incorporating collinearity diagnostic under linear regression which returns the variance of inflation factor (VIF). VIF indicates to what extent a particular IV contributes to multi-collinearity issues within the dataset. In this study, VIF values beyond 10 were considered as having multi-collinearity as a rule of thumb.
The fourth assumption checks for having proportional odds. Here, the test of parallel lines was used to compare the fit of the proportional odds model to a model with varying slope coefficients. It was desired not to reject the null hypothesis that states the slope coefficients are the same across the three cumulative regression models. If true, the effect of each IV will be identical at each cumulative logit which is desired here.

Overall Parameter Estimates
As pointed out earlier, the type of OLR model used in this study produces an equation for each cumulative logit. As there are four categories of the DV, three cumulative logits (Equations (1)-(3)) are expected. Also, the assumption of proportional odds constrains the slope coefficients to be the same for all the three equations, so it is just going to be the thresholds that may vary between the three equations.
Since changes in log odds do not have much intuitive meaning, the ratio of the odds between any two categories or a unit change in a numerical IV is reported. The odds ratio (OR) was calculated as the exponential of the log odds of the slope coefficient. Also, the 95% confidence intervals of the OR and the significance levels are reported.
Unlike the numerical and dichotomous IVs, zone, as a polytomous IV, demands additional calculations to complete an overall test of statistical significance. To exhaust the entire pairwise comparison of the categories, one category was taken as the reference, and the rest were compared with that as primary categories. In each significance test, each zone had to be recoded into a new variable with the desirable reference category being coded as the last category (highest level).

Results
For the assessment of intra-observer repeatability, the weighted kappa (κW) with quadratic weights indicated a statistically significant agreement, κW = 0.64 (95% CI, 0.59 to 0.69), p < 0.001 between the two sets of scores. According to [23], the strength of the agreement was classified as good. Before using the OLR model, preliminary data analysis was carried out by looking at frequency histograms of the scores in different zones. Using various bin sizes and definitions, a number of different histograms were generated to graphically summaries the distribution of scores across the eight taper zones.

Distribution of Corrosion Scores
Visual scoring of the 137 stem tapers across the eight zones resulted in 1096 corrosion scores. Table 4 summarizes the frequency of each score level. Score level 2 had the highest quantity (512) while the lowest quantity (51) belonged to score level 4.   Considering the unbalanced score levels, the first two score levels that are higher in quantity (i.e., 359 and 512) always show higher percentages compared with score levels 3 and 4 within each zone.
To better compare the severity of damage across the zones, two more configurations of scores (by combining the original score levels) were also explored. The first configuration groups the first and the last two score levels into low and high groups, respectively. Figure 3 visualizes this configuration and compares each score group across the eight zones. As expected, the low score group which comprises (359 + 512) scores has a higher frequency compared with the high score group (174 + 51). This configuration can better show which zones have more severe corrosion damage (for example, MD and LD zones). Also, at zones MD and PP, the smallest and largest gaps between these two combined score levels were observed. The third configuration preserves score level 1 and combines the other three score levels to form two new score groups of intact and corroded stem tapers. Figure 4 illustrates the frequencies of these two score groups. Considering the unbalanced score levels, the first two score levels that are higher in quantity (i.e., 359 and 512) always show higher percentages compared with score levels 3 and 4 within each zone.
To better compare the severity of damage across the zones, two more configurations of scores (by combining the original score levels) were also explored. The first configuration groups the first and the last two score levels into low and high groups, respectively. Figure 3 visualizes this configuration and compares each score group across the eight zones. Considering the unbalanced score levels, the first two score levels that are higher in quantity (i.e., 359 and 512) always show higher percentages compared with score levels 3 and 4 within each zone.
To better compare the severity of damage across the zones, two more configurations of scores (by combining the original score levels) were also explored. The first configuration groups the first and the last two score levels into low and high groups, respectively. Figure 3 visualizes this configuration and compares each score group across the eight zones. As expected, the low score group which comprises (359 + 512) scores has a higher frequency compared with the high score group (174 + 51). This configuration can better show which zones have more severe corrosion damage (for example, MD and LD zones). Also, at zones MD and PP, the smallest and largest gaps between these two combined score levels were observed. The third configuration preserves score level 1 and combines the other three score levels to form two new score groups of intact and corroded stem tapers. Figure 4 illustrates the frequencies of these two score groups. As expected, the low score group which comprises (359 + 512) scores has a higher frequency compared with the high score group (174 + 51). This configuration can better show which zones have more severe corrosion damage (for example, MD and LD zones). Also, at zones MD and PP, the smallest and largest gaps between these two combined score levels were observed. The third configuration preserves score level 1 and combines the other three score levels to form two new score groups of intact and corroded stem tapers. Figure 4 illustrates the frequencies of these two score groups.
The medial distal zone had the largest difference between these two score groups which confirms that this particular zone is most damaged. Also, the posterior-proximal zone had the smallest difference between the two score groups (thus least damaged). As a key finding, the distal regions of the four quadrants showed more corrosion damage compared with the proximal regions.
These finding from the histogram can shed light on the likely outcome of the OLR model. In particular, when the number of DV levels are higher, cumulative logits models may become infeasible. Histograms can determine which score levels are more important to be compared via using other types of OLR models such as adjacent categories. The medial distal zone had the largest difference between these two score groups which confirms that this particular zone is most damaged. Also, the posterior-proximal zone had the smallest difference between the two score groups (thus least damaged). As a key finding, the distal regions of the four quadrants showed more corrosion damage compared with the proximal regions.
These finding from the histogram can shed light on the likely outcome of the OLR model. In particular, when the number of DV levels are higher, cumulative logits models may become infeasible. Histograms can determine which score levels are more important to be compared via using other types of OLR models such as adjacent categories.

Comparison of Corrosion in the Zones
Cumulative odds OLR with proportional odds was employed to conduct pairwise comparisons between the zones. First, it was established whether zone is statistically significant overall. From the test of model performed on SPSS, zone was observed to be a statistically significant (p = 0.002) predictor of corrosion scores in this univariate regression model.
Since no specific zone was preferential to investigate, 28 pairwise comparisons had to be undertaken which incurred additional calculations to obtain the overall omnibus statistical test. Table  5 summarizes the OR, p-values, and confidence intervals. Significant OR values are highlighted in grey. In this table, each zone has been used seven times either as the primary or reference (inside brackets) group to exhaust the combinations. OR values below 1 indicate that for the primary category, the odds of having a higher corrosion score is lower than that of the reference category.

Comparison of Corrosion in the Zones
Cumulative odds OLR with proportional odds was employed to conduct pairwise comparisons between the zones. First, it was established whether zone is statistically significant overall. From the test of model performed on SPSS, zone was observed to be a statistically significant (p = 0.002) predictor of corrosion scores in this univariate regression model.
Since no specific zone was preferential to investigate, 28 pairwise comparisons had to be undertaken which incurred additional calculations to obtain the overall omnibus statistical test. Table 5 summarizes the OR, p-values, and confidence intervals. Significant OR values are highlighted in grey. In this table, each zone has been used seven times either as the primary or reference (inside brackets) group to exhaust the combinations. OR values below 1 indicate that for the primary category, the odds of having a higher corrosion score is lower than that of the reference category.  The reciprocal of odds ratios can be calculated to compare a reference group with a primary group. To compare the severity of corrosion across the entire eight zones, the odds ratios were sorted and plotted ( Figure 5). The red and blue bars indicate the significant and insignificant OR values, respectively. An OR equal to 1 indicates equal odds of observing a higher corrosion score at the primary and reference zone groups. By moving away from unity, the odds ratios that are first insignificant later on become significant. The speed by which this transition takes place is a function of the presumed statistical significance level. The reciprocal of odds ratios can be calculated to compare a reference group with a primary group. To compare the severity of corrosion across the entire eight zones, the odds ratios were sorted and plotted ( Figure 5). The red and blue bars indicate the significant and insignificant OR values, respectively. An OR equal to 1 indicates equal odds of observing a higher corrosion score at the primary and reference zone groups. By moving away from unity, the odds ratios that are first insignificant later on become significant. The speed by which this transition takes place is a function of the presumed statistical significance level. The severity of corrosion at each zone with respect to the other zones was assessed based on its corresponding OR values. For each zone, Table 5 has provided seven OR values wherein that particular zone appears as either primary or reference. Table 6 sorts the eight zones from the least to the most severely damaged according to the value of C1 + C2. This value quantifies how many times each zone had a higher likelihood of damage compared with the other seven zones throughout the 28 pairwise comparisons. C1 indicates how many time a particular zone, as the primary, had an OR value above 1, while C2 indicates how many times that same zone, as the reference, had an OR value below 1. Therefore, both C1 and C2 reflects the frequency of each zone appearing as more severely damaged with respect to the other zones. The severity of corrosion at each zone with respect to the other zones was assessed based on its corresponding OR values. For each zone, Table 5 has provided seven OR values wherein that particular zone appears as either primary or reference. Table 6 sorts the eight zones from the least to the most severely damaged according to the value of C1 + C2. This value quantifies how many times each zone had a higher likelihood of damage compared with the other seven zones throughout the 28 pairwise comparisons. C1 indicates how many time a particular zone, as the primary, had an OR value above 1, while C2 indicates how many times that same zone, as the reference, had an OR value below 1. Therefore, both C1 and C2 reflects the frequency of each zone appearing as more severely damaged with respect to the other zones.  Zones PP and MD were identified having the least and highest severity of corrosion. Interestingly, proximal and distal regions were found to be grouping together in this table with the distal region showing more damage compared with the proximal region across the four quadrants in the studied stem tapers.

Discussion
Eight distinct zones of the stem tapers including anterior-distal, anterior-proximal, medial-distal, medial-proximal, posterior-distal, posterior-proximal, lateral-distal, and lateral-proximal were scored and statistically compared to identify the zone(s) with the most severe corrosion damage in the retrieved implants studied in this work. It is noted that there are several studies in the literature that chose to score stem tapers holistically, not locally [9,10,[24][25][26].
Within the studies [11,12,[15][16][17][18][19] that scored stem tapers locally, the pools of implants had a limited diversity in terms of implant properties (e.g., head diameter, articulation type, and stem design). Therefore, it was deemed necessary to explore whether a similar distribution of corrosion damage can be seen in a more heterogeneous pool of implants.
To the best of our knowledge, there are only two studies [18,19] in the literature that, similar to this work, have assigned eight local scores to the stems with the rest using lower numbers of zones. In those two studies, one did not compare the scores between the zones [18]. The other compared the four quadrants, and the two distal and proximal regions separately in terms of corrosion severity and did not determine which zone(s) had the most severe damage [19].
Routine causal-explanatory statistical analyses require only one score as the descriptor of damage for each implant. The majority of these studies have chosen to combine the local scores by calculating an overall value [8,[11][12][13][14]. This approach has led to the presumption that this global score is a continuous variable; and, thus, the statistical analyses for continuous variables have been utilised. Analysing a continuous variable with an interval or ratio level of measurement is generally less complex in nature. However, an increased number of levels in the global score does not necessarily imply a known "distance" between the score levels. Therefore, this approach was treated with suspicion in this study and was not adopted.
Here, the corrosion scores were analysed using a univariate OLR model, and the odds ratios along with their p-values were reported. Since there was no particular hypothesis about the relative level of corrosion at the eight zones, 28 pairwise comparisons were carried out to exhaust the entire pairwise comparison of the zones. The distal region of the medial quadrant was found to have the highest odds of receiving a higher corrosion score which is aligned with the previous findings in the literature that identified the distal region [19,20,27] and the medial quadrant [7,10,16,28] having the highest corrosion scores. Also, this study shows that the distal region of all the four quadrants had more corrosion damage in comparison with the proximal region of those quadrants. Therefore, it was found that, regardless of the quadrant, corrosion damage is more present distally than proximally.
Generally, the higher severity of wear or corrosion at a specific zone has been attributed to several factors such as increased micro-motions at the interface, head or stem materials, head diameter, high friction moments, and poor lubrication of the bearing articulation. While some act as root causes, the others play the role of causal factors. Also, damage at the head-neck taper junction usually appears as a combination of wear and corrosion mechanisms. Some of these factors may only contribute to a specific mode of damage, while others may contribute towards a set of damage mechanisms.
In a retrieval study of 231 implants [7] the stem tapers received four fretting and corrosion scores corresponding to the four quadrants. The medial and lateral scores were observed to be significantly higher than the scores at the other two quadrants (posterior and anterior). This was explained to be due to a higher likelihood of micro-motions between the head and neck about an axis in the sagittal plane. Similar to the present study, the pool of implants in this work had a wide diversity, and higher corrosion scores at the medial quadrant suggest that it could be a phenomenon independent of the included patient and implant factors.
Wilson et al. [29] explained how at the double-tapered cone design of Profemur Z, the proximal end of the neck experiences an almost pure compression and shear loading. High frictional moments at taper junctions were related to poor lubrication of the articulation interfaces by another study [30].
The medial quadrant was identified to have higher corrosion scores in a retrieval study of 52 S-ROM components [16]. It was hypothesised that greater micro-motions at this quadrant could result in a more frequent disruption of the passive oxide layer; and consequently, more severe corrosion damage. Similar to the conclusion of the Wilson et al. [29] study, they reported that this region is generally under a compression-loading regime. A computational modelling of the stem taper stresses paired with large diameter heads confirmed this hypothesis after witnessing maximum levels of principal stresses at the medial quadrant [31]. In that work, a 3D model of a 12/14 titanium taper was paired with cobalt-chromium and alumina heads. Increasing the head diameter increased this quadrant's stresses distal to the junction significantly. It was highlighted that the pairing of a small taper and a large head leads to a larger moment arm transmitting a higher force to a small surface area which facilitates tribo-corrosion.
A relatively higher amount of load and stress at the medial quadrant causes elastic strains which appear as surface compression. This condition may lead to micro-motions of approximately 5 to 40 µm [32] which in turn may result in abrasion or fracture of the oxide layer. The subsequent changes in the metal surface potential and the continuous re-passivation of the oxide layer change the chemistry of the crevice solution. Ultimately, the deaeration and pH decrease of the solution initiate crevice corrosive attacks [33,34]. Crevice corrosion has been reported to occur near the bore opening which may explain observing more severe corrosion at the distal region [35].
Besides micro-motions, galvanic corrosion at this interface due to using mixed metal components is a potential source of material loss. In this study, 18 (13.1%) implants had mixed head and stem materials, whereas 45 (32.8%) had similar materials. Therefore, galvanic corrosion cannot be nominated as the sole mechanism of corrosion.
These studies have used relatively homogenous pools of implants, yet they observed higher levels of corrosion at the medial quadrant or distal zones of stem tapers. Based on the findings of the present study which shows that the distal region of the medial quadrant sustains the most severe corrosion damage, it is understood that this particular zone is most severely damaged versus all the other zones regardless of the properties and patient characteristics of the investigated pool of implants.

Conclusions
This study introduces a method to statistically compare the severity of corrosion damage at eight distinct zones of stem tapers. A pool of 137 retrieved total hip replacement implants was visually scored at the head-neck junction for corrosion damage using Goldberg's method. An OLR model with proportional odds was used to determine the odds of observing a higher score at a primary zone compared with a reference zone. The findings of this study can be highlighted as below:

•
The corrosion score level 2 was observed having the highest frequency (46.7%).

•
Posterior-proximal and medial-distal were identified as the zones with the least and most severity of corrosion.
• Interestingly, the proximal and distal regions were found to be grouping together with the distal region showing more damage in comparison with the proximal region of the four quadrants.

•
Out of the 28 pairwise comparisons of these eight zones, nine pairs of zones were identified to be significantly different regarding corrosion damage. This observation objectively shows the high diversity in corrosion damage across these zones.

•
Retrieval studies of taper junctions are, therefore, recommended to score the zones separately and avoid adding up local scores to be used with an interval or ratio level of measurement. Funding: This research received no external funding.