Integrated Testing Strategies for Skin Sensitization Hazard and Potency Assessment—State of the Art and Challenges

The paper provides an overview of existing Integrated Testing Strategies (ITS) for assessing hazard and potency of skin sensitization. The ITS research is active, diverse and constantly evolving as new assays are being developed and new mechanistic insights are discovered. Despite the need to assess potency, the majority of the ITS approaches developed to date assess hazard only. Reasons for this situation are analyzed and include, for example, the dynamic range of existing alternative assays versus the range of in vivo responses, but also sporadic use of kinetic information and molar units. Depending on the application, regulatory or product development, standardized and nonstandard ITS approaches will be developed. Challenges to practical applications, with focus on regulatory are discussed.


Introduction
Skin sensitization is one of the critical endpoints when assessing cosmetic ingredients' safety. Legislative changes increasingly mandate that skin sensitization potential be assessed with non-animal methods. Since 2013, the European Union has banned the use of animals for testing cosmetic products and ingredients [1]. Similar legislation may be proposed in the United States in 2015 [2]. This resulted in acceleration of mechanistic understanding of skin sensitization [3,4] and many novel promising alternatives to animal testing tests [5]. The new data streams are heterogeneous in metrics, levels of biological organization and times scales of the biological events they address. Examples are omics data, cell-based assays with dose-response curves for both cell marker induction and cytotoxicity, reactivity assays of reaction kinetics or peptide depletion, and diverse in silico readouts: molecular descriptors, structural alerts, predictions of in vivo response. In an effort to organize these data in the context of evolving mechanistic knowledge to understand the adverse effects of chemicals, the Organization for Economic Co-operation and Development (OECD) coordinates the development of Adverse Outcome Pathways (AOP) [6][7][8]. The AOP for skin sensitization triggered by chemicals that bind covalently to proteins [9] includes four key events (KE) that occur after a substance (parent chemical or abiotically transformed product) penetrates through the skin and is potentially transformed into active metabolites: KE1: covalent binding to skin proteins; KE2: activation of inflammatory cytokines and induction of cyto-protective genes in the keratinocyte; KE3: activation (induction of inflammatory cytokines and surface molecules) and mobilization of dendritic cells in the skin; KE4: activation and proliferation of antigen-specific T-cells. Reisinger et al. 2015 [5] provides a very good overview of how different existing assays map onto the AOP.

ITS Approaches State of the Art
Today the critical and open question is: how to translate AOP information into predictive mechanistically interpretable ITS frameworks that are reliable, applicable to broad chemical classes and accessible to many stakeholders. The need is to apply these frameworks to hazard assessment, classification and labeling, and eventually risk assessment, that sets different success criteria in the spirit of fit for purpose. In addition, there is a strong interest to keep these new frameworks time-and cost-effective.
Even before structuring of the existing mechanistic knowledge into the AOP, several authors argued that the process of skin sensitization is too complex to be able to predict an adverse in vivo outcome using a single alternative assay. The need for an integrated testing strategy (ITS) approach was very convincingly illustrated by Jowsey et al. [10] and later reiterated by Basketter and Kimber [11].
Historically, expert approaches, widely known as weight-of-evidence (WoE) approaches, have been used to integrate existing information and determine the need for additional testing. Due to frequent differences in opinion, expert based approaches are challenged with subjectivity, lack of consistency and transparency [12] and prefer to err on the conservative side when in doubt, or when conflicting information is available as evidence. To make the situation even more challenging for the traditional expert-based evaluations, as mentioned earlier, the amount of available alternative assay data is increasing rapidly and becoming more diverse. Due to their qualitative underpinning they have been used to hazard assessments but struggle with potency assessments that clearly need quantitative approaches.
The main drivers to develop approaches predicting potency is Global Harmonization Scheme classification [13] as well as quantitative risk assessment. However, the majority of ITS approaches developed to date address hazard only. Predicting hazard is far easier than predicting potency. Why is predicting potency so difficult? First, as measured by the murine local lymph node assay (LLNA), skin sensitization potency may span four orders of magnitude [14]. Second, existing alternative test methods are deemed as not appropriate for potency predictions [15,16] partly due to an insufficient dynamic range. Third, it is not clear what measurements, other than reaction kinetics, are necessary to predict potency [11,17]. Fourth, measurements of in vivo potency have substantial variability due to inherent biological variability and use of diverse vehicles in in vivo experiments [18]. In this situation formulation of what can be predicted needs to be carefully chosen. Deterministic predictions of a single value are likely to result in poorly predictive models. Finally, there is probably a fifth element, units. Predictive models, especially for potency, need to be grounded in chemistry and biology over the entire dynamic range of responses, and expression of potency in molar units is necessary to achieve this. Until now, many predictive models were constructed to predict the in vivo result expressed on a weight basis in order to directly meet existing regulatory needs. While this seems pragmatic, it may result in a compromised robustness of potency prediction. Hazard predictions are not affected by unit issue.
The majority of ITS approaches focus on data integration and require that all input data be available at the onset of the process to make a prediction for a given chemical. They use hybrid sets of inputs, most often combinations of physico-chemical properties, in silico predictions, and experimental data from one or more in vitro assays representing key events. The choice of the tests (i.e., variables in the ITS) is most often contingent on origin of the assays as the assay developers explore utility of their assays in the larger context of ITS. The simplest approach is based on majority voting from the outcome of three in vitro tests [29]. Approaches based on machine learning algorithms are very popular. Among them are linear regression regression-based methods [23,24,30] and nonlinear methods like neural networks [19,20], support vector machines [31] and random-forest models [27].
The common characteristic of these models is that the underlying model structure is dictated by the chosen machine learning algorithm while parameters (such as regression coefficients) are data-driven. As such, despite the fact that they use mechanistically relevant input data, these approaches do not have the ability to make mechanistically interpretable integrated predictions because they exploit correlative dependencies and not causal dependencies between inputs and outputs.
While the above mentioned approaches heavily rely on biological tests to explain and predict sensitization potency there is another view-that skin sensitization can be predicted by in depth understanding of chemistry alone [17,24]. A chemistry based approach combines the knowledge of reactivity domains, reactivity measurements, reactivity molecular descriptors and kinetics of reaction rates to predict sensitization potency. Albeit as equally viable as published-to-date results indicate [17,24], this school of thought has been receiving less attention. The chemistry based approach overcomes perceived dynamic range limitation of biological assays by using the kinetic rate as one of the key variables. Potency is always expressed in molar units. The difficulty with a chemistry oriented approach is standardization of measurements, especially kinetic rates and descriptors over broad chemical classes and, therefore, practical application is lagging except for well-defined reactivity domains and chemical classes.
To bridge biology and chemistry oriented approaches, Natsch et al. [23] and Patlewicz et al. [32] developed ITS frameworks based on local reactivity domain models, physico-chemical properties, structural alerts, in silico simulators of skin metabolism, auto-oxidation, hydrolysis, and in vitro experimental data. While the approach in [23] is quantitative and estimates the pEC3, a measure of potency, the approach taken by Patlewicz et al. [32] does not have a built-in algorithm to make quantitative estimates. It is a WoE tool with decisions driven by expert opinion. The WoE tool is limited to hazard characterization and the authors suggest that it be used in read-across determinations. The work of Natsch et al. [23] and Patlewicz et al. [32] serves as inspiration to better integrate knowledge of reaction chemistry in ITS frameworks, which other approaches tend to lack.
The reason that there are so few ITS approaches for which model structure encodes the skin sensitization process, i.e., in a fully mechanistic framework, is due to the complexity and residual uncertainty of the process. It is a very challenging task to formalize this process into equations and then populate the model with data for the parameters. To this end, Maxwell and Mackay [33] published a pharmacodynamic-pharmacokinetic model of human immune response to chemical agents, represented by a set of differential equations describing underlying chemical and biological dynamic processes of mass transport, reaction kinetics, cell population dynamics and receptor binding events. To date the model has been parameterized only for 2, 4-dinitrochlorobenzene and its readiness for routine risk assessment awaits parameter estimation for other chemicals.
Another possibility is to incorporate existing, imperfect mechanistic information into a model structure by using a Bayesian network (BN). The Bayesian network is uniquely suited to represent uncertain knowledge when one knows which variables, not necessarily all, are important in the process of interest, but where the relationships between the variables are not well characterized, or complex, or both. The utility of BN methods to serve as a framework for ITS approaches has been discussed by Jaworska et al. [34]. We later applied this to skin sensitization class potency prediction as described below.

Sequential Strategies
Some of the ITS approaches do not require all data to be entered at the beginning of the evaluation process. If the tests are executed in a tiered sequence, there is a potential for generating less data and cost saving. Sequential strategies require development of rules related to when to stop testing and make a decision, as well as rules when to proceed to additional data generation. Prediction models of standalone assays are used as the foundation of the rules and thresholds. In these approaches if a final result is equivocal, expert opinion is needed to complete prediction. Examples of sequential strategies include human cell line activation test (h-CLAT)-direct peptide reactivity assay (DPRA) sequence [35]; DPRA, KeratinoSens or the expression of 10 biomarker genes in a human HaCaT keratinocyte cell line, and h-CLAT sequence [36]. In addition, the results of four in silico approaches (MultiCASE, CAESAR, DEREK, and the OECD QSAR Toolbox) are considered by combining them into a single result using naive Bayesian technique by [36]. While accuracy of the mentioned tiered approaches on the training set was high, evaluations on the external test set have not been published.
Recently Jaworska et al., 2015 [28] published BN (ITS-3) that is the 3rd generation BN ITS [25,26] as a decision support system for a risk assessor that provides quantitative weight of evidence, leading to a mechanistically interpretable potency hypothesis, and formulates adaptive testing strategy for a chemical. In the ITS-3 structure represents the first three key events of the adverse outcome pathway [9]. The KEs are represented by three validated assays, DPRA, KeratinoSens and h-CLAT for skin sensitization. Corrections for bioavailability both in vivo and for all 3 assays are included in the ITS model structure. The skin sensitization potency prediction is provided as a probability distribution over four pEC3 potency classes. This allows to calculate any percentile of the distribution, thus the approach is fully quantitative. The prediction process is very structured and takes into account the individual assays' applicability domains. It exploits the fact that the BN ITS framework can build a hypothesis with incomplete data records. Physico-chemical properties (water solubility and fraction ionized) define bioavailability oriented applicability domains of individual assays. Further, the ITS-3 uses TIMES [37] to form in silico hypothesis about potency as well as a way to assess a chemical's potential for being a pre-or pro-hapten. Data outside applicability domains are excluded from the integrated prediction or in case of pre-or pro-haptenation are treated with caution in the prediction process. Eliminating data on the basis of being outside of the applicability domains is a key tool for removing data conflict and reducing mispredictions.The accuracy for predicting LLNA outcomes on the external test set (n = 60) depends on a particular regulatory application: hazard (two classes)-100%, GHS potency classification (three classes)-96%, potency (four classes)-89%. The authors argued that this work demonstrates that skin sensitization potency prediction based on data from three key events, and often less, is possible, reliable over broad chemical classes, and ready for practical applications.

Making Decisions with ITS
From a policy perspective, the value of a model-based analysis lies not simply in its ability to generate a point estimate for a specific outcome, but also in the systematic examination and reporting of uncertainty surrounding the prediction and the ultimate decision for which it is applied. Deterministic frameworks are limited in possibilities to correctly represent and handle data variability and knowledge uncertainty, while probabilistic frameworks have a naturally built in capability to handle them. The uncertainty in the ITS model prediction comes from two main sources: uncertainty in the model structure and variability in the experimental data. The former is related to uncertainty in knowledge and limitation in the coverage of AOP key events by the current tests. The latter is associated with the variability of biological data as well as data quality. In a probabilistic framework transparent and consistent criteria to accept a prediction are easy to set up and the need for uncertainty factors is diminished. Despite calls for the type of approach [18] only two probabilistic ITS frameworks (Jaworska et al. [25,26,28] and Luechtefeld et al. [27]) have been published to date.
In the majority of ITS approaches derived using machine learning techniques, the "weights" of individual assays are constant and established during ITS model development on the basis of a training set. In BN ITS, the weight of the individual information source is context specific. It changes every time a new piece of information is provided to the system. This adaptive nature of Bayesian frameworks has important implications for decision-making. It treats the decision-making as a dynamic process. BN ITS accounts for dynamically changing interrelationships among tests based on evidence provided, thus there is no constant weight of an AOP event. Since the process of adding evidence to the BN ITS can be sequential (and not all at once), interim predictions and decisions can be generated. As a result, BN ITS can also be used to guide and optimize a testing strategy before testing is commenced. The optimal testing strategy amounts to identifying a test that leads to a maximum uncertainty reduction in the predicted variable. It means that there is no single best strategy but the best strategy is existing evidence-specific.

Challenges to Practical Applications of ITS
Practical applications of ITS approaches are lagging because investigators often neglect evaluation of their approaches with an external validation (test) set [38]. Among ITSs for skin sensitization performance evaluations based on an external dataset were published only for three approaches: Jaworska et al. [24,25,28], Luechtefeld et al. [27] and Strickland et al. [31]. In addition to good statistical performance which is necessary, the ability to explain an input data conflict is an additional critical criterion to the successful use of ITS. Usually when there is no conflict among data, a correct hypothesis is formed. In contrast, predictions based on input data which are in conflict are often problematic. Resolving conflicting data needs to be better developed and incorporated into ITS frameworks. Mechanistic frameworks are much more suitable to accommodate this than a statistical model based on correlation.
Another hurdle is the lack of broad, public accessibility of the ITS approaches to interested users. A unique effort was undertaken by the US National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) in which they reproduced and distribute the ITS-2 [25] in an open source version [39] at the National Toxicology Program (NTP) Integrated Testing Strategies website [40]. Web based applications with graphic user interfaces seems like the way forward, however this requires hosting institutions.
Lack of hands-on guidance in implementing ITS and the lack of regulatory guidance regarding the acceptability of ITS approaches further inhibit practical application of ITS [38]. The enormous diversity of approaches and inputs used presents a challenge for all stakeholders. Activities at OECD on the Integrated approaches to testing and assessment (IATAs) templates development aim to address this need and it is expected that skin sensitization specific IATA templates together with selected case studies will be published in 2016. To this end, OECD efforts to develop reporting templates are critical to facilitate transition from research to regulatory application. For regulatory applications, ITS approaches that use only validated assays have advantage as at least the inputs are standardized and the strengths and weaknesses of these assays are documented. However, for product development purposes, where input is needed for decisions such as "should we continue to develop this chemical or should we drop it from our portfolio?", non-standard approaches may be preferred depending on the specific case. Sometimes it may be that key information needed (especially chemistry information) may not be obtainable from a validated assay but could quite easily be generated by appropriate experimental (or in some cases theoretical) chemistry.
In summary, the ITS field is active, very diverse and constantly evolving as new assays are being developed and new mechanistic insights are discovered. Depending on the application, standardized and nonstandard approaches will be developed.

Conflicts of Interest:
The author declares no conflict of interest.