The reform of the nation’s primary chemicals law, as directed by the Frank R. Lautenberg Chemical Safety for the 21st Century Act, required the Environmental Protection Agency to take affirmative steps to support the development of nonanimal testing methods and encourage third parties to use them. The agency released a big picture plan on New Approach Methodologies (NAM) in March, followed in April by a pilot implementation policy that would set a precedent by allowing the results from a single in vitro test to be sufficient for classifying a substance as a hazard for causing allergy-type responses (sensitizers).
Designating a substance as a sensitizer has consequences with obligations for hazard communication on Safety Data Sheets and labeling, along with regulatory and market pressure for product de-selection and phaseout. That is reason enough to require policies and approaches that lead to scientifically accurate determinations. But there are big-picture implications of direct regulatory endorsement of a “one-and-done” in vitro testing approach. This precedent will certainly be pointed to as sufficient basis for making other consequential hazard designations, such as endocrine disruption and carcinogenicity.
TSCA reform legislation didn’t call for substances to be subject to labeling based on a single fluorescing test tube. Yet the pilot policy floated by the EPA in April opens that path. The goal of reducing animal testing and a plan to identify information of equivalent or better scientific quality ended up at a flow chart rendering the fluorescence of a single protein on the surface of artificially maintained cells sufficient to make a significant regulatory and commercial hazard designation. The path from the initial goal to this flow chart illustrates how this pilot implementation establishes a precedent for conferring hazard designations that carry regulatory and precautionary burdens without requiring that dots be connected between cellular responses and adverse health effects.
Reduce Animal Tests and Rely on In-Depth Analysis
The EPA draft Strategic Plan to Promote the Development and Implementation of Alternative Test Methods released March 7, 2018, specifies that multi-factor, systematic review methods will be relied upon as an approach to ensure that information from NAM is of equivalent or better scientific quality than existing animal-based testing. TSCA reform requirements [Section 26 (h) and (i)] and the strategic plan refer to weight-of-evidence (WOE) approaches that incorporate information from various sources into evaluations. Researchers have been active and fruitful for more than a decade developing comprehensive, systematic WOE methods to provide updated approaches for balanced, scientific consideration of the overall information set available for substances. The strategic plan indicates that the agency will use several defined and structured WOE systems including the Defined Approach (DA) method and the Integrated Approaches to Testing and Assessment (IATA) method.
The agency’s first implementation of the strategic plan released for public comment on April 4, 2018, Interim Science Policy: Use of Alternative Approaches for Skin Sensitization as a Replacement for Laboratory Animal Testing, references a DA framework for determining whether tested substances are designated as skin sensitizers. But the decision matrix pares down testing specifications so that only one element of the DA, addressing a single cellular step in the chain of events leading to sensitization, need be tested. The decision maker is thus relying on results from testing a limited aspect of the DA, rather than on a comprehensive review. This simplification establishes a worrisome precedent that may be cited as sufficient and extended to other endpoints and designations.
The implementation policy lays out the chain of events from chemical reactions to cellular and organ responses that can lead to sensitization as an adverse outcome, summarizing four sequential key events (KE) identified by researchers and incorporated into test guidelines maintained by the Organisation for Economic Co-operation and Development (OECD). The EPA policy correctly acknowledges that “there are no validated non-animal methods that assess the ability of substances to activate KE4,” the final KE in the chain.
Though the chain of necessary events is broken by the absence of this link to the actual outcome, the policy deems two DA methods acceptable for agency submission: 1) AOP “2 out of 3,” and 2) KE3/1 Sequential Testing Strategy (STS). As the name implies for the former, the approach is to conduct validated tests relating to at least two of Key Events 1, 2 and 3. Concordant positive or negative results for two of the KEs are considered sufficient to determine whether or not the substance qualifies as a skin sensitizer. The latter DA provides the “one-and-done” pathway through the decision flowchart, allowing for any of the KE3 tests in OECD Test Guideline 442E to be conducted and the results accepted as sufficient to designate the substance as a sensitizer if this first KE3 test is positive.
If the KE3 test does not produce a positive response, the substance must then undergo KE1 testing, which looks simply at whether the test substance can stick to skin proteins. A positive response in a KE1 test is enough to designate a substance a sensitizer, even without a cellular response for KE3, which is further along in the chain of events and a necessary step prior to a sensitization outcome. Mandating this interpretation of results from sequential tests amounts to adopting a highly precautionary bias. Effectively, a substance that sticks to skin proteins, even if it does not produce the KE3 cellular response that is a prerequisite for sensitization, must be designated a skin sensitizer. Starting from the TSCA reform requirement to use only equivalent or better science in selecting non-animal methods and arriving at this proposed policy of interpreting in vitro tests is a scientific disconnect.
Setting the Bar Low
A convenient and scientifically based source can help evaluate the likelihood of misidentifying skin sensitizers via the proposed decision pathways. As part of validating the tests in Test Guideline 442E, OECD reviewed the accuracy of the individual tests, considering both sensitivity (ability of the test to correctly identify known sensitizers) and specificity (ability of the test to correctly rule out non-sensitizers). The validation review also considered the consistency of results from multiple runs of the same test and from multiple laboratories (reproducibility).
Overall, the three specific tests included in the test guideline perform well with regard to correctly identifying known sensitizers. The sensitivity ranged from 91 percent to 100 percent for the three tests, indicating that these tests are not expected to yield many “false negatives,” or miss a substance that actually is a sensitizer.
While a bit less reliable for reproducibility, the tests are similar to many routinely used tests in this regard. For one of the three tests, the reproducibility was only 80 percent when run in either the same laboratory or two different laboratories. This means that one out of five times, a single test will produce the “wrong” answer compared to retests of the same substance. Although not out of the ordinary, this does illustrate the uncertainty associated with relying on a limited testing approach. The reproducibility for the other two Test Guideline 442E tests ranged from 84 percent to 90 percent.
A substantial reliability issue for the policy emerges when looking at the possibility of incorrectly designating non-sensitizers as sensitizers. For two of the tests, according to OECD’s validation, there is less than a 50:50 chance of correctly identifying substances as non-sensitizers. Compared to results of actual human sensitization potential, one of the tests has only 39 percent specificity and another 47 percent specificity. This means that as many as six of every 10 substances tested would be incorrectly designated as sensitizers if these tests were used alone—pursuant to the “one-and-done” pathway provided in the proposed policy. This “false positive” rate is inconsistent with regulatory application and would result in many incorrect labels under hazard communication programs and unnecessary regulatory efforts and responses. The third test has a specificity rate of 66 percent, which would still result in one-third of substances being incorrectly designated as sensitizers.
The fact that none of the validated tests in the test guideline required under the policy yield an appropriately low likelihood of false positives reveals the scientific inadequacy of proposing to rely on one of these tests alone. The OECD test guideline explicitly addresses this limitation, explaining that these tests alone may not be sufficient to reach conclusions about sensitization potential and referencing reliance on a broader base of information: “Data generated with the test methods described in this Test Guideline are proposed to support the discrimination between skin sensitisers (i.e. UN GHS Category 1) and non-sensitisers when used within Integrated Approaches to Testing and Assessment (IATA), together with other relevant information.” The proposed EPA policy thus runs counter to the test guideline it relies on to provide simplification down to the one-and-done approach.
As currently proposed, EPA’s policy goes further, suggesting that the agency expects the use of the one-and-done approach to be preferred. “USEPA acknowledges that the 3/1 STS is simpler, will lead to fewer studies conducted, and is thus more efficient … some chemicals are likely to have existing evaluations using the 2 out of 3 DA; the agency will continue to accept these” (Interim Science Policy, p.2). This language stands in marked contrast to the discussion in the EPA strategic plan needing to rely on weight-of-evidence and systematic review of integrated information to achieve the goal of getting information of equivalent or better scientific quality than traditional animal models.
TSCA Reform and Precedents
Beyond the need for refinement and less simplification of the WOE approach for designating skin sensitizers are several broader implications of EPA’s proposed policy: First is the issue of whether it meets TSCA reform requirements. Second is the precedent of accepting tests with high false-positive rates for important and expensive hazard designations. Third is the precedent of making decisions based on a set of information that does not connect the links between in vitro testing responses and actual adverse outcomes.
With regard to TSCA reform, the Lautenberg Act specifically directs the agency to apply the criterion that new methods have equivalent or better scientific quality than animal testing either to list them as agency-approved methods or to require third parties to use them. The strategic plan recognizes and references the applicability of this criterion in both situations. The proposed policy specifies that the included methods are approved for use in third-party submissions to the EPA. The policy applies to single chemical pesticide substance submissions to be reviewed by the agency. This use places it squarely within the requirements of TSCA reform that new methods meet the equivalent or better science criterion.
The proposed policy states that the described DAs “demonstrate comparable or superior performance to the LLNA” [a standard animal test for skin sensitization] (Interim Science Policy, p. 8) and provides a discussion of reviews of LLNA accuracy. The agency adopts the values of 14 percent to 20 percent as the range of false positives from the LLNA. Because no comparison is made to the false positive rates included in the OECD Test Guideline, it is unclear how the agency supports the statement that the one-and-done DA performs comparably. The false positive rates from that source range from 33 percent to 61 percent for the three included tests—all of which are worse, not better, than the LLNA.
The statement quoted above may be intended to purport that the decision matrix in the policy complies with TSCA reform, but the discrepancy between this conclusion and the findings of the OECD test guideline warrants explanation. The OECD guideline is cited as the source for the acceptable DAs, yet the validation findings from this source contradict the conclusion about being better than the LLNA. The policy, therefore, fails to meet the requirements of TSCA reform in that context.
Unrelated to the formality of TSCA reform compliance is the precedent of the EPA relying on methods with such high false-positive rates for significant agency determinations and substance categorization. The obligations that follow designation as a skin sensitizer, or other designations such as endocrine disruptors, entail expenditure of substantial resources, as well as the commercial and competitive implications of being labeled with such hazard properties. If a false positive rate as high as 61 percent is acceptable in a test being interpreted as a one-and-done standard for skin sensitization, arguably a level of accuracy this relaxed would be acceptable for other endpoints and designations.
While the criterion that NAM perform better than the traditional animal test is scientifically sound for relative comparisons, and it is difficult to assign a generally applicable floor level of reliability given the differences among types of tests, there has to be some level of uncertainty that is too great, on an absolute scale, for agency reliance on test results. As the extreme examples of the validation results from OECD Test Guideline 442E illustrate, more than a 50:50 likelihood of erroneously designating a substance as a sensitizer is just too high to make sense as a regulatory policy. An animal method that performed this poorly would not have been acceptable to the agency in the first place. Allowing such relaxed standards for in vitro tests that replace animal tests fails the mandate of being a scientific improvement.
Besides the precedent of a low floor for reliability, there is the additional precedent of accepting DAs that do not include a complete chain of key events between initial chemical reactions and actual adverse outcomes. The analogy of links in a chain fits for evaluations of Adverse Outcome Pathways (AOPs) because these pathways incorporate events and responses that occur in an interconnected sequence. If the chain is broken, that is, one or more of the key events does not occur, then the ultimate outcome in the pathway is not realized.
In the case of the skin sensitization AOP, there is no test for the fourth KE, the only one considered an organ-level response. And, the proposed policy doesn’t require demonstration of an unbroken chain from KE1 through KE3. On the contrary, one of the approved DAs is even named “2 out of 3” and the other starts with KE3 and skips back to KE1 if the substance fails to produce a response in the KE3 test. These inconsistencies underscore the uncertainty of the policy as the first implementation of an approach for endorsing and accepting alternative methods. More importantly, if the agency is willing to overlook the absence of a test connecting the adverse effect in animals to the in vitro progression and rely on varied combinations of demonstrated key events for skin sensitization, these shortcuts could be promoted for complex AOPs such as cancer or endocrine disruption endpoints. Adopting this policy for the skin sensitization pathway will make it more difficult for the agency not to accept such incomplete chains for AOPs on other endpoints. When the overarching directive is to upgrade the scientific quality of decision-making information, implementing shortcuts in the name of efficiency and allowing hazard designation on the basis of a single test result seem off track.
Conclusion
A frequently raised criticism of historical regulatory use of toxicity testing information was that adverse outcomes in one animal test were often given excessive or sole weight in data evaluation and priority in determining chemical hazards. Scientists have worked to develop and validate updated approaches that enable extensive sets of results from more refined testing to be considered in systematic, weight-of-evidence approaches. And, TSCA reform and the EPA strategic plan make clear that improved, comprehensive evaluation methods should be used in conjunction with adopting new methods.
Yet, the EPA’s first proposed policy to implement a transition from animal testing to new methods includes an approach that leads to the designation of a skin sensitization hazard as an obligatory interpretation from positive results in a single test. If the proverbial one mouse with a tumor is supplanted by one glowing test tube as a decision criterion for designating the hazard related to a chemical, the link between regulatory characterizations and actual potential for adverse effects will be weakened rather than strengthened using emerging scientific methods and advances.
The lure of quick and convenient testing approaches carries the hook of increased efforts to address the hazard designations that result and develop substitute products. A policy that accepts the results from single tests with high false-positive rates exacerbates the potential for substances to be erroneously labeled with problematic hazard properties. But that is the situation created by the currently proposed EPA policy for replacing skin sensitization testing with in vitro methods. This policy should not become a precedent.
Dr. Robert DeMott is a Principal Toxicologist and global Health Sciences leader in the Environment & Health business of leading global engineering and consulting firm Ramboll. A board-certified toxicologist, he has more than 25 years of experience evaluating exposures to chemicals in the workplace and the environment.
Dr. Gavin Thompson is a Principal Consultant in Ramboll’s global Environment & Health business. He has more than 30 years of experience advising manufacturers, marketers, and industry associations on evaluating chemical exposures to humans from consumer products, food, food processing and packaging, household items and environmental media.
The opinions expressed here do not represent those of Bloomberg Environment, which welcomes other points of view.