ML21053A236

From kanterella
Revision as of 19:04, 20 January 2022 by StriderTol (talk | contribs) (StriderTol Bot change)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Npic Hmit Paper Applying Single Failure Criteria to Digital
ML21053A236
Person / Time
Issue date: 02/22/2021
From: Richard Stattel
NRC/NRR/DEX/ESEB
To:
Richard J. Stattel 301-415-8472
References
Download: ML21053A236 (8)


Text

APPLYING SINGLE FAILURE CRITERIA TO DIGITAL I&C SYSTEMS Author Richard J Stattel US Nuclear Regulatory Commission One White Flint North 11555 Rockville Pike Rockville, MD 20852-2738 Richard.stattel@nrc.gov ABSTRACT The single failure criteria (SFC), though conceived in a time before the advent of digital technology, remains equally applicable to todays safety related I&C systems regardless of the technologies they use. Applying this criterion to modern digital I&C safety systems and demonstrating that SFC criterion is met has created unique and interesting licensing challenges, particularly when the effects of common mode software or logic errors are taken into consideration.

The nuclear industry and its regulators have found that methods used to demonstrate SFC compliance vary greatly depending on individual system characteristics and logistical limitations such as the inability to perform comprehensive testing of complex digital systems. A method successfully used for one type of system may be inadequate when applied to a different type of system particularly when different technologies are involved. For this reason, there is currently a renewed interest in producing new guidance for addressing the SFC using methods that are based on best practices that have been developed by technologists in these areas.

The IEEE nuclear power engineering committee (NPEC) is currently developing a revision to IEEE Std. 379, Standard for Application of the Single-Failure Criterion to Nuclear Power Generating Station Safety Systems to clarify how potential for CCFs in software or logic can be screened or otherwise addressed within the context of a single failure analysis.

This paper explores the challenges presented in applying the SFC criterion to todays digital upgrade projects. This paper also explores the evolutionary changes that have occurred to the single failure criteria since it was first conceived over 50 years ago.

Key Words: Failure Modes and Effects Analysis (FMEA), Diagnostics, Common Cause Failure (CCF), Diversity and Defense-in-Depth.

1 INTRODUCTION:

Applying single failure criteria (SFC) to digital protection systems of a nuclear power plant is one of the greatest challenges facing I&C engineers today. The SFC requires these systems to respond in a known and predictable way to an increasingly unknown and unpredictable set of failure modes, many of which are difficult to anticipate during system development.

A traditional approach to addressing the single failure criteria would be to first design a system using a standard set of design criteria, which would of course include the SFC, and then to test the completed system by introducing a limited set of credible failures and monitoring system response. In this way, one could use test results as a confirmation that the SFC are met by the system. However, these traditional methods of system development and testing can no longer be relied upon for such a demonstration. This paper examines several factors that have made demonstrations of SFC compliance challenging and, in many

cases, unattainable for digital systems. We begin with an explanation of the SFC and its regulatory significance.

2 SINGLE FAILURE CRITERIA AND REGULATORY SIGNIFICANCE Single failure criterion or SFC is derived from two different U.S. federal regulation sources. The first is the general design criteria which is provided in 10 CFR 50, Appendix A, General Design Criteria for Nuclear Power Plants. Specifically, Criterion 21, Protection system reliability and testability states, in part, the following:

Redundancy and independence designed into the protection system shall be sufficient to assure that (1) no single failure results in loss of the protection function.

The second regulatory source for single failure criterion comes from the incorporated by reference standards, IEEE Std 279, and IEEE Std 603-1991 via 10 CFR 50.55a(h). Single failure criteria for nuclear I&C systems appears in IEEE Std. 279 in 1971 as follows:

Single Failure Criterion. Any single failure within the protection system shall not prevent proper protective action at the system level when required.

Though this criterion has been expanded upon in IEEE Std 603, the fundamental criteria is essentially identical. The advent of digital instrumentation and protection systems as well as the complexities involved with predicting failure modes of these systems has made it exceedingly difficult to design systems to meet this criteria and more importantly to demonstrate that a well-designed system meets this criteria.

3 DIGITAL I&C SYSTEM CHARACTERISTICS Modern I&C systems have been far more reliable and less prone to failures than their predecessors, analog I&C systems, for which the SFC was intended to address, so why hasnt SFC compliance become easier to demonstrate given the performance data available? I will attempt to answer this by discussing some of the unique characteristics of digital technology. As it happens, some of the very characteristics of digital instrumentation that make it more reliable and fault tolerant than analog instrumentation also make these systems more susceptible to SFC compliance challenges. To examine these characteristics more closely, the following question shall be considered:

Why are digital instruments more reliable than their analog counterparts?

Here are a few characteristics that have improved the reliable performance of digital systems:

  • Self-Monitoring and Self-Diagnostic Capabilities - Including Fault Detection and Fault response functions. Self-diagnostic functionality has undeniably led to equipment performance improvements. Analog circuits with few exceptions were largely incapable of performing self-monitoring. On the other hand, digital systems can typically be programmed to both detect and respond to a wide range of system failures in a controlled manner.
  • Standardized Integrated circuitry - The integration and standardization of circuit designs has resulted in a reduction in the number of discrete circuit components and a corresponding reduction of component failure rates. Though not strictly a digital electronics phenomenon, these trends for electronic circuit designs have also caused an increase in component complexity.
  • Lower power level signal processing - By reducing power consumption rates, digital electronic components can be designed to operate at reduced stress levels resulting in improved performance over longer periods of time.
  • Redundant signal processing capabilities. Many digital systems are designed with two or more redundant signal processing paths to increase reliability and enhance fault tolerance levels.

Though each of these characteristics has served to improve system reliability, they have done so at the expense of increasing the level of system complexity. Along with increased complexity comes an increase in uncertainty, and an expansion of what I will refer to as the credible failure domain. It is also much more difficult to identify and demonstrate the effects of failures when more complex system designs are used.

Even though there is higher confidence in modern digital systems that protection functions will be maintained for a greater number of failure modes, the real challenge has been demonstrating system performance in this larger failure domain.

Factors that affect a systems compliance to the SFC include 1) the number of failure modes that a system has, 2) the severity or effects for each failure mode, 3) the ability of a system design engineer to minimize the effects of failures, and 4) the ability to test and demonstrate the effects of postulated failures.

3.1 Number of failure modes In comparison to older analog I&C systems, digital systems typically have a much larger number of failure modes to be analyzed. One principal reason for this is that digital systems employ multiple layers of technology to perform their functions. For example, a typical digital system uses electronic circuitry, microprocessor based digital processing, digital communications, and stored executable programming instructions. Conversely, analog technologies generally employ only a single layer of technology, (analog electronic circuitry), to accomplish a similar degree of functionality.

This larger failure mode domain is the principal reason that SFC compliance is more difficult to demonstrate when applied to digital systems. Figure 1 is a Venn diagram which illustrates the failure mode domains as they are defined in the standards, IEEE 279 and IEEE Std 603. The SFC as stated in IEEE 279 uses the phrase Any single failure within the protection system to establish applicability and scope. This is shown in Figure 1 as Domain I. The SFC as stated in IEEE 603 adds some clarification by limiting the failure domain to 1) detectable failures that are concurrent with all identifiable but non-detectable failures, (Shown as Domain II) and 2) failures caused by the single failure (shown as Domain III). While helpful, this still leaves developers with a very large set of potential failures to which the SFC must be applied. This combined scope includes both Domains II and III of Figure 1.

Domain 0 All conceivable failure modes of a system Domain I Domain II IEEE 603: 1) Detectable failure modes concurrent with identifiable non-detectable failure modes IEEE 603: Certain failure modes that need not be considered Domain IV IEEE 603-2018:

Credible Failures IEEE 279: Any single failure of a protection system Domain III IEEE 603: 2) Failures caused by the single failure Figure 1, Failure Mode Domains The standard, IEEE Std 603-1991, also provides an allowance that certain postulated failures need not be considered in the application of the SFC. This introduces the concept of failure credibility into the failure mode scope of the SFC. The more recent 2018 version of IEEE Std 603 reinforces this concept by stating the following:

Credible failures and events shall be addressed in the single failure analysis.

Of course, this also conversely implies that non-credible failures do not need to be addressed in the SFC. This provides an opening that could be used to improve and simplify the single failure analysis process.

Domain IV in Figure 1 represents the subset of these credible failure modes for which the SFC must be applied. Unfortunately, the IEEE 603 SFC does little to establish criteria for determining the credibility of a postulated failure mode. This leaves the determination of Domain IV scope in an ambiguous state and it is the source of many challenges the nuclear industry currently faces. If we could somehow develop a set of mutually agreed upon criteria for establishing the boundaries of this domain, then much of the uncertainty involved with the licensing of digital I&C systems could be alleviated.

The guidance of IEEE 379, IEEE Standard for Application of the Single-Failure Criterion to Nuclear Power Generating Station Safety Systems may provide a means of accomplishing this objective.

This standard already provides a limited amount of guidance for establishing the scope of Domain IV.

The current version of this standard, includes a screening process that can be used to eliminate certain types of failure from SFC consideration. The standard suggests that the following failure types do not need to be subjected to the single failure analysis and thus can be excluded from Domain IV in our model:

  • Causative factors from external environmental effects (e.g., voltage, frequency, radiation, temperature, humidity, pressure, vibration, and electromagnetic interference)
  • design deficiencies
  • manufacturing errors

The standard, however, provides no objective criteria for performing such a failure credibility screening activity. Instead, it provides guidance to various pathways for addressing the effects of these failures without providing a definition of what it means for a failure mode to have been addressed. For example, the standard says that external environmental effects should be addressed by performing equipment qualification. This would infer that if all of the equipment within a system is environmentally qualified, then the common mode failure resulting from the subject environmental conditions can be considered non-credible and excluded from SFC consideration. The results for this case are widely accepted and environmental CCFs are generally exempt from SFC consideration.

For the case of digital CCF, the standard states that IEEE Std 7-4.3.2, IEEE Standard Criteria for Digital Computers in Safety Systems of Nuclear Power Generating Stations should be used to address the effects of such failure. By following the same logic used for environmental failures in the previous example, it might seem reasonable to assume that if the guidance of IEEE Std 7-4.3.2 is followed, then considered credible.

There is however a fundamental flaw with this line of reasoning. Without having performed any testing or assessment of the probability or the consequences of a digital CCF and without having postulated the CCF itself, one is led to a conclusion that Digital CCFs are not credible based only on the fact that IEEE 7-4.3.2 guidance was followed during a systems development. While IEEE Std 7-4.3.2 is a valuable standard for providing sound guidance for the development of high quality digital systems, it does not provide assurance or a documented analysis to show either that a digital CCF in a resulting system is not credible or that the system will not prevent proper protective action at the system level when required, which is of course, the fundamental criteria of the SFC. This would also be equivalent to accepting environmental CCFs as being non-credible based on the equipment design processes alone without requiring environmental qualification testing of the equipment. This approach is clearly not a widely accepted practice in other categories of CCF.

Following good engineering processes certainly reduces the likelihood of digital CCF but does it really make such failures non-credible? To answer this question, we should consider the relative risks associated with the digital CCF. Since risk includes both probability of occurrence and consequence of occurrence factors, the IEEE 7-4.3.2 processes only partially address the elements of risk that are entailed with digital CCF.

Figure 3 provides an additional breakdown of failure Domain IV for consideration. Of the failures that could be considered credible under the criteria provided in IEEE 279 and IEEE 603, there are failures that have very low likelihood of occurrence (Domain V) and there are failures that have very low consequence of occurrence (Domain VI). We could also hypothesize that failures in either of these categories could be excluded from SFC consideration if an effective process could be developed to screen them out based on risk insights. This could effectively reduce the number of credible failures for CCF consideration to a manageable number or even to zero for many cases.

Domain II & III All detectable and non-detectable failure modes of a system as well as failures caused by the single failure Domain IV Domain V Failures that are credible but have extremely low likelihood of occurrence Domain VI Failures that are credible but have very low or no consequence of occurrence IEEE 603-2018:

Credible Failures For failures in either Domain V or Domain VI, such a method of excluding these failure modes by using an analytical approach that includes justification and documentation could be performed as part of a defined screening process. The recent guidance of IEEE 603 Section 5.16, Common cause failure provides a good starting point for screening criteria because it has guidance for addressing failure mode risk that includes assessment of both probability and consequence for each postulated failure.

3.2 Determining the Effects of Common Cause Failures While quantification of CCF probability can be performed using existing techniques such as reliability analysis or by using empirical data available for a particular system or platform, the consequence part of the Risk equation is understandably more difficult to analyze and only limited guidance for performing such an analysis activity exists today. One challenge to determination of consequence is that such an activity necessarily involves postulation of the CCF which happens to be the one activity that most designers and licensees would very much like to avoid at any cost. Such a postulation often entails analysis of faults that violate the assumptions made in the plant accident analysis. Such a violation of assumptions or even contemplation of a system that has the potential for doing so may seem like a formidable task.

However, such tasks are not without precedent.

Several digital I&C safety systems have been placed into operation at nuclear power plants in the US.

Most of these upgrade projects have included a diversity and defense in depth (D3) analysis activity to identify existing backup functions to the functions performed by the safety system and to identify the need for additional backup functionality for these systems. A D3 analysis normally involves postulating that the safety system fails to perform its assigned safety functions. Even though not all CCF failure modes would have this effect, such a postulation is made to simplify the analysis and to encompass the wide CCF failure domain that is necessary for application of the SFC.

3.3 Minimizing the Effects of Common Cause Failures Once the CCF failures are postulated and the effects of these failures are established, a developer can introduce measures to either reduce the potential for the identified CCFs or address the effects of these failures to a practicable extent. Upon completion, the risks of CCF are known and can therefore be used as a means of screening CCFs out of SFC consideration. To corroborate this point, NRC Branch Technical Position BTP 7-19 includes the following statement:

Over the years, the U.S. Nuclear Regulatory Commission (NRC) staff has approved applications that use various design features to address CCF vulnerabilities in DI&C systems. Some of these use multiple design solutions within different parts of a single DI&C system. In reviewing these applications, the staff has evaluated several different solutions that successfully address CCF vulnerabilities.

Since, plants are already taking actions to reduce CCF risk to an acceptable level, it can be postulated that risk informed means of screening digital system CCFs out of SFC consideration can provide a reasonable solution to this problem. Furthermore, if a well-defined method for performing a risk informed screening of software and logic based CCFs could be developed, then licensees would have a means of performing digital system upgrades without the unattainable objective of proving that digital CCFs cant happen or that no digital CCF can result in failure to meet SFC.

3.4 Ability to Test and Demonstrate the Effects of Credible Failures The approach to adopt an enhanced risk informed CCF screening process could serve to obviate the need for excessive or unachievable levels of system testing. The goal of the test program would no longer be to test every possible failure mode and validate every possible effect on system operation. That objective has always been far too broad to be practical for most digital systems. Instead, testing objectives would be to test and confirm each of the analyzed probable and credible failure modes that could be reasonably expected to occur during system operation. This approach would reduce and re-focus testing activities to a level that is both manageable and efficient at identifying the effects of postulated credible failure modes.

Such an enhanced test program could be much more efficient than current test-all approaches because they would be more capable of comprehensively addressing the system failure domains that pose the greatest risk to safe plant operation.

The test results alone would no longer be relied upon to show SFC compliance but would be used in conjunction with risk analysis, and risk informed screening processes to demonstrate with reasonable assurance that safety functions would remain capable of performing required protective actions in the presence of credible single failures. These improved and focused test processes would become part of a multifaceted approach to addressing the SFC and would have a much more meaningful nexus to system safety.

4 CONCLUSIONS As mentioned above, IEEE Std 379 contains limited guidance for CCF screening that could be expanded to include an enhanced risk informed CCF screening process. The Nuclear Power Engineering Committee (NPEC) working group subcommittee 6.3 is currently working on a revision to IEEE 379 and is actively developing new guidance for Section 5.5, Common Cause Failures that will include an update to the existing CCF screening process. This important work has the potential to facilitate significant reactor safety improvements and should be supported by the nuclear power community at all levels.

5 ACKNOWLEDGMENTS Many thanks to the work of my colleagues David Rahn and Wendell Morton and to the members of NPEC working group 6.3 whose efforts and critical feedback have made this work possible.

6 REFERENCES

1. IEEE Std 279-1971, Criteria for Protection Systems for Nuclear Power Generating Stations
2. IEEE Std 603-1991, Criteria for Safety Systems for Nuclear Power Generating Stations
3. IEEE Std 379-2014, Application of the Single-Failure Criterion to Nuclear Power Generating Station Safety Systems
4. IEEE Std 7-4.3.2, IEEE Standard Criteria for Digital Computers in Safety Systems of Nuclear Power Generating Stations