ML23272A033
| ML23272A033 | |
| Person / Time | |
|---|---|
| Issue date: | 09/29/2023 |
| From: | Sushil Birla NRC/RES/DE |
| To: | |
| Sushil Birla 301-415-2311 | |
| References | |
| Download: ML23272A033 (28) | |
Text
Identifying Hazards in a System Design Sushil Birla Senior Technical Advisor U.S. Nuclear Regulatory Commission Office of Nuclear Regulatory Research The views expressed herein are those of the author and do not represent an official position of the U.S. NRC.
Guest Lecture at the University of Michigan Nuclear Engineering & Radiological Sciences Dept.
Course # NERS 462
General Systems Thinking Framework for Hazard Analysis Weinbergs partitioning of the problem space
- Organized simplicity
- Region of mechanical laws.
- Unorganized complexity
- Region of sufficient diversity or randomness for statistics to be reliable
- Organized complexity
- Region too complex for analysis and too organized for statistics An Introduction to General Systems Thinking, Gerald M. Weinberg, PhD Communication Sciences, 1963, University of Michigan
Statistics, Probability Analytic Reduction Systems Thinking, Systems Theory (ST)
Weinbergs partitioning of the problem space
Organized Simplicity Example: A mechanical system (pump; valve)
- Composed of well-proven, well-understood components
- Well-understood behavior
- Well-understood degradation: cause-effect relationships
- Assembled with well-proven joint designs
- Behavior is well understood (normal; abnormal)
- Predictable
- Deterministic
- Valid range & combination of inputs known & enforceable
- Valid range of environmental conditions known & enforceable
- Failure modes and effects well-understood
- Properties confirmed with operating experience
Unorganized complexity Gas Heat What is the behavior of gas particles in this vessel, subjected to heat?
Brownian motion:
~ perpetually in a state of random jittery motion The collisions between molecules and particles, that are the cause of Brownian motion, are not truly random.
They follow well-known physical laws and can in principle be predicted.
Source However, we can easily make predictions about the average behaviour of a collection of particles, and this is often just as useful.
Source Takeaway: Test the validity of the selected distribution for your context (purpose; system; etc.)
Organized complexity Modern systemsa are a combination of dissimilar elements:
physical (e.g.: mechanical; electrical; electronic)
- cyber (software-based)
The behavior of the system is more than the collective behavior of its elements:
- The system exhibits emergent behaviors (e.g.: safety is an emergent property)
Design goal:
- Achieve required emergent behavior (perform the required safety function)
- Avoid, preclude, or prevent hazardous emergent behaviors
- Best achieved by constraining (pruning) the design space.
- Identify hazardous conditions Formulate preventative constraints.
a: Known as cyber-physical systems
On system analyzability for safety Some system design characteristics that make safety analysis difficult:
Dependencies across components Component behavior in the system may not be the same as its standalone behavior Emergent behavior not manifested in the aggregated behaviors of the components Interactions can be complex, non-linear, dynamic, e.g.:
feedback loops between components behavior of one part may affect the behavior of others Some thoughts on system design for safety (for more, see RIL-1101):
Reactors are purpose-specific engineered systems
- Constrain the design space to admit only solutions that are analyzable for safety Some characteristics of system architecture for analyzability
- Component behavior does not change when used in a system
- Nested hierarchy No feedback loops between components
Some resources on the web Risk identification methods Risk analysis methods (consequences)
Risk analysis methods (severity)
Risk analysis methods (causes; threats)
Risk analysis methods (likelihood)
BEWARE OF applying to rare event cases! (The Black Swan: The impact of the highly improbable)
Risk evaluation methods ISO 31000 Risk management guidelines TUHH Overview ISO Overview
Approaches and Methods used for Hazard Analysis 1.
Cause Consequence Analysis (CCA)
Covered in STPA 2.
Common-Cause Failure Analysis (CCFA)
May be viewed as extension of CCA. Covered in STPA 3.
Functional Failure Modes and Effects Analysis (FFMEA)
Performed at concept phase, useful in driving design. For generic failure/fault modes, see RIL-1002 4.
Design Failure Modes and Effects Analysis (DFMEA)
Good fit for the organized simplicity category of systems. Bottom-up method.
5.
Dynamic Flowgraph Method (DFM)
Able to cover hazards from interactions organized complexity category of systems, Needs system design.
6.
Functional Hazard Analysis (FuHA)
Performed at concept phase, useful in driving design. Top-down method. STPA can fulfill FuHA role.
7.
Fault Hazard Analysis (FHA)
Similar to FuHA, but applied to component design, Less effort than FMEA when performed qualitatively.
8.
Fault Tree Analysis (FTA)
Top-down (outside-in) method. Traditional FT does not capture systemic interactions or emergent behaviors.
9.
Hazard and operability study (HAZOP)
Benefit: Prevention when applied to concept; Validation when applied to system architecture.
- 10. Hazard Analysis & Critical Control Points (HACCP)
Focuses effort on critical points in the system development process.
- 11. System-Theoretic Process Approach (STPA)
Top-down (outside-in) method.
- 12. What-If Analysis (W/IA)
Complements procedural methods; based on expert team brainstorming, drawing upon heuristics.
- 13. Markov Cell-to-Cell Mapping Technique (CCMT)
Benefit similar to DFM, but less evidence of successful commercial use and support.
Using combination of methods Method Limitation Antidote DFMEA Discovery is late in the development process; most hazards are introduced and are preventable earlier in the development process.
Does not reflect systemic interactions, if not explicit in the design.
Use in combination with a top-down method.
FTA Does not reflect systemic interactions, if not explicit in the design.
Use in combination with a method that uncovers hazards from interactions.
STPA
- Completeness not assured.
- Volume of information could increase beyond human comprehensibility, as the analysis progresses down contributory levels.
- Creation of control model requires skill. Not a mechanical process.
Combine with What/IF.
Assist with automation (e.g.,
modeling environments), with libraries of pre-validated domain-specific building blocks. Example:
CAMET tool suite.
HAZOP Skill-dependent.
Domain-specific templates.
Domain-specific guide words.
Guidance of skilled facilitator.
Support with catalog of well-known hazard-contributing scenarios. See RIL-1101.
Presentation on comparison of methods.
Hazard Identification Example: Checklist vs Active Search Checklist Active Search (aka Red Teaming)
General Process Stepping through list 1)
Ask what each hazard might do 2)
Screen or retain for further analysis using established criteria Looking at undesired state (e.g., failure of key components), ask 1)
What conditions might cause this undesired state 2)
What hazards or hazard combinations might create these conditions 3)
If there are protective barriers preventing the undesired conditions, what might fail these barriers Advantages More complete Methodical, easy to document More direct Less restricted by categorization Engages imagination Challenges Not wasting time on unimportant categories Avoiding urge to screen (to finish the job)
Tempering imagination with plausibility Ensuring reasonable completeness
Risk Information: Qualitative + Quantitative*
Risk { i, i,
i }
- What can go wrong?
- What are the consequences?
- How likely is it?
- Kaplan/Garrick triplet definition has been adopted by NRC. See:
- White Paper on Risk-Informed and Performance-Based Regulation (Revised), SRM to SECY-98-144, March 1, 1999
- Glossary of Risk-Related Terms in Support of Risk-Informed Decisionmaking, NUREG-2122, May 2013
- Probabilistic Risk Assessment and Regulatory Decisionmaking: Some Frequently Asked Questions, NUREG-2201, September 2016
Examples: Identifying hazards controlling conditions RIL-1101 Table 1: Considerations in broadly evaluating hazard analysis Contributory hazards Conditions that reduce the hazard space ID H-n-mm Description ID H-0-i Description
H-0-6 Hazard controls needed to satisfy system constraints (which prevent hazards) are inadequate.
-6G1 Hazard controls are identified and validated to be correct, complete, and consistent.
[H-0-7G1]
-7 Flow-down to verifiable requirements and constraints is inadequate
-7G1 Requirements and constraints [H-0-6G1] are formulated and validated to be correct, complete, consistent
-11 Required control action is degraded.
11G1 Each required control action is analyzed for ways in which it can lead to a hazard, e.g.
1.
~ not provided when needed 2.
~ provided when not needed 3.
~ provided at incorrect time 4.
~ provided too long 5.
~ provided too short 6.
~ is intermittent 7.
~ interferes with another..
8.
~ exhibits Byzantine behavior 9.
Incorrect state transition occurs 10.
Incorrect input value Sources: RIL-1101; RIL-1002 13
Examples: Controlling causes of hazards from complexity Contributory hazards Conditions that reduce the hazard space ID H-S-Description ID H-S-Description 1
The system is not sufficiently verifiable and understandable...
considerations and criteria are not formulated at the beginning of the development lifecycle; therefore, corresponding architectural constraints are not formalized and checked.
1G1 Verifiability required property, flowing down system most finely grained constituents.
1G1.1 Verifiability checked at every phase, at every level of integration, before next phase.
1.1G1.1 The behavior is unambiguously specified (incl. unexpected inputs) at every level of integration.
1.1G1.2 The flow-down (from composition to decomposition) ensures that:
1.
Allocated behaviors satisfy the behavior specified at the next higher level.
2.
Unspecified behavior does not occur.
1.1G1.3 System behavior composed of element behaviors such that when all elements verified individually, their compositions may also be considered verified; no unspecified behavior emerges.
1.1G1.4 Development follows a refinement process.
1.1.1 Unanalyzed/unanalyzable conditions exist, e.g.
unknown/unwanted system states.
1.1.1G1 Static analyzability: System is statically analyzable.
1.
All states, including fault conditions, are known.
2.
All fault states that lead to failure modes are known.
3.
The safe-state space of the system is known.
1.2
1.3
2 Comprehensibility: System behavior not interpreted correctly/consistently by its users [H-S-1].
2G1 Behavior is completely and explicitly specified.
2G3 Behavior is understood or interpreted completely, correctly, consistently, and unambiguously 2G6 The architecture is specified such that it is unambiguously interpretable by the community of its users (e.g., reviewers, architects, designers, implementers), that is, the people and the tools they use.
Source: RIL-1101 14
Examples: Controlling causes of hazards from interference Contributory hazards Conditions that reduce the hazard space ID H-SA Description ID H-SA-Description 3
A system, device, or other element (external or internal to a safety system) might affect a safety function adversely through unintended interactions caused by some combination of deficiencies, disorders, malfunctions, or oversights.
3G2 Interactions and interconnections that preclude complete V&V are avoided, eliminated, or prevented.
3G3 Freedom from interference is assured provably across:
1.
Lines of defense.
2.
Redundant divisions of system.
3.
Degrees of safety qualification.
4.
Monitoring & monitored elements of the system.
3G4 Analysis of the system demonstrates that unintended behavior is not possible.
1.
Interaction across different sources of uncertainty is avoided.
2.
The architecture precludes unwanted interactions, unwanted or hidden couplings.
3.
Specified information exchanges or communications occur in safe ways.
3G6 Constraints are identified for such contributing hazards from the environment as EMI; 3G7 The impact of dependency-affecting change is analyzed to demonstrate no adverse effect.
4
[H-SA-3G4]: A function, whose execution is required at a particular time, cannot be performed as required because of interference through sharing of some resource it needs.
4G1 Analysis of the execution-behavior of the system proves that such interference will not occur. For example, worst-case execution time is guaranteed.
5 Timing constraints are not correctly specified and not correctly allocated.
5G1
Source: RIL-1101 15
STPA in 4 steps
- 1. Identify the losses of concern
- 2. Model the control structures
- 3. Identify unsafe control actions
- 4. Identify causal scenarios
References:
The STPA Handbook Introduction to STAMP (theoretical foundation) - Nancy Leveson Introduction to STPA (hazard analysis method) - John Thomas CAST tutorial (method to analyze and learn from mishaps)
Examples of losses of concern in a nuclear reactor
- Unwanted radioactive emissions
- Loss of electricity production
- Damage to equipment
Example of STPA Step 1 for a nuclear reactor Losses of concern (examples)
L-1: Loss of life; injury to people; long-term health effects on people L-2: Damage to environment (e.g.: Contamination)
L-3: Loss of electricity production L-4: Other financial loss L-5: Loss of goodwill, reputation, trust, investor confidence, customer confidence Hazards (examples)
H-1: Plant creates unacceptable radioactive exposure [L-1, L-2, L-3, L-4, L-5]
H-1.1: Large early release H-1.2: Exposure without release H-1.3: Other release H-2: Plant releases excessive energy (e.g. explosion, steam release) [L-1, L-3, L-4, L-5]
H-2.1: Kinetic energy H-2.2: Thermal energy H-2.3: Other H-3: Plant is unable to generate sufficient power [L-3, L-4]
H-4: Plant is physically damaged, degraded, or needs repair [L-3, L-4]
H-4.1: Damage to fuel H-4.2: Damage to balance-of-plant equipment, turbines, generator H-4.3: Damage to reactor structure H-4.4: Damage to containment H-5: Plant is outside the envelope of its licensing basis [L-3, L-4, L-5]
Generic Control Structure (STPA step 2)
Controller(s)
Control Algorithm(s)
Process Model(s)
Controlled Process Control Actions Feedback
Operators (MCR)
Protection System Reactor Trip Module Engineered Safety Features Module RTBs Contacts Process within the reactor (relevant to safety)
Plant Parameters Operators (RSS)
Plant Parameters Sensors Rod status Power level Neutron flux
Trip Open breakers Open contacts Breaker position Contact position Control structure for a reactor trip system (simplified)
STPA step 3: Identify unsafe control actions (UCAs)
Control action Not providing causes hazard Providing causes hazard Too early; too late; out of order Stopped too soon; applied too long (Manual) trip on pressurizer (PZR) pressure too high or too low UCA-1: Operator does not Provide Trip signal when PZR pressure is lower or higher than setpoints (under non-bypass conditions) (H-4.2)
UCA-2: Operator does not provide trip signal when RTS / DAS fails (H-3)
UCA-4: Operator provides Trip signal when PZR pressure is below Hi setpoint or above Lo setpoint (H-3)
UCA-5: Operator provides Trip signal under bypass conditions (H-3)
UCA7: Operator delays trip when PZR Press above Hi setpoint (under non-bypass conditions) and when the PS has not provided the trip (H-4.2)
UCA-8: Operator is early to trip when PZR Press below Hi setpoint and above Lo setpoint (H-3)
UCA-9: UCA-8 Operator is early to trip UCA-11: Operator does not disengage trip signal when appropriate (H-3)
UCA12: Operator engaged but stopped before Reactor trip circuitry could engage (H-4.2)
Prevention of UCAs Safety constraints Drive system design
STPA step 4: Identify causal scenarios What could cause the UCAs?
a) How could incorrect feedback, inadequate requirements, design errors, component failures, and other factors cause unsafe control actions and ultimately lead to losses?
Identifying scenarios that lead to Unsafe Control Actions b) How might safe control actions be provided but not followed or executed properly, leading to a loss?
Identifying scenarios in which control actions are improperly executed or not executed This is a backward propagation search process
STPA step 4a: Identifying scenarios that lead to Unsafe Control Actions 1/2 Common categories of causes of unsafe controller behavior
- 1. Failures related to the controller (for physical controllers)
- 1. Physical failure of the controller itself
- 2. Power interruption (e.g.: power supply failure; intermittent connection; short circuit; open circuit)
- 2. Inadequate control algorithm
- 1. Flawed implementation of the specified control algorithm
- 2. The specified control algorithm is flawed
- 3. The specified control algorithm becomes inadequate over time due to changes or degradation
- 3. Unsafe control input UCA received from another controller (already addressed when considering UCAs from other controllers) 4.
Inadequate process model 1.
Controller receives incorrect feedback/information 2.
Controller receives correct feedback/information but interprets it incorrectly or ignores it 3.
Controller does not receive feedback/information when needed (delayed or never received) 4.
Necessary controller feedback/information does not exist
STPA step 4a: Identifying scenarios that lead to Unsafe Control Actions 2/2 Common categories of causes of inadequate feedback and information
- 1. Feedback of information not received
- 1. Feedback/info sent by sensor but not received by controller
- 2. Feedback/info is not sent by sensor(s) but is received or applied to sensor(s)
- 3. Feedback/info is not received by or applied to sensor(s)
- 4. Feedback/info does not exist in control structure or sensor(s) do(es) not exist
- 2. Inadequate feedback is received
- 1. Sensors respond adequately but controller receives inadequate feedback/info
- 2. Sensors respond inadequately to feedback/info received by or applied to sensors
- 3. Sensors are not capable to provide necessary feedback/info (i.e., not designed to)
Controller(s)
Control Algorithm(s)
Process Model(s)
Controlled Process Control Actions Feedback
STPA step 4b: Identifying scenarios in which control actions are not executed (or not executed correctly): Scenarios involving the control path
- 1. Control action not executed
- 1. Control action is sent by controller but not received by actuator
- 2. Control action is received by actuator but actuator does not respond
- 3. Actuator responds but the control action is not applied to or received by the controlled process
- 2. Control action is improperly executed
- 1. Control action is sent by controller but received improperly by actuator, e.g. due to 1.
Delay in communication
- 2. Out-of-order transmission
- 3. Lost communication
- 2. Control action is received correctly by actuator but actuator response is inadequate, e.g. due to
- 1. Loss of power to the actuator
- 2. Inaccuracies in actuator operation
- 3. Actuator misbehavior
- 4. Delay in actuator response
- 5. Actuator receives some other (possibly conflicting) command from some other source
- 6. Incorrect priority scheme used by actuator
- 7. Incorrect configuration
- 8. Actuator behavior changes or degrades over time
- 9. Unanticipated conditions in the actuator environment
- 3. Actuator responds adequately but the controlled process receives or applies the control action improperly
- 4. Control action is not sent by controller but actuators or other elements respond as if it did
STPA step 4b: Identifying scenarios in which control actions are not executed (or not executed correctly): Scenarios related to the controlled process
- 1. Control action not executed Control action is applied to or received by the controlled process but the controlled process does not respond
- 2. Control action improperly executed 1.
Control action is applied to or received by the controlled process but the controlled process responds improperly 2.
Control action is not applied to or received by the controlled process but the process responds as if it did Controller(s)
Control Algorithm(s)
Process Model(s)
Controlled Process Control Actions Feedback
Progression of hazard analysis during development Plans Concept Requirements Architecture Detailed design Implementation Testing Verification Validation (V&V)
Vp System Development HAp Requirements from NPP Safety Analysis HAc HAr HAr HAdd HAi HAi Vc Vr Va Vdd Vi Vt Safety Engineering Reference model from IEEE Std 1012 27
The Future around the corner Model-based systems engineering to support safety analysis:
- SysML version 2 (modeling critical systems)
- RAAML (Risk Analysis & Assessment Modeling Language)
- AADL (modeling critical cyber-physical systems)
- AADL Error Library blog
- AADL Error Library paper
- Article on fault modeling and analysis using AADL
- Example implementation