ML23272A033

From kanterella
Jump to navigation Jump to search
Identifying Hazards in a System Design
ML23272A033
Person / Time
Issue date: 09/29/2023
From: Sushil Birla
NRC/RES/DE
To:
Sushil Birla 301-415-2311
References
Download: ML23272A033 (28)


Text

Identifying Hazards in a System Design Guest Lecture at the University of Michigan Nuclear Engineering & Radiological Sciences Dept.

Course # NERS 462 Sushil Birla Senior Technical Advisor U.S. Nuclear Regulatory Commission Office of Nuclear Regulatory Research The views expressed herein are those of the author and do not represent an official position of the U.S. NRC.

General Systems Thinking Framework for Hazard Analysis Weinbergs partitioning of the problem space

  • Organized simplicity

- Region of mechanical laws.

  • Unorganized complexity

- Region of sufficient diversity or randomness for statistics to be reliable

  • Organized complexity

- Region too complex for analysis and too organized for statistics An Introduction to General Systems Thinking, Gerald M. Weinberg, PhD Communication Sciences, 1963, University of Michigan

Weinbergs partitioning of the problem space Statistics, Probability Systems Thinking, Systems Theory (ST)

Analytic Reduction

Organized Simplicity Example: A mechanical system (pump; valve)

  • Composed of well-proven, well-understood components

- Well-understood behavior

- Well-understood degradation: cause-effect relationships

  • Assembled with well-proven joint designs
  • Behavior is well understood (normal; abnormal)

- Predictable

- Deterministic

- Valid range & combination of inputs known & enforceable

- Valid range of environmental conditions known & enforceable

  • Failure modes and effects well-understood
  • Properties confirmed with operating experience

Unorganized complexity Heat What is the behavior of gas particles in this vessel, subjected to heat?

Brownian motion:

~ perpetually in a state of random jittery motion Gas The collisions between molecules and particles, that are the cause of Brownian motion, are not truly random. However, we can easily make predictions about They follow well-known physical laws the average behaviour of a collection of particles, and can in principle be predicted. and this is often just as useful.

Source Source Takeaway: Test the validity of the selected distribution for your context (purpose; system; etc.)

Organized complexity

  • Modern systemsa are a combination of dissimilar elements:

- physical (e.g.: mechanical; electrical; electronic)

- cyber (software-based)

  • The behavior of the system is more than the collective behavior of its elements:

- The system exhibits emergent behaviors (e.g.: safety is an emergent property)

  • Design goal:

- Achieve required emergent behavior (perform the required safety function)

- Avoid, preclude, or prevent hazardous emergent behaviors

  • Best achieved by constraining (pruning) the design space.
  • Identify hazardous conditions Formulate preventative constraints.

a: Known as cyber-physical systems

On system analyzability for safety Some system design characteristics that make safety analysis difficult:

  • Dependencies across components
  • Component behavior in the system may not be the same as its standalone behavior
  • Emergent behavior not manifested in the aggregated behaviors of the components
  • Interactions can be complex, non-linear, dynamic, e.g.:

- feedback loops between components

- behavior of one part may affect the behavior of others Some thoughts on system design for safety (for more, see RIL-1101):

  • Reactors are purpose-specific engineered systems

- Constrain the design space to admit only solutions that are analyzable for safety

  • Some characteristics of system architecture for analyzability

- Component behavior does not change when used in a system

- Nested hierarchy

  • No feedback loops between components

Some resources on the web

  • Risk identification methods
  • Risk analysis methods (consequences)
  • Risk analysis methods (severity)
  • Risk analysis methods (causes; threats)
  • Risk analysis methods (likelihood)

- BEWARE OF applying to rare event cases! (The Black Swan: The impact of the highly improbable)

  • Risk evaluation methods
  • ISO 31000 Risk management guidelines

- TUHH Overview

- ISO Overview

Approaches and Methods used for Hazard Analysis

1. Cause Consequence Analysis (CCA)

Covered in STPA

2. Common-Cause Failure Analysis (CCFA)

May be viewed as extension of CCA. Covered in STPA

3. Functional Failure Modes and Effects Analysis (FFMEA)

Performed at concept phase, useful in driving design. For generic failure/fault modes, see RIL-1002

4. Design Failure Modes and Effects Analysis (DFMEA)

Good fit for the organized simplicity category of systems. Bottom-up method.

5. Dynamic Flowgraph Method (DFM)

Able to cover hazards from interactions organized complexity category of systems, Needs system design.

6. Functional Hazard Analysis (FuHA)

Performed at concept phase, useful in driving design. Top-down method. STPA can fulfill FuHA role.

7. Fault Hazard Analysis (FHA)

Similar to FuHA, but applied to component design, Less effort than FMEA when performed qualitatively.

8. Fault Tree Analysis (FTA)

Top-down (outside-in) method. Traditional FT does not capture systemic interactions or emergent behaviors.

9. Hazard and operability study (HAZOP)

Benefit: Prevention when applied to concept; Validation when applied to system architecture.

10. Hazard Analysis & Critical Control Points (HACCP)

Focuses effort on critical points in the system development process.

11. System-Theoretic Process Approach (STPA)

Top-down (outside-in) method.

12. What-If Analysis (W/IA)

Complements procedural methods; based on expert team brainstorming, drawing upon heuristics.

13. Markov Cell-to-Cell Mapping Technique (CCMT)

Benefit similar to DFM, but less evidence of successful commercial use and support.

Using combination of methods Method Limitation Antidote DFMEA Discovery is late in the development process; most hazards are Use in combination with a top-introduced and are preventable earlier in the development down method.

process.

Does not reflect systemic interactions, if not explicit in the design.

FTA Does not reflect systemic interactions, if not explicit in the Use in combination with a method design. that uncovers hazards from interactions.

STPA *Completeness not assured. Combine with What/IF.

  • Volume of information could increase beyond human Assist with automation (e.g.,

comprehensibility, as the analysis progresses down modeling environments), with contributory levels. libraries of pre-validated domain-

  • Creation of control model requires skill. Not a mechanical specific building blocks. Example:

process. CAMET tool suite.

HAZOP Skill-dependent. Domain-specific templates.

Domain-specific guide words.

Guidance of skilled facilitator.

Support with catalog of well-known hazard-contributing scenarios. See RIL-1101.

Presentation on comparison of methods.

Hazard Identification Example: Checklist vs Active Search Checklist Active Search (aka Red Teaming)

General Process Stepping through list Looking at undesired state (e.g., failure of

1) Ask what each hazard might do key components), ask
2) Screen or retain for further analysis 1) What conditions might cause this using established criteria undesired state
2) What hazards or hazard combinations might create these conditions
3) If there are protective barriers preventing the undesired conditions, what might fail these barriers Advantages More complete More direct Methodical, easy to document Less restricted by categorization Engages imagination Challenges Not wasting time on unimportant Tempering imagination with plausibility categories Ensuring reasonable completeness Avoiding urge to screen (to finish the job)

Risk Information: Qualitative + Quantitative*

Risk { i , i, i}

  • What can go wrong?
  • What are the consequences?
  • How likely is it?
  • Kaplan/Garrick triplet definition has been adopted by NRC. See:

- White Paper on Risk-Informed and Performance-Based Regulation (Revised), SRM to SECY-98-144, March 1, 1999

- Glossary of Risk-Related Terms in Support of Risk-Informed Decisionmaking, NUREG-2122, May 2013

- Probabilistic Risk Assessment and Regulatory Decisionmaking: Some Frequently Asked Questions, NUREG-2201, September 2016

Examples: Identifying hazards controlling conditions RIL-1101 Table 1: Considerations in broadly evaluating hazard analysis Contributory hazards Conditions that reduce the hazard space ID Description ID Description H-n- H-0-i mm H-0-6 Hazard controls needed to -6G1 Hazard controls are identified and validated to be correct, complete, and consistent.

satisfy system constraints [H-0-7G1]

(which prevent hazards) are inadequate.

-7 Flow-down to verifiable -7G1 Requirements and constraints [H-0-6G1] are formulated and validated to be correct, requirements and complete, consistent constraints is inadequate

-11 Required control action is - Each required control action is analyzed for ways in which it can lead to a hazard, e.g.

degraded. 11G1 1. ~ not provided when needed

2. ~ provided when not needed
3. ~ provided at incorrect time
4. ~ provided too long
5. ~ provided too short
6. ~ is intermittent
7. ~ interferes with another ..
8. ~ exhibits Byzantine behavior
9. Incorrect state transition occurs
10. Incorrect input value Sources: RIL-1101; RIL-1002 13

Examples: Controlling causes of hazards from complexity Contributory hazards Conditions that reduce the hazard space ID ID H-S- Description H-S- Description 1 The system is not 1G1 Verifiability required property, flowing down system most finely grained constituents.

sufficiently verifiable and 1G1.1 Verifiability checked at every phase, at every level of integration, before next phase.

understandable ...

1.1G1.1 The behavior is unambiguously specified (incl. unexpected inputs) at every level of integration.

considerations and criteria are not formulated at the 1.1G1.2 The flow-down (from composition to decomposition) ensures that:

beginning of the 1. Allocated behaviors satisfy the behavior specified at the next higher level.

development lifecycle; 2. Unspecified behavior does not occur.

1.1G1.3 System behavior composed of element behaviors such that when all elements verified therefore, corresponding individually, their compositions may also be considered verified; no unspecified behavior emerges.

architectural constraints are not formalized and checked . 1.1G1.4 Development follows a refinement process.

1.1.1 Unanalyzed/unanalyzable 1.1.1G1 Static analyzability: System is statically analyzable.

conditions exist, e.g. 1. All states, including fault conditions, are known.

unknown/unwanted system 2. All fault states that lead to failure modes are known.

states. 3. The safe-state space of the system is known.

1.2 1.3 2 Comprehensibility: System 2G1 Behavior is completely and explicitly specified.

behavior not interpreted 2G3 Behavior is understood or interpreted completely, correctly, consistently, and unambiguously correctly/consistently by its 2G6 The architecture is specified such that it is unambiguously interpretable by the community of its users [H-S-1]. users (e.g., reviewers, architects, designers, implementers), that is, the people and the tools they use.

Source: RIL-1101 14

Examples: Controlling causes of hazards from interference Contributory hazards Conditions that reduce the hazard space ID H-SA ID

- Description H-SA- Description 3 A system, device, or other 3G2 Interactions and interconnections that preclude complete V&V are avoided, eliminated, or element (external or internal to prevented.

a safety system) might affect a 3G3 Freedom from interference is assured provably across:

safety function adversely 1. Lines of defense.

through unintended 2. Redundant divisions of system.

interactions caused by some 3. Degrees of safety qualification.

combination of deficiencies, 4. Monitoring & monitored elements of the system.

disorders, malfunctions, or 3G4 Analysis of the system demonstrates that unintended behavior is not possible.

oversights. 1. Interaction across different sources of uncertainty is avoided.

2. The architecture precludes unwanted interactions, unwanted or hidden couplings.
3. Specified information exchanges or communications occur in safe ways.

3G6 Constraints are identified for such contributing hazards from the environment as EMI; 3G7 The impact of dependency-affecting change is analyzed to demonstrate no adverse effect.

4 [H-SA-3G4]: A function, 4G1 Analysis of the execution-behavior of the system proves that such interference will not occur. For whose execution is required at example, worst-case execution time is guaranteed.

a particular time, cannot be performed as required because of interference through sharing of some resource it needs.

5 Timing constraints are not 5G1 correctly specified and not correctly allocated.

Source: RIL-1101 15

STPA in 4 steps

1. Identify the losses of concern
2. Model the control structures
3. Identify unsafe control actions
4. Identify causal scenarios

References:

The STPA Handbook Introduction to STAMP (theoretical foundation) - Nancy Leveson Introduction to STPA (hazard analysis method) - John Thomas CAST tutorial (method to analyze and learn from mishaps)

Examples of losses of concern in a nuclear reactor

  • Unwanted radioactive emissions
  • Loss of electricity production
  • Damage to equipment

Example of STPA Step 1 for a nuclear reactor Losses of concern (examples)

L-1: Loss of life; injury to people; long-term health effects on people L-2: Damage to environment (e.g.: Contamination)

L-3: Loss of electricity production L-4: Other financial loss L-5: Loss of goodwill, reputation, trust, investor confidence, customer confidence Hazards (examples)

H-1: Plant creates unacceptable radioactive exposure [L-1, L-2, L-3, L-4, L-5]

H-1.1: Large early release H-1.2: Exposure without release H-1.3: Other release H-2: Plant releases excessive energy (e.g. explosion, steam release) [L-1, L-3, L-4, L-5]

H-2.1: Kinetic energy H-2.2: Thermal energy H-2.3: Other H-3: Plant is unable to generate sufficient power [L-3, L-4]

H-4: Plant is physically damaged, degraded, or needs repair [L-3, L-4]

H-4.1: Damage to fuel H-4.2: Damage to balance-of-plant equipment, turbines, generator H-4.3: Damage to reactor structure H-4.4: Damage to containment H-5: Plant is outside the envelope of its licensing basis [L-3, L-4, L-5]

Generic Control Structure (STPA step 2)

Controller(s)

Control Process Algorithm(s) Model(s)

Control Feedback Actions Controlled Process

Control structure for a reactor trip system (simplified)

Plant Plant Operators (MCR) Operators (RSS)

Parameters Parameters Trip Protection System Reactor Trip Module Engineered Safety Features Module Open Breaker Open Contact breakers position contacts position Sensors

  • Rod status RTBs Contacts
  • Power level
  • Neutron flux Process within the reactor (relevant to safety)

STPA step 3: Identify unsafe control actions (UCAs)

Control action Not providing Providing causes Too early; too late; Stopped too soon; causes hazard hazard out of order applied too long (Manual) trip on UCA-1: Operator does UCA-4: Operator UCA7: Operator delays UCA-11: Operator does pressurizer (PZR) not Provide Trip signal provides Trip signal trip when PZR Press not disengage trip signal when PZR pressure is when PZR above Hi setpoint when appropriate (H-3) pressure lower or higher than pressure is below Hi (under non-bypass too high or too low setpoints (under non- setpoint or above Lo conditions) and when UCA12: Operator bypass conditions) (H- setpoint (H-3) the PS has not provided engaged but stopped 4.2) the trip (H-4.2) before Reactor trip UCA-5: Operator circuitry could engage UCA-2: Operator does provides Trip signal UCA-8: Operator is (H-4.2) not provide trip signal under bypass conditions early to trip when PZR when RTS / DAS fails (H-3) Press below Hi setpoint (H-3) and above Lo setpoint (H-3)

UCA-9: UCA-8 Operator is early to trip Prevention of UCAs Safety constraints Drive system design

STPA step 4: Identify causal scenarios What could cause the UCAs?

a) How could incorrect feedback, inadequate requirements, design errors, component failures, and other factors cause unsafe control actions and ultimately lead to losses?

- Identifying scenarios that lead to Unsafe Control Actions b) How might safe control actions be provided but not followed or executed properly, leading to a loss?

- Identifying scenarios in which control actions are improperly executed or not executed This is a backward propagation search process

STPA step 4a: Identifying scenarios that lead to Unsafe Control Actions 1/2 Common categories of causes of unsafe controller behavior

1. Failures related to the controller (for physical controllers)
1. Physical failure of the controller itself
2. Power interruption (e.g.: power supply failure; intermittent connection; short circuit; open circuit)
2. Inadequate control algorithm
1. Flawed implementation of the specified control algorithm
2. The specified control algorithm is flawed
3. The specified control algorithm becomes inadequate over time due to changes or degradation
3. Unsafe control input UCA received from another controller (already addressed when considering UCAs from other controllers)
4. Inadequate process model
1. Controller receives incorrect feedback/information
2. Controller receives correct feedback/information but interprets it incorrectly or ignores it
3. Controller does not receive feedback/information when needed (delayed or never received)
4. Necessary controller feedback/information does not exist

STPA step 4a: Identifying scenarios that lead to Unsafe Control Actions 2/2 Common categories of causes of inadequate feedback and information

1. Feedback of information not received
1. Feedback/info sent by sensor but not received by controller
2. Feedback/info is not sent by sensor(s) but is received or applied to sensor(s)
3. Feedback/info is not received by or applied to sensor(s)
4. Feedback/info does not exist in control structure or sensor(s) do(es) not exist
2. Inadequate feedback is received
1. Sensors respond adequately but controller receives inadequate feedback/info
2. Sensors respond inadequately to feedback/info received by or applied to sensors
3. Sensors are not capable to provide necessary feedback/info (i.e., not designed to)

Controller(s)

Control Process Algorithm(s) Model(s)

Control Feedback Actions Controlled Process

STPA step 4b: Identifying scenarios in which control actions are not executed (or not executed correctly): Scenarios involving the control path

1. Control action not executed
1. Control action is sent by controller but not received by actuator
2. Control action is received by actuator but actuator does not respond
3. Actuator responds but the control action is not applied to or received by the controlled process
2. Control action is improperly executed
1. Control action is sent by controller but received improperly by actuator, e.g. due to
1. Delay in communication
2. Out-of-order transmission
3. Lost communication
2. Control action is received correctly by actuator but actuator response is inadequate, e.g. due to
1. Loss of power to the actuator
2. Inaccuracies in actuator operation
3. Actuator misbehavior
4. Delay in actuator response
5. Actuator receives some other (possibly conflicting) command from some other source
6. Incorrect priority scheme used by actuator
7. Incorrect configuration
8. Actuator behavior changes or degrades over time
9. Unanticipated conditions in the actuator environment
3. Actuator responds adequately but the controlled process receives or applies the control action improperly
4. Control action is not sent by controller but actuators or other elements respond as if it did

STPA step 4b: Identifying scenarios in which control actions are not executed (or not executed correctly): Scenarios related to the controlled process

1. Control action not executed Control action is applied to or received by the controlled process but the controlled process does not respond
2. Control action improperly executed
1. Control action is applied to or received by the controlled process but the controlled process responds improperly
2. Control action is not applied to or received by the controlled process but the process responds as if it did Controller(s)

Control Process Algorithm(s) Model(s)

Control Feedback Actions Controlled Process

Progression of hazard analysis during development Verification Validation (V&V)

Vp Vc Vr Va Vdd Vi Vt Requirements from NPP Safety Analysis System Development Detailed Plans Concept Requirements Architecture Implementation Testing design HAp HAc HAr HAr HAdd HAi HAi Safety Engineering Reference model from IEEE Std 1012 27

The Future around the corner Model-based systems engineering to support safety analysis:

  • SysML version 2 (modeling critical systems)
  • RAAML (Risk Analysis & Assessment Modeling Language)
  • AADL (modeling critical cyber-physical systems)

- AADL Error Library blog

- AADL Error Library paper

- Article on fault modeling and analysis using AADL

- Example implementation