ML20059N404

From kanterella
Jump to navigation Jump to search
Conference Paper Root Cause Analysis of Operating Problems, to Be Presented at Regulatory Info Conference on 890418-20
ML20059N404
Person / Time
Issue date: 04/18/1989
From: Ernst M
NRC OFFICE OF INSPECTION & ENFORCEMENT (IE REGION II)
To:
Shared Package
ML20059N403 List:
References
NUDOCS 9010120189
Download: ML20059N404 (7)


Text

.

c..

i i

I ROOT CAUSi ANALYSIS OF OPERATING PROBLEMS BY MALCOLM L. ERNST DEPUTY REGIONAL ADMINISTRATOR REGION II I

U. S. NUCLEAR REGULATORY COMMISSION FOR PRESENTATION AT THE l

NRC REGULATORY INFORMATION CONFERENCE WASHINGTON, D.C.

APRIL 18-20, 1980

+

t i

l l

l l

l t

9010120189 890328 PDR MISC 9010120188 FDC

C

}

l ROOT CAUSE ANALYSIS OF OPERATING PROBLEMS SCOPE This paper addresses the importance of root cause analysis as it relates to plant safety and the reduction of the potential for significant events to occur. Several examples of recent occurrences will be addressed to highlight where evaluations have been effective and ineffective.

It is not the purpose here to provide _a detailed discussion of the methodological aspects of root cause analysis.

DISCVSSION 10 CFR S0, Appendix B, Criterion XVI, Corrective Action requires for conditions adverse to quality (such as equipment failures, defective material, and improper personnel actions) that licensees shall assure the cause of the condition is determined and corrective action is. taken to preclude repetition. Hence, " Root Cause Determination."

j Root cause analysis is defined as the methed by which the most basic cause of an event is determined in order to best prevent recurrence.

For a given event factor to be a root cause it must meet two criteria:

(1)It must contribute substantially to the event in question such that, upon removal of the factor, repeated events either do not occur-or are much less likely to occur; and (2) It must be reasonably within management's control to correct.

A failure to meet one of these criteria results in the failure to properly identify the basic cause of the event.

It is each' licensee's responsibility to adequately evaluate occur-rences in order to provide continued safe plant operation.

The NRC assures that appropriate root cause analysis is adequately addressed through the inspection and enforcement processes.

Additionally, the Division of Operational Events Assessment, specifically the Events Assess-ment Branch, evaluates reported events, reviews the actions to be taken to 1

prevent recurrence, and relates significant event history and operating experience to industry through generic correspondence prepared by the NRC staff.

One principal purpose of such correspondence is to provide -the results of licensees' root cause analyses on potential generic concerns, thus providing a mechanism for other facilities to benefit from identified a

lessons learned.

Identification of root causes have both economic and safety benefits.

Proper resolution of problems decreases the likelihood of repetitive equipment failures, common mode failures, or multiple failures resulting in reactor transients or safety system failures.

Therefore, root cause analysis is a valuable tool in minimizing the risk of severe reactor

~

2 accidents.

From an economic standpoint, Nsolution of true root causes can reduce unnecessary downtime for the utility and thus maximize electr-icity production and potential profits.

It is recognized that root cause analysis does have limitations.

In-depth analyses require significant time, staff effort, and financial resources, and the benefits associated with it may not be immediately realized.

Often, root cause analyses are performed only after the occur-rence of numerous repeat events where the cost of the failures becomes a significant issue or the NRC exercises considerable influence.

Licensee management must establish priorities for the performance of causal analysis 1

with the most attention directed to the abnormal operation of safety related. equipment or events which unnecessarily challenge safety systems.

Such events include unplanned reactor trips, conditions beyond the design basis of the plant, two or more failures in-redundant systems, multiple failures due to an apparent common cause during an event, and multiple human errors or failures of an important safety component over an extended period of time.

The following examples are provided to illustrate recent events which resulted in either effective or ineffective root cause determination.

1.

Multiple failures of silicon-bronze carriage head bolts in the bus bars connecting safety related motor control centers.

Over a period of approximately one year, multiple failures of 5/16" silicon-bronze bolts were identified by maintenance personnel.

Initially, it was assumed that the failures were caused by over-torquing, and the bolts were replaced.

Some bolts were replaced a second time which should have triggered a j

concern that the real failure cause had not been determined.

Emphasis was not placed on these failures until an outage, when a significant number of failed bolts lead to increased NRC concern and initiation of an evaluation by the licensee.

The subsequent analysis identified the failure mechanism to be intergranular stress corrosion cracking. However, once the root cause was determined, there was no integrated effort to initiate a sampling program to determine the existence or condition of other sizes and types of silicon-bronze bolts located throughout the plant.

This example illustrates a poor root cause analysis program.

Personnel were not adequately trained to identify potential problems and properly comunicate such situations to other groups for resolution.

Instead, a superficial cause was assumed with the rectification efforts confined to a limited group, i

l Therefore, the failures were not adequately evaluated and I

3 I

resolved in a timely manner. The lack of communications between technical support and maintenance also resulted in the failure to understand the potential scope of the problem 2.

Numerous failures of the High Pressure Coolant Injection System over the life of the plant.

Another example of a poor root cause determination involved l

the operability of the High Pressure Coolant Injection System t

(HPCI). Here, numerous operability problems had been identified with HPCI since the plant came on-line.

These included under-sized motors, valve actuator deficiencies (steam emission and HPCI injection valves), EQ problems on skid mounted equipment, operator errors regarding manual HPCI start-up, and improperly '

set setpoints.

After a number of " fixes" over a several year i

period, the licensee experienced HPCI injection failure due 'to i

a HPCI turbine trip.

The licensee pursued this failure and determined that the turbine trip was caused by. low suction pressure during initial pump startup.

This failure mode had been obscured previously by all the other failures, and this brought into question whether the HPCI system had ever been fully operational.

To confinn the low suction pressure theory, i

the licensee examined plant computer data and performed an at-power HPCI vessel injection test to recreate the failure conditions.

The results of this test confirmed the original j

theory, and the low suction pressure trip was removed temporarily until a plant modification to insert a 10 second time delay in the trip could be implemented.

Although the licensee pursued the resolution of the latest failure aggressively, the failure mechanism could have been identified much sooner.

Because HPCI is a safety related system which serves a vital function during transient conditions, the numerous failures, indicative of poor reliability, should hive promulgated an intensive study of the system.

Tne licensee conducted a HPCI SSFI previously which identified many issues l

in.luding the need to remove the low suction pressure trip (because the trip was primarily for. pump protection); however, the recomendations were not aggressively - pursued.

Lack of responsiveness to issues shows significant deficiencies in the 1

corrective action and root cause processes.

l 3.

Repeated emergency diesel generator failures.

A third example of a poor root cause determination involved-I repeated failures of emergency diesel generators (EDG) during 4

l surveillance testing.

In this case, root cause_ determination and implementation of corrective actions were extremely delayed.-

I A

~

4 Initially, the EDG failures were attributed to problems with the shuttle valves in the control air pneumatic shutdown logic.

These valves were replaced repeatedly, but the failures did not cease.

The cause of the shuttle valve failures was never confirmed; and, eventually, they were replaced with different, fast-acting valves.

However, EDG failures continued.

An EDG reliability task force was established to evaluate and determine the root cause of the failures.

The failures were identified to be from two root causes: (1) excessive moisture in the control airsystemandsubsequentcorrosionofcomponents;and(2) deft-ciencies in the pressure sensors.

In both cases, adequate initial evaluation could have prevented repeated failures.

Poor instrument air quality was a known problem for some time, but no action was taken as it was perceived as a low priority maintenance issue.

Secondly, the assistance of the vendor should have been requested earlier to assist in the pressure sensor issues.

The initial mind-set on the shuttle valve problems may have masked the consideration of other potential problems which appeared to arise later.

4.

Failure of redundant containment isolation valves to close on demand.

A good example of an effective root cause analysis is illustrated by an incident where four containment isolation valves failed to close upon receipt of an automatic signal (2 redundant isolation valves in the drywell equipment drain and 2 redundant valves in the drywell floor drain system).

Significant attention was focused on these failures by both the licensee and 1

NRC.

The licensee immediately formed an event investigation team to evaluate the failure mechanism.

Based on the event circumstances (no valve closure on automatic signal, imediate closure of 2 valves-on manual demand and delayed closure of the remaining two valves on manual demand) efforts were concentrated in two areas: solenoid valve failure and relay failure in the control and logic circuitry.

The analysis included: operability tests of the components in question; disassembly and inspection the the solenoid valves; vendor consultatica and ' inspection; relay inspection; retrieval of repair history, and the perfom-ance of ntensive failure analysis by the company's metallurgy unit.

Although these analysis did define a problem with the relay in the control and logic circuitry contributing to the failure of one valve, the overall root cau e was determined to be the failure of the solenoid valves to vent and close the primary valve.

This was attributed to sticking of the solenoid valve's lower disk to its lower seat.

Extensive testing revealed the

S l

l mechanism for the sticking was the accelerated oxidation of the EPDM seat when in contact with the copper lower disk.

This oxidation was determined to take place at lower than expected temperatures.

t Both short term and long term corrective actions were

[

implemented to preclude further sticking problems.

This action included replacement of the EPDM seats with Vitcon and cycling the safety related solenoid valves of this type until_ replace-ments were perfonned.

Causal analysis and corrective action for this case were thorough and prompt.

In many cases, an effort is made to determine the root cause of an event; however, the licensee falls into " traps" which deter from the development of the final ~ resolution and prevention of recurrence.

Examples of these traps include:

(1) assuming the identified problem is the cause; (2) blaming the circumstance on personnel error when in fact the individual may have been set-up to fail; (3) jumping to conclusions where a cause is assumed based on limited information and then data is gathered to support that theory; (4) inadequate definition of the problem due to varied.

contradictory, and complex data; (5) overkill where many actions are taken to address the issue, and it is never known what actually solved the problem; (6) delaying the resolution until a' planned outage; and (7) low maintenance prioritization because the the component or system is not specifically safety relato.

Relating to the reviously discussed examples many of these " traps" i

are illustrated as well as others.

In example 1, for instance, maintenance perceived the failure of the bolts to.be due to over-torquing, and did not i

pursue other cause alternatives.

A more prudent course of action would L

have been to proceed with a laboratory metallurgical analysis in parallel with the bolt replacement program.

In addition, poor comunication between groups contributed to the delayed root cause determination and its application to other bolts.

In example 2, numerous problems related to one system clouded the real issue of overall system reliability.

. Con-l sequently, aggressive evaluation which challenged the system des 1gn was I

not performed, and a problem which significantly impected effective system l

operation went undetected.

In example.3, fixation on one apparent issue contributed to the masking of other causes.-

In addition, the delay in performance of corrective action on the control air system due to low priority resulted in further diesel failures.

E As detailed by example 4, a valid root cause analysis involves numerous factors.

Most importantly, resources must be mobilized quickly and efficiently, including vendor support.

The " team" approach also appears effective in this regard in that it integrates the talents of knowledgeable, diverse experts all focusing on a common goal.

Other factors relating to a successful root cause determination include:

(1) preservation of the incident scene and related records for accurate

o

. o 6

reconstruction; (2) detailed personnel statements; (3) evaluators intimately familiar with the areas being studied; (4) objective evaluators, uninfluenced by outside factors; (5) thorough and broad-scope evaluations; and (6) realistic and useable recommendations.

CONCLUSION Ait'nough the elements and staff implementation of a root cause program are important, management support is critical to its success.

Upper menagement support and involvement' assures that (1) feedback of root cause determinations is factored into all aspects of plant operations; (2) proper emphasis is placed on learning from past experiences; and (3) perform ece is trended to measure whether basic root causes are found.

From a regulatory standpoint, ensuring that licensees adequately identify root causes of safety significant problems is extremely important.

Assurance that the licensee is thorough, objective, and a self-evaluator is indicative of the proper management attention to overall safe operation of the plant.

l f

l-L i

T l

f