ML23262B164

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230322-final
ML23262B164
Person / Time
Issue date: 03/22/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B164 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, March 22, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith

Agenda 2

Clustering Data Analysis

- Inspection Report Characterization Topic Modeling Tool Analysis Progress and Next Steps

Clustering 3

Clustering: Topic Modeling can lead to Safety Clusters 4

Generic groupings based on words that frequently occur together in text

The words at the left occurred together more often than they occurred with other words

Once clusters of words (topics) are identified, documents can be aligned with those topics.

A document may align with more than one topic

- that is, more than one topic may have been discussed within the document.

Topics can be characterized (given a name) by analyzing the words that define them.

The topics discovered in our early experiments appear to align well with Safety Clusters.

Topic Modeling

Clustering samples - Potential Safety Clusters 5

Title - This clustering approach shows words that commonly occur together.

This safety cluster has reactor and radiation as the most relevant unifying words.

Additional experiments will be performed on larger text samples - potentially using the entire inspection report rather than just the title.

Different approaches including bi-grams and tri-grams will be explored.

Time periods will be introduced to ensure insights are current.

These are very early results - several algorithms and parameter settings will be tried as we move into the experimentation phase.

Safety Clusters

Topics

Clusters formed in Topic Modeling are characterized by the words that are unique to that collection Reactor and Radiation appear in every title in this topic, and do not appear in any other titles Same for High, Coolant, and Vessel

Using the words most strongly connected to a given topic helps us characterize, or name that topic.

That named topic can be considered a Safety Cluster

A next step could be to characterize the reports in this Safety Cluster in terms of Cornerstones and Cross Cutting Areas Topic Defining Terms => Safety Clusters Safety Cluster formation 6

Safety Clustering Samples - Relative Importance Terms on the right can be used to best characterize the cluster 7

These terms apply to this cluster more than to any other

Clustering Samples - Overlap could indicate widespread issues 8

Data Analysis 9

Reactor Locations Site Name and Code, Unit, Docket, Reactor Type, Containment, Region, State, City, Latitude, Longitude, Parent Company, Operating Missing information on sites that have findings: FCS, CR, VOG3, CGS, PILG, KEWA, SANO, VY, TMI, OC

Performance Indicators Date (Year, Quarter), Docket, 17 indicators across 6 safety cornerstones 3 Initiating Events, 6 Mitigating Systems, 2 Barrier Integrity, 3 Emergency Preparedness, 1 Public Radiation Safety, 1 Occupational Radiation Safety

Inspection Reports 11,672 reports (4GB pdfs, 552MB extracted text)

Text extracted from 10,671 pdfs; ~1000 pdfs are unreadable Inspection Findings Year, Quarter, Issue Date, Report and Accession Numbers, Link Region, Site Code and Name, Docket Number Type, Item Severity Type Code, Significance Procedure, Cornerstone and its Attribute Type, Cross-cutting Area and its Aspect Identified By, Traditional Enforcement Title and Item Introduction Data Sources 10

Event Notifications Event Notification Id and Number, Reactor Indicator, Site Name and Unit, Docket, Region, State, City, County, Time zone, Rx Type; Event, Notification and Update Dates Notified by, Operations Officer, Staff Names, Organizations, Text Current (3) and Initial (3) Power; Critical Indicators (3); Scam Type (3)

Event and CRF Descriptions (4); I Mode (3) F Mode (3); Emergency Class Agreement State, Licensee Name and Number, Containment Type (3); Document Type Code and CFR Code (4)

Security, Release Date, Note; Of Interest, Interest; Docket Number (2,3) 38,864

Licensee Event Reports Plant Name, Event and Report Dates, LER Number, Accession Number, Title, Abstract 18872 with Title & Abstract in HTML, 18695 cleaned text with Title & Abstract separated

Part 21 Reports Log Number, Event/Accession Number, Report Date, Notifier, Description, Body, Links 2,690

Action Matrix Date (Year, Quarter), Docket, total number of actions Data Sources 11

Reactor Locations Site Name and Code, Unit, Docket, Reactor Type, Containment, Region, State, City, Latitude, Longitude, Parent Company, Operating

Missing information on sites that have findings:

Inspection reports for these sites are not available on the main webpage FCS: Fort Calhoun CR: Crystal River PILG: Pilgrim KEWA: Kewaunee SANO: San Onofre VY: Vermont Yankee TMI: Three Mile Island OC: Oyster Creek Will attempt to download from the links in inspection findings file Data Sources 12

Inspection Report Characterization 13

Inspection Findings: Cornerstone 14 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 Public Radiation Safety Emergency Preparedness Occupational Radiation Safety Barrier Integrity Initiating Events Mitigating Systems Inspection Findings Cornerstone Inspection Findings by Cornerstones Green Greater Than Green White Yellow Red

Inspection Findings: Cornerstone Attribute 15 0

50 100 150 200 250 300 350 400 450 Cladding Performance ERO Readiness Material Control and Accounting RCS Equipment and Barrier Performance Offsite EP SSC Performance Facilities and Equipment Plant Facilities/Equipment and Instrumentation ERO Performance SSC and Barrier Performance Configuration Control Program & Process Human Performance Procedure Quality Protection Against External Factors Design Control Equipment Performance Inspection Findings Cornerstone Attribute Inspection Findings by Cornerstone Attribute Green Greater Than Green White

Inspection Findings: Cross-cutting Area 16 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 Supplemental Cross-Cutting Aspects Safety Conscious Work Environment Problem Identification and Resolution Human Performance Inspection Findings Corss-Cutting Area Inspection Findings by Cross-Cutting Area Green White Yellow Red

Inspection Findings: Cross-Cutting Aspect 17 0

100 200 300 400 500 600 700 800 P.1(e) Alternative CAP Process X.12 Accountability for Decisions S.1 SCWE Policy P.6 Self-Assessment P.3(c) Communicates & Acts on Assessment Results H.10 Bases for Decisions P.6 Self Assessment P.4 Trending P.1(b) Trend Performance CAP H.1(c) Communication of Decisions P.3(a) Self-Assessment H.2(d) Facilities & Equipment P.2(a) Evaluating & Communicating Operating Experience H.2 Field Presence P.5 Operating Experience H.2(a) Maintaining Long Term Plant Safety H.9 Training H.2(b) Personnel Training & Qualifications H.3 Change Management P.1 Identification H.13 Consistent Process H.4 Teamwork H.6 Design Margins H.7 Documentation H.11 Challenge the Unknown P.2(b) Implementing Operating Experience P.3 Resolution H.1(a) Systematic Process for Decisions H.1 Resources H.14 Conservative Bias H.8 Procedure Adherence H.3(a) Work Planning H.3(b) Work Activity Coordination H.4(c) Supervisory & Management Oversight P.1(a) Corrective Action Program Issue Identification H.5 Work Management P.2 Evaluation H.12 Avoid Complacency H.4(a) Human Error Prevention P.1(d) Implementation of Corrective Actions H.4(b) Procedural Compliance H.1(b) Conservative Assumptions & Safe Actions H.2(c) Documentation, Procedures & Component Labeling P.1(c) Evaluation of Identified Problems Inspection Findings Cross-Cutting Aspect Inspection Findings by Cross-Cutting Aspect Green White Yellow Red

Inspection Findings: Site 18 0

100 200 300 400 500 600 700 800 VOG3 CR ROB VY SEAB TMI GINN HAR SUM FITZ BV OC CAT PILG VOG MCG MONT HAT SUR NMP DUAN LIM WB STL FAR KEWA HOPE PB BRU CLIN DAVI MILL SEQ PALI FERM CALV WAT TP CGS CALL PERR BYRO RBS LASA STP SALM FCS SANO WC COOK GG SUSQ BRAI PRAI ANO DIAB INPT BF QUAD CNS CP DRES OCO POIN PALO Inspection Findings Reactor Site Code Inspection Findings by Site Green Greater Than Green White Yellow Red

Inspection Findings: Year 19 0

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 Inspection Findings Year Inspection Findings by Year Green Greater Than Green White Yellow Red

Topic Modeling 20

Unsupervised discovery of topics from a collection of text documents Latent Dirichlet Allocation (LDA)

Describe a document as a bag-of-words Model each document as a mixture of latent topics Topic is represented as a distribution over the words in the vocabulary Variants of Topic Modeling can be explored Text-embeddings from Language Models and Neural Topic Modeling can be used to improve the quality of results Topic Modeling 21 Topic Model Topics 50% topic 1 25% topic 2 25 % topic 3 Clusters of documents by topic Proportion of topics in each document word/phrase frequency that distinguishes and characterizes a topic Corpus of Documents

Demo 22

Actionable Insights 23

Inspection Report ML15037A011 Indian Point

The algorithm reveals the safety topics from findings

These are the safety clusters

One report may mention multiple safety topics

The algorithm links the report to those safety clusters

Each safety topic has a probability of match Actionable Insights within a facility

Indicators for inspectors based on all of the Safety Clusters that a facility is linked to.

All issues that are part of a topic cluster tend to occur together - ensure inspections cover each of these areas.

It appears that each inspection has a focus, these safety clusters could help validate that focus.

Actionable Insights from Cluster Analysis 24 C1: water, service, valve, cool C5: emergency, diesel, generator C9: pump, valve, feedwater 13%

17%

61%

Inspection Reports:

ML042190340 Grand Gulf ML101330214 Indian Point

Both reactors are strongly tied to safety Cluster #9 -

Pump, Valve, and feedwater Actionable Insights from Cluster Analysis 25 C9: pump, valve, feedwater 91%

Actionable Insights across facilities

Indicators that tie reactors together based on safety findings

Analyze other factors to seek similarities in designs, procedures, etc. to reveal potential hazards.

Classification Input from:

Performance Indicators

Event Notifications

Event Reports

Part 21 Reports Classify on:

Significance

Cornerstone Areas

Crosscutting Areas Actionable Insights from classification analysis 26 Actionable Insights for Classification

Use alternate sources of input.

Exclude data on the target feature (significance)

Use the remaining features to assign probabilities to the significance categories Could also classify on cornerstone and crosscutting areas

Can use historical inspection reports to verify accuracy Significance Probability No Finding 47%

Green 28%

White 14%

Yellow 9%

Red 2%

Predictive Input from:

Performance Indicators

Event Notifications

Event Reports

Part 21 Reports Predictions on:

Cornerstone Areas / Attributes

Crosscutting Areas / Aspects Actionable Insights from predictive analysis 27 Actionable Insights for Future

Use alternate sources of input.

Exclude data on the target feature (cornerstones)

Use the remaining features to discover cornerstone trends and patterns

Assigns a probability to findings in inspections

Can use historical inspection reports to verify accuracy

Are these types of actionable insights what you would like to see from the study?

What other goals do you have?

28 Questions on direction for clustering and insights

Tool Analysis 29

If you want simple, but limited:

Platform Supplied Algorithms Pros

Integrated into user experience

Can be combined into standard pipelines

Work well in common cases (social media)

Well known algorithms Cons

Do not handle technical domains well

Difficult to tailor pipelines

Limited selection of algorithms

Limited selection of pre-training datasets

Advanced algorithms not available

Strengths in one area (Topic Modeling) may not translate to other areas (Classification, clustering, regression, anomaly detection)

If you want flexible and timely:

Python Library Algorithms Pros

Flexibility to select from multiple algorithms

Flexibility to select from multiple pre-trainings

Advanced models available

Flexibility to address highly technical domains

Flexibility to leverage internal parameters

Work well in Machine Learning Notebooks

Can be deployed in any cloud or on premises Cons

Requires some knowledge of Python

Algorithms must be researched and selected Survey of Platforms and Tools 30

Platforms and Tools:

Amazons AWS SageMaker Microsoft Azures AI/ML Services Googles Cloud AI Products MatLab

Initial survey indicates that all platforms provide the ability to use Python libraries needed for pre-processing textual data Python libraries that provide algorithms for unsupervised learning with textual and numerical data Various pre-trained models and easily fine-tune them with our data

Python notebooks that are currently used locally can be launched and scaled on all platforms

Refine evaluation factors and their weights & ranges with continued exploration of unsupervised techniques Survey of Platforms and Tools 31

What do you see as the primary role of the people executing this type of analysis?

Are there longer-term aspirations for ML initiatives?

32 Questions on Tool Selection Criteria

Progress 33

SOW Task Status 34 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem In progress Search the Literature In progress Select Candidates In progress Select Evaluation Factors In progress Develop evaluation factor weights In progress Define evaluation factor ranges In progress Perform assessment Not started Report Results Not started Deliver Trade study report Not started Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Not started Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)

Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started

Next Steps 35

Next Steps 36 Determine algorithms and approaches needed to perform the study:

Exploring LDA with other text data: full inspection reports, event notifications, LERs, part 21s

Explore neural topic modeling with BERTopic and Contextualized Topic Model

Explore other ML areas as directed Define the selection criteria for tools / environments based on study needs:

Compare the services and methods available for topic modeling across our 4 platforms