ML23262B164
| ML23262B164 | |
| Person / Time | |
|---|---|
| Issue date: | 03/22/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B164 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, March 22, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
Clustering Data Analysis
- Inspection Report Characterization Topic Modeling Tool Analysis Progress and Next Steps
Clustering 3
Clustering: Topic Modeling can lead to Safety Clusters 4
Generic groupings based on words that frequently occur together in text
The words at the left occurred together more often than they occurred with other words
Once clusters of words (topics) are identified, documents can be aligned with those topics.
A document may align with more than one topic
- that is, more than one topic may have been discussed within the document.
Topics can be characterized (given a name) by analyzing the words that define them.
The topics discovered in our early experiments appear to align well with Safety Clusters.
Topic Modeling
Clustering samples - Potential Safety Clusters 5
Title - This clustering approach shows words that commonly occur together.
This safety cluster has reactor and radiation as the most relevant unifying words.
Additional experiments will be performed on larger text samples - potentially using the entire inspection report rather than just the title.
Different approaches including bi-grams and tri-grams will be explored.
Time periods will be introduced to ensure insights are current.
These are very early results - several algorithms and parameter settings will be tried as we move into the experimentation phase.
Safety Clusters
Topics
Clusters formed in Topic Modeling are characterized by the words that are unique to that collection Reactor and Radiation appear in every title in this topic, and do not appear in any other titles Same for High, Coolant, and Vessel
Using the words most strongly connected to a given topic helps us characterize, or name that topic.
That named topic can be considered a Safety Cluster
A next step could be to characterize the reports in this Safety Cluster in terms of Cornerstones and Cross Cutting Areas Topic Defining Terms => Safety Clusters Safety Cluster formation 6
Safety Clustering Samples - Relative Importance Terms on the right can be used to best characterize the cluster 7
These terms apply to this cluster more than to any other
Clustering Samples - Overlap could indicate widespread issues 8
Data Analysis 9
Reactor Locations Site Name and Code, Unit, Docket, Reactor Type, Containment, Region, State, City, Latitude, Longitude, Parent Company, Operating Missing information on sites that have findings: FCS, CR, VOG3, CGS, PILG, KEWA, SANO, VY, TMI, OC
Performance Indicators Date (Year, Quarter), Docket, 17 indicators across 6 safety cornerstones 3 Initiating Events, 6 Mitigating Systems, 2 Barrier Integrity, 3 Emergency Preparedness, 1 Public Radiation Safety, 1 Occupational Radiation Safety
Inspection Reports 11,672 reports (4GB pdfs, 552MB extracted text)
Text extracted from 10,671 pdfs; ~1000 pdfs are unreadable Inspection Findings Year, Quarter, Issue Date, Report and Accession Numbers, Link Region, Site Code and Name, Docket Number Type, Item Severity Type Code, Significance Procedure, Cornerstone and its Attribute Type, Cross-cutting Area and its Aspect Identified By, Traditional Enforcement Title and Item Introduction Data Sources 10
Event Notifications Event Notification Id and Number, Reactor Indicator, Site Name and Unit, Docket, Region, State, City, County, Time zone, Rx Type; Event, Notification and Update Dates Notified by, Operations Officer, Staff Names, Organizations, Text Current (3) and Initial (3) Power; Critical Indicators (3); Scam Type (3)
Event and CRF Descriptions (4); I Mode (3) F Mode (3); Emergency Class Agreement State, Licensee Name and Number, Containment Type (3); Document Type Code and CFR Code (4)
Security, Release Date, Note; Of Interest, Interest; Docket Number (2,3) 38,864
Licensee Event Reports Plant Name, Event and Report Dates, LER Number, Accession Number, Title, Abstract 18872 with Title & Abstract in HTML, 18695 cleaned text with Title & Abstract separated
Part 21 Reports Log Number, Event/Accession Number, Report Date, Notifier, Description, Body, Links 2,690
Action Matrix Date (Year, Quarter), Docket, total number of actions Data Sources 11
Reactor Locations Site Name and Code, Unit, Docket, Reactor Type, Containment, Region, State, City, Latitude, Longitude, Parent Company, Operating
Missing information on sites that have findings:
Inspection reports for these sites are not available on the main webpage FCS: Fort Calhoun CR: Crystal River PILG: Pilgrim KEWA: Kewaunee SANO: San Onofre VY: Vermont Yankee TMI: Three Mile Island OC: Oyster Creek Will attempt to download from the links in inspection findings file Data Sources 12
Inspection Report Characterization 13
Inspection Findings: Cornerstone 14 0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 Public Radiation Safety Emergency Preparedness Occupational Radiation Safety Barrier Integrity Initiating Events Mitigating Systems Inspection Findings Cornerstone Inspection Findings by Cornerstones Green Greater Than Green White Yellow Red
Inspection Findings: Cornerstone Attribute 15 0
50 100 150 200 250 300 350 400 450 Cladding Performance ERO Readiness Material Control and Accounting RCS Equipment and Barrier Performance Offsite EP SSC Performance Facilities and Equipment Plant Facilities/Equipment and Instrumentation ERO Performance SSC and Barrier Performance Configuration Control Program & Process Human Performance Procedure Quality Protection Against External Factors Design Control Equipment Performance Inspection Findings Cornerstone Attribute Inspection Findings by Cornerstone Attribute Green Greater Than Green White
Inspection Findings: Cross-cutting Area 16 0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 Supplemental Cross-Cutting Aspects Safety Conscious Work Environment Problem Identification and Resolution Human Performance Inspection Findings Corss-Cutting Area Inspection Findings by Cross-Cutting Area Green White Yellow Red
Inspection Findings: Cross-Cutting Aspect 17 0
100 200 300 400 500 600 700 800 P.1(e) Alternative CAP Process X.12 Accountability for Decisions S.1 SCWE Policy P.6 Self-Assessment P.3(c) Communicates & Acts on Assessment Results H.10 Bases for Decisions P.6 Self Assessment P.4 Trending P.1(b) Trend Performance CAP H.1(c) Communication of Decisions P.3(a) Self-Assessment H.2(d) Facilities & Equipment P.2(a) Evaluating & Communicating Operating Experience H.2 Field Presence P.5 Operating Experience H.2(a) Maintaining Long Term Plant Safety H.9 Training H.2(b) Personnel Training & Qualifications H.3 Change Management P.1 Identification H.13 Consistent Process H.4 Teamwork H.6 Design Margins H.7 Documentation H.11 Challenge the Unknown P.2(b) Implementing Operating Experience P.3 Resolution H.1(a) Systematic Process for Decisions H.1 Resources H.14 Conservative Bias H.8 Procedure Adherence H.3(a) Work Planning H.3(b) Work Activity Coordination H.4(c) Supervisory & Management Oversight P.1(a) Corrective Action Program Issue Identification H.5 Work Management P.2 Evaluation H.12 Avoid Complacency H.4(a) Human Error Prevention P.1(d) Implementation of Corrective Actions H.4(b) Procedural Compliance H.1(b) Conservative Assumptions & Safe Actions H.2(c) Documentation, Procedures & Component Labeling P.1(c) Evaluation of Identified Problems Inspection Findings Cross-Cutting Aspect Inspection Findings by Cross-Cutting Aspect Green White Yellow Red
Inspection Findings: Site 18 0
100 200 300 400 500 600 700 800 VOG3 CR ROB VY SEAB TMI GINN HAR SUM FITZ BV OC CAT PILG VOG MCG MONT HAT SUR NMP DUAN LIM WB STL FAR KEWA HOPE PB BRU CLIN DAVI MILL SEQ PALI FERM CALV WAT TP CGS CALL PERR BYRO RBS LASA STP SALM FCS SANO WC COOK GG SUSQ BRAI PRAI ANO DIAB INPT BF QUAD CNS CP DRES OCO POIN PALO Inspection Findings Reactor Site Code Inspection Findings by Site Green Greater Than Green White Yellow Red
Inspection Findings: Year 19 0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 Inspection Findings Year Inspection Findings by Year Green Greater Than Green White Yellow Red
Topic Modeling 20
Unsupervised discovery of topics from a collection of text documents Latent Dirichlet Allocation (LDA)
Describe a document as a bag-of-words Model each document as a mixture of latent topics Topic is represented as a distribution over the words in the vocabulary Variants of Topic Modeling can be explored Text-embeddings from Language Models and Neural Topic Modeling can be used to improve the quality of results Topic Modeling 21 Topic Model Topics 50% topic 1 25% topic 2 25 % topic 3 Clusters of documents by topic Proportion of topics in each document word/phrase frequency that distinguishes and characterizes a topic Corpus of Documents
Demo 22
Actionable Insights 23
Inspection Report ML15037A011 Indian Point
The algorithm reveals the safety topics from findings
These are the safety clusters
One report may mention multiple safety topics
The algorithm links the report to those safety clusters
Each safety topic has a probability of match Actionable Insights within a facility
Indicators for inspectors based on all of the Safety Clusters that a facility is linked to.
All issues that are part of a topic cluster tend to occur together - ensure inspections cover each of these areas.
It appears that each inspection has a focus, these safety clusters could help validate that focus.
Actionable Insights from Cluster Analysis 24 C1: water, service, valve, cool C5: emergency, diesel, generator C9: pump, valve, feedwater 13%
17%
61%
Inspection Reports:
ML042190340 Grand Gulf ML101330214 Indian Point
Both reactors are strongly tied to safety Cluster #9 -
Pump, Valve, and feedwater Actionable Insights from Cluster Analysis 25 C9: pump, valve, feedwater 91%
Actionable Insights across facilities
Indicators that tie reactors together based on safety findings
Analyze other factors to seek similarities in designs, procedures, etc. to reveal potential hazards.
Classification Input from:
Performance Indicators
Event Notifications
Event Reports
Part 21 Reports Classify on:
Significance
Cornerstone Areas
Crosscutting Areas Actionable Insights from classification analysis 26 Actionable Insights for Classification
Use alternate sources of input.
Exclude data on the target feature (significance)
Use the remaining features to assign probabilities to the significance categories Could also classify on cornerstone and crosscutting areas
Can use historical inspection reports to verify accuracy Significance Probability No Finding 47%
Green 28%
White 14%
Yellow 9%
Red 2%
Predictive Input from:
Performance Indicators
Event Notifications
Event Reports
Part 21 Reports Predictions on:
Cornerstone Areas / Attributes
Crosscutting Areas / Aspects Actionable Insights from predictive analysis 27 Actionable Insights for Future
Use alternate sources of input.
Exclude data on the target feature (cornerstones)
Use the remaining features to discover cornerstone trends and patterns
Assigns a probability to findings in inspections
Can use historical inspection reports to verify accuracy
Are these types of actionable insights what you would like to see from the study?
What other goals do you have?
28 Questions on direction for clustering and insights
Tool Analysis 29
If you want simple, but limited:
Platform Supplied Algorithms Pros
Integrated into user experience
Can be combined into standard pipelines
Work well in common cases (social media)
Well known algorithms Cons
Do not handle technical domains well
Difficult to tailor pipelines
Limited selection of algorithms
Limited selection of pre-training datasets
Advanced algorithms not available
Strengths in one area (Topic Modeling) may not translate to other areas (Classification, clustering, regression, anomaly detection)
If you want flexible and timely:
Python Library Algorithms Pros
Flexibility to select from multiple algorithms
Flexibility to select from multiple pre-trainings
Advanced models available
Flexibility to address highly technical domains
Flexibility to leverage internal parameters
Work well in Machine Learning Notebooks
Can be deployed in any cloud or on premises Cons
Requires some knowledge of Python
Algorithms must be researched and selected Survey of Platforms and Tools 30
Platforms and Tools:
Amazons AWS SageMaker Microsoft Azures AI/ML Services Googles Cloud AI Products MatLab
Initial survey indicates that all platforms provide the ability to use Python libraries needed for pre-processing textual data Python libraries that provide algorithms for unsupervised learning with textual and numerical data Various pre-trained models and easily fine-tune them with our data
Python notebooks that are currently used locally can be launched and scaled on all platforms
Refine evaluation factors and their weights & ranges with continued exploration of unsupervised techniques Survey of Platforms and Tools 31
What do you see as the primary role of the people executing this type of analysis?
Are there longer-term aspirations for ML initiatives?
32 Questions on Tool Selection Criteria
Progress 33
SOW Task Status 34 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem In progress Search the Literature In progress Select Candidates In progress Select Evaluation Factors In progress Develop evaluation factor weights In progress Define evaluation factor ranges In progress Perform assessment Not started Report Results Not started Deliver Trade study report Not started Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Not started Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)
Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started
Next Steps 35
Next Steps 36 Determine algorithms and approaches needed to perform the study:
Exploring LDA with other text data: full inspection reports, event notifications, LERs, part 21s
Explore neural topic modeling with BERTopic and Contextualized Topic Model
Explore other ML areas as directed Define the selection criteria for tools / environments based on study needs:
Compare the services and methods available for topic modeling across our 4 platforms