ML23262B183

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230510-final
ML23262B183
Person / Time
Issue date: 05/10/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B183 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, May 10, 2023

Agenda Topic Modeling Metrics Update Interpreting Topics with a Customized Vocabulary and Automated Key Phrase Extraction Further Investigation Progress and Next Steps 2

Corpus Testing 3

Corpus Testing We have been suggested several corpuses to use as reference documents for the coherence metric Our current baseline is the standard technical specifications for the different reactors Another corpus we are inspecting are documents from the NRC technical reports (NUREGs)

Below we calculate the Coherence Metric using four different corpuses

- Standard Technical Specifications

- Standard Technical Specifications with multi-grams

- Subset of NUREGs

- Subset of NUREGs with multi-grams The trends for all four metrics is consistent To identify if any of these metrics are superior to the other, Guillermo is ranking subsets of topics and then we will compare results to these different metrics

- The ranking provided by Guillermo can be also used in the future for similar work 4

Interpreting Topics with a Customized Vocabulary and Automated Key Phrase Extraction 5

Creating a Customized Vocabulary Various data sources used to create a vocabulary

- ML14300A223 (656), ML17004A106 (995), Reactor Concepts R-100 Acronyms (1050) reduced to 1004 unique terms and phrases by the NRC

- Unique list of 407 words and phrases different failure models of reactor systems and components from the NRC

- 1411 unique words and phrases after combining the three sources and dropping duplicates

- Some abbreviations have multiple different full-forms that cant be disambiguated with simple string matching

- Inspection findings item intros often use the abbreviations multiple times after providing the full-form once

- If we dont replace all abbreviations with their full-forms and only use the full-forms to match on, our terms counts will be less than the true appearance of the terms, so abbreviations are included in this vocabulary Possible enhancements:

- Abbreviation Detector from Allen AIs SciSpaCy package can be used to detect abbreviations and their defined full forms from text, and then replace all occurrences of the abbreviations with their full-forms

> Works on typical pattern like cause of the Emergency Diesel Generator (EDG) failure the EDG failure occurred on

- SpaCys Entity Linker can be used to link found abbreviations (entities) to a knowledge base and defined aliases 6

Creating a List of Keywords and Phrases KeyphraseVectorizers to extract candidates Guided KeyBERT and KeyphraseVectorizers used that follow defined to extract top 20 key phrases from 14938 POS pattern Inspection Findings Item Introductions (~2.5hrs)

- Using custom vocabulary (abbreviations, full forms and failure modes) as the seed words for Guided KeyBERT

- Unique list of 66,325 words and phrases filtering out some unhelpful words and phrases with a custom stop words list will improve key phrases list (in-progress, not in todays results) 7

Interpreting Topics with Custom String Matching Our custom vocabulary (abbreviations, full forms, failure modes) and/or our list of extracted key words/phrases can be used to interpret the topics we discover in our text after running BERTopic

- Comparative approach to using MMR, POS, KeyBERTInspired or Text Generation representation models from BERTopic to interpret the found topics in slightly different ways Extract string matches in each document, for each topic (including the documents in the outlier topic)

- Using SpaCys Phrase Matcher and our custom vocabulary of terms and phrases

- Allowing a match on lower case text to account for casing differences between vocabulary and inspection findings text For each topic (composed of all the documents assigned to it):

- Use counts to see how many times each term or phrase from our vocabulary appears in each topic

- Use tf-idf representation to see how much each term or phrase from our vocabulary appears in each topic (reducing the impact of common terms that occur frequently across all topics)

For each document:

- Use counts to see how many times each term or phrase from our vocabulary appears in each document Visualize counts and tf-idf representations of vocabulary words in documents and topics with wordclouds 8

Topic Modeling Configuration using topics learned from inspection findings Three different ways of learning topic item introduction summaries generated by the representations compared to BERtopic:

pegasus-cnn-dailymail model - Using vocabulary of 1004 abbreviations and full-forms + 407 failure modes for topic representation BERTopic configuration:

- Using 66,325 key words/phrases automatically extracted from item introductions for topic MMR (diversity = 0.6) representation

- Using both the vocabulary and the extracted key words/phrases (67,402) for topic representation n-grams range of 1-3 min cluster size 20 15 neighbors, 5 components all-MiniLM-L6-v2 9

Topic 0 - Radiation (879 documents)

Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -

rem - safety cornerstone - exposure - ability assess String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 10

Topic 0 - Radiation (879 documents) Vocab (abbreviations, full-forms, failure modes)

Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -

rem - safety cornerstone - exposure - ability assess String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 11

Topic 0 - Radiation (879 documents) Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -

rem - safety cornerstone - exposure - ability assess String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 12

Vocab (abbreviations, full-forms, failure modes) +

Topic 0 - Radiation (879 documents) Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -

rem - safety cornerstone - exposure - ability assess String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 13

Topic 1 - Emergency Diesel Generator (763 documents)

Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -

transfer - generator fuel oil - division - air start - oil storage String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 14

Topic 1 - Emergency Diesel Generator (763 documents) Vocab (abbreviations, full-forms, failure modes)

Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -

transfer - generator fuel oil - division - air start - oil storage String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 15

Topic 1 - Emergency Diesel Generator (763 documents) Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -

transfer - generator fuel oil - division - air start - oil storage String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 16

Topic 1 - Emergency Diesel Generator (763 documents) Vocab (abbreviations, full-forms, failure modes) +

Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -

transfer - generator fuel oil - division - air start - oil storage String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 17

Topic 4 - Auxiliary Feedwater Pump (349 documents)

Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -

service - motor driven - service water pump - esw pump - bearing oil String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 18

Topic 4 - Auxiliary Feedwater Pump (349 documents) Vocab (abbreviations, full-forms, failure modes)

Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -

service - motor driven - service water pump - esw pump - bearing oil String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 19

Topic 4 - Auxiliary Feedwater Pump (349 documents) Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -

service - motor driven - service water pump - esw pump - bearing oil String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 20

Topic 4 - Auxiliary Feedwater Pump (349 documents) Vocab (abbreviations, full-forms, failure modes) +

Key Words/Phrases extracted from KeyBERT Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -

service - motor driven - service water pump - esw pump - bearing oil String Matches across Documents in the Topic String Matches across Documents in the Topic (using counts as weights for font size) (using TF-IDF representation as weights for font size) 21

Takeaways and Next Steps Vocabulary of abbreviations, full-forms and failure modes is not comprehensive enough to accurately represent the smaller, more unique topics

- Reactor structures, systems, and components can be added to the vocabulary for more coverage Key phrase extraction works well at the document level to find text spans of interest but at the topic level, there are not enough common phrases to give useful insights

- Custom stop word removal at the topic level will decrease the appearance of unhelpful terms/phrases Topic model and its steps have not been extensively tuned yet, which can affect results

- Embedding model (finetuned on domain data)

- Embedding reduction and clustering tuned together Use extracted key phrases from each item introduction as the input to topic modeling to form more cohesive, focused clusters 22

Further Investigation 23

Further Investigation Based on our research and analysis in the three phases of this project, we have identified 5 topics that warrant further investigation and prototyping. Each of these initiatives could be pursued independently and would be similar in size to the current project. Additional details on selected topics will be provided.

Document Gisting tool to Accelerate Analysis Dynamic Analysis and Discovery Assist analysts in understanding large volumes of documents quickly. Enable analysts to explore and understand Safety Issues quickly.

Machine generated summaries Software and storage for dynamic pivots, analyst directed queries Quick understanding of numerous documents. Find sites with similar connection structures.

Inspection Reports, LERs, and others. Dynamic topic modeling to see how safety clusters change through time.

Cluster Representation - Names and Descriptions Safety Event Alerting Provide analysts with better insights into clustered safety issues Inform analysts of potential safety events.

Custom defined named entity recognition + customized pattern matching Text classification - Use inspection reports to train a system that can Topic modeling by category classify events and LERs into cornerstones + cross-cutting areas.

Text similarity of input between documents and inspection procedures Safety Cluster Tuning - a Safety Cluster Tuning - b Guided Cluster Discovery using custom vocabularies. Fine-tuning a pre-trained model or a custom trained model.

Influence cluster formation with NRC specified safety terms and concepts. Enhance existing cluster models with NRC specific language.

Use supervised or semi-supervised methods to discover clusters rather Enable Text summarization or question-answering.

than relying on fully unsupervised approaches.

Safety Cluster Tuning - c Pre-training a language model with NRC text. Tool SOTA language models suited for scientific/engineering text.

Use regulations, manuals, inspection procedures, licensee event notifications + reports, inspection reports, part 21 reports and more.

Model 24

Progress 25

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation Complete Search the Literature Complete Data acquisition and preparation Complete Select Candidates Complete Feature pipeline engineering Complete Select Evaluation Factors Complete Clustering method experimentation & selection Complete Develop evaluation factor weights Complete Cluster pipeline engineering Complete Define evaluation factor ranges Complete Anomaly detection (as needed) Not needed Perform assessment Complete Model Development, Training, Evaluation Complete Report Results Complete Test harness development Complete Deliver Trade study report Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution In progress Cluster evaluation In progress Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started 26

Next Steps 27

Next Steps Cluster Representations

- Continue implementing string matching approaches to find effective cluster descriptions Metrics

- Assess clusters formed using full Item Introductions vs various machine generated summaries Deliverables

- Consolidate experiments, inputs, and outputs to deliverable format 28