ML23262B183
| ML23262B183 | |
| Person / Time | |
|---|---|
| Issue date: | 05/10/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B183 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, May 10, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
Topic Modeling Metrics Update Interpreting Topics with a Customized Vocabulary and Automated Key Phrase Extraction Further Investigation Progress and Next Steps
Corpus Testing 3
We have been suggested several corpuses to use as reference documents for the coherence metric
Our current baseline is the standard technical specifications for the different reactors
Another corpus we are inspecting are documents from the NRC technical reports (NUREGs)
Below we calculate the Coherence Metric using four different corpuses Standard Technical Specifications Standard Technical Specifications with multi-grams Subset of NUREGs Subset of NUREGs with multi-grams
The trends for all four metrics is consistent
To identify if any of these metrics are superior to the other, Guillermo is ranking subsets of topics and then we will compare results to these different metrics The ranking provided by Guillermo can be also used in the future for similar work Corpus Testing 4
Interpreting Topics with a Customized Vocabulary and Automated Key Phrase Extraction 5
Various data sources used to create a vocabulary ML14300A223 (656), ML17004A106 (995), Reactor Concepts R-100 Acronyms (1050) reduced to 1004 unique terms and phrases by the NRC Unique list of 407 words and phrases different failure models of reactor systems and components from the NRC 1411 unique words and phrases after combining the three sources and dropping duplicates Some abbreviations have multiple different full-forms that cant be disambiguated with simple string matching Inspection findings item intros often use the abbreviations multiple times after providing the full-form once If we dont replace all abbreviations with their full-forms and only use the full-forms to match on, our terms counts will be less than the true appearance of the terms, so abbreviations are included in this vocabulary
Possible enhancements:
Abbreviation Detector from Allen AIs SciSpaCy package can be used to detect abbreviations and their defined full forms from text, and then replace all occurrences of the abbreviations with their full-forms Works on typical pattern like cause of the Emergency Diesel Generator (EDG) failure the EDG failure occurred on SpaCys Entity Linker can be used to link found abbreviations (entities) to a knowledge base and defined aliases Creating a Customized Vocabulary 6
Guided KeyBERT and KeyphraseVectorizers used to extract top 20 key phrases from 14938 Inspection Findings Item Introductions (~2.5hrs)
Using custom vocabulary (abbreviations, full forms and failure modes) as the seed words for Guided KeyBERT Unique list of 66,325 words and phrases
filtering out some unhelpful words and phrases with a custom stop words list will improve key phrases list (in-progress, not in todays results)
Creating a List of Keywords and Phrases 7
KeyphraseVectorizers to extract candidates that follow defined POS pattern
Our custom vocabulary (abbreviations, full forms, failure modes) and/or our list of extracted key words/phrases can be used to interpret the topics we discover in our text after running BERTopic Comparative approach to using MMR, POS, KeyBERTInspired or Text Generation representation models from BERTopic to interpret the found topics in slightly different ways
Extract string matches in each document, for each topic (including the documents in the outlier topic)
Using SpaCys Phrase Matcher and our custom vocabulary of terms and phrases Allowing a match on lower case text to account for casing differences between vocabulary and inspection findings text
For each topic (composed of all the documents assigned to it):
Use counts to see how many times each term or phrase from our vocabulary appears in each topic Use tf-idf representation to see how much each term or phrase from our vocabulary appears in each topic (reducing the impact of common terms that occur frequently across all topics)
For each document:
Use counts to see how many times each term or phrase from our vocabulary appears in each document
Visualize counts and tf-idf representations of vocabulary words in documents and topics with wordclouds Interpreting Topics with Custom String Matching 8
using topics learned from inspection findings item introduction summaries generated by the pegasus-cnn-dailymail model BERTopic configuration:
Three different ways of learning topic representations compared to BERtopic:
- Using vocabulary of 1004 abbreviations and full-forms + 407 failure modes for topic representation
- Using 66,325 key words/phrases automatically extracted from item introductions for topic representation
- Using both the vocabulary and the extracted key words/phrases (67,402) for topic representation Topic Modeling Configuration 9
all-MiniLM-L6-v2 15 neighbors, 5 components min cluster size 20 n-grams range of 1-3 MMR (diversity = 0.6)
Topic 0 - Radiation (879 documents) 10 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -
rem - safety cornerstone - exposure - ability assess
Topic 0 - Radiation (879 documents) 11 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -
rem - safety cornerstone - exposure - ability assess Vocab (abbreviations, full-forms, failure modes)
Topic 0 - Radiation (879 documents) 12 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -
rem - safety cornerstone - exposure - ability assess Key Words/Phrases extracted from KeyBERT
Topic 0 - Radiation (879 documents) 13 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic dose - overexposure - area - rates - workers - occupational radiation safety -
rem - safety cornerstone - exposure - ability assess Vocab (abbreviations, full-forms, failure modes) +
Key Words/Phrases extracted from KeyBERT
Topic 1 - Emergency Diesel Generator (763 documents) 14 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -
transfer - generator fuel oil - division - air start - oil storage
Topic 1 - Emergency Diesel Generator (763 documents) 15 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -
transfer - generator fuel oil - division - air start - oil storage Vocab (abbreviations, full-forms, failure modes)
Topic 1 - Emergency Diesel Generator (763 documents) 16 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -
transfer - generator fuel oil - division - air start - oil storage Key Words/Phrases extracted from KeyBERT
Topic 1 - Emergency Diesel Generator (763 documents) 17 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic emergency diesel generator - edg - emergency - fuel - generators -
transfer - generator fuel oil - division - air start - oil storage Vocab (abbreviations, full-forms, failure modes) +
Key Words/Phrases extracted from KeyBERT
Topic 4 - Auxiliary Feedwater Pump (349 documents) 18 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -
service - motor driven - service water pump - esw pump - bearing oil
Topic 4 - Auxiliary Feedwater Pump (349 documents) 19 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -
service - motor driven - service water pump - esw pump - bearing oil Vocab (abbreviations, full-forms, failure modes)
Topic 4 - Auxiliary Feedwater Pump (349 documents) 20 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -
service - motor driven - service water pump - esw pump - bearing oil Key Words/Phrases extracted from KeyBERT
Topic 4 - Auxiliary Feedwater Pump (349 documents) 21 String Matches across Documents in the Topic (using counts as weights for font size)
String Matches across Documents in the Topic (using TF-IDF representation as weights for font size)
Topic Terms from BERTopic auxiliary feedwater pump - feedwater - auxiliary - turbine driven auxiliary - tdafw -
service - motor driven - service water pump - esw pump - bearing oil Vocab (abbreviations, full-forms, failure modes) +
Key Words/Phrases extracted from KeyBERT
Vocabulary of abbreviations, full-forms and failure modes is not comprehensive enough to accurately represent the smaller, more unique topics Reactor structures, systems, and components can be added to the vocabulary for more coverage
Key phrase extraction works well at the document level to find text spans of interest but at the topic level, there are not enough common phrases to give useful insights Custom stop word removal at the topic level will decrease the appearance of unhelpful terms/phrases
Topic model and its steps have not been extensively tuned yet, which can affect results Embedding model (finetuned on domain data)
Embedding reduction and clustering tuned together Use extracted key phrases from each item introduction as the input to topic modeling to form more cohesive, focused clusters Takeaways and Next Steps 22
Further Investigation 23
Further Investigation 24 Based on our research and analysis in the three phases of this project, we have identified 5 topics that warrant further investigation and prototyping. Each of these initiatives could be pursued independently and would be similar in size to the current project. Additional details on selected topics will be provided.
Cluster Representation - Names and Descriptions Provide analysts with better insights into clustered safety issues Custom defined named entity recognition + customized pattern matching Topic modeling by category Text similarity of input between documents and inspection procedures Safety Event Alerting Inform analysts of potential safety events.
Text classification - Use inspection reports to train a system that can classify events and LERs into cornerstones + cross-cutting areas.
Document Gisting tool to Accelerate Analysis Assist analysts in understanding large volumes of documents quickly.
Machine generated summaries Quick understanding of numerous documents.
Inspection Reports, LERs, and others.
Dynamic Analysis and Discovery Enable analysts to explore and understand Safety Issues quickly.
Software and storage for dynamic pivots, analyst directed queries Find sites with similar connection structures.
Dynamic topic modeling to see how safety clusters change through time.
Safety Cluster Tuning - c Pre-training a language model with NRC text.
SOTA language models suited for scientific/engineering text.
Use regulations, manuals, inspection procedures, licensee event notifications + reports, inspection reports, part 21 reports and more.
Safety Cluster Tuning - a Guided Cluster Discovery using custom vocabularies.
Influence cluster formation with NRC specified safety terms and concepts.
Use supervised or semi-supervised methods to discover clusters rather than relying on fully unsupervised approaches.
Safety Cluster Tuning - b Fine-tuning a pre-trained model or a custom trained model.
Enhance existing cluster models with NRC specific language.
Enable Text summarization or question-answering.
Tool Model
Progress 25
SOW Task Status 26 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Complete Data acquisition and preparation Complete Feature pipeline engineering Complete Clustering method experimentation & selection Complete Cluster pipeline engineering Complete Anomaly detection (as needed)
Not needed Model Development, Training, Evaluation Complete Test harness development Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution In progress Cluster evaluation In progress Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started
Next Steps 27
Next Steps 28 Cluster Representations
- Continue implementing string matching approaches to find effective cluster descriptions Metrics
- Assess clusters formed using full Item Introductions vs various machine generated summaries Deliverables
- Consolidate experiments, inputs, and outputs to deliverable format