ML23262B192
| ML23262B192 | |
| Person / Time | |
|---|---|
| Issue date: | 05/31/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B192 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, May 31, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
Varying Inputs and Topic Representations Stopword Removal
- Input text
- Topic Representation Outlier Reduction Techniques Topic Modeling Metrics Applied to Experiments Progress
Varying Inputs and Topic Representations 3
Varying Inputs and Topic Representations for Cluster Formation 4
Item Introduction Item Introduction Summary (Pegasus_cnn_dailymail model)
Item Introduction Key Phrases (KeyphraseVectorizer + Guided KeyBERT with custom vocab of 1411 abbreviations, full forms and failure modes)
MMR (diversity = 0.6);
MMR (diversity = 0.6) + POS (NOUN, PROPN, ADJ-NOUN, ADJ-PROPN)
Vocabulary 1411 abbreviations + full forms + failure modes Key Phrases (66,325 words/phrases extracted from Item Introductions using KeyphraseVectorizer +
Guided KeyBERT with vocab of 1411 abbreviations, full forms and failure modes)
Vocabulary + Key Phrases (67,402)
- 1. Topic Modeling Input (3) all-MiniLM-L6-v2 15 neighbors, 5 components min cluster size 10, 20, 40, 60 n-grams range of 1-3
- 2. Topic Modeling Parameters (4)
- 3. Topic Representation (5)
BERTopic MMR (diversity = 0.6)
BERTopic MMR + POS MMR (diversity = 0.6) + POS (NOUN, PROPN, ADJ-NOUN, ADJ-PROPN) 3 inputs x 4 cluster sizes = 12 experiments 2 BERTopic representations for all 12 experiments 3 custom representations for all 12 experiments Stopword removal at input level and at topic representation level TF-IDF, Counts: String matching on full item-intros in each topic cluster:
TF-IDF on input text in each topic cluster:
Input Text
Reactor site names and codes
Reactor parent company names
City names and State abbreviations
Both standardized names and short-form variations of sites and parent company names are removed
Leaving these in the input text affects the topics that are found: documents discussing different safety issues are clustered on the common site name (Palo Verde, Diablo Canyon, Wolf Creek) or parent company name (Entergy, Exelon, Dominion, PSEG)
Removing these names helps us find the true underlying safety clusters
Discovered safety clusters can still be viewed by the site and parent company categories Topic Representation
Curated list of 117 stopwords specific to NRC text that dont help in identifying safety issues Ex: safety, reactor, inspection, inspectors, licensee, 10 CRF, ASME, cornerstone, cross-cutting area, significance, appendix, criterion
Singular words from Cornerstones and Cross-cutting Areas public, cross-cutting, events, integrity, systems, performance, mitigating, emergency, resolution, safety, occupational, preparedness, problem, work, barrier, supplemental, environment, human, identification, initiating, aspects, conscious Full phrases retained, cornerstone attributes and cross-cutting aspects also retrained Stopword Removal 5
Without stopword removal Stopword Removal 6
Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)
Topic 8 Topic 10 Topic 11 Topic 35 Topic 48 Topic 51 Topic 61
With stopword removal Stopword Removal 7
Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)
Topic 6 Topic 8 Topic 11 Topic 16 Topic 30 Topic 59 Topic 69
With stopword removal Stopword Removal 8
Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)
Some stopwords can still be removed to improve topic representations Safety clusters are no longer forming on site name or parent company name commonalities Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often
With stopword removal Stopword Removal 9
Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)
Safety clusters are no longer forming on site name or parent company name commonalities Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often
Outlier Reduction 10
- 1) Topic Probability Soft-clustering from HDBSCAN to find the best matching topic for each outlier document
- 2) Topic Distribution Find the most frequent topic discussed in the outlier document, and assign that topic to the outlier document Sliding window applied to document, c-TF-IDF of each window is computed and compared to existing topics, similarities of each window to topics are summed to create a topic distribution for the whole document
- 3) C-TF-TDF Find the most similar c-tf-idf topic representation to the c-tf-idf representation of the outlier document and assign that topic to the outlier document
- 4) Embedding Find the most similar topic embedding to the outlier documents embedding and assign that topic to the outlier document
Threshold: probability or distance to control how many outliers are assigned to topics vs kept as outliers keeping default value of 0 for now
Topic representations can be left un-changed after adding the outlier documents to existing topics, but we recompute the topic representations after adding outlier documents so that metrics can be calculated before and after outlier assignment Outlier Reduction Techniques 11
Outlier Reduction (Item Introduction) 12
Topic 52 (top 20 terms)
provisions approved protection - license condition - provisions approved - protection implement - license - approved protection -
effect provisions approved - provisions - effect provisions - facility operating license - operating license - transient combustible - facility operating - protection specifically - approved protection specifically - license condition protection - identified facility operating -
maintain effect provisions - combustible -
condition protection - protection Topic 52 (top 20 terms) after Embedding Outlier Assignment
protection - condition - operability - failed -
procedure - entered - procedures - 0609 -
specifically - manual - required - technical -
chapter - determined - manual chapter -
determination - ensure - equipment - manual chapter 0609 Outlier Reduction (Item Introduction) 13
Outlier Reduction (Item Introduction Summary) 14
Topic 44 (top 20 terms)
team identified technical - identified technical -
technical procedures - technical failed establish
- identified technical procedures - identified noncited technical - identified technical failed -
technical failed - establish adequate procedures
- technical - team - team identified - adequate procedures ensure - technical procedures implement - noncited technical - annunciator -
nature team identified - audit process applicability - audit process - adequate procedures severe Topic 44 (top 20 terms) after Embedding Outlier Assignment
procedure - procedures - quality - technical -
implement - team - determined - personnel -
maintenance - follow - adequate - identified -
process - entergy - identified technical -
instructions - procedure quality - team identified
- technical procedures - operations Outlier Reduction (Item Introduction Summary) 15
Outlier Reduction (Item Introduction Key Phrases) 16
Topic 64 (top 20 terms)
brigade - low protection - tables - detection -
detectors - license condition - extinguishers -
license - post incident - strategy plan -
chemicals - compensatory - adequate warnings
- low brigade response - hazards - low protection procedures - qcnps - extinguisher -
areas - low brigade - warnings Topic 64 (top 20 terms) after Embedding Outlier Assignment
brigade - license - co2 - license condition -
suppression - operating license - detection -
protection - hose - doors - room - facility operating license - door - facility operating -
areas - detectors - compensatory - low protection - hose stations - consequences -
Outlier Reduction (Item Introduction Key Phrases) 17
Coherence Metric Verification based on inputs 18
Coherence Metric Verification 19 Last week we provided coherence results for different inputs. For the majority of experiments, introduction as input had the highest coherence. An example of this is presented below on the right Guillermo has qualified 30 different topics from three different inputs (Introduction, key phrases, Summary) as good, intermediate, or bad in coherence Both our coherence metric and Guillermos results show the introduction is the best input.
According to Guillermo, the summary is slightly worse input than key phrase, contradictory from our work
Outlier Reduction with Metrics 20
Coherence of outlier reductions 21
Outlier reduction is an important step to improving the practicality of topic modelling
Of course, there is always the concern that outlier reduction will decrease the quality of the topics
Below we have plotted the coherence metric for the mmr-pos representation produced when using introduction as input and key phrases as input (top 40 words have been used)
-3
-2.5
-2
-1.5
-1
-0.5 0
Coherence Metric (Closer to 0 is better)
Intro only Key Phrase None CTIFD EMBD DIST PROB
Outlier reduction does increase the overall coherence of the topics
Based on these results, C-TF-IDF or Topic probability seem like the best methods for outlier reduction
Generally, the affect of outlier reduction on the overall coherence seems negligible
The next slide contains the results from the diversity metric Outlier Reduction Techniques
Diversity of outlier reduction 22
Below are the pair-word embedding distance calculations from the mmr-pos representation (higher means more diverse)
Unfortunately, due to time constraints only the top 40 topics and top 10 words from each topic could be used For completeness, we have started a much longer run
Generally, outlier reduction decreases the diversity of the topics. The reduction is small, though not negligible.
The outlier reduction technique that decreases diversity the least for both inputs is C-TF-IDF None CTIFD EMBD DIST PROB Outlier Reduction Techniques 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 Pair-Wise embedding Distance Intro only Key Phrase
Progress 23
SOW Task Status 24 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Complete Data acquisition and preparation Complete Feature pipeline engineering Complete Clustering method experimentation & selection Complete Cluster pipeline engineering Complete Anomaly detection (as needed)
Not needed Model Development, Training, Evaluation Complete Test harness development Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution In progress Cluster evaluation In progress Critical Method documentation In progress Technical Report Document In progress Deliver final report with findings Not started