ML23262B192

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230531-final
ML23262B192
Person / Time
Issue date: 05/31/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B192 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, May 31, 2023

Agenda Varying Inputs and Topic Representations Stopword Removal

- Input text

- Topic Representation Outlier Reduction Techniques Topic Modeling Metrics Applied to Experiments Progress 2

Varying Inputs and Topic Representations 3

Varying Inputs and Topic Representations for Cluster Formation

  • 3 inputs x 4 cluster sizes = 12 experiments 3. Topic Representation (5)
  • 2 BERTopic representations for all 12 experiments TF-IDF on input text in each topic cluster:
  • 3 custom representations for all 12 experiments
  • Stopword removal at input level and at topic representation level BERTopic MMR (diversity = 0.6)
2. Topic Modeling Parameters (4) BERTopic MMR + POS MMR (diversity = 0.6) + POS (NOUN, PROPN, ADJ-NOUN, ADJ-PROPN)

MMR (diversity = 0.6);

MMR (diversity = 0.6) + POS (NOUN, TF-IDF, Counts: String matching on full item-PROPN, ADJ-NOUN, ADJ-PROPN) intros in each topic cluster:

1. Topic Modeling Input (3) Vocabulary 1411 abbreviations + full forms + failure modes n-grams range of 1-3 Item Introduction min cluster size 10, 20, 40, 60 Key Phrases (66,325 words/phrases extracted from Item Item Introduction Summary Introductions using KeyphraseVectorizer +

15 neighbors, 5 components Guided KeyBERT with vocab of 1411 (Pegasus_cnn_dailymail model) abbreviations, full forms and failure modes)

Item Introduction Key Phrases all-MiniLM-L6-v2 (KeyphraseVectorizer + Guided KeyBERT with Vocabulary + Key Phrases custom vocab of 1411 abbreviations, full forms and (67,402) failure modes) 4

Stopword Removal Input Text Topic Representation Reactor site names and codes Curated list of 117 stopwords specific to NRC Reactor parent company names text that dont help in identifying safety issues City names and State abbreviations - Ex: safety, reactor, inspection, inspectors, licensee, 10 CRF, ASME, cornerstone, cross-Both standardized names and short-form variations of cutting area, significance, appendix, criterion sites and parent company names are removed Singular words from Cornerstones and Cross-cutting Areas Leaving these in the input text affects the topics that

- public, cross-cutting, events, integrity, systems, are found: documents discussing different safety performance, mitigating, emergency, resolution, issues are clustered on the common site name (Palo safety, occupational, preparedness, problem, Verde, Diablo Canyon, Wolf Creek) or parent work, barrier, supplemental, environment, company name (Entergy, Exelon, Dominion, PSEG) human, identification, initiating, aspects, Removing these names helps us find the true conscious underlying safety clusters

- Full phrases retained, cornerstone attributes and cross-cutting aspects also retrained Discovered safety clusters can still be viewed by the site and parent company categories 5

Stopword Removal Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

Without stopword removal Topic 8 Topic 10 Topic 11 Topic 35 Topic 48 Topic 51 Topic 61 6

Stopword Removal Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

With stopword removal Topic 6 Topic 8 Topic 11 Topic 16 Topic 30 Topic 59 Topic 69 7

Stopword Removal Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

With stopword removal

  • Safety clusters are no longer forming on site name or parent company name commonalities
  • Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often Some stopwords can still be removed to improve topic representations 8

Stopword Removal Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

With stopword removal

  • Safety clusters are no longer forming on site name or parent company name commonalities
  • Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often 9

Outlier Reduction 10

Outlier Reduction Techniques

1) Topic Probability

- Soft-clustering from HDBSCAN to find the best matching topic for each outlier document

2) Topic Distribution

- Find the most frequent topic discussed in the outlier document, and assign that topic to the outlier document

> Sliding window applied to document, c-TF-IDF of each window is computed and compared to existing topics, similarities of each window to topics are summed to create a topic distribution for the whole document

3) C-TF-TDF

- Find the most similar c-tf-idf topic representation to the c-tf-idf representation of the outlier document and assign that topic to the outlier document

4) Embedding

- Find the most similar topic embedding to the outlier documents embedding and assign that topic to the outlier document Threshold: probability or distance to control how many outliers are assigned to topics vs kept as outliers

- keeping default value of 0 for now Topic representations can be left un-changed after adding the outlier documents to existing topics, but we recompute the topic representations after adding outlier documents so that metrics can be calculated before and after outlier assignment 11

Outlier Reduction (Item Introduction) 12

Outlier Reduction (Item Introduction)

Topic 52 (top 20 terms) after Embedding Topic 52 (top 20 terms) Outlier Assignment provisions approved protection - license protection - condition - operability - failed -

condition - provisions approved - protection procedure - entered - procedures - 0609 -

implement - license - approved protection - specifically - manual - required - technical -

effect provisions approved - provisions - effect chapter - determined - manual chapter -

provisions - facility operating license - operating determination - ensure - equipment - manual license - transient combustible - facility chapter 0609 operating - protection specifically - approved protection specifically - license condition protection - identified facility operating -

maintain effect provisions - combustible -

condition protection - protection 13

Outlier Reduction (Item Introduction Summary) 14

Outlier Reduction (Item Introduction Summary)

Topic 44 (top 20 terms) after Embedding Topic 44 (top 20 terms) Outlier Assignment team identified technical - identified technical - procedure - procedures - quality - technical -

technical procedures - technical failed establish implement - team - determined - personnel -

- identified technical procedures - identified maintenance - follow - adequate - identified -

noncited technical - identified technical failed - process - entergy - identified technical -

technical failed - establish adequate procedures instructions - procedure quality - team identified

- technical - team - team identified - adequate - technical procedures - operations procedures ensure - technical procedures implement - noncited technical - annunciator -

nature team identified - audit process applicability - audit process - adequate procedures severe 15

Outlier Reduction (Item Introduction Key Phrases) 16

Outlier Reduction (Item Introduction Key Phrases)

Topic 64 (top 20 terms) after Embedding Topic 64 (top 20 terms) Outlier Assignment brigade - low protection - tables - detection - brigade - license - co2 - license condition -

detectors - license condition - extinguishers - suppression - operating license - detection -

license - post incident - strategy plan - protection - hose - doors - room - facility chemicals - compensatory - adequate warnings operating license - door - facility operating -

- low brigade response - hazards - low areas - detectors - compensatory - low protection procedures - qcnps - extinguisher - protection - hose stations - consequences -

areas - low brigade - warnings 17

Coherence Metric Verification based on inputs 18

Coherence Metric Verification

  • Last week we provided coherence results for different inputs. For the majority of experiments, introduction as input had the highest coherence. An example of this is presented below on the right
  • Guillermo has qualified 30 different topics from three different inputs (Introduction, key phrases, Summary) as good, intermediate, or bad in coherence
  • Both our coherence metric and Guillermos results show the introduction is the best input.
  • According to Guillermo, the summary is slightly worse input than key phrase, contradictory from our work 19

Outlier Reduction with Metrics 20

Coherence of outlier reductions Outlier reduction is an important step to improving the practicality of topic modelling Of course, there is always the concern that outlier reduction will decrease the quality of the topics Below we have plotted the coherence metric for the mmr-pos representation produced when using introduction as input and key phrases as input (top 40 words have been used)

Outlier Reduction Techniques None CTIFD EMBD DIST PROB Outlier reduction does increase the overall 0 coherence of the topics Based on these results, C-TF-IDF or Topic Coherence Metric (Closer to 0 is

-0.5 probability seem like the best methods for outlier reduction

-1 Generally, the affect of outlier reduction on the overall coherence seems negligible The next slide contains the results from the

-1.5 better) diversity metric

-2

-2.5

-3 Intro only Key Phrase 21

Diversity of outlier reduction Below are the pair-word embedding distance calculations from the mmr-pos representation (higher means more diverse)

- Unfortunately, due to time constraints only the top 40 topics and top 10 words from each topic could be used

- For completeness, we have started a much longer run Generally, outlier reduction decreases the diversity of the topics. The reduction is small, though not negligible.

The outlier reduction technique that decreases diversity the least for both inputs is C-TF-IDF Outlier Reduction Techniques None CTIFD EMBD DIST PROB 0.802 Pair-Wise embedding Distance 0.8 0.798 0.796 0.794 0.792 0.79 0.788 0.786 Intro only Key Phrase 22

Progress 23

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation Complete Search the Literature Complete Data acquisition and preparation Complete Select Candidates Complete Feature pipeline engineering Complete Select Evaluation Factors Complete Clustering method experimentation & selection Complete Develop evaluation factor weights Complete Cluster pipeline engineering Complete Define evaluation factor ranges Complete Anomaly detection (as needed) Not needed Perform assessment Complete Model Development, Training, Evaluation Complete Report Results Complete Test harness development Complete Deliver Trade study report Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution In progress Cluster evaluation In progress Critical Method documentation In progress Technical Report Document In progress Deliver final report with findings Not started 24