ML23262B192

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230531-final
ML23262B192
Person / Time
Issue date: 05/31/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B192 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, May 31, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith

Agenda 2

Varying Inputs and Topic Representations Stopword Removal

- Input text

- Topic Representation Outlier Reduction Techniques Topic Modeling Metrics Applied to Experiments Progress

Varying Inputs and Topic Representations 3

Varying Inputs and Topic Representations for Cluster Formation 4

Item Introduction Item Introduction Summary (Pegasus_cnn_dailymail model)

Item Introduction Key Phrases (KeyphraseVectorizer + Guided KeyBERT with custom vocab of 1411 abbreviations, full forms and failure modes)

MMR (diversity = 0.6);

MMR (diversity = 0.6) + POS (NOUN, PROPN, ADJ-NOUN, ADJ-PROPN)

Vocabulary 1411 abbreviations + full forms + failure modes Key Phrases (66,325 words/phrases extracted from Item Introductions using KeyphraseVectorizer +

Guided KeyBERT with vocab of 1411 abbreviations, full forms and failure modes)

Vocabulary + Key Phrases (67,402)

1. Topic Modeling Input (3) all-MiniLM-L6-v2 15 neighbors, 5 components min cluster size 10, 20, 40, 60 n-grams range of 1-3
2. Topic Modeling Parameters (4)
3. Topic Representation (5)

BERTopic MMR (diversity = 0.6)

BERTopic MMR + POS MMR (diversity = 0.6) + POS (NOUN, PROPN, ADJ-NOUN, ADJ-PROPN) 3 inputs x 4 cluster sizes = 12 experiments 2 BERTopic representations for all 12 experiments 3 custom representations for all 12 experiments Stopword removal at input level and at topic representation level TF-IDF, Counts: String matching on full item-intros in each topic cluster:

TF-IDF on input text in each topic cluster:

Input Text

Reactor site names and codes

Reactor parent company names

City names and State abbreviations

Both standardized names and short-form variations of sites and parent company names are removed

Leaving these in the input text affects the topics that are found: documents discussing different safety issues are clustered on the common site name (Palo Verde, Diablo Canyon, Wolf Creek) or parent company name (Entergy, Exelon, Dominion, PSEG)

Removing these names helps us find the true underlying safety clusters

Discovered safety clusters can still be viewed by the site and parent company categories Topic Representation

Curated list of 117 stopwords specific to NRC text that dont help in identifying safety issues Ex: safety, reactor, inspection, inspectors, licensee, 10 CRF, ASME, cornerstone, cross-cutting area, significance, appendix, criterion

Singular words from Cornerstones and Cross-cutting Areas public, cross-cutting, events, integrity, systems, performance, mitigating, emergency, resolution, safety, occupational, preparedness, problem, work, barrier, supplemental, environment, human, identification, initiating, aspects, conscious Full phrases retained, cornerstone attributes and cross-cutting aspects also retrained Stopword Removal 5

Without stopword removal Stopword Removal 6

Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

Topic 8 Topic 10 Topic 11 Topic 35 Topic 48 Topic 51 Topic 61

With stopword removal Stopword Removal 7

Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

Topic 6 Topic 8 Topic 11 Topic 16 Topic 30 Topic 59 Topic 69

With stopword removal Stopword Removal 8

Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

Some stopwords can still be removed to improve topic representations Safety clusters are no longer forming on site name or parent company name commonalities Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often

With stopword removal Stopword Removal 9

Note: topics are not from the same experiment run, but are from runs with comparable model parameters (min cluster size 20, MMR BERtopic representation)

Safety clusters are no longer forming on site name or parent company name commonalities Generic reactor-safety related terms and inspection findings boilerplate words are appearing less often

Outlier Reduction 10

1) Topic Probability Soft-clustering from HDBSCAN to find the best matching topic for each outlier document
2) Topic Distribution Find the most frequent topic discussed in the outlier document, and assign that topic to the outlier document Sliding window applied to document, c-TF-IDF of each window is computed and compared to existing topics, similarities of each window to topics are summed to create a topic distribution for the whole document
3) C-TF-TDF Find the most similar c-tf-idf topic representation to the c-tf-idf representation of the outlier document and assign that topic to the outlier document
4) Embedding Find the most similar topic embedding to the outlier documents embedding and assign that topic to the outlier document

Threshold: probability or distance to control how many outliers are assigned to topics vs kept as outliers keeping default value of 0 for now

Topic representations can be left un-changed after adding the outlier documents to existing topics, but we recompute the topic representations after adding outlier documents so that metrics can be calculated before and after outlier assignment Outlier Reduction Techniques 11

Outlier Reduction (Item Introduction) 12

Topic 52 (top 20 terms)

provisions approved protection - license condition - provisions approved - protection implement - license - approved protection -

effect provisions approved - provisions - effect provisions - facility operating license - operating license - transient combustible - facility operating - protection specifically - approved protection specifically - license condition protection - identified facility operating -

maintain effect provisions - combustible -

condition protection - protection Topic 52 (top 20 terms) after Embedding Outlier Assignment

protection - condition - operability - failed -

procedure - entered - procedures - 0609 -

specifically - manual - required - technical -

chapter - determined - manual chapter -

determination - ensure - equipment - manual chapter 0609 Outlier Reduction (Item Introduction) 13

Outlier Reduction (Item Introduction Summary) 14

Topic 44 (top 20 terms)

team identified technical - identified technical -

technical procedures - technical failed establish

- identified technical procedures - identified noncited technical - identified technical failed -

technical failed - establish adequate procedures

- technical - team - team identified - adequate procedures ensure - technical procedures implement - noncited technical - annunciator -

nature team identified - audit process applicability - audit process - adequate procedures severe Topic 44 (top 20 terms) after Embedding Outlier Assignment

procedure - procedures - quality - technical -

implement - team - determined - personnel -

maintenance - follow - adequate - identified -

process - entergy - identified technical -

instructions - procedure quality - team identified

- technical procedures - operations Outlier Reduction (Item Introduction Summary) 15

Outlier Reduction (Item Introduction Key Phrases) 16

Topic 64 (top 20 terms)

brigade - low protection - tables - detection -

detectors - license condition - extinguishers -

license - post incident - strategy plan -

chemicals - compensatory - adequate warnings

- low brigade response - hazards - low protection procedures - qcnps - extinguisher -

areas - low brigade - warnings Topic 64 (top 20 terms) after Embedding Outlier Assignment

brigade - license - co2 - license condition -

suppression - operating license - detection -

protection - hose - doors - room - facility operating license - door - facility operating -

areas - detectors - compensatory - low protection - hose stations - consequences -

Outlier Reduction (Item Introduction Key Phrases) 17

Coherence Metric Verification based on inputs 18

Coherence Metric Verification 19 Last week we provided coherence results for different inputs. For the majority of experiments, introduction as input had the highest coherence. An example of this is presented below on the right Guillermo has qualified 30 different topics from three different inputs (Introduction, key phrases, Summary) as good, intermediate, or bad in coherence Both our coherence metric and Guillermos results show the introduction is the best input.

According to Guillermo, the summary is slightly worse input than key phrase, contradictory from our work

Outlier Reduction with Metrics 20

Coherence of outlier reductions 21

Outlier reduction is an important step to improving the practicality of topic modelling

Of course, there is always the concern that outlier reduction will decrease the quality of the topics

Below we have plotted the coherence metric for the mmr-pos representation produced when using introduction as input and key phrases as input (top 40 words have been used)

-3

-2.5

-2

-1.5

-1

-0.5 0

Coherence Metric (Closer to 0 is better)

Intro only Key Phrase None CTIFD EMBD DIST PROB

Outlier reduction does increase the overall coherence of the topics

Based on these results, C-TF-IDF or Topic probability seem like the best methods for outlier reduction

Generally, the affect of outlier reduction on the overall coherence seems negligible

The next slide contains the results from the diversity metric Outlier Reduction Techniques

Diversity of outlier reduction 22

Below are the pair-word embedding distance calculations from the mmr-pos representation (higher means more diverse)

Unfortunately, due to time constraints only the top 40 topics and top 10 words from each topic could be used For completeness, we have started a much longer run

Generally, outlier reduction decreases the diversity of the topics. The reduction is small, though not negligible.

The outlier reduction technique that decreases diversity the least for both inputs is C-TF-IDF None CTIFD EMBD DIST PROB Outlier Reduction Techniques 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 Pair-Wise embedding Distance Intro only Key Phrase

Progress 23

SOW Task Status 24 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Complete Data acquisition and preparation Complete Feature pipeline engineering Complete Clustering method experimentation & selection Complete Cluster pipeline engineering Complete Anomaly detection (as needed)

Not needed Model Development, Training, Evaluation Complete Test harness development Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution In progress Cluster evaluation In progress Critical Method documentation In progress Technical Report Document In progress Deliver final report with findings Not started