ML23262B200

From kanterella
Revision as of 00:38, 29 September 2023 by StriderTol (talk | contribs) (StriderTol Bot insert)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Meeting Slides 20230614-final
ML23262B200
Person / Time
Issue date: 06/14/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B200 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, June 14, 2023

Agenda Metrics Update Overview of Study Experiments Demonstration of End-to-End Jupyter Notebook 2

Diversity Metric Update 3

Bi-grams and Tri-grams The two metrics that we have been using are the coherence metric and the pair-wise embedding distance diversity metric We have been giving both metrics bi-grams and tri-grams where spaces are replaced with _

Nuclear Energy => Nuclear_Energy This was the correct format for the coherence metric but incorrect for the diversity metric The following slides contains recalculations of the diversity metrics with the original spaces still intact 4

Diversity of outlier (Recalculation)

Below are the pair-word embedding distance calculations from the mmr-pos representation (higher means more diverse)

- Unfortunately, due to time constraints only the top 40 topics and top 10 words from each topic could be used Outlier Reduction Techniques (old results) Outlier Reduction Techniques (new results)

None CTIFD EMBD DIST PROB None CTIFD EMBD DIST PROB 0.802 0.798 0.8 0.796 Pair-Wise embedding Pair Wise Embedding 0.798 0.794 0.796 0.792 0.794 0.79 Distance Distance 0.792 0.788 0.79 0.786 0.788 0.784 0.786 0.782 Intro only Key Phrase Intro Only Key Phrase

  • The recalculation still shows that diversity of the topics does not change much after performing outlier reduction 5

Outlier Reduction of Custom Representations (Recalculation)

Coherence Scores Introduction Input Vocab Key Phrases Vocab + Key Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction -1.201 -0.801 -0.804 No Reduction -1.453 -0.886 -0.877 Probability Outlier Reduction -1.207 -0.685 -0.686 Probability Outlier Reduction -1.298 -0.711 -0.717 (Old) Diversity Scores Introduction Input Vocab Key Phrases Vocab + Key Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction 0.809 0.809 0.809 No Reduction 0.808 0.810 0.812 Probability Outlier Reduction 0.805 0.801 0.804 Probability Outlier Reduction 0.803 0.796 0.799 (New) Diversity Scores Introduction Input Vocab Key Phrases Vocab + Key Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction 0.836 0.812 0.813 No Reduction 0.836 0.817 0.820 Probability Outlier Reduction 0.826 0.803 0.806 Probability Outlier Reduction 0.828 0.802 0.804

  • We still see a small decrease in diversity after outlier reduction. However, now the custom vocab representation provides the most diverse response.
  • Given these results, we still think outlier reduction is worth the decrease to diversity 6

Overview of Study Experiments 7

Structure of Solution

1. Topic Modeling Input 2. Topic Modeling Parameters 3. Topic Representation and Visualization Representation Weighting Tokenizer &

Vectorizer Clustering Dimension Reduction Embedding 8

Select the text Create a Reduce the Present the that will be mathematical Select parameters Split up text and number of Calculate cluster themes used in representation of the for unsupervised count parameters from importance of in analyst unsupervised document to use in clustering occurrences of hundreds to terms friendly terms learning the algorithms the tokens dozens 15 tested 5 tested 10 tested 4 tested 3 tested 5 tested 1 Chosen 3 Chosen 1 Chosen 1 Chosen 1 Chosen 1 Chosen 2 Chosen Input - Item Dimension Tokenizer &

Embedding Clustering Weighting Representation Introduction Reduction Vectorizer UMAP HDBSCAN Full Text N=5, C=5 UMAP Min doc/cluster 10 MMR N=10, C=5 HDBSCAN Uni-grams Summary UMAP Min doc/cluster 20 N=15, C=5 Uni + Bi- MMR+POS BART-large- cnn UMAP HDBSCAN grams T5-Base N=20, C=5 Min doc/cluster 40 all-MiniLM-L6-v2 Uni + Bi + Vocab Flan-t5-base UMAP TF-IDF Pegasus xsum N=5, C=10 HDBSCAN Tri-grams Pegasus Pegasuscnn-cnn-all-mpnet-base- Min doc/cluster 60 daily dailymail mail Key Phrase Pegasus arix v2 Pegasus pubmed Vocab +

Question xlnet-base-cased Key Phrase Answering UMAP N=10, C=50 Key BERT Flan-t5-base SPECTER UMAP Inspired Roberta-base- N=15, C=50 squad2 Bert-large-cased multi-qa-MiniLM- UMAP L6-dot-v1 N=20, C=50 Key Phrase Unsupervised KeyBERT Guided KeyBERT Unsupervised KeyBERT KeyphraseVectorizers Guided GuidedKeyBERT KeyBERT KeyphraseVectorizers KeyphraseVectorizers Named Entity Recognition

Final Pipeline and Demo Content Create a Select the text Reduce the mathematical Select parameters Split up text and Present the that will be number of Calculate representation of the for unsupervised count cluster themes used in parameters from importance of document to use in clustering occurrences of in analyst unsupervised hundreds to terms the algorithms the tokens friendly terms learning dozens Input - Item Dimension Tokenizer &

Embedding Clustering Weighting Representation Introduction Reduction Vectorizer Full Text MMR + POS Summary HDBSCAN Uni + Bi +

UMAP Pegasus cnn-all-MiniLM-L6-v2 N=15, C=5 TF-IDF daily mail Min doc/cluster 20 Tri-grams Key Phrase Vocab +

Guided KeyBERT KeyphraseVectorizers Key Phrase 10

Demonstration of End-to-End Jupyter Notebook 11

Progress 12

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation Complete Search the Literature Complete Data acquisition and preparation Complete Select Candidates Complete Feature pipeline engineering Complete Select Evaluation Factors Complete Clustering method experimentation & selection Complete Develop evaluation factor weights Complete Cluster pipeline engineering Complete Define evaluation factor ranges Complete Anomaly detection (as needed) Not needed Perform assessment Complete Model Development, Training, Evaluation Complete Report Results Complete Test harness development Complete Deliver Trade study report Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Complete Model execution Complete Cluster evaluation Complete Critical Method documentation Complete Technical Report Document In progress Deliver final report with findings Not started 13