ML23262B200

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230614-final
ML23262B200
Person / Time
Issue date: 06/14/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B200 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, June 14, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith

Agenda 2

Metrics Update Overview of Study Experiments Demonstration of End-to-End Jupyter Notebook

Diversity Metric Update 3

Bi-grams and Tri-grams 4

The two metrics that we have been using are the coherence metric and the pair-wise embedding distance diversity metric

We have been giving both metrics bi-grams and tri-grams where spaces are replaced with _

Nuclear Energy => Nuclear_Energy

This was the correct format for the coherence metric but incorrect for the diversity metric

The following slides contains recalculations of the diversity metrics with the original spaces still intact

Diversity of outlier (Recalculation) 5 None CTIFD EMBD DIST PROB Outlier Reduction Techniques (old results) 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 Pair-Wise embedding Distance Intro only Key Phrase 0.782 0.784 0.786 0.788 0.79 0.792 0.794 0.796 0.798 Pair Wise Embedding Distance Intro Only Key Phrase None CTIFD EMBD DIST PROB Outlier Reduction Techniques (new results)

The recalculation still shows that diversity of the topics does not change much after performing outlier reduction

Below are the pair-word embedding distance calculations from the mmr-pos representation (higher means more diverse)

Unfortunately, due to time constraints only the top 40 topics and top 10 words from each topic could be used

Outlier Reduction of Custom Representations (Recalculation) 6 Introduction Input Vocab Key PhrasesVocab + Key No Reduction 0.809 0.809 0.809 Probability Outlier Reduction 0.805 0.801 0.804 Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction 0.808 0.810 0.812 Probability Outlier Reduction 0.803 0.796 0.799 (Old) Diversity Scores (New) Diversity Scores Introduction Input Vocab Key PhrasesVocab + Key No Reduction 0.836 0.812 0.813 Probability Outlier Reduction 0.826 0.803 0.806 Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction 0.836 0.817 0.820 Probability Outlier Reduction 0.828 0.802 0.804 Introduction Input Vocab Key Phrases Vocab + Key No Reduction

-1.201 -0.801

-0.804 Probability Outlier Reduction -1.207 -0.685

-0.686 Key Phrases Input Vocab Key Phrases Vocab + Key No Reduction

-1.453

-0.886

-0.877 Probability Outlier Reduction -1.298

-0.711

-0.717 Coherence Scores We still see a small decrease in diversity after outlier reduction. However, now the custom vocab representation provides the most diverse response.

Given these results, we still think outlier reduction is worth the decrease to diversity

Overview of Study Experiments 7

Structure of Solution 8

1. Topic Modeling Input
3. Topic Representation and Visualization
2. Topic Modeling Parameters Embedding Dimension Reduction Clustering Tokenizer &

Vectorizer Weighting Representation

Input - Item Introduction Full Text Summary Question Answering Key Phrase Named Entity Recognition BART-large-cnn T5-Base Flan-t5-base Pegasus xsum Pegasus cnn-daily mail Pegasus arix Pegasus pubmed Flan-t5-base Roberta-base-squad2 Bert-large-cased Unsupervised KeyBERT Guided KeyBERT Unsupervised KeyBERT KeyphraseVectorizers Guided KeyBERT KeyphraseVectorizers all-mpnet-base-v2 xlnet-base-cased SPECTER multi-qa-MiniLM-L6-dot-v1 all-MiniLM-L6-v2 Embedding Weighting TF-IDF Uni + Bi +

Tri-grams Tokenizer &

Vectorizer Uni + Bi-grams Uni-grams Select the text that will be used in unsupervised learning 15 tested 3 Chosen Create a mathematical representation of the document to use in the algorithms 5 tested 1 Chosen Reduce the number of parameters from hundreds to dozens 10 tested 1 Chosen Clustering HDBSCAN Min doc/cluster 10 HDBSCAN Min doc/cluster 60 HDBSCAN Min doc/cluster 20 HDBSCAN Min doc/cluster 40 Select parameters for unsupervised clustering 4 tested 1 Chosen Split up text and count occurrences of the tokens 3 tested 1 Chosen Calculate importance of terms 1 Chosen Present the cluster themes in analyst friendly terms 5 tested 2 Chosen Dimension Reduction UMAP N=5, C=10 UMAP N=10, C=50 UMAP N=15, C=50 UMAP N=20, C=50 UMAP N=10, C=5 UMAP N=5, C=5 UMAP N=15, C=5 UMAP N=20, C=5 Full Text all-MiniLM-L6-v2 HDBSCAN Min doc/cluster 20 Uni + Bi +

Tri-grams TF-IDF MMR+POS Vocab +

Key Phrase Pegasus cnn-daily mail Guided KeyBERT KeyphraseVectorizers UMAP N=15, C=5 MMR+POS Representation MMR Vocab Key Phrase Vocab +

Key Phrase Key BERT Inspired

10 Input - Item Introduction Full Text Summary Key Phrase Select the text that will be used in unsupervised learning all-MiniLM-L6-v2 Create a mathematical representation of the document to use in the algorithms Embedding Dimension Reduction Reduce the number of parameters from hundreds to dozens Clustering HDBSCAN Min doc/cluster 20 Select parameters for unsupervised clustering Uni + Bi +

Tri-grams Split up text and count occurrences of the tokens Tokenizer &

Vectorizer Weighting TF-IDF Representation Present the cluster themes in analyst friendly terms MMR + POS Vocab +

Key Phrase Pegasus cnn-daily mail Guided KeyBERT KeyphraseVectorizers UMAP N=15, C=5 Calculate importance of terms Full Text all-MiniLM-L6-v2 HDBSCAN Min doc/cluster 20 Uni + Bi +

Tri-grams TF-IDF Vocab +

Key Phrase UMAP N=15, C=5 Final Pipeline and Demo Content

Demonstration of End-to-End Jupyter Notebook 11

Progress 12

SOW Task Status 13 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Complete Data acquisition and preparation Complete Feature pipeline engineering Complete Clustering method experimentation & selection Complete Cluster pipeline engineering Complete Anomaly detection (as needed)

Not needed Model Development, Training, Evaluation Complete Test harness development Complete PoC integration and demonstration Complete Trial runs and evaluation Complete Demonstrate PoC capability Complete Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Complete Model execution Complete Cluster evaluation Complete Critical Method documentation Complete Technical Report Document In progress Deliver final report with findings Not started