ML23262B165
| ML23262B165 | |
| Person / Time | |
|---|---|
| Issue date: | 03/29/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B165 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, March 29, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
Data Presentation Neural Topic Modeling Actionable Insights Tool Analysis Progress and Next Steps
Topic Modeling 3
Unsupervised discovery of topics from a collection of text documents Latent Dirichlet Allocation (LDA)
Describe a document as a bag-of-words Model each document as a mixture of latent topics Topic is represented as a distribution over the words in the vocabulary Variants of Topic Modeling can be explored Text-embeddings from Language Models and Neural Topic Modeling can be used to improve the quality of results Topic Modeling 4
Topic Model Topics 50% topic 1 25% topic 2 25 % topic 3 Clusters of documents by topic Proportion of topics in each document word/phrase frequency that distinguishes and characterizes a topic Corpus of Documents
Neural Topic Modeling 5
BERTopic
Generate document embeddings with pre-trained transformer-based language models
Reduce dimensionality of document embeddings
Cluster document embeddings
Generate topic representations with class-based TF-IDF procedure to overcome centroid-based perspective
Coherent and diverse topics
BERTopic offers modularity at each step of the process
- Embedding
- Dimensionality Reduction
- Clustering
- Tokenizer
- Weighing scheme
- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data BERTopic: Modularity 7
Refine how a topic is represented and interpreted KeyBERT
- Extract keywords for each topic and a set of representative documents per topic
- Compare the embeddings of the keywords and the representative documents Maximal Marginal Relevance
- Reduce redundancy and improve diversity of keywords Part of Speech
- Extract keywords for each topic and documents that contain the keywords
- Use a part-of-speech tagger to generate new candidate keywords Zero-shot Classification
- Assign candidate labels to topics given keywords for each topic Text Generation and Prompts
- Create topic labels based on representative documents and keywords
- Huggingface Transformers, OpenAI GPT, co:here, LangChain BERTopic: Representing a Topic 8
Topic Distributions
- Approximate topic distributions per document when using a hard-clustering approach Topics per Class (Category)
- Extract topic representations for each class or category of interest from topic model Dynamic Topic Modeling
- Analyze how the representation of a topic changes over time Hierarchical Topic Modeling
- Obtain insights into which topics are similar and sub-topics that may exist in data Online Topic Modeling
- Continue updating topic model with new data Semi-supervised Topic Modeling
- Steer dimensionality reduction of document embeddings into a space close to the topic labels for some or all documents Guided (Seeded) Topic Modeling
- Predefined keywords or phrases for the topic model to converge to by comparing document embeddings with seeded topic embeddings Supervised Topic Modeling
- If topic labels are already known, discover relationships between documents and topics Manual Topic Modeling
- Find topic representations for document topic labels that are already known and use other topic modeling variations with this model BERTopic: Topic Modeling Variations 9
Input Text: inspection findings titles (13-258 words, avg. 85 words) all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)
Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations BERTopic: Preliminary Results 10
Input Text: inspection findings item introductions (44-11,670 words, avg. 1,649 words) all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)
Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations BERTopic: Preliminary Results 11
Actionable Insights 12
Zero-shot Classification
Text Generation and Prompts
Topic Distributions
Topics per Class (Category)
Dynamic Topic Modeling
Hierarchical Topic Modeling
Online Topic Modeling Leverage Fine Tuning to Analyze Safety Clusters
Once Safety Clusters have been defined, we can apply additional ML techniques to gain insights into the overarching themes in the Safety Cluster.
Actionable Insights: Descriptions 13 Using Topics to Name the Safety Clusters This can aid the SMEs in characterizing the Safety Issues
Actionable Insights Using Seeds
Topic Modeling can be influenced
This would be more of a supervised approach, telling the system what the safety clusters are, then grouping inspection reports based on those
Different from discovering safety clusters
May be a way to do both and compare
Supervised, semi-supervised, guided, manual Actionable Insights: Unsupervised vs Supervised 14 Forcing Clusters Inspection Procedures 1.
Initial Management Meeting Materials Licensees 2.
Quality Assurance Program Implementation During Construction and Pre-Construction Activities 3.
Quality Assurance Implementation Inspection 4.
Reactor safety inspection items 1.
Adverse Weather Protection 2.
Equipment Alignments 3.
Fire Protection 4.
Flood Protection 5.
Licensed Operator Requalification 6.
Maintenance Implementation 7.
Maintenance Risk Assessments and Emergent Work Evaluation 8.
Personnel Performance During Nonroutine Plant Evolutions and Events 9.
Operability Evaluations
- 10. Operator Workarounds
- 11. Post-maintenance Testing
- 12.
This capability is only available for some models - BERTopic provides them Reactors 1.
POIN 2.
FAR 3.
DIAB 4.
GINN 5.
SUR 6.
HAR 7.
FERM 8.
In addition to the insights discussed last week:
Within a reactor Across reactors Classification Would these type of insights be beneficial?
15 Questions on direction for clustering and insights
Tool Analysis 16
Platform Supplied Algorithms Pros
Integrated into user experience
Can be combined into standard pipelines
Work well in common cases (social media)
Cons
Do not handle technical domains well
Black box algorithms (unpublished models)
Difficult to tailor pipelines
Limited selection of algorithms
Limited selection of pre-training datasets
Advanced algorithms not available
Strengths in one area (Topic Modeling) may not translate to other areas (Classification, clustering, regression, anomaly detection)
Python Library Algorithms Pros
Flexibility to select from multiple algorithms
Flexibility to select from multiple pre-trainings
Advanced models available
Flexibility to address highly technical domains
Flexibility to leverage internal parameters
Work well in Machine Learning Notebooks
Can be deployed in any cloud or on premises
Limited coding knowledge required Cons
Requires some knowledge of Python
Algorithms must be researched and selected Survey of Platforms and Tools 17 Based on preliminary results, we expect a Notebook based approach using Neural Topic Modeling to be the top candidate
Model Support
LDA Topic Modeling
Neural Topic Modeling
Other relevant approaches for future use/experiments with text data Processing Support Candidate Evaluation Criteria 18 Python Libraries
Notebook Integration
Text Extraction
Text Pre-processing
Text embedding
Visual Programming Pandas Numpy BeautifulSoup NLTK Spacy Gensim pyLDAvs Matplotlib Bertopic hdbscan scikit-learn scipy huggingface transformers Pytorch sentence-transformers Criteria Weight Neural Topic Modeling
.27 LDA
.17 Visual Programming
.14 Text Pre-processing
.11 Text Embedding
.10 Other Text Approaches
.09 Notebook Integration
.08 Text Extraction
.03 Are these criteria and weights in line with the goals of the project?
Notebook Integration: the ability to launch and scale a python notebook on the platform
Text Extraction: the ability to extract text from PDFs
Text Pre-processing: the ability to clean text data before passing it to a machine learning algorithm remove punctuation, emails, URLs, specific string patterns, correct spelling errors filter stopwords, stemming & lemmatization, extract linguistic features
Text Embedding: the ability to represent a text document as a vector of numbers using various embedding techniques and pre-trained models
Topic Modeling: LDA, Neural approaches
Other relevant approaches available on the platform for future use/experiments with text data Text classification Summarization, key phrase extraction Named entity recognition and linking Question - Answering Platforms and Tools: Evaluation Criteria 19
Google AI 20 Google AI has many products (Below are relevant products)
Vertex AI - Unified platform for training, hosting, and managing models
Natural Language AI - Sentiment analysis and classification of unstructured text
Document AI - Machine learning and AI to unlock insights from your documents
Contact Center AI - AI model for speaking with customers and assisting human agents Out of the box capabilities:
Vertex AI - Supervised learning tasks for image, tabular, text, and video domain
Natural Language AI - Sentiment analysis, Entity analysis, Entity Sentiment analysis, Syntactic analysis, Content classification
Document AI - Collect structured data from unstructured text data (Able to fill in missing data using Google's knowledge graph)
Contact Center AI - Only product I could find with topic modeling out of the box. Requires 10000 conversations and both customer and agent responses for training These out of the box capabilities do no align with the unsupervised clustering approach envisioned for this task
Simple function calls and interactive notebook like environment
Ability to call python libraries from MATLAB environment and import/export deep learning frameworks with Open Neural Network Exchange (ONNX) format
Text Analytics Toolbox Text extraction: supports extractions from various formats(text, PDF, HTML, CSV, Excel, and Word)
Text pre-processing: remove punctuation, URL, correct spelling errors, filter stopwords, stemming & lemmatization, extract linguistic features Text embedding: word and n-gram counting, word2vec, CBOW, FastText, GloVe Topic modeling: LDA Other relevant offerings: document summarization, text classification, and keyword extraction with limited deep learning models
Basic NLP functionalities offered by MATLAB, but more advanced methods will likely be needed to obtain actionable results from technical text data
Any code written will be specific to MATLAB and not easily portable to other platforms or a python notebook MathWorks MATLAB 21
Many AI/ML Products: Applied AI Services, Cognitive Services, Form Recognizer, Cognitive Search, OpenAI Services, Machine Learning and more
Various environments supported: visual no-code (Azure ML Designer), code-first (Jupyter Notebooks, Python SDK, CLI)
Azure Machine Learning Text extraction: supported through Form Recognizer if needed, else import data from an Azure Datastore and transform via Designer Text pre-processing: remove stopwords, regular expressions for string matching, lemmatization, case normalization, remove special characters, patters, emails or URLs Text embedding: N-gram features, Word2Vec, FastText, GLoVe Topic modeling: LDA Other relevant offerings: text classification and named entity recognition via Designer; many NLP tasks (key phrase extraction, entity recognition and linking, summarization, question-answering) supported via Cognitive Services
Basic NLP functionalities and LDA topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies Microsoft Azure AI + ML 22
Many AI/ML Products: Comprehend, Textract, Augmented AI and more
Various SageMaker Environments: visual no-code (Canvas), code-first (Studio, Notebook Instances, Studio Lab)
AWS SageMaker Built-in functionalities Text extraction: supported through Amazon Comprehend if needed, but data can be provided in various formats (text, csv, json) for use within SageMaker Text pre-processing: not built-in, but can be imported from libraries as needed Text embedding: BlazingText (for learning CBOW, skip-gram or batch skip-gram embeddings with Word2Vec; learning character n-gram embeddings), Object2Vec (for learning embeddings with sentence pairs)
Topic modeling: LDA and Neural Topic Modeling Other relevant offerings: text classification, summarization, entity recognition and relationship extraction, question-answering
Some advanced NLP functionalities, LDA and neural topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies Amazon AWS SageMaker 23
What do you see as the primary role of the people executing this type of analysis?
Are there longer-term aspirations for ML initiatives?
Minimal Python coding in a Notebook environment enables many more options 24 Questions on Tool Selection Criteria
Progress 25
SOW Task Status 26 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors In progress Develop evaluation factor weights In progress Define evaluation factor ranges In progress Perform assessment In progress Report Results Not started Deliver Trade study report Not started Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Not started Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)
Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started
Next Steps 27
Next Steps 28 Update evaluation criteria and weighting factors
Based on feedback from this session Perform candidate evaluations Produce Trade Study Report Begin model tuning