ML23262B165

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230329-final
ML23262B165
Person / Time
Issue date: 03/29/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B165 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, March 29, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith

Agenda 2

Data Presentation Neural Topic Modeling Actionable Insights Tool Analysis Progress and Next Steps

Topic Modeling 3

Unsupervised discovery of topics from a collection of text documents Latent Dirichlet Allocation (LDA)

Describe a document as a bag-of-words Model each document as a mixture of latent topics Topic is represented as a distribution over the words in the vocabulary Variants of Topic Modeling can be explored Text-embeddings from Language Models and Neural Topic Modeling can be used to improve the quality of results Topic Modeling 4

Topic Model Topics 50% topic 1 25% topic 2 25 % topic 3 Clusters of documents by topic Proportion of topics in each document word/phrase frequency that distinguishes and characterizes a topic Corpus of Documents

Neural Topic Modeling 5

BERTopic

Generate document embeddings with pre-trained transformer-based language models

Reduce dimensionality of document embeddings

Cluster document embeddings

Generate topic representations with class-based TF-IDF procedure to overcome centroid-based perspective

Coherent and diverse topics

BERTopic offers modularity at each step of the process

- Embedding

- Dimensionality Reduction

- Clustering

- Tokenizer

- Weighing scheme

- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data BERTopic: Modularity 7

Refine how a topic is represented and interpreted KeyBERT

- Extract keywords for each topic and a set of representative documents per topic

- Compare the embeddings of the keywords and the representative documents Maximal Marginal Relevance

- Reduce redundancy and improve diversity of keywords Part of Speech

- Extract keywords for each topic and documents that contain the keywords

- Use a part-of-speech tagger to generate new candidate keywords Zero-shot Classification

- Assign candidate labels to topics given keywords for each topic Text Generation and Prompts

- Create topic labels based on representative documents and keywords

- Huggingface Transformers, OpenAI GPT, co:here, LangChain BERTopic: Representing a Topic 8

Topic Distributions

- Approximate topic distributions per document when using a hard-clustering approach Topics per Class (Category)

- Extract topic representations for each class or category of interest from topic model Dynamic Topic Modeling

- Analyze how the representation of a topic changes over time Hierarchical Topic Modeling

- Obtain insights into which topics are similar and sub-topics that may exist in data Online Topic Modeling

- Continue updating topic model with new data Semi-supervised Topic Modeling

- Steer dimensionality reduction of document embeddings into a space close to the topic labels for some or all documents Guided (Seeded) Topic Modeling

- Predefined keywords or phrases for the topic model to converge to by comparing document embeddings with seeded topic embeddings Supervised Topic Modeling

- If topic labels are already known, discover relationships between documents and topics Manual Topic Modeling

- Find topic representations for document topic labels that are already known and use other topic modeling variations with this model BERTopic: Topic Modeling Variations 9

Input Text: inspection findings titles (13-258 words, avg. 85 words) all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)

Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations BERTopic: Preliminary Results 10

Input Text: inspection findings item introductions (44-11,670 words, avg. 1,649 words) all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)

Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations BERTopic: Preliminary Results 11

Actionable Insights 12

Zero-shot Classification

Text Generation and Prompts

Topic Distributions

Topics per Class (Category)

Dynamic Topic Modeling

Hierarchical Topic Modeling

Online Topic Modeling Leverage Fine Tuning to Analyze Safety Clusters

Once Safety Clusters have been defined, we can apply additional ML techniques to gain insights into the overarching themes in the Safety Cluster.

Actionable Insights: Descriptions 13 Using Topics to Name the Safety Clusters This can aid the SMEs in characterizing the Safety Issues

Actionable Insights Using Seeds

Topic Modeling can be influenced

This would be more of a supervised approach, telling the system what the safety clusters are, then grouping inspection reports based on those

Different from discovering safety clusters

May be a way to do both and compare

Supervised, semi-supervised, guided, manual Actionable Insights: Unsupervised vs Supervised 14 Forcing Clusters Inspection Procedures 1.

Initial Management Meeting Materials Licensees 2.

Quality Assurance Program Implementation During Construction and Pre-Construction Activities 3.

Quality Assurance Implementation Inspection 4.

Reactor safety inspection items 1.

Adverse Weather Protection 2.

Equipment Alignments 3.

Fire Protection 4.

Flood Protection 5.

Licensed Operator Requalification 6.

Maintenance Implementation 7.

Maintenance Risk Assessments and Emergent Work Evaluation 8.

Personnel Performance During Nonroutine Plant Evolutions and Events 9.

Operability Evaluations

10. Operator Workarounds
11. Post-maintenance Testing
12.

This capability is only available for some models - BERTopic provides them Reactors 1.

POIN 2.

FAR 3.

DIAB 4.

GINN 5.

SUR 6.

HAR 7.

FERM 8.

In addition to the insights discussed last week:

Within a reactor Across reactors Classification Would these type of insights be beneficial?

15 Questions on direction for clustering and insights

Tool Analysis 16

Platform Supplied Algorithms Pros

Integrated into user experience

Can be combined into standard pipelines

Work well in common cases (social media)

Cons

Do not handle technical domains well

Black box algorithms (unpublished models)

Difficult to tailor pipelines

Limited selection of algorithms

Limited selection of pre-training datasets

Advanced algorithms not available

Strengths in one area (Topic Modeling) may not translate to other areas (Classification, clustering, regression, anomaly detection)

Python Library Algorithms Pros

Flexibility to select from multiple algorithms

Flexibility to select from multiple pre-trainings

Advanced models available

Flexibility to address highly technical domains

Flexibility to leverage internal parameters

Work well in Machine Learning Notebooks

Can be deployed in any cloud or on premises

Limited coding knowledge required Cons

Requires some knowledge of Python

Algorithms must be researched and selected Survey of Platforms and Tools 17 Based on preliminary results, we expect a Notebook based approach using Neural Topic Modeling to be the top candidate

Model Support

LDA Topic Modeling

Neural Topic Modeling

Other relevant approaches for future use/experiments with text data Processing Support Candidate Evaluation Criteria 18 Python Libraries

Notebook Integration

Text Extraction

Text Pre-processing

Text embedding

Visual Programming Pandas Numpy BeautifulSoup NLTK Spacy Gensim pyLDAvs Matplotlib Bertopic hdbscan scikit-learn scipy huggingface transformers Pytorch sentence-transformers Criteria Weight Neural Topic Modeling

.27 LDA

.17 Visual Programming

.14 Text Pre-processing

.11 Text Embedding

.10 Other Text Approaches

.09 Notebook Integration

.08 Text Extraction

.03 Are these criteria and weights in line with the goals of the project?

Notebook Integration: the ability to launch and scale a python notebook on the platform

Text Extraction: the ability to extract text from PDFs

Text Pre-processing: the ability to clean text data before passing it to a machine learning algorithm remove punctuation, emails, URLs, specific string patterns, correct spelling errors filter stopwords, stemming & lemmatization, extract linguistic features

Text Embedding: the ability to represent a text document as a vector of numbers using various embedding techniques and pre-trained models

Topic Modeling: LDA, Neural approaches

Other relevant approaches available on the platform for future use/experiments with text data Text classification Summarization, key phrase extraction Named entity recognition and linking Question - Answering Platforms and Tools: Evaluation Criteria 19

Google AI 20 Google AI has many products (Below are relevant products)

Vertex AI - Unified platform for training, hosting, and managing models

Natural Language AI - Sentiment analysis and classification of unstructured text

Document AI - Machine learning and AI to unlock insights from your documents

Contact Center AI - AI model for speaking with customers and assisting human agents Out of the box capabilities:

Vertex AI - Supervised learning tasks for image, tabular, text, and video domain

Natural Language AI - Sentiment analysis, Entity analysis, Entity Sentiment analysis, Syntactic analysis, Content classification

Document AI - Collect structured data from unstructured text data (Able to fill in missing data using Google's knowledge graph)

Contact Center AI - Only product I could find with topic modeling out of the box. Requires 10000 conversations and both customer and agent responses for training These out of the box capabilities do no align with the unsupervised clustering approach envisioned for this task

Simple function calls and interactive notebook like environment

Ability to call python libraries from MATLAB environment and import/export deep learning frameworks with Open Neural Network Exchange (ONNX) format

Text Analytics Toolbox Text extraction: supports extractions from various formats(text, PDF, HTML, CSV, Excel, and Word)

Text pre-processing: remove punctuation, URL, correct spelling errors, filter stopwords, stemming & lemmatization, extract linguistic features Text embedding: word and n-gram counting, word2vec, CBOW, FastText, GloVe Topic modeling: LDA Other relevant offerings: document summarization, text classification, and keyword extraction with limited deep learning models

Basic NLP functionalities offered by MATLAB, but more advanced methods will likely be needed to obtain actionable results from technical text data

Any code written will be specific to MATLAB and not easily portable to other platforms or a python notebook MathWorks MATLAB 21

Many AI/ML Products: Applied AI Services, Cognitive Services, Form Recognizer, Cognitive Search, OpenAI Services, Machine Learning and more

Various environments supported: visual no-code (Azure ML Designer), code-first (Jupyter Notebooks, Python SDK, CLI)

Azure Machine Learning Text extraction: supported through Form Recognizer if needed, else import data from an Azure Datastore and transform via Designer Text pre-processing: remove stopwords, regular expressions for string matching, lemmatization, case normalization, remove special characters, patters, emails or URLs Text embedding: N-gram features, Word2Vec, FastText, GLoVe Topic modeling: LDA Other relevant offerings: text classification and named entity recognition via Designer; many NLP tasks (key phrase extraction, entity recognition and linking, summarization, question-answering) supported via Cognitive Services

Basic NLP functionalities and LDA topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies Microsoft Azure AI + ML 22

Many AI/ML Products: Comprehend, Textract, Augmented AI and more

Various SageMaker Environments: visual no-code (Canvas), code-first (Studio, Notebook Instances, Studio Lab)

AWS SageMaker Built-in functionalities Text extraction: supported through Amazon Comprehend if needed, but data can be provided in various formats (text, csv, json) for use within SageMaker Text pre-processing: not built-in, but can be imported from libraries as needed Text embedding: BlazingText (for learning CBOW, skip-gram or batch skip-gram embeddings with Word2Vec; learning character n-gram embeddings), Object2Vec (for learning embeddings with sentence pairs)

Topic modeling: LDA and Neural Topic Modeling Other relevant offerings: text classification, summarization, entity recognition and relationship extraction, question-answering

Some advanced NLP functionalities, LDA and neural topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies Amazon AWS SageMaker 23

What do you see as the primary role of the people executing this type of analysis?

Are there longer-term aspirations for ML initiatives?

Minimal Python coding in a Notebook environment enables many more options 24 Questions on Tool Selection Criteria

Progress 25

SOW Task Status 26 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors In progress Develop evaluation factor weights In progress Define evaluation factor ranges In progress Perform assessment In progress Report Results Not started Deliver Trade study report Not started Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation Not started Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)

Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started

Next Steps 27

Next Steps 28 Update evaluation criteria and weighting factors

Based on feedback from this session Perform candidate evaluations Produce Trade Study Report Begin model tuning