ML23262B165

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230329-final
ML23262B165
Person / Time
Issue date: 03/29/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B165 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, March 29, 2023

Agenda Data Presentation Neural Topic Modeling Actionable Insights Tool Analysis Progress and Next Steps 2

Topic Modeling 3

Topic Modeling

  • Unsupervised discovery of topics from a collection of text documents Topics
  • Latent Dirichlet Allocation (LDA)
  • Describe a document as a bag-of-words
  • Model each document as a mixture of latent topics
  • Topic is represented as a distribution over the words Topic word/phrase frequency that in the vocabulary Model distinguishes and characterizes a topic
  • Variants of Topic Modeling can be explored Corpus of Documents 50% topic 1
  • Text-embeddings from Language Models and 25% topic 2 25 % topic 3 Neural Topic Modeling can be used to improve the Proportion of topics in quality of results each document Clusters of documents by topic 4

Neural Topic Modeling 5

BERTopic Generate document embeddings with pre-trained transformer-based language models Reduce dimensionality of document embeddings Cluster document embeddings Generate topic representations with class-based TF-IDF procedure to overcome centroid-based perspective Coherent and diverse topics

BERTopic: Modularity BERTopic offers modularity at each step of the process

- Embedding

- Dimensionality Reduction

- Clustering

- Tokenizer

- Weighing scheme

- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data 7

BERTopic: Representing a Topic Refine how a topic is represented and interpreted KeyBERT

- Extract keywords for each topic and a set of representative documents per topic

- Compare the embeddings of the keywords and the representative documents Maximal Marginal Relevance

- Reduce redundancy and improve diversity of keywords Part of Speech

- Extract keywords for each topic and documents that contain the keywords

- Use a part-of-speech tagger to generate new candidate keywords Zero-shot Classification

- Assign candidate labels to topics given keywords for each topic Text Generation and Prompts

- Create topic labels based on representative documents and keywords

- Huggingface Transformers, OpenAI GPT, co:here, LangChain 8

BERTopic: Topic Modeling Variations Topic Distributions Semi-supervised Topic Modeling

- Approximate topic distributions per document when - Steer dimensionality reduction of document embeddings using a hard-clustering approach into a space close to the topic labels for some or all documents Topics per Class (Category)

- Extract topic representations for each class or category Guided (Seeded) Topic Modeling of interest from topic model - Predefined keywords or phrases for the topic model to converge to by comparing document embeddings with Dynamic Topic Modeling seeded topic embeddings

- Analyze how the representation of a topic changes over time Supervised Topic Modeling

- If topic labels are already known, discover relationships Hierarchical Topic Modeling between documents and topics

- Obtain insights into which topics are similar and sub-topics that may exist in data Manual Topic Modeling

- Find topic representations for document topic labels that Online Topic Modeling are already known and use other topic modeling

- Continue updating topic model with new data variations with this model 9

BERTopic: Preliminary Results Input Text: inspection findings titles (13-258 words, avg. 85 words)

- all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)

Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations 10

BERTopic: Preliminary Results Input Text: inspection findings item introductions (44-11,670 words, avg. 1,649 words)

- all-MiniLM-L6-v2 model used for embeddings (max sequence length: 256, embedding dimension: 384)

Default BERTopic configuration + Maximal Marginal Relevance to tune topic representations 11

Actionable Insights 12

Actionable Insights: Descriptions Using Topics to Name the Safety Clusters Leverage Fine Tuning to Analyze Safety Clusters Zero-shot Classification Once Safety Clusters have been defined, Text Generation and Prompts we can apply additional ML techniques to Topic Distributions gain insights into the overarching themes Topics per Class (Category) in the Safety Cluster.

Dynamic Topic Modeling Hierarchical Topic Modeling Online Topic Modeling This can aid the SMEs in characterizing the Safety Issues 13

Actionable Insights: Unsupervised vs Supervised Forcing Clusters Reactor safety inspection items Inspection Procedures

1. Adverse Weather Protection 1. Initial Management Meeting Actionable Insights Using Seeds
2. Equipment Alignments Materials Licensees
3. Fire Protection 2. Quality Assurance Program Topic Modeling can be influenced
4. Flood Protection Implementation During This would be more of a supervised
5. Licensed Operator Construction and Pre- approach, telling the system what the Requalification Construction Activities
6. Maintenance Implementation 3. Quality Assurance safety clusters are, then grouping
7. Maintenance Risk Assessments Implementation Inspection inspection reports based on those and Emergent Work Evaluation 4. Different from discovering safety clusters
8. Personnel Performance During May be a way to do both and compare Nonroutine Plant Evolutions and Events Reactors Supervised, semi-supervised, guided,
9. Operability Evaluations 1. POIN manual
10. Operator Workarounds 2. FAR
11. Post-maintenance Testing 3. DIAB
12. 4. GINN
5. SUR
6. HAR This capability is only available for some
7. FERM
8.

models - BERTopic provides them 14

Questions on direction for clustering and insights In addition to the insights discussed last week:

Within a reactor Across reactors Classification Would these type of insights be beneficial?

15

Tool Analysis 16

Survey of Platforms and Tools Platform Supplied Algorithms Python Library Algorithms Pros Pros Integrated into user experience Flexibility to select from multiple algorithms Can be combined into standard pipelines Flexibility to select from multiple pre-trainings Work well in common cases (social media) Advanced models available Cons Flexibility to address highly technical domains Do not handle technical domains well Flexibility to leverage internal parameters Black box algorithms (unpublished models) Work well in Machine Learning Notebooks Difficult to tailor pipelines Can be deployed in any cloud or on premises Limited selection of algorithms Limited coding knowledge required Limited selection of pre-training datasets Cons Advanced algorithms not available Requires some knowledge of Python Strengths in one area (Topic Modeling) may not Algorithms must be researched and selected translate to other areas (Classification, clustering, regression, anomaly detection)

Based on preliminary results, we expect a Notebook based approach using Neural Topic Modeling to be the top candidate 17

Candidate Evaluation Criteria Model Support Python Libraries Criteria Weight LDA Topic Modeling Pandas Neural Topic Modeling Numpy Neural Topic Modeling .27 Other relevant approaches for BeautifulSoup LDA .17 future use/experiments with NLTK Visual Programming .14 text data Spacy Gensim Text Pre-processing .11 Processing Support pyLDAvs Text Embedding .10 Notebook Integration Matplotlib Text Extraction Bertopic Other Text Approaches .09 Text Pre-processing hdbscan Notebook Integration .08 Text embedding scikit-learn Text Extraction .03 Visual Programming scipy huggingface transformers Pytorch Are these criteria and weights in line with the sentence-transformers goals of the project?

18

Platforms and Tools: Evaluation Criteria Notebook Integration: the ability to launch and scale a python notebook on the platform Text Extraction: the ability to extract text from PDFs Text Pre-processing: the ability to clean text data before passing it to a machine learning algorithm

- remove punctuation, emails, URLs, specific string patterns, correct spelling errors

- filter stopwords, stemming & lemmatization, extract linguistic features Text Embedding: the ability to represent a text document as a vector of numbers using various embedding techniques and pre-trained models Topic Modeling: LDA, Neural approaches Other relevant approaches available on the platform for future use/experiments with text data

- Text classification

- Summarization, key phrase extraction

- Named entity recognition and linking

- Question - Answering 19

Google AI Google AI has many products (Below are relevant products)

Vertex AI - Unified platform for training, hosting, and managing models Natural Language AI - Sentiment analysis and classification of unstructured text Document AI - Machine learning and AI to unlock insights from your documents Contact Center AI - AI model for speaking with customers and assisting human agents Out of the box capabilities:

Vertex AI - Supervised learning tasks for image, tabular, text, and video domain Natural Language AI - Sentiment analysis, Entity analysis, Entity Sentiment analysis, Syntactic analysis, Content classification Document AI - Collect structured data from unstructured text data (Able to fill in missing data using Google's knowledge graph)

Contact Center AI - Only product I could find with topic modeling out of the box. Requires 10000 conversations and both customer and agent responses for training These out of the box capabilities do no align with the unsupervised clustering approach envisioned for this task 20

MathWorks MATLAB Simple function calls and interactive notebook like environment Ability to call python libraries from MATLAB environment and import/export deep learning frameworks with Open Neural Network Exchange (ONNX) format Text Analytics Toolbox

- Text extraction: supports extractions from various formats(text, PDF, HTML, CSV, Excel, and Word)

- Text pre-processing: remove punctuation, URL, correct spelling errors, filter stopwords, stemming & lemmatization, extract linguistic features

- Text embedding: word and n-gram counting, word2vec, CBOW, FastText, GloVe

- Topic modeling: LDA

- Other relevant offerings: document summarization, text classification, and keyword extraction with limited deep learning models Basic NLP functionalities offered by MATLAB, but more advanced methods will likely be needed to obtain actionable results from technical text data Any code written will be specific to MATLAB and not easily portable to other platforms or a python notebook 21

Microsoft Azure AI + ML Many AI/ML Products: Applied AI Services, Cognitive Services, Form Recognizer, Cognitive Search, OpenAI Services, Machine Learning and more Various environments supported: visual no-code (Azure ML Designer), code-first (Jupyter Notebooks, Python SDK, CLI)

Azure Machine Learning

- Text extraction: supported through Form Recognizer if needed, else import data from an Azure Datastore and transform via Designer

- Text pre-processing: remove stopwords, regular expressions for string matching, lemmatization, case normalization, remove special characters, patters, emails or URLs

- Text embedding: N-gram features, Word2Vec, FastText, GLoVe

- Topic modeling: LDA

- Other relevant offerings: text classification and named entity recognition via Designer; many NLP tasks (key phrase extraction, entity recognition and linking, summarization, question-answering) supported via Cognitive Services Basic NLP functionalities and LDA topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies 22

Amazon AWS SageMaker Many AI/ML Products: Comprehend, Textract, Augmented AI and more Various SageMaker Environments: visual no-code (Canvas), code-first (Studio, Notebook Instances, Studio Lab)

AWS SageMaker Built-in functionalities

- Text extraction: supported through Amazon Comprehend if needed, but data can be provided in various formats (text, csv, json) for use within SageMaker

- Text pre-processing: not built-in, but can be imported from libraries as needed

- Text embedding: BlazingText (for learning CBOW, skip-gram or batch skip-gram embeddings with Word2Vec; learning character n-gram embeddings), Object2Vec (for learning embeddings with sentence pairs)

- Topic modeling: LDA and Neural Topic Modeling

- Other relevant offerings: text classification, summarization, entity recognition and relationship extraction, question-answering Some advanced NLP functionalities, LDA and neural topic modeling with a visual no-code environment or notebook approach to explore more advanced methodologies 23

Questions on Tool Selection Criteria What do you see as the primary role of the people executing this type of analysis?

Are there longer-term aspirations for ML initiatives?

Minimal Python coding in a Notebook environment enables many more options 24

Progress 25

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation Not started Search the Literature Complete Data acquisition and preparation In progress Select Candidates Complete Feature pipeline engineering In progress Select Evaluation Factors In progress Clustering method experimentation & selection In progress Develop evaluation factor weights In progress Cluster pipeline engineering In progress Define evaluation factor ranges In progress Anomaly detection (as needed) Not started Perform assessment In progress Model Development, Training, Evaluation Not started Report Results Not started Test harness development Not started Deliver Trade study report Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started 26

Next Steps 27

Next Steps Update evaluation criteria and weighting factors Based on feedback from this session Perform candidate evaluations Produce Trade Study Report Begin model tuning 28