ML23262B168
| ML23262B168 | |
| Person / Time | |
|---|---|
| Issue date: | 04/12/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B168 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, April 12, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
BERTopic Jupyter Notebook Demo Named Entity Extraction Varying Text Embeddings in BERTopic Topic Modeling for Different Categories Progress and Next Steps
BERTopic Demo 3
Language models that have been trained to recognize text spans as different kinds of named entities
Most available pre-trained models can extract 4 or 18 names entities class NER models: person, location, organization, miscellaneous class NER models: cardinal value, date, event, building or facility, geo-political entity, language, law, location, money, nationality, religious or political affiliation, ordinal, organization, percent, person, product, quantity, time, work of art
These models can be trained to extract custom entities, but this would require training with a large text corpora with entities and their corresponding text spans labeled
For our use case, a pre-trained model that does a good job of extracting the entities of interest is sufficient Named Entity Recognition (NER) 4
Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 5
Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 6
Different models will be better at recognizing different entities (ex. Law with Flair, Facility with Spacy, Product with both)
Pre-trained NER models arent completely reliable on technical data, they may miss some parts of the entities or some entities altogether
Extracted entities can be used at various points in the BERTopic pipeline
- As categories to view found topics under, and extract topic representations for
- Steer dimensionality reduction of document embeddings closer to embeddings of extracted entities in a semi-supervised topic modeling approach
- With more reliable and accurate NER, and customized code:
Select which text is fed to the embedding step (along with POS tagging to select patterns of interest)
Tokenize text before c-TF-IDF step (along with POS tagging to select patterns of interest)
Named Entity Recognition (NER) 7
BERTopic Experiments Overview 8
BERTopic offers modularity at each step of the process
- Embedding
- Dimensionality Reduction
- Clustering
- Tokenizer
- Weighing scheme
- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data BERTopic: Modularity 9
Topic Representation Maximal marginal relevance to reduce redundant keywords KeyBERT inspired approach to find keywords that are closely related to the representative documents of each topic Rule-based part-of-speech matching to find keywords or key phrases from representative documents that follow a specified part-of-speech pattern (nouns, adjectives followed by nouns)
Using text generation models to label topics by providing a prompt with the keywords and representative documents Chaining multiple topic representation approaches MMR KeyBERT, MMR POS MMR KeyBERT Text Generation, MMR POS Text Generation
Text Embedding Varying the language models used to embed document text Models with larger token limits, those without token limits Models that perform character level embeddings to better capture technical language
Topic Modeling per Category Perform topic modeling for each known category of the data Regions, reactor sites, reactor units, cornerstone areas, cross-cutting areas BERTopic Experiments Overview 10
Text Embedding Experimentation 11
An Embedding is the First LEGO piece in the BERTopic model
The chosen embedding converts text into a numerical format that computers can understand
When a person tries to understand a problem, they bring in their own bias and experiences
Similarly, the Embeddings give the BERTopic model the context and experience necessary to identify topics.
Just how two different people can approach the same problem differently, two different Embeddings can affect the topics identified by BERTopic
The next few slides will summarize a few of the many possible Embeddings we are using and present a few results Embeddings 12
Example Embeddings 13
BERTopics modularity allows us to very easily test different embeddings from different online repositories Embedding Description all-MiniLM-L6-v2 Designed for general purpose and speed. Trained on a large corpus of online data all-mpnet-base-v2 Designed for general purpose and quality. Trained on a large corpus of online data xlnet-base-cased Designed to work on language tasks that involve long context.
SPECTER Trained on scientific citations and designed to estimate the similarity of two publications multi-qa-MiniLM-L6-dot-v1 Model was designed to find relevant passages from specific queries.
Trained on a large and diverse set of (question, answer) pairs
Results
182 topics found (including outlier topic)
The top 6 most common results were all-MiniLM-L6-v2 14
Hierarchical clustering after topic reduction
all-MiniLM-L6-v2 15
Results
176 topics found (including outlier topic)
The top 6 most common results were all-mpnet-base-v2 16
Hierarchical clustering after topic reduction
all-mpnet-base-v2 17
Results
186 topics found (including outlier topic)
The top 6 most common results were xlnet-base-cased 18
Hierarchical clustering after topic reduction
xlnet-base-cased 19
Results
132 topics found (including outlier topic)
The top 6 most common results were SPECTER 20
Hierarchical clustering after topic reduction
SPECTER 21
Results multi-qa-MiniLM-L6-dot-v1 22
169 topics found (including outlier topic)
The top 6 most common results were
Hierarchical clustering after topic reduction
multi-qa-MiniLM-L6-dot-v1 23
Topic Modeling for Different Categories 24
Create a topic model and then extract, for each topic, its representation per class See how certain topics, calculated over all documents, are represented for certain subgroups Default BERTopic configuration (all-MiniLM-L6-v2 for embedding item introduction text and MMR to reduce redundancy)
Categories explored:
- Region
- Reactor Site
- Cornerstones
- Cross-cutting Areas Topic Modeling for Different Categories 25
Demo (See screen shots at the end of the deck) 26 Topic Modeling for Different Categories
Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Reactors POIN FAR DIAB GINN SUR HAR FERM Reports ML13317A5 ML12026A4 ML1030752 ML11312A0 ML1019005 ML13204A0 ML12121A5 Inspection Proc 71153 71111.13 71111.15 71111.20 71111.21 71114.06 71152B 2000 2023 2020 Safety Clusters Reactors Reports Inspection Proc Reactors POIN FAR DIAB GINN SUR HAR FERM Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Reactors POIN FAR DIAB GINN SUR HAR FERM Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Inspection Proc 71153 71111.13 71111.15 71111.20 71111.21 71114.06 71152B Reactors POIN FAR DIAB GINN SUR HAR FERM Reports ML13317A5 ML12026A4 ML1030752 ML11312A0 ML1019005 ML13204A0 ML12121A5
Progress 28
SOW Task Status 29 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation In progress Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)
Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started
Next Steps 30
Next Steps 31 Experiment with alternatives in BERTopic composable parts:
Topic representation
Embedding
Variants of topic modeling Begin Azure configuration for no-code solution Share early topics / safety clusters with SMEs
Backup 32
Cornerstone 33
Region 34
Site Name 35
Cross Cutting Area 36