ML23262B168

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230412-final
ML23262B168
Person / Time
Issue date: 04/12/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B168 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, April 12, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith

Agenda 2

BERTopic Jupyter Notebook Demo Named Entity Extraction Varying Text Embeddings in BERTopic Topic Modeling for Different Categories Progress and Next Steps

BERTopic Demo 3

Language models that have been trained to recognize text spans as different kinds of named entities

Most available pre-trained models can extract 4 or 18 names entities class NER models: person, location, organization, miscellaneous class NER models: cardinal value, date, event, building or facility, geo-political entity, language, law, location, money, nationality, religious or political affiliation, ordinal, organization, percent, person, product, quantity, time, work of art

These models can be trained to extract custom entities, but this would require training with a large text corpora with entities and their corresponding text spans labeled

For our use case, a pre-trained model that does a good job of extracting the entities of interest is sufficient Named Entity Recognition (NER) 4

Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 5

Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 6

Different models will be better at recognizing different entities (ex. Law with Flair, Facility with Spacy, Product with both)

Pre-trained NER models arent completely reliable on technical data, they may miss some parts of the entities or some entities altogether

Extracted entities can be used at various points in the BERTopic pipeline

- As categories to view found topics under, and extract topic representations for

- Steer dimensionality reduction of document embeddings closer to embeddings of extracted entities in a semi-supervised topic modeling approach

- With more reliable and accurate NER, and customized code:

Select which text is fed to the embedding step (along with POS tagging to select patterns of interest)

Tokenize text before c-TF-IDF step (along with POS tagging to select patterns of interest)

Named Entity Recognition (NER) 7

BERTopic Experiments Overview 8

BERTopic offers modularity at each step of the process

- Embedding

- Dimensionality Reduction

- Clustering

- Tokenizer

- Weighing scheme

- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data BERTopic: Modularity 9

Topic Representation Maximal marginal relevance to reduce redundant keywords KeyBERT inspired approach to find keywords that are closely related to the representative documents of each topic Rule-based part-of-speech matching to find keywords or key phrases from representative documents that follow a specified part-of-speech pattern (nouns, adjectives followed by nouns)

Using text generation models to label topics by providing a prompt with the keywords and representative documents Chaining multiple topic representation approaches MMR KeyBERT, MMR POS MMR KeyBERT Text Generation, MMR POS Text Generation

Text Embedding Varying the language models used to embed document text Models with larger token limits, those without token limits Models that perform character level embeddings to better capture technical language

Topic Modeling per Category Perform topic modeling for each known category of the data Regions, reactor sites, reactor units, cornerstone areas, cross-cutting areas BERTopic Experiments Overview 10

Text Embedding Experimentation 11

An Embedding is the First LEGO piece in the BERTopic model

The chosen embedding converts text into a numerical format that computers can understand

When a person tries to understand a problem, they bring in their own bias and experiences

Similarly, the Embeddings give the BERTopic model the context and experience necessary to identify topics.

Just how two different people can approach the same problem differently, two different Embeddings can affect the topics identified by BERTopic

The next few slides will summarize a few of the many possible Embeddings we are using and present a few results Embeddings 12

Example Embeddings 13

BERTopics modularity allows us to very easily test different embeddings from different online repositories Embedding Description all-MiniLM-L6-v2 Designed for general purpose and speed. Trained on a large corpus of online data all-mpnet-base-v2 Designed for general purpose and quality. Trained on a large corpus of online data xlnet-base-cased Designed to work on language tasks that involve long context.

SPECTER Trained on scientific citations and designed to estimate the similarity of two publications multi-qa-MiniLM-L6-dot-v1 Model was designed to find relevant passages from specific queries.

Trained on a large and diverse set of (question, answer) pairs

Results

182 topics found (including outlier topic)

The top 6 most common results were all-MiniLM-L6-v2 14

Hierarchical clustering after topic reduction

all-MiniLM-L6-v2 15

Results

176 topics found (including outlier topic)

The top 6 most common results were all-mpnet-base-v2 16

Hierarchical clustering after topic reduction

all-mpnet-base-v2 17

Results

186 topics found (including outlier topic)

The top 6 most common results were xlnet-base-cased 18

Hierarchical clustering after topic reduction

xlnet-base-cased 19

Results

132 topics found (including outlier topic)

The top 6 most common results were SPECTER 20

Hierarchical clustering after topic reduction

SPECTER 21

Results multi-qa-MiniLM-L6-dot-v1 22

169 topics found (including outlier topic)

The top 6 most common results were

Hierarchical clustering after topic reduction

multi-qa-MiniLM-L6-dot-v1 23

Topic Modeling for Different Categories 24

Create a topic model and then extract, for each topic, its representation per class See how certain topics, calculated over all documents, are represented for certain subgroups Default BERTopic configuration (all-MiniLM-L6-v2 for embedding item introduction text and MMR to reduce redundancy)

Categories explored:

- Region

- Reactor Site

- Cornerstones

- Cross-cutting Areas Topic Modeling for Different Categories 25

Demo (See screen shots at the end of the deck) 26 Topic Modeling for Different Categories

Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Reactors POIN FAR DIAB GINN SUR HAR FERM Reports ML13317A5 ML12026A4 ML1030752 ML11312A0 ML1019005 ML13204A0 ML12121A5 Inspection Proc 71153 71111.13 71111.15 71111.20 71111.21 71114.06 71152B 2000 2023 2020 Safety Clusters Reactors Reports Inspection Proc Reactors POIN FAR DIAB GINN SUR HAR FERM Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Reactors POIN FAR DIAB GINN SUR HAR FERM Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Safety Clusters Reactor Fire Valve Generator Procedure Control Maintain Inspection Proc 71153 71111.13 71111.15 71111.20 71111.21 71114.06 71152B Reactors POIN FAR DIAB GINN SUR HAR FERM Reports ML13317A5 ML12026A4 ML1030752 ML11312A0 ML1019005 ML13204A0 ML12121A5

Progress 28

SOW Task Status 29 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation In progress Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)

Not started Model Development, Training, Evaluation Not started Test harness development Not started PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started

Next Steps 30

Next Steps 31 Experiment with alternatives in BERTopic composable parts:

Topic representation

Embedding

Variants of topic modeling Begin Azure configuration for no-code solution Share early topics / safety clusters with SMEs

Backup 32

Cornerstone 33

Region 34

Site Name 35

Cross Cutting Area 36