ML23262B168

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230412-final
ML23262B168
Person / Time
Issue date: 04/12/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B168 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, April 12, 2023

Agenda BERTopic Jupyter Notebook Demo Named Entity Extraction Varying Text Embeddings in BERTopic Topic Modeling for Different Categories Progress and Next Steps 2

BERTopic Demo 3

Named Entity Recognition (NER)

Language models that have been trained to recognize text spans as different kinds of named entities Most available pre-trained models can extract 4 or 18 names entities class NER models: person, location, organization, miscellaneous class NER models: cardinal value, date, event, building or facility, geo-political entity, language, law, location, money, nationality, religious or political affiliation, ordinal, organization, percent, person, product, quantity, time, work of art These models can be trained to extract custom entities, but this would require training with a large text corpora with entities and their corresponding text spans labeled For our use case, a pre-trained model that does a good job of extracting the entities of interest is sufficient 4

Named Entity Recognition Example Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model 5

Named Entity Recognition Example Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model 6

Named Entity Recognition (NER)

Different models will be better at recognizing different entities (ex. Law with Flair, Facility with Spacy, Product with both)

Pre-trained NER models arent completely reliable on technical data, they may miss some parts of the entities or some entities altogether Extracted entities can be used at various points in the BERTopic pipeline

- As categories to view found topics under, and extract topic representations for

- Steer dimensionality reduction of document embeddings closer to embeddings of extracted entities in a semi-supervised topic modeling approach

- With more reliable and accurate NER, and customized code:

> Select which text is fed to the embedding step (along with POS tagging to select patterns of interest)

> Tokenize text before c-TF-IDF step (along with POS tagging to select patterns of interest) 7

BERTopic Experiments Overview 8

BERTopic: Modularity BERTopic offers modularity at each step of the process

- Embedding

- Dimensionality Reduction

- Clustering

- Tokenizer

- Weighing scheme

- Representation tuning Each component can be easily swapped according to the goals and to accommodate the data 9

BERTopic Experiments Overview Topic Representation

- Maximal marginal relevance to reduce redundant keywords

- KeyBERT inspired approach to find keywords that are closely related to the representative documents of each topic

- Rule-based part-of-speech matching to find keywords or key phrases from representative documents that follow a specified part-of-speech pattern (nouns, adjectives followed by nouns)

- Using text generation models to label topics by providing a prompt with the keywords and representative documents

- Chaining multiple topic representation approaches

> MMR KeyBERT, MMR POS

> MMR KeyBERT Text Generation, MMR POS Text Generation Text Embedding

- Varying the language models used to embed document text

- Models with larger token limits, those without token limits

- Models that perform character level embeddings to better capture technical language Topic Modeling per Category

- Perform topic modeling for each known category of the data

- Regions, reactor sites, reactor units, cornerstone areas, cross-cutting areas 10

Text Embedding Experimentation 11

Embeddings An Embedding is the First LEGO piece in the BERTopic model The chosen embedding converts text into a numerical format that computers can understand When a person tries to understand a problem, they bring in their own bias and experiences Similarly, the Embeddings give the BERTopic model the context and experience necessary to identify topics.

Just how two different people can approach the same problem differently, two different Embeddings can affect the topics identified by BERTopic The next few slides will summarize a few of the many possible Embeddings we are using and present a few results 12

Example Embeddings BERTopics modularity allows us to very easily test different embeddings from different online repositories Embedding Description Designed for general purpose and speed. Trained on a large corpus of all-MiniLM-L6-v2 online data Designed for general purpose and quality. Trained on a large corpus all-mpnet-base-v2 of online data xlnet-base-cased Designed to work on language tasks that involve long context.

Trained on scientific citations and designed to estimate the similarity SPECTER of two publications multi-qa-MiniLM-L6-dot- Model was designed to find relevant passages from specific queries.

v1 Trained on a large and diverse set of (question, answer) pairs 13

all-MiniLM-L6-v2 Results 182 topics found (including outlier topic) Hierarchical clustering after topic reduction The top 6 most common results were 14

all-MiniLM-L6-v2 15

all-mpnet-base-v2 Results 176 topics found (including outlier topic) Hierarchical clustering after topic reduction The top 6 most common results were 16

all-mpnet-base-v2 17

xlnet-base-cased Results 186 topics found (including outlier topic) Hierarchical clustering after topic reduction The top 6 most common results were 18

xlnet-base-cased 19

SPECTER Results 132 topics found (including outlier topic) Hierarchical clustering after topic reduction The top 6 most common results were 20

SPECTER 21

multi-qa-MiniLM-L6-dot-v1 Results 169 topics found (including outlier topic) Hierarchical clustering after topic reduction The top 6 most common results were 22

multi-qa-MiniLM-L6-dot-v1 23

Topic Modeling for Different Categories 24

Topic Modeling for Different Categories Create a topic model and then extract, for each topic, its representation per class See how certain topics, calculated over all documents, are represented for certain subgroups Default BERTopic configuration (all-MiniLM-L6-v2 for embedding item introduction text and MMR to reduce redundancy)

Categories explored:

- Region

- Reactor Site

- Cornerstones

- Cross-cutting Areas 25

Topic Modeling for Different Categories Demo (See screen shots at the end of the deck) 26

2000 2020 2023 Safety Clusters Reactors Reports Inspection Proc Safety Clusters Reactors Reports Safety Clusters Inspection Proc Reactor POIN ML13317A5 Reactor 71153 Fire FAR ML12026A4 Fire 71111.13 Valve DIAB ML1030752 Valve 71111.15 Generator GINN ML11312A0 Generator 71111.20 Procedure SUR ML1019005 Procedure 71111.21 Control HAR ML13204A0 Control 71114.06 Maintain FERM ML12121A5 Maintain 71152B Reactors Safety Clusters Reactors Safety Clusters Inspection Proc Reactors Reports POIN Reactor POIN Reactor 71153 POIN ML13317A5 FAR Fire FAR Fire 71111.13 FAR ML12026A4 DIAB Valve DIAB Valve 71111.15 DIAB ML1030752 GINN Generator GINN Generator 71111.20 GINN ML11312A0 SUR Procedure SUR Procedure 71111.21 SUR ML1019005 HAR Control HAR Control 71114.06 HAR ML13204A0 FERM Maintain FERM Maintain 71152B FERM ML12121A5

Progress 28

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation In progress Search the Literature Complete Data acquisition and preparation In progress Select Candidates Complete Feature pipeline engineering In progress Select Evaluation Factors Complete Clustering method experimentation & selection In progress Develop evaluation factor weights Complete Cluster pipeline engineering In progress Define evaluation factor ranges Complete Anomaly detection (as needed) Not started Perform assessment Complete Model Development, Training, Evaluation Not started Report Results Complete Test harness development Not started Deliver Trade study report Complete PoC integration and demonstration Not started Trial runs and evaluation Not started Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion Not started Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started 29

Next Steps 30

Next Steps Experiment with alternatives in BERTopic composable parts:

Topic representation Embedding Variants of topic modeling Begin Azure configuration for no-code solution Share early topics / safety clusters with SMEs 31

Backup 32

Cornerstone 33

Region 34

Site Name 35

Cross Cutting Area 36