ML23262B171

From kanterella
Jump to navigation Jump to search
Meeting Slides 20230419-final
ML23262B171
Person / Time
Issue date: 04/19/2023
From: Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G
NRC/RES/DRA/HFRB, Sphere of Influence
To:
References
Download: ML23262B171 (1)


Text

Machine Learning Demo Wednesday Prioritizing Inspections using ML Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith Wednesday, April 19, 2023

Agenda Topic Modeling Metrics Topic Reduction with BERTopic Supplemental NLP Tasks

- Summarization

- Question-Answering Progress and Next Steps 2

Topic Modeling Metrics 3

Metrics As we established over the last few weeks, there are many hyperparameters/embeddings that can be used for topic modelling Although the most important metric will come from the NRC; to assist us in the development stage non-domain specific metrics for topic modelling are crucial Over the years many different evaluation techniques have been developed. In the following slides we will review three different techniques for topic model evaluation that we plan to use in the future

- Diversity - seeking highly diverse topics

> Inverted Rank-Biased Overlap

> Word Embedding-based Pairwise Distance

- Coherence - seeking internally coherent topics

> Topic Coherence Model 4

Inverted Rank-Biased Overlap Inverted Rank-Biased Overlap is the inverse of the estimate of the similarity of two different ranked lists (RBO)

RBO(List 1, List 2, ) = (1 ) =1: 1

= number of top words

= depth of the ranking (Number of top words to compare)

= Tunable weighting Factor between 0 and 1 List 1 List 2

=

Notice how due to the exponential term 1 the similarity between the top words in a list are more important than the similarity of bottom words on the list To make use of the Inverted Rank-Biased Overlap (IRBO), the average IRBO of all top n word combinations for all topics is calculated.

The higher the average IRBO, the more distinct the different topics are We are seeking topics with minimal overlap 5

Word Embedding-based Pairwise distance While the previous diversity metric did not require any external resources, we can use the word embeddings that were inputted into training Bertopic, to yield another diversity metric Word-Embedding-based Pairwise distance is a very simple metric that is designed to calculate the distance between two lists of words in embedding space WEPWD(List 1, List 2) = 1=1: 2=1: Cosine Distance(v , wd2 )

= number of top words

= current depth of list 1 2 = current depth of list 2 v = The embedding vector of word at depth in List 1 v = The embedding vector of word at depth 2 in List 2 This metric does not give a higher weight to the more important words We can use a variant of this code to calculate the distance between the centroids of different topics The larger the pairwise distance the more diverse the topics are We are seeking topics that are far away from each other 6

Topic Coherence Topic Coherence is part of the gensim library Unlike the previous two methods, it works on individual topics (Although results can be aggregated)

The goal of the algorithm is to estimate how coherent the words in a topic are based on some external corpus The algorithm is split into four different parts Image from https://miro.medium.com/v2/resize:fit:1400/format:webp/1*9rH9c-VJ3K27Q__wulDNvA.png 7

Segmentation and Probability Calculations The first step in Topic Coherence is Segmentation

- Segmentation creates subsets of the list of top words. For example, if the words are {desk, pencil, chair} then the one collection of subsets could be: S={(desk, pencil),(desk, chair),(pencil, desk),(pencil, chair),(chair, desk),(chair, pencil)}

- The type of segmentation we perform will dictate the type of coherence we are calculating.

The second step, Probability Calculation, estimates the probability of one or multiple words showing up in an external corpus.

- One method is to estimate how many times multiple words show up in a sliding window of text from your external corpus. This allows us to estimate the co-occurrence of words in our topics

- The external corpus choice is very important. Some tasks use Wikipedia. I would like try this methodology using the significant amount of documentation that NRC has, including inspection manuals, procedures, etc.

The third step, Confirmation Measure, takes a single pair of subsets from the segmentation step and their corresponding probabilities from the probability calculation step to compute how strong one subset supports another

- There are many different confirmation measure we can use. One example is difference confirmation measure

- subset1 , subset 2 = subset1 subset 2 (subset1 )

The final step is Aggregation, which aggregates the confirmation measures over all subset pairs into a single coherence score. We typically use the mean or median for this.

We are seeking cohesive topics 8

BERTopic: Topic Reduction 9

BERTopic: Topic Reduction BERTopic discovers a large number of topics under the default configuration Topics can be combined manually through inspection of topic terms

1) Increase the minimum number of documents needed to form a cluster in HDBSCAN to obtain broader topics (default is 10)

Topic Embeddings

- Topic representations for each topic (top N words or phrases, depending on the representation model used)

- Embed topic representations using embedding model to get multiple embeddings per topic

- Take weighted average of embeddings in a topic by their c-TF-IDF score (giving more weight to the embeddings of the words or phrases that best represent the topic)

2) Reduce to a specific number: Hierarchical (Agglomerative) Clustering of topic embeddings

- Agglomerative clustering: bottom-up approach where each point starts in its own cluster and clusters are successively merged using a linkage criteria that determines the merging strategy (ward, maximum, average, single)

- Similar topics are iteratively merged using the cosine distance between topic embeddings

- Merges topics even if they are not very similar to reduce to the specified number of topics

3) Automatically reduce to a number: Use HDBSCAN to cluster topic embeddings and merge topics that cluster together

- HDBSCAN generates outliers, preventing topics from being merged if no other topics are similar Following results use the default BERTopic config (all-minilm-l6-v2 for embedding, UMAP and HDBSCAN with min cluster size = 10 docs, CountVectorizer with 1-grams, c-TF-IDF) + MMR representation (diversity=0.6) 10

Topic Reduction: 1) create larger clusters Increase the minimum number of documents needed to form a cluster in HDBSCAN to obtain broader topics

- Increase minimum cluster size from 10 to 30 Min Docs = 10, 129 Topics Min Docs = 30, 47 Topics 11

Topic Reduction: 2) hierarchical clustering of topic embeddings Hierarchical (Agglomerative) Clustering of topic embeddings 120 Topics Before 29 Topics After 12

Topic Reduction: 3) merge topics that cluster together Use HDBSCAN to cluster topic embeddings and only merge topics that cluster together, leaving outliers as standalone topics 13

Supplemental Natural Language Processing Tasks 14

Supplemental Natural Language Processing Tasks Named Entity Recognition (NER)

- Labels spans of text as different named entities (law, facility, product, etc.)

- Extracted entities can be used for topic modeling per category or semi-supervised topic modeling

- Extracted entities can better focus the input text to topic modeling or the topic representation in found topics Summarization and Question-Answering (QA)

- Obtaining the most important phrases or sentences of interest from paragraphs or long documents

- Extracted text can be used to narrow down and better focus the input text to topic modeling

- Summarize each of the three representative documents for a topic to get a theme for the topic Fine-tuning pre-trained models on our data would provide much better results, but in the absence of labeled training data, we can utilize models pre-trained on large text corpora or those fine-tuned on text that has technical language like our data to supplement our unsupervised discovery of safety issues 15

Summarization Creating short, accurate summaries of longer text documents Input

- Single document: summary of shorter text from one document

- Multi-document: summary of arbitrarily long text that can span multiple documents Output

- Extractive: select the most important sentences in the text (older, frequency-based approaches)

- Abstractive: generate phrases and sentences that provide a meaningful summary of the text (newer, deep-learning based NLP approaches)

Purpose

- Generic: no assumption about domain or content of text

- Domain-specific: text to be summarized belongs to a specialized domain

- Query-based: summaries are focused on answering the question Transformer language models (pre-trained on various datasets and fine-tuned for summarization tasks on various datasets) available from the HuggingFace library were explored using NRC Inspection Findings item introductions

- T5-base, Flan-t5-base, bart-large-cnn, Pegasus (xsum, cnn-dailymail, arxiv, pubmed) 16

Summarization google/pegasus-xsum google/pegasus-cnn_dailymail A self-revealed finding with its safety significance as yet to be The inspectors identified a finding (FIN) of very low safety significance (Green),

determined (TBD) and an associated Apparent Violation (AV) of 10 for the licensees failure to perform an adequate operating experience CFR Part 50, Appendix B, Criterion III, Design Control were evaluation for NRC Information Notice 2017-06 which identified how the site might be vulnerable to the situation described in the IN. This was contrary to identified for the licensees apparent failure to adequately translate Point Beach Procedure PI-AA-102-1001, Operating Experience Program the design basis of the Auxiliary Feedwater System (AFW) into Screening and Responding to Incoming Operating Experience. Specifically, the procedures and instructions which resulted in the inoperability of licensee failed to evaluate the effects of the additional short circuit current the turbine-driven auxiliary feedwater pump (TDAFWP) on August 3, expected from the battery chargers and determine the impact on the direct 2022. current distribution buses and interrupting devices.

The licensee failed to evaluate the effects of the additional short The US Nuclear Regulatory Commission (NRC) has opened an circuit current expected from the battery chargers . The licensee also investigation into the failure of a turbine-driven auxiliary feedwater failed to determine the impact on the direct current distribution pump at a nuclear power plant in Pennsylvania.

buses and interrupting devices .

17

Summarization google/pegasus-cnn_dailymail google/pegasus-cnn_dailymail A self-revealed Green finding and associated non-cited violation of 10 CFR The NRC inspectors identified a Green finding and associated Non-50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design manage the increase in risk that may result from maintenance activities that Control when the licensee performed a plant modification to remove were performed in the switchyard house. Specifically, the licensee's failure to the automatic trip of the circulating water pumps after a flood is assess and manage the increase in risk associated with the movement of floor detected in the turbine building. Specifically, the licensee did not tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial evaluate the impact of water leakage into the emergency diesel loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-generator (EDG) and battery rooms during the analyzed internal 2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power. flooding event.

The licensee failed to assess and manage the increase in risk NRC inspectors identified a Green finding and associated Non-cited associated with the movement of floor tiles near vibration sensitive Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control relays . The failure resulted in a partial loss of offsite power to the 1B' .The licensee did not evaluate the impact of water leakage into the startup transformer (SUT) and the loss of generation (i.e., runback) of emergency diesel generator (EDG) and battery rooms during the approximately 10-percent power . analyzed internal flooding event .

18

Summarization facebook/bart-large-cnn google/t5-base A self-revealed Green finding and associated non-cited violation of 10 CFR A self-revealed Green finding was identified for the licensee's failure 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at to take appropriate action to address the impacts of a setpoint Nuclear Power Plants, was identified for the licensee's failure to assess and change on relay KC-2 into design-basis documentation in accordance manage the increase in risk that may result from maintenance activities that with licensee procedure NMP-ES-039-001, Calculations -

were performed in the switchyard house. Specifically, the licensee's failure to assess and manage the increase in risk associated with the movement of floor Preparation and Revision, Version 6. As a result, the licensee failed tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial to update drawing A-177048, U1 Main Transformer Switchyard Fault loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC- Detector, Version 1, with the correct relay setting, which resulted in 2 relay setpoint, the loss of generation (i.e., runback) of approximately 10- an improperly configured relay and unnecessary automatic reactor percent power. trip of Unit 1 on August 3, 2022.

A self-revealed Green finding and associated non-cited violation of 10 CFR licensee failed to update drawing A-177048, U1 Main Transformer 50.65(a)(4) requirements for Monitoring the Effectiveness of Maintenance at Switchyard Fault Detector, Version 1, with the correct relay setting.

Nuclear Power Plants, was identified. The licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration resulted in an improperly configured relay and unnecessary automatic sensitive relays on August 3, 2022, resulted in a partial loss of offsite power. reactor trip of Unit 1 on august 3, 2022 19

Summarization google/flan-t5-base google/flan-t5-base The NRC inspectors identified a Green finding and associated Non- The inspectors identified a finding of very low safety significance cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design (Green) and a Non-Cited Violation of 10 CFR Part 50, Appendix B, Control when the licensee performed a plant modification to Criterion III, Design Control, for the licensees failure to ensure core remove the automatic trip of the circulating water pumps after a cooling flow could be maintained and not interrupted during the flood is detected in the turbine building. Specifically, the licensee transition from the injection phase to the recirculation phase did not evaluate the impact of water leakage into the emergency assuming a single failure. Specifically, when verifying the adequacy diesel generator (EDG) and battery rooms during the analyzed of design via time critical operator actions, the licensee failed to internal flooding event. assume a more limiting single failure of valve SI-856A or SI-856B.

The NRC found a Green finding and a NCV in 10 CFR 50, Appendix B, The licensee failed to ensure core cooling flow could be maintained Criterion III, Design Control when the licensee performed a plant and not interrupted during the transition from the injection phase to modification to remove the automatic trip of the circulating water the recirculation phase assuming a single failure.

pumps after a flood is detected in the turbine building.

20

Question-Answering Retrieve or generate answer to a question with or without given context Input

- Open domain (answer questions about anything)

- Closed domain (answer questions from only one specialized domain)

- Open book (context passage provided or retrieved from a knowledge base)

- Closed book (no context passage provided)

Output

- Extractive: extract the answer from given text or context

- Abstractive: generate the answer based on the given context Transformer language models (pre-trained on various datasets and fine-tuned for question-answering tasks on various datasets) available from the HuggingFace library were explored using NRC Inspection Findings item introductions as the context and a question about the safety issue in the passage

- Flan-tf-base, roberta-base-squad2, bert-large-cased-whole-word-masking-finetuned-squad 21

Question-Answering deepset/roberta-base-squad2 deepset/roberta-base-squad2 The inspectors identified a finding of very low safety significance (Green) and an The NRC inspectors identified a Green finding and associated Non-associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, Control when the licensee performed a plant modification to remove FO-3983A, into the IST Program in accordance with American Society of the automatic trip of the circulating water pumps after a flood is Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 detected in the turbine building. Specifically, the licensee did not Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST evaluate the impact of water leakage into the emergency diesel Program update, the licensee incorrectly applied the exclusion criteria from the generator (EDG) and battery rooms during the analyzed internal ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing. flooding event.

What is the most important safety issue discussed in this passage? What is the most important safety issue discussed in this passage?

low safety significance water leakage 22

Question-Answering bert-large-uncased-whole-word-masking- bert-large-uncased-whole-word-masking-finetuned-squad finetuned-squad The inspectors identified a finding of very low safety significance (Green) and an A self-revealed Green finding and associated non-cited violation of 10 CFR associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Inservice Testing Requirements, for the licensees failure to include Emergency Nuclear Power Plants, was identified for the licensee's failure to assess and Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, manage the increase in risk that may result from maintenance activities that FO-3983A, into the IST Program in accordance with American Society of were performed in the switchyard house. Specifically, the licensee's failure to Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 assess and manage the increase in risk associated with the movement of floor Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial Program update, the licensee incorrectly applied the exclusion criteria from the loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-ASME OM Code to the valve resulting in the exclusion of its safety functions 2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-from the scope of the IST Program for testing. percent power.

What is the most important safety issue discussed in this passage? What is the most important safety issue discussed in this passage?

failure to include Emergency Diesel Generator (EDG) movement of floor tiles near vibration sensitive relays Fuel Oil Transfer Pump 23

Question-Answering bert-large-uncased-whole-word-masking- bert-large-uncased-whole-word-masking-finetuned-squad finetuned-squad A self-revealed finding with its safety significance as yet to be The inspectors identified a finding of very low safety significance determined (TBD) and an associated Apparent Violation (AV) of 10 (Green) and a Non-Cited Violation of 10 CFR Part 50, Appendix B, CFR Part 50, Appendix B, Criterion III, Design Control were Criterion III, Design Control, for the licensees failure to ensure core identified for the licensees apparent failure to adequately translate cooling flow could be maintained and not interrupted during the the design basis of the Auxiliary Feedwater System (AFW) into transition from the injection phase to the recirculation phase procedures and instructions which resulted in the inoperability of assuming a single failure. Specifically, when verifying the adequacy the turbine-driven auxiliary feedwater pump (TDAFWP) on August 3, of design via time critical operator actions, the licensee failed to 2022. assume a more limiting single failure of valve SI-856A or SI-856B.

What is the most important safety issue discussed in this passage? What is the most important safety issue discussed in this passage?

inoperability of the turbine-driven auxiliary feedwater failure to ensure core cooling flow could be pump maintained 24

Question-Answering google/flan-t5-base google/flan-t5-base The inspectors identified a finding of very low safety significance (Green) and an The inspectors identified a finding (FIN) of very low safety significance (Green),

associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and for the licensees failure to perform an adequate operating experience Inservice Testing Requirements, for the licensees failure to include Emergency evaluation for NRC Information Notice 2017-06 which identified how the site Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, might be vulnerable to the situation described in the IN. This was contrary to FO-3983A, into the IST Program in accordance with American Society of Point Beach Procedure PI-AA-102-1001, Operating Experience Program Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Screening and Responding to Incoming Operating Experience. Specifically, the Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST licensee failed to evaluate the effects of the additional short circuit current Program update, the licensee incorrectly applied the exclusion criteria from the expected from the battery chargers and determine the impact on the direct ASME OM Code to the valve resulting in the exclusion of its safety functions current distribution buses and interrupting devices.

from the scope of the IST Program for testing.

What is the safety issue in this passage? What is the safety issue in this passage?

The licensee failed to evaluate the effects of the additional short The licensee incorrectly applied the exclusion criteria from the ASME circuit current expected from the battery chargers and determine the OM Code to the valve resulting in the exclusion of its safety functions impact on the direct current distribution buses and interrupting from the scope of the IST Program for testing devices 25

Question-Answering google/flan-t5-base google/flan-t5-base A self-revealed Green finding and associated non-cited violation of 10 CFR The NRC inspectors identified a Green finding and associated Non-50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design manage the increase in risk that may result from maintenance activities that Control when the licensee performed a plant modification to remove were performed in the switchyard house. Specifically, the licensee's failure to the automatic trip of the circulating water pumps after a flood is assess and manage the increase in risk associated with the movement of floor detected in the turbine building. Specifically, the licensee did not tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial evaluate the impact of water leakage into the emergency diesel loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-generator (EDG) and battery rooms during the analyzed internal 2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power. flooding event.

What is the safety issue in this passage? What is the safety issue in this passage?

The licensee's failure to assess and manage the increase in risk associated with The licensee did not evaluate the impact of water leakage into the the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) emergency diesel generator (EDG) and battery rooms during the and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback analyzed internal flooding event 26

Comparisons The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.

google/pegasus-cnn_dailymail google/flan-t5-base NRC inspectors identified a Green finding and associated Non-cited The NRC found a Green finding and a NCV in 10 CFR 50, Appendix B, Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control Criterion III, Design Control when the licensee performed a plant

.The licensee did not evaluate the impact of water leakage into the modification to remove the automatic trip of the circulating water emergency diesel generator (EDG) and battery rooms during the pumps after a flood is detected in the turbine building.

analyzed internal flooding event .

deepset/roberta-base-squad2 (Q&A) google/flan-t5-base (Q&A)

What is the most important safety issue discussed in this passage? What is the safety issue in this passage?

The licensee did not evaluate the impact of water leakage into the water leakage emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event 27

Comparisons The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.

google/flan-t5-base (summary) google/pegasus-cnn_dailymail (summary)

The inspectors identified a finding of very low safety significance The IST Program was re-examined for a violation of 10 CFR 50.55a(f) (Green) and an associated non-cited violation (NCV) of 10 CFR for the failure to include the emergency diesel generators Fuel Oil 50.55a(f) The licensee incorrectly applied the exclusion criteria from Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the the OM Code to the valve resulting in the exclusion of its safety IST Program. functions from the scope of the IST Program for testing.

google/flan-t5-base (Q&A) bert-large-uncased-whole-word-masking-finetuned-squad What is the safety issue in this passage? What is the most important safety issue discussed in this passage?

The licensee incorrectly applied the exclusion criteria from the ASME failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer OM Code to the valve resulting in the exclusion of its safety functions Pump from the scope of the IST Program for testing 28

Named Entity Recognition (NER)

Language models that have been trained to recognize text spans as different kinds of named entities Most available pre-trained models can extract 4 or 18 names entities class NER models: person, location, organization, miscellaneous class NER models: cardinal value, date, event, building or facility, geo-political entity, language, law, location, money, nationality, religious or political affiliation, ordinal, organization, percent, person, product, quantity, time, work of art These models can be trained to extract custom entities, but this would require training with a large text corpora with entities and their corresponding text spans labeled For our use case, a pre-trained model that does a good job of extracting the entities of interest is sufficient 29

Named Entity Recognition Example Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model 30

Named Entity Recognition Example Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model 31

Named Entity Recognition (NER)

Different models will be better at recognizing different entities (ex. Law with Flair, Facility with Spacy, Product with both)

Pre-trained NER models arent completely reliable on technical data, they may miss some parts of the entities or some entities altogether Extracted entities can be used at various points in the BERTopic pipeline

- As categories to view found topics under, and extract topic representations for

- Steer dimensionality reduction of document embeddings closer to embeddings of extracted entities in a semi-supervised topic modeling approach

- With more reliable and accurate NER, and customized code:

> Select which text is fed to the embedding step (along with POS tagging to select patterns of interest)

> Tokenize text before c-TF-IDF step (along with POS tagging to select patterns of interest) 32

Progress 33

SOW Task Status Phase I: March 6, 2023 - April 9, 2023 Status Phase II: March 20, 2023 - May 7, 2023 Status Describe the Problem Complete Platform/system selection and installation In progress Search the Literature Complete Data acquisition and preparation In progress Select Candidates Complete Feature pipeline engineering In progress Select Evaluation Factors Complete Clustering method experimentation & selection In progress Develop evaluation factor weights Complete Cluster pipeline engineering In progress Define evaluation factor ranges Complete Anomaly detection (as needed) Not started Perform assessment Complete Model Development, Training, Evaluation In progress Report Results Complete Test harness development In progress Deliver Trade study report Complete PoC integration and demonstration Not started Trial runs and evaluation In progress Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started 34

Next Steps 35

Next Steps Experiment with alternatives in BERTopic composable parts Iterate on clusters and number of clusters Evaluate clusters using proposed metrics Share clusters and metrics with NRC SMEs to ensure metrics are effective 36