ML23262B171
| ML23262B171 | |
| Person / Time | |
|---|---|
| Issue date: | 04/19/2023 |
| From: | Chang Y, Friedman C, Mishkin A, Polra S, Pringle S, Tanya Smith, Vasquez G NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B171 (1) | |
Text
Machine Learning Demo Wednesday Prioritizing Inspections using ML Wednesday, April 19, 2023 Alec Mishkin, Guillermo Vasquez, Stuti Polra, Casey Friedman, Scott Pringle, Theresa Smith
Agenda 2
Topic Modeling Metrics Topic Reduction with BERTopic Supplemental NLP Tasks
- Summarization
- Question-Answering Progress and Next Steps
Topic Modeling Metrics 3
As we established over the last few weeks, there are many hyperparameters/embeddings that can be used for topic modelling
Although the most important metric will come from the NRC; to assist us in the development stage non-domain specific metrics for topic modelling are crucial
Over the years many different evaluation techniques have been developed. In the following slides we will review three different techniques for topic model evaluation that we plan to use in the future Diversity - seeking highly diverse topics Inverted Rank-Biased Overlap Word Embedding-based Pairwise Distance Coherence - seeking internally coherent topics Topic Coherence Model Metrics 4
Inverted Rank-Biased Overlap is the inverse of the estimate of the similarity of two different ranked lists (RBO)
RBO(List 1, List 2, ) = (1 ) =1:1
= number of top words
= depth of the ranking (Number of top words to compare)
= Tunable weighting Factor between 0 and 1
= List 1List 2
Notice how due to the exponential term 1 the similarity between the top words in a list are more important than the similarity of bottom words on the list
To make use of the Inverted Rank-Biased Overlap (IRBO), the average IRBO of all top n word combinations for all topics is calculated.
The higher the average IRBO, the more distinct the different topics are Inverted Rank-Biased Overlap 5
We are seeking topics with minimal overlap
While the previous diversity metric did not require any external resources, we can use the word embeddings that were inputted into training Bertopic, to yield another diversity metric
Word-Embedding-based Pairwise distance is a very simple metric that is designed to calculate the distance between two lists of words in embedding space
WEPWD(List 1, List 2) = 1=1:2=1:Cosine Distance(v, wd2)
= number of top words
= current depth of list 1
2 = current depth of list 2
v= The embedding vector of word at depth in List 1
v= The embedding vector of word at depth 2 in List 2
This metric does not give a higher weight to the more important words
We can use a variant of this code to calculate the distance between the centroids of different topics
The larger the pairwise distance the more diverse the topics are Word Embedding-based Pairwise distance 6
We are seeking topics that are far away from each other
Topic Coherence is part of the gensim library
Unlike the previous two methods, it works on individual topics (Although results can be aggregated)
The goal of the algorithm is to estimate how coherent the words in a topic are based on some external corpus
The algorithm is split into four different parts Topic Coherence 7
Image from https://miro.medium.com/v2/resize:fit:1400/format:webp/1*9rH9c-VJ3K27Q__wulDNvA.png
The first step in Topic Coherence is Segmentation Segmentation creates subsets of the list of top words. For example, if the words are {desk, pencil, chair} then the one collection of subsets could be: S={(desk, pencil),(desk, chair),(pencil, desk),(pencil, chair),(chair, desk),(chair, pencil)}
The type of segmentation we perform will dictate the type of coherence we are calculating.
The second step, Probability Calculation, estimates the probability of one or multiple words showing up in an external corpus.
One method is to estimate how many times multiple words show up in a sliding window of text from your external corpus. This allows us to estimate the co-occurrence of words in our topics The external corpus choice is very important. Some tasks use Wikipedia. I would like try this methodology using the significant amount of documentation that NRC has, including inspection manuals, procedures, etc.
The third step, Confirmation Measure, takes a single pair of subsets from the segmentation step and their corresponding probabilities from the probability calculation step to compute how strong one subset supports another There are many different confirmation measure we can use. One example is difference confirmation measure subset1, subset2 = subset1 subset2 (subset1)
The final step is Aggregation, which aggregates the confirmation measures over all subset pairs into a single coherence score. We typically use the mean or median for this.
Segmentation and Probability Calculations 8
We are seeking cohesive topics
BERTopic: Topic Reduction 9
BERTopic discovers a large number of topics under the default configuration
Topics can be combined manually through inspection of topic terms
- 1) Increase the minimum number of documents needed to form a cluster in HDBSCAN to obtain broader topics (default is 10)
Topic Embeddings Topic representations for each topic (top N words or phrases, depending on the representation model used)
Embed topic representations using embedding model to get multiple embeddings per topic Take weighted average of embeddings in a topic by their c-TF-IDF score (giving more weight to the embeddings of the words or phrases that best represent the topic)
- 2) Reduce to a specific number: Hierarchical (Agglomerative) Clustering of topic embeddings Agglomerative clustering: bottom-up approach where each point starts in its own cluster and clusters are successively merged using a linkage criteria that determines the merging strategy (ward, maximum, average, single)
Similar topics are iteratively merged using the cosine distance between topic embeddings Merges topics even if they are not very similar to reduce to the specified number of topics
- 3) Automatically reduce to a number: Use HDBSCAN to cluster topic embeddings and merge topics that cluster together HDBSCAN generates outliers, preventing topics from being merged if no other topics are similar
Following results use the default BERTopic config (all-minilm-l6-v2 for embedding, UMAP and HDBSCAN with min cluster size = 10 docs, CountVectorizer with 1-grams, c-TF-IDF) + MMR representation (diversity=0.6)
BERTopic: Topic Reduction 10
Increase the minimum number of documents needed to form a cluster in HDBSCAN to obtain broader topics Increase minimum cluster size from 10 to 30 Topic Reduction: 1) create larger clusters 11 Min Docs = 10, 129 Topics Min Docs = 30, 47 Topics
Hierarchical (Agglomerative) Clustering of topic embeddings Topic Reduction: 2) hierarchical clustering of topic embeddings 12 120 Topics Before 29 Topics After
Use HDBSCAN to cluster topic embeddings and only merge topics that cluster together, leaving outliers as standalone topics Topic Reduction: 3) merge topics that cluster together 13
Supplemental Natural Language Processing Tasks 14
Named Entity Recognition (NER)
Labels spans of text as different named entities (law, facility, product, etc.)
Extracted entities can be used for topic modeling per category or semi-supervised topic modeling Extracted entities can better focus the input text to topic modeling or the topic representation in found topics
Summarization and Question-Answering (QA)
Obtaining the most important phrases or sentences of interest from paragraphs or long documents Extracted text can be used to narrow down and better focus the input text to topic modeling Summarize each of the three representative documents for a topic to get a theme for the topic
Fine-tuning pre-trained models on our data would provide much better results, but in the absence of labeled training data, we can utilize models pre-trained on large text corpora or those fine-tuned on text that has technical language like our data to supplement our unsupervised discovery of safety issues Supplemental Natural Language Processing Tasks 15
Creating short, accurate summaries of longer text documents
Input Single document: summary of shorter text from one document Multi-document: summary of arbitrarily long text that can span multiple documents
Output Extractive: select the most important sentences in the text (older, frequency-based approaches)
Abstractive: generate phrases and sentences that provide a meaningful summary of the text (newer, deep-learning based NLP approaches)
Purpose Generic: no assumption about domain or content of text Domain-specific: text to be summarized belongs to a specialized domain Query-based: summaries are focused on answering the question
Transformer language models (pre-trained on various datasets and fine-tuned for summarization tasks on various datasets) available from the HuggingFace library were explored using NRC Inspection Findings item introductions T5-base, Flan-t5-base, bart-large-cnn, Pegasus (xsum, cnn-dailymail, arxiv, pubmed)
Summarization 16
google/pegasus-xsum google/pegasus-cnn_dailymail Summarization 17 A self-revealed finding with its safety significance as yet to be determined (TBD) and an associated Apparent Violation (AV) of 10 CFR Part 50, Appendix B, Criterion III, Design Control were identified for the licensees apparent failure to adequately translate the design basis of the Auxiliary Feedwater System (AFW) into procedures and instructions which resulted in the inoperability of the turbine-driven auxiliary feedwater pump (TDAFWP) on August 3, 2022.
The inspectors identified a finding (FIN) of very low safety significance (Green),
for the licensees failure to perform an adequate operating experience evaluation for NRC Information Notice 2017-06 which identified how the site might be vulnerable to the situation described in the IN. This was contrary to Point Beach Procedure PI-AA-102-1001, Operating Experience Program Screening and Responding to Incoming Operating Experience. Specifically, the licensee failed to evaluate the effects of the additional short circuit current expected from the battery chargers and determine the impact on the direct current distribution buses and interrupting devices.
The licensee failed to evaluate the effects of the additional short circuit current expected from the battery chargers. The licensee also failed to determine the impact on the direct current distribution buses and interrupting devices.
The US Nuclear Regulatory Commission (NRC) has opened an investigation into the failure of a turbine-driven auxiliary feedwater pump at a nuclear power plant in Pennsylvania.
google/pegasus-cnn_dailymail google/pegasus-cnn_dailymail Summarization 18 A self-revealed Green finding and associated non-cited violation of 10 CFR 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and manage the increase in risk that may result from maintenance activities that were performed in the switchyard house. Specifically, the licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power.
The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control
.The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
The licensee failed to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays. The failure resulted in a partial loss of offsite power to the 1B' startup transformer (SUT) and the loss of generation (i.e., runback) of approximately 10-percent power.
facebook/bart-large-cnn google/t5-base Summarization 19 A self-revealed Green finding and associated non-cited violation of 10 CFR 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and manage the increase in risk that may result from maintenance activities that were performed in the switchyard house. Specifically, the licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power.
A self-revealed Green finding was identified for the licensee's failure to take appropriate action to address the impacts of a setpoint change on relay KC-2 into design-basis documentation in accordance with licensee procedure NMP-ES-039-001, Calculations -
Preparation and Revision, Version 6. As a result, the licensee failed to update drawing A-177048, U1 Main Transformer Switchyard Fault Detector, Version 1, with the correct relay setting, which resulted in an improperly configured relay and unnecessary automatic reactor trip of Unit 1 on August 3, 2022.
licensee failed to update drawing A-177048, U1 Main Transformer Switchyard Fault Detector, Version 1, with the correct relay setting.
resulted in an improperly configured relay and unnecessary automatic reactor trip of Unit 1 on august 3, 2022 A self-revealed Green finding and associated non-cited violation of 10 CFR 50.65(a)(4) requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified. The licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a partial loss of offsite power.
google/flan-t5-base google/flan-t5-base Summarization 20 The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
The inspectors identified a finding of very low safety significance (Green) and a Non-Cited Violation of 10 CFR Part 50, Appendix B, Criterion III, Design Control, for the licensees failure to ensure core cooling flow could be maintained and not interrupted during the transition from the injection phase to the recirculation phase assuming a single failure. Specifically, when verifying the adequacy of design via time critical operator actions, the licensee failed to assume a more limiting single failure of valve SI-856A or SI-856B.
The licensee failed to ensure core cooling flow could be maintained and not interrupted during the transition from the injection phase to the recirculation phase assuming a single failure.
The NRC found a Green finding and a NCV in 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building.
Retrieve or generate answer to a question with or without given context
Input Open domain (answer questions about anything)
Closed domain (answer questions from only one specialized domain)
Open book (context passage provided or retrieved from a knowledge base)
Closed book (no context passage provided)
Output Extractive: extract the answer from given text or context Abstractive: generate the answer based on the given context
Transformer language models (pre-trained on various datasets and fine-tuned for question-answering tasks on various datasets) available from the HuggingFace library were explored using NRC Inspection Findings item introductions as the context and a question about the safety issue in the passage Flan-tf-base, roberta-base-squad2, bert-large-cased-whole-word-masking-finetuned-squad Question-Answering 21
deepset/roberta-base-squad2 deepset/roberta-base-squad2 Question-Answering 22 The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
water leakage low safety significance What is the most important safety issue discussed in this passage?
What is the most important safety issue discussed in this passage?
bert-large-uncased-whole-word-masking-finetuned-squad bert-large-uncased-whole-word-masking-finetuned-squad Question-Answering 23 The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
A self-revealed Green finding and associated non-cited violation of 10 CFR 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and manage the increase in risk that may result from maintenance activities that were performed in the switchyard house. Specifically, the licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power.
movement of floor tiles near vibration sensitive relays failure to include Emergency Diesel Generator (EDG)
Fuel Oil Transfer Pump What is the most important safety issue discussed in this passage?
What is the most important safety issue discussed in this passage?
bert-large-uncased-whole-word-masking-finetuned-squad bert-large-uncased-whole-word-masking-finetuned-squad Question-Answering 24 A self-revealed finding with its safety significance as yet to be determined (TBD) and an associated Apparent Violation (AV) of 10 CFR Part 50, Appendix B, Criterion III, Design Control were identified for the licensees apparent failure to adequately translate the design basis of the Auxiliary Feedwater System (AFW) into procedures and instructions which resulted in the inoperability of the turbine-driven auxiliary feedwater pump (TDAFWP) on August 3, 2022.
The inspectors identified a finding of very low safety significance (Green) and a Non-Cited Violation of 10 CFR Part 50, Appendix B, Criterion III, Design Control, for the licensees failure to ensure core cooling flow could be maintained and not interrupted during the transition from the injection phase to the recirculation phase assuming a single failure. Specifically, when verifying the adequacy of design via time critical operator actions, the licensee failed to assume a more limiting single failure of valve SI-856A or SI-856B.
failure to ensure core cooling flow could be maintained inoperability of the turbine-driven auxiliary feedwater pump What is the most important safety issue discussed in this passage?
What is the most important safety issue discussed in this passage?
google/flan-t5-base google/flan-t5-base Question-Answering 25 The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
The inspectors identified a finding (FIN) of very low safety significance (Green),
for the licensees failure to perform an adequate operating experience evaluation for NRC Information Notice 2017-06 which identified how the site might be vulnerable to the situation described in the IN. This was contrary to Point Beach Procedure PI-AA-102-1001, Operating Experience Program Screening and Responding to Incoming Operating Experience. Specifically, the licensee failed to evaluate the effects of the additional short circuit current expected from the battery chargers and determine the impact on the direct current distribution buses and interrupting devices.
The licensee failed to evaluate the effects of the additional short circuit current expected from the battery chargers and determine the impact on the direct current distribution buses and interrupting devices The licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing What is the safety issue in this passage?
What is the safety issue in this passage?
google/flan-t5-base google/flan-t5-base Question-Answering 26 A self-revealed Green finding and associated non-cited violation of 10 CFR 50.65(a)(4) Requirements for Monitoring the Effectiveness of Maintenance at Nuclear Power Plants, was identified for the licensee's failure to assess and manage the increase in risk that may result from maintenance activities that were performed in the switchyard house. Specifically, the licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback) of approximately 10-percent power.
The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event The licensee's failure to assess and manage the increase in risk associated with the movement of floor tiles near vibration sensitive relays on August 3, 2022, resulted in a (1) partial loss of offsite power to the 1B' startup transformer (SUT) and (2) absent the KC-2 relay setpoint, the loss of generation (i.e., runback What is the safety issue in this passage?
What is the safety issue in this passage?
Comparisons 27 The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control
.The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
google/pegasus-cnn_dailymail google/flan-t5-base The NRC found a Green finding and a NCV in 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building.
water leakage What is the most important safety issue discussed in this passage?
deepset/roberta-base-squad2 (Q&A)
The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event What is the safety issue in this passage?
google/flan-t5-base (Q&A)
Comparisons 28 google/flan-t5-base (Q&A)
The licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
What is the safety issue in this passage?
failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump What is the most important safety issue discussed in this passage?
bert-large-uncased-whole-word-masking-finetuned-squad The IST Program was re-examined for a violation of 10 CFR 50.55a(f) for the failure to include the emergency diesel generators Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program.
google/flan-t5-base (summary)
The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f) The licensee incorrectly applied the exclusion criteria from the OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
google/pegasus-cnn_dailymail (summary)
Language models that have been trained to recognize text spans as different kinds of named entities
Most available pre-trained models can extract 4 or 18 names entities class NER models: person, location, organization, miscellaneous class NER models: cardinal value, date, event, building or facility, geo-political entity, language, law, location, money, nationality, religious or political affiliation, ordinal, organization, percent, person, product, quantity, time, work of art
These models can be trained to extract custom entities, but this would require training with a large text corpora with entities and their corresponding text spans labeled
For our use case, a pre-trained model that does a good job of extracting the entities of interest is sufficient Named Entity Recognition (NER) 29
Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 30
Flairs ner-english-ontonotes-large model SpaCys en_core_web_trf model Named Entity Recognition Example 31
Different models will be better at recognizing different entities (ex. Law with Flair, Facility with Spacy, Product with both)
Pre-trained NER models arent completely reliable on technical data, they may miss some parts of the entities or some entities altogether
Extracted entities can be used at various points in the BERTopic pipeline
- As categories to view found topics under, and extract topic representations for
- Steer dimensionality reduction of document embeddings closer to embeddings of extracted entities in a semi-supervised topic modeling approach
- With more reliable and accurate NER, and customized code:
Select which text is fed to the embedding step (along with POS tagging to select patterns of interest)
Tokenize text before c-TF-IDF step (along with POS tagging to select patterns of interest)
Named Entity Recognition (NER) 32
Progress 33
SOW Task Status 34 Phase I: March 6, 2023 - April 9, 2023 Status Describe the Problem Complete Search the Literature Complete Select Candidates Complete Select Evaluation Factors Complete Develop evaluation factor weights Complete Define evaluation factor ranges Complete Perform assessment Complete Report Results Complete Deliver Trade study report Complete Phase II: March 20, 2023 - May 7, 2023 Status Platform/system selection and installation In progress Data acquisition and preparation In progress Feature pipeline engineering In progress Clustering method experimentation & selection In progress Cluster pipeline engineering In progress Anomaly detection (as needed)
Not started Model Development, Training, Evaluation In progress Test harness development In progress PoC integration and demonstration Not started Trial runs and evaluation In progress Demonstrate PoC capability Not started Phase III: April 19, 2023 - June 16, 2023 Status Live data ingestion In progress Model execution Not started Cluster evaluation Not started Critical Method documentation Not started Technical Report Document Not started Deliver final report with findings Not started
Next Steps 35
Next Steps 36 Experiment with alternatives in BERTopic composable parts Iterate on clusters and number of clusters Evaluate clusters using proposed metrics Share clusters and metrics with NRC SMEs to ensure metrics are effective