ML23262B214
| ML23262B214 | |
| Person / Time | |
|---|---|
| Issue date: | 06/16/2023 |
| From: | Chang Y, Klacar S NRC/RES/DRA/HFRB, Sphere of Influence |
| To: | |
| References | |
| Download: ML23262B214 (1) | |
Text
1 An AI engineering company When human intelligence isnt enough We are pleased to present this final report as the Phase III deliverable for the Use Machine Learning to Prioritize Inspections effort.
Sphere of Influence (SphereOI) creates intelligent systems that engage AI/ML to raise the bar for efficiency, speed, and sustainability. In our digital factory, our proprietary SpeedShift' design-engineering cadence accelerates tool selection and product development, turning years into months by delivering custom, defining technologies on small budgets. Our past performance covers a wide array of innovations that span precision agriculture, missile defense, and medical and consumer products.
June 16, 2023 Sanda Klacar, Assistant Treasurer Date Sphere of Influence, Inc.
Phase III Final Report Use Machine Learning to Prioritize Inspections Order Number: 31310023P0005 Sam.gov: Registered and up to date CAGE: 338K9 Business Type: Small business UEI: LDQAKL4UDN27 Period contract Type: Firm Fixed Price Sphere of Influence, Inc.
1420 Spring Hill Rd., Suite 525 Tysons Corner, Virginia 22102 www.SphereOI.ai Contracts: Sanda Klacar, Asst. Treasurer SKlacar@sphereoi.com 7 0 3 - 5 4 8 - 5 4 0 6 T 7 0 3 - 8 4 2 - 8 4 7 9 F Program Manager: Scott Pringle, EVP SPringle@sphereoi.com 3 0 1 - 9 1 9 - 9 3 9 3 T
2 Use Machine Learning to Priorize Inspecons Contents Contents....................................................................................................................................................... 2 I.
Executive Summary............................................................................................................................ 6 II.
Introduction......................................................................................................................................... 6 A.
Background and Context............................................................................................................... 6 B.
Objectives and scope of the proof-of-concept study................................................................ 7
- 1.
Objective...................................................................................................................................... 7
- 2.
Scope of Work............................................................................................................................ 7 C.
Significance and potential impact of the study........................................................................... 7 III.
Methodology.................................................................................................................................... 8 A.
Study design................................................................................................................................... 8
- 1.
Phase 1........................................................................................................................................ 8
- 2.
Phase 2........................................................................................................................................ 8
- 3.
Phase 3........................................................................................................................................ 8 B.
Data supplied by the NRC............................................................................................................. 8 C.
Data analysis plan........................................................................................................................... 9
- 1.
Input:........................................................................................................................................... 10
- 2.
Neural Topic Modeling............................................................................................................. 20
- 3.
Neural Topic Modeling Extensions......................................................................................... 29 D.
Metrics............................................................................................................................................ 34
- 1.
Segmentation:........................................................................................................................... 35
- 2.
Probability Estimation:.............................................................................................................. 35
- 3.
Confirmation Measure:............................................................................................................. 35
- 4.
Aggregation:.............................................................................................................................. 35 IV.
Results............................................................................................................................................ 36 A.
Presentation of findings............................................................................................................... 36 B.
Analysis and interpretation of results......................................................................................... 39
- 1.
Topic Modeling Inputs.............................................................................................................. 39
- 2.
Stop word Removal.................................................................................................................. 42
3
- 3.
Topic Representations.............................................................................................................. 44
- 4.
Outlier Reduction...................................................................................................................... 47 C.
Discussion of any unexpected or significant findings............................................................. 50 V.
Recommendations for further research or improvements.......................................................... 52 A.
Document Summarization tool to Accelerate Analysis........................................................... 52
- 1.
Problem:..................................................................................................................................... 52
- 2.
Background............................................................................................................................... 52
- 3.
Description:
..................................................................................................................... 53
- 4.
Benefit to NRC:......................................................................................................................... 53 B.
Dynamic Analysis and Discovery................................................................................................ 54
- 1.
Problem:..................................................................................................................................... 54
- 2.
Background............................................................................................................................... 54
- 3.
Description:
..................................................................................................................... 54
- 4.
Benefit to NRC:......................................................................................................................... 55 C.
Safety Cluster Tuning:.................................................................................................................. 55
- 1.
Problem:..................................................................................................................................... 55
- 2.
Background............................................................................................................................... 55
- 3.
Description:
..................................................................................................................... 55
- 4.
Benefit to NRC:......................................................................................................................... 56 D.
Cluster Representation - Giving the clusters good names and descriptions...................... 57
- 1.
Problem:..................................................................................................................................... 57
- 2.
Background............................................................................................................................... 57
- 3.
SOW Description...................................................................................................................... 57
- 4.
Benefit to NRC:......................................................................................................................... 58 E.
Safety Event Alerting.................................................................................................................... 58
- 1.
Problem:..................................................................................................................................... 58
- 2.
Background............................................................................................................................... 58
- 3.
Description:
..................................................................................................................... 58
- 4.
Benefit to NRC:......................................................................................................................... 59 VI.
Conclusion..................................................................................................................................... 59 A.
Recap of the study objectives and main findings..................................................................... 59
- 1.
Flexibility in selecting Algorithms........................................................................................... 59
- 2.
Cost............................................................................................................................................ 60
4
- 3.
Flexibility in execution environment....................................................................................... 60 B.
Summary of the study's contributions and potential benefits................................................ 60 VII.
References..................................................................................................................................... 61 A.
List of cited sources and relevant literature.............................................................................. 61 VIII.
Appendix........................................................................................................................................ 62 List of Figures Figure 1 High Level Pipeline................................................................................................................. 9 Figure 2 Detailed Experiment Outline for Pipeline.......................................................................... 10 Figure 3 Recommended Pipeline....................................................................................................... 10 Figure 4 Unsupervised KeyBERT Approach for Key Phrase Extraction [12]............................... 16 Figure 5 Semi-supervised Guided KeyBERT Approach for Key Phrase Extraction [12]............ 17 Figure 6 PatternRank: Combining KeyphraseVectorizers with KeyBERT [13]............................. 17 Figure 7 Class-based TF-IDF Weighing used for BERTopic Topic Representations [25]........... 25 Figure 8 The number of documents attributed to topics after Probability Outlier Assignment (blue), Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green),
Embedding Outlier Assignment (red). Item Introduction was used as input with MMR-POS Topic Representation.......................................................................................................................................... 31 Figure 9 The number of documents attributed to topics after Probability Outlier Assignment (blue), Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green),
Embedding Outlier Assignment (red). Item Introduction Summaries were used as input with MMR-POS Topic Representation............................................................................................................ 31 Figure 10 The number of documents attributed to topics after Probability Outlier Assignment (blue), Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green),
Embedding Outlier Assignment (red). Item Introduction Key Phrases were used as input with MMR-POS Topic Representation............................................................................................................ 32 Figure 11 Recommended Pipeline Configuration.......................................................................... 36 Figure 12 An Example of a Topic results table. Bertopic (mmr), Bertopic (mmr_pos), Topic Terms by TF-IDF (vocab_key_phrases) are all topic representations with the top 100 terms used for their wordclouds................................................................................................................................. 37 Figure 13 An example of a word cloud for a topic using BERTopic MMR-POS topic representation........................................................................................................................................... 37 Figure 14 An example of a word cloud for a topic using Vocab + Key Phrases custom topic representation........................................................................................................................................... 37 Figure 15 Topic Modeling by Category: Site Name. Using Item Introduction as input and MMR-POS Topic Representation............................................................................................................ 38 Figure 16 Distribution of all topics across the Cornerstone category. Using Item Introduction as input with MMR-POS Topic Representation.................................................................................... 39 Figure 17 Topic 0 Using Item Introductions as Topic Modeling Input......................................... 40 Figure 18 Topic 1 Using Item Introduction as Topic Modeling Input........................................... 40
5 Figure 19 Topic 3 Using Summaries as Topic Modeling Input..................................................... 40 Figure 20 Topic 2 Using Summaries as Topic Modeling Input..................................................... 40 Figure 21 Topic 0 Using Key Phrases as Topic Modeling Input................................................... 40 Figure 22 Topic 1 Using Key Phrases as Topic Modeling Input................................................... 40 Figure 23 Coherence metric results for different representations and different inputs. The red circle is used to highlight the differences in the y-axis....................................................................... 41 Figure 24 Subject Matter Expert, Guillermo Vazquez, qualification of 30 different topics, with 10 being created by using the full item introduction (blue) as input, the key phrases of the item introduction (orange) as input, and the summary of the item introduction (green) as input......... 42 Figure 25 Word clouds for seven different safety clusters that still contain stop words........... 43 Figure 26 Word clouds for seven different safety clusters that do not contain stop words..... 43 Figure 27 Topic 14 with MMR Topic Representation Using Item Introduction as Input............ 45 Figure 28 Topic 14 with Vocab Topic Representation Using Item Introduction as Input.......... 45 Figure 29 Topic 14 with MMR-POS Topic Representation Using Item Introduction as Input.. 45 Figure 30 Topic 14 with Key Phrases Topic Representation Using Item Introduction as Input45 Figure 31 Topic 14 with Vocab + Key Phrases Topic Representation Using Item Introduction as Input 45 Figure 32 Coherence metric of topics generated by the same input for four different minimum number of documents per topic............................................................................................ 46 Figure 33 Subject Matter Expert, Guillermo Vazquez, qualification of 60 different topics, with 20 being created by using MMR representation (blue), MMR+POS representation (orange), and no representation (green)........................................................................................................................ 47 Figure 34 Topic 41 with MMR-POS before Topic Probabilities Outlier Reduction.................... 48 Figure 35 Topic 41 with MMR-POS after Topic Probabilities Outlier Reduction........................ 48 Figure 36 Topic 32 with Vocab + Key Phrases Representation before Topic Probabilities Outlier Reduction...................................................................................................................................... 48 Figure 37 Topic 32 with Vocab + Key Phrases Representation after Topic Probabilities Outlier Reduction 48 Figure 38 The coherence metric of the topics before outlier reduction (None) or after C-TF-TDF (CTIFD), Embedding (EMBD) Topic Distribution (DIST), and Topic Probability (PROB) outlier reduction. Results were presented when using introduction as input (blue) and key phrases of introduction as input (black).................................................................................................................... 49 Figure 39 The Pair wise embedding distance (Diversity Metric) of the topics before outlier reduction (None) or after C-TF-IDF (CTIFD), Embedding (EMBD) Topic Distribution (DIST), and Topic Probability (PROB) outlier reduction. Results were presented when using introduction as input (blue) and key phrases of introduction as input (black)............................................................ 49
6 Use Machine Learning to Prioritize Inspections I.
Executive Summary The NRC is seeking to prioritize inspections using machine learning. This study is designed to evaluate the suitability of the commercially available machine learning (ML) systems to perform unsupervised learning to identify the safety clusters among US nuclear power plants and to generate safety clusters using techniques deemed suitable.
The study is composed of three phases - Phase I to evaluate and select a cloud environment, Phase II to design an AI/ML pipeline to create safety clusters, and Phase III to fine-tune algorithms and generate safety clusters.
To answer the primary question, are commercially available ML systems suited to this task, the answer is undeniably, yes. The vast array of ML systems, libraries, and pre-trained models make this a solvable problem. SphereOI was able to evaluate cloud environments, evaluate libraries and pre-trained models, and construct an ML pipeline to generate useful safety clusters in under 90 days.
In addition to evaluating the suitability of the commercially available resources, SphereOI performed numerous experiments with multiple configurations for each stage in the ML pipeline to enhance the quality and understandability of the safety clusters generated using unsupervised methods. With the pipeline and models in hand, and the results and analysis for the currently available inspection reports, the NRC is poised to add the automation and tools needed to deliver these insights to the analysts. These results will enhance productivity, reduce cognitive load, and enable data driven prioritization of inspection activities.
II.
Introduction A.
Background and Context Inspections are an important element of NRCs oversight of its licensees. To ensure safe operations, the NRC conducts inspections of licensed nuclear power plants, fuel cycle facilities, and radioactive materials activities and operations to ensure that licensees meet the NRCs regulatory requirements. The inspections are performed with set frequencies.
During abnormal situations such as COVID-19, the number of inspections is reduced and the duration between inspections are increased. Developing a performance-based, data-driven method to inform the inspection priority and frequency would improve the effectiveness of inspections during abnormal situations.
The decisions on the priority and frequency of inspections during abnormal situations, such as COVID-19, largely rely on expert judgment. ML is an advanced technique to provide predictions and has been successfully implemented in many applications. ML
7 intakes a large volume of data to train its algorithms, and the prediction results are used to improve the algorithms. The cycle of prediction and refining the algorithms continuously improves the algorithms and prediction reliability. In many cases, databased decision making performs better than expert judgment. As a result, the increase in human workload is minimized when scaling up the operations. The prediction results inform the NRC of inspection priorities and, potentially, data-based industry performance trends.
B.
Objectives and scope of the proof-of-concept study
- 1. Objective The objective of this acquisition is to evaluate the suitability of the commercially available machine learning (ML) systems to perform unsupervised learning to identify the safety clusters among US nuclear power plants and to perform an in-depth evaluation of a selected ML system to identify safety clusters using the inspection reports of Nuclear Regulatory Commission as input data.
- 2. Scope of Work The scope of the contemplated effort includes all tasks, activities, deliverables, and reports to successfully evaluate the suitability of ML systems for analyzing NRCs inspection reports to identify the clusters in safety performance among US nuclear power plants and to select and use a specific ML system to prioritize inspections for the facilities in the data provided by the NRC.
- a. Tasks
- 1) Trade Study - Phase I Duration: 34 calendar days.
- 2) Core PoC development - Phase II Duration: 48 calendar days.
- 3) PoC Study - Phase III Duration: 55 calendar days.
C.
Significance and potential impact of the study Using Machine Learning to analyze inspection reports to identify trends and patterns in safety concerns at nuclear facilities can reduce the workload on inspectors and operating experience analysts while maintaining the high level of safety in the fleet. By creating safety clusters, that is, groups of events and findings that have common characteristics, the machine learning system can help the analyst consolidate and validate recommendations and updates to procedures and guidance. By adding an automated support tool, analysts will be able to consider a wider scope of issues, findings, and facilities.
8 III.
Methodology A.
Study design
- 1. Phase 1 Phase 1 delivered a trade study between ML offerings. A Separate report was delivered detailing this trade study. The results showed that any of the major cloud providers, including AWS, Microsoft, and Google could supply the tools and services needed to perform the machine learning tasks for this study. It was decided that MS Azure was a slightly better environment for the NRC and that the current study would be performed using on prem resources and a Jupyter notebook.
- 2. Phase 2 Phase 2 involved selecting between various ML algorithms to perform unsupervised clustering of inspection reports. In this phase two primary approaches, Topic Modeling (Latent Dirichlet Allocation (LDA)) and Neural Topic Modeling (Bidirectional Encoder Representations from Transformers (BERTopic)) were compared to identify the best candidate for the final safety cluster generation.
Neural Topic Modeling using the BERTopic suite of tools was selected.
- 3. Phase 3 Phase 3 used the BERTopic suite to prototype the generation of safety clusters, metrics to evaluate the clusters, and representations to allow analysts to understand the content of the clusters. The BERTopic algorithms and inputs were tuned to improve cluster performance.
B.
Data supplied by the NRC The primary data source for this study was a spreadsheet of inspection findings from inspection reports provided by the NRC. The spreadsheet contains titles and item introductions of the findings out of the inspection reports from 1998 through 2023. There are numerous other columns in the spreadsheet with meta data related to the finding like Year, Quarter, Issue Date, Report Number, Accession Number, Link, Region, Site Code, Site Name, Docket Number, Type (of finding or violation), Item Severity Type Code, Significance, Procedure, CornerstoneCode, Cornerstone, Cornerstone Attribute Type, CorssCutting Area, CrossCutting Aspect, Idby, and IsTraditionalEnforcement. There were 19,359 rows in the spreadsheet; however, some rows have duplicated titles and item introductions as the metadata in one of the columns contains a different value. After removing the rows with blank item introductions and deduplication, there were 14,937 unique item introductions that were used in this study.
The NRC also provided a spreadsheet with information about the various reactor locations, including some that are no longer operational but had findings in the past that were included in the inspection findings spreadsheet that was provided. Some of the information from this reactor locations spreadsheet, like reactor type, parent company and operating status were joined with the inspection findings spreadsheet using the site code present in both spreadsheets.
9 A customized vocabulary of terms and phrase was developed for use in this study with various data sources provided by the NRC. Abbreviations and full forms of reactor systems, components and processes were combined from ML14300A223 (656),
ML17004A106 (995), Reactor Concepts (R-100) Acronyms (1050) and reduced to 1004 unique terms and phrases by the NRC. A list of 407 common failure modes (failure, trip, scram, misalign, corrosion etc.) of various reactor systems were also provided by the NRC.
A reference corpus was generated for use with the coherence metric. This reference corpus included 269 NUREG publications and 195 Research Information Letters scraped from the NRC website.
C.
Data analysis plan Perform numerous experiments to create safety clusters. Evaluate the clusters using SME analysis and metrics. Recommend a configuration of the BERTopic stack. Supply all options in a python file to enable algorithm mixing if desired.
The basic structure of cluster creation is as follows:
Input Topic Modeling Representation Outlier Reduction Figure 1 High Level Pipeline This can be further broken into the component parts of topic modeling:
Input Embedding Dimension Reduction Clustering Tokenizer Weighting Representation Outlier Reduction Visually, we can represent the experiments performed and selections made as follows:
10 Figure 2 Detailed Experiment Outline for Pipeline Experiments and analyses were conducted to find the best candidates in each of these seven areas, given the domain of NRC Safety Inspection Reports resulting in the final pipeline for the project having the following components:
Figure 3 Recommended Pipeline Details of the candidates and experiments are detailed below.
- 1. Input:
The neural topic modeling approach requires text for input. NRC provided pointers to the publicly available PDF reports and an Excel spreadsheet with salient fields extracted from the inspection database. The Excel spreadsheet served as the primary source of data for this study. One column in the spreadsheet is the Item Introduction. This field contains a lengthy prose description of the environment, areas of concerns, and findings. The Item Introductions contain a large amount of text. The embedding models used in BERTopic to convert text to vectors of numbers have various supported input lengths and any text after that length is truncated which could cause important information to be lost from
11 long spans of text. As these embedding techniques work better on smaller quantities of more focused content, we conducted experiments with text summarization techniques, question answering techniques, and key phrase extraction techniques to condense Item Introductions into smaller spans of text that still hold the safety information. Full text of the Item Introduction text along with 3 condensed versions of the item into (summary, question answering, and key phrase) were evaluated for use in the clustering algorithms.
- a. Full Item Introduction The text from the spreadsheet was cleaned by removing HTML tags, normalizing spacing and control characters, and performing stop word analysis (see Stop word Removal section below). Item introduction ranged in length from 42 to 11,670 words and on average were about 1,649 words long.
Below is a sample Item Introduction - this Item Intro is the basis for samples of the various summary techniques presented below.
The inspectors identified a Green NCV of Unit 3 Technical Specification (TS) 5.4.1 when Entergy did not take adequate measures to control transient combustibles in accordance with established procedures and thereby did not maintain in effect all provisions of the approved fire protection program, as described in the Unit 3 final safety analysis report. Specifically, on two separate occasions, Entergy did not ensure that transient combustibles were evaluated in accordance with established procedures; and as a result, they allowed combustible loading in the 480 volt emergency switchgear room to exceed limits established in the fire hazards analysis (FHA) of record. The inspectors determined that not completing a TCE, as required by EN-DC-161, Control of Combustibles, Revision 18, was a performance deficiency, given that it was reasonably within Entergys ability to foresee and correct and should have been prevented. Specifically, on August 28, 2018, wood in excess of 100 pounds was identified in the switchgear room; however, an associated TCE had not been developed. Additionally, on October 1, 2018, three 55-gallon drums of EDG lube oil were stored in the switchgear room without an associated TCE having been developed to authorize storage in this room, as required for a volume of lube oil in excess of 5 gallons. The inspectors determined the performance deficiency was more than minor because it was associated with protection against external factors attribute of the Mitigating Systems cornerstone, and it adversely affected the cornerstone goal of ensuring the availability, reliability, and capability of systems that respond to initiating events to prevent undesirable consequences. Specifically, storage of combustibles in excess of the maximum permissible combustibles loading could have the potential to challenge the capability of fire barriers to prevent a fire from affecting multiple fire zones and further degrading plant equipment. Additionally, this issue was similar to an example listed in IMC 0612, Appendix E, "Examples of Minor lssues," Example 4.k.,
because the fire loading was not within the FHA limits established at the time. Entergy required the issuance of a revised evaluation to provide reasonable assurance that the presence of combustibles of a quantity in excess of the loading limit of record would not challenge the capacity of fire barriers, and further evaluation and the issuance of an EC was necessary to raise the established loading limit to a less-conservative value.
The inspectors assessed the significance of the finding using IMC 0609, Appendix F, Fire Protection Significance Determination Process, and determined that this finding
12 screened to Green (very low safety significance) because it had a low degradation rating in accordance with Attachment 2 of the appendix. The inspectors determined that this finding had a cross-cutting aspect in the area of Human Performance, Work Management, because Entergy did not adequately plan, control, and execute work activities such that nuclear safety was the overriding priority, nor did they adequately identify risk associated with work being performed or coordinate across working groups to anticipate and manage this risk. Specifically, in the case of wood scaffolding being stored in the switchgear room, while planning work to be performed, Entergy did not adequately consider the fire risk that would be introduced by the presence of additional combustible materials. In the case of lube oil being stored in the room, Entergy did not take adequate action to ensure that activities were executed in a manner that would prevent work taking place in one area (the adjacent EDG cell) from introducing additional fire risk into a space for which it had not been evaluated (the switchgear room). In both cases, Entergy did not take sufficient action to ensure that workers were aware of the fire protection requirements associated with activities being conducted and to ensure that they coordinated as needed across working groups to adequately assess and mitigate the associated fire risk.
- b. Item Introduction summary We utilize text summarization techniques to generate condensed versions of the longer item introduction text input. As opposed to the older frequency-based extractive summarization approaches that select important sentences from the input text, we perform abstractive summarization with deep-learning based natural language processing approaches to generate short, condensed summaries of longer passages of text that are syntactically correct and convey the most important information without altering its meaning. We evaluate seven different transformer-based language models, available from the HuggingFace Transformers library [1], with different architectures, pre-training techniques, and various pre-training and fine-tuning datasets.
Seven summary algorithms were evaluated:
Summary Description T5-Base [2]
T5 is a transformer-based encoder-decoder model that converts all NLP problems into a text-to-text format, where the model is fed some text for context or conditioning along with a task-specific prefix and produces the appropriate output text. T5 was pre-trained on a variety of unsupervised, self-supervised and supervised objectives using the Colossal Clean Crawled Corpus (C4) dataset.
Flan-t5-base [3]
FLAN-T5 is an enhanced version of the T5 model, developed with instruction finetuning with a focus on increasing the number of tasks and the model size as well as finetuning on chain-of-thought data.
BART-large-cnn [4]
BART is a sequence-to-sequence transformer-based architecture with a bidirectional encoder and a left-to-right
13 decoder. It was pre-trained with a combination of text infilling and sentence permutation objectives.
Pegasus Pegasus is a sequence-to-sequence transformer-based architecture with a bidirectional encoder and a left-to-right decoder like BART. It was jointly pre-trained with two objectives: masked language modeling (MLM) and gap sentence generation (GSG), which closely resembles the downstream task of summarization. The model was pre-trained on the Colossal Clean Crawled Corpus (C4) dataset and the HugeNews dataset. We use four variants of the Pegasus model, each fine-tuned for summarization on a different dataset: CNN/Daily Mail, XSum, arXiv, and PubMed.
Pegasus-cnn_dailymail
[5]
The Pegasus-cnn_dailymail model was finetuned on datasets of news articles and their bullet point summaries from CNN and the Daily Mail newspapers.
Pegasus-xsum [6]
The Pegasus-xsum model was finetuned on a dataset of BBC articles and their one-sentence summaries.
Pegasus-arxiv [7]
The Pegasus-arxiv model was fine-tuned on a dataset of scientific publications and their abstracts from arXiv.org.
Pegasus-pubmed [8]
The Pegasus-pubmed model was fine-tuned on a dataset of scientific publications and their abstracts.
Below is a set of samples from several summary models for the item introduction presented above.
google/flan-t5-base Entergy did not take adequate measures to control transient combustibles in accordance with established procedures and thereby did not maintain in effect all provisions of the approved fire protection program, as described in the Unit 3 final safety analysis report.
google/pegasus-xsum The US Department of Labors Occupational Safety and Health Administration (OSHA) has issued an inspection order against an oil and gas operator for violations of the Inservice Testing (IST) Program.
google/pegasus-cnn_dailymail Entergy did not ensure that transient combustibles were evaluated in accordance with established procedures
.<n>Entergy did not take sufficient action to ensure that workers were aware of the fire protection requirements associated with activities being conducted.<n>Entergy did not adequately identify risk associated with work being performed or coordinate across working groups to anticipate and manage this risk facebook/bart-large-cnn The inspectors identified a Green NCV of Unit 3 Technical Specification (TS) 5.4.1 when Entergy did not take adequate measures to control transient combustibles. The inspectors determined that not completing a TCE, as
14 required by EN-DC-161, Control of Combustibles, Revision 18, was a performance deficiency.
t5-base inspectors identified a Green NCV of Unit 3 Technical Specification (TS) 5.4.1 when Entergy did not take adequate measures to control transient combustibles. the inspectors determined that not completing a TCE was a performance deficiency, given that it was reasonably within entergys ability to foresee and correct and should have been prevented. in the case of wood scaffolding being stored in the switch google/pegasus-arxiv on, 10, wood in room ; on, 10, wood in room ; on, 10, wood in room ; on, 10, wood in room ; on, 10, wood in room ; on, 10, wood in room ; on, 10, wood in room ; on,
10, wood in room ; on, 10, wood in room ; on, 10, wood in room google/pegasus-pubmed the united states environmental protection agency ( u.s.
epa ) conducted an inspection of chemical storage facilities in the united states during the period january 1, 2000 to december 31, 2009. <n> the purpose of the inspection was to identify non - compliance with u.s. <n>
environmental protection agency standards for chemical storage facilities. <n> the inspection was conducted in response to a communication from the u.s. <n>
environmental protection agency (
The summaries produced by the Pegasus-arxiv and Pegasus-pubmed models were very poor in quality as they either repeated a couple of words to reach the output length specified or generated very general words and phrases related to inspections. The one-sentence summaries produced by the Pegasus-xsum model were slightly better, but tended to be shorter and did not capture all aspects of the safety issue discussed in the Item Introduction. The summaries generated by Pegasus-cnn_dailymail, BART-large-cnn, T5-base, and Flan-T5 models were significantly better in quality and captured various aspects of the safety issues like the reactor system or component that was affected, any regulations or specifications that were violated and potential impacts of the safety issue.
Selection: With NRC feedback on sample summaries generated by these models, we chose to use the summaries generated by the Pegasus-cnn_dailymail model (minimum of 105 words, maximum of 713 words and on average 327 words) as an input for topic modeling.
- c. Question Answering We utilize text question-answering techniques to generate condensed versions of the longer item introduction text input as an alternative to text summarization. Question-
15 answering is a natural language processing task to train a neural language model to retrieve, extract or generate answers to questions with or without given context. Open domain models are trained to answer questions about anything whereas closed domain models are only trained to answer questions about a specific domain that they were trained on. Open book QA refers to models that are trained to answer questions from a context passage that is provided or retrieved from a knowledge base, whereas closed book QA refers to models that answer questions without any context. We use pre-trained transformer models from the HuggingFace Transformers library [1], intended for open-domain, open-book QA that have been fine-tuned on question-answering datasets. We provide the Item Introduction as the context passage to the models and append a question about the reactor safety issue discussed in the passage.
Three question-answering algorithms were evaluated:
Question Answering Description Flan-t5-base [3]
FLAN-T5 is an enhanced version of the T5 model, developed with instruction finetuning with a focus on increasing the number of tasks and the model size as well as finetuning on chain-of-thought data.
Roberta-base-squad2
[9]
RoBERTa is a transformer model derived from BERT, with some hyperparameter modifications and removal of the next-sentence pre-training objective. Roberta-base-squad2 is a version of the RoBERTa base model fine-tuned on the SQuAD2.0 dataset of question-answer pairs.
Bert-large-cased-whole-word-masking-finetuned-squad [10]
BERT is transformer model pre-trained on the BookCorpus and English Wikipedia datasets using a masked language modeling (MLM) objective. Bert-large-cased-whole-word-masking-finetuned-squad is a version of the large BERT model fine-tuned on the SQuAD dataset of question-answer pairs.
Below is a set of samples from several QA models for the item introduction presented above.
QA(google/flan-t5-base)
Storage of combustibles in excess of the maximum permissible combustibles loading could have the potential to challenge the capability of fire barriers to prevent a fire from affecting multiple fire zones and further degrading plant equipment QA(deepset/roberta-base-squad2) nuclear safety QA(bert-large-uncased-whole-word-masking-finetuned-squad) nuclear safety
16 The answers generated by the roberta-base-squad2 and bert-large-uncased-whole-word-masking-finetuned-squad models were very short (between 3-10 words) and composed of generic terms frequently used in the inspection findings item introductions.
In some instances, the answers generated by the bert-large-uncased-whole-word-masking-finetuned-squad model contained the reactor system or component of interest with respect to the safety issue, or a procedure or event that caused the safety issue; however, the results were not consistent for all item introductions. The answers produced by the flan-t5-base model were better in quality compared to those from the BERT and RoBERTa models, as they were longer phrases or short sentences describing the safety issue; however, they did not capture all of the important aspects from the item introduction like the system or component, the violated regulation, and potential impacts.
Selection: The answers obtained from the question-answering models did not prove to be a condensed, complete alternative to the full item introduction text, so we chose not to use them as an input to topic modeling.
- d. Item Introduction Key Phrases Key Phrase Extraction can be used to reduce and focus large text inputs. As an alternative to full item introductions, we extract key phrases using several approaches.
KeyBERT [11] is an unsupervised approach to extract key phrases from text using a pre-trained language model, but it can also be used in a semi-supervised manner with a pre-defined list of important words and phrases to guide the algorithm.
Figure 4 Unsupervised KeyBERT Approach for Key Phrase Extraction [12]
17 Figure 5 Semi-supervised Guided KeyBERT Approach for Key Phrase Extraction [12]
KeyphraseVectorizers is an unsupervised approach to extract phrases from text that follow a pre-defined part-of-speech pattern using a pre-trained part-of-speech tagger without considering their importance to the overall text. PatternRank is the approach that uses KeyphraseVectorizers as a part of the KeyBERT pipeline to find key phrases that follow a pre-defined part-of-speech pattern.
Figure 6 PatternRank: Combining KeyphraseVectorizers with KeyBERT [13]
We utilize four combinations of these techniques to extract the top 20 key phrases from item introductions: unsupervised and Guided KeyBERT, with and without the use of KeyphraseVectorizers. A customized vocabulary (see Custom Vocabulary section below) composed of terms, acronyms, their full forms and failure modes of reactors systems and components was used as the list of pre-defined words and phrases for Guided KeyBERT.
Four key phrase extraction approaches were evaluated:
Key Phrase Description KeyBERT [11]
KeyBERT is an unsupervised key phrase extraction algorithm that first tokenizes the text into words and phrases (of varying length n-grams) to obtain candidate keywords and phrases. Next, the full text document and all candidate keywords or phrases are embedded using a pre-trained sentence transformer model. Finally, it computes cosine
18 similarity between the embedded document and the embedded key phrases and retrieves the top N keywords or phrases that are most similar to the document.
Guided KeyBERT [11]
Guided KeyBERT is a slight variation of the KeyBERT approach where a pre-defined list of important words and phrases is provided to the algorithm as seeded keywords.
These are embedded by the same pre-trained sentence transformer model and combined with the document embeddings with a weighted average before computing similarity between the embeddings of the candidate key phrases and the embedding of the document.
KeyphraseVectorizers +
KeyBERT Following the PatternRank approach [13],
KeyphraseVectorizers [14] is used to extract candidate phrases that have zero or more adjectives followed by one or more nouns and KeyBERT is used to determine which candidate phrases are most similar to the full document in an unsupervised manner.
KeyphraseVectorizers +
Guided KeyBERT Following the PatternRank approach [13],
KeyphraseVectorizers [14] is used to extract candidate phrases that have zero or more adjectives followed by one or more nouns and a Guided KeyBERT is used to determine which candidate phrases are most similar to the full document in a semi-supervised manner, using the provided list of important words and phrases.
Below is a set of samples from several Key Phrase models for the item introduction presented above.
KeyBERT allowed combustible loading, allowed combustible, combustibles revision 18, combustibles evaluated accordance, permissible combustibles loading, result allowed combustible, combustibles revision, combustibles evaluated, permissible combustibles, transient combustibles evaluated, additional combustible, maximum permissible combustibles, combustibles loading, 161 control combustibles, final safety analysis, control combustibles revision, presence additional combustible, unit final safety, combustibles accordance established, established hazards analysis Guided-KeyBERT allowed combustible loading, final safety analysis, unit final safety, combustibles evaluated accordance, safety analysis report, combustibles revision 18, allowed combustible, permissible combustibles loading, transient combustibles evaluated, combustibles evaluated, permissible combustibles, result allowed combustible, maximum permissible combustibles, 161 control combustibles, combustibles revision, combustibles loading, established
19 hazards analysis, safety significance, safety analysis, control combustibles revision KeyphraseVectorizers +
KeyBERT additional fire risk, fire protection requirements, final safety analysis report, fire risk, maximum permissible combustibles loading, fire protection significance determination process, fire barriers, additional combustible materials, combustible loading, fire protection program, combustibles, transient combustibles, low safety significance, edg lube oil, fire loading, entergy, multiple fire zones, fire, nuclear safety, further degrading plant equipment KeyphraseVectorizers +
Guided-KeyBERT final safety analysis report, fire protection requirements, additional fire risk, fire risk, fire protection significance determination process, maximum permissible combustibles loading, fire barriers, fire protection program, combustible loading, low safety significance, additional combustible materials, transient combustibles, combustibles, edg lube oil, further degrading plant equipment, fire loading, nuclear safety, entergy, multiple fire zones, volt emergency The phrases extracted by KeyBERT and Guided KeyBERT include some important safety information, however they tend to be incomplete or contain extra words before or after the phrase of interest due to the nature of its tokenization algorithm. Because the n-gram range for key phrase candidates must be specified, there is no comprehensive way to extract complete and syntactically correct phrases of all lengths. In contrast, when KeyphraseVectorizers is used, complete and syntactically correct noun phrases are extracted as candidate key phrases from the text and the resulting key phrases found with either KeyBERT or Guided KeyBERT both capture the essential safety related information from the Item Introduction text. There was not a significant difference between the sets of key phrases discovered by the KeyphraseVectorizers + KeyBERT approach and the KeyphraseVectorizers + Guided-KeyBERT approach as most key phrases discovered by them are the same, just in different order of importance. However, in some instances, providing the custom vocabulary to the KeyphraseVectorizers +
Guided KeyBERT approach allowed a few new phrases to be extracted that shed more light on the safety issue discussed in the Item Introduction.
Selection: The KeyphraseVectorizers + Guided KeyBERT approach to extract key phrases (minimum of 13 words, maximum of 713 words and on average 392 words) was chosen as an alternative form of input for topic modeling.
- e. Stop Word Removal After several iterations of topic modeling, we observed that the names of reactor sites and their parent companies present in the Item Introduction texts were having a
20 significant impact on the formation of several safety clusters. To prevent several different safety issues from being grouped together, solely on the common plant or company name, we decided to remove all references to reactor site names, site codes and their parent companies from the input text (Item Introductions, Summaries, Key Phrases) used for topic modeling. Removing these terms as stop words allowed the input text to cluster into a topic that more closely resembles the safety issue discussed within the text.
- f.
Custom Vocabulary A custom vocabulary was developed with input and feedback from the NRC on terms and phrases that are of interest when considering safety issues across reactor sites.
Abbreviations and full forms of reactor systems, components and processes were combined from ML14300A223 (656), ML17004A106 (995), Reactor Concepts (R-100)
Acronyms (1050) and reduced to 1004 unique terms and phrases by the NRC. A list of 407 common failure modes (failure, trip, scram, misalign, corrosion etc.) of various reactor systems were also provided by the NRC. The final vocabulary combining these abbreviations, full forms and failure modes consisted of 1,411 unique words and phrases.
This custom vocabulary was used as the seed words for Guided KeyBERT to extract key phrases from Item Introductions, and in forming custom topic representations.
- 2. Neural Topic Modeling BERTopic [15] is a neural topic modeling approach that can be used to discover topics that are present in text documents, group documents by the topic they discuss and represent the discovered topics in an understandable way. Although the variety of topics discussed within a document can be analyzed via the trained model, the creation of the model itself assumes that one document primarily discusses one topic and that is the topic the document should be assigned to. BERTopic is composed of 6 layers of processing. The layers and the variables are provided below.
- a. Embeddings This initial layer takes the text inputs and converts them into vectors of numbers that can be mathematically compared to each other. Embeddings produced by language models can vary in quality depending on the architecture of the model as well as the pre-training data and objectives used to train the language model. While some models are better suited at accurately embedding longer text due to their architecture, others may do a better job of producing contextualized embeddings if their pre-training or fine-tuning data was similar to the data being used in our downstream topic modeling task. Many language models have an input token limit, after which the input will be truncated, and lost information will not be reflected in the produced embedding. Keeping these aspects
21 in mind, we considered several pre-trained embedding models, available from the Sentence Transformers library [16] and the HuggingFace Transformers library [1].
Five embedding approaches were evaluated:
Embedding Description all-MiniLM-L6-v2 [17]
Designed for general purpose and speed. Trained on a large corpus of online data.
all-mpnet-base-v2 [18]
Designed for general purpose and quality. Trained on a large corpus of online data.
xlnet-base-cased [19]
Designed to work on language tasks that involve long context.
SPECTER [20]
Trained on scientific citations and designed to estimate the similarity of two publications.
multi-qa-MiniLM-L6-dot-v1
[21]
Model was designed to find relevant passages from specific queries. Trained on a large and diverse set of (question, answer) pairs.
The evaluation included generating the six top topics (including the outlier topic) using all five embeddings on the same input. All results can be found in the April 12th, 2023 Demo Wednesday slides. All embeddings produced acceptable but not great results, with no embedding clearly being the best.
Selection: Since all-MiniLM-L6-v2 is the fastest model, it was chosen as the candidate embedding to run experiments. With a significant amount of work the results from all-MiniLM-L6-v2 have gone from acceptable to great.
These initial evaluations were performed before both the metric and word cloud work had started. With these tools now easily available, future work could include a more thorough evaluation of these and other embeddings.
- b. Dimension Reduction The embedding vectors produced by the embedding models for each text document can be hundreds of dimensions (384 dimensions for all-MiniLM-L6-v2) which can make it increasingly difficult for clustering algorithms to compare them. For relatively small text samples, it is best to reduce that dimensionality before proceeding with the clustering step of the topic modeling process.
A variety of algorithms like Uniform Manifold Approximation and Projection (UMAP) [22],
Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Truncated Singular Value Decomposition (SVD) can be used for dimensionality reduction. They each have their advantages and disadvantages when it comes to the kinds of structures they can capture in data as well as their training and
22 inference times. The UMAP algorithm preserves the distance to the kth nearest point in low-and high-density areas when going to a lower dimension and is recommended for usage with density-based clustering algorithms like HDBSCAN. It also preserves more of the local global features of high-dimensional data in lower projected dimensions than PCA and t-SNE. We chose to use UMAP for dimensionality reduction due to its theoretical and practical advantages and experimented with several combinations of its two main parameters: the number of neighbors considered and the number of components (dimensions) to reduce the embeddings down to.
UMAP Parameters Description Number of Neighbors The size of the neighborhood (number of neighbor samples) that are used during dimensionality reduction. While larger values preserve global views of the data, smaller values preserve more local data.
Values: 5, 10, 15, 20 Number of Components The dimension of the space the data should be embedded into.
Values: 5, 10, 20, 50 In general, the size of the neighborhood had a larger impact on the number of topics generated at the end rather than the dimension of the embeddings. As the number of neighbors increased from 5 to 10, 15, and 20, the number of topics found decreased from ~94, ~77, ~63, and ~59, while the number of topics stayed around ~70-77 when the components increased from 5 to 10, 20, and 50. When the number of topics was larger, there were some small, similar clusters being created that could be merged. When the number of topics was smaller, there were some large but generic topics being created where the more specific, smaller ones had been merged into larger ones.
Selection: To avoid issues with large and small numbers of topics, we chose to use the default parameter combination of UMAP with 15 neighbors and 5 components.
It should be noted that UMAP is stochastic in nature and the results produced by it from one run to the next will be slightly different. We have not noticed any major differences or discrepancies in the topics found at the end, but slight variation is to be expected. There is a random seed parameter that can be set in UMAP to get consistent results for every run; however, this will disable parallelism and increase the run time of the UMAP algorithm.
- c. Clustering This stage of the topic modeling pipeline performs clustering on the document embeddings whose dimensionality has been reduced using UMAP. Like the previous
23 stage, a variety of unsupervised clustering algorithms like HDBSCAN, K-Means, and Agglomerative Clustering can be used for this step. K-Means is a centroid based clustering algorithm where the number of clusters must be specified and every point (or document) is forced to be in a cluster, which can lead to noisy clusters and poor topic representations if there are outliers in the data. In contrast, HDBSCAN [23] is a density based hierarchical clustering algorithm that finds stable clusters of varying densities without needing to specify the number of expected clusters. It is also a soft-clustering approach that allows modeling outliers as noise, which prevents unrelated documents from being assigned to a topic. Considering these advantages, we chose to use HDBSCAN for the clustering stage and experimented with the minimum cluster size parameter as it had the largest effect on the results.
Four HDBSCAN clustering parameter values (10, 20, 40, 60) for minimum cluster size (the number of documents needed to form a cluster) were evaluated.
Clustering Description Min Cluster size 10 At a minimum, 10 documents were required to form a cluster in HDBSCAN which resulted in about 130-280 topics for various input text types Min Cluster size 20 At a minimum, 20 documents were required to form a cluster in HDBSCAN which resulted in about 60-150 topics for various input text types Min Cluster size 40 At a minimum, 40 documents were required to form a cluster in HDBSCAN which resulted in about 40-65 topics for various input text types Min Cluster size 60 At a minimum, 60 documents were required to form a cluster in HDBSCAN which resulted in about 15-50 topics for various input text types The number of clusters or topics formed decreases as the minimum number of documents required to form a cluster increases. The input text used for topic modeling also has an impact on the number of clusters formeditem introduction forms the smallest number of clusters, summary forms an intermediate number of clusters, while key phrases form the largest number of clusters. With minimum cluster size 10, too many clusters are formed and there is a significant amount of overlap in top terms for some middle and small size clusters that should be merged. With minimum cluster size 60, too few clusters are formed and only capture the frequent safety issues as some of the more specific, less frequent ones are merged into larger clusters. At a minimum cluster size of 20 and 40, noticeable redundant clusters were not formed, but there were some more unique clusters at 20 that were merged into larger clusters at 40.
Selection: A minimum cluster size of 20 was chosen for use in the HDBSCAN clustering algorithm. It should be noted that UMAP and HDBSCAN have not been jointly tuned to find the ideal number of clusters and there is potential to explore many combinations of parameters to tune the two algorithms in future studies.
24
- d. Tokenizer and Vectorizer After the clustering of the document embeddings, the next steps in the topic modeling process uses the text documents instead of the embeddings by turning the text into a bag-of-words. Because the clustering step can use any algorithm, centroid-based or density-based, the next steps to represent topics cannot make any assumptions about the cluster structure. All documents in a cluster are combined into one long document that represents the cluster. Then this long document can be broken into words and phrases (bag-of-words representation at a cluster level), and the occurrent of each word or phrase in each cluster can be found to form a topic representation.
A tokenizer is used to break the input text (grouped by the topic it belongs in) into words or phrases depending on the specified n-gram range and then create a matrix of word or phrase occurrences across topics (the collection of documents that make up each topic).
Although there is flexibility to use complex tokenizers like a part-of-speech tagger or KeyphraseVectorizers that break text based on its semantic structure or extract key phrases from the text, it is practically infeasible due to the design of BERTopics internal code which combines all text documents in a topic together before feeding them to the tokenizer. When large topics of long documents are concatenated together and passed to these complex tokenizers, it can take hours to process them. To keep the pipeline fast and efficient, we use the default tokenizer and vectorizer (Scikit-Learn CountVectorizer
[24]) in BERTopic which breaks the text into words and phrases based on the n-gram range specified.
Three n-gram ranges with the Tokenizer and Vectorizer were evaluated.
Tokenizer Description Unigrams Input text is broken up into single words (unigrams) separated by spaces and their occurrences are counted across documents that make up the topics Unigrams and Bigrams Input text is broken up into single words (unigrams) and two-word pairs (bigrams) separated by spaces and their occurrences are counted across documents that make up the topics
- Unigrams, Bigrams and Trigrams Input text is broken up into single words, two-word pairs (bigrams) and three-word pairs (trigrams) separated by spaces and their occurrences are counted across documents that make up the topics Qualitative evaluation of the various BERTopic topic representations when using Unigram, Bigram or Trigram ranges were performed. With only unigrams, the topic terms were too generic (valves, radiation, diesel, voltage) and full system names could not be captured. Similarly, with unigrams and bigrams, there were generic and incomplete terms, but also stop words like the attached to the beginning of system names like the
25 diesel. When allowing unigrams, bigrams, and trigrams, most system and component names could be captured to identify the discovered topics. Although it is possible to allow larger ranges of n-grams, it can cause frequently used phrases (inspection findings boilerplate text) to be picked up which are not as informative when identifying topics.
Selection: Unigrams, Bigrams and Trigrams were chosen as the range in the Tokenization and Vectorization step.
- e. Weighting The previous step grouped all documents in a cluster together and represented each cluster as a bag-of-words and phrases. From this bag-of-words representations of the clusters, we must extract words and phrases that best represent each topic, that make each cluster different from the others. In order to do this, BERTopic introduced c-TF-IDF or class-based TF-IDF which is a modification of the traditional document-based TF-IDF (Term-Frequency Inverse-Document-Frequency) that compares the importance of words between documents. By applying c-TF-IDF on the concatenation of all documents that represent a cluster, we can obtain importance scores for words and phrases within a cluster and the more important the words, the more representative they are of that topic.
Figure 7 Class-based TF-IDF Weighing used for BERTopic Topic Representations [25]
The c-TF-IDF scores for a term in a class is calculated as follows. The set of documents in a cluster are concatenated together to form a single document or class.
Then, the frequency of word in class is computed as, and L1-normalized to account for the differences in topic sizes, yielding the class-based TF representation.
Next, the logarithm of one plus the average number of words per class divided by the frequency of word is taken to yield the class-based IDF representation. The plus one inside the logarithm forces values to be positive. Finally, the TF and IDF values are multiplied to give the importance score for each word or phrase in each class or cluster.
Selection: The weighting approach recommended by BERTopic was chosen.
- f.
Representation
26 The tokenization, vectorization and weighing scheme from the previous two stages essentially yield the topic representations for the document clusters; however, they can be fine-tuned further to make them more interpretable for the analyst. We explore five topic representation approaches from BERTopic (MMR, POS, MMR+POS, KeyBERTInspired, and Text Generation) and create three custom representations from NRC text (Vocab, Key Phrases, Vocab + Key Phrases).
A total of 8 representation approaches were evaluated.
Representation Description MMR Maximal Marginal Relevance (MMR) is used to decrease the redundancy and improve the diversity of the topic terms and phrases by considering the similarity of the keywords/phrases in the documents that make up a topic along with the similarity of the keywords/phrases that have been selected. A diversity value of 0.6 and top_n_words value of 100 were used.
POS Part-of-Speech (POS) is used to extract topic words and phrases that follow a specified part-of-speech pattern. For each topic, we find documents that contain any of the top 100 topic terms and pass these to the SpaCy Rule-Based Matching module [26] along with a set of part-of-speech patterns (nouns, proper nouns, adjectives followed by nouns, adjectives followed by proper nouns). The SpaCy module uses a pre-trained part-of-speech tagger to tag each word with its part-of-speech and the words and phrases that match our given patterns are returned. All of these words and phrases are treated as a set of new topic terms and phrases for each topic and are sorted by their c-TF-IDF values to get the top 100 terms and phrases.
MMR+POS The MMR+POS approach combines the Maximal Marginal Relevance technique to reduce redundant topic terms and phrases before finding terms and phrases that match our defined part-of-speech patterns.
KeyBERTInspired KeyBERTInspired approach for topic representation is a slight modification of the KeyBERT approach (used to obtain key phrases) for speed and efficiency. This approach takes the top N topic terms/phrases based on their c-TF-IDF scores and embeds them to obtain candidate key phrase embeddings. Then it obtains a sample of representative documents by comparing their c-TF-IDF with topic c-TF-IDF, embeds them and averages their embeddings to come up with a document embedding. The candidate embeddings and document embedding are compared to find the top N terms and phrases that best represent the topics.
Text Generation The Text Generation approach uses a pre-trained language model (we use google/flan-t5-base [3] from HuggingFace Transformers
27
[1], but many others are supported) to produce short labels for topics. A text prompt is created including the top N topic terms and representative documents for the topic, followed by a question about the label for the topic, and the language models response is used as the topic label.
Vocab This custom topic representation approach uses the vocabulary created with NRC abbreviations, full-forms and reactor systems failure modes (see section Custom Vocabulary). SpaCys PhraseMatcher [27] is used to find spans of text in the full Item Introduction that match any of the words or phrases in the vocabulary of abbreviations, full-forms and reactor systems failure modes. These string matches for all item intros are precomputed and saved so that they can be read in for all topic modeling experiments, but they can also be computed on the fly. The full Item Introduction is used for string matching to provide more text to match on even if the input type is Summary or Key Phrases.
The string matches for the documents in each topic are combined and TF-IDF scores are computed for each term or phrase, and the top N terms are used as the topic representation.
Key Phrases This custom topic representation approach used the unique list of key phrases obtained from the item introduction key phrases that were extracted using KeyphraseVectorizers and Guided KeyBERT with custom vocabulary (see section Item Introduction Key Phrases). SpaCys PhraseMatcher [27] is used to find spans of text in the full Item Introduction that match any of the words or phrases in the list of unique key phrases. These string matches for all item intros are precomputed and saved so that they can be read in for all topic modeling experiments, but they can also be computed on the fly. The full Item Introduction is used for string matching to provide more text to match on even if the input type is Summary or Key Phrases. The string matches for the documents in each topic are combined and TF-IDF scores are computed for each term or phrase, and the top N terms are used as the topic representation.
Vocab + Key Phrases This custom topic representation approach combines the vocabulary and the key phrases into a unique set of words and phrases. SpaCys PhraseMatcher [27] is used to find spans of text in the full Item Introduction that match any of the words or phrases in the vocabulary of abbreviations, full-forms and reactor systems failure modes or list of unique key phrases. These string matches for all item intros are precomputed and saved so that they can be read in for all topic modeling experiments, but they can also be computed on the fly. The full Item Introduction is used for string matching to provide more text to match on even if the input type is Summary or Key Phrases. The string matches for the documents in each topic are combined and TF-IDF scores are computed for
28 each term or phrase, and the top N terms are used as the topic representation.
From the five BERTopic topic representations experimented, Text Generation had the worst results as pre-trained models were unable to provide specific topic labels using the topic terms and representative documents. Most topics were labeled nuclear safety or reactor safety, or for some of the fire or flooding related topics, fire hazard or flooding.
These pre-trained models are better suited at generating labels for topics in other general domains like news or social media whose data aligns with the models pre-training data, but they are not recommended for technical text domains like the NRC inspection findings.
The MMR approach is recommended over no representation, so that redundant words can be reduced from topic terms. However, it will allow bi-grams and tri-grams that have numbers, letters and prepositions before or after other words like nouns or verbs which can make the representation a little less intuitive and interpretable. One benefit of this representation is that it allows abbreviations like MSIV, RCIC, HPCI, SBLC which makes it very easy for a domain expert to quickly identify what reactor systems or components a topic is about. In contrast, the POS approach will only allow terms that follow our defined patterns (nouns, proper nouns, adjective - noun, adjective - proper noun) so the topic representations are far more grammatically sound and more descriptive of the safety issue. The MMR-POS will yield similar results to the POS approach but by reducing the topic terms redundancy before the POS module, we perform the tagging on less documents, making the process more efficient. MMR can also be applied after the POS to further reduce redundancies if needed. It should be noted that when using the POS representation alone or with MMR, acronyms and abbreviations like MSIV, RCIC, HPCI, SBLC are not retained because they do not follow the specified pattern of speech. This can make it a little harder to identify some topics quickly, but these representations can be used in conjunction with one of the custom representations that do preserve abbreviations.
The KeyBERTInspired approach produces topic representations that have similar characteristics to those from MMR and POS because it includes acronyms and mostly picks nouns and noun phrases. However, the downside to this approach is that it uses a sample of representative documents to come up with the representation, and if the terms selected are not as representative of all documents in the cluster as they are of the sample documents, then we could be misrepresenting the cluster. Because we can obtain better results with custom representations, we do not recommend this approach.
The three custom representations are all created by string matching the item introductions in each topic against a list of words and phrases (acronyms, full-forms and failure modes vocabulary, unique key phrases, or vocabulary and key phrases). Hence, all three representations contain full reactor system and component names and are more intuitive to a domain expert than any of the BERTopic representations. Seeing the complete phrases (like main steam isolation valve) pinpoints the exact safety issue and
29 the domain expert or analyst can interpret the topic without any confusion in the scope or breadth of the topic that comes from non-specific topic terms (like valve or generator).
While the vocabulary representation highlights the failure modes and system acronyms, it can only include in the topic terms the matches that are found across item introductions and the vocabulary of 1411 words and phrases. If the documents contain other words or phrases describing the safety issue that are not in the vocabulary, then they will not be included in the topic representation. The vocabulary can be expanded to cover more terms but there is always a possibility that it wont have the phrases used to describe safety issues in some inspection findings. To avoid incomplete representations or potentially misrepresented topics, we recommend using the Key Phrases or the Vocab +
Key Phrases representations. Because the list of unique key phrases comes from the top 20 key phrases extracted from all item introductions, it is less likely to skip over important phrases describing safety issues.
Selection: The topic representation chosen plays a large role in the interpretability and understandability of the discovered topics. After examining coherence and diversity metrics, and incorporating NRC feedback on the various representations, the MMR+POS BERTopic representation and the Vocab+KeyPhrases custom representation were chosen as the best topic representation approaches.
- g. Stop Word Removal A common problem amongst the various topic representations was the presence of stop words. Although there is a list of common English stop words that is used to remove stop words in the Scikit-Learn CountVectorizer [24] used in the Tokenization and Vectorization step, there were many other words and phrases that could be deemed stop words to the NRC inspection findings text. We replace the default list of stop words in Scikit-Learns CountVectorizer (used both for BERTopic representations and three custom representations) with a list of 337 English stop words from Gensim and 136 custom stop words, for a total of 473 words. In this custom list are words and phrases like 'safety',
'reactor', 'power plant, inspector, license, finding, cornerstone, cross cutting area and more that are found across all findings texts and not helpful in identifying safety issues. We also include 15 unique singular words that appear in the names of cornerstones and cross-cutting areas as they often appear as dominating terms in most clusters, for example mitigating, systems, barrier, integrity, initiating, event, human, performance, problem, identification, and resolution. Removing these custom stop words allowed other terms to be included in the top N topic terms that were far more useful in identifying and interpreting the topics.
- 3. Neural Topic Modeling Extensions
- a. Outlier Reduction The HDBSCAN clustering algorithm chosen in the topic modeling pipeline is a soft-clustering approach that allows noise to be modeled as outliers and as a result, about a third of the documents end up in the outlier cluster (cluster number -1). Experimenting
30 with the minimum cluster size values did not decrease this outlier number, hence we explore four outlier reduction approaches to assign outlier documents to existing clusters. Probability or similarity thresholds can be set for each approach to limit the number of outlier documents that are assigned, but we use the default configuration allowing all outlier documents to be assigned to a cluster.
Outlier Reduction Technique Description Topic Probabilities This approach uses the HDBSCAN instance that was fitted with the documents to find topic clusters, in order to find the most probable topic that each outlier document should belong to. The soft-clustering performed by HDBSCAN computes cluster-oriented distance-based membership vectors and locality-oriented outlier-based membership vectors, views them as probability mass functions and combined them into a posterior distribution via Bayes theorem to approximate the probability that a point is a member of each cluster. The probability threshold can be specified to limit how many outliers are assigned to existing topics.
Topic Distribution This approach computes the distribution of topics within each outlier document and assigns the document to the most frequently discussed topic. Each outlier document is split into tokens with a sliding window (defaults 4 token window and stride of 1 token), then the c-TF-IDF score is calculated for each window of token sets. Finally, the similarity of each windows c-TF-IDF is compared to the c-TF-IDF of existing topics, and these similarity scores for each window are summed to get a topic distribution for the full document. The document is assigned to the topic with the highest distribution.
Minimum similarity threshold can be specified to limit how many outliers are assigned to existing topics.
C-TF-IDF This approach calculates the c-TF-IDF representation for each outlier document and then finds the best matching c-TF-IDF topic representation using cosine similarity. Minimum similarity threshold can be specified to limit how many outliers are assigned to existing topics.
Embeddings This approach uses the embedding model used for the topic modeling pipeline to embed each outlier document and each topic representation. It then finds the most similar topic embedding for each outlier document embedding, and the outlier document is assigned to the corresponding topic.
Minimum similarity threshold can be specified to limit how many outliers are assigned to existing topics.
These outlier reduction techniques are available from BERTopic, and three (Topic Distribution, C-TF-IDF, and Embeddings) out of four use the BERTopic topic representations as a part of the outlier assignment process, which means that they
31 cannot be applied to the three custom topic representations. There is an option to perform each of these techniques to reduce outliers but keep the original topic representations computed from the non-outlier documents of each topic. We chose to update the topic representations for all topics after the outlier documents have all been assigned to an existing topic so that changes in the quality, coherence and diversity of the topics could be measured.
Figures 8-10 below show the number of documents that were assigned to each existing topic after performing the four different outlier assignment techniques with MMR-POS BERTopic representation, when using Item Introduction, Summaries and Key Phrases as inputs for topic modeling.
Figure 8 The number of documents attributed to topics after Probability Outlier Assignment (blue),
Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green), Embedding Outlier Assignment (red). Item Introduction was used as input with MMR-POS Topic Representation Figure 9 The number of documents attributed to topics after Probability Outlier Assignment (blue),
Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green), Embedding Outlier Assignment (red). Item Introduction Summaries were used as input with MMR-POS Topic Representation.
32 Figure 10 The number of documents attributed to topics after Probability Outlier Assignment (blue),
Topic Distribution Outlier Assignment (blue), c-TF-IDF Outlier Assignment (green), Embedding Outlier Assignment (red). Item Introduction Key Phrases were used as input with MMR-POS Topic Representation.
The charts in the figures above comparing the size of topics before and after each outlier reduction techniques showed that when using the Embeddings approach, there were several small topics that significantly increased in size to be equivalent to or even larger than the top 3 three originally largest topics. Upon inspection, it was obvious that these large new topics after outlier reduction contained many generic terms common to inspection findings text (like procedure, technical, license condition) that were not indicative of any specific safety issues. Because a pre-trained model is being used for the embeddings, it is possible that this embeddings approach for outlier reduction is focusing too heavily on the more generic terms that the pre-trained model has seen before, while not representing the reactor specific terminology as accurately.
The Topic Distribution approach can be appropriate for some data if the kinds of terms used in the text are highly indicative of a topic. In NRC inspection findings text, that is not the case, as many reactor systems and components are mentioned across texts that have different safety issues, that should be in different topics. If Topic Distribution is used to perform outlier reduction, we may assign documents mentioning a particular reactor system to a topic about that reactor system, even if the main safety issue in the text was about something else.
The C-TF-IDF technique relies on the BERTopic topic representation which allows unigrams, bi-grams and tri-grams. As the inspection findings text contains many multi-word phrases referring to reactor systems and components (like main steam isolation valve), the n-grams may fail to capture the full phrase in the C-TF-IDF representations of the documents and the topics. With partial phrases, we may incorrectly assign outlier documents to topics that have similar partial phrases.
The Topic Probabilities technique for outlier reduction uses the same clustering algorithm that was used to find the original clusters to assign outlier documents to the most probable topic. This approach in general preserves the topic representations the most
33 after adding outlier documents. It also does not rely on the BERTopic topic representations, so we can compute all of our custom representations for the topics after assigning outlier documents to existing topics.
Selection: Topic Probabilities was chosen as the Outlier Reduction approach.
- b. Categorical Topic Modeling BERTopic offers numerous topic modeling variants that can be used to further analyze the discovered topics. One such approach is topic modeling by class or category (we use the term category to avoid confusion with classification). This technique is utilized to see how topics are represented over different categories and lends itself well to the kind of meta data that is available with each inspection finding. Instead of running the topic model for every category, we create a topic model and then extract, for each topic, its representation per category. This allows us to view how topics, obtained from all documents, are represented for certain subgroups of documents with different properties.
First, the documents are split by the topic they have been assigned to. Then, each group of documents is split by the provided category value. Finally, c-TF-IDF representations are calculated for each subset of documents with the same topic and category.
Optionally, these category-level c-TF-IDF representations can be further tuned by averaging them with the topic-level c-TF-IDF; however, we turn this off to prevent including words in topic representation that could not be found in the documents for a category.
We use the values of the following fields for categorical topic modeling:
Category Number of Unique Values in Dataset Number of Missing Values in Dataset Cornerstone 6
121 Cornerstone Attribute Type 17 13,928 CrossCutting Area 4
7,129 CrossCuttingAspect 44 7,129 Region 4
0 Type 7
0 Significance 5
0 Procedure 119 0
Parent Company 30 1,534 Site Name 66 0
Reactor Type 2
1,534 As of this writing, BERTopic only allows one document to have one category, so cases where a finding has multiple cross-cutting areas or procedures, one value must be picked for categorical topic modeling (We pick the first). Some of the categories that were of interest did not have their values filled out in the spreadsheet, and we fill these in with Category Not Specified when performing this task, as BERTopic does not allow any
34 missing category values. It should be noted that computing these representations for categories with a large number of values is time consuming (~2mins for 50 unique values, ~4 minutes for 100 unique values).
We compute categorical representations for the 11 categories listed in the table above, before and after topic probabilities outlier reduction, for the MMR and MMR-POS BERTopic representations, as we cannot perform categorical topic modeling using our custom representations.
- c. Topic Reduction Due to the large number of topics that were discovered in some runs of topic modeling, we briefly experimented with two topic reduction approaches available in BERTopic.
Topic Reduction Technique Description Manual The manual approach requires the user to specify the number of topics that the total topics should be reduced to. It uses the c-TF-IDF vectors for each topic to iteratively find and merge the most similar topics, starting from the least frequent topic, until the specified number of topics is reached.
Automatic The automatic approach does not require the user to specify a number of topics to reduce to. Instead, it uses HDBSCAN to cluster the topic representations and merges the topics that cluster together, while leaving the outliers as standalone topics.
While manual topic reduction can be of interest to an analyst to discover similar topics that can be merged, it does force topics to be combined even if they are not very similar, just to reach the number of specified topics. In that aspect, automatic topic reduction is the more flexible solution that allows topics to not be merged if there are no similar topics. By adjusting the minimum cluster size parameter in HDBSCAN, we were able to obtain a reasonable number of topics that were specific and coherent.
Selection: Given the ability to use cluster size to achieve a similar result, we did not spend as much time experimenting with topic reduction approaches, and chose not to incorporate an additional step. There is scope for experimentation with topic reduction techniques in future studies.
D.
Metrics There are many parameters that can be altered during the topic modelling process, including those mentioned in the previous sections. Ideally subject matter experts, such as Guillermo Vasquez, would analyze the results from every topic modelling experiment, but realistically that would be extremely inefficient. Instead, the feedback received from Guillermo Vasquez and the NRC team is supported by two pre-existing metrics:
Coherence and Diversity.
35 The coherence metric comes from the Gensim library [28] and implements work outlined in Exploring the space of topic coherence measures [29]. The goal of the coherence metric is to quantify how coherent the top words in a topic are. The coherence metric algorithm contains four separate stages:
- 1. Segmentation:
Creates subsets of the top words from a topic of interest
- 2. Probability Estimation:
Estimates the probability of individual n-grams or the joint probability of pairs of n-grams inside a reference corpus. The reference corpus is very important and ideally should come from the same domain or a similar domain to the corpus that the topic modelling is taking place on. We built our own corpus out of a subset of NUREG and Research Information Letters from the NRC website.
- 3. Confirmation Measure:
Calculates a measure of a single pair of n-grams that represents how strong the first n-gram supports the other n-gram. There are several different confirmation measures. In this work UMass was the chosen confirmation measure. The UMass confirmation measure is equivalent to log-conditional-likelihood confirmation measure, [29].
() = log
,+
()
where is the pair of n-grams and, () and (, )
are probabilities estimated in step 2, and is an infinitesimally small value. A larger UMass suggests a more coherent topic.
- 4. Aggregation:
Aggregates the confirmation measures from all n-gram pairs. The mean and median are common aggregation functions.
Following these steps will produce a coherence value for one topic. The average coherence of all topics is then calculated to give a final coherence score for topic models.
The diversity metric used in this work is the pair wise embedding distance metric and is the average cosine distance of the top n-grams in all pairs of topics of interest. The cosine distance is calculated between the embedded vectors of the n-grams. The embedding model used for the diversity metric is the same embedding model used in topic modelling training, all-MiniLM-L6-v2. All diversity scores calculated for this project have been positive, and thus higher diversity is considered ideal. However, since the distance calculated is the cosine distance, larger negative numbers could also mean higher diversity. The metric has been simplified in this work by only returning the absolute value of the diversity metric. Higher diversity in the topic models is good because there is less overlap among the generated topics. One important aspect of the diversity metric is that no subject matter experts have compared diversity metric scores to quality of the topics, and thus the diversity metric should be used cautiously. The original code for the diversity metric can be found in silviattis github repository [30].
36 IV.
Results A.
Presentation of findings During our time on this project, we have performed countless experiments yielding results that can only be truly analyzed qualitatively. Also, the results of all these experiments cannot be easily represented in a word document, since one experiment can easily contain tens of files with thousands of words across different topic representations well as 40 - 100 figures for each representation. Thus, to keep this document readable we will be focusing on only a few specific results and examples. To ensure this study's completeness, we provide a full set of results in the PipelineResults-Full directory.
In the Data analysis plan section, we present various configurations for the different stages of the pipeline and select one or two recommended approaches for each. The results in the PipelineResults-Recommended directory are from using the recommended configurations for the topic modeling pipeline in Figure 11.
Three inputs (item introduction, summaries from Pegasus_cnn_dailymail, key phrases extracted from item introductions using KeyphraseVectorizers and Guided KeyBERT with custom vocabulary of abbreviations, full-forms and failure modes as the seeded words) were used as input for topic modeling. UMAP was configured with 15 neighbors and 5 components, and a minimum cluster size of 20 was used for HDBSCAN. For topic representation, MMR+POS and Vocab + Key Phrases were used in conjunction. Outlier reduction was performed using the Topic Probabilities technique, and the MMR+POS and Vocab + Key Phrases topic representations have been updated after outlier assignment. Categorical topic modeling, for 11 different categories, was performed on the MMR+POS BERTopic representation after outlier reduction.
Figure 11 Recommended Pipeline Configuration The results found inside the directories include tables of topics and word clouds to visualize the topics. The topics file includes information on the discovered topics, like a Topic-Id (-1 refers to the outlier topic in the full results), the number of documents in each topic, a name for the topic generated by concatenating the top 4 words/phrases, the top 100 words for several different representations, and three different representative documents (could be item introduction, summary or key phrases depending on what was used as input for topic modeling).
37 Figure 12 An Example of a Topic results table. Bertopic (mmr), Bertopic (mmr_pos), Topic Terms by TF-IDF (vocab_key_phrases) are all topic representations with the top 100 terms used for their wordclouds.
Figure 12 exemplifies the difficulty of utilizing the table of topics to understand the identified safety clusters. After much discussion, we decided that word clouds created more understandable safety clusters. Word clouds are generated for both BERTopic and custom topic representations, and each topic has its own topic word cloud where the sizes for words correlate with their TF-IDF scores.
Figure 13 An example of a word cloud for a topic using BERTopic MMR-POS topic representation Figure 14 An example of a word cloud for a topic using Vocab +
Key Phrases custom topic representation The document-topics file in the results directory includes the information found in the original inspection findings file provided by the NRC, some reactor location information, as well as the discovered topic-related information (topic number, name, various representations) for each row. It should be noted that the inspection findings file contained duplicated item introductions (with different meta data values in other columns), and the discovered topic information has also been duplicated for them. There were some rows in the original file with blank item introductions, these rows will not have any topic-related information filled out. When using custom representations, there will also be a column (Item Introduction Term Matches) in this file that lists the string matches found in each item introduction that were used for the custom topic representation.
The document-topic file was used to create the ClusterResultsWithHeatMapsAndPivots Excel Worksheet for each input type. Sample heatmaps and pivots are provided in this file and can be extended as needed by NRC analysts.
38 A set of 11 categorical topic modeling charts can also be found in the results directory for BERTopic representations. These charts aid the analyst in drawing out additional insights about the discovered topics as they relate to cornerstones, cross-cutting areas, sites, parent companies and other categories. Figure 15 presented below shows a screenshot of an interactive categorical topic modeling chart for the reactor site category, with the top two topics about EDG and radiation overexposure selected. Figure 16 shows a screenshot of the frequencies of all topics across the different values of the cornerstone category, which can be obtained by double-clicking anywhere on the Global Topic Representation pane.
Figure 15 Topic Modeling by Category: Site Name. Using Item Introduction as input and MMR-POS Topic Representation.
39 Figure 16 Distribution of all topics across the Cornerstone category. Using Item Introduction as input with MMR-POS Topic Representation.
In the following section we will present the word clouds for several select experiments related to different inputs, stop-words, custom representations, outlier reduction and categorical topic modeling.
B.
Analysis and interpretation of results
- 1. Topic Modeling Inputs In C-1 of the Methodology section, we discussed the inputs being used in this work.
These inputs included the full item introduction, summary of the item introduction, question and answer response from item introduction, and key phrases of the item introduction. The question-answering algorithms did not consistently capture the important information from item introductions, so they were not useful inputs for topic modelling. Various configurations for the other three input methods were tested, and feedback from the NRC was used to guide the configuration choices: full item introduction, summaries from the Pegasus_cnn_dailymail model, and key phrases from KeyphraseVectorizers + Guided KeyBERT.
Despite the variation in the input text used for topic modeling, a couple of the generated clusters were very similar in content and size. Figures 17-22 show word clouds that were generated for two topics across the three inputs.
40 Figure 17 Topic 0 Using Item Introductions as Topic Modeling Input Figure 18 Topic 1 Using Item Introduction as Topic Modeling Input Figure 19 Topic 3 Using Summaries as Topic Modeling Input Figure 20 Topic 2 Using Summaries as Topic Modeling Input Figure 21 Topic 0 Using Key Phrases as Topic Modeling Input Figure 22 Topic 1 Using Key Phrases as Topic Modeling Input The figures on the left show word clouds for a safety issue related to the Emergency Diesel Generator (EDG), while the figures on the right show a safety issue related to radiation overexposure. In both cases, the topics are consistent in size with a large number of documents in them, across the different inputs. NRC feedback on the prominence and frequency of some common issues related to the EDG and radiation corroborates the discovery of these topics and their sizes.
41 The number of topics that were discovered when using the three inputs were different, but consistent across multiple runs with the same topic modeling parameters. The least number of topics (56) were generated when using summaries as input. As the long item introduction text was condensed into fewer sentences that covered less information in the summaries, there was less text to be embedded and used for topic modeling which can result in a fewer number of clusters. When using the full item introduction as input, 76 topics were produced while 100 topics were produced when using the key phrases as input. Technically, the full item introduction text contains the most information but because the sentence transformer model used to embed the text truncates the input after reaching its input limit, a large amount of text with important information could be discarded before reaching the topic model. In contrast, our approach to extracting the top 20 key phrases is more likely to capture phrases indicative of the safety issues from the full text so that all the important information can be embedded and passed to the topic model. These intricacies in the way the text is captured can affect the content and size of the generated clusters.
Due to the qualitative nature of these topics, we calculated the coherence metric for several different topic representations using these different inputs. We then supplemented the analysis with guidance from subject matter expert, Guillermo Vasquez.
The coherence metric was calculated for four different representations using the full input introduction, a summary of the input introduction, and the key-phrases from the item introduction. We also used four different minimum number of documents per topic, to check the robustness of these calculations. The full results can be seen in the May 24th, 2023 Demo Wednesday slide deck.
Figure 23 Coherence metric results for different representations and different inputs. The red circle is used to highlight the differences in the y-axis.
42 Using the introduction as input provides the most coherent topics for almost all representations and minimum cluster sizes. Guillermo Vasquez was also given 20 words from 30 topics, with 10 being generated when the introduction was used as input, 10 being generated when the summary of the introduction was used as input, and 10 being generated when the key phrases of the introduction was used as input. The MMR representation was used for these experiments. He was asked to qualify each topics coherence as good, intermediate, or bad. The topics were shuffled so he was unaware of which input the topics were generated from. The results are presented in Figure 24 and confirmed that using the introduction as input created the most coherent topics. The full set of results can be found in the May 31st, 2023 slide deck.
Figure 24 Subject Matter Expert, Guillermo Vazquez, qualification of 30 different topics, with 10 being created by using the full item introduction (blue) as input, the key phrases of the item introduction (orange) as input, and the summary of the item introduction (green) as input.
These metrics results show that using the introduction as input provides the most coherent topics.
- 2. Stop word Removal.
Stop word removal is an important processing step for topic modeling tasks, even more so in a specialized technical domain like that of the NRC inspection findings text.
Throughout this study, we gathered input from the NRC on what kinds of words and phrases are considered stop words for them when analyzing safety clusters. The feedback was incorporated into our pipeline at two levelsstop word removal from topic modeling input text (discussed in C-1-e of Methodology section) and stop word removal from topic representations (discussed in C-2-g of Methodology section).
The presence of stop words like reactor site names and parent company names in the input text caused clusters to be formed around them, ignoring the underlying safety issues. The presence of stop words in the topic representations resulted in safety clusters with few useful identifying words. An example of these phenomena can be seen in Figure 25 where the displayed safety clusters still contain stop words at the topic
43 representation level (topic 11) and at the input text level (topics 8, 10, 35, 48, 51 and 61) where word clouds are inundated by the common company and reactor site names.
Figure 25 Word clouds for seven different safety clusters that still contain stop words.
Figure 26 Word clouds for seven different safety clusters that do not contain stop words.
When the stop words are removed, the word clouds in Figure 26 show that the most important words identified in the safety clusters are problem specific and of much greater use to the NRC team. These full results can be found in the May 31st, 2023 Demo Wednesday slide deck.
Currently, the input stop word removal takes out site names, codes and parent company names, and the representation stop word removal uses a customized list of 473 words and phrases, but these can be expanded to further improve the clarity of the safety clusters.
44
- 3. Topic Representations In this project eight different topic representations were studied. Five of them come from Bertopic (MMR, POS, MMR+POS, KeyBERTInspired, and Text Generation) and three of them were custom generated (Vocab, Key Phrases, Vocab + Key Phrases). These different representations were explained at length in C-2-f of the Methodology section.
Figures 27-31 show the topic word clouds for the same cluster (topic 14) that were generated for each of the five different topic representations (2 from BERTopic and 3 custom) when using Item Introduction as the input for topic modeling.
45 Figure 27 Topic 14 with MMR Topic Representation Using Item Introduction as Input Figure 28 Topic 14 with Vocab Topic Representation Using Item Introduction as Input Figure 29 Topic 14 with MMR-POS Topic Representation Using Item Introduction as Input Figure 30 Topic 14 with Key Phrases Topic Representation Using Item Introduction as Input Figure 31 Topic 14 with Vocab +
Key Phrases Topic Representation Using Item Introduction as Input
46 The word clouds for the various topic representations illustrate the different characteristics of each method. While MMR includes the reactor system acronyms like rcic, it also has repeated incomplete phrases like core isolation, isolation cooling, core isolation cooling and core isolation cooling rcic which can make the representation redundant. The MMR-POS representation doesnt include the acronyms, but its selection of nouns and noun phrases makes the topic representation less redundant and more grammatically sound. The Vocab representation is generally less dense than the others because it only represents what is included in the customized vocabulary of 1411 words and phrases, but it can provide insights about the failure modes of reactor systems and components which are not as prominent in other representations. The Key Phrases representation may not always capture the acronyms, but it does capture the complete phrase describing a system or component like reactor core isolation cooling. As illustrated in Figure 31 the Vocab + Key Phrases representation is ideal as it can capture the acronyms, complete phrases describing systems or components and their failure modes.
The coherence metric was calculated on the topics generated using the same input for three different topic representations (MMR, MMR+POS, No representation) and four different minimum number of clusters in Figure 32. The results show that MMR+POS is the most coherent representation. These results were confirmed Guillermo Vasquez in Figure 33. The full set of results can be found in the May 24th, 2023 Demo Wednesday slide deck.
Figure 32 Coherence metric of topics generated by the same input for four different minimum number of documents per topic
47 Figure 33 Subject Matter Expert, Guillermo Vazquez, qualification of 60 different topics, with 20 being created by using MMR representation (blue), MMR+POS representation (orange), and no representation (green).
Figure 23 contained the coherence metric for the MMR, Vocab, Key Phrases, and Vocab
+ Key Phrases representation for the same inputs. Key Phrases and Vocab + Key Phrases had the highest coherence metric meaning that these two representations produce the most coherent safety clusters. These results combined with feedback from the NRC has led to the decision that the default topic representation should be Vocab + Key Phrases.
- 4. Outlier Reduction The four different outlier reduction techniques described in C-3-a of the Methodology section were applied to topics generated from item introductions, summaries of item introductions and key phrases of item introductions. The resulting BERTopic topic representations can be found in the PipelineResults-Full directory. As discussed in the C-3-a of the Methodology section, we chose to use Topic Probability outlier assignment for both BERTopic and custom topic representations. Figures 34 - 37 below compare the topic representations before and after assigning outlier documents to existing topics with the Topic Probabilities approach.
48 Figure 34 Topic 41 with MMR-POS before Topic Probabilities Outlier Reduction Figure 35 Topic 41 with MMR-POS after Topic Probabilities Outlier Reduction Figure 36 Topic 32 with Vocab +
Key Phrases Representation before Topic Probabilities Outlier Reduction Figure 37 Topic 32 with Vocab +
Key Phrases Representation after Topic Probabilities Outlier Reduction Assigning outlier documents to existing topics by using the Topic Probabilities technique does not change the representation of the topic itself as visualized in the figures above.
The main safety issues related to circuit breakers represented in Topic 41 does not change after outlier reduction, but some words become more significant to the topic as more documents are added, and new words may also appear in the word cloud.
Similarly, Topic 32 is about main steam isolation valves or MSIV, which stays consistent after outlier reduction. The cluster representations not changing significantly after outlier assignment indicates that the outlier documents that were assigned to these clusters did have similar terms and phrases within, and thus have been placed in the best topic.
Utilizing this technique allowed almost a third of the documents that were considered outliers in the clustering stage to be reliably assigned to an existing cluster.
The coherence and diversity metrics were calculated before and after applying the four different outlier reduction techniques to topics generated using the item introduction as
49 input and the key-phrases of item introduction as input. The representation was MMR+POS. The line plots are shown in Figure 38 and Figure 39. There is no significant difference in the coherence metric and there is only a small difference seen in the diversity metric results.
Figure 38 The coherence metric of the topics before outlier reduction (None) or after C-TF-TDF (CTIFD), Embedding (EMBD) Topic Distribution (DIST), and Topic Probability (PROB) outlier reduction. Results were presented when using introduction as input (blue) and key phrases of introduction as input (black)
Figure 39 The Pair wise embedding distance (Diversity Metric) of the topics before outlier reduction (None) or after C-TF-IDF (CTIFD), Embedding (EMBD) Topic Distribution (DIST), and Topic Probability (PROB) outlier reduction. Results were presented when using introduction as input (blue) and key phrases of introduction as input (black)
50 A similar experiment was performed when custom representations were being used. This time, when the Probability Outlier Reduction is applied to the topics there is much more variation in both the coherence and diversity metrics. The coherence metric increases significantly while the diversity metric slightly decreases. The results are presented in the table below. Based on the increase in coherence and the slight decrease in diversity, outlier reduction is recommended Introduction Input (Coherence)
Vocab Key Phrases Vocab + key No Reduction
-1.201
-0.801
-0.804 Probability Outlier Reduction
-1.207
-0.685
-0.686 Key Phrases Input (Coherence)
Vocab Key Phrases Vocab + key No Reduction
-1.453
-0.886
-0.877 Probability Outlier Reduction
-1.298
-0.711
-0.717 Introduction Input (Diversity)
Vocab Key Phrases Vocab + key No Reduction 0.836 0.812 0.813 Probability Outlier Reduction 0.826 0.803 0.806 Key Phrases Input (Diversity)
Vocab Key Phrases Vocab + key No Reduction 0.836 0.817 0.820 Probability Outlier Reduction 0.828 0.802 0.804 The Topic Probabilities technique for outlier reduction uses the same clustering algorithm that was used to find the original clusters to assign outlier documents to the most probable topic. This approach in general preserves the topic representations the most after adding outlier documents. It also does not rely on the BERTopic topic representations, so we can compute all our custom representations for the topics after assigning outlier documents to existing topics. Hence, Topic Probabilities was chosen as the Outlier Reduction approach.
C.
Discussion of any unexpected or significant findings Text summarization and Question-Answering, described in the Methodology section, were only intended to be a preprocessing technique for item introductions so that more condensed versions of the text could be used for topic modeling. The tables below include samples of summaries and answers generated by different pre-trained language models for the same inspection finding item introductions that were presented to the NRC.
The NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is
51 detected in the turbine building. Specifically, the licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
Summary (google/pegasus-cnn_dailymail)
NRC inspectors identified a Green finding and associated Non-cited Violation (NCV) of 10 CFR 50, Appendix B, Criterion III, Design Control.The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event.
Summary (google/flan-t5-base)
The NRC found a Green finding and a NCV in 10 CFR 50, Appendix B, Criterion III, Design Control when the licensee performed a plant modification to remove the automatic trip of the circulating water pumps after a flood is detected in the turbine building.
Question-Answering (deepset/roberta-base-squad2)
What is the most important safety issue discussed in this passage?
water leakage Question-Answering (google/flan-t5-base)
What is the safety issue in this passage?
The licensee did not evaluate the impact of water leakage into the emergency diesel generator (EDG) and battery rooms during the analyzed internal flooding event The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f), Preservice and Inservice Testing Requirements, for the licensees failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program in accordance with American Society of Mechanical Engineers (ASME) Operation and Maintenance (OM) Code - 2017 Edition, Subsection ISTA-1100, Scope. Specifically, during the most recent IST Program update, the licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
Summary (google/pegasus-cnn_dailymail)
The inspectors identified a finding of very low safety significance (Green) and an associated non-cited violation (NCV) of 10 CFR 50.55a(f)
The licensee incorrectly applied the exclusion criteria from the OM Code to the valve resulting in the exclusion of its safety functions from the scope of the IST Program for testing.
Summary (google/flan-t5-base)
The IST Program was re-examined for a violation of 10 CFR 50.55a(f) for the failure to include the emergency diesel generators Fuel Oil Transfer Pump (FOTP) Discharge Unloader Valve, FO-3983A, into the IST Program.
Question-Answering (bert-large-uncased-whole-word-masking-finetuned-squad)
What is the most important safety issue discussed in this passage?
Question-Answering (google/flan-t5-base)
What is the safety issue in this passage?
The licensee incorrectly applied the exclusion criteria from the ASME OM Code to the valve
52 failure to include Emergency Diesel Generator (EDG) Fuel Oil Transfer Pump resulting in the exclusion of its safety functions from the scope of the IST Program for testing In general, the question-answering models provide shorter responses that focused on the broader safety issue discussed in the item introduction, while summarization models provided multi-sentence summaries that cover the reactor system or component affected, causes and impacts of the event, and even the federal regulation that was violated. The different characteristics of these models allowed for different kinds of information to be pulled out from the longer documents which can be insightful for inspection officers or operating experience officers at the NRC and make it much easier and faster for them to review documents.
While sharing these summary and QA results, there was significant interest in a summarization or gist-ing tool as its own utility for the NRC staff. A proposal related to this work has been added to the Recommendations for further research or improvements section.
V.
Recommendations for further research or improvements This study was designed to determine whether unsupervised machine learning would be used to identify safety concerns related to nuclear facility inspection reports. The study concluded that safety clusters can be formed that convey useful information to safety inspectors and operating experience analysts.
Based on our research and analysis in the three phases of this project, we have identified 5 topics that warrant further investigation. Each of these initiatives could be pursued independently and would be similar in size to the current project. The study itself is considered by SphereOI to be a success - further work will be required to enable NRC analysts to easily access the insights produced by the unsupervised models.
A.
Document Summarization tool to Accelerate Analysis.
- 1. Problem:
Analysts encounter an overwhelming number of documents that must be understood and prioritized quickly.
2. Background
While presenting methods and results of the feasibility study we demonstrated the capability to summarize long documents using natural language processing with pretrained models. There was significant interest from the analysts attending the sessions in having the ability to summarize these and other documents to enhance their ability to understand large volumes of documents. There are several models available for this purpose, and bringing those models together into a tool that presents documents and summaries to the analyst will improve understanding and save significant time and effort in knowledge gathering tasks.
53
- 3. SOW
Description:
- a. Create an operationalizable Tool Prototype to assist analysts in understanding large volumes of documents quickly.
- b. Machine generate document summaries that are appropriate for the NRC domain to enable quick understanding of numerous documents.
- c. Select, configure, assess, and compare multiple NLP models to match NRC staffs needs.
- d. Process multiple types of documents including Inspection Reports, LERs, and others.
- e. Perform summarization as needed for small and large sets of documents.
- f.
Store document summaries for shared access.
- g. Deliverables:
- 1) Deployable software and user interface to present tailorable document summaries to the analyst.
- 2) Storage solution for document summaries.
- 3) Inferencing solution for multiple NLP models.
- h. Key Personnel
- 1) NLP expert - masters degree, 3 years of experience, experience working with NRC documents and formats, understanding of NRC inspection reports. Deep knowledge of natural language processing, Python, topic modeling, unsupervised machine learning, AI/ML pipeline design, experiment design, experiment execution, analysis of results.
- 2) Nuclear Subject Matter Expert - PhD in physical sciences, knowledge of nuclear power plant operations, inspections, and reports. Knowledge of radiation safety and measurements.
- 3) User Experience expert - demonstrated ability to present AL/ML model results in an intuitive user interface that improves productivity.
- 4) Software Engineering expert - Integrating AI/ML models with operational software systems including deploying inferencing systems and providing storage and access to model results.
- i.
2-week turnaround for proposal.
- j.
Max $150K.
- 4. Benefit to NRC:
Analysts in the Operating Experience and Inspection groups must read and digest vast amounts of report data on a regular basis. To reduce the load on these analysts, and to enable a high-level understanding quickly, document summaries can be generated and presented. This enables the analyst to get an overall understanding of the body of documents and can provide the information needed to flag documents for more in-depth evaluations.
54 B.
Dynamic Analysis and Discovery.
- 1. Problem:
Serious safety issues and trends are difficult to identify using text-based reports.
2. Background
The feasibility study confirmed that unsupervised machine learning can be used to create safety clusters that contain inspection reports about related safety concerns. To turn the information contained in the safety clusters created in the feasibility study into actionable intelligence, additional software tools are needed. This software will present model results and enable the analysts to perform ad hoc analysis and query of the safety clusters and the inspection report documentation using visual queries.
- 3. SOW
Description:
- a. Create an operationalizable Tool Prototype to allow NRC analysts build visual queries to perform deep dive analysis into inspection reports and safety clusters.
- b. Enable analysts to explore and understand safety clusters and safety issues quickly.
- c. Integrate with current and future safety cluster machine learning models to visualize the content and connections discovered in the clustering results.
- d. Provide a search and storage solution for safety clusters integrated with current and historical site information.
- e. Create a user interface to allow dynamic pivots based on safety issues identified in safety clusters and NRC inspection and other reports.
- f.
Find sites, both within and across safety clusters, that exhibit similar circumstances and conditions that could lead to safety hazards.
- g. Provide dynamic modeling and visualization of safety cluster changes over time.
- h. Deliverables:
- 1) Deployable software and user interface for on-demand visual queries and analysis of safety cluster and historical site data.
- 2) User interface to enable dynamic pivots on inspection report information.
- 3) Integrated user interface to model and visualize safety cluster changes over time.
- 4) Search and storage solution for safety clusters integrated with historical sire data.
- i.
Key Personnel
- 1) NLP expert - masters degree, 3 years of experience, experience working with NRC documents and formats, understanding of NRC inspection reports. Deep knowledge of natural language processing, Python, topic modeling, unsupervised machine learning, AI/ML pipeline design, experiment design, experiment execution, analysis of results.
- 2) Nuclear Subject Matter Expert - PhD in physical sciences, knowledge of nuclear power plant operations, inspections, and reports. Knowledge of radiation safety and measurements.
- 3) User Experience expert - demonstrated ability to present AL/ML model results in an intuitive user interface that improves productivity.
55
- 4) Software Engineering expert - Integrating AI/ML models with operational software systems including deploying inferencing systems and providing storage and access to model results.
- 5) Storage and search expert - provide storage solution to enable dynamic and visual queries of safety clusters and historical data.
- j.
Deliverables:
- 1) Deployable software and user interface to allow NRC analysts build visual queries to perform deep dive analysis into inspection reports and safety clusters.
- 2) Storage solution to support ad hoc visual queries.
- k. 2-week turnaround for proposal.
- l.
Max $150K.
- 4. Benefit to NRC:
Analysts will be able to identify and research safety issues and potential remedies more quickly by enabling deep dive searches based on cornerstones, crosscutting areas, and other key data items. Allow Pulling the string analysis on safety issues, how they are related to each other, and what inspection procedures are involved to accelerate the development of guidance and recommendations.
C.
Safety Cluster Tuning:
- 1. Problem:
The feasibility study to determine if unsupervised machine learning could be used to prioritize safety inspections was a success but stopped short of optimizing the safety clusters. Additional model tuning is needed to ensure safety issues are effectively assigned to clusters.
2. Background
There are a multitude of tuning approaches available for improving the quality of the safety clusters developed during the feasibility study. A handful of those methods were applied during the study to create clusters that are both useful and to validate the approach. Additional tuning will enable the NRC to further focus the safety clusters to reveal the most impactful safety concerns that will lead to meaningful enhancements to analysis and procedures.
- 3. SOW
Description:
- a. Model research and development study.
- b. Enhance machine learning models to align safety findings and clusters.
- 1) Describe up to 3 different approaches to enhance safety cluster performance using machine learning.
- 2) Provide strengths and weaknesses of each method as they pertain to the data and desired outcomes.
56
- 3) Provide details on how the approach used in the feasibility study can be extended and improved.
- c. Implement one of the 3 approaches as selected by the NRC.
- d. Use metrics developed in the feasibility study to assess updated cluster quality.
- e. Refine metrics to enhance clustering algorithm selection.
- f.
Provide SME analysis of updated clusters to assess quality and to validate metrics and metric enhancements.
- g. Present clusters with representations developed during the feasibility study to enable objective, side-by-side comparisons to previous solutions.
- h. Key Personnel:
- 1) AI/ML expert with knowledge of re"ning, training, and "ne-tuning large models. The AI/ML expert must understand the nuances of leveraging pre-trained models and when the pre-trained models must be enhanced. PhD in Machine Learning, Physical Science, or related "elds.
- 2) NLP expert - masters degree, 3 years of experience, experience working with NRC documents and formats, understanding of NRC inspecon reports. Deep knowledge of natural language processing, Python, topic modeling, unsupervised machine learning, AI/ML pipeline design, experiment design, experiment execuon, analysis of results.
- 3) Nuclear Subject Mater Expert - PhD in physical sciences, knowledge of nuclear power plant operaons, inspecons, and reports. Knowledge of radiaon safety and measurements.
- 4) User Experience expert - demonstrated ability to present AL/ML model results in an intuive user interface that improves producvity.
- 5) Soware Engineering expert - Integrang AI/ML models with operaonal soware systems including deploying inferencing systems and providing storage and access to model results.
- 6) Storage and search expert - provide storage soluon to enable dynamic and visual queries of safety clusters and historical data.
- i.
Deliverables:
- 1) Enhanced safety cluster generation model.
- 2) Study results and tradeoffs.
- 3) Integration into Jupyter notebook solution delivered for the feasibility study.
- j.
2-week turnaround for proposal.
- k. Max $150K.
- 4. Benefit to NRC:
Increasing the quality of the safety clusters will improve the performance of inspectors and Operations Experience teams, further reducing the physical and mental load on the analysts.
57 D.
Cluster Representation - Giving the clusters good names and descriptions.
- 1. Problem:
Safety clusters bring together safety issues from numerous sites, components, cornerstones, and inspection procedures. Interpreting the unifying theme can be difficult for this new technology.
2. Background
Safety clusters will provide important insights to a wide variety of stakeholders at the NRC. Inspectors, Operating Experience Analysis, on site engineers, accreditors and more will be able to leverage the results of the clustering analysis. Each of these groups will require slightly different information about the clusters to perform their analysis.
Curating multiple representations of the clusters, and presenting them to the analysts from different domains, will multiply the value of the clustering analysis.
- 3. SOW Description
- a. Model research and development study.
- b. Provide analysts with better insights into safety issues identified in a machine generated safety cluster.
- c. Explore the ability to recommend potential remedies, inspection procedure identification, and procedure updates.
- d. Extend the custom cluster representation efforts performed in the feasibility study to narrow in on the systems and failure modes presented in the inspection reports.
- e. Leverage custom defined named entity recognition + customized pattern matching (laws/codes/regulations, reactor locations/sites, parent companies, reactor structures, systems and components, and their failure modes). Present cluster descriptions, including entities and patterns in an intuitive online dashboard.
- f.
Expand topic modeling by category (Cornerstones + Cross-cutting Areas, Sites, Regions, Inspection Procedures) presented in the feasibility study. Identify categories that are most useful to the NRC and codify the presentation.
- g. Key Personnel:
- 1) NLP expert - masters degree, 3 years of experience, experience working with NRC documents and formats, understanding of NRC inspection reports. Deep knowledge of natural language processing, Python, topic modeling, unsupervised machine learning, AI/ML pipeline design, experiment design, experiment execution, analysis of results.
- 2) Nuclear Subject Matter Expert - PhD in physical sciences, knowledge of nuclear power plant operations, inspections, and reports. Knowledge of radiation safety and measurements.
- 3) User Experience expert - demonstrated ability to present AL/ML model results in an intuitive user interface that improves productivity.
- 4) Software Engineering expert - Integrating AI/ML models with operational software systems including deploying inferencing systems and providing storage and access to model results.
58
- h. Deliverables:
- 1) Improved cluster representations including selectable representations for different stakeholders.
- 2) Integration into Jupyter Notebook solution from the feasibility study.
- 3) Visual presentation of representations.
- i.
2-week turnaround for proposal.
- j.
Max $150K.
- 4. Benefit to NRC:
Providing intuitive descriptions of the safety clusters discovered through unsupervised machine learning will expose the underlying issues and allow the analyst to focus on the most serious safety issues in priority order. This will reduce the cognitive load on the analyst by making the high risk areas to be identified quickly and without having each analyst maintain full intellectual control over all nuclear sites simultaneously.
E.
Safety Event Alerting.
- 1. Problem:
The unsupervised learning approach used in the initial study serves to discover safety clusters and to assign existing inspection reports to those clusters. This analysis stops short of using the predictive power of advanced analytics to identify when and where the next significant safety finding may occur.
2. Background
In addition to revealing clusters of safety concerns, AI/ML can serve as a check and balance on the inspection process. Using various inputs including Licensee Event Reports, Event Messages, and self-reporting metrics, patterns can be discovered and used as a second check on the inspections. Expected findings can be compared to actuals and statistically significant discrepancies can be examined.
- 3. SOW
Description:
Use self-reporting, LERs, event messages and other NRC input to
- a. Model research and development study with an operationalizable Tool Prototype.
- b. Inform analysts of potential safety events.
- c. Text classification.
- 1) Use inspection reports to train a system that can classify event notifications, licensee event reports into cornerstones + cross-cutting areas.
- 2) Enable finer-grained classification into cornerstone attributes, cross-cutting area aspects, or inspection procedures.
- d. Key Personnel:
- 1) AI/ML expert with knowledge of refining, training, and fine-tuning large models.
The AI/ML expert must understand the nuances of leveraging pre-trained models
59 and when the pre-trained models must be enhanced. PhD in Machine Learning, Physical Science, or related fields.
- 2) NLP expert - masters degree, 3 years of experience, experience working with NRC documents and formats, understanding of NRC inspection reports. Deep knowledge of natural language processing, Python, topic modeling, unsupervised machine learning, AI/ML pipeline design, experiment design, experiment execution, analysis of results.
- 3) Nuclear Subject Matter Expert - PhD in physical sciences, knowledge of nuclear power plant operations, inspections, and reports. Knowledge of radiation safety and measurements.
- 4) User Experience expert - demonstrated ability to present AL/ML model results in an intuitive user interface that improves productivity.
- 5) Software Engineering expert - Integrating AI/ML models with operational software systems including deploying inferencing systems and providing storage and access to model results.
- e. Deliverables:
- 1) Predictive model for upcoming safety events.
- 2) Ability to compare prediction to actual report findings.
- 3) Alerts for expected significant findings.
- f.
2-week turnaround for proposal.
- g. Max $150K.
- 4. Benefit to NRC:
Enables analysts to proactively address safety issues rather than waiting for inspectors to report findings. This will improve the self-assessment process, reduce time on-site, and remove bias from the reports.
VI.
Conclusion A.
Recap of the study objectives and main findings The objective of this acquisition is to evaluate the suitability of the commercially available machine learning (ML) systems to perform unsupervised learning to identify the safety clusters among US nuclear power plants and to perform an in-depth evaluation of a selected ML system to identify safety clusters using the inspection reports of Nuclear Regulatory Commission as input data.
The commercially available cloud environments that were studied, Amazon, Azure, Google, and Matlab all provide capabilities and resources to perform studies of this nature. However, to build a machine learning pipeline that is flexible, robust, and tailored to the NRC data, using Python libraries and pre-trained models in a Jupyter Notebook environment provides three significant advantages over choosing a single cloud providers environment.
- 1. Flexibility in selecting Algorithms.
60 Each cloud provider offers a very limited set of algorithms and models to be used for machine learning. For unsupervised topic modeling, the offerings were LDA and a un-published neural model. Neither of these provided the flexibility in configuring, or extensive documentation we were able to leverage by using the BERTopic modeling approach.
- 2. Cost.
The scale of the effort is relatively small in machine learning terms. The studies performed and models used for this task could be handled on commodity desktop hardware. While the cost for cloud resources to perform these tasks would be minimal, avoiding the cost by using existing resources reduced costs.
- 3. Flexibility in execution environment.
By using the Jupyter notebook approach, future work on this task can be performed using on-prem hardware or any of the cloud service providers. There is no lock-in to a sing environment or configuration, allowing new versions of the algorithms and models to be incorporated without significant effort.
The second objective, identifying and creating safety clusters based on NRC inspection reports, was also successfully completed. The models were configured and tuned to improve the quality of the clusters. Representations of the clusters were provided to the NRC for input and the responses were positive. As a final validation of the clusters a short study was performed. Four inspection reports were identified by the Operating Experience group as being both important and related to each other. More than 20,000 reports were placed into approximately 100 clusters. Analysis showed that 3 of the 4 inspection reports of interest fell into the same cluster, and the fourth was assigned to a cluster that made logical sense to the NRC and SphereOI teams.
Word cloud representations of the safety clusters provide insight into the safety issues that formed clusters. These clusters were formed using unsupervised methods, so no subject matter experts supplied themes, systems, components or other organizing factors. This further validates the approach and promise of using unsupervised machine learning technology to identify and analyze safety topics in the nuclear reactor fleet.
B.
Summary of the study's contributions and potential benefits The objectives of this study posed several novel challenges. First, safety clusters have a fuzzy definition - there are no hard and fast rules for what represents a safety cluster, and there are no previously defined safety clusters to use as guidance. This means that there is no labeled data available to train a supervised machine learning system. To overcome this, unsupervised machine learning was employed.
Second, performing studies with unsupervised machine learning offers its own set of challenges.
In most cases, unsupervised machine learning requires large volumes of data to detect patterns in a generalized fashion. The data available for this study, while substantial, does not rise to the typical volume needed to perform unsupervised learning. To overcome this, pre-trained models were assessed through extensive experiments to select models that provided valuable results.
61 Third, when the domain under study contains jargon, technical language, and niche concepts, using pre-trained models can present problems. Care must be taken to ensure the key concepts in the technical language are given proper weight and consideration. To overcome this, numerous embedding techniques were assessed to identify those that provided the best vectorized versions of the text to be used in the clustering algorithms.
Finally, measures for the quality of safety clusters do not exist, so understanding the quality of the results is difficult. To overcome this, three techniques were employed: 1) Metrics for cohesion and diversity in the clusters were researched and implemented, 2) subject matter experts provided independent assessments of the clusters, and 3) custom representations were developed using a curated vocabulary and word cloud visualization to quickly reveal the themes contained in the machine generated clusters.
By overcoming these challenges, this study has opened the doors to improving the safety of the nuclear reactor fleet by identifying patterns and groupings found in the thousands of inspection reports - a task that is difficult for the analyst given the cognitive load required to read and understand more than 20,000 documents, and to recall details from each report.
VII.
References A.
List of cited sources and relevant literature
[1]
https://huggingface.co/docs/transformers/index
[2]
https://huggingface.co/t5-base
[3]
https://huggingface.co/google/flan-t5-base
[4]
https://huggingface.co/facebook/bart-large-cnn
[5]
https://huggingface.co/google/pegasus-cnn_dailymail
[6]
https://huggingface.co/google/pegasus-xsum
[7]
https://huggingface.co/google/pegasus-arxiv
[8]
https://huggingface.co/google/pegasus-pubmed
[9]
https://huggingface.co/deepset/roberta-base-squad2
[10]
https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad
[11]
https://github.com/MaartenGr/KeyBERT
[12]
https://maartengr.github.io/KeyBERT/guides/quickstart.html
[13]
https://arxiv.org/pdf/2210.05245.pdf
[14]
https://github.com/TimSchopf/KeyphraseVectorizers
[15]
https://arxiv.org/abs/2203.05794, https://github.com/MaartenGr/BERTopic
[16]
[17]
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
62
[18]
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
[19]
https://huggingface.co/xlnet-base-cased
[20]
https://huggingface.co/sentence-transformers/allenai-specter
[21]
https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1
[22]
https://github.com/lmcinnes/umap
[23]
https://github.com/scikit-learn-contrib/hdbscan
[24]
[25]
https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation
[26]
https://spacy.io/usage/rule-based-matching
[27]
https://spacy.io/api/phrasematcher
[28]
Gensim: Topic modelling for humans (radimrehurek.com)
[29]
Exploring the Space of Topic Coherence Measures l Proceedings of the Eighth ACM International Conference on Web Search and Data Mining
[30]
https://github.com/silviatti/topic-model-diversity VIII.
Appendix Additional data and code can be found in the files delivered to Box.
Final Report: This PDF file contains our final report include executive summary, methodology, results, and future work.
PipelineResults-Recommended: this folder contains a full run of the recommended pipeline including the word clouds for all of the safety clusters and spreadsheets with heatmaps and pivot tables.
PipelineResults-Full: this folder contains multiple pipeline runs including data before and after outlier reduction for in depth analysis if desired Jupyter Notebooks: this folder contains Jupyter notebooks and data inputs for the recommended pipeline and the metrics calculations. There is also a list of required software to be able to run the notebooks.
Python Code for Extended Experiments: this folder contains the standard pipeline in a loop to be able to run the variations described in the final report in addition to the recommended pipeline. This can be modified to suit current and future needs of the team.