ML21277A098
| ML21277A098 | |
| Person / Time | |
|---|---|
| Issue date: | 08/18/2021 |
| From: | Hathaway T NRC/RES/DSA |
| To: | |
| Dennis M | |
| References | |
| Download: ML21277A098 (18) | |
Text
Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:
Current Topics
Natural Language Processing Techniques that allow computers to understand the contents of natural language
- Allows for the extraction of information and insights from documents
- Collection of techniques:
- Rule-based, statistical, or neural
Use Cases Goals Apply Natural Language Processing techniques to NRC data and use cases Demonstrate Successes
Resource Prediction Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged Goal: Create tool to assist project managers in formulating resource estimates
- Leverage historical data
- Find historically similar reviews Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations
- Rank documents based on similarity
Resource Prediction
- Term Frequency-Inverse Document Frequency (tf-idf)
- Weighting factor for words
- Product term frequency and inverse document frequency
- Term Frequency (tf)
- How frequency a word appears in a document
- Importance of word
- Inverse document Frequency (idf)
- How frequently a word appears in a collection of documents
Term Frequency-Inverse Document Frequency (Vector Representation)
Represent a document as a vector
- The vector reflects the word usage in the document
- The vector will have 1000s of dimensions wordx wordz
Term Frequency-Inverse Document Frequency (Vector Space Corpus)
Represent the collection of documents as vectors
- Create a vocabulary of all words used in the collection wordx wordz
Term Frequency-Inverse Document Frequency (Similarity Calculations)
A new document is converted to a vector based on the vocabulary of the collection of documents
- The similarity (angle between vectors) is calculated as the dot product between vectors
- Documents ranked by similarity score wordx wordz
Resource Prediction Approach
- Acquire historical licensing actions and resource requirements
- Extract text data from pdf files
- Clean data
- Create tf-idf matrix
- Create User Interface
- Extracts text data
- Performs similarity calculations
Resource Estimation Tool
Resource Estimation Tool
Current Status and Follow-on Work Preliminary acceptance testing complete
- Historical data provides reasonable estimates of required resources and review durations NRR/EMBARK and NRR/DORL coordinating to finalize visualizations Develop and deploy final User Interface Potential Follow-on Work:
- Search capabilities
- Predict Branch assignments
- Predict Standard Review Plan
- Predict which Regulatory Guide(s) was used for the licensing action
Regulatory Named Entity Recognition Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference sections of 10 CFR
- Revisions to 10 CFR could impact other sections Goal: Create a tool to find and extract 10 CFR references from documents Method: Use Named Entity Recognition (NER) to label text as regulations and extract that text
SpaCy Default Entities Addition of NRC Specific Language Patterns Named Entity Recognition Used Python package Spacy
10 CFR Reference Identification Tool
10 CFR Reference Identification Tool
10 CFR Reference Identification Tool
Conclusions Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Deploying these tools would increase efficiency of staff by reducing time required for manual searches
- Staff can leverage historical data in informing decisions