ML21277A098
ML21277A098 | |
Person / Time | |
---|---|
Issue date: | 08/18/2021 |
From: | Hathaway T NRC/RES/DSA |
To: | |
Dennis M | |
References | |
Download: ML21277A098 (18) | |
Text
Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:
Current Topics
Natural Language Processing
- Techniques that allow computers to understand the contents of natural language
- Allows for the extraction of information and insights from documents
- Collection of techniques:
- Rule-based, statistical, or neural
Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes
- Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged
- Goal: Create tool to assist project managers in formulating resource Resource estimates
- Leverage historical data
- Find historically similar reviews Prediction
- Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations
- Rank documents based on similarity
- Term Frequency-Inverse Document Frequency (tf-idf)
- Weighting factor for words
- Product term frequency and inverse document frequency Resource
- Term Frequency (tf)
- How frequency a word appears in a Prediction document
- Importance of word
- Inverse document Frequency (idf)
- How frequently a word appears in a collection of documents
Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx
- Represent a document as a vector
- The vector reflects the word usage in the document
- The vector will have 1000s of dimensions
Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx
- Represent the collection of documents as vectors
- Create a vocabulary of all words used in the collection
Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx
- A new document is converted to a vector based on the vocabulary of the collection of documents
- The similarity (angle between vectors) is calculated as the dot product between vectors
- Documents ranked by similarity score
Approach
- Acquire historical licensing actions and resource requirements Resource
- Extract text data from pdf files
- Clean data Prediction
- Create tf-idf matrix
- Create User Interface
- Extracts text data
- Performs similarity calculations
Resource Estimation Tool Resource Estimation Tool
- Preliminary acceptance testing complete
- Historical data provides reasonable Current estimates of required resources and review durations
- NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations
- Develop and deploy final User Interface and
- Potential Follow-on Work:
- Search capabilities Follow-on -
Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action
- Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR
- Revisions to 10 CFR could impact other Named sections
- Goal: Create a tool to find and extract 10 CFR references from Entity documents
- Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text
Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns
- Used Python package Spacy
10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool
- Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions
- Deploying these tools would increase efficiency of staff by reducing time required for manual searches
- Staff can leverage historical data in informing decisions