ML21277A098

From kanterella
Jump to navigation Jump to search
5 Trey Hathaway - Resource Prediction Using Nlp
ML21277A098
Person / Time
Issue date: 08/18/2021
From: Hathaway T
NRC/RES/DSA
To:
Dennis M
References
Download: ML21277A098 (18)


Text

Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:

Current Topics

Natural Language Processing Techniques that allow computers to understand the contents of natural language

- Allows for the extraction of information and insights from documents

- Collection of techniques:

  • Rule-based, statistical, or neural

Use Cases Goals Apply Natural Language Processing techniques to NRC data and use cases Demonstrate Successes

Resource Prediction Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged Goal: Create tool to assist project managers in formulating resource estimates

- Leverage historical data

- Find historically similar reviews Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations

- Rank documents based on similarity

Resource Prediction

  • Term Frequency-Inverse Document Frequency (tf-idf)

- Weighting factor for words

- Product term frequency and inverse document frequency

  • Term Frequency (tf)

- How frequency a word appears in a document

- Importance of word

  • Inverse document Frequency (idf)

- How frequently a word appears in a collection of documents

Term Frequency-Inverse Document Frequency (Vector Representation)

Represent a document as a vector

- The vector reflects the word usage in the document

- The vector will have 1000s of dimensions wordx wordz

Term Frequency-Inverse Document Frequency (Vector Space Corpus)

Represent the collection of documents as vectors

- Create a vocabulary of all words used in the collection wordx wordz

Term Frequency-Inverse Document Frequency (Similarity Calculations)

A new document is converted to a vector based on the vocabulary of the collection of documents

- The similarity (angle between vectors) is calculated as the dot product between vectors

- Documents ranked by similarity score wordx wordz

Resource Prediction Approach

  • Acquire historical licensing actions and resource requirements
  • Extract text data from pdf files
  • Clean data
  • Create tf-idf matrix
  • Create User Interface

- Extracts text data

- Performs similarity calculations

Resource Estimation Tool

Resource Estimation Tool

Current Status and Follow-on Work Preliminary acceptance testing complete

- Historical data provides reasonable estimates of required resources and review durations NRR/EMBARK and NRR/DORL coordinating to finalize visualizations Develop and deploy final User Interface Potential Follow-on Work:

- Search capabilities

- Predict Branch assignments

- Predict Standard Review Plan

- Predict which Regulatory Guide(s) was used for the licensing action

Regulatory Named Entity Recognition Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference sections of 10 CFR

- Revisions to 10 CFR could impact other sections Goal: Create a tool to find and extract 10 CFR references from documents Method: Use Named Entity Recognition (NER) to label text as regulations and extract that text

SpaCy Default Entities Addition of NRC Specific Language Patterns Named Entity Recognition Used Python package Spacy

10 CFR Reference Identification Tool

10 CFR Reference Identification Tool

10 CFR Reference Identification Tool

Conclusions Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Deploying these tools would increase efficiency of staff by reducing time required for manual searches

- Staff can leverage historical data in informing decisions