ML21277A098

From kanterella
Revision as of 15:09, 18 January 2022 by StriderTol (talk | contribs) (StriderTol Bot change)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
5 Trey Hathaway - Resource Prediction Using Nlp
ML21277A098
Person / Time
Issue date: 08/18/2021
From: Hathaway T
NRC/RES/DSA
To:
Dennis M
References
Download: ML21277A098 (18)


Text

Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:

Current Topics

Natural Language Processing

  • Techniques that allow computers to understand the contents of natural language

- Allows for the extraction of information and insights from documents

- Collection of techniques:

  • Rule-based, statistical, or neural

Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes

  • Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged
  • Goal: Create tool to assist project managers in formulating resource Resource estimates

- Leverage historical data

- Find historically similar reviews Prediction

  • Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations

- Rank documents based on similarity

  • Term Frequency-Inverse Document Frequency (tf-idf)

- Weighting factor for words

- Product term frequency and inverse document frequency Resource

  • Term Frequency (tf)

- How frequency a word appears in a Prediction document

- Importance of word

  • Inverse document Frequency (idf)

- How frequently a word appears in a collection of documents

Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx

  • Represent a document as a vector

- The vector reflects the word usage in the document

- The vector will have 1000s of dimensions

Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx

  • Represent the collection of documents as vectors

- Create a vocabulary of all words used in the collection

Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx

  • A new document is converted to a vector based on the vocabulary of the collection of documents

- The similarity (angle between vectors) is calculated as the dot product between vectors

- Documents ranked by similarity score

Approach

  • Acquire historical licensing actions and resource requirements Resource
  • Extract text data from pdf files
  • Clean data Prediction
  • Create tf-idf matrix
  • Create User Interface

- Extracts text data

- Performs similarity calculations

Resource Estimation Tool Resource Estimation Tool

  • Preliminary acceptance testing complete

- Historical data provides reasonable Current estimates of required resources and review durations

  • NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations
  • Develop and deploy final User Interface and
  • Potential Follow-on Work:

- Search capabilities Follow-on -

Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action

  • Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR

- Revisions to 10 CFR could impact other Named sections

  • Goal: Create a tool to find and extract 10 CFR references from Entity documents
  • Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text

Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns

  • Used Python package Spacy

10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool

  • Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions
  • Deploying these tools would increase efficiency of staff by reducing time required for manual searches

- Staff can leverage historical data in informing decisions