Text

5 Trey Hathaway - Resource Prediction Using Nlp
	ML21277A098
Person / Time
Issue date:	08/18/2021
From:	Hathaway T; NRC/RES/DSA
To:	;
	Dennis M
References
	Download: ML21277A098 (18)
	v • d • e

Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:

Current Topics

Natural Language Processing

Techniques that allow computers to understand the contents of natural language

- Allows for the extraction of information and insights from documents

- Collection of techniques:

Rule-based, statistical, or neural

Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes

Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged

Goal: Create tool to assist project managers in formulating resource Resource estimates

- Leverage historical data

- Find historically similar reviews Prediction

Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations

- Rank documents based on similarity

Term Frequency-Inverse Document Frequency (tf-idf)

- Weighting factor for words

- Product term frequency and inverse document frequency Resource

Term Frequency (tf)

- How frequency a word appears in a Prediction document

- Importance of word

Inverse document Frequency (idf)

- How frequently a word appears in a collection of documents

Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx

Represent a document as a vector

- The vector reflects the word usage in the document

- The vector will have 1000s of dimensions

Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx

Represent the collection of documents as vectors

- Create a vocabulary of all words used in the collection

Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx

A new document is converted to a vector based on the vocabulary of the collection of documents

- The similarity (angle between vectors) is calculated as the dot product between vectors

- Documents ranked by similarity score

Approach

Acquire historical licensing actions and resource requirements Resource

Extract text data from pdf files

Clean data Prediction

Create tf-idf matrix

Create User Interface

- Extracts text data

- Performs similarity calculations

Resource Estimation Tool Resource Estimation Tool

Preliminary acceptance testing complete

- Historical data provides reasonable Current estimates of required resources and review durations

NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations

Develop and deploy final User Interface and

Potential Follow-on Work:

- Search capabilities Follow-on -

Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action

Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR

- Revisions to 10 CFR could impact other Named sections

Goal: Create a tool to find and extract 10 CFR references from Entity documents

Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text

Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns

Used Python package Spacy

10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool

Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions

Deploying these tools would increase efficiency of staff by reducing time required for manual searches

- Staff can leverage historical data in informing decisions

ML21277A098

Text

Navigation menu

Search