ML21277A098: Difference between revisions

From kanterella
Jump to navigation Jump to search
(StriderTol Bot change)
(StriderTol Bot change)
 
Line 15: Line 15:


=Text=
=Text=
{{#Wiki_filter:}}
{{#Wiki_filter:Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:
Current Topics
 
Natural Language Processing
* Techniques that allow computers to understand the contents of natural language
  - Allows for the extraction of information and insights from documents
  - Collection of techniques:
* Rule-based, statistical, or neural
 
Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes
* Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged
* Goal: Create tool to assist project managers in formulating resource Resource    estimates
              - Leverage historical data
              - Find historically similar reviews Prediction
* Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations
              - Rank documents based on similarity
* Term Frequency-Inverse Document Frequency (tf-idf)
              - Weighting factor for words
              - Product term frequency and inverse document frequency Resource
* Term Frequency (tf)
              - How frequency a word appears in a Prediction document
              - Importance of word
* Inverse document Frequency (idf)
              - How frequently a word appears in a collection of documents
 
Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx
* Represent a document as a vector
  - The vector reflects the word usage in the document
  - The vector will have 1000s of dimensions
 
Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx
* Represent the collection of documents as vectors
                      - Create a vocabulary of all words used in the collection
 
Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx
* A new document is converted to a vector based on the vocabulary of the collection of documents
            - The similarity (angle between vectors) is calculated as the dot product between vectors
            - Documents ranked by similarity score
 
Approach
* Acquire historical licensing actions and resource requirements Resource
* Extract text data from pdf files
* Clean data Prediction
* Create tf-idf matrix
* Create User Interface
              - Extracts text data
              - Performs similarity calculations
 
Resource Estimation Tool Resource Estimation Tool
* Preliminary acceptance testing complete
              - Historical data provides reasonable Current estimates of required resources and review durations
* NRR/EMBARK and NRR/DORL Status  coordinating to finalize visualizations
* Develop and deploy final User Interface and
* Potential Follow-on Work:
              -  Search capabilities Follow-on    -
Predict Branch assignments Predict Standard Review Plan Work    -  Predict which Regulatory Guide(s) was used for the licensing action
* Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory  sections of 10 CFR
              - Revisions to 10 CFR could impact other Named sections
* Goal: Create a tool to find and extract 10 CFR references from Entity  documents
* Method: Use Named Entity Recognition  Recognition (NER) to label text as regulations and extract that text
 
Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns
* Used Python package Spacy
 
10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool
* Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions
* Deploying these tools would increase efficiency of staff by reducing time required for manual searches
              - Staff can leverage historical data in informing decisions}}

Latest revision as of 15:09, 18 January 2022

5 Trey Hathaway - Resource Prediction Using Nlp
ML21277A098
Person / Time
Issue date: 08/18/2021
From: Hathaway T
NRC/RES/DSA
To:
Dennis M
References
Download: ML21277A098 (18)


Text

Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:

Current Topics

Natural Language Processing

  • Techniques that allow computers to understand the contents of natural language

- Allows for the extraction of information and insights from documents

- Collection of techniques:

  • Rule-based, statistical, or neural

Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes

  • Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged
  • Goal: Create tool to assist project managers in formulating resource Resource estimates

- Leverage historical data

- Find historically similar reviews Prediction

  • Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations

- Rank documents based on similarity

  • Term Frequency-Inverse Document Frequency (tf-idf)

- Weighting factor for words

- Product term frequency and inverse document frequency Resource

  • Term Frequency (tf)

- How frequency a word appears in a Prediction document

- Importance of word

  • Inverse document Frequency (idf)

- How frequently a word appears in a collection of documents

Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx

  • Represent a document as a vector

- The vector reflects the word usage in the document

- The vector will have 1000s of dimensions

Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx

  • Represent the collection of documents as vectors

- Create a vocabulary of all words used in the collection

Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx

  • A new document is converted to a vector based on the vocabulary of the collection of documents

- The similarity (angle between vectors) is calculated as the dot product between vectors

- Documents ranked by similarity score

Approach

  • Acquire historical licensing actions and resource requirements Resource
  • Extract text data from pdf files
  • Clean data Prediction
  • Create tf-idf matrix
  • Create User Interface

- Extracts text data

- Performs similarity calculations

Resource Estimation Tool Resource Estimation Tool

  • Preliminary acceptance testing complete

- Historical data provides reasonable Current estimates of required resources and review durations

  • NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations
  • Develop and deploy final User Interface and
  • Potential Follow-on Work:

- Search capabilities Follow-on -

Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action

  • Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR

- Revisions to 10 CFR could impact other Named sections

  • Goal: Create a tool to find and extract 10 CFR references from Entity documents
  • Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text

Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns

  • Used Python package Spacy

10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool

  • Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions
  • Deploying these tools would increase efficiency of staff by reducing time required for manual searches

- Staff can leverage historical data in informing decisions