ML21277A098: Difference between revisions
StriderTol (talk | contribs) (StriderTol Bot change) |
StriderTol (talk | contribs) (StriderTol Bot change) |
||
Line 15: | Line 15: | ||
=Text= | =Text= | ||
{{#Wiki_filter:}} | {{#Wiki_filter:Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops: | ||
Current Topics | |||
Natural Language Processing | |||
* Techniques that allow computers to understand the contents of natural language | |||
- Allows for the extraction of information and insights from documents | |||
- Collection of techniques: | |||
* Rule-based, statistical, or neural | |||
Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes | |||
* Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged | |||
* Goal: Create tool to assist project managers in formulating resource Resource estimates | |||
- Leverage historical data | |||
- Find historically similar reviews Prediction | |||
* Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations | |||
- Rank documents based on similarity | |||
* Term Frequency-Inverse Document Frequency (tf-idf) | |||
- Weighting factor for words | |||
- Product term frequency and inverse document frequency Resource | |||
* Term Frequency (tf) | |||
- How frequency a word appears in a Prediction document | |||
- Importance of word | |||
* Inverse document Frequency (idf) | |||
- How frequently a word appears in a collection of documents | |||
Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx | |||
* Represent a document as a vector | |||
- The vector reflects the word usage in the document | |||
- The vector will have 1000s of dimensions | |||
Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx | |||
* Represent the collection of documents as vectors | |||
- Create a vocabulary of all words used in the collection | |||
Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx | |||
* A new document is converted to a vector based on the vocabulary of the collection of documents | |||
- The similarity (angle between vectors) is calculated as the dot product between vectors | |||
- Documents ranked by similarity score | |||
Approach | |||
* Acquire historical licensing actions and resource requirements Resource | |||
* Extract text data from pdf files | |||
* Clean data Prediction | |||
* Create tf-idf matrix | |||
* Create User Interface | |||
- Extracts text data | |||
- Performs similarity calculations | |||
Resource Estimation Tool Resource Estimation Tool | |||
* Preliminary acceptance testing complete | |||
- Historical data provides reasonable Current estimates of required resources and review durations | |||
* NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations | |||
* Develop and deploy final User Interface and | |||
* Potential Follow-on Work: | |||
- Search capabilities Follow-on - | |||
Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action | |||
* Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR | |||
- Revisions to 10 CFR could impact other Named sections | |||
* Goal: Create a tool to find and extract 10 CFR references from Entity documents | |||
* Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text | |||
Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns | |||
* Used Python package Spacy | |||
10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool | |||
* Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions | |||
* Deploying these tools would increase efficiency of staff by reducing time required for manual searches | |||
- Staff can leverage historical data in informing decisions}} |
Latest revision as of 15:09, 18 January 2022
ML21277A098 | |
Person / Time | |
---|---|
Issue date: | 08/18/2021 |
From: | Hathaway T NRC/RES/DSA |
To: | |
Dennis M | |
References | |
Download: ML21277A098 (18) | |
Text
Resource Prediction Using Natural Language Processing Trey Hathaway U.S. Nuclear Regulatory Commission RES/DSA/AAB August 18, 2021 NRC Data Science and Artificial Intelligence Regulatory Applications Workshops:
Current Topics
Natural Language Processing
- Techniques that allow computers to understand the contents of natural language
- Allows for the extraction of information and insights from documents
- Collection of techniques:
- Rule-based, statistical, or neural
Apply Natural Language Processing techniques to NRC Use Cases data and use cases Goals Demonstrate Successes
- Challenge: Deviations between resource estimates to complete a licensing review and the actual hours charged
- Goal: Create tool to assist project managers in formulating resource Resource estimates
- Leverage historical data
- Find historically similar reviews Prediction
- Method: Use term frequency-inverse document frequency vectors to represent documents and perform similarity calculations
- Rank documents based on similarity
- Term Frequency-Inverse Document Frequency (tf-idf)
- Weighting factor for words
- Product term frequency and inverse document frequency Resource
- Term Frequency (tf)
- How frequency a word appears in a Prediction document
- Importance of word
- Inverse document Frequency (idf)
- How frequently a word appears in a collection of documents
Term Frequency-Inverse Document Frequency (Vector Representation) wordz wordx
- Represent a document as a vector
- The vector reflects the word usage in the document
- The vector will have 1000s of dimensions
Term Frequency-Inverse Document Frequency (Vector Space Corpus) wordz wordx
- Represent the collection of documents as vectors
- Create a vocabulary of all words used in the collection
Term Frequency-Inverse Document Frequency (Similarity Calculations) wordz wordx
- A new document is converted to a vector based on the vocabulary of the collection of documents
- The similarity (angle between vectors) is calculated as the dot product between vectors
- Documents ranked by similarity score
Approach
- Acquire historical licensing actions and resource requirements Resource
- Extract text data from pdf files
- Clean data Prediction
- Create tf-idf matrix
- Create User Interface
- Extracts text data
- Performs similarity calculations
Resource Estimation Tool Resource Estimation Tool
- Preliminary acceptance testing complete
- Historical data provides reasonable Current estimates of required resources and review durations
- NRR/EMBARK and NRR/DORL Status coordinating to finalize visualizations
- Develop and deploy final User Interface and
- Potential Follow-on Work:
- Search capabilities Follow-on -
Predict Branch assignments Predict Standard Review Plan Work - Predict which Regulatory Guide(s) was used for the licensing action
- Challenge: Title 10 of the Code of Federal Regulations (CFR), and other regulatory documents, reference Regulatory sections of 10 CFR
- Revisions to 10 CFR could impact other Named sections
- Goal: Create a tool to find and extract 10 CFR references from Entity documents
- Method: Use Named Entity Recognition Recognition (NER) to label text as regulations and extract that text
Named Entity Recognition SpaCy Default Entities Addition of NRC Specific Language Patterns
- Used Python package Spacy
10 CFR Reference Identification Tool 10 CFR Reference Identification Tool 10 CFR Reference Identification Tool
- Natural Language Processing is a powerful tool to leverage unstructured data in historical documents Conclusions
- Deploying these tools would increase efficiency of staff by reducing time required for manual searches
- Staff can leverage historical data in informing decisions