GithubHelp home page GithubHelp logo

varun196 / knowledge_graph_from_unstructured_text Goto Github PK

View Code? Open in Web Editor NEW
44.0 3.0 28.0 232.27 MB

Building knowledge graph from input data

Python 72.34% Java 22.45% Batchfile 0.77% Shell 4.43%

knowledge_graph_from_unstructured_text's Introduction

From unstructured text to knowledge graph

The project is a complete end-to-end solution for generating knowledge graphs from unstructured data. NER can be run on input by either NLTK, Spacy or Stanford APIs. Optionally, coreference resolution can be performed which is done by python wrapper to stanford's core NLP API. Relation extraction is then done using stanford's open ie. Lastly, post-processing is done to get csv file which can be uploaded to graph commons to visualize the knowledge graph.

More details can be found in the Approach folder.

Running the code

  1. Clone Repository
  2. Ensure your system is setup properly (Refer Setup instructions below)
  3. Put your input data files (.txt) in data/input
  4. Run knowledge_graph.py
    python3 knowledge_graph.py spacy You can provide several arguments to knowledge_graph.py. For a more detailed list, refer the running knowledge_graph.py section below
  5. Run relation_extractor.py python3 relation_extractor.py
  6. Run create_structured_csv python3 create_structured_csv.py
  7. The resultant csv is available in data/results folder

Setup

The following installation steps are written w.r.t. linux operating system and python3 language.

  1. Create a new python3 virtual environment:
    python3 -m venv <path_to_env/env_name>
  2. Switch to the environment:
    source path_to_env/env_name/bin/activate
  3. Install Spacy:
    pip3 install spacy
  4. Install en_core_web_sm model for spacy:
    python3 -m spacy download en_core_web_sm
  5. Install nltk:
    pip3 install nltk
  6. Install required nltk data. Either install required packages individually or install all packages by using
    python -m nltk.downloader all
    Refer: https://www.nltk.org/data.html
  7. Install stanfordcorenlp python package:
    pip3 install stanfordcorenlp
  8. Download and unzip stanford-corenlp-full:
    https://stanfordnlp.github.io/CoreNLP/download.html
  9. Download and setup stanford ner: https://nlp.stanford.edu/software/CRF-NER.shtml#Download as described in NLTK documentation: http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford (Not required if already present due to git clone)
  10. Download and unzip stanford open-ie (Not required if already present due to git clone)
  11. Install python-tk:
    sudo apt-get install python3-tk
  12. Install pandas:
    pip3 install pandas

knowledge_graph.py

Performs Named Entity Recognition (NER) on input data by using either NLTK, Spacy or Stanford (or all of them). Also performs coreference resolution. The coreference is used by relation_extractor.py . The recognised NER are used by create_structured_csv.py

Running knowledge_graph.py

Will only run on linux like operating systems, with paths like abc/def/file.txt

Please note that coreference resolution server requires around 4GB of free system RAM to run. If this is not available, stanford server may stop with an error or thrashing may cause program to run very slowly.

python3 knowledge_graph.py <options>

options:

  • nltk runs Named Entity Recognition using custom code written with help of NLTK
  • stanford runs NER using stanford's library
  • spacy uses spacy's pre-trained models for NER
  • verbose to get detailed output
  • optimized run coreference resolution to get better output. This will increase time taken significantly. Also will impose a limit on size of each file; so data may need to be split amongst files.

e.g.:

python3 knowledge_graph.py optimized verbose nltk spacy
will o/p ner via nltk and spacy, and perform coreference resolution

inputs to knowledge_graph.py

The input unstructured data files must be in ./data/input folder. I.e. data folder must be in same dir as knowledge_graph.py

outputs from knowledge_graph.py

data/output/ner --- contains recognised named entities
data/output/caches --- Intended to contain result pickles of coreferences obtained by stanford's core nlp
data/output/kg --- contains input files with coreferences resolved

knowledge_graph_from_unstructured_text's People

Contributors

rajatdge avatar varun196 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

knowledge_graph_from_unstructured_text's Issues

relation extractor problem

when I try to use relation extractor I get this error

Traceback (most recent call last):
File "C:\Users\Siyavash\Desktop\maliheh\knowledge_graph_from_unstructured_text\relation_extractor.py", line 28, in
Stanford_Relation_Extractor()
File "C:\Users\Siyavash\Desktop\maliheh\knowledge_graph_from_unstructured_text\relation_extractor.py", line 19, in Stanford_Relation_Extractor
p = subprocess.Popen(['./process_large_corpus.sh',f,f + '-out.csv'], stdout=subprocess.PIPE)
File "C:\Users\Siyavash\anaconda3\envs\mynlp\lib\subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Siyavash\anaconda3\envs\mynlp\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid Win32 application

Error when python3 create_structured_csv.py

I can run the first and second command successfully, but get error when i working
python3 create_structured_csv.py
The error is like below:

input_data
Traceback (most recent call last):
  File "create_structured_csv.py", line 58, in <module>
    main()
  File "create_structured_csv.py", line 30, in main
    df = pd.read_csv(curr_dir +"/data/output/kg/"+file_name+".txt-out.csv")
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 30, saw 4

Not getting any output

When I tried to give input: "John transferred 5000 dollars to Rohan. John committed fraud of 2 million dollars. While going through serious debt he also took loan from Bank Of Baroda. USA allegedly provided John money of $8 million."

I got nothing (result/named_entity_input.csv was empty) after executing create_structured_csv.py (tried using both (optimized verbose nltk spacy" and (spacy))

Can I know what pattern does this module capture in a paragraph

Or what kind/structure of input is optimal for this module to handle them.

Btw, I really liked your approach and thanks in advance !

create_structured_csv.py does not work for any new data!

create_structured_csv.py needs input_data.txt-out.csv file to be created in the intermediate process, but the file is not being generated/saved in the process. This program only working if input_data.txt-out.csv file is downloaded from github(already present), which is not the case for new data.

Relation Extraction

Hello all,

Can someone help me please with this error

Traceback (most recent call last): File "relation_extractor.py", line 25, in <module> Stanford_Relation_Extractor() File "relation_extractor.py", line 16, in Stanford_Relation_Extractor p = subprocess.Popen(['./process_large_corpus.sh',f,f + '-out.csv'], stdout=subprocess.PIPE) File "/usr/lib/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: './process_large_corpus.sh': './process_large_corpus.sh'

I get this error when I run the 'relation_extraction.py' file. There is no process_large_corpus.sh file in the repo.

Any help or hint will be appreciated.

Thanks very much

Relation-Extraction: "process_large_corpus.sh" file not found

Hello guys,

Can someone please help me with this error. The 'process_large_corpus.sh' file is in the right directory so I do not understand what the issue is..??

Traceback (most recent call last): File "relation_extractor.py", line 27, in <module> Stanford_Relation_Extractor() File "relation_extractor.py", line 17, in Stanford_Relation_Extractor p = subprocess.Popen(['./process_large_corpus.sh',f,f + '-out.csv'], stdout=subprocess.PIPE) File "/usr/lib/python3.6/subprocess.py", line 729, in __init__ restore_signals, start_new_session) File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: './process_large_corpus.sh': './process_large_corpus.sh'

Thanks very much..!!

IndexError: list index out of range while coreference

Traceback (most recent call last):
  File "knowledge_graph.py", line 292, in <module>
    main()
  File "knowledge_graph.py", line 287, in main
    doc = resolve_coreferences(doc,stanford_core_nlp_path,named_entities,verbose)
  File "knowledge_graph.py", line 217, in resolve_coreferences
    result = coref_obj.resolve_coreferences(corefs,doc,ner,verbose)
  File "knowledge_graph.py", line 200, in resolve_coreferences
    replaced_sent = words[i] + " "+ replaced_sent
IndexError: list index out of range


Data file added for reproducing the error
input_data (1).txt

Primary analysis suggests: The file has tokens like:
" North-East", and "third-largest", stanford tokenizer for coreference splits across hyphen, while nltk does does not. So, as per , nltk the token length of corresponding sentence is 37, which does not match co-reference indices (with 41 tokens) ['North', '-','East',third','-','largest']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.