machinalis / iepy Goto Github PK
View Code? Open in Web Editor NEWInformation Extraction in Python
License: BSD 3-Clause "New" or "Revised" License
Information Extraction in Python
License: BSD 3-Clause "New" or "Revised" License
That, adding more logging to the pipeline in core.py would be useful
The download 3rd party scripts downloads additional packages that are not needed, like:
The script should only download what is actually used by IEPY.
Classifier Configuration:
"classifier_config": {
"classifier": "svm",
"classifier_args": {
"class_weight": {
"false": 1,
"true": 1
},
"gamma": 0.0,
"kernel": "rbf"
},
"dimensionality_reduction": null,
"dimensionality_reduction_dimension": null,
"feature_selection": "frequency_filter",
"feature_selection_dimension": 10,
"features": [
"bag_of_words_in_between",
"bag_of_pos_in_between",
"bag_of_wordpos_in_between",
"entity_order",
"entity_distance",
"other_entities_in_between",
"verbs_count_in_between"
],
"scaler": true,
"sparse": true
},
Full traceback:
Processing | | 2/2882014-06-09 13:34:29,163 - root - ERROR - Experiment failed because of ValueError ColumnFilter eliminates all columns!, skipping...
Traceback (most recent call last):
File "experimentation/loop/experiment_runner.py", line 338, in <module>
use_git_info_from_path=path)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
result = single_runner(config)
File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
answers_given, progression = iepyloop.run_experiment()
File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
self.force_process() # blocking
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
self.do_iteration(None)
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
data = step(data)
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
classifiers[rel] = self._build_extractor(rel, Knowledge(k))
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
return FactExtractorFactory(self.extractor_config, data)
File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
p.fit(data)
File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
self.predictor.fit(X, y)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 122, in _pre_transform
Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 442, in fit
raise ValueError("ColumnFilter eliminates all columns!")
ValueError: ColumnFilter eliminates all columns!
Traceback (most recent call last):
File "scripts/cross_validate.py", line 114, in
accuracy, precision, recall = main(opts)
File "scripts/cross_validate.py", line 67, in main
standard = Knowledge.load_from_csv(options['<gold_standard>'], connection)
TypeError: load_from_csv() takes exactly 2 arguments (3 given)
I start with seeds of type (disease, symptom) but in the second round of questions IEPY starts asking about (disease, disease).
In my case I run with the command:
$ python scripts/iepy_runner.py house_pages_current seeds.csv facts.csv
Only "iepy" is listed.
On documentation is not completely clear that once you run the preprocess you can already browse and edit the processed data.
Right now our LiteralNER is very literal, so in some cases is not working.
Example: an entry like this
takayasu's arteritis
Is never found because the documents will be tokenized, transforming this
John had takayasu's arteritis
into this
John had takayasu 's arteritis
making impossible a match (notice that 's is a separated token).
Also, what make things harder is that the tokenizer to use while parsing the LiteralNER entries must be the same tokenizer used when tokenizing text.
Setting override=True in any step of the pipeline has no effect. I think the problem is in PreProcessPipeline, where get_documents_lacking_preprocess() is used regardless if we are trying to override.
The bug can be reproduced answering 'n' to a question and then 'run'.
In my case:
$ python scripts/iepy_runner.py house_pages_current seeds.csv facts.csv
...
(y/n/d/run/STOP): n
...
(y/n/d/run/STOP): run
...
Traceback (most recent call last):
File "scripts/iepy_runner.py", line 40, in
p.force_process()
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 318, in force_process
self.do_iteration(None)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 281, in do_iteration
data = step(data)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 401, in extract_facts
true_index = list(classifier.classes_).index(True)
ValueError: True is not in list
Would be great to add the following features to the generate_seeds
script:
--list-entity-kinds
that prints the available entity-kinds(reported by jmansilla)
By default a iepy --create
instantiation uses a sqlite database. Since the performance of this database is awful it would be a good thing to warn the user about this in the documentation.
Installing IEPY with the following command
$ pip install git+https://github.com/machinalis/iepy
makes scripts difficult to access and templates directly unavailable.
Maybe a single binary 'iepy' could be provided to give access to the scripts.
iepy/app_preprocess.py.template
The filter_facts step adds to knowledge the evidence that was classified as positive with high probability but doesn't check if the evidence was already labeled as negative by the user.
I've tried to import my corpus of ukrainian texts and apparently one of them was too big for iepy:
Added 2503 documents
Traceback (most recent call last):
File "bin/csv_to_iepy.py", line 29, in <module>
csv_to_iepy(filepath)
File "/Users/dchaplinsky/Projects/pullenti-ukr/iepy/venv/lib/python3.4/site-packages/iepy/utils.py", line 111, in csv_to_iepy
for i, d in enumerate(reader):
File "/usr/local/Cellar/python3/3.4.1_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 110, in __next__
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
While I understand that having such a text in corpus is a bit stupid I think good solution here would be:
Right now download_third_party_data.py
re-download data without checking if it's necessary. It would be great if it only downloaded what is necessary.
(reported by jmansilla)
Once the BootstrappedIEPipeline starts running the Evidence instances generated are (kind of) read-only.
The same evidences will be generated over and over in stage 1. Therefore, the vectorization needed for the classifier at stages 3 and 5 could be cached for future pipeline cycles, improving the execution speed.
the method _confidence of class BootstrappedIEPipeline is not being used
Should we remove it?
In our example tvseries app, the word "House" which refers to the person is instead labeled as organization by the default NER[1] we have
[1] probably because our NER was trained with the WallStreetJournal so it may refer to the White House... dunno.
It's accessing the DB too many times. Refactor needed.
Classification fails (is impossible) if only negative (or only positive) evidence is present.
Right now, a fail (an un-captured exception) raises if only negative examples are present.
This issue has not been consciously addressed and correct behavior should be ensured.
"What is correct behavior?" is open for discussion, what I propose is: In all stages of the pipeline except the human interaction (stage 2) skip relations that don't have positive and negative evidence.
Apparently the problem is in
corenlp.sh
which is sitting here:
/Users/dchaplinsky/Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/corenlp.sh
and dirname $0 fails because of space in the path.
Fixed it by patching .sh to use dirname "$0"
I understand that this is more like StanfordNER problem, but might be my hack will be useful to somebody.
In my own app I am trying to subclass IEDocument to have additional fields and methods, but mongoengine says I can't do it. Not sure if bug or design decision, but it would be useful in my case to be able to subclass IEDocument.
For instance:
from iepy.models import IEDocument
class MyDocument(IEDocument):
... pass
...
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 332, in new
new_class = super_new(cls, name, bases, attrs)
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 120, in new
base.name)
ValueError: Document IEDocument may not be subclassed
RuntimeWarning: invalid value encountered in double_scalars
scale = lambda x: (x - min_score) * range_delta / score_range + range_min
shall we worry about this?
Last parameter is missing:
logger.info('Database %s created with %i documents', dbname, )
Classifier configuration:
"classifier_config": {
"classifier": "svm",
"classifier_args": {
"class_weight": {
"false": 1,
"true": 1
},
"gamma": 0.0,
"kernel": "rbf"
},
"dimensionality_reduction": null,
"dimensionality_reduction_dimension": null,
"feature_selection": "frequency_filter",
"feature_selection_dimension": 5,
"features": [
"bag_of_words_in_between",
"bag_of_pos_in_between",
"bag_of_wordpos_in_between",
"entity_order",
"entity_distance",
"other_entities_in_between",
"verbs_count_in_between"
],
"scaler": true,
"sparse": true
},
Full traceback:
Processing | | 1/2882014-06-09 13:27:39,152 - root - ERROR - Experiment failed because of IndexError index out of bounds, skipping...
Traceback (most recent call last):
File "experimentation/loop/experiment_runner.py", line 338, in <module>
use_git_info_from_path=path)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
result = single_runner(config)
File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
answers_given, progression = iepyloop.run_experiment()
File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
self.force_process() # blocking
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
self.do_iteration(None)
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
data = step(data)
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
classifiers[rel] = self._build_extractor(rel, Knowledge(k))
File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
return FactExtractorFactory(self.extractor_config, data)
File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
p.fit(data)
File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
self.predictor.fit(X, y)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 131, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
y = self._validate_targets(y)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 442, in _validate_targets
self.class_weight_ = compute_class_weight(self.class_weight, cls, y)
File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 52, in compute_class_weight
if classes[i] != c:
IndexError: index out of bounds
It would be great to have a tool to import and export annotated corpora.
That would allow us to easily share the corpora that we already have (perdate and orgloc) and to receive contributions from future IEPY users.
As an example, this package is doing it
Every other script on the instances takes a parameter for the relation used, but the rules runner takes it from the rules.py file.
For uniformity purposes we should think how can we change that, having in mind that a set of rules is specific to a relation.
I open this to discussion
Add Sphinx stuff to the docs, so we can integrate the documentation into readthedocs easily.
The tokenizer is not following standard contraction tokenization [0], expected by the Stanford POS tagger. Contractions are not splitted and should be.
Also, the apostrophe character ´ is not handled.
When "speeding up" the database, the focus shall be put on the operations needed for running iepy core, not the preprocessing.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/francolq/Documents/comp/machinalis/experimental/iepy/scripts/iepy_runner.py", line 51, in
p.force_process()
File "iepy/core.py", line 182, in force_process
self.do_iteration(None)
File "iepy/core.py", line 136, in do_iteration
data = step(data)
File "iepy/core.py", line 267, in learn_fact_extractors
classifiers[rel] = self._build_extractor(rel, Knowledge(k))
File "iepy/core.py", line 277, in _build_extractor
return FactExtractorFactory(self.extractor_config, data)
File "iepy/fact_extractor.py", line 176, in FactExtractorFactory
p.fit(data)
File "iepy/fact_extractor.py", line 157, in fit
self.predictor.fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, *_fit_params)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 122, in _pre_transform
Xt = transform.fit(Xt, y, *_fit_params_steps[name])
File "iepy/fact_extractor.py", line 429, in fit
if not any(self.mask):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Hi,
I'm interested in the whole subject matter of iepy, but the vocabulary and theories attached to NLP are very new to me. My apologies if what I'm asking shows a complete lack of understanding.
I've installed iepy, and created a simple .csv file and imported it:
$ python bin/csv_to_iepy.py docs.csv
Importing Documents to IEPY from docs.csv
Added 1 documents
Added 2 documents
Then preprocessed:
$ python bin/preprocess.py
Starting preprocessing step <iepy.preprocess.stanford_preprocess.StanfordPreprocess object at 0x7ff43878ce90>
Loading StanfordCoreNLP...
Done for 1 documents
Done for 2 documents
Starting preprocessing step <iepy.preprocess.segmenter.SyntacticSegmenterRunner object at 0x7ff43878cf90>
About to set 1 segments for current doc
New 1 segments created
Done for 1 documents
About to set 1 segments for current doc
New 1 segments created
Done for 2 documents
Using the web interface, I created a relation called rvb
and an entity kind Èntity` (to which the relation is attached to, left and right).
If I then run
$ python bin/iepy_runner.py rvb
I get
Loading candidate evidence from database...
Getting labels from DB
Sorting labels them by evidence
Labels conflict solving
Traceback (most recent call last):
File "bin/iepy_runner.py", line 71, in <module>
performance_tradeoff=tuning_mode)
File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 48, in __init__
self._setup_labeled_evidences(labeled_evidences)
File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 162, in _setup_labeled_evidences
raise ValueError("Cannot start core without candidate evidence")
ValueError: Cannot start core without candidate evidence
As I said, this is probably due to my complete lack of understanding of the theories behind IEPY, but how can I create candidate evidence?
javier@my_computer:~/repo$ python scripts/download_third_party_data.py
Downloading third party software...
Downloading punkt tokenizer
[nltk_data] Downloading package 'punkt' to /home/javier/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Downloading wordnet
[nltk_data] Downloading package 'wordnet' to /home/javier/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Traceback (most recent call last):
File "scripts/download_third_party_data.py", line 17, in
download_third_party_data()
File "scripts/download_third_party_data.py", line 11, in download_third_party_data
download_tagger()
File "/home/javier/repo/iepy/tagger.py", line 67, in download
os.mkdir(DIRS.user_data_dir)
OSError: [Errno 2] No such file or directory: '/home/javier/.config/iepy'
Right now, if no --extractor-config option is passed, iepy will use the internal python defaults.
If I create an instance, it sounds a bit boilerplate the obligation of having to explicitely say
--extractor-config=extractor_config.json each time.
There's a re-definition of the "rules" variable at:
https://github.com/machinalis/iepy/blob/master/iepy/instantiation/iepy_rules_runner.py#L43
This means rules are not being loaded
Running the preprocess, or the tests under python 2 logs:
DeprecationWarning: get_or_create is scheduled to be deprecated. The approach is flawed without transactions. Upserts should be preferred.
Enhancement proposal by @makmanalp thanks for reporting!
"While installing, it took me a minute to realize that it was python 3 only, which is not a big deal, but it might help to have something in setup.py that errors out and gives you a message if you try to do that - or maybe there's a way in pypi to disallow from installing on python 2?"
Specifically, those facts that are seed facts. For instance:
disease,botulism,symptom,paralysis,CAUSES,,,,,1
Features such as bag_of_words, bag_of_pos, bag_of_wordpos and their bigram versions shouldn't include the two entities involved in the evidence.
Hi, I've noticed that the direction of the arrows when you are building a corpus (labeling by hand) depends on the order you click the entity ocurrences but has no relation with the order that the entity kinds have in a relation.
I don't know if it's intentional or not, I just noticed it and wanted to leave record.
Some features of the classifier in fact_extraction require stemming and/or lemmatization.
This is currently done using nltk.
Since CoreNLP already calculates this information during preprocessing it would be more efficient to save that information into the database and then reuse it (instead of re-calculatting it).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.