anhaidgroup / py_entitymatching Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 28.25% Jupyter Notebook 45.56% PowerShell 0.13% Batchfile 0.12% Shell 0.01% HTML 25.33% C++ 0.42% Cython 0.16% MDX 0.01%

py_entitymatching's People

Contributors

Stargazers

Watchers

py_entitymatching's Issues

Fix conda environment in AppVeyor

The AppVeyor script is broken. Googling around, it seems like modeling a fix after this might work. Modify the AppVeyor script to install and configure Miniconda correctly.

Broken link in py_entitymatching/v0.1.x/user_manual/installation.html

Installing from Source Distribution
Download the py_entitymatching package from here

Mojave 'pip install py_entitymatching' error with command 'gcc' failed with exit status 1

How do you block a particular tuple?

I want to block a particular tuple in a candidate set,but i didn't find an available interface.The interface "py_entitymatching.OverlapBlocker.block_tuples" just check if the tuple is blocked.Is there any way?
Thank you!

Consider developing plan for dropping Python 2

Since Python 2 was officially unsupported on Jan 1 2020, more and more high-profile packages are dropping support for Python 2. A notable case is numpy; while there's some effort to "be nice" to Python 2 users, problems are already apparent in cases like the following:

# From py_entitymatching project root

# Default Python is 2.7.16
docker container run -dit --name conda-latest continuumio/miniconda:latest /bin/bash

# Try installing py-entitymatching dependencies in a clean environment
docker cp requirements.txt conda-latest:/
docker container attach conda-latest
apt-get update && apt-get install -y build-essential
pip install -r requirements.txt
# ^^ py-stringmatching tries installing latest version of numpy for build,
#     but latest version of numpy requires Python >= 3.5.

# Dirty fix
pip install numpy==1.16.2
pip install -r requirements.txt

The PyMatcher project should revisit its support for Python 2, possibly developing a plan for dropping support in the future.

Multiprocessing Bug on Windows

Tests for multi-threading start an infinite loop when running tests on windows machines. Find out what is causing the problem.

The table I used to train the model was the ACM_DBLP data set, and the tabular data set I evaluated was the DBLP_Scholar data set. Finally, the prediction function reports the following error

Metadata file already exists at D:\web\DATA\end-to-end/GGG.metadata. Overwriting it
Traceback (most recent call last):
File "d:/workspace/graph_EA/DBLP_Scholar/entitymatching_DS.py", line 104, in
append=True, target_attr='predicted', inplace=False)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 239, in predict
y = self._predict_ex_attrs(table, exclude_attrs, return_prob=return_probs)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 179, in _predict_ex_attrs
res = self._predict_sklearn(x, check_rem=False, return_prob=return_prob)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 137, in _predict_sklearn
y = self.clf.predict(x)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 430, in predict
X = self._validate_X_predict(X, check_input)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 402, in validate_X_predict
% (self.n_features, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 14 and input n_features is 23

The reason seems to be that the feature tables generated by the two data sets are different. But the number of columns in the two data sets is the same. Why does this happen and how to solve it? Thank you

Bug in simfunctions.abs_norm(d1, d2)

When one of d1, d2 is negative and the other is zero, it will give ZeroDivisionError.
Moreover, what should be the correct way to calculate abs_norm when one or both of d1 and d2 <0?

Update error strings in pickles.py

In the following code in save_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

In the following code in load_object function,
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

In the following code in load_table function,

if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')

Update the error string with the type of input data frame.

In the following code in save_table,

if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')

Update the error string with the input metadata extension.

In the following code in load_table,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with input file path

In the following code in load_table

if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')

Update the error string with metadata ext.

New Labeler Colors

When the user labels a tuple pair, the color of the new label flashes and then returns to the color of the old label. The tuple pair does not update until the user clicks one of the filter options in the top left corner (show yes, show no, etc.). This does not occur when using the show all filter option.

show_progress has no effect in rule-based blocker's block_tables command

Update DataFrame.ix usage

Starting with pandas 0.20.0, the .ix indexer was deprecated. Pandas 1.0.0 was released on January 29, 2020, which removed the deprecated indexer from the package. Don't pin py_entitymatching to an old version of Pandas; instead, update usage of the deprecated indexer to .loc or .iloc.

Refactor error messages in py_entitymatching

The py_entitymatching packages has a lot of duplicated code to raise errors. Specifically, we do similar validations for many functions, and we currently repeat the error checking code. It will be good to refactor them into smaller functions (put it in generic_helper.py).

Update matcherselector tests for XGBoost

The tests in py_entitymatching/tests/_test_matcherselector_mlmatcherselection_xg.py are outdated and do not assume to the current py_entitymatching API. Update these so that optional XGBoost functionality can be tested.

Newlines in Feature Creation

You can’t include newlines when adding a custom feature, the parser will not be able to correctly parse these types of strings even though it is fine to have new lines in normal python code.

Causes Error:

feature_string = """
jaccard(
...
)
"""

Works Correctly:

feature_string = """jaccard(...)"""

New Labeler AppVeyor bug

AppVeyor fails for both python versions 2.7 and 3.4. For some reason the new labeler test cases are not skipped for these versions of python for the AppVeyor tests on Windows.

'pip install py_entitymatching' error with command 'gcc' failed with exit status 1

Hi,

I encountered an error while installing the package, see details below:

Error message:
Command “/opt/conda/bin/python -u -c “import setuptools, tokenize;file=‘/tmp/pip-install-jagug70a/py-stringsimjoin/setup.py’;f=getattr(tokenize, ‘open’, open)(file);code=f.read().replace(‘\r\n’, ‘\n’);f.close();exec(compile(code, file, ‘exec’))” install --record /tmp/pip-record-qmor4dym/install-record.txt --single-version-externally-managed --compile” failed with error code 1 in /tmp/pip-install-jagug70a/py-stringsimjoin/
error: command ‘gcc’ failed with exit status 1

mac OS Mojave: 10.14
Python version 3.7.2

extract_feature_vecs() issue with cloudpickle + Fix

Hello, I am using Python 3.7 and cloudpickle 1.5.

When i ran the extract_feature_vecs() function it yielded an error stating that cloudpickle has no attribute "dumps"

Below is symbolically what I was running

`import py_entitymatching as em

matching_features_df = em.extract_feature_vecs(candidate_table,
feature_table = matching_features,
show_progress = False)`

All I had to do to change this was change
"from cloudpickle import cloudpickle"
to
"import cloudpickle"

On line 14 of extractfeatures.py and it solved my issue :)

Great package and I thought I would let others know of this fix.

New Labeler Pages

The page number in the bottom left corner of the screen are given with a decimal. It would probably be better to show an integer here.
There is no limit on the pages shown so the user is able to go to blank pages that don't contain any tuples.

Update error message add features

Update the error message to include what are the necessary columns in feature table.

New Labeler Scrolling

When the user scrolls down the page and then selects a label for a tuple, the labeler returns to the top of the page instead of staying at the same place the user left off.

Parallelize extract_feat_vecs

The function extract_feat_vecs converts the candidate set into a set of feature vectors. This function is embarrassingly parallelizable.

Refer to blocker impln for parallilization.

Parallelize cross_validation in select_matcher

Train some classifier takes a long time so that we should reduce calculation cost in cross_validation.

Rule-Based Matcher and Triggers Tasks

Rule-Based Matcher

Coding
1. Update init.py to import BooleanRuleMatcher
2. Update booleanrulematcher.py to add rule based matching and pass test cases
3. Add parse_conjunct to generic_helper.py in utils. Have both rule based matcher and blocker use it.
Testing
1. Added test_rule_based_matcher.py
Documentation
1. Update matchers.rst with ML sections and RB section. See blocking.rst for example.
2. Update supported_matchers.rst with ML section and RB section
Jupiter Notebook
1. Create notebook for RB Matcher

Triggers

Coding
1. Update init.py to import MatchTrigger
2. Update matchtrigger.py to add triggers and pass test cases
Testing
1. Added test_match_trigger.py
Documentation
1. Add triggers.rst
Jupiter Notebook
1. Create notebook for triggers

Show_progress has no effect in block_candset in OverlapBlocker

When the user sets show_progress=False the progress bar gets displayed for block_candset in OverlapBlocker.

Update usage of pandas.np

Using Numpy referenced through Pandas raises a FutureWarning advising to import and use Numpy directly instead. Update existing usage of pandas.np to use numpy directly in order to prevent future dependency-based errors.

[Overlap blocker] The token "of" is not included in the stop_words set

I'm from CS638, we have been working on our project stage 4 using Magellan. Many of the existing stop words such as "the", "and", etc were common across multiple strings we used for blocking (particularly from board game titles). The rem_stop_words code managed to prune these from the calculations of overlap, thus reducing our false positives greatly.

However, many tuple pairs were formed with strings (board game titles) sharing the token"of". Perhaps "of" can be added this this almost extensive list of stop_words?

Link to project homepage in README.md wrong

The current link under Important Links > Project Homepage points to a "Page not found" page. If I'm not mistaken, the correct link should be https://sites.google.com/site/anhaidgroup/projects/magellan/py_entitymatching

Pull changes from RIT branch

Need to pull in the changes from RIT branch. Specifically, the following needs to be pulled in

inclusion of xgboost as one of the matchers.
addition of returnprobs to predict command
parallelizing cross validation
bug fix in simfunctions.py

Update the following assertion errors in to_csv_metadata

In the following code,

if not isinstance(data_frame, pd.DataFrame):
    logging.error('Input dataframe is not of type pandas dataframe')
    raise AssertionError('Input dataframe is not of type pandas dataframe')

The error, does not include the current type of input. Update the error, including the type of input object.

Further in this code,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

The input file path should be included in the error strings by trying to convert input file_path. If it fails, the error message need not include the given file_path.

Typo (lack of a right parenthesis) on the API docstring

py_entitymatching/py_entitymatching/blocker/rule_based_blocker.py

Line 653 in 4fa644c

>>> D = rb.block_tuples(A.loc[0], B.loc[1)

Dear them,

I found that there exists a simple typo on the code API docstring.

The right sentence should be D = rb.block_tuples(A.loc[0], B.loc[1])

sklearn>=0.22 and Py3.8 compat: sklearn.preprocessing.Imputer has been removed

scikit-learn deprecated the sklearn.preprocessing.Imputer in 0.20.0 and removed it in 0.22.0:
https://github.com/scikit-learn/scikit-learn/blame/0.22.1/doc/whats_new/v0.20.rst#L1444-L1468

This means py_entitymatching is incompatible with scikit-learn >=0.22 due to
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L11
and
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L221-L224

This in turn means py_entitymatching does not fully work with Python 3.8 since scikit-learn <0.22 does not support the latter.
ref: conda-forge/py_entitymatching-feedstock#3

The above referenced whats_new/v0.20.rst mentions sklearn.impute.SimpleImputer, sklearn.preprocessing.FunctionTransformer (for the axis keyword), and numpy.nan (for the missing_values keyword) as replacements.
I have no experience with scikit-learn so am not able to offer further assistance/PRs, unfortunately.

Add support for Python 3.8

Python 3.8 was released in October 2019, and Python 3.9 is currently in development. Add support for Python 3.8 to PyMatcher projects.

Update the assertion error in down sampling

The following code in down sampling raises an assertion error:

_def _get_str_cols_list(table):
if len(table) == 0:
logger.error('_get_str_cols_list: Size of the input table is 0')
raise AssertionError('get_str_cols_list: Size of the input table is 0')

There are two problems with this assertion error

This assertion error is not reachable as down_sample function checks for the lengths
The log information and the error need not include _get_str_cols_list: as in the stack trace, we will know that where the error originated.

Proposed solution:
As of now fix the English statement in _get_str_cols_list. Specifically, remove "_get_str_cols_list:"

feature extract methods for numeric data fail when predict term

In sim_functions.py
method abs_norm and rel_diff which is for numeric data fail when the predict data has string type row.

New Labeler Counts

When a user labels a tuple, the counts in the top left corner of the screen (number of yes, no, unsure, unlabeled, and all tuples) are reset except for the category of the new label for the tuple (yes, no, or unsure). This bug does not occur when the user is in the show all view.
When the user changes the display mode (horizontal, vertical, one at a time), the counts in the top left corner of the screen are reset, expect for the category the user is currently viewing. Again, this does not occur when the user in the the show all view.

Travis failed for Python 2.7 with gcc error

Travis failed only for Python 2.7 and the error specified that the problem was related to gcc.

Utilising the underlying dask machinery

It looks like that many of the computationally intensive tasks (blocking mechanisms, feature value extraction) are reimplemented using dask (in submodule: dask), but they're not installed when installing the package via pip. Have they been omitted purposefully due to the lack of testing on them (as suggested by the docstrings of the functions) or is there a way (for instance setting some installation flags) to install the dask utilities?

Wheels for Linux and Windows on pypi.org

Would it be possible to publish compiled wheels for Windows and Linux on pypi.org, please?

efficiency of rule-based blocker

Hello,

It seems that the rule-based blocker iterates through all pairs of tuples from the two dataframe, which in effect, would imply a run-time of O ( N * M) where N and M are the number of rows in each table. Isn't the purpose of the blocker to reduce the candidate set size, to avoid computations over all pairs in the Cartesian product space ?

It would be great if you clarify which of the blockers iterate through all possible pairs, and which ones do not.

Thanks,

Dependencies

The dependencies of scikit-learn should be no more thant 0.20 since the Imputer no longer exist. (enforce a 0.18 version ?)

Update error messages in pickles.py

In the following code in save_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

In the following code in load_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

In the following code in load_table function,

if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')

Update the error string with the type of input data frame.

In the following code in save_table,

# metadata_ext is expected to be of type string

if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with the input metadata extension.

In the following code in load_table,

# The file_path is expected to be of type string

if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

Update the error string with input file path

In the following code in load_table

# The metadata_extn is expected to be of type string

if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with metadata ext.

Return True Probability

Thank you for update that return_probs function.

But it seems confidence that the probability is relate to the prediction.
(If prediction is True proba is probability of True)

I want True probability.

Make Dask modules "extras"

Ports of core py_entitymatching modules to leverage Dask exist and are packaged as py_entitymatching.dask, but they throw "experimental" warnings upon instantiation, do not have unit tests, and have their own set of dependencies. Make the existing experimental Dask modules "extras" for a py_entitymatching installation.

Can the code be accelerated by using Cython significantly?

Have you tried on this? Because I found some key technique like blocking, feature extraction are so slow that becoming the bottleneck.

Release plan 0.1.0

Initial release of py_entitymatching
Include commands to support matching two tables using supervised learning.

GUI error and foreign key errors

Dear Team,

While preparing the gold dataset, the module PyQt doesn't seem to work when we run it on the cloud server. And after manually generating the dataset, while running the em.train_test_split(...) on the training dataset, it throws a foreign key error. Please guide me through this.

Thanks,
Karunakar Chinnabathini.

[Bug][OverlapBlocker] rem_stop_words attribute to block_candset, block_tables not working

Another issue to share from an annoying CS638 student. Slightly more pressing

When using the OverlapBlocker class, we noticed the rem_stop_words attribute in block_candset():
rem_stop_words (boolean): A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).

However, we noticed that, by default and regardless of the attribute's value ("rem_stop_words=False" or "rem_stop_words=True"), this process runs regardless, always pruning these stop words from the token sets.

Upon inspection, it looks like the rem_stop_words attribute is passed down but never used. In particular, the stop word pruning seems to occur here, regardless of the attribute's value.

https://github.com/anhaidgroup/py_entitymatching/blob/master/py_entitymatching/blocker/overlap_blocker.py#L584-L586

And:

https://github.com/anhaidgroup/py_entitymatching/blob/master/py_entitymatching/blocker/overlap_blocker.py#L601-L604

Perhaps an if statement or something should be used here to conditionally apply this "stop word pruning"?

Update assertion error in read_csv_metadata

The following assertion error needs to be updated in read_csv_metadata.

In the following code,

if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

The error is thrown if the file path is not of type string. It will be good to include the what is the input ( in a best effort manner). Specifically, if the input file_path can be converted to a string then include that in the error message, else have error message without the input file path.

Update error msg in attribute utils

In the function get_type, update error messages when the column does not include even a single type or there are multiple types

anhaidgroup / py_entitymatching Goto Github PK

py_entitymatching's People

Contributors

Stargazers

Watchers

Forkers

py_entitymatching's Issues

Causes Error:

Works Correctly:

# metadata_ext is expected to be of type string

# The file_path is expected to be of type string

# The metadata_extn is expected to be of type string

Have you tried on this? Because I found some key technique like blocking, feature extraction are so slow that becoming the bottleneck.

Recommend Projects

Recommend Topics

Recommend Org

Jobs