GithubHelp home page GithubHelp logo

anhaidgroup / py_entitymatching Goto Github PK

View Code? Open in Web Editor NEW
183.0 183.0 47.0 73.25 MB

License: BSD 3-Clause "New" or "Revised" License

Python 28.25% Jupyter Notebook 45.56% PowerShell 0.13% Batchfile 0.12% Shell 0.01% HTML 25.33% C++ 0.42% Cython 0.16% MDX 0.01%

py_entitymatching's People

Contributors

abdaniel avatar anhaidgroup avatar anson-doan avatar christiemj09 avatar dmvieira avatar fventuri-availity avatar hanli91 avatar kjinxin avatar kvpradap avatar pjmartinkus avatar raghuramayya avatar sanjibkd avatar yashg19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py_entitymatching's Issues

How do you block a particular tuple?

I want to block a particular tuple in a candidate set,but i didn't find an available interface.The interface "py_entitymatching.OverlapBlocker.block_tuples" just check if the tuple is blocked.Is there any way?
Thank you!

Consider developing plan for dropping Python 2

Since Python 2 was officially unsupported on Jan 1 2020, more and more high-profile packages are dropping support for Python 2. A notable case is numpy; while there's some effort to "be nice" to Python 2 users, problems are already apparent in cases like the following:

# From py_entitymatching project root

# Default Python is 2.7.16
docker container run -dit --name conda-latest continuumio/miniconda:latest /bin/bash

# Try installing py-entitymatching dependencies in a clean environment
docker cp requirements.txt conda-latest:/
docker container attach conda-latest
apt-get update && apt-get install -y build-essential
pip install -r requirements.txt
# ^^ py-stringmatching tries installing latest version of numpy for build,
#     but latest version of numpy requires Python >= 3.5.

# Dirty fix
pip install numpy==1.16.2
pip install -r requirements.txt

The PyMatcher project should revisit its support for Python 2, possibly developing a plan for dropping support in the future.

Multiprocessing Bug on Windows

Tests for multi-threading start an infinite loop when running tests on windows machines. Find out what is causing the problem.

The table I used to train the model was the ACM_DBLP data set, and the tabular data set I evaluated was the DBLP_Scholar data set. Finally, the prediction function reports the following error

Metadata file already exists at D:\web\DATA\end-to-end/GGG.metadata. Overwriting it
Traceback (most recent call last):
File "d:/workspace/graph_EA/DBLP_Scholar/entitymatching_DS.py", line 104, in
append=True, target_attr='predicted', inplace=False)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 239, in predict
y = self._predict_ex_attrs(table, exclude_attrs, return_prob=return_probs)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 179, in _predict_ex_attrs
res = self._predict_sklearn(x, check_rem=False, return_prob=return_prob)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 137, in _predict_sklearn
y = self.clf.predict(x)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 430, in predict
X = self._validate_X_predict(X, check_input)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 402, in validate_X_predict
% (self.n_features
, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 14 and input n_features is 23

The reason seems to be that the feature tables generated by the two data sets are different. But the number of columns in the two data sets is the same. Why does this happen and how to solve it? Thank you

Bug in simfunctions.abs_norm(d1, d2)

When one of d1, d2 is negative and the other is zero, it will give ZeroDivisionError.
Moreover, what should be the correct way to calculate abs_norm when one or both of d1 and d2 <0?

Update error strings in pickles.py

  1. In the following code in save_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

  1. In the following code in load_object function,
    if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

  1. In the following code in load_table function,

if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')

Update the error string with the type of input data frame.

  1. In the following code in save_table,

    if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with the input metadata extension.

  1. In the following code in load_table,

    if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

Update the error string with input file path

  1. In the following code in load_table

    if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with metadata ext.

New Labeler Colors

When the user labels a tuple pair, the color of the new label flashes and then returns to the color of the old label. The tuple pair does not update until the user clicks one of the filter options in the top left corner (show yes, show no, etc.). This does not occur when using the show all filter option.

Update DataFrame.ix usage

Starting with pandas 0.20.0, the .ix indexer was deprecated. Pandas 1.0.0 was released on January 29, 2020, which removed the deprecated indexer from the package. Don't pin py_entitymatching to an old version of Pandas; instead, update usage of the deprecated indexer to .loc or .iloc.

Refactor error messages in py_entitymatching

The py_entitymatching packages has a lot of duplicated code to raise errors. Specifically, we do similar validations for many functions, and we currently repeat the error checking code. It will be good to refactor them into smaller functions (put it in generic_helper.py).

Newlines in Feature Creation

You can’t include newlines when adding a custom feature, the parser will not be able to correctly parse these types of strings even though it is fine to have new lines in normal python code.

Causes Error:

feature_string = """
jaccard(
...
)
"""

Works Correctly:

feature_string = """jaccard(...)"""

New Labeler AppVeyor bug

AppVeyor fails for both python versions 2.7 and 3.4. For some reason the new labeler test cases are not skipped for these versions of python for the AppVeyor tests on Windows.

'pip install py_entitymatching' error with command 'gcc' failed with exit status 1

Hi,

I encountered an error while installing the package, see details below:

Error message:
Command “/opt/conda/bin/python -u -c “import setuptools, tokenize;file=‘/tmp/pip-install-jagug70a/py-stringsimjoin/setup.py’;f=getattr(tokenize, ‘open’, open)(file);code=f.read().replace(‘\r\n’, ‘\n’);f.close();exec(compile(code, file, ‘exec’))” install --record /tmp/pip-record-qmor4dym/install-record.txt --single-version-externally-managed --compile” failed with error code 1 in /tmp/pip-install-jagug70a/py-stringsimjoin/
error: command ‘gcc’ failed with exit status 1

mac OS Mojave: 10.14
Python version 3.7.2

extract_feature_vecs() issue with cloudpickle + Fix

Hello, I am using Python 3.7 and cloudpickle 1.5.

When i ran the extract_feature_vecs() function it yielded an error stating that cloudpickle has no attribute "dumps"

Below is symbolically what I was running

`import py_entitymatching as em

matching_features_df = em.extract_feature_vecs(candidate_table,
feature_table = matching_features,
show_progress = False)`

All I had to do to change this was change
"from cloudpickle import cloudpickle"
to
"import cloudpickle"

On line 14 of extractfeatures.py and it solved my issue :)

Great package and I thought I would let others know of this fix.

New Labeler Pages

  1. The page number in the bottom left corner of the screen are given with a decimal. It would probably be better to show an integer here.
  2. There is no limit on the pages shown so the user is able to go to blank pages that don't contain any tuples.

New Labeler Scrolling

When the user scrolls down the page and then selects a label for a tuple, the labeler returns to the top of the page instead of staying at the same place the user left off.

Parallelize extract_feat_vecs

The function extract_feat_vecs converts the candidate set into a set of feature vectors. This function is embarrassingly parallelizable.

Refer to blocker impln for parallilization.

Rule-Based Matcher and Triggers Tasks

Rule-Based Matcher

  • Coding
    1. Update init.py to import BooleanRuleMatcher
    2. Update booleanrulematcher.py to add rule based matching and pass test cases
    3. Add parse_conjunct to generic_helper.py in utils. Have both rule based matcher and blocker use it.
  • Testing
    1. Added test_rule_based_matcher.py
  • Documentation
    1. Update matchers.rst with ML sections and RB section. See blocking.rst for example.
    2. Update supported_matchers.rst with ML section and RB section
  • Jupiter Notebook
    1. Create notebook for RB Matcher

Triggers

  • Coding
    1. Update init.py to import MatchTrigger
    2. Update matchtrigger.py to add triggers and pass test cases
  • Testing
    1. Added test_match_trigger.py
  • Documentation
    1. Add triggers.rst
  • Jupiter Notebook
    1. Create notebook for triggers

Update usage of pandas.np

Using Numpy referenced through Pandas raises a FutureWarning advising to import and use Numpy directly instead. Update existing usage of pandas.np to use numpy directly in order to prevent future dependency-based errors.

[Overlap blocker] The token "of" is not included in the stop_words set

I'm from CS638, we have been working on our project stage 4 using Magellan. Many of the existing stop words such as "the", "and", etc were common across multiple strings we used for blocking (particularly from board game titles). The rem_stop_words code managed to prune these from the calculations of overlap, thus reducing our false positives greatly.

However, many tuple pairs were formed with strings (board game titles) sharing the token"of". Perhaps "of" can be added this this almost extensive list of stop_words?

Pull changes from RIT branch

Need to pull in the changes from RIT branch. Specifically, the following needs to be pulled in

  1. inclusion of xgboost as one of the matchers.
  2. addition of returnprobs to predict command
  3. parallelizing cross validation
  4. bug fix in simfunctions.py

Update the following assertion errors in to_csv_metadata

In the following code,

if not isinstance(data_frame, pd.DataFrame):
    logging.error('Input dataframe is not of type pandas dataframe')
    raise AssertionError('Input dataframe is not of type pandas dataframe')

The error, does not include the current type of input. Update the error, including the type of input object.

Further in this code,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

The input file path should be included in the error strings by trying to convert input file_path. If it fails, the error message need not include the given file_path.

sklearn>=0.22 and Py3.8 compat: sklearn.preprocessing.Imputer has been removed

scikit-learn deprecated the sklearn.preprocessing.Imputer in 0.20.0 and removed it in 0.22.0:
https://github.com/scikit-learn/scikit-learn/blame/0.22.1/doc/whats_new/v0.20.rst#L1444-L1468

This means py_entitymatching is incompatible with scikit-learn >=0.22 due to
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L11
and
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L221-L224

This in turn means py_entitymatching does not fully work with Python 3.8 since scikit-learn <0.22 does not support the latter.
ref: conda-forge/py_entitymatching-feedstock#3

The above referenced whats_new/v0.20.rst mentions sklearn.impute.SimpleImputer, sklearn.preprocessing.FunctionTransformer (for the axis keyword), and numpy.nan (for the missing_values keyword) as replacements.
I have no experience with scikit-learn so am not able to offer further assistance/PRs, unfortunately.

Update the assertion error in down sampling

The following code in down sampling raises an assertion error:

_def _get_str_cols_list(table):
if len(table) == 0:
logger.error('_get_str_cols_list: Size of the input table is 0')
raise AssertionError('get_str_cols_list: Size of the input table is 0')

There are two problems with this assertion error

  1. This assertion error is not reachable as down_sample function checks for the lengths
  2. The log information and the error need not include _get_str_cols_list: as in the stack trace, we will know that where the error originated.

Proposed solution:
As of now fix the English statement in _get_str_cols_list. Specifically, remove "_get_str_cols_list:"

New Labeler Counts

  1. When a user labels a tuple, the counts in the top left corner of the screen (number of yes, no, unsure, unlabeled, and all tuples) are reset except for the category of the new label for the tuple (yes, no, or unsure). This bug does not occur when the user is in the show all view.
  2. When the user changes the display mode (horizontal, vertical, one at a time), the counts in the top left corner of the screen are reset, expect for the category the user is currently viewing. Again, this does not occur when the user in the the show all view.

Utilising the underlying dask machinery

It looks like that many of the computationally intensive tasks (blocking mechanisms, feature value extraction) are reimplemented using dask (in submodule: dask), but they're not installed when installing the package via pip. Have they been omitted purposefully due to the lack of testing on them (as suggested by the docstrings of the functions) or is there a way (for instance setting some installation flags) to install the dask utilities?

efficiency of rule-based blocker

Hello,

It seems that the rule-based blocker iterates through all pairs of tuples from the two dataframe, which in effect, would imply a run-time of O ( N * M) where N and M are the number of rows in each table. Isn't the purpose of the blocker to reduce the candidate set size, to avoid computations over all pairs in the Cartesian product space ?

It would be great if you clarify which of the blockers iterate through all possible pairs, and which ones do not.

Thanks,

Dependencies

The dependencies of scikit-learn should be no more thant 0.20 since the Imputer no longer exist. (enforce a 0.18 version ?)

Update error messages in pickles.py

  1. In the following code in save_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

  1. In the following code in load_object function,

if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')

Update the error string with the input file path.

  1. In the following code in load_table function,

if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')

Update the error string with the type of input data frame.

  1. In the following code in save_table,

# metadata_ext is expected to be of type string

if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with the input metadata extension.

  1. In the following code in load_table,

# The file_path is expected to be of type string

if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

Update the error string with input file path

  1. In the following code in load_table

# The metadata_extn is expected to be of type string

if not isinstance(metadata_ext, six.string_types):
    logger.error('Input metadata ext is not of type string')
    raise AssertionError('Input metadata ext is not of type string')

Update the error string with metadata ext.

Return True Probability

Thank you for update that return_probs function.

But it seems confidence that the probability is relate to the prediction.
(If prediction is True proba is probability of True)

I want True probability.

Make Dask modules "extras"

Ports of core py_entitymatching modules to leverage Dask exist and are packaged as py_entitymatching.dask, but they throw "experimental" warnings upon instantiation, do not have unit tests, and have their own set of dependencies. Make the existing experimental Dask modules "extras" for a py_entitymatching installation.

Release plan 0.1.0

  • Initial release of py_entitymatching
  • Include commands to support matching two tables using supervised learning.

GUI error and foreign key errors

Dear Team,

While preparing the gold dataset, the module PyQt doesn't seem to work when we run it on the cloud server. And after manually generating the dataset, while running the em.train_test_split(...) on the training dataset, it throws a foreign key error. Please guide me through this.

Thanks,
Karunakar Chinnabathini.

[Bug][OverlapBlocker] rem_stop_words attribute to block_candset, block_tables not working

Another issue to share from an annoying CS638 student. Slightly more pressing

When using the OverlapBlocker class, we noticed the rem_stop_words attribute in block_candset():
rem_stop_words (boolean): A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).

However, we noticed that, by default and regardless of the attribute's value ("rem_stop_words=False" or "rem_stop_words=True"), this process runs regardless, always pruning these stop words from the token sets.

Upon inspection, it looks like the rem_stop_words attribute is passed down but never used. In particular, the stop word pruning seems to occur here, regardless of the attribute's value.

https://github.com/anhaidgroup/py_entitymatching/blob/master/py_entitymatching/blocker/overlap_blocker.py#L584-L586

And:

https://github.com/anhaidgroup/py_entitymatching/blob/master/py_entitymatching/blocker/overlap_blocker.py#L601-L604

Perhaps an if statement or something should be used here to conditionally apply this "stop word pruning"?

Update assertion error in read_csv_metadata

The following assertion error needs to be updated in read_csv_metadata.

In the following code,

if not isinstance(file_path, six.string_types):
    logger.error('Input file path is not of type string')
    raise AssertionError('Input file path is not of type string')

The error is thrown if the file path is not of type string. It will be good to include the what is the input ( in a best effort manner). Specifically, if the input file_path can be converted to a string then include that in the error message, else have error message without the input file path.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.