anhaidgroup / py_entitymatching Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Installing from Source Distribution
Download the py_entitymatching package from here
I want to block a particular tuple in a candidate set,but i didn't find an available interface.The interface "py_entitymatching.OverlapBlocker.block_tuples" just check if the tuple is blocked.Is there any way?
Thank you!
Since Python 2 was officially unsupported on Jan 1 2020, more and more high-profile packages are dropping support for Python 2. A notable case is numpy; while there's some effort to "be nice" to Python 2 users, problems are already apparent in cases like the following:
# From py_entitymatching project root
# Default Python is 2.7.16
docker container run -dit --name conda-latest continuumio/miniconda:latest /bin/bash
# Try installing py-entitymatching dependencies in a clean environment
docker cp requirements.txt conda-latest:/
docker container attach conda-latest
apt-get update && apt-get install -y build-essential
pip install -r requirements.txt
# ^^ py-stringmatching tries installing latest version of numpy for build,
# but latest version of numpy requires Python >= 3.5.
# Dirty fix
pip install numpy==1.16.2
pip install -r requirements.txt
The PyMatcher project should revisit its support for Python 2, possibly developing a plan for dropping support in the future.
Tests for multi-threading start an infinite loop when running tests on windows machines. Find out what is causing the problem.
Metadata file already exists at D:\web\DATA\end-to-end/GGG.metadata. Overwriting it
Traceback (most recent call last):
File "d:/workspace/graph_EA/DBLP_Scholar/entitymatching_DS.py", line 104, in
append=True, target_attr='predicted', inplace=False)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 239, in predict
y = self._predict_ex_attrs(table, exclude_attrs, return_prob=return_probs)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 179, in _predict_ex_attrs
res = self._predict_sklearn(x, check_rem=False, return_prob=return_prob)
File "C:\Users\周周\AppData\Roaming\Python\Python36\site-packages\py_entitymatching-0.3.2-py3.6-win-amd64.egg\py_entitymatching\matcher\mlmatcher.py", line 137, in _predict_sklearn
y = self.clf.predict(x)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 430, in predict
X = self._validate_X_predict(X, check_input)
File "d:\anaconda\envs\pytorch\lib\site-packages\sklearn\tree\tree.py", line 402, in validate_X_predict
% (self.n_features, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 14 and input n_features is 23
The reason seems to be that the feature tables generated by the two data sets are different. But the number of columns in the two data sets is the same. Why does this happen and how to solve it? Thank you
When one of d1, d2 is negative and the other is zero, it will give ZeroDivisionError.
Moreover, what should be the correct way to calculate abs_norm when one or both of d1 and d2 <0?
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
Update the error string with the input file path.
Update the error string with the input file path.
if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')
Update the error string with the type of input data frame.
In the following code in save_table,
if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')
Update the error string with the input metadata extension.
In the following code in load_table,
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
Update the error string with input file path
In the following code in load_table
if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')
Update the error string with metadata ext.
When the user labels a tuple pair, the color of the new label flashes and then returns to the color of the old label. The tuple pair does not update until the user clicks one of the filter options in the top left corner (show yes, show no, etc.). This does not occur when using the show all filter option.
Starting with pandas 0.20.0, the .ix indexer was deprecated. Pandas 1.0.0 was released on January 29, 2020, which removed the deprecated indexer from the package. Don't pin py_entitymatching
to an old version of Pandas; instead, update usage of the deprecated indexer to .loc or .iloc.
The py_entitymatching packages has a lot of duplicated code to raise errors. Specifically, we do similar validations for many functions, and we currently repeat the error checking code. It will be good to refactor them into smaller functions (put it in generic_helper.py).
The tests in py_entitymatching/tests/_test_matcherselector_mlmatcherselection_xg.py are outdated and do not assume to the current py_entitymatching
API. Update these so that optional XGBoost functionality can be tested.
You can’t include newlines when adding a custom feature, the parser will not be able to correctly parse these types of strings even though it is fine to have new lines in normal python code.
feature_string = """
jaccard(
...
)
"""
feature_string = """jaccard(...)"""
AppVeyor fails for both python versions 2.7 and 3.4. For some reason the new labeler test cases are not skipped for these versions of python for the AppVeyor tests on Windows.
Hi,
I encountered an error while installing the package, see details below:
Error message:
Command “/opt/conda/bin/python -u -c “import setuptools, tokenize;file=‘/tmp/pip-install-jagug70a/py-stringsimjoin/setup.py’;f=getattr(tokenize, ‘open’, open)(file);code=f.read().replace(‘\r\n’, ‘\n’);f.close();exec(compile(code, file, ‘exec’))” install --record /tmp/pip-record-qmor4dym/install-record.txt --single-version-externally-managed --compile” failed with error code 1 in /tmp/pip-install-jagug70a/py-stringsimjoin/
error: command ‘gcc’ failed with exit status 1
mac OS Mojave: 10.14
Python version 3.7.2
Hello, I am using Python 3.7 and cloudpickle 1.5.
When i ran the extract_feature_vecs() function it yielded an error stating that cloudpickle has no attribute "dumps"
Below is symbolically what I was running
`import py_entitymatching as em
matching_features_df = em.extract_feature_vecs(candidate_table,
feature_table = matching_features,
show_progress = False)`
All I had to do to change this was change
"from cloudpickle import cloudpickle"
to
"import cloudpickle"
On line 14 of extractfeatures.py and it solved my issue :)
Great package and I thought I would let others know of this fix.
Update the error message to include what are the necessary columns in feature table.
When the user scrolls down the page and then selects a label for a tuple, the labeler returns to the top of the page instead of staying at the same place the user left off.
The function extract_feat_vecs converts the candidate set into a set of feature vectors. This function is embarrassingly parallelizable.
Refer to blocker impln for parallilization.
Train some classifier takes a long time so that we should reduce calculation cost in cross_validation.
Rule-Based Matcher
Triggers
When the user sets show_progress=False the progress bar gets displayed for block_candset in OverlapBlocker.
Using Numpy referenced through Pandas raises a FutureWarning
advising to import and use Numpy directly instead. Update existing usage of pandas.np
to use numpy
directly in order to prevent future dependency-based errors.
I'm from CS638, we have been working on our project stage 4 using Magellan. Many of the existing stop words such as "the", "and", etc were common across multiple strings we used for blocking (particularly from board game titles). The rem_stop_words code managed to prune these from the calculations of overlap, thus reducing our false positives greatly.
However, many tuple pairs were formed with strings (board game titles) sharing the token"of". Perhaps "of" can be added this this almost extensive list of stop_words?
The current link under Important Links > Project Homepage points to a "Page not found" page. If I'm not mistaken, the correct link should be https://sites.google.com/site/anhaidgroup/projects/magellan/py_entitymatching
Need to pull in the changes from RIT branch. Specifically, the following needs to be pulled in
In the following code,
if not isinstance(data_frame, pd.DataFrame):
logging.error('Input dataframe is not of type pandas dataframe')
raise AssertionError('Input dataframe is not of type pandas dataframe')
The error, does not include the current type of input. Update the error, including the type of input object.
Further in this code,
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
The input file path should be included in the error strings by trying to convert input file_path. If it fails, the error message need not include the given file_path.
Dear them,
I found that there exists a simple typo on the code API docstring.
The right sentence should be D = rb.block_tuples(A.loc[0], B.loc[1])
scikit-learn deprecated the sklearn.preprocessing.Imputer
in 0.20.0
and removed it in 0.22.0
:
https://github.com/scikit-learn/scikit-learn/blame/0.22.1/doc/whats_new/v0.20.rst#L1444-L1468
This means py_entitymatching
is incompatible with scikit-learn >=0.22
due to
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L11
and
https://github.com/anhaidgroup/py_entitymatching/blob/v0.3.2/py_entitymatching/matcher/matcherutils.py#L221-L224
This in turn means py_entitymatching
does not fully work with Python 3.8 since scikit-learn <0.22
does not support the latter.
ref: conda-forge/py_entitymatching-feedstock#3
The above referenced whats_new/v0.20.rst
mentions sklearn.impute.SimpleImputer
, sklearn.preprocessing.FunctionTransformer
(for the axis
keyword), and numpy.nan
(for the missing_values
keyword) as replacements.
I have no experience with scikit-learn so am not able to offer further assistance/PRs, unfortunately.
Python 3.8 was released in October 2019, and Python 3.9 is currently in development. Add support for Python 3.8 to PyMatcher projects.
The following code in down sampling raises an assertion error:
_def _get_str_cols_list(table):
if len(table) == 0:
logger.error('_get_str_cols_list: Size of the input table is 0')
raise AssertionError('get_str_cols_list: Size of the input table is 0')
There are two problems with this assertion error
Proposed solution:
As of now fix the English statement in _get_str_cols_list. Specifically, remove "_get_str_cols_list:"
In sim_functions.py
method abs_norm and rel_diff which is for numeric data fail when the predict data has string type row.
Travis failed only for Python 2.7 and the error specified that the problem was related to gcc.
It looks like that many of the computationally intensive tasks (blocking mechanisms, feature value extraction) are reimplemented using dask (in submodule: dask), but they're not installed when installing the package via pip. Have they been omitted purposefully due to the lack of testing on them (as suggested by the docstrings of the functions) or is there a way (for instance setting some installation flags) to install the dask utilities?
Would it be possible to publish compiled wheels for Windows and Linux on pypi.org, please?
Hello,
It seems that the rule-based blocker iterates through all pairs of tuples from the two dataframe, which in effect, would imply a run-time of O ( N * M) where N and M are the number of rows in each table. Isn't the purpose of the blocker to reduce the candidate set size, to avoid computations over all pairs in the Cartesian product space ?
It would be great if you clarify which of the blockers iterate through all possible pairs, and which ones do not.
Thanks,
The dependencies of scikit-learn should be no more thant 0.20 since the Imputer no longer exist. (enforce a 0.18 version ?)
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
Update the error string with the input file path.
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
Update the error string with the input file path.
if not isinstance(data_frame, pd.DataFrame):
logging.error('Input object is not of type pandas DataFrame')
raise AssertionError('Input object is not of type pandas DataFrame')
Update the error string with the type of input data frame.
if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')
Update the error string with the input metadata extension.
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
Update the error string with input file path
if not isinstance(metadata_ext, six.string_types):
logger.error('Input metadata ext is not of type string')
raise AssertionError('Input metadata ext is not of type string')
Update the error string with metadata ext.
Thank you for update that return_probs function.
But it seems confidence that the probability is relate to the prediction.
(If prediction is True proba is probability of True)
I want True probability.
Ports of core py_entitymatching
modules to leverage Dask exist and are packaged as py_entitymatching.dask
, but they throw "experimental" warnings upon instantiation, do not have unit tests, and have their own set of dependencies. Make the existing experimental Dask modules "extras" for a py_entitymatching
installation.
Dear Team,
While preparing the gold dataset, the module PyQt doesn't seem to work when we run it on the cloud server. And after manually generating the dataset, while running the em.train_test_split(...) on the training dataset, it throws a foreign key error. Please guide me through this.
Thanks,
Karunakar Chinnabathini.
Another issue to share from an annoying CS638 student. Slightly more pressing
When using the OverlapBlocker class, we noticed the rem_stop_words attribute in block_candset():
rem_stop_words (boolean): A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
However, we noticed that, by default and regardless of the attribute's value ("rem_stop_words=False" or "rem_stop_words=True"), this process runs regardless, always pruning these stop words from the token sets.
Upon inspection, it looks like the rem_stop_words attribute is passed down but never used. In particular, the stop word pruning seems to occur here, regardless of the attribute's value.
And:
Perhaps an if statement or something should be used here to conditionally apply this "stop word pruning"?
The following assertion error needs to be updated in read_csv_metadata.
In the following code,
if not isinstance(file_path, six.string_types):
logger.error('Input file path is not of type string')
raise AssertionError('Input file path is not of type string')
The error is thrown if the file path is not of type string. It will be good to include the what is the input ( in a best effort manner). Specifically, if the input file_path can be converted to a string then include that in the error message, else have error message without the input file path.
In the function get_type, update error messages when the column does not include even a single type or there are multiple types
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.