christopher-thornton / hmni Goto Github PK

📛 Fuzzy Name Matching with Machine Learning

License: MIT License

Python 100.00%

ai artificial-intelligence data-science fuzzy-matching machine-learning natural-language-processing nlp python

hmni's Issues

Can we alter this model such that the recall is prioritised instead?

Hi, is it possible to alter any parameters/thresholds of this model, or the model itself such that recall is prioritised instead of precision?

Code from your tutorial

@Christopher-Thornton , thanks for a great library!
Did you share the code, which you used for building a model?
https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4
I am looking to match 30000+ vs 5000+ names and I am trying to find a solution, which would be tailored for names specifically.
This is exactly what I need

ModuleNotFoundError: No module named 'syllable_tokenizer'

Hi @Christopher-Thornton

What should I install to resolve this? Did not find any answer. Thanks!

 ModuleNotFoundError: No module named 'syllable_tokenizer'

UserWarning: Trying to unpickle estimator MinMaxScaler from version 0.23.1 when using version 1.0.2.

Hello again :)
I am trying to use the library and get this error

UserWarning: Trying to unpickle estimator MinMaxScaler from version 0.23.1 when using version 1.0.2.

Downgradit scikit did not help. How do you resolve situations like this? Thanks!

fuzzymerge throws error if match cannot be found

Modified example from the readme.md:

import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold', 'Leon']}) # added name 'Leon'
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

This throws an error. The root cause seems to be that no match is found for 'Leon'. Indeed, setting threshold=0.4 runs without an error, since 'Leon' is matched to 'Alan' with similarity 0.43.

licence compatibility

I believe that you can incorporate third-party code under an MIT licence in a package released under Apache 2.0 licence but not the other way round -- see e.g. https://dwheeler.com/essays/floss-license-slide.html .

As the third-party code used here is under a mixture of these two licences -- in particular the code from nltk (here hmni/syllable_tokenizer.py) is under Apache 2.0 -- the whole thing needs to be under Apache 2.0 (or compatible) rather than MIT.

AttributeError: MinMaxScaler Error

Getting the following error message.

'MinMaxScaler' object has no attribute 'clip'

Can you please review it?

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

@Christopher-Thornton , got the error

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

while running your cell 21 from the notebook

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-f1b5a7cac224> in <module>
----> 1 df = featurize(df)

<ipython-input-15-eff31deb72c1> in featurize(df)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-15-eff31deb72c1> in <lambda>(row)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

Hi, thanks for this fantastic library.. But, I have an issue I have python 3.8.3. I installed the required libraries that you mentioned, but it was also required to install preprocess. After this, I tried to run again only the import of your library and the code returned the error:

ModuleNotFoundError: No module named 'packaging.about'

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
----> 1 import hmni
2
3 # Initialize a Matcher Object
4 #matcher = hmni.Matcher(model='latin')
5

D:\Anaconda\lib\site-packages\hmni_init_.py in
----> 1 from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
2 import os
3 import tarfile
4
5 # extract model tarball into directory if doesnt exist

D:\Anaconda\lib\site-packages\hmni\input_helpers.py in
33 from .preprocess import MyVocabularyProcessor
34 except:
---> 35 from preprocess import MyVocabularyProcessor
36
37

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

What is the problem?? Thanks again!

featurize(df) throwing error: AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

Hi!
I was trying to implement hmni model with source code. Python 3.8.5, pandas 1.1.3, sklearn 0.23.2

I am facing the following error when running featurize(df) in model_building.ipynb:

AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

from the following code:
df['syll_a']= df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)

How can we solve this?

Thanks,
Farhana

pandas fuzzy merge with particular distance algorithm

I see in requirements that hmni uses abydos , which in turn is the only package I have found implementing some distance algorithms, like Rees-Levenshtein.

I wonder if someone could provide an example of how to fuzzy merge two pandas dataframes on a common column, but stablishing a minimal distance cutoff using that particular algorithm.

import pandas as pd
df1 = pd.DataFrame({"name":["Johnn Doe", "Jon Doe", "Margaret Tacher", "Marareth Tatcher"]})
df2 = pd.DataFrame(
	[["John Doe", "US"],["Margareth Thatcher","UK"]],
	columns=["name", "country"])

For example, get country for each person name in df1, based on fuzzy matches against df2 with a particular algorithm.

Is it possible to do that with hmni?
I should use fuzzymerge method, but I can't figure out how to specify an algorithm taken from abydos.

Thanks a lot for your help in advance

matcher.similarity gives an error

I tried the example given in the README, but I got the following error:

Python code:

import hmni

# Initialize a Matcher Object
matcher = hmni.Matcher(model='latin')

# Single Pair Similarity
matcher.similarity('Alan', 'Al')

Terminal:

Traceback (most recent call last):
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 33, in <module>
    from .preprocess import MyVocabularyProcessor
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/preprocess.py", line 30, in <module>
    from tensorflow.python.platform import gfile
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/core/framework/graph_pb2.py", line 7, in <module>
    from google.protobuf import descriptor as _descriptor
  File "/home/cecile/.local/lib/python3.6/site-packages/google/protobuf/descriptor.py", line 47, in <module>
    from google.protobuf.pyext import _message
AttributeError: module 'google.protobuf.internal.containers' has no attribute 'MutableMapping'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "name_similarity.py", line 2, in <module>
    import hmni
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 35, in <module>
    from preprocess import MyVocabularyProcessor
ModuleNotFoundError: No module named 'preprocess'

Slow, even with small dataset

We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?

Could you share the training part?

Thank you for your sharing!
could you share the training code part? I want to try to train with my own dataset.
Thank you!

ImportError: cannot import name 'Iterable' from 'collections'

I just pip installed it and wanted to try it out, but can't even get that far. :/

If I just have a single line:
import hmni

It immediately fails with the above error. I tried running pip install a second time in case something was missed, it gave a bunch of Requirement Already Satisfied messages.

Some searching led me to try
from collections.abc import Iterable
but it doesn't help. Full error text follows:

Traceback (most recent call last): File "C:\Users\abc\something.py", line 1, in <module> import hmni File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\__init__.py", line 1, in <module> from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\matcher.py", line 30, in <module> from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95, File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\__init__.py", line 369, in <module> from ._ample import AMPLE File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_ample.py", line 22, in <module> from ._token_distance import _TokenDistance File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_token_distance.py", line 35, in <module> from ..tokenizer import QGrams, QSkipgrams, WhitespaceTokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\__init__.py", line 99, in <module> from ._q_grams import QGrams File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\_q_grams.py", line 22, in <module> from collections import Iterable ImportError: cannot import name 'Iterable' from 'collections' (C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\collections\__init__.py)

Tensorflow Dependency Error

During pip install I'm getting the following error:

ERROR: Could not find a version that satisfies the requirement tensorflow<2.0,>=1.11 (from hmni) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow<2.0,>=1.11 (from hmni)

OS details:

Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal

fuzzymerge throws float indexing error

TypeError: 'Matcher' object is not callable

Sir, I am simply importing hmni and making a "Matcher" object like:
matcher = hmni.Matcher(model='latin')

And then I am using "matcher.similarity(str_1, str_2)"

But it throws the above header.
Can you tell me what should I do to fix this?
Python - 3.8.0
tensorflow - 2.9.1
hmni - 0.1.8

If there is something else you need to know about my system, please do tell, but I want to use this package ASAP

Thanks in advance

Cannot import name 'float' from 'numpy'

I am getting this stack trace when I try to input hmni

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/matcher.py", line 30, in <module>
    from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95,
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/__init__.py", line 368, in <module>
    from ._aline import ALINE
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/_aline.py", line 25, in <module>
    from numpy import float as np_float
ImportError: cannot import name 'float' from 'numpy' (/opt/homebrew/lib/python3.10/site-packages/numpy/__init__.py)

christopher-thornton / hmni Goto Github PK

hmni's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs