christopher-thornton / hmni Goto Github PK
View Code? Open in Web Editor NEW๐ Fuzzy Name Matching with Machine Learning
License: MIT License
๐ Fuzzy Name Matching with Machine Learning
License: MIT License
I just pip installed it and wanted to try it out, but can't even get that far. :/
If I just have a single line:
import hmni
It immediately fails with the above error. I tried running pip install a second time in case something was missed, it gave a bunch of Requirement Already Satisfied messages.
Some searching led me to try
from collections.abc import Iterable
but it doesn't help. Full error text follows:
Traceback (most recent call last): File "C:\Users\abc\something.py", line 1, in <module> import hmni File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\__init__.py", line 1, in <module> from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\matcher.py", line 30, in <module> from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95, File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\__init__.py", line 369, in <module> from ._ample import AMPLE File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_ample.py", line 22, in <module> from ._token_distance import _TokenDistance File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_token_distance.py", line 35, in <module> from ..tokenizer import QGrams, QSkipgrams, WhitespaceTokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\__init__.py", line 99, in <module> from ._q_grams import QGrams File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\_q_grams.py", line 22, in <module> from collections import Iterable ImportError: cannot import name 'Iterable' from 'collections' (C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\collections\__init__.py)
Hi, is it possible to alter any parameters/thresholds of this model, or the model itself such that recall is prioritised instead of precision?
@Christopher-Thornton , got the error
AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'
while running your cell 21 from the notebook
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-f1b5a7cac224> in <module>
----> 1 df = featurize(df)
<ipython-input-15-eff31deb72c1> in featurize(df)
12 '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
13
---> 14 df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
15 df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
16
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
<ipython-input-15-eff31deb72c1> in <lambda>(row)
12 '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
13
---> 14 df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
15 df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
16
AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'
I am getting this stack trace when I try to input hmni
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/lib/python3.10/site-packages/hmni/__init__.py", line 1, in <module>
from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
File "/opt/homebrew/lib/python3.10/site-packages/hmni/matcher.py", line 30, in <module>
from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95,
File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/__init__.py", line 368, in <module>
from ._aline import ALINE
File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/_aline.py", line 25, in <module>
from numpy import float as np_float
ImportError: cannot import name 'float' from 'numpy' (/opt/homebrew/lib/python3.10/site-packages/numpy/__init__.py)
@Christopher-Thornton , thanks for a great library!
Did you share the code, which you used for building a model?
https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4
I am looking to match 30000+ vs 5000+ names and I am trying to find a solution, which would be tailored for names specifically.
This is exactly what I need
Hi, thanks for this fantastic library.. But, I have an issue I have python 3.8.3. I installed the required libraries that you mentioned, but it was also required to install preprocess. After this, I tried to run again only the import of your library and the code returned the error:
ModuleNotFoundError: No module named 'packaging.about'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
in
----> 1 import hmni
2
3 # Initialize a Matcher Object
4 #matcher = hmni.Matcher(model='latin')
5
D:\Anaconda\lib\site-packages\hmni_init_.py in
----> 1 from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
2 import os
3 import tarfile
4
5 # extract model tarball into directory if doesnt exist
D:\Anaconda\lib\site-packages\hmni\input_helpers.py in
33 from .preprocess import MyVocabularyProcessor
34 except:
---> 35 from preprocess import MyVocabularyProcessor
36
37
ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)
What is the problem?? Thanks again!
I believe that you can incorporate third-party code under an MIT licence in a package released under Apache 2.0 licence but not the other way round -- see e.g. https://dwheeler.com/essays/floss-license-slide.html .
As the third-party code used here is under a mixture of these two licences -- in particular the code from nltk (here hmni/syllable_tokenizer.py
) is under Apache 2.0 -- the whole thing needs to be under Apache 2.0 (or compatible) rather than MIT.
I see in requirements that hmni uses abydos , which in turn is the only package I have found implementing some distance algorithms, like Rees-Levenshtein.
I wonder if someone could provide an example of how to fuzzy merge two pandas dataframes on a common column, but stablishing a minimal distance cutoff using that particular algorithm.
import pandas as pd
df1 = pd.DataFrame({"name":["Johnn Doe", "Jon Doe", "Margaret Tacher", "Marareth Tatcher"]})
df2 = pd.DataFrame(
[["John Doe", "US"],["Margareth Thatcher","UK"]],
columns=["name", "country"])
For example, get country for each person name in df1, based on fuzzy matches against df2 with a particular algorithm.
Is it possible to do that with hmni?
I should use fuzzymerge
method, but I can't figure out how to specify an algorithm taken from abydos.
Thanks a lot for your help in advance
Hi!
I was trying to implement hmni model with source code. Python 3.8.5, pandas 1.1.3, sklearn 0.23.2
I am facing the following error when running featurize(df) in model_building.ipynb:
AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'
from the following code:
df['syll_a']= df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
How can we solve this?
Thanks,
Farhana
Thank you for your sharing!
could you share the training code part? I want to try to train with my own dataset.
Thank you!
What should I install to resolve this? Did not find any answer. Thanks!
ModuleNotFoundError: No module named 'syllable_tokenizer'
I tried the example given in the README, but I got the following error:
Python code:
import hmni
# Initialize a Matcher Object
matcher = hmni.Matcher(model='latin')
# Single Pair Similarity
matcher.similarity('Alan', 'Al')
Terminal:
Traceback (most recent call last):
File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 33, in <module>
from .preprocess import MyVocabularyProcessor
File "/home/cecile/.local/lib/python3.6/site-packages/hmni/preprocess.py", line 30, in <module>
from tensorflow.python.platform import gfile
File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 28, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 52, in <module>
from tensorflow.core.framework.graph_pb2 import *
File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/core/framework/graph_pb2.py", line 7, in <module>
from google.protobuf import descriptor as _descriptor
File "/home/cecile/.local/lib/python3.6/site-packages/google/protobuf/descriptor.py", line 47, in <module>
from google.protobuf.pyext import _message
AttributeError: module 'google.protobuf.internal.containers' has no attribute 'MutableMapping'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "name_similarity.py", line 2, in <module>
import hmni
File "/home/cecile/.local/lib/python3.6/site-packages/hmni/__init__.py", line 1, in <module>
from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 35, in <module>
from preprocess import MyVocabularyProcessor
ModuleNotFoundError: No module named 'preprocess'
Sir, I am simply importing hmni and making a "Matcher" object like:
matcher = hmni.Matcher(model='latin')
And then I am using "matcher.similarity(str_1, str_2)"
But it throws the above header.
Can you tell me what should I do to fix this?
Python - 3.8.0
tensorflow - 2.9.1
hmni - 0.1.8
If there is something else you need to know about my system, please do tell, but I want to use this package ASAP
Thanks in advance
Modified example from the readme.md:
import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold', 'Leon']}) # added name 'Leon'
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
This throws an error. The root cause seems to be that no match is found for 'Leon'. Indeed, setting threshold=0.4 runs without an error, since 'Leon' is matched to 'Alan' with similarity 0.43.
During pip install I'm getting the following error:
ERROR: Could not find a version that satisfies the requirement tensorflow<2.0,>=1.11 (from hmni) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow<2.0,>=1.11 (from hmni)
OS details:
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
Hello again :)
I am trying to use the library and get this error
UserWarning: Trying to unpickle estimator MinMaxScaler from version 0.23.1 when using version 1.0.2.
Downgradit scikit did not help. How do you resolve situations like this? Thanks!
We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.