GithubHelp home page GithubHelp logo

christopher-thornton / hmni Goto Github PK

View Code? Open in Web Editor NEW
247.0 247.0 50.0 21.71 MB

๐Ÿ“› Fuzzy Name Matching with Machine Learning

License: MIT License

Python 100.00%
ai artificial-intelligence data-science fuzzy-matching machine-learning natural-language-processing nlp python

hmni's Issues

fuzzymerge throws error if match cannot be found

Modified example from the readme.md:

import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold', 'Leon']}) # added name 'Leon'
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

This throws an error. The root cause seems to be that no match is found for 'Leon'. Indeed, setting threshold=0.4 runs without an error, since 'Leon' is matched to 'Alan' with similarity 0.43.

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

@Christopher-Thornton , got the error

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

while running your cell 21 from the notebook

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-f1b5a7cac224> in <module>
----> 1 df = featurize(df)

<ipython-input-15-eff31deb72c1> in featurize(df)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-15-eff31deb72c1> in <lambda>(row)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

Hi, thanks for this fantastic library.. But, I have an issue I have python 3.8.3. I installed the required libraries that you mentioned, but it was also required to install preprocess. After this, I tried to run again only the import of your library and the code returned the error:

ModuleNotFoundError: No module named 'packaging.about'

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
----> 1 import hmni
2
3 # Initialize a Matcher Object
4 #matcher = hmni.Matcher(model='latin')
5

D:\Anaconda\lib\site-packages\hmni_init_.py in
----> 1 from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
2 import os
3 import tarfile
4
5 # extract model tarball into directory if doesnt exist

D:\Anaconda\lib\site-packages\hmni\input_helpers.py in
33 from .preprocess import MyVocabularyProcessor
34 except:
---> 35 from preprocess import MyVocabularyProcessor
36
37

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

What is the problem?? Thanks again!

featurize(df) throwing error: AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

Hi!
I was trying to implement hmni model with source code. Python 3.8.5, pandas 1.1.3, sklearn 0.23.2

I am facing the following error when running featurize(df) in model_building.ipynb:

AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

from the following code:
df['syll_a']= df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)

How can we solve this?

Thanks,
Farhana

pandas fuzzy merge with particular distance algorithm

I see in requirements that hmni uses abydos , which in turn is the only package I have found implementing some distance algorithms, like Rees-Levenshtein.

I wonder if someone could provide an example of how to fuzzy merge two pandas dataframes on a common column, but stablishing a minimal distance cutoff using that particular algorithm.

import pandas as pd
df1 = pd.DataFrame({"name":["Johnn Doe", "Jon Doe", "Margaret Tacher", "Marareth Tatcher"]})
df2 = pd.DataFrame(
	[["John Doe", "US"],["Margareth Thatcher","UK"]],
	columns=["name", "country"])

For example, get country for each person name in df1, based on fuzzy matches against df2 with a particular algorithm.

Is it possible to do that with hmni?
I should use fuzzymerge method, but I can't figure out how to specify an algorithm taken from abydos.

Thanks a lot for your help in advance

matcher.similarity gives an error

I tried the example given in the README, but I got the following error:

Python code:

import hmni

# Initialize a Matcher Object
matcher = hmni.Matcher(model='latin')

# Single Pair Similarity
matcher.similarity('Alan', 'Al')

Terminal:

Traceback (most recent call last):
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 33, in <module>
    from .preprocess import MyVocabularyProcessor
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/preprocess.py", line 30, in <module>
    from tensorflow.python.platform import gfile
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/core/framework/graph_pb2.py", line 7, in <module>
    from google.protobuf import descriptor as _descriptor
  File "/home/cecile/.local/lib/python3.6/site-packages/google/protobuf/descriptor.py", line 47, in <module>
    from google.protobuf.pyext import _message
AttributeError: module 'google.protobuf.internal.containers' has no attribute 'MutableMapping'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "name_similarity.py", line 2, in <module>
    import hmni
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 35, in <module>
    from preprocess import MyVocabularyProcessor
ModuleNotFoundError: No module named 'preprocess'

Slow, even with small dataset

We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?

ImportError: cannot import name 'Iterable' from 'collections'

I just pip installed it and wanted to try it out, but can't even get that far. :/

If I just have a single line:
import hmni

It immediately fails with the above error. I tried running pip install a second time in case something was missed, it gave a bunch of Requirement Already Satisfied messages.

Some searching led me to try
from collections.abc import Iterable
but it doesn't help. Full error text follows:

Traceback (most recent call last): File "C:\Users\abc\something.py", line 1, in <module> import hmni File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\__init__.py", line 1, in <module> from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\matcher.py", line 30, in <module> from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95, File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\__init__.py", line 369, in <module> from ._ample import AMPLE File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_ample.py", line 22, in <module> from ._token_distance import _TokenDistance File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_token_distance.py", line 35, in <module> from ..tokenizer import QGrams, QSkipgrams, WhitespaceTokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\__init__.py", line 99, in <module> from ._q_grams import QGrams File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\_q_grams.py", line 22, in <module> from collections import Iterable ImportError: cannot import name 'Iterable' from 'collections' (C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\collections\__init__.py)

Tensorflow Dependency Error

During pip install I'm getting the following error:

ERROR: Could not find a version that satisfies the requirement tensorflow<2.0,>=1.11 (from hmni) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow<2.0,>=1.11 (from hmni)


OS details:

Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal

TypeError: 'Matcher' object is not callable

Sir, I am simply importing hmni and making a "Matcher" object like:
matcher = hmni.Matcher(model='latin')

And then I am using "matcher.similarity(str_1, str_2)"

But it throws the above header.
Can you tell me what should I do to fix this?
Python - 3.8.0
tensorflow - 2.9.1
hmni - 0.1.8

If there is something else you need to know about my system, please do tell, but I want to use this package ASAP

Thanks in advance

Cannot import name 'float' from 'numpy'

I am getting this stack trace when I try to input hmni

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/matcher.py", line 30, in <module>
    from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95,
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/__init__.py", line 368, in <module>
    from ._aline import ALINE
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/_aline.py", line 25, in <module>
    from numpy import float as np_float
ImportError: cannot import name 'float' from 'numpy' (/opt/homebrew/lib/python3.10/site-packages/numpy/__init__.py)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.