GithubHelp home page GithubHelp logo

christopher-thornton / hmni Goto Github PK

View Code? Open in Web Editor NEW
243.0 8.0 50.0 21.71 MB

๐Ÿ“› Fuzzy Name Matching with Machine Learning

License: MIT License

Python 100.00%
natural-language-processing fuzzy-matching nlp machine-learning data-science python artificial-intelligence ai

hmni's People

Contributors

christopher-thornton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hmni's Issues

ImportError: cannot import name 'Iterable' from 'collections'

I just pip installed it and wanted to try it out, but can't even get that far. :/

If I just have a single line:
import hmni

It immediately fails with the above error. I tried running pip install a second time in case something was missed, it gave a bunch of Requirement Already Satisfied messages.

Some searching led me to try
from collections.abc import Iterable
but it doesn't help. Full error text follows:

Traceback (most recent call last): File "C:\Users\abc\something.py", line 1, in <module> import hmni File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\__init__.py", line 1, in <module> from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\hmni\matcher.py", line 30, in <module> from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95, File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\__init__.py", line 369, in <module> from ._ample import AMPLE File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_ample.py", line 22, in <module> from ._token_distance import _TokenDistance File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\distance\_token_distance.py", line 35, in <module> from ..tokenizer import QGrams, QSkipgrams, WhitespaceTokenizer File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\__init__.py", line 99, in <module> from ._q_grams import QGrams File "C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\site-packages\abydos\tokenizer\_q_grams.py", line 22, in <module> from collections import Iterable ImportError: cannot import name 'Iterable' from 'collections' (C:\Users\abc\AppData\Local\Programs\Python\Python310\lib\collections\__init__.py)

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

@Christopher-Thornton , got the error

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

while running your cell 21 from the notebook

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-f1b5a7cac224> in <module>
----> 1 df = featurize(df)

<ipython-input-15-eff31deb72c1> in featurize(df)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-15-eff31deb72c1> in <lambda>(row)
     12         '[^a-zA-Z]+', '', unidecode.unidecode(row['b']).lower().strip()), axis=1)
     13 
---> 14     df['syll_a'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)
     15     df['syll_b'] = df.apply(lambda row: syllable_tokenizer.tokenize(row.name_b), axis=1)
     16 

AttributeError: module 'hmni.syllable_tokenizer' has no attribute 'tokenize'

Cannot import name 'float' from 'numpy'

I am getting this stack trace when I try to input hmni

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/opt/homebrew/lib/python3.10/site-packages/hmni/matcher.py", line 30, in <module>
    from abydos.distance import (IterativeSubString, BISIM, DiscountedLevenshtein, Prefix, LCSstr, MLIPNS, Strcmp95,
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/__init__.py", line 368, in <module>
    from ._aline import ALINE
  File "/opt/homebrew/lib/python3.10/site-packages/abydos/distance/_aline.py", line 25, in <module>
    from numpy import float as np_float
ImportError: cannot import name 'float' from 'numpy' (/opt/homebrew/lib/python3.10/site-packages/numpy/__init__.py)

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

Hi, thanks for this fantastic library.. But, I have an issue I have python 3.8.3. I installed the required libraries that you mentioned, but it was also required to install preprocess. After this, I tried to run again only the import of your library and the code returned the error:

ModuleNotFoundError: No module named 'packaging.about'

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
----> 1 import hmni
2
3 # Initialize a Matcher Object
4 #matcher = hmni.Matcher(model='latin')
5

D:\Anaconda\lib\site-packages\hmni_init_.py in
----> 1 from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
2 import os
3 import tarfile
4
5 # extract model tarball into directory if doesnt exist

D:\Anaconda\lib\site-packages\hmni\input_helpers.py in
33 from .preprocess import MyVocabularyProcessor
34 except:
---> 35 from preprocess import MyVocabularyProcessor
36
37

ImportError: cannot import name 'MyVocabularyProcessor' from 'preprocess' (D:\Anaconda\lib\site-packages\preprocess.py)

What is the problem?? Thanks again!

pandas fuzzy merge with particular distance algorithm

I see in requirements that hmni uses abydos , which in turn is the only package I have found implementing some distance algorithms, like Rees-Levenshtein.

I wonder if someone could provide an example of how to fuzzy merge two pandas dataframes on a common column, but stablishing a minimal distance cutoff using that particular algorithm.

import pandas as pd
df1 = pd.DataFrame({"name":["Johnn Doe", "Jon Doe", "Margaret Tacher", "Marareth Tatcher"]})
df2 = pd.DataFrame(
	[["John Doe", "US"],["Margareth Thatcher","UK"]],
	columns=["name", "country"])

For example, get country for each person name in df1, based on fuzzy matches against df2 with a particular algorithm.

Is it possible to do that with hmni?
I should use fuzzymerge method, but I can't figure out how to specify an algorithm taken from abydos.

Thanks a lot for your help in advance

featurize(df) throwing error: AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

Hi!
I was trying to implement hmni model with source code. Python 3.8.5, pandas 1.1.3, sklearn 0.23.2

I am facing the following error when running featurize(df) in model_building.ipynb:

AttributeError: module hmni.syllable_tokenizer has no attribute 'tokenize'

from the following code:
df['syll_a']= df.apply(lambda row: syllable_tokenizer.tokenize(row.name_a), axis=1)

How can we solve this?

Thanks,
Farhana

matcher.similarity gives an error

I tried the example given in the README, but I got the following error:

Python code:

import hmni

# Initialize a Matcher Object
matcher = hmni.Matcher(model='latin')

# Single Pair Similarity
matcher.similarity('Alan', 'Al')

Terminal:

Traceback (most recent call last):
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 33, in <module>
    from .preprocess import MyVocabularyProcessor
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/preprocess.py", line 30, in <module>
    from tensorflow.python.platform import gfile
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 28, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 52, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/home/cecile/.local/lib/python3.6/site-packages/tensorflow/core/framework/graph_pb2.py", line 7, in <module>
    from google.protobuf import descriptor as _descriptor
  File "/home/cecile/.local/lib/python3.6/site-packages/google/protobuf/descriptor.py", line 47, in <module>
    from google.protobuf.pyext import _message
AttributeError: module 'google.protobuf.internal.containers' has no attribute 'MutableMapping'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "name_similarity.py", line 2, in <module>
    import hmni
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/__init__.py", line 1, in <module>
    from . import input_helpers, matcher, preprocess, siamese_network, syllable_tokenizer
  File "/home/cecile/.local/lib/python3.6/site-packages/hmni/input_helpers.py", line 35, in <module>
    from preprocess import MyVocabularyProcessor
ModuleNotFoundError: No module named 'preprocess'

TypeError: 'Matcher' object is not callable

Sir, I am simply importing hmni and making a "Matcher" object like:
matcher = hmni.Matcher(model='latin')

And then I am using "matcher.similarity(str_1, str_2)"

But it throws the above header.
Can you tell me what should I do to fix this?
Python - 3.8.0
tensorflow - 2.9.1
hmni - 0.1.8

If there is something else you need to know about my system, please do tell, but I want to use this package ASAP

Thanks in advance

fuzzymerge throws error if match cannot be found

Modified example from the readme.md:

import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold', 'Leon']}) # added name 'Leon'
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

This throws an error. The root cause seems to be that no match is found for 'Leon'. Indeed, setting threshold=0.4 runs without an error, since 'Leon' is matched to 'Alan' with similarity 0.43.

Tensorflow Dependency Error

During pip install I'm getting the following error:

ERROR: Could not find a version that satisfies the requirement tensorflow<2.0,>=1.11 (from hmni) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow<2.0,>=1.11 (from hmni)


OS details:

Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal

Slow, even with small dataset

We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.