mrpowers / ceja Goto Github PK

PySpark phonetic and string matching algorithms

License: MIT License

Python 100.00%

pyspark jaro-winkler nysiis metaphone match-rating-comparisons porter-stemmer damerau-levenshtein hamming-distance jaro-similarity

ceja's People

Contributors

Stargazers

Watchers

Forkers

zhuohuwu0603 thunter009 carlonicolini squipix ashvardanian

ceja's Issues

Accelerating string operations

I've noticed that ceja relies on jellyfish for Levenshtein distance computations, which opens an optimization opportunity. StringZilla should be a few times faster for that operation, and may be very handy for other tasks as well 🤗

Perhaps add thefuzz

https://github.com/seatgeek/thefuzz

Seems like this is a popular lib.

Error while using Stemming

Hello,

I am facing issues when trying to apply stemming on text data in AWS with Pyspark. Here is the error message I'm getting:
PythonException: An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

How can I resolve this?

Thank you for your support.

Best,
Evangelia

pip install Download pyspark as default & fails to work

When I pip install ceja, I automatically get
pyspark-3.1.1.tar.gz (212.3MB)
which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL).
Even when I eliminate it, I still get errors on EMR.
Can this behavior be stopped?

[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 install ceja
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting ceja
  Downloading https://files.pythonhosted.org/packages/c6/80/f372c62a83175f4c54229474f543aeca3344f4c64aab4bcfe7cf05f50cbf/ceja-0.2.0-py3-none-any.whl
Collecting pyspark>2.0.0 (from ceja)
  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
    100% |████████████████████████████████| 212.3MB 6.3kB/s
Collecting jellyfish<0.9.0,>=0.8.2 (from ceja)
  Downloading https://files.pythonhosted.org/packages/04/3f/d03cb056f407ef181a45569255348457b1a0915fc4eb23daeceb930a68a4/jellyfish-0.8.2.tar.gz (134kB)
    100% |████████████████████████████████| 143kB 9.1MB/s
Collecting py4j==0.10.9 (from pyspark>2.0.0->ceja)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
    100% |████████████████████████████████| 204kB 6.5MB/s
Installing collected packages: py4j, pyspark, jellyfish, ceja
  Running setup.py install for pyspark ... done
  Running setup.py install for jellyfish ... done
Successfully installed ceja-0.2.0 jellyfish-0.8.2 py4j-0.10.9 pyspark-3.1.1


[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 uninstall pyspark
Proceed (y/n)? y
..(snip)..
  Successfully uninstalled pyspark-3.1.1

When I do above & attempt to use:

>>> df_m.columns
['guid_consumer_hashed_df10', 'guid_customer_hashed_df10', 'guidr_m', 'jws_fnm_m', 'jws_lnm_m', 'gender_m', 'state_m', 'zip3_m', 'soundex_fnm_m', 'lev_gender_m', 'lev_state_m', 'l
ev_zip3_m', 'lev_soundex_fnm_m']

jws_???_m are created with:

...     .withColumn(
...         "jws_fnm_m",
...         ceja.jaro_winkler_similarity(f.col("firstname_df10"), f.col("firstname_df4")),
...     )

I can see columns but show fails

>>> df_m.show()
21/03/26 06:01:50 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 40007, ip-172-31-80-99.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'
```

attempting install fails
```
$ sudo /usr/bin/pip3 install jellifish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting jellifish
  Could not find a version that satisfies the requirement jellifish (from versions: )
No matching distribution found for jellifish
```

java.net.SocketException: Datenübergabe unterbrochen (broken pipe) (Write failed)

Hey,
If I try to run the code

import pyspark.sql.functions as F
data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(F.col("word1"), F.col("word2")))
actual_df.show()

I will get a lot of Exception from missed data Transmissions.

java.net.SocketException: Datenübergabe unterbrochen (broken pipe) (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:732)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1(DAGScheduler.scala:1524)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1$adapted(DAGScheduler.scala:1515)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1515)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1627)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2588)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

I am running the code on an Hadoop Cluster with Spark v3.2.0.3.2.7170.0-49 on yarn.
uses Python version is 3.9.19

The code seems to run but it is a bit anying for the logs.

The Output seems like on your example.

+---------+----------+-----------------------+
| word1 | word2 |jaro_winkler_similarity|
+---------+----------+-----------------------+
| jellyfish. | smellyfish| 0.8962963 |
| li | lee | 0.6111111 |
| luisa | bruna | 0.6 |
| null | null | null |
+---------+----------+-----------------------+

Jellyfish module

I'm try to use the library, but when I do toy example from the documentation an error appear:

ModuleNotFoundError: No module named 'jellyfish'

I'm try to install the module, but the problem continues.

License and source-code link on PyPi

Hi MrPowers,

Can you include the fields repository and license into the next version?
Because our scanner complains there there is no license defined for this package.

It's just updating the pyproject.toml, see: https://python-poetry.org/docs/pyproject/

So it's visible here: https://pypi.org/project/ceja/

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble