GithubHelp home page GithubHelp logo

mrpowers / ceja Goto Github PK

View Code? Open in Web Editor NEW
33.0 3.0 5.0 33 KB

PySpark phonetic and string matching algorithms

License: MIT License

Python 100.00%
pyspark jaro-winkler nysiis metaphone match-rating-comparisons porter-stemmer damerau-levenshtein hamming-distance jaro-similarity

ceja's People

Contributors

mrpowers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ceja's Issues

Accelerating string operations

I've noticed that ceja relies on jellyfish for Levenshtein distance computations, which opens an optimization opportunity. StringZilla should be a few times faster for that operation, and may be very handy for other tasks as well πŸ€—

Error while using Stemming

Hello,

I am facing issues when trying to apply stemming on text data in AWS with Pyspark. Here is the error message I'm getting:
PythonException: An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

How can I resolve this?

Thank you for your support.

Best,
Evangelia

pip install Download pyspark as default & fails to work

When I pip install ceja, I automatically get
pyspark-3.1.1.tar.gz (212.3MB)
which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL).
Even when I eliminate it, I still get errors on EMR.
Can this behavior be stopped?

[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 install ceja
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting ceja
  Downloading https://files.pythonhosted.org/packages/c6/80/f372c62a83175f4c54229474f543aeca3344f4c64aab4bcfe7cf05f50cbf/ceja-0.2.0-py3-none-any.whl
Collecting pyspark>2.0.0 (from ceja)
  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 212.3MB 6.3kB/s
Collecting jellyfish<0.9.0,>=0.8.2 (from ceja)
  Downloading https://files.pythonhosted.org/packages/04/3f/d03cb056f407ef181a45569255348457b1a0915fc4eb23daeceb930a68a4/jellyfish-0.8.2.tar.gz (134kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 143kB 9.1MB/s
Collecting py4j==0.10.9 (from pyspark>2.0.0->ceja)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
    100% |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 204kB 6.5MB/s
Installing collected packages: py4j, pyspark, jellyfish, ceja
  Running setup.py install for pyspark ... done
  Running setup.py install for jellyfish ... done
Successfully installed ceja-0.2.0 jellyfish-0.8.2 py4j-0.10.9 pyspark-3.1.1


[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 uninstall pyspark
Proceed (y/n)? y
..(snip)..
  Successfully uninstalled pyspark-3.1.1

When I do above & attempt to use:

>>> df_m.columns
['guid_consumer_hashed_df10', 'guid_customer_hashed_df10', 'guidr_m', 'jws_fnm_m', 'jws_lnm_m', 'gender_m', 'state_m', 'zip3_m', 'soundex_fnm_m', 'lev_gender_m', 'lev_state_m', 'l
ev_zip3_m', 'lev_soundex_fnm_m']

jws_???_m are created with:

...     .withColumn(
...         "jws_fnm_m",
...         ceja.jaro_winkler_similarity(f.col("firstname_df10"), f.col("firstname_df4")),
...     )

I can see columns but show fails

>>> df_m.show()
21/03/26 06:01:50 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 40007, ip-172-31-80-99.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'
```

attempting install fails
```
$ sudo /usr/bin/pip3 install jellifish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting jellifish
  Could not find a version that satisfies the requirement jellifish (from versions: )
No matching distribution found for jellifish
```

java.net.SocketException: DatenΓΌbergabe unterbrochen (broken pipe) (Write failed)

Hey,
If I try to run the code

import pyspark.sql.functions as F
data = [
    ("jellyfish", "smellyfish"),
    ("li", "lee"),
    ("luisa", "bruna"),
    (None, None)
]
df = spark.createDataFrame(data, ["word1", "word2"])
actual_df = df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(F.col("word1"), F.col("word2")))
actual_df.show()

I will get a lot of Exception from missed data Transmissions.

java.net.SocketException: DatenΓΌbergabe unterbrochen (broken pipe) (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:732)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1(DAGScheduler.scala:1524)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$updateAccumulators$1$adapted(DAGScheduler.scala:1515)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1515)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1627)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2588)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

I am running the code on an Hadoop Cluster with Spark v3.2.0.3.2.7170.0-49 on yarn.
uses Python version is 3.9.19

The code seems to run but it is a bit anying for the logs.

The Output seems like on your example.

+---------+----------+-----------------------+
| word1 | word2 |jaro_winkler_similarity|
+---------+----------+-----------------------+
| jellyfish. | smellyfish| 0.8962963 |
| li | lee | 0.6111111 |
| luisa | bruna | 0.6 |
| null | null | null |
+---------+----------+-----------------------+

Jellyfish module

I'm try to use the library, but when I do toy example from the documentation an error appear:

ModuleNotFoundError: No module named 'jellyfish'

I'm try to install the module, but the problem continues.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.