Example of wrapping a java function in a java spark udf, then calling from pyspark
The purpose was to demonstrate the use case of calling a java library in pyspark through a UDF.
- Start with Java function (
triple()
) in a library (Multiples
) - Wrap the Java function in a Java Spark UDF (
TripleUdf
) - Call the UDF from pyspark.
This is a quickly-built POC, so it's not ready for repeatable execution. Follow the steps below to run the example and see how the functionality works.
- Build and install the
Multiples
library as a jar.mvn package install
Multiples
is referenced by the UDF in the UDF'spom.xml
, somvn install
is required.<dependency> <groupId>org.example.functions</groupId> <artifactId>Multiples</artifactId> <version>1.0-SNAPSHOT</version> </dependency>
- Build the
MultipleUdf
project, which will create a fat jar in thetarget
directory,MultipleUdf-1.0-SNAPSHOT.jar
. - Open up the pyspark shell
, referencing the additional
jarin the classpath.
pyspark --jars /Users/donaldsawyer/git/PysparkJavaUdfExample/MultipleUdf/target/MultipleUdf-1.0-SNAPSHOT.jar` - Run the pyspark commands that are in
executePythonJavaUdf.py
from pyspark.sql import functions as F from pyspark.sql.types import DoubleType spark.udf.registerJavaFunction("triple", "TripleUdf") df = spark.createDataFrame([0.0, 4.111, -4.5], DoubleType()).toDF("value") df.withColumn("tripled", F.expr("triple(value)")).show()