GithubHelp home page GithubHelp logo

yuanjie-ai / pyspark2pmml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jpmml/pyspark2pmml

0.0 2.0 0.0 60 KB

Python library for converting Apache Spark ML pipelines to PMML

License: GNU Affero General Public License v3.0

Python 100.00%

pyspark2pmml's Introduction

PySpark2PMML

Python library for converting Apache Spark ML pipelines to PMML.

Features

This library is a thin wrapper around the JPMML-SparkML library. For a list of supported Apache Spark ML Estimator and Transformer types, please refer to the documentation of the JPMML-SparkML project.

Prerequisites

  • Apache Spark 2.0.X, 2.1.X, 2.2.X or 2.3.X.
  • Python 2.7, 3.4 or newer.

Installation

Install the latest version from GitHub:

pip install --user --upgrade git+https://github.com/jpmml/pyspark2pmml.git

Configuration

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark version JPMML-SparkML development branch JPMML-SparkML version
2.0.X 1.1.X 1.1.19
2.1.X 1.2.X 1.2.11
2.2.X 1.3.X 1.3.7
2.3.X master 1.4.4

Apache Spark 2.3.X

Launch PySpark; use the --packages command-line option to specify the Maven Central repository coordinates of the JPMML-SparkML library:

pyspark --packages org.jpmml:jpmml-sparkml:${version}

Apache Spark 2.0.X through 2.2.X

Apache Spark versions prior to 2.3.0 prepend a legacy version of the JPMML-Model library to application classpath, which brings about fatal class loading errors with all JPMML software, including the JPMML-SparkML library. This conflict is documented in SPARK-15526.

The workaround is to switch from the JPMML-SparkML library to the JPMML-SparkML uber-JAR file that bundles "shaded" classes. Currently, this JPMML-SparkML uber-JAR file needs to be built locally using Apache Maven:

git clone https://github.com/jpmml/jpmml-sparkml.git
cd jpmml-sparkml
# Check out the intended development branch
git checkout ${development branch}
mvn clean package

Launch PySpark; use the --jars command-line option to specify the location of the JPMML-SparkML uber-JAR file:

pyspark --jars /path/to/jpmml-sparkml/target/converter-executable-${version}.jar

Usage

Fitting an example pipeline model:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to PMML byte array:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
	.putOption(classifier, "compact", True)

pmmlBytes = pmmlBuilder.buildByteArray()
print(pmmlBytes.decode("UTF-8"))

License

PySpark2PMML is dual-licensed under the GNU Affero General Public License (AGPL) version 3.0, and a commercial license.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using JPMML software in your application? Please contact [email protected]

pyspark2pmml's People

Contributors

vruusmann avatar

Watchers

James Cloos avatar Betterme avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.