GithubHelp home page GithubHelp logo

databricks / congruity Goto Github PK

View Code? Open in Web Editor NEW
16.0 2.0 1.0 100 KB

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.

Home Page: https://github.com/databricks/congruity

License: Apache License 2.0

Python 92.91% Shell 7.09%

congruity's Introduction

congruity

GitHub Actions Build PyPI Downloads

In many ways, the migration from using classic Spark applications using the full power and flexibility to be using only the Spark Connect compatible DataFrame API can be challenging.

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.

Non-Goals

This library is not intended to be a long-term solution. The goal is to provide a compatibility layer that becomes obsolete over time. In addition, we do not aim to provide compatibility for all methods and features but only a select subset. Lastly, we do not aim to achieve the same performance as using some of the native RDD APIs.

Usage

Spark JVM & Spark Connect compatibility library.

pip install spark-congruity
import congruity

Example

Here is code that works on Spark JVM:

from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

This code doesn't work with Spark Connect. The congruity library rearranges the code under the hood, so the old syntax works on Spark Connect clusters as well:

import congruity  # noqa: F401
from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

Contributing

We very much welcome contributions to this project. The easiest way to start is to pick any of the below RDD or SparkContext methods and implement the compatibility layer. Once you have done that open a pull request and we will review it.

What's supported?

RDD

RDD API Comment
aggregate
aggregateByKey
barrier
cache
cartesian
checkpoint
cleanShuffleDependencies
coalesce
cogroup
collect
collectAsMap
collectWithJobGroup
combineByKey
count
countApprox
countByKey
countByValue
distinct
filter
first
flatMap
fold First version
foreach
foreachPartition
fullOuterJoin
getCheckpointFile
getNumPartitions
getResourceProfile
getStorageLevel
glom
groupBy
groupByKey
groupWith
histogram
id
intersection
isCheckpointed
isEmpty
isLocallyCheckpointed
join
keyBy
keys
leftOuterJoin
localCheckpoint
lookup
map
mapPartitions First version, based on mapInArrow.
mapPartitionsWithIndex
mapPartitionsWithSplit
mapValues
max
mean
meanApprox
min
name
partitionBy
persist
pipe
randomSplit
reduce
reduceByKey
repartition
repartitionAndSortWithinPartition
rightOuterJoin
sample
sampleByKey
sampleStdev
sampleVariance
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
saveAsPickleFile
saveAsTextFile
setName
sortBy
sortByKey
stats
stdev
subtract
substractByKey
sum First version.
sumApprox
take Ordering might not be guaranteed in the same way as it is in RDD.
takeOrdered
takeSample
toDF
toDebugString
toLocalIterator
top
treeAggregate
treeReduce
union
unpersist
values
variance
withResources
zip
zipWithIndex
zipWithUniqueId

SparkContext

RDD API Comment
parallelize Does not support numSlices yet.

Limitations

  • Error handling and checking is kind of limited right now. We try to emulate the existing behavior, but this is not always possible because the invariants are not encode in Python but rather somewhere in Scala.
  • numSlices - we don't emulate this behavior for now.

congruity's People

Contributors

dependabot[bot] avatar grundprinzip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

congruity's Issues

Migration Log

This is an auto-generated issue. The migration log is in the following comment(s).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.