awesome-spark / awesome-spark Goto Github PK

View Code? Open in Web Editor NEW

1.7K 85.0 325.0 214 KB

A curated list of awesome Apache Spark packages and resources.

License: Creative Commons Zero v1.0 Universal

Shell 100.00%

apache-spark pyspark awesome sparkr

awesome-spark's Introduction

Awesome Spark

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Packages
Resources

Packages

Language Bindings

Kotlin for Apache Spark - Kotlin API bindings and extensions.
Flambo - Clojure DSL.
Mobius - C# bindings (Deprecated in favor of .NET for Apache Spark).
.NET for Apache Spark - .NET bindings.
sparklyr - An alternative R backend, using dplyr.
sparkle - Haskell on Apache Spark.

Notebooks and IDEs

almond - A scala kernel for Jupyter.
Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

Succinct - Support for efficient queries on compressed data.
itachi - A library that brings useful functions from modern database management systems to Apache Spark.
spark-daria - A Scala library with essential Spark functions and extensions to make you more productive.
quinn - A native PySpark implementation of spark-daria.
Apache DataFu - A library of general purpose functions and UDF's.
Joblib Apache Spark Backend - joblib backend for running tasks on Spark clusters.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Spark CSV - CSV reader and writer (obsolete since Spark 2.0 [SPARK-12833]).
Spark Avro - Apache Avro reader and writer (obselete since Spark 2.4 [SPARK-24768]).
Spark XML - XML parser and writer.
Spark Cassandra Connector - Cassandra support including data source and API and support for arbitrary queries.
Spark Riak Connector - Riak TS & Riak KV connector.
Mongo-Spark - Official MongoDB connector.
OrientDB-Spark - Official OrientDB connector.

Storage

Delta Lake - Storage layer with ACID transactions.
lakeFS - Integration with the lakeFS atomic versioned storage layer.

Bioinformatics

ADAM - Set of tools designed to analyse genomics data.
Hail - Genetic analysis framework.

GIS

Magellan - Geospatial analytics using Spark.
Apache Sedona - Cluster computing system for processing large-scale spatial data.

Time Series Analytics

Spark-Timeseries - Scala / Java / Python library for interacting with time series data on Apache Spark.
flint - A time series library for Apache Spark.

Graph Processing

Mazerunner - Graph analytics platform on top of Neo4j and GraphX.
GraphFrames - Data frame based graph API.
neo4j-spark-connector - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.
SparklingGraph - Library extending GraphX features with multiple functionalities useful in graph analytics (measures, generators, link prediction etc.).

Machine Learning Extension

Clustering4Ever Scala and Spark API to benchmark and analyse clustering algorithms on any vectorization you can generate.
dbscan-on-spark - An Implementation of the DBSCAN clustering algorithm on top of Apache Spark by irvingc and based on the paper from He, Yaobin, et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data.
Apache SystemML - Declarative machine learning framework on top of Spark.
Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax.
spark-sklearn - Scikit-learn integration with distributed model training.
KeystoneML - Type safe machine learning pipelines with RDDs.
JPMML-Spark - PMML transformer library for Spark ML.
Distributed Keras - Distributed deep learning framework with PySpark and Keras.
ModelDB - A system to manage machine learning models for spark.ml and scikit-learn .
Sparkling Water - H2O interoperability layer.
BigDL - Distributed Deep Learning library.
MLeap - Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.
Microsoft ML for Apache Spark - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
MLflow - Machine learning orchestration platform.

Middleware

Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
spark-jobserver - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
Mist - Service for exposing Spark analytical jobs and machine learning models as realtime, batch or reactive web services.
Apache Toree - IPython protocol based middleware for interactive applications.
Apache Kyuubi - A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.

Monitoring

Data Mechanics Delight - Cross-platform monitoring tool (Spark UI / Spark History Server replacement).

Utilities

silex - Collection of tools varying from ML extensions to additional RDD methods.
sparkly - Helpers & syntactic sugar for PySpark.
pyspark-stubs - Static type annotations for PySpark (obsolete since Spark 3.1. See SPARK-32681).
Flintrock - A command-line tool for launching Spark clusters on EC2.
Optimus - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Natural Language Processing

spark-corenlp - DataFrame wrapper for Stanford CoreNLP.
spark-nlp - Natural language processing library built on top of Apache Spark ML.

Streaming

Apache Bahir - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

Apache Beam - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
Blaze - Interface for querying larger than memory datasets using Pandas-like syntax. It supports both Spark DataFrames and RDDs.
Koalas - Pandas DataFrame API on top of Apache Spark.

Testing

deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
spark-testing-base - Collection of base test classes.
spark-fast-tests - A lightweight and fast testing framework.

Web Archives

Archives Unleashed Toolkit - Open-source toolkit for analyzing web archives.

Workflow Management

Cromwell - Workflow management system with Spark backend.

Resources

Books

Learning Spark, 2nd Edition - Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.
Advanced Analytics with Spark - Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.
Mastering Apache Spark - Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.
Spark Gotchas - Subjective compilation of tips, tricks and common programming mistakes.
Spark in Action - New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here.

Papers

Large-Scale Intelligent Microservices - Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Paper introducing a core distributed memory abstraction.
Spark SQL: Relational Data Processing in Spark - Paper introducing relational underpinnings, code generation and Catalyst optimizer.
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark - Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.

MOOCS

Data Science and Engineering with Apache Spark (edX XSeries) - Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.
Big Data Analysis with Scala and Spark (Coursera) - Scala oriented introductory course. Part of Functional Programming in Scala Specialization.

Workshops

AMP Camp - Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Projects Using Spark

Oryx 2 - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Docker Images

apache/spark - Apache Spark Official Docker images.
jupyter/docker-stacks/pyspark-notebook - PySpark with Jupyter Notebook and Mesos client.
sequenceiq/docker-spark - Yarn images from SequenceIQ.
datamechanics/spark - An easy to setup Docker image for Apache Spark from Data Mechanics.

Miscellaneous

Spark with Scala Gitter channel - "A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.
Apache Spark User List and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.

References

Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.

License

This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.

Inspired by sindresorhus/awesome.

awesome-spark's People

Contributors

Stargazers

Watchers

Forkers

realforce1024 allengaller cloudnuts kangkot theho andypetrella malcolmgreaves khajavi dmitryfill eliasah cycorey veterun anandhs krasnyanskiy oluies praveen-symphony alonsoir rootcss srinivasagit naliazheli saimaung flying0er mtoto wypb joe2hpimn mtunique dog-sunflower veelenga ppr10 spushkarev youngwookim welcome9 vaquarkhan dutrow anant zhuohuwu0603 riomus cbaenziger phonchi ashisharma23 ahmedkamal reloadbrain sayantanmukherjee6 daroza yuanjie-ai mrpowers marilenaoita developerswithpassion dattatele sbarman25 kondisettyravi anuragit19 ajoeajoe awesome-library maniiyer7 lxiong nuthanreddy biaoma-ty anyuray umitsamima angelom88 kashcm chaitanyaphalak worldmovers enggen lucianosb jxie418 charismatron gangk manojlds alexgids 2check91 shubhampachori12110095 ptzagk awesome-resources mengmengan danigunawan mstei4176 faviovazquez xiaoximiao hanks110 nchammas yangmaoer stdatalabs kadiy1k stp008 vinodkandula guptam mallik-g ahbab dailyactie ashwin-patil pineda-vv pavithranrao toquoccuong rtu thenextbiglogicprovider chabobo tanthml cu-noyvirt

awesome-spark's Issues

Remove double line separators

In particular `------` which expands to:

Consider spark-tensorflow-connector

This repo contains a library for loading and storing TensorFlow records with Apache Spark. The library implements data import from the standard TensorFlow record format ([TFRecords] (https://www.tensorflow.org/how_tos/reading_data/)) into Spark SQL DataFrames, and data export from DataFrames to TensorFlow records.

https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector#spark-tensorflow-connector

This project is part of Tensorflow ecosystem which contains examples for integrating TensorFlow with other open-source frameworks e.g kubernetes, docker, marathon, spark, etc.

What do you think @zero323 @oluies ?

Consider combust/mleap

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark, Scikit-learn, TensorFlow graphs, or an MLeap pipeline for use in a scoring engine (API Servers).

Consider dist-keras

https://github.com/cerndb/dist-keras

Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of ensembles and models using data parallel methods.

Adding spark-in-a-box

I was thinking about adding spark-in-a-box into docker images, what do you think ?

Consider snappydata

SnappyData bundles Spark and supports all the Spark APIs. It seemed like leggit project. What do you think ?

https://github.com/SnappyDataInc/snappydata
https://www.snappydata.io/

Does it make sense to include sparkinabox

@eliasah What do you think? It is not particularly mature and I am cautious when it comes to self-promotion...

Consider kontextfrei

https://github.com/dwestheide/kontextfrei

This library enables you to write the business logic of your Spark application without depending on RDDs and the SparkContext.

cc @dwestheide

Consider Flintrock

https://github.com/nchammas/flintrock

We should probably wait until 1.0, but I'll leave to you @eliasah. Not like I am big AWS user :)

Add description to Mastering Apache Spark.

@jaceklaskowski Would you like to take care of that?

Add photon-ml

https://github.com/linkedin/photon-ml

@eliasah Would you do the honors?

Add description to Lightning-Fast Big Data Analysis. Advanced Analytics with Spark.

Include Spark logo

It has not much practical value but it could look nice.

Consider removing spark-timeseries

spark-timeseries is officially no longer developed: sryza/spark-timeseries@17b78f7.

Should we keep it in the awesome?

Non-active repositories

What should we do about inactive repositories ? example : https://github.com/irvingc/dbscan-on-spark

Change logo width size

The logo should be high-DPI, so set it to maximum half the width of the original image also for this sindresorhus/awesome#895

The licence badge has to be in SVG not PNG

To be able to make this sindresorhus/awesome#895 accept we need to change the badge format to SVG.

How are we supposed to make a svg of the licence batch if it's a link ? https://github.com/awesome-spark/awesome-spark/blob/master/README.md#license

Add description to Apache Zeppelin

Add description to Spark Notebook

Add archived MOOCS section and move cs110x and cs120x there.

Use consitent line ending strategy

Right now there are two different approaches:

finishing with full-stop
no-punctuation mark at all

The first approach seems to be much better IMHO.

Add Spark website URL to the list description .

A curated list of awesome Apache Spark packages and resources.

Version and language labeling

Should we provide an optional version or language specific labels? The former one can become a serious issue with upcoming 2.0+. The latter one is probably less useful and not worth the effort.

Remove spark_dbscan

Project: spark_dbscan
Reason: no longer maintained.

sparkapi and sparklyr

should we include these two or wait and see how things go?
if we include where should these go? Should we add these separately?

@eliasah

Consider Spark with Java

Hi,

Here is a small intro to Spark with Java:

Spark with Java - New book in Manning's Early Access Program (MEAP) with a very strong focus on Java, targeted towards software and data engineers. The author focuses on existing developing existing skills rather than learning Spark, Scala, and Hadoop at the same time. You can find the accompanying GitHub repos here.

Thanks for considering it,
HIH,

Tutorials

Should we add some tutorials links ? What do you think ?

More language bindings

I can't vouch for the quality or maturity of these bindings, but you may want to consider them.

Haskell: https://github.com/tweag/sparkle
Node.js:
- https://github.com/EclairJS/eclairjs-node
- https://github.com/henridf/apache-spark-node

Add description to Learning Spark

Consider pyspark-asyncactions

https://github.com/zero323/pyspark-asyncactions

A proof of concept asynchronous actions for PySpark using concurent.futures Originally developed as proof-of-concept solution for SPARK-20347

Consider intel-analytics/BigDL

https://github.com/intel-analytics/BigDL

Consider arangodb-spark-connector

https://github.com/arangodb/arangodb-spark-connector

Right now it is a tad useless (doesn't support data source API), maybe it'll change in the future.

Add header to the TOC

Contribution guidelines

We need a clear contribution guidelines.

Databricks Spark Knowledge Base

Is there any point in adding it here? It hasn't been updated in two years and ignoring iconic Avoid GroupByKey where the title seems to be to only thing than readers remember it is rather shallow.

Consider twosigma/flint

https://github.com/twosigma/flint

Use consistent title case capitalization for headers.

CMS with articles (a, an, the), coordinating conjunctions (and, but, or, for, nor), and prepositions, regardless of length, are lowercased unless they are the first or last word of the title?

Consider pyspark-stubs

While this is still work in progress I would like to add it at some point.

https://github.com/zero323/pyspark-stubs

Logo should use RawGit instead of direct link

Clean MOOCS section

CS100 and CS190 are no longer offered. The new curriculum contains five different courses. We also need proper descriptions.

dist-keras is listed twice

Fix header usage

use h2 for top level categories (like packages)
use h3 for subcategories (like Language bindings)

Dealing with packages for spark 1.x and 2.x and also abandoned projects

This is point is subject of discussion.

Docker images

I've been thinking about adding a list of useful Spark Docker images. This it not very long but I think it can be useful. It could go the resources section.

Resources

...

Blogs

....

Docker images:

jupyter/docker-stacks/pyspark-notebook/ - PySpark with Jupyter Notebook and Mesos client.
Spark Notebook Generator - Package generator for Spark notebook supporting multiple formats, including Docker images.
sequenceiq/docker-spark - Spark on Yarn images.
zero323/spark-in-a-box - Awesome Docker images generator.

Consider Sparkle

https://github.com/tweag/sparkle

sparkle: Apache Spark applications in Haskell

sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details.

Moved from #97

CC @alpmestan

Rename spark_dbscan to Spark DBSCAN

It looks like this is the name used in the source repo.

Consider Big Data for Data Engineers Specialization

https://www.coursera.org/specializations/big-data-engineering

In four concise courses you will learn the basics of Hadoop, MapReduce, Spark, methods of offline data processing for warehousing, real-time data processing and large-scale machine learning. And Capstone project for you to build and deploy your own Big Data Service (make your portfolio even more competitive).

Consider EclairJS / EclairJS-Node

https://github.com/EclairJS/eclairjs

Considering that the official statement is:

There has been a team here in IBM working on EclairJS for about a year now. Unfortunately from the team's perspective, the project has not gained traction and so they are going to discontinue work on it. The eclairjs.org domain name will continue to exist for a while longer although there will a few other changes such as shuttering this (largely unused) Slack channel

I doubt it make sense to follow this route, but I could be biased.

Moved from #97

Clean AMP Camp entry

We should either provide a short description of the AMP Camp in general or provide some links to individual events, each with its own description.

Inconsistent captialization of the descriptions.

Right now have three different conventions:

[link] - lowercase foo bar.
[link] - Uppercase foo bar.
[link] - Uppercase Foo Bar.

We should choose one (the second approach is probably canonical given we use full sentences for descriptions) and adjust the rest.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

awesome-spark / awesome-spark Goto Github PK

awesome-spark's Introduction

Awesome Spark

Contents

Packages

Language Bindings

Notebooks and IDEs

General Purpose Libraries

SQL Data Sources

Storage

Bioinformatics

GIS

Time Series Analytics

Graph Processing

Machine Learning Extension

Middleware

Monitoring

Utilities

Natural Language Processing

Streaming

Interfaces

Testing

Web Archives

Workflow Management

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

Docker Images

Miscellaneous

References

License

awesome-spark's People

Contributors

Stargazers

Watchers

Forkers

awesome-spark's Issues

In particular ------ which expands to:

Resources

Blogs

Docker images:

Recommend Projects

Recommend Topics

Recommend Org

Jobs

In particular `------` which expands to: