GithubHelp home page GithubHelp logo

gmrqs / lasagna Goto Github PK

View Code? Open in Web Editor NEW
28.0 2.0 6.0 11.94 MB

A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka

Dockerfile 1.40% Shell 0.02% Python 40.88% Jupyter Notebook 57.71%
hive-metastore pyspark trino docker docker-compose minio spark spark-streaming jupyter jupyterlab

lasagna's Introduction

alt text Lasagna (or pastabricks) is a interactive development environment I built to learn and practice PySpark.

It's built using Docker Compose template, provisioning a Jupyter Lab, a two-workers Spark Standalone Cluster, MinIO Object Storage, a Hive Standalone Metastore, Trino and a Kafka cluster for simulating events.

Requisites:

  • Docker Desktop
  • Docker Compose

To use it you just have to clone this repository and execute the following:

docker compose up -d

Docker will build the images by itself. I recommend having a wired internet connection for this

After all container are up and running, execute the following to get Jupyter Lab access link:

 docker logs workspace 2>&1 | grep http://127.0.0.1

(you can also the the link in docker desktop logs)

Clique no link http://127.0.0.1:8888/lab?token=<token_gigante_super_seguro>

To start the Kafka broker you need to go to the kafka folder and execute the following:

docker compose up -d

What does Lasagna creates?

alt text

The docker-compose.yml template create a series of containers:

๐Ÿ“™ Workspace

A Jupyter Lab client for interactive development sessions, featuring:

  • A work directory in order to persists your scripts and notebooks;
  • spark-defaults.conf pre-configured to make Spark Sessions easier to create;
  • Dedicated kernels for PySpark with Hive, Iceberg or Delta;

๐Ÿ‘€ Use %SparkSession command to easily configure Spark Session

alt text

๐Ÿ“‚ MinIO Object Storage

A single MinIO instance to serve as object storage:

  • Web UI accessible at localhost:9090 (user: admin password: password)
  • s3a protocol API available at port 9000;
  • mount/minio and mount/minio-config directories mounted to persist data between sessions.

โœจ Spark Cluster

A standalone spark cluster for workload processing:

  • 1 Master node (master at port 7077, web-ui at localhost:5050)
  • 2 Worker nodes (web-ui at localhost:5051 and localhost:5052)
  • All the necessary dependencies for MinIO connection;
  • Connectivity with MinIO @ port 9000.

๐Ÿ Hive Standalone Metastore

A Hive Standalone Metastore instance using PostgreSQL as back-end database allowinto to persist table metadata between sessions.

  • mount/postgres directory to persist tables between development sessions;
  • Connectivity with Spark cluster at through thift protocol at port 9083;
  • Connectivity with PostgresSQL through JDBC at port 5432.

๐Ÿฐ Trino

A single Trino instace to serve as query engine.

  • Hive, Delta e Iceberg catalos configured. All tables created in using PySpark are accessible with Trino;
  • Standar service available at port 8080.

๐Ÿ‘€ Don't forget you can use the %trino magic command in your notebooks!

๐ŸŒŠ Kafka

A separate docker compose template with a zookeper + kafka single-node instance to mock data-streams with a python producer.

  • Uses the same network as the lasagna docker compose creates;
  • A kafka-producer notebook/script is available to create random events with Faker library;
  • Accessible at kafka:29092.

lasagna's People

Contributors

gmrqs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lasagna's Issues

Docker compose up ERROR Workspace 23/48 RUN jupyter serverextension enable --py jupyterlab_s3_browser

Trying to setup lasagna on a computer and running into the following issue.

Jupyter Command jupyter-serverextension not found.

Edit: Was playing around with versioning pyspark and jupyterLabs. I was able to get the compose up working with pyspark 3.5.0 and Jupyter Lab version 3.6.6. The compose failed a couple time which I think were related to network issues. Trying to see if there are some other issues with updating pyspark to 3.5.0 when playing around with notebooks

hive-metastore | sh: 1: /entrypoint.sh: Permission denied

I tried to start the docker compose file, and getting
hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore | sh: 1: /entrypoint.sh: Permission denied hive-metastore exited with code 126

Is there anything else I should be doing ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.