GithubHelp home page GithubHelp logo

cicd_with_databricks's Introduction

CI/CD with Databricks

Continuous Integration and Continuous Deployment (CI/CD) is a process used to deliver software applications to end-users in a faster, efficient, and reliable manner. CI/CD automates the building, testing, and deployment process, which reduces the time between committing the code changes and releasing it to production.

Databricks is a cloud-based data engineering and analytics platform that provides a collaborative environment for building, managing, and deploying data pipelines, machine learning models, and analytics applications. It also provides an API to programmatically interact with the Databricks workspace.

In this repository, we'll discuss how to set up CI/CD pipeline with Databricks using GitHub Actions, Databricks CLI, and Databricks REST API.

Prerequisites

  • A GitHub account.
  • A Databricks account with an active workspace.
  • Databricks CLI installed on your local machine.
  • An Azure Blob storage account to store build artifacts (Optional).

Setup

Create a Databricks token

  • Go to your Databricks workspace.
  • Click on the user icon in the top-right corner and select "User Settings".
  • Click on the "Access Tokens" tab.
  • Click on the "Generate New Token" button.
  • Give a name to the token, select the "Manage" permission, and click on the "Generate" button.
  • Copy the generated token.

Setup github configuration with Databricks

  • Navigate to Github --> Settings --> Developer Settings --> Personal access tokens
  • Generate new token (classic) --> Enter password --> Enter note
  • Select repo and workflow scope --> Submit --> Copy token
  • Login to Databricks --> User Settings --> Git Integration
  • Git provider (Github) --> Git provider username or email (your git credentials) --> Token

Configure github actions (CICD)

Configure secrets

  • Go to Settings--> Secrets and Variables
    • Actions
    • DATABRICKS_HOST_PRD: https://< databricks_workspace_name >.com/
    • DATABRICKS_TOKEN_PRD: < personal access token generated in databricks >

Action time!

  • Go to Databricks Repos and clone your repository.
  • Checkout branch- feature/< username >
  • There are few code level changes required before one raises pull request.
  • Update the bronze unit test- test_load_data_into_bronze and set expected_num_files = 2
  • Update the silver unit tests and add a new assertion. You are free to write any test case.
  • Review the integration_suite test which uses Files in Repos feature and fill in the missing elements.
    • src/main/python/gold/gold_layer_etl.py
    • src/main/tests/integration_suite/test_integration_gold_layer_etl
  • Review the dbx deployment file
    • Open cicd_with_databricks/deployment/deploy-job.yaml and update notebook_path variable to your Databricks Repos location.
  • Review files under .github/workflow/ to understand the CICD plan.
  • Create a pull request and view the CICD unit testing job that spins up in github โ†’ actions.
  • Once it succeeds, merge the pull request into develop branch and view the CICD integration testing job that spins up.
  • Once integration tests are completed on develop branch, raise a PR from develop branch into main. View the CICD job that spins up, runs unit & integration tests.
  • Once it succeeds, merge the pull request into develop branch and view the CICD job that creates Databricks workflow jobs and launches them.

cicd_with_databricks's People

Contributors

shivampanicker avatar shivampanicker16 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.