GithubHelp home page GithubHelp logo

datayoga-io / datayoga Goto Github PK

View Code? Open in Web Editor NEW
27.0 4.0 3.0 2.49 MB

streaming data pipeline platform

Home Page: https://datayoga-io.github.io/datayoga/

License: Apache License 2.0

Python 99.41% Dockerfile 0.59%
etl pipeline transformation cdc sqlalchemy kafka singer elt data database

datayoga's Introduction

Introduction

DataYoga is a framework for building and running streaming or batch data pipelines. DataYoga uses low-code to easily define data pipelines using a declarative markup language using YAML files.

PyPI - License PyPI PyPI - Python Version

DataYoga overview

Concepts

Job - A Job is composed of a series of Steps that reads information from a source, performs transformations, and write to a target. Many sources and targets are supported, including relational databases, non relational databases, file formats, cloud storage, and HTTP servers.

Step - Each Step runs a Block that uses specific business logic. The output of each Step is fed into the next Step, creating a chain of transformations.

Blocks - The Block defines the business logic. Blocks can:

  • Read and write from relational and non relational databases
  • Read, write, and parse data from local storage and cloud storage
  • Perform transformations, modify structure, add computed fields, rename fields, or remove fields
  • Enrich data from external sources and APIs

DataYoga Runtime

DataYoga provides a standalone stream processing engine, the DataYoga Runtime that validates and run Transformation Jobs. The Runtime provides:

  • Validation
  • Error handling
  • Metrics and observability
  • Credentials management

The Runtime supports multiple stream processing strategies including buffering and rate limiting. It supports both async processing, multi-threading, and multi-processing to enable maximum throughput with a low footprint.

Quickstart

pip install datayoga

Verify that the installation completed successfully by running this command:

datayoga --version

Create New DataYoga Project

To create a new DataYoga project, use the init command:

datayoga init hello_world
cd hello_world

Directory structure

Run Your First Job

Let's run our first job. It is pre-defined in the samples folder as part of the init command:

datayoga run sample.hello

If all goes well, you should see some startup logs, and eventually:

{"id": "1", "fname": "john", "lname": "doe", "credit_card": "1234-1234-1234-1234", "country_code": "972", "country_name": "israel", "gender": "M", "full_name": "John Doe", "greeting": "Hello Mr. John Doe"}
{"id": "2", "fname": "jane", "lname": "doe", "credit_card": "1000-2000-3000-4000", "country_code": "972", "country_name": "israel", "gender": "F", "full_name": "Jane Doe", "greeting": "Hello Ms. Jane Doe"}
{"id": "3", "fname": "bill", "lname": "adams", "credit_card": "9999-8888-7777-666", "country_code": "1", "country_name": "usa", "gender": "M", "full_name": "Bill Adams", "greeting": "Hello Mr. Bill Adams"}

That's it! You've created your first job that loads data from CSV, runs it through a series of transformation steps, and shows the data to the standard output. A good start. Read on for a more detailed tutorial or check out the reference to see the different block types currently available.

datayoga's People

Contributors

mungabanna avatar spicy-sauce avatar vd2org avatar zalmane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

datayoga's Issues

Data record should be of key, value, source format

Currently a record is Dict[str, Any] so we get the payload of either the key or value. What we really need is to have the both:

{
	key: {
		field1: xx,
		field2: zz
	},
	value:{
		field3: aa,
		field4: bb
	}
}

So it should be something like:
Dict[Literal["key", "value"], Dict[str, Any]]

validate job_settings

currently it's done as part of the compile method in __init__.py which is good. We also need a method that just validates a job as a standalone task for a case where the user wants to see in real time if the job (yaml) is valid or not.

`init` CLI command

creates a new folder
scaffolds based on the 'scaffold' folder with sample jobs and configurations

change JMESPath's concat implementation in README and test

Currently this is the example we have:

  - uses: add_field
    with:
      field: full_name
      language: jmespath
      expression: '{ "fname": fname, "lname": lname} | join('' '', values(@))'

We have the concat block for this purpose, so we probably replace this example with something more intuitive.

jsonschema block

validates a message content against a jsonschema. either inline or via file or configmap

`http.read` block

two blocks:
http to fetch data (webservices)
http to post data as a sink

`datetime` custom functions

As a user of the transformation engine I need easier ways to deal with dates and for filtering and adding date fields

I would like to have custom functions that can:

Help me filter based on absolute date (I provide the date as string or as day, month,year,hour, min, sec, ms,tz) - we can base this on python datetime methods

Help me filter based on moving window x days, y hours, z min, v sec etc from now - we can base this on timedelta methods

a function to express now(tz)

the ability to translate a date to unix epoch

framework for managing connections

utility for exposing connections to the blocks. blocks should be able to import method and use the context to grab a connection by name.

`add_field`, `remove_field`, `rename_field` blocks - support list

Instead of defining multiple blocks for adding new fields, for example, support this format too (list)

  - uses: add_field
    with:
      - field: BillingHouseNumber
        expression: split(`BillingAddress`,'.')[0]
      - field: BillingStreet
        expression: split(`BillingAddress`,'.')[1]
      - field: BillingCountry
        expression: upper(`BillingCountry`)
      - field: VAT
        expression: total * 1.17
        language: sql

The other format should be also supported for dealing with only one field:

  - uses: add_field
    with:
        field: BillingHouseNumber
        expression: split(`BillingAddress`,'.')[0]

`hash`, `uuid` custom functions

As a user I need the following functions to be able to achieve some outcomes:

  • hash so I can has a string (or all the entry)
  • checksum - so that I can compare hashes
  • uid - so that I can add a UID to an entry key

support multiple 'block' repositories

ability to 'import' a block repo from URL or from offline package (zip?). This should enable block packages so that only 'core' are included in the basic repo and others can be loaded on demand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.