datayoga-io / datayoga Goto Github PK

View Code? Open in Web Editor NEW

27.0 4.0 3.0 2.49 MB

streaming data pipeline platform

Home Page: https://datayoga-io.github.io/datayoga/

License: Apache License 2.0

Python 99.41% Dockerfile 0.59%

etl pipeline transformation cdc sqlalchemy kafka singer elt data database

datayoga's Introduction

Introduction

DataYoga is a framework for building and running streaming or batch data pipelines. DataYoga uses low-code to easily define data pipelines using a declarative markup language using YAML files.

Concepts

Job - A Job is composed of a series of Steps that reads information from a source, performs transformations, and write to a target. Many sources and targets are supported, including relational databases, non relational databases, file formats, cloud storage, and HTTP servers.

Step - Each Step runs a Block that uses specific business logic. The output of each Step is fed into the next Step, creating a chain of transformations.

Blocks - The Block defines the business logic. Blocks can:

Read and write from relational and non relational databases
Read, write, and parse data from local storage and cloud storage
Perform transformations, modify structure, add computed fields, rename fields, or remove fields
Enrich data from external sources and APIs

DataYoga Runtime

DataYoga provides a standalone stream processing engine, the DataYoga Runtime that validates and run Transformation Jobs. The Runtime provides:

Validation
Error handling
Metrics and observability
Credentials management

The Runtime supports multiple stream processing strategies including buffering and rate limiting. It supports both async processing, multi-threading, and multi-processing to enable maximum throughput with a low footprint.

Quickstart

pip install datayoga

Verify that the installation completed successfully by running this command:

datayoga --version

Create New DataYoga Project

To create a new DataYoga project, use the init command:

datayoga init hello_world
cd hello_world

Directory structure

Run Your First Job

Let's run our first job. It is pre-defined in the samples folder as part of the init command:

datayoga run sample.hello

If all goes well, you should see some startup logs, and eventually:

{"id": "1", "fname": "john", "lname": "doe", "credit_card": "1234-1234-1234-1234", "country_code": "972", "country_name": "israel", "gender": "M", "full_name": "John Doe", "greeting": "Hello Mr. John Doe"}
{"id": "2", "fname": "jane", "lname": "doe", "credit_card": "1000-2000-3000-4000", "country_code": "972", "country_name": "israel", "gender": "F", "full_name": "Jane Doe", "greeting": "Hello Ms. Jane Doe"}
{"id": "3", "fname": "bill", "lname": "adams", "credit_card": "9999-8888-7777-666", "country_code": "1", "country_name": "usa", "gender": "M", "full_name": "Bill Adams", "greeting": "Hello Mr. Bill Adams"}

That's it! You've created your first job that loads data from CSV, runs it through a series of transformation steps, and shows the data to the standard output. A good start. Read on for a more detailed tutorial or check out the reference to see the different block types currently available.

datayoga's People

Contributors

Stargazers

Watchers

Forkers

orenelias odarya tomermalkaa

datayoga's Issues

add support for cloud storage

either as separate 'load' block or a type within a generic 'load' block

lookup value block

Clone the data before passing it to be processed in the blocks

Debugging tool

Missing data in Blocks' documentation

For example:
https://datayoga-io.github.io/datayoga-py/blocks/remove_field.html

Relevant issue in jsonschema2mk:
simonwalz/jsonschema2mk#6

`add_field`, `rename_field`, `remove_field`: support fields with dot

name.fname is an indication for nested field ( "name": { "fname": "yossi" }}, while we also should support escaping dots as fields can contain dots in it:
name.fname should be { "name.fname": "yossi" }

`validate` - can't find job.schema.json when datayoga package is bundled

add docs about the expression language jmespath and SQL

including examples and limitations

sql expression - refactor utils methods

support for conditional steps

using an 'if' property similar to github actions

Data record should be of key, value, source format

Currently a record is Dict[str, Any] so we get the payload of either the key or value. What we really need is to have the both:

{
	key: {
		field1: xx,
		field2: zz
	},
	value:{
		field3: aa,
		field4: bb
	}
}

So it should be something like:
Dict[Literal["key", "value"], Dict[str, Any]]

unit test for exec_sql

validate job_settings

currently it's done as part of the compile method in __init__.py which is good. We also need a method that just validates a job as a standalone task for a case where the user wants to see in real time if the job (yaml) is valid or not.

`split` custom function

build cli into binary

for Ubuntu 18, 20, 22
for RH 7, 8

validate schema block

validate record against json schema

github action that pushes `datayoga-py` to PyPI when pushing to main

`init` CLI command

creates a new folder
scaffolds based on the 'scaffold' folder with sample jobs and configurations

add 'filter' block

ability to filter messages by expression property

load connections using OmegaConf

add support for {env:} resolver (vs oc.env)
add support for {file:} to load from tmpfs

ability to call another pipeline

using with "@namespace/pipeline" or "@pipeline

ability to define requirements for a block

either requirements.txt or using a toml format? should also support "extras".

change JMESPath's concat implementation in README and test

Currently this is the example we have:

  - uses: add_field
    with:
      field: full_name
      language: jmespath
      expression: '{ "fname": fname, "lname": lname} | join('' '', values(@))'

We have the concat block for this purpose, so we probably replace this example with something more intuitive.

`left`, `right` and `mid` custom functions

transform to expect an array of steps instead of an object with `steps` array

jsonschema block

validates a message content against a jsonschema. either inline or via file or configmap

Ability to whitelist specific blocks that can be used in datayoga-py

image manipulation block

`http.read` block

two blocks:
http to fetch data (webservices)
http to post data as a sink

ability to limit the block types used in a transformation

for embedding purposes, ability to provide a list of whitelisted transformation blocks

favicon is not shown in pages

Transformation engine - infrastructure

https://docs.google.com/document/d/1_7RujE28H3CvSM8pM_WatLBVQ8d9m3IerRNWfmGPua8/edit#

Framework infrastructure with task runner, blocks, etc.

JMESPath's `replace` custom function

mask_field block

split block

splits one record into multiple records

`datetime` custom functions

As a user of the transformation engine I need easier ways to deal with dates and for filtering and adding date fields

I would like to have custom functions that can:

Help me filter based on absolute date (I provide the date as string or as day, month,year,hour, min, sec, ms,tz) - we can base this on python datetime methods

Help me filter based on moving window x days, y hours, z min, v sec etc from now - we can base this on timedelta methods

a function to express now(tz)

the ability to translate a date to unix epoch

  - uses: add_field
    with:
      - field: BillingHouseNumber
        expression: split(`BillingAddress`,'.')[0]
      - field: BillingStreet
        expression: split(`BillingAddress`,'.')[1]
      - field: BillingCountry
        expression: upper(`BillingCountry`)
      - field: VAT
        expression: total * 1.17
        language: sql

The other format should be also supported for dealing with only one field:

  - uses: add_field
    with:
        field: BillingHouseNumber
        expression: split(`BillingAddress`,'.')[0]

hash so I can has a string (or all the entry)
checksum - so that I can compare hashes
uid - so that I can add a UID to an entry key