GithubHelp home page GithubHelp logo

raystack / optimus Goto Github PK

View Code? Open in Web Editor NEW
737.0 18.0 153.0 13.53 MB

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

Home Page: https://raystack.github.io/optimus

License: Apache License 2.0

Dockerfile 0.02% Makefile 0.20% Go 98.18% Python 1.49% Shell 0.11%
airflow etl workflows automation golang bigquery data-warehouse analytics data-modelling analytics-engineering

optimus's Introduction

Optimus

verify workflow publish latest workflow Coverage Status License Version

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. It enables data analysts and engineers to transform their data by writing simple SQL queries and YAML configuration while Optimus handles dependency management, scheduling and all other aspects of running transformation jobs at scale.

Key Features

Discover why users choose Optimus as their main data transformation tool.

  • Warehouse management: Optimus allows you to create and manage your data warehouse tables and views through YAML based configuration.
  • Scheduling: Optimus provides an easy way to schedule your SQL transformation through a YAML based configuration.
  • Automatic dependency resolution: Optimus parses your data transformation queries and builds a dependency graphs automaticaly instead of users defining their source and taget dependencies in DAGs.
  • Dry runs: Before SQL query is scheduled for transformation, during deployment query will be dry-run to make sure it passes basic sanity checks.
  • Powerful templating: Optimus provides query compile time templating with variables, loop, if statements, macros, etc for allowing users to write complex tranformation logic.
  • Cross tenant dependency: Optimus is a multi-tenant service, if there are two tenants registered, serviceA and serviceB then service B can write queries referencing serviceA as source and Optimus will handle this dependency as well.
  • Hooks: Optimus provides hooks for post tranformation logic. e,g. You can sink BigQuery tables to Kafka.
  • Extensibility: Optimus support Python transformation and allows for writing custom plugins.
  • Workflows: Optimus provides industry proven workflows using git based specification management and REST/GRPC based specification management for data warehouse management.

Usage

Optimus has two components, Optimus service that is the core orchestrator installed on server side, and a CLI binary used to interact with this service. You can install Optimus CLI using homebrew on macOS:

$ brew install raystack/tap/optimus
$ optimus --help

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for
data transformation, data modeling, pipelines, and data quality management.

Usage:
  optimus [command]

Available Commands:
  backup      Backup a resource and its downstream
  completion  Generate the autocompletion script for the specified shell
  extension   Operate with extension
  help        Help about any command
  init        Interactively initialize Optimus client config
  job         Interact with schedulable Job
  migration   Command to do migration activity
  namespace   Commands that will let the user to operate on namespace
  playground  Play around with some Optimus features
  plugin      Manage plugins
  project     Commands that will let the user to operate on project
  resource    Interact with data resource
  secret      Manage secrets to be used in jobs
  scheduler   Scheduled/run job related functions
  serve       Starts optimus service
  version     Print the client version information

Flags:
  -h, --help       help for optimus
      --no-color   Disable colored output

Use "optimus [command] --help" for more information about a command.

Documentation

Explore the following resources to get started with Optimus:

  • Guides provides guidance on using Optimus.
  • Concepts describes all important Optimus concepts.
  • Reference contains details about configurations, metrics and other aspects of Optimus.
  • Contribute contains resources for anyone who wants to contribute to Optimus.

Running locally

Optimus requires the following dependencies:

  • Golang (version 1.16 or above)
  • Git

Run the following commands to compile optimus from source

$ git clone [email protected]:raystack/optimus.git
$ cd optimus
$ make

Use the following command to run

$ ./optimus version

Optimus service can be started with

$ ./optimus serve

serve command has few required configurations that needs to be set for it to start. Read more about it in getting started.

Compatibility

Optimus is currently undergoing heavy development with frequent, breaking API changes. Current major version is zero (v0.x.x) to accommodate rapid development and fast iteration while getting early feedback from users (feedback on APIs are appreciated). The public API could change without a major version update before v1.0.0 release.

Contribute

Development of Optimus happens in the open on GitHub, and we are grateful to the community for contributing bugfixes and improvements. Read below to learn how you can take part in improving Optimus.

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

To help you get your feet wet and get you familiar with our contribution process, we have a list of good first issues that contain bugs which have a relatively limited scope. This is a great place to get started.

License

Optimus is Apache 2.0 licensed.

optimus's People

Contributors

anuraagbarde avatar arinda-arif avatar dependabot[bot] avatar deryrahman avatar irainia avatar kushsharma avatar lollyxsrinand avatar mauliksoneji avatar mryashbhardwaj avatar novanxyz avatar ravisuhag avatar rootcss avatar sbchaos avatar scortier avatar siddhanta-rath avatar smarchint avatar sravankorumilli avatar sumitagrawal03071989 avatar tharun1718333 avatar vianhazman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

optimus's Issues

Secrets can be used through macros in the job specs.

All User created secrets within the namespace or project level secrets can be referenced by users in the Jobspec & should be evaluated while fetching the assets & the spec.

Acceptance Criteria

  • when the secrets are referenced properly they should be evaluated & returned on registerInstance api.
  • When the secrets are not referenced properly the api call to fail with the relevant message.

Support for External Sensor for Optimus Jobs

Currently Optimus Supports sensors for job dependencies which are within and the outside the project but they are managed by the same Optimus Server. It would be helpful if Optimus supports job sensors which are managed in a different Optimus, as with in an organisation there will be many deployments, checking for data availability may not always guarantee completeness & correctness of data which is guaranteed through Optimus dependencies.

Expectation :
The sensor provides checks for the status of the jobs b/w the input window.

Configuration :

dependencies : 
 job : 
 type : external
 project : 
 host : 
 start_time : // start time of the data that the job depends on
 end_time : // end time of the data that the job depends on.

The Optimus Server which accepts the requests based on its window, schedule configuration checks for all the jobs which outputs the data for the given window

This has the challenge of breaking the dependencies when job name changes.

Broken Link

Link to Contributing guide in line 101 of Readme.md is broken
The line:
Read our [contributing guide](docs/contribute/contribution.md) to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

Show recent replay list of a project

Currently, the subcommands for Replay that are available are run to start the replay and status to get the status of one replay using the ID. However, Optimus should also be able to list down the recent replays of a project, so that users can check the latest replay request including the id, which job & date is being replayed, when the replay happened, and also the status. Subcommand list will be added to accommodate this.

Example: optimus replay list --project [project_name]

Delete call of job specification doesn't work

Sending a rest call to delete a job specification throws 404 where as grpc call works fine. Steps to reproduce

curl -X DELETE "http://localhost:9100/v1/project/my-project/namespace/kush/helloworld" -H  "accept: application/json"

Optimus command to start a program with context envs injected by default

When a plugin docker container executes and requires configuration/asset files as input, it can either use GRPC/REST calls to fetch them from optimus or there is an existing command optimus admin build instance that prints them in the local filesystem.

What if we have a run command instead that will inject these configuration variables as env by default? Something like optimus admin run instance python3 main.py --some-arg where python3 main.py --some-arg will vary based on each plugin.

Create backup resource run

  • Add the command
  • Generate the job spec from the destination (already available, need to test)
  • Resolve the dependencies of the job (just use the resolver)
  • Response of list of tables to be backed up
    • As part of the response, we can highlight which of the table that will be backed up and which are not (if user chose to ignore downstream backup)
  • Prompt confirmation to proceed the backup
  • Decide if the job can be backed up, which datastore and destination
  • Backup the resource
    • Bigquery
      • Table
      • View
      • External table (to be verified)

Refactor optimus plugins to have a base plugin interface

Currently, each plugin type has its own interface with set of functions in it, we should break it down to multiple interfaces like BaseInterface that will be implemented by all plugins, CLIInterface for those who want their plugin to be exposed in CLI questions/answers, etc.

Support for RateLimits for replay such that the scheduled jobs are not impacted.

Is your feature request related to a problem? Please describe.
Replay feature can be handy as well it can be a problem where it disrupts the regular scheduled runs, by hitting the worker pool limits or the limits of the underlying Datastore.
Would be nice to have a feature to limit the number of jobs due to replay.

Describe the solution you'd like
A config which can be configured at project or namespace level to rate limit the number of active job runs due to replay, considering all the downstream runs. An option to force this validation, mainly for admin usage for replaying a higher priority job.

Describe alternatives you've considered
Currently, the design doesn't allow a mechanism to identify jobs which are triggered through replay or scheduled runs so the sophistication of configured the jobs differently such that limits can be applied at a underlying datastore or any downstream level as well.
Even assigning all the replay requests to a different pool at the scheduler level is not an option now. For all of that a custom scheduler with the state management managed by optimus is the solution but that will be a big change.

Additional context
With secret management, Optimus will have a flexibility to configure various service accounts for various jobs so one can configure the different service accounts accordingly such that the resources are properly used across

Optimus should not fail on startup because of bad plugins

When loading a plugin after the discovery phase, when the first GRPC client handshake happens between optimus core and plugin server, don't terminate the application on error instead, simply log a warning and continue as if an invalid binary is discovered as plugin.

Option to ignore inferred dependencies

Optimus supports figuring out dependencies automatically by parsing task assets. This logic of finding dependency is implemented per task behavior. Users can also choose to not depend on the task's inferred dependencies and pass them explicitly in job.yaml specification. We need a way to ignore the task's automatically inferred dependencies explicitly.

We use the existing specification file to add a nonbreaking change as follows:

name: job1
dependencies:
- job: hello_world1
- job: hello_world3
- ignore_job: hello_world2
- ignore_job: hello_world4

In this case, if the task used in this job somehow infer hello_world2 as one of the upstream, we will choose to not treat it as an upstream dependency. Similarly, if infer logic did not find hello_world4 as one of the upstream, nothing happens and no error should be thrown.

Create backup resource dry run

  • Add backup command
  • Generate the job spec using datastore destination (already available, need to test)
  • Resolve the dependencies of the job (use the resolver)
  • Response of list of tables to be backed up

Implement google sheets external data table type for bigquery datastore

Enable google sheets externaltables management for BigQuery datastore via Optimus.

User should be able to:

  • Create a google sheets external table by specifying sheets URI
  • Define schema for the google sheets external table
  • Use metadata management feature such as Labels

The implementation should be able to extend other BigQuery supported external data sources for future development.
About BigQuery external tables: https://cloud.google.com/bigquery/docs/external-tables

Support for opentelemetry metrics

Currently, there are no stats/metrics/traces being pushed by Optimus service. It should support emitting basic stats like cpu/mem/gc usage, time to complete GRPC calls, etc.

Using integer type in job spec configs causes panic

Currently, job spec yaml configuration only supports string key-value pairs. Having string kv pairs are fine but passing int(or any other type) should handle the failure gracefully instead of panic.

panic: interface conversion: interface {} is int, not string
goroutine 1 [running]:
github.com/odpf/optimus/store/local.JobSpecAdapter.ToSpec
....

Backup & Replay Improvements.

Table name of the backup result is preferred to have suffix of timestamp, instead of UUID.

To be considered:
=> limit of the table name
=> separator on the timestamp to make it still readable
=> the timestamp should be equals to backup time and will be equals to all downstream tables suffix.

Backup list response can be added extra useful information
=> High-level information (should be in user request point of view)
=> Ignore downstream choice should be added.
=> TTL (not the expiry time).

Add Backup details subcommand to show the list of all the tables backed up.
Screenshot 2021-11-25 at 12 28 21 PM

User should be able to create/update a secret through apis & cli.

A user through apis or through CLI should be able to create/update the secrets such that he should be able to reference the secrets & use in the Optimus Jobs.

User should be provided an option to create secret at project or namespace level.

Acceptance Criteria

  • GRPC end points to create/update a secret to accept base64 encoded values.
  • CLI to create/update secrets to accept both base64 & plain text.
  • Update documentation
  • Secrets to be encrypted securely.

Support Replay and Backup for multiple namespaces project

Users should be able to do backup and replay for downstream jobs with a different namespace, as long as authorized to do so.

  • should able to accept allowed_downstream with possible value * (all namespaces) or empty (only requested namespace). applied to both replay and backup.
  • should able to accept ignore_downstream in Replay

Support custom date range generation via SQL query

Right now on Optimus, date range is generated by this window config in task section on job.yaml file

window:
    size: 24h
    offset: 24h
    truncate_to: d

On some of our use cases, we need to generate a custom date range based on certain conditions in form of SQL query. For example:

SELECT DISTINCT DATE(event_date) as data_date
FROM some_table
WHERE (event_date >= start_date and event_date < end_date)

UNION DISTINCT

SELECT DISTINCT DATE(created_date) as data_date
FROM some_table_2
WHERE ((created_date >= start_date AND created_date < end_date)  
OR (last_modified_date >= start_date AND last_modified_date < end_date))
ORDER BY 1

Date range that generated by above query then will be used as parameters to the job.

Provide a provision to configure resources for Jobs

Is your feature request related to a problem? Please describe.
Pods should have defaults configured but should have a provision for configuration as well, currently there is no provision for configuring cpu & memory for the jobs. Provide a mechanism to configure the resources for the jobs.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

User should be able to delete a secret.

User should be able to delete a secret which is no longer used in any job specs.

Acceptance Criteria

  • If the secret is referenced in any job spec, the delete should fail & list all the jobs where it is referenced.
  • If the secret is not referenced the deletion of the secret to be successful.
  • If the secret doesn't exist show a valid message that secret doesn't exist.

Allow plugins to skip assets compilation with Go template

Some plugins might have their own assets compilation method for example using Go templates or Jinja which used their own set of variables. For example an email sending plugin might have various kinds of variables:

Dear {{ .RECIPIENTS }},
Attached in this email is the monthly report for {{ .MONTH }}

Based on the discussion with @kushsharma , the proposed workaround is to add SkipCompile flag into CompileAssetsResponse which is configurable at plugin level.

Basic user authentication and authorisation

We can support basic authentication with minimal permission-based rule enforcement that can be read from a file in the Optimus server. The file could be stored locally, stored in a k8s config map, GCS, etc for the server to fetch.

[
  {
     "username": "foo",
     "password": "bar",
     "perms": ["*"]
  },
  {
     "username": "optimus",
     "password": "$2a$10$fKRHxrEuyDTP6tXIiDycr.nyC8Q7UMIfc31YMyXHDLgRDyhLK3VFS",
     "perms": ["deploy:t-data", "deploy:g-data"]
  },
  {
     "username": "prime",
     "password": "pass",
     "perms": ["deploy:*", "register:project", "register:secret"]
  }
]

Passwords can be cleartext or bcrypt encrypted hashes. Each permission is mapped as action:entity and * is used as a wildcard for all. To avoid authentication with internal clients(airflow docker images), we can break the optimus API into two parts, public and internal both exposed to different ports, and only public will be served to external users.

From the cli we can either

  • Use a .netrc file to store user credentials.
  • Users while running the command will provide these credentials in their .optimus.yaml file(auth method and username can only be configured in file, password will be either passed as flag or asked from user in stdin).

Refactor logger used in packages

Current implementation of the logger is very rough, that is a global variable is being used across different packages. It should be properly injected from the top wherever it is needed.

Duplicate cross project dependencies should be handled gracefully

If a job inferred a cross-project dependency from its task and the same dependency is also mentioned in job.yaml specification, they are treated as duplicates. The reason is inferred dependencies when used in map for dependencies uses job name whereas cross-project dependency mentioned in specification uses project_name/job_name so duplicates are created inside the same dependency map.
For example:

...
dependencies:
- job: foo-project/bar-job
  type: inter

The map will have two jobs as bar-job and foo-project/bar-job.

Although users can choose to simply write the spec properly, the expected result is Optimus to handle it gracefully.

Two jobs with same destination will cause ambiguous dependency resolution

The current database model doesn't properly resolve if two jobs within a project choose a single destination and will cause ambiguity during dependency resolution. The destination should also support taking the type of destination and not just name to handle a variety of destinations like buckets/databases/tables/etc

User should be able to list all secrets.

A User should be provided an option to list all secrets within the project through api & cli, only digests to be shown to protect the secrets.

Acceptance Criteria

  • All secrets along with the digest to be shown to the user when requested.
  • Operation to fail with relevant details shown on invalid/insufficient params provided.
  • Documentation to be updated accordingly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.