raystack / optimus Goto Github PK

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

Home Page: https://raystack.github.io/optimus

License: Apache License 2.0

Dockerfile 0.02% Makefile 0.20% Go 98.18% Python 1.49% Shell 0.11%

airflow etl workflows automation golang bigquery data-warehouse analytics data-modelling analytics-engineering

optimus's Introduction

Optimus

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. It enables data analysts and engineers to transform their data by writing simple SQL queries and YAML configuration while Optimus handles dependency management, scheduling and all other aspects of running transformation jobs at scale.

Key Features

Discover why users choose Optimus as their main data transformation tool.

Warehouse management: Optimus allows you to create and manage your data warehouse tables and views through YAML based configuration.
Scheduling: Optimus provides an easy way to schedule your SQL transformation through a YAML based configuration.
Automatic dependency resolution: Optimus parses your data transformation queries and builds a dependency graphs automaticaly instead of users defining their source and taget dependencies in DAGs.
Dry runs: Before SQL query is scheduled for transformation, during deployment query will be dry-run to make sure it passes basic sanity checks.
Powerful templating: Optimus provides query compile time templating with variables, loop, if statements, macros, etc for allowing users to write complex tranformation logic.
Cross tenant dependency: Optimus is a multi-tenant service, if there are two tenants registered, serviceA and serviceB then service B can write queries referencing serviceA as source and Optimus will handle this dependency as well.
Hooks: Optimus provides hooks for post tranformation logic. e,g. You can sink BigQuery tables to Kafka.
Extensibility: Optimus support Python transformation and allows for writing custom plugins.
Workflows: Optimus provides industry proven workflows using git based specification management and REST/GRPC based specification management for data warehouse management.

Usage

Optimus has two components, Optimus service that is the core orchestrator installed on server side, and a CLI binary used to interact with this service. You can install Optimus CLI using homebrew on macOS:

$ brew install raystack/tap/optimus
$ optimus --help

Optimus is an easy-to-use, reliable, and performant workflow orchestrator for
data transformation, data modeling, pipelines, and data quality management.

Usage:
  optimus [command]

Available Commands:
  backup      Backup a resource and its downstream
  completion  Generate the autocompletion script for the specified shell
  extension   Operate with extension
  help        Help about any command
  init        Interactively initialize Optimus client config
  job         Interact with schedulable Job
  migration   Command to do migration activity
  namespace   Commands that will let the user to operate on namespace
  playground  Play around with some Optimus features
  plugin      Manage plugins
  project     Commands that will let the user to operate on project
  resource    Interact with data resource
  secret      Manage secrets to be used in jobs
  scheduler   Scheduled/run job related functions
  serve       Starts optimus service
  version     Print the client version information

Flags:
  -h, --help       help for optimus
      --no-color   Disable colored output

Use "optimus [command] --help" for more information about a command.

Documentation

Explore the following resources to get started with Optimus:

Guides provides guidance on using Optimus.
Concepts describes all important Optimus concepts.
Reference contains details about configurations, metrics and other aspects of Optimus.
Contribute contains resources for anyone who wants to contribute to Optimus.

Running locally

Optimus requires the following dependencies:

Golang (version 1.16 or above)
Git

Run the following commands to compile optimus from source

$ git clone [email protected]:raystack/optimus.git
$ cd optimus
$ make

Use the following command to run

$ ./optimus version

Optimus service can be started with

$ ./optimus serve

serve command has few required configurations that needs to be set for it to start. Read more about it in getting started.

Compatibility

Optimus is currently undergoing heavy development with frequent, breaking API changes. Current major version is zero (v0.x.x) to accommodate rapid development and fast iteration while getting early feedback from users (feedback on APIs are appreciated). The public API could change without a major version update before v1.0.0 release.

Contribute

Development of Optimus happens in the open on GitHub, and we are grateful to the community for contributing bugfixes and improvements. Read below to learn how you can take part in improving Optimus.

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

To help you get your feet wet and get you familiar with our contribution process, we have a list of good first issues that contain bugs which have a relatively limited scope. This is a great place to get started.

License

Optimus is Apache 2.0 licensed.

optimus's People

Contributors

Stargazers

Watchers

Forkers

kushsharma scortier vianhazman kar2410 karan-pawar-09 prituthenoob01 guilefoylegaurav dphilomath gauravdas014 arshita04 shivaom02 jyotishka747 10akash10 4marsha1 venkat-1708 farfromshallow101 tracyber shristisarma23 superfrank99 snehil07 gihpim vikas528 88anish yunsweta skywalkerps raktim-bhuyan bhaskar100-hub sumon05th baivabsaha679 adityacsenits021 ayyushshrma talukdarudit tinku245 gyandeep216 biley02 nazibul1 dharanip2512 yogeshkalluri mud-pot inolas259 legitmxn debojyoti1915001 niranjan0niranjan shamimahsabnoor24 revanthpalukuri faizalkarim280280 bharathchandra12 vermastra divk123 uday525 bishalk03 hrishikesh007788 rahul190190 pawan123987 ram-gopal nehapramanik mdsadik13 dishitahz rsv16 prem-331 20antara acyrus-git inrittik apu244 mansirawat709 josu9435 debanitrkl menma04 nandita658 saranga7 pankajswami78 suryaanand1979 trivedi089 vijaytomarx farukabd barnikapaul23 mayosenpai karan9102001 bitsjaymehta173 yash312312 rishav1709 rohit08012002 prakash2287 rhisav25 nirmal-sarma yk-4900 wi-es username-naz dominatorvj shubham001111 gittushr losercreates ishanarya0 kashifb25 angshuman032001 aman1727 anuraagbarde cybernetics developgo shubhi10

optimus's Issues

Secrets can be used through macros in the job specs.

All User created secrets within the namespace or project level secrets can be referenced by users in the Jobspec & should be evaluated while fetching the assets & the spec.

Acceptance Criteria

when the secrets are referenced properly they should be evaluated & returned on registerInstance api.
When the secrets are not referenced properly the api call to fail with the relevant message.

External table datasource from Google Sheets should autodetect schema when not supplied

Bigquery external table API supports auto schema detection for external table source by a boolean property - ref.

If the target field names are the same as the source fields in the Google Spreadsheet, it should be better not to supply any schema rather than specifying the same column names one by one.

Support for External Sensor for Optimus Jobs

Currently Optimus Supports sensors for job dependencies which are within and the outside the project but they are managed by the same Optimus Server. It would be helpful if Optimus supports job sensors which are managed in a different Optimus, as with in an organisation there will be many deployments, checking for data availability may not always guarantee completeness & correctness of data which is guaranteed through Optimus dependencies.

Expectation :
The sensor provides checks for the status of the jobs b/w the input window.

Configuration :

dependencies : 
 job : 
 type : external
 project : 
 host : 
 start_time : // start time of the data that the job depends on
 end_time : // end time of the data that the job depends on.

The Optimus Server which accepts the requests based on its window, schedule configuration checks for all the jobs which outputs the data for the given window

This has the challenge of breaking the dependencies when job name changes.

Broken Link

Link to Contributing guide in line 101 of Readme.md is broken
The line:
Read our [contributing guide](docs/contribute/contribution.md) to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to Optimus.

Convert handler/adapter from interface to functions

Option to provide no schedule interval for creating on-demand jobs

Show recent replay list of a project

Currently, the subcommands for Replay that are available are run to start the replay and status to get the status of one replay using the ID. However, Optimus should also be able to list down the recent replays of a project, so that users can check the latest replay request including the id, which job & date is being replayed, when the replay happened, and also the status. Subcommand list will be added to accommodate this.

Example: optimus replay list --project [project_name]

Bigquery Datastore should support description field on view columns

Currently, the BigQuery view is created using SQL query. Although we need a SQL query to create a schema, we should support updating an existing schema with descriptions in the upsert query similar to what is supported when creating a BigQuery table.

Remove Success bool field from proto response

Delete call of job specification doesn't work

Sending a rest call to delete a job specification throws 404 where as grpc call works fine. Steps to reproduce

curl -X DELETE "http://localhost:9100/v1/project/my-project/namespace/kush/helloworld" -H  "accept: application/json"

Update documentation about organising Optimus specifications

Documentation about how to organize specifications and inherit specifications using this.yaml if needed.
Documentation for creating SLA miss and failure notifications over slack.

Optimus command to start a program with context envs injected by default

When a plugin docker container executes and requires configuration/asset files as input, it can either use GRPC/REST calls to fetch them from optimus or there is an existing command optimus admin build instance that prints them in the local filesystem.

What if we have a run command instead that will inject these configuration variables as env by default? Something like optimus admin run instance python3 main.py --some-arg where python3 main.py --some-arg will vary based on each plugin.

Kubernetes executor compatible with sequential scheduler

capable of executing 1 job at a time. Only supports OCI images

Add the Type of Plugins documentation

Clarified the difference in the types of plugin ,i.e between Task and Hook in the table format to get clear understanding

show execution logs of requested job

Via Optimus CLI user should be able to see logs of the requested Job

SQL linter and formater as extension

What if wrap sqlfluff as an extension as well?

Create backup resource run

Add the command
Generate the job spec from the destination (already available, need to test)
Resolve the dependencies of the job (just use the resolver)
Response of list of tables to be backed up
- As part of the response, we can highlight which of the table that will be backed up and which are not (if user chose to ignore downstream backup)
Prompt confirmation to proceed the backup
Decide if the job can be backed up, which datastore and destination
Backup the resource
- Bigquery
  - Table
  - View
  - External table (to be verified)

Refactor optimus plugins to have a base plugin interface

Currently, each plugin type has its own interface with set of functions in it, we should break it down to multiple interfaces like BaseInterface that will be implemented by all plugins, CLIInterface for those who want their plugin to be exposed in CLI questions/answers, etc.

Support for RateLimits for replay such that the scheduled jobs are not impacted.

Is your feature request related to a problem? Please describe.
Replay feature can be handy as well it can be a problem where it disrupts the regular scheduled runs, by hitting the worker pool limits or the limits of the underlying Datastore.
Would be nice to have a feature to limit the number of jobs due to replay.

Describe the solution you'd like
A config which can be configured at project or namespace level to rate limit the number of active job runs due to replay, considering all the downstream runs. An option to force this validation, mainly for admin usage for replaying a higher priority job.

Describe alternatives you've considered
Currently, the design doesn't allow a mechanism to identify jobs which are triggered through replay or scheduled runs so the sophistication of configured the jobs differently such that limits can be applied at a underlying datastore or any downstream level as well.
Even assigning all the replay requests to a different pool at the scheduler level is not an option now. For all of that a custom scheduler with the state management managed by optimus is the solution but that will be a big change.

Additional context
With secret management, Optimus will have a flexibility to configure various service accounts for various jobs so one can configure the different service accounts accordingly such that the resources are properly used across

Add integration tests for postgres in github workflows

Currently, there are tests files written for postgres store but they are being ignored in unit tests. This can be fixed using GitHub workflow services.

Optimus should not fail on startup because of bad plugins

When loading a plugin after the discovery phase, when the first GRPC client handshake happens between optimus core and plugin server, don't terminate the application on error instead, simply log a warning and continue as if an invalid binary is discovered as plugin.

Optimus to support deployment of multiple namespaces.

Currently optimus cli does support deployment of a single namespace, it would be better if the deploy command supports deploying of all namespaces within the project as well.

Option to ignore inferred dependencies

Optimus supports figuring out dependencies automatically by parsing task assets. This logic of finding dependency is implemented per task behavior. Users can also choose to not depend on the task's inferred dependencies and pass them explicitly in job.yaml specification. We need a way to ignore the task's automatically inferred dependencies explicitly.

We use the existing specification file to add a nonbreaking change as follows:

name: job1
dependencies:
- job: hello_world1
- job: hello_world3
- ignore_job: hello_world2
- ignore_job: hello_world4

In this case, if the task used in this job somehow infer hello_world2 as one of the upstream, we will choose to not treat it as an upstream dependency. Similarly, if infer logic did not find hello_world4 as one of the upstream, nothing happens and no error should be thrown.

Print Basic Details of Image Version & Optimus Client & Server Versions

In order to support better debuggability, at the time of running of plugins.

Would be good to print image version.
Optimus client & optimus server versions

Support for submitting PySpark Jobs

Create backup resource dry run

Add backup command
Generate the job spec using datastore destination (already available, need to test)
Resolve the dependencies of the job (use the resolver)
Response of list of tables to be backed up

Remove interface for Config provider and use the Config struct

Discover plugins while respecting semantic versioning

If more than one version of the binary is available in plugin directories, use the latest one according to semantic versioning. Currently, this behaviour is undefined and randomly any binary will be selected.

Implement google sheets external data table type for bigquery datastore

Enable google sheets externaltables management for BigQuery datastore via Optimus.

User should be able to:

Create a google sheets external table by specifying sheets URI
Define schema for the google sheets external table
Use metadata management feature such as Labels

The implementation should be able to extend other BigQuery supported external data sources for future development.
About BigQuery external tables: https://cloud.google.com/bigquery/docs/external-tables

Support for opentelemetry metrics

Currently, there are no stats/metrics/traces being pushed by Optimus service. It should support emitting basic stats like cpu/mem/gc usage, time to complete GRPC calls, etc.

Using integer type in job spec configs causes panic

Currently, job spec yaml configuration only supports string key-value pairs. Having string kv pairs are fine but passing int(or any other type) should handle the failure gracefully instead of panic.

panic: interface conversion: interface {} is int, not string
goroutine 1 [running]:
github.com/odpf/optimus/store/local.JobSpecAdapter.ToSpec
....

CrossTenantDependancySensor takes up a worker slot for its entire runtime

CrossTenantDependancySensor currently configured to be on poke mode. causing worker slot to be allocated for its entire runtime. If too many CrossTenantDependancySensor running it exaust the available slots causing priority tasks to be queued.

Backup & Replay Improvements.

Table name of the backup result is preferred to have suffix of timestamp, instead of UUID.

To be considered:
=> limit of the table name
=> separator on the timestamp to make it still readable
=> the timestamp should be equals to backup time and will be equals to all downstream tables suffix.

Backup list response can be added extra useful information
=> High-level information (should be in user request point of view)
=> Ignore downstream choice should be added.
=> TTL (not the expiry time).

Add Backup details subcommand to show the list of all the tables backed up.

Add the shell-completion feature documentation

User should be able to create/update a secret through apis & cli.

A user through apis or through CLI should be able to create/update the secrets such that he should be able to reference the secrets & use in the Optimus Jobs.

User should be provided an option to create secret at project or namespace level.

Acceptance Criteria

GRPC end points to create/update a secret to accept base64 encoded values.
CLI to create/update secrets to accept both base64 & plain text.
Update documentation
Secrets to be encrypted securely.

Support Replay and Backup for multiple namespaces project

Users should be able to do backup and replay for downstream jobs with a different namespace, as long as authorized to do so.

should able to accept allowed_downstream with possible value * (all namespaces) or empty (only requested namespace). applied to both replay and backup.
should able to accept ignore_downstream in Replay

Kube secret automation as part of bootstrap

Support custom date range generation via SQL query

Right now on Optimus, date range is generated by this window config in task section on job.yaml file

window:
    size: 24h
    offset: 24h
    truncate_to: d

On some of our use cases, we need to generate a custom date range based on certain conditions in form of SQL query. For example:

SELECT DISTINCT DATE(event_date) as data_date
FROM some_table
WHERE (event_date >= start_date and event_date < end_date)

UNION DISTINCT

SELECT DISTINCT DATE(created_date) as data_date
FROM some_table_2
WHERE ((created_date >= start_date AND created_date < end_date)  
OR (last_modified_date >= start_date AND last_modified_date < end_date))
ORDER BY 1

Date range that generated by above query then will be used as parameters to the job.

Provide a provision to configure resources for Jobs

Is your feature request related to a problem? Please describe.
Pods should have defaults configured but should have a provision for configuration as well, currently there is no provision for configuring cpu & memory for the jobs. Provide a mechanism to configure the resources for the jobs.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

User should be able to delete a secret.

User should be able to delete a secret which is no longer used in any job specs.

Acceptance Criteria

If the secret is referenced in any job spec, the delete should fail & list all the jobs where it is referenced.
If the secret is not referenced the deletion of the secret to be successful.
If the secret doesn't exist show a valid message that secret doesn't exist.

Allow plugins to skip assets compilation with Go template

Some plugins might have their own assets compilation method for example using Go templates or Jinja which used their own set of variables. For example an email sending plugin might have various kinds of variables:

Dear {{ .RECIPIENTS }},
Attached in this email is the monthly report for {{ .MONTH }}

Based on the discussion with @kushsharma , the proposed workaround is to add SkipCompile flag into CompileAssetsResponse which is configurable at plugin level.

Basic user authentication and authorisation

We can support basic authentication with minimal permission-based rule enforcement that can be read from a file in the Optimus server. The file could be stored locally, stored in a k8s config map, GCS, etc for the server to fetch.

[
  {
     "username": "foo",
     "password": "bar",
     "perms": ["*"]
  },
  {
     "username": "optimus",
     "password": "$2a$10$fKRHxrEuyDTP6tXIiDycr.nyC8Q7UMIfc31YMyXHDLgRDyhLK3VFS",
     "perms": ["deploy:t-data", "deploy:g-data"]
  },
  {
     "username": "prime",
     "password": "pass",
     "perms": ["deploy:*", "register:project", "register:secret"]
  }
]

Passwords can be cleartext or bcrypt encrypted hashes. Each permission is mapped as action:entity and * is used as a wildcard for all. To avoid authentication with internal clients(airflow docker images), we can break the optimus API into two parts, public and internal both exposed to different ports, and only public will be served to external users.

From the cli we can either

Use a .netrc file to store user credentials.
Users while running the command will provide these credentials in their .optimus.yaml file(auth method and username can only be configured in file, password will be either passed as flag or asked from user in stdin).

Optimus cli should support bash/zsh completions

Refactor logger used in packages

Current implementation of the logger is very rough, that is a global variable is being used across different packages. It should be properly injected from the top wherever it is needed.

Optimus CLI should check for updates using Github releases

Having a notification about an update available for optimus cli can be helpful for users to keep the binary up to date. We can do this using github release api I think. This notification can be shown to users maybe once a week.

Support for basic sequential scheduler

The scheduler should be able to parse a requested job, convert it to an execution graph and executes it using a mocked executor.

Create backup resource list

Duplicate cross project dependencies should be handled gracefully

If a job inferred a cross-project dependency from its task and the same dependency is also mentioned in job.yaml specification, they are treated as duplicates. The reason is inferred dependencies when used in map for dependencies uses job name whereas cross-project dependency mentioned in specification uses project_name/job_name so duplicates are created inside the same dependency map.
For example:

...
dependencies:
- job: foo-project/bar-job
  type: inter

The map will have two jobs as bar-job and foo-project/bar-job.

Although users can choose to simply write the spec properly, the expected result is Optimus to handle it gracefully.

Two jobs with same destination will cause ambiguous dependency resolution

The current database model doesn't properly resolve if two jobs within a project choose a single destination and will cause ambiguity during dependency resolution. The destination should also support taking the type of destination and not just name to handle a variety of destinations like buckets/databases/tables/etc

User should be able to list all secrets.

A User should be provided an option to list all secrets within the project through api & cli, only digests to be shown to protect the secrets.

Acceptance Criteria

All secrets along with the digest to be shown to the user when requested.
Operation to fail with relevant details shown on invalid/insufficient params provided.
Documentation to be updated accordingly.

raystack / optimus Goto Github PK

optimus's Introduction

Optimus

Key Features

Usage

Documentation

Running locally

Compatibility

Contribute

License

optimus's People

Contributors

Stargazers

Watchers

Forkers

optimus's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs