GithubHelp home page GithubHelp logo

nicholasyager / dbt-loom Goto Github PK

View Code? Open in Web Editor NEW
100.0 3.0 20.0 6.52 MB

A dbt-core plugin to weave together multi-project dbt-core deployments

License: The Unlicense

Python 100.00%
dbt dbt-core plugin python3

dbt-loom's People

Contributors

alangner avatar bl3f avatar cedric-orange avatar nawfel-bacha avatar nicholasyager avatar sayeaud-accelins avatar sdaylor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dbt-loom's Issues

Log the dbt-loom version during execution

Is your feature request related to a problem? Please describe.
While debugging or triaging issues for the community, I don't always know which version of dbt-loom is in use. This can make it challenging to pin down the exact issue, or to confirm if newer versions of dbt-loom have resolved the issue already.

Describe the solution you'd like
I'd like dbt-loom to print the current version much like adapters and dbt-core itself. For example:

08:25:16  Running with dbt=1.7.14
08:25:17  dbt-loom: Registering plugin: dbt-loom=0.5.2
08:25:17  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
08:25:17  dbt-loom: Loading manifest for `project_a` from `file`
08:25:17  Registered adapter: bigquery=1.7.7

Describe alternatives you've considered

  • Do nothing.
  • Find ways of adding this to artifacts or metadata.

Additional context
Add any other context or screenshots about the feature request here.

Support dbt Cloud as a manifest source

This is particularly useful to enable hybrid development in multi-cloud deployments. For example, this would enable dbt Cloud to use its own dbt Labs supported injection mechanism for orchestration and their cloud IDE, and dbt-loom would support people running dbt-core locally, backed by the same up-to-date manifests.

IMHO, this is required for adequate community adoption of multi-project deployments without vendor lock-in.

Show the group when a private model is referenced

Is your feature request related to a problem? Please describe.
PR 40 shows better error when accessing a private or protected model of the upstream project, however, the group is missing from the message shown.

Node model.great_bay.test_balboa_models attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the '' group.

Describe the solution you'd like
Ideally the group of the upstream model is shown in the error like:

Node model.great_bay.test_balboa_models attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the 'marketing' group.

Describe alternatives you've considered
None

Additional context
None

Support postgres cross-database project dependencies

I am proposing a feature for dbt-loom to work with multiple projects where each has a postgres databases or any databases built on PG. Like Hydra DB.

My use case is that I am building a Data Product segregated data mesh. Each product has its own PG database, usually hosted on the same PG server - but that could change based on scale. Each Data Product has its own DBT-core dbt project. Rather than replicate all the models using Extract-load between the db's it makes more sense to be able to read (read-only) the data from each db. And I faced some challenges getting dbt-loom working with this, but I did get it working. Thanks! Dbt-loom is a really useful solution.

Let say I have a dataproduct "calendar" and one called "workday". Workday refers to Calendar models. Dbt loom succesfully generated the following SQL

select * from "calendar"."product"."days"

However, in PG this gives an error on execution as PG does not allow cross db joins.

postgres cross-database references are not implemented

So the solution I came do involved

  1. some minor "tweaks" to init.py
  2. using the PG IMPORT FOREIGN SCHEMA to create an set of alias style tables. For example inside "workdays" db (this is an abbreviated version)

`(create extension and grant usage - removed for brevity)

CREATE SERVER IF NOT EXISTS foreign_calendar
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', port '5432', dbname 'calendar');

(also create user mapping - removed for brevity)

create schema if not exists calendar;

CREATE FOREIGN TABLE calendar.days (
id text
, col1 date
, col2 numeric
, col3 text
, col4 text
)
SERVER foreign_calendar
OPTIONS (schema_name 'product', table_name 'days');

-- this now works to read the foreign database and table
select * from "workhours"."calendar"."days"
`

Describe the solution you'd like
Obviously the CREATE FOREIGN TABLE scaffolding could be automated, for each PUBLIC dbt model - and even the CREATE SERVER etc could be run each initialisation - and pulled from the connection details of the project.

Each time a model changes and the table DDL changes, you need to add/update/remove columns from the FOREIGN TABLE which could be automated.

Describe alternatives you've considered
It is workable to have to write the above code for every model, but not ideal.

Additional context
I would like feedback if this makes sense as a feature of DBT loom or should be implemented as a separate plugin? I am new to DBT, but not new to SQL and programming - I may be able to attempt this, but any starting pointers or code reviews would be most welcome if its a wanted feature.

Protected seeds in upstream project breaking dbt runs in downstream project

Describe the bug
Running any dbt command that requires compiling from Project B using the manifest from Project A throws the following error:

08:25:16  Running with dbt=1.7.14
08:25:17  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
08:25:17  dbt-loom: Loading manifest for `project_a` from `file`
08:25:17  Registered adapter: bigquery=1.7.7
08:25:27  dbt-loom: Injecting nodes
08:25:28  Encountered an error:
Compilation Error
  'model.project_a.stg_my_seed_file' depends on 'seed.project_a.seed_my_seed_file' which is not in the graph!

This error refers to a seed and model that are not upstream of the model I was trying to run. In Project A, there are no errors when running or compiling.

I tried running with --no-partial-parse as well as running dbt clean before the commands to no avail.
Deleting the seed and related model only gave the same error with another seed and model pair.

I have looked at the manifest.json from Project A and I can see the seed node, however, when looking at the manifest generated from the run in Project B, I can only see the seed node as a dependency from the staging model but doesn't seem to be injected as a node itself.

To Reproduce
Steps to reproduce the behavior:

  1. Project A has protected seeds that are upstream of some other protected models
  2. Project A has a public model that is downstream of one or more seeds.
  3. Project A compiles with no errors
  4. Project B references a public model from the upstream model
  5. Run any dbt command that requires compiling (e.g. dbt run)
  6. Get the error above

Expected behavior
Project B should also compile without any errors.

  • Ubuntu 20.04
  • dbt-loom v0.5.1/v0.5.2 (error with both)
  • dbt-core v1.7.14/v1.6.9 (error with both)
  • dbt-bigquery v1.7.7/v1.6.9 (error with both
  • python v3.10.12

Additional context
Thanks to smilingthax (don't want to tag to spam emails) for pointing out that this error with seeds that I mentioned in another issue was talked already in another issue (that I missed when researching).

I haven't had time to rebuild this with a toy project at the moment, if that is needed I will take some time later in the week as I have been pushing at work to give dbt-loom work so I am invested hah. Thanks for all the efforts and for creating and maintaining this amazing project!!

Discussion: Generate API DBT project instead of ingesting models parsed from manifest

Current solution requires manifest.json to be present before upstream projects can be compiled. In some cases it's not convenient.
Instead I think it can be useful to generate API package from existing DBT project, that package can be used as simple DBT package.

Some details of the idea:

  1. Public models in core project are exposed as sources
  2. Ephemeral models are created in core_api project which just do select * from {{ source('core', 'public_model_name') }}
  3. If public core model is versioned - several SQL files can be created to reflect model versions like select * from {{ source('core', 'public_model_name_v2') }}
  4. dbt_project.yml is copied from core to core_api with all model configurations excluded except public models and of course name changed from core to core_api.

Do you think it can work? Do you see any issues with that?

[Feature] Create new configuration abstraction for a Project

Is your feature request related to a problem? Please describe.
Currently, dbt-loom does not have a good way to leverage a dbt project's restrict-access configuration. This config is important for limiting cross-project access to protected and private nodes. Since this is an entirely new file to load, we do not currently have a defined path for loading this information.

flowchart LR
  
  subgraph project_a
    dbt_project.yml
    manifest.json
  end

  subgraph project_b
    project_b_project[dbt_project.yml]
    project_b_manifest[manifest.json]
    dbt_loom.config.yml

    subgraph dbt-loom

      ManifestLoader
    
      Plugin 

    end
  end

  
  manifest.json --> ManifestLoader

  
  ManifestLoader --> Plugin

  dbt_loom.config.yml --> Plugin


  dbt-core

  Plugin --> dbt-core
  project_b_project --> dbt-core
  project_b_manifest --> dbt-core

Loading

Describe the solution you'd like
I'd like to be able to define a Project, and with this project a location for its dbt_project.yml file and an associated manifest.json file. This should support all of our existing artifact sources where possible.

flowchart LR
  
  subgraph project_a
    dbt_project.yml
    manifest.json
  end

  subgraph project_b
    project_b_project[dbt_project.yml]
    project_b_manifest[manifest.json]
    dbt_loom.config.yml

    subgraph dbt-loom

      ManifestLoader
      ProjectLoader
      Plugin 

    end
  end

  dbt_project.yml --> ProjectLoader
  manifest.json --> ManifestLoader

  ProjectLoader --> Plugin
  ManifestLoader --> Plugin

  dbt_loom.config.yml --> Plugin


  dbt-core

  Plugin --> dbt-core
  project_b_project --> dbt-core
  project_b_manifest --> dbt-core

Loading

To configure this, we can introduce a new optional top-level concept of a Project.

dependencies:

  - name: core
    description: All common core dependencies across our `n` base projects.
    artifacts:
      - type: s3
        config:
          bucket_name: com.example.dbt_artifacts 
          object_prefix: latest/
        
   - name: revenue
     description: A proof-of-concept local-only revenue reporting project.
     artifacts:
        - type: file
          config:
            path: path/to/manifest.json
        - type: file
          config:
            path: path/to/dbt_project.yml       
   

Describe alternatives you've considered

  • Have restrict-access configured in the dbt_loom.config.yml file instead.
  • Don't support restruct-access at all.

Additional context

  • It would be ideal to keep the door open for bulk loading of artifacts, as described in #31

Make node resolution environment aware

Is your feature request related to a problem? Please describe.
Hi folks,
Thanks for the great work, I wonder is it possible to make node resolution process to be environment (target) aware?

So in our use case lower environments (targets) like dev, ci are using one set of Snowflake tables for defer and we can put it in the manifest, but for higher environments (target) like prod the same model should be resolved to a different fully qualified name. If I rephrase the question can we hack generate_schema_name and generate_database_name macros to node resolution?

Describe alternatives you've considered
Not quite sure if there is a solution other than maintain N different manifests one for each environment, which doesn't look like an optimal and potentially error prone method.

Additional context
I'm ready to contribute to the solution with enough guidance provided.

error when using deprecation_date at version level

I created a model in my upstream project with a deprecation_date at model level, generated the manifest.yml and everything works fine in my downstream project.
If I move the deprecation_date at version level, the downstream project parsing breaks with error:

  File "/home/daniele/.cache/pypoetry/virtualenvs/credem-poc-pkg-aG8EzHFC-py3.10/lib/python3.10/site-packages/dbt/parser/manifest.py", line 579, in check_for_model_deprecations
    if resolved_ref.deprecation_date < datetime.datetime.now().astimezone():
TypeError: '<' not supported between instances of 'str' and 'datetime.datetime'

The issue arises from dbt-core, but if I move all the models and configs into one single project, the error disappears, so I guess it's somehow related to dbt-loom node injection.

This is a sample yml config for the upstream project:

version: 2

models:
  - name: GESTORE
    access: public
    latest_version: 2

    config:
      contract:
        enforced: true

    columns:
      - name: customer_id
        data_type: string
        constraints:
          - type: not_null
          - type: primary_key
      - name: customer_name
        data_type: string

    versions:
      - v: 1
        deprecation_date: '2023-10-30'

      - v: 2 
        columns:
          - include: all
            exclude: [customer_name]
          - name: customer_desc
            data_type: string

Add logging to the plugin

As a user of the plugin, it would be useful to have logging functionality. This will make it easier to track what project are being injected, and which models are being referenced.

Enable errors for Private and Protected nodes

Is your feature request related to a problem? Please describe.
Right now, upstream producer private and protected nodes are not included in the consumer project. When a user tries to access one of these nodes, they are given an error that the node does not exist. It would be a better UX to show an error like

Parsing Error
  Node model.great_bay.test_us_population attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the marketing group.

Add a code of conduct to the repository

Is your feature request related to a problem? Please describe.
Nope!

Describe the solution you'd like
A clear and concise description of what is expected for people interacting with the development process of this project. This should also outline what is expected of commercial interests who are participating in this project's development.

Describe alternatives you've considered
Be a bunch of cowpokes and enjoy the wild west and all the challenges that entails ๐Ÿค 

Confirm support for 1.8.0rc1

Is your feature request related to a problem? Please describe.
As an avid dbt-core user, I want dbt-loom to be compatible with dbt-core 1.8.0-rc1.

Describe the solution you'd like
Update the version range for dbt-loom to include 1.8.0-rc1, and include this new version in the testing matrix.

Describe alternatives you've considered
Go for a walk on the beach. ๐Ÿ–๏ธ

Additional context
Nope

Problem with versioned models?!

Describe the bug

dbt-loom/test_projects/customer_success# dbt build
09:32:20  Running with dbt=1.7.14
09:32:20  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
09:32:20  dbt-loom: Loading manifest for `revenue` from `file`
09:32:20  Registered adapter: duckdb=1.7.4
09:32:20  dbt-loom: Injecting nodes
09:32:20  [WARNING]: Model orders has passed its deprecation date of 2024-01-01T00:00:00+00:00. This model should be disabled or removed.            ## (removing deprecation_date does not change anything)
09:32:20  Encountered an error:
Compilation Error
  'model.revenue.not_null_orders_v1_order_id' depends on 'model.revenue.orders.v1' which is not in the graph!

To Reproduce

  1. Install/Setup dbt-core, dbt-duckdb, dbt-loom, ...
  2. git clone https://github.com/nicholasyager/dbt-loom (to retrieve test_projects/)
  3. In dbt-loom/test_projects/revenue run dbt deps, dbt build, dbt run
  4. In dbt-loom/test_projects/customer_success run dbt deps, try dbt build or dbt run
  5. See error, above.

Expected behavior

The test project from the dbt-loom repository should compile without errors.
Other projects which use versioned models also compile without errors.

  • OS: python:3.12-bookworm-based container running on amd64 linux
  • dbt-loom Version 0.5.1
  • dbt-core Version 1.7.14, also 1.7.13

Additional context

This first happened in my own project, but just using the test_projects from the dbt-loom repository exhibits the same behaviour.

AFAICT the corresponding node name/id in revenue/target/manifest.json is "model.revenue.orders.v1", whereas in customer_success/target/manifest.json the injected(?) node seems to be called "model.revenue.orders.v1.0" โ€“ but some/all(?) references to it (depends_on, ...) still use the "original" "model.revenue.orders.v1" name/id, which then cannot be found, as said in the error message (... depends on 'model.revenue.orders.v1' which is not in the graph!)...

Non-versioned models seem to be unaffected / work fine.

Add a contributing guide to the repository

Is your feature request related to a problem? Please describe.
Nope!

Describe the solution you'd like
A clear and concise description of how new contributors can submit code to the dbt-loom project.

Describe alternatives you've considered
Be a bunch of cowpokes and enjoy the wild west and all the responsibility it entails ๐Ÿค 

Support pulling in multiple manifests from single bucket

Currently, dbt-loom supports pulling in a manifest from cloud storage using bucket name + object name.

However, for organizations with n number of dbt-core projects that need to peer with each other, adding an entry to each repo gets difficult. I propose that in the s3 and gcp clients, we add a method that allows for specifying just the bucket name. From there, dbt loom will iterate through all the manifests in the bucket and add them to the project.

I could take a first stab at implementing s3 version.

Edit: Would actually prefer trying this in artifiactory first if this is something we want to do. Can implement single and muli-manifest json pull from artifiactory

Support S3 buckets as a manifest source

Having access to manifests for multi-project deployments is cool and all, but it would be even cooler to have support for S3 get object calls. This would enable "mid-tier" uses of this tool: beyond a single machine, but not production-worthy.

Requirements:

  • Can provide an S3 path and the plug in will use AWS environment variables to auth and get the object.

Child project no longer compiles successfully (possibly when seeds are involved upstream)

Describe the bug
Upstream node injection in version 0.5.0 may have introduced a bug. Models in a child project having dependencies to public models in a parent model no longer compile. Things were running smoothly in previous versions of dtb-loom.

A typical error message resulting from compiling the child project looks as follows:

14:26:03  Running with dbt=1.7.13
14:26:04  dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
14:26:04  dbt-loom: Loading manifest for `parent` from `file`
14:26:04  Registered adapter: snowflake=1.7.3
14:26:04  dbt-loom: Injecting nodes
14:26:05  Encountered an error:
Compilation Error
  'model.parent.dim_commission_type' depends on 'seed.parent.seed_commission_type' which is not in the graph!

The error message above mentions the public dim_commission_type model in the parent project that stems from the protected seed_commission_type seed in the parent project as well. Some models of the child project reference the dim_commission_type model in the parent project, but none references the seed_commission_type seed. Since we have other projects for which cross-project dependencies work fine with dbt-loom 0.5.0, the problem may be limited to scenarios where protected seeds are used upstream.

To Reproduce
The overall setup is as follows:

  1. A parent project, say, parent, contains various models, with both protected and public models.
  2. The parent project compiles/builds successfully.
  3. The resulting manifest.json file is properly referenced in the dbt_loom.config.yml file at the root of a child project, say, child.
  4. Some models in the child project reference public models in the parent project, in turn based on protected seeds in the parent project.
  5. Commands such as dbt run, dbt build or dbt compile will all produce error messages similar to the one above when executed in the child project.

Expected behavior
To have the child project compiling successfully.

Setup

  • OS: macOS Sonoma 14.4.1
  • Code editor: VS Code 1.88.1
  • Python version: 3.9.18
  • dbt-core: 1.7.13
  • dbt-loom: 0.5.0
  • dbt-snowflake: 1.7.3

Please, let me know if additional information could be useful.

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.