nicholasyager / dbt-loom Goto Github PK
View Code? Open in Web Editor NEWA dbt-core plugin to weave together multi-project dbt-core deployments
License: The Unlicense
A dbt-core plugin to weave together multi-project dbt-core deployments
License: The Unlicense
The ability to retrieve manifest files from Azure Storage will round out support for the three major cloud platforms.
I'll work on this feature.
Currently, the ManifestReference
and all ReferenceConfig
s are defined as part of __init__.py
. This isn't bad, per se, but it would be more ergonomic to split the base types out, and define reference configs alongside the appropriate client classes.
It would be useful for GitHub to run dbt-loom against multiple versions of dbt-core to confirm functionality.
Is your feature request related to a problem? Please describe.
While debugging or triaging issues for the community, I don't always know which version of dbt-loom is in use. This can make it challenging to pin down the exact issue, or to confirm if newer versions of dbt-loom have resolved the issue already.
Describe the solution you'd like
I'd like dbt-loom to print the current version much like adapters and dbt-core itself. For example:
08:25:16 Running with dbt=1.7.14
08:25:17 dbt-loom: Registering plugin: dbt-loom=0.5.2
08:25:17 dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
08:25:17 dbt-loom: Loading manifest for `project_a` from `file`
08:25:17 Registered adapter: bigquery=1.7.7
Describe alternatives you've considered
Additional context
Add any other context or screenshots about the feature request here.
This is particularly useful to enable hybrid development in multi-cloud deployments. For example, this would enable dbt Cloud to use its own dbt Labs supported injection mechanism for orchestration and their cloud IDE, and dbt-loom would support people running dbt-core locally, backed by the same up-to-date manifests.
IMHO, this is required for adequate community adoption of multi-project deployments without vendor lock-in.
Is your feature request related to a problem? Please describe.
PR 40 shows better error when accessing a private or protected model of the upstream project, however, the group is missing from the message shown.
Node model.great_bay.test_balboa_models attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the '' group.
Describe the solution you'd like
Ideally the group of the upstream model is shown in the error like:
Node model.great_bay.test_balboa_models attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the 'marketing' group.
Describe alternatives you've considered
None
Additional context
None
I am proposing a feature for dbt-loom to work with multiple projects where each has a postgres databases or any databases built on PG. Like Hydra DB.
My use case is that I am building a Data Product segregated data mesh. Each product has its own PG database, usually hosted on the same PG server - but that could change based on scale. Each Data Product has its own DBT-core dbt project. Rather than replicate all the models using Extract-load between the db's it makes more sense to be able to read (read-only) the data from each db. And I faced some challenges getting dbt-loom working with this, but I did get it working. Thanks! Dbt-loom is a really useful solution.
Let say I have a dataproduct "calendar" and one called "workday". Workday refers to Calendar models. Dbt loom succesfully generated the following SQL
select * from "calendar"."product"."days"
However, in PG this gives an error on execution as PG does not allow cross db joins.
postgres cross-database references are not implemented
So the solution I came do involved
`(create extension and grant usage - removed for brevity)
CREATE SERVER IF NOT EXISTS foreign_calendar
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', port '5432', dbname 'calendar');
(also create user mapping - removed for brevity)
create schema if not exists calendar;
CREATE FOREIGN TABLE calendar.days (
id text
, col1 date
, col2 numeric
, col3 text
, col4 text
)
SERVER foreign_calendar
OPTIONS (schema_name 'product', table_name 'days');
-- this now works to read the foreign database and table
select * from "workhours"."calendar"."days"
`
Describe the solution you'd like
Obviously the CREATE FOREIGN TABLE scaffolding could be automated, for each PUBLIC dbt model - and even the CREATE SERVER etc could be run each initialisation - and pulled from the connection details of the project.
Each time a model changes and the table DDL changes, you need to add/update/remove columns from the FOREIGN TABLE which could be automated.
Describe alternatives you've considered
It is workable to have to write the above code for every model, but not ideal.
Additional context
I would like feedback if this makes sense as a feature of DBT loom or should be implemented as a separate plugin? I am new to DBT, but not new to SQL and programming - I may be able to attempt this, but any starting pointers or code reviews would be most welcome if its a wanted feature.
Describe the bug
Running any dbt command that requires compiling from Project B
using the manifest from Project A
throws the following error:
08:25:16 Running with dbt=1.7.14
08:25:17 dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
08:25:17 dbt-loom: Loading manifest for `project_a` from `file`
08:25:17 Registered adapter: bigquery=1.7.7
08:25:27 dbt-loom: Injecting nodes
08:25:28 Encountered an error:
Compilation Error
'model.project_a.stg_my_seed_file' depends on 'seed.project_a.seed_my_seed_file' which is not in the graph!
This error refers to a seed and model that are not upstream of the model I was trying to run. In Project A
, there are no errors when running or compiling.
I tried running with --no-partial-parse
as well as running dbt clean
before the commands to no avail.
Deleting the seed and related model only gave the same error with another seed and model pair.
I have looked at the manifest.json
from Project A
and I can see the seed node, however, when looking at the manifest generated from the run in Project B
, I can only see the seed node as a dependency from the staging model but doesn't seem to be injected as a node itself.
To Reproduce
Steps to reproduce the behavior:
Project A
has protected seeds that are upstream of some other protected modelsProject A
has a public model that is downstream of one or more seeds.Project A
compiles with no errorsProject B
references a public model from the upstream modeldbt run
)Expected behavior
Project B
should also compile without any errors.
Additional context
Thanks to smilingthax (don't want to tag to spam emails) for pointing out that this error with seeds that I mentioned in another issue was talked already in another issue (that I missed when researching).
I haven't had time to rebuild this with a toy project at the moment, if that is needed I will take some time later in the week as I have been pushing at work to give dbt-loom work so I am invested hah. Thanks for all the efforts and for creating and maintaining this amazing project!!
Current solution requires manifest.json
to be present before upstream projects can be compiled. In some cases it's not convenient.
Instead I think it can be useful to generate API package from existing DBT project, that package can be used as simple DBT package.
Some details of the idea:
core
project are exposed as sourcescore_api
project which just do select * from {{ source('core', 'public_model_name') }}
core
model is versioned - several SQL files can be created to reflect model versions like select * from {{ source('core', 'public_model_name_v2') }}
dbt_project.yml
is copied from core
to core_api
with all model configurations excluded except public models and of course name changed from core
to core_api
.Do you think it can work? Do you see any issues with that?
Is your feature request related to a problem? Please describe.
Currently, dbt-loom does not have a good way to leverage a dbt project's restrict-access
configuration. This config is important for limiting cross-project access to protected
and private
nodes. Since this is an entirely new file to load, we do not currently have a defined path for loading this information.
flowchart LR
subgraph project_a
dbt_project.yml
manifest.json
end
subgraph project_b
project_b_project[dbt_project.yml]
project_b_manifest[manifest.json]
dbt_loom.config.yml
subgraph dbt-loom
ManifestLoader
Plugin
end
end
manifest.json --> ManifestLoader
ManifestLoader --> Plugin
dbt_loom.config.yml --> Plugin
dbt-core
Plugin --> dbt-core
project_b_project --> dbt-core
project_b_manifest --> dbt-core
Describe the solution you'd like
I'd like to be able to define a Project
, and with this project a location for its dbt_project.yml
file and an associated manifest.json
file. This should support all of our existing artifact sources where possible.
flowchart LR
subgraph project_a
dbt_project.yml
manifest.json
end
subgraph project_b
project_b_project[dbt_project.yml]
project_b_manifest[manifest.json]
dbt_loom.config.yml
subgraph dbt-loom
ManifestLoader
ProjectLoader
Plugin
end
end
dbt_project.yml --> ProjectLoader
manifest.json --> ManifestLoader
ProjectLoader --> Plugin
ManifestLoader --> Plugin
dbt_loom.config.yml --> Plugin
dbt-core
Plugin --> dbt-core
project_b_project --> dbt-core
project_b_manifest --> dbt-core
To configure this, we can introduce a new optional top-level concept of a Project.
dependencies:
- name: core
description: All common core dependencies across our `n` base projects.
artifacts:
- type: s3
config:
bucket_name: com.example.dbt_artifacts
object_prefix: latest/
- name: revenue
description: A proof-of-concept local-only revenue reporting project.
artifacts:
- type: file
config:
path: path/to/manifest.json
- type: file
config:
path: path/to/dbt_project.yml
Describe alternatives you've considered
restrict-access
configured in the dbt_loom.config.yml
file instead.restruct-access
at all.Additional context
Is your feature request related to a problem? Please describe.
Hi folks,
Thanks for the great work, I wonder is it possible to make node resolution process to be environment (target) aware?
So in our use case lower environments (targets) like dev
, ci
are using one set of Snowflake tables for defer and we can put it in the manifest, but for higher environments (target) like prod
the same model should be resolved to a different fully qualified name. If I rephrase the question can we hack generate_schema_name
and generate_database_name
macros to node resolution?
Describe alternatives you've considered
Not quite sure if there is a solution other than maintain N different manifests one for each environment, which doesn't look like an optimal and potentially error prone method.
Additional context
I'm ready to contribute to the solution with enough guidance provided.
I created a model in my upstream project with a deprecation_date
at model level, generated the manifest.yml
and everything works fine in my downstream project.
If I move the deprecation_date
at version level, the downstream project parsing breaks with error:
File "/home/daniele/.cache/pypoetry/virtualenvs/credem-poc-pkg-aG8EzHFC-py3.10/lib/python3.10/site-packages/dbt/parser/manifest.py", line 579, in check_for_model_deprecations
if resolved_ref.deprecation_date < datetime.datetime.now().astimezone():
TypeError: '<' not supported between instances of 'str' and 'datetime.datetime'
The issue arises from dbt-core, but if I move all the models and configs into one single project, the error disappears, so I guess it's somehow related to dbt-loom node injection.
This is a sample yml config for the upstream project:
version: 2
models:
- name: GESTORE
access: public
latest_version: 2
config:
contract:
enforced: true
columns:
- name: customer_id
data_type: string
constraints:
- type: not_null
- type: primary_key
- name: customer_name
data_type: string
versions:
- v: 1
deprecation_date: '2023-10-30'
- v: 2
columns:
- include: all
exclude: [customer_name]
- name: customer_desc
data_type: string
As a user of the plugin, it would be useful to have logging functionality. This will make it easier to track what project are being injected, and which models are being referenced.
Is your feature request related to a problem? Please describe.
Right now, upstream producer private and protected nodes are not included in the consumer project. When a user tries to access one of these nodes, they are given an error that the node does not exist. It would be a better UX to show an error like
Parsing Error
Node model.great_bay.test_us_population attempted to reference node model.balboa.us_population, which is not allowed because the referenced node is private to the marketing group.
Is your feature request related to a problem? Please describe.
Nope!
Describe the solution you'd like
A clear and concise description of what is expected for people interacting with the development process of this project. This should also outline what is expected of commercial interests who are participating in this project's development.
Describe alternatives you've considered
Be a bunch of cowpokes and enjoy the wild west and all the challenges that entails ๐ค
Is your feature request related to a problem? Please describe.
As an avid dbt-core user, I want dbt-loom to be compatible with dbt-core 1.8.0-rc1.
Describe the solution you'd like
Update the version range for dbt-loom to include 1.8.0-rc1, and include this new version in the testing matrix.
Describe alternatives you've considered
Go for a walk on the beach. ๐๏ธ
Additional context
Nope
Describe the bug
dbt-loom/test_projects/customer_success# dbt build
09:32:20 Running with dbt=1.7.14
09:32:20 dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
09:32:20 dbt-loom: Loading manifest for `revenue` from `file`
09:32:20 Registered adapter: duckdb=1.7.4
09:32:20 dbt-loom: Injecting nodes
09:32:20 [WARNING]: Model orders has passed its deprecation date of 2024-01-01T00:00:00+00:00. This model should be disabled or removed. ## (removing deprecation_date does not change anything)
09:32:20 Encountered an error:
Compilation Error
'model.revenue.not_null_orders_v1_order_id' depends on 'model.revenue.orders.v1' which is not in the graph!
To Reproduce
git clone https://github.com/nicholasyager/dbt-loom
(to retrieve test_projects/
)dbt-loom/test_projects/revenue
run dbt deps
, dbt build
, dbt run
dbt-loom/test_projects/customer_success
run dbt deps
, try dbt build
or dbt run
Expected behavior
The test project from the dbt-loom repository should compile without errors.
Other projects which use versioned models also compile without errors.
python:3.12-bookworm
-based container running on amd64 linux0.5.1
1.7.14
, also 1.7.13
Additional context
This first happened in my own project, but just using the test_projects
from the dbt-loom repository exhibits the same behaviour.
AFAICT the corresponding node name/id in revenue/target/manifest.json
is "model.revenue.orders.v1"
, whereas in customer_success/target/manifest.json
the injected(?) node seems to be called "model.revenue.orders.v1.0"
โ but some/all(?) references to it (depends_on
, ...) still use the "original" "model.revenue.orders.v1"
name/id, which then cannot be found, as said in the error message (... depends on 'model.revenue.orders.v1' which is not in the graph!
)...
Non-versioned models seem to be unaffected / work fine.
Is your feature request related to a problem? Please describe.
Nope!
Describe the solution you'd like
A clear and concise description of how new contributors can submit code to the dbt-loom project.
Describe alternatives you've considered
Be a bunch of cowpokes and enjoy the wild west and all the responsibility it entails ๐ค
Currently, dbt-loom supports pulling in a manifest from cloud storage using bucket name + object name.
However, for organizations with n number of dbt-core projects that need to peer with each other, adding an entry to each repo gets difficult. I propose that in the s3 and gcp clients, we add a method that allows for specifying just the bucket name. From there, dbt loom will iterate through all the manifests in the bucket and add them to the project.
I could take a first stab at implementing s3 version.
Edit: Would actually prefer trying this in artifiactory first if this is something we want to do. Can implement single and muli-manifest json pull from artifiactory
Having access to manifests for multi-project deployments is cool and all, but it would be even cooler to have support for S3 get object calls. This would enable "mid-tier" uses of this tool: beyond a single machine, but not production-worthy.
Requirements:
As part of dbt Lab's new multi-project deployment functionality in dbt Cloud, they are using a new config file, called dependencies.yml
to track project dependencies. We should be leveraging this file to check for missing projects during injection.
Add support for storing manifest files on GCS bucket.
We (team at Astrafy) will work on that feature.
Describe the bug
Upstream node injection in version 0.5.0 may have introduced a bug. Models in a child project having dependencies to public models in a parent model no longer compile. Things were running smoothly in previous versions of dtb-loom.
A typical error message resulting from compiling the child project looks as follows:
14:26:03 Running with dbt=1.7.13
14:26:04 dbt-loom: Patching ref protection methods to support dbt-loom dependencies.
14:26:04 dbt-loom: Loading manifest for `parent` from `file`
14:26:04 Registered adapter: snowflake=1.7.3
14:26:04 dbt-loom: Injecting nodes
14:26:05 Encountered an error:
Compilation Error
'model.parent.dim_commission_type' depends on 'seed.parent.seed_commission_type' which is not in the graph!
The error message above mentions the public dim_commission_type
model in the parent
project that stems from the protected seed_commission_type
seed in the parent
project as well. Some models of the child
project reference the dim_commission_type
model in the parent
project, but none references the seed_commission_type
seed. Since we have other projects for which cross-project dependencies work fine with dbt-loom 0.5.0, the problem may be limited to scenarios where protected seeds are used upstream.
To Reproduce
The overall setup is as follows:
parent
, contains various models, with both protected and public models.parent
project compiles/builds successfully.manifest.json
file is properly referenced in the dbt_loom.config.yml
file at the root of a child project, say, child
.child
project reference public models in the parent
project, in turn based on protected seeds in the parent
project.dbt run
, dbt build
or dbt compile
will all produce error messages similar to the one above when executed in the child
project.Expected behavior
To have the child project compiling successfully.
Setup
Please, let me know if additional information could be useful.
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.