GithubHelp home page GithubHelp logo

sovalinux / dbt-athena Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tomme/dbt-athena

0.0 1.0 1.0 107 KB

The athena adapter plugin for dbt (https://getdbt.com)

License: Apache License 2.0

Python 99.17% Makefile 0.83%

dbt-athena's Introduction

dbt-athena

  • Supports dbt version 1.0.*
  • Supports Seeds
  • Correctly detects views and their columns
  • Support incremental models
    • Support two incremental update strategies: insert_overwrite and append
    • Does not support the use of unique_key
  • Only supports Athena engine 2

Installation

pip install git+https://github.com/SOVALINUX/dbt-athena

Prerequisites

To start, you will need an S3 bucket, for instance my-staging-bucket and an Athena database:

CREATE DATABASE IF NOT EXISTS analytics_dev
COMMENT 'Analytics models generated by dbt (development)'
LOCATION 's3://my-staging-bucket/'
WITH DBPROPERTIES ('creator'='Foo Bar', 'email'='[email protected]');

Notes:

  • Take note of your AWS region code (e.g. us-west-2 or eu-west-2, etc.).
  • You can also use AWS Glue to create and manage Athena databases.

Credentials

This plugin does not accept any credentials directly. Instead, credentials are determined automatically based on aws cli/boto3 conventions and stored login info. You can configure the AWS profile name to use via aws_profile_name. Checkout DBT profile configuration below for details.

Configuring your profile

A dbt profile can be configured to run against AWS Athena using the following configuration:

Option Description Required? Example
s3_staging_dir S3 location to store Athena query results and metadata Required s3://bucket/dbt/
region_name AWS region of your Athena instance Required eu-west-1
schema Specify the schema (Athena database) to build models into (lowercase only) Required dbt
database Specify the database (Data catalog) to build models into (lowercase only) Required awsdatacatalog
poll_interval Interval in seconds to use for polling the status of query results in Athena Optional 5
aws_profile_name Profile to use from your AWS shared credentials file. Optional my-profile
work_group Identifier of Athena workgroup Optional my-custom-workgroup
num_retries Number of times to retry a failing query Optional 3

Example profiles.yml entry:

athena:
  target: dev
  outputs:
    dev:
      type: athena
      s3_staging_dir: s3://athena-query-results/dbt/
      region_name: eu-west-1
      schema: dbt
      database: awsdatacatalog
      aws_profile_name: my-profile
      work_group: my-workgroup

Additional information

  • threads is supported
  • database and catalog can be used interchangeably

Usage notes

Zero-Downtime Tables Starting from adapter version 1.0.3 there is a way to do dbt run without downtime on table update
It creates a new CTAS table with name ctas_{{model.name}}_{{timestamp}} + creates or updates extra view {{model.name}} that selects from this CTAS

It works in the following way:

  1. Enable zero-downtime for tables
    a) Add var table_zero_downtime to the dbt_project.yml
vars:
  table_zero_downtime: true

b) Alternatively add tag table_zero_downtime to the specific model with table materialization

{{config(tags=['table_zero_downtime'])}}
  1. Cleaning up stale objects like ctas a) We've added a complimentary set of macroses to cleanup objects that not exist in Git Checkout these links:
on-run-end: "{% do athena_utils.delete_stale_ctas_run_end([target.schema, generate_schema_name('some_extra_schema', '')], False, '') %}"

b) For our projects for development purposes I've added wrapper for on-run-end hook that will not trigger on single model run on developer machines

Notes on Docker

If you ever going to add this connector to the Docker, please use Dockerfile from dbt v1.1 or higher https://github.com/dbt-labs/dbt-core/blob/1.1.latest/docker/Dockerfile

Models

Table Configuration

  • external_location (default=none)
    • The location where Athena saves your table in Amazon S3
    • If none then it will default to {s3_staging_dir}/tables
    • If you are using a static value, when your table/partition is recreated underlying data will be cleaned up and overwritten by new data
  • partitioned_by (default=none)
    • An array list of columns by which the table will be partitioned
    • Limited to creation of 100 partitions (currently)
  • bucketed_by (default=none)
    • An array list of columns to bucket data
  • bucket_count (default=none)
    • The number of buckets for bucketing your data
  • format (default='parquet')
    • The data format for the table
    • Supports ORC, PARQUET, AVRO, JSON, or TEXTFILE
  • write_compression (default=none)
    • The compression type to use for any storage format that allows compression to be specified. To see which options are available, check out CREATE TABLE AS
  • field_delimiter (default=none)
    • Custom field delimiter, for when format is set to TEXTFILE

More information: CREATE TABLE AS

Supported functionality

Support for incremental models:

  • Support two incremental update strategies with partitioned tables: insert_overwrite and append
  • Does not support the use of unique_key

Due to the nature of AWS Athena, not all core dbt functionality is supported. The following features of dbt are not implemented on Athena:

  • Snapshots

Known issues

  • Quoting is not currently supported

    • If you need to quote your sources, escape the quote characters in your source definitions:
    version: 2
    
    sources:
      - name: my_source
        tables:
          - name: first_table
            identifier: "first table"       # Not like that
          - name: second_table
            identifier: "\"second table\""  # Like this
  • Tables, schemas and database should only be lowercase

  • Only supports Athena engine 2

Running tests

First, install the adapter and its dependencies using make (see Makefile):

make install_deps

Next, configure the environment variables in dev.env to match your Athena development environment. Finally, run the tests using make:

make run_tests

Community

dbt-athena's People

Contributors

antauri avatar aut0clave avatar bensatori avatar courentin avatar dandandan avatar daniel-cortez-stevenson avatar danielcmessias avatar djlemkes avatar dverzal avatar friendofasquid avatar ignacioreyna avatar julianste avatar justasce avatar mederka avatar mrshu avatar nialloriordan avatar sovalinux avatar stevenayers avatar stumelius avatar tomme avatar tuan-seek avatar

Watchers

 avatar

Forkers

krisstinkou

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.