GithubHelp home page GithubHelp logo

souvik-databricks / dlt-with-debug Goto Github PK

View Code? Open in Web Editor NEW
37.0 1.0 7.0 91 KB

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

Home Page: https://pypi.org/project/dlt-with-debug/

License: MIT License

Python 100.00%
big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark

dlt-with-debug's People

Contributors

souvik-databricks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

dlt-with-debug's Issues

create_stream_live_table

Has this being implemented yet? Also how different is this from the standard dlt.create_table or dlt.create_view?

approxQuantile works in DLT debug, but does not work within the databricks website itself.

Hi,

First of all, please do allow me to thank you greatly for this package, it is very convenient to be able to debug the code meant to create a pipeline of delta live table without having the run the entire thing. However I am currently experiencing some problems that I find hard to resolve.

To give more context, please check the code below:

@dlt.table(name = "customer_order_silver_v2")
def capping_unitPrice_Qt():
    df =  dlt.read("customer_order_silver")
    boundary_unit = [0,0]
    boundary_qty = [0,0]
    boundary_unit = df.select(col("UnitPrice")).approxQuantile('UnitPrice',[0.05,0.95], 0.25)

    boundary_qty = df.select(col("Quantity")).approxQuantile('Quantity',[0.05,0.95], 0.25)
    print(boundary_unit)
    print(boundary_unit[0])
    print(boundary_unit[1])
    

    df = df.withColumn('UnitPrice', F.when(col('UnitPrice') > boundary_unit[1], boundary_unit[1])
                                       .when(col('UnitPrice') < boundary_unit[0], boundary_unit[0])
                                       .otherwise(col('UnitPrice')))
    
    df = df.withColumn('Quantity', F.when(col('Quantity') > boundary_qty[1], boundary_qty[1])
                                       .when(col('Quantity') < boundary_qty[0], boundary_qty[0])
                                       .otherwise(col('Quantity')))
                                          
    return df

When I run this the code for this DLT, the approxQuantile() in it seems to be not working. What I get after running this:
image

Yet somehow, after I use the debug package and rewrite the code into:

#this way of writing might be too complex. An alternative solution is to write the DLT as a general function and then pass it as a function. 
@dlt.create_table(name = "customer_order_silver_v2")
@dltwithdebug(globals())
# @dlt.table(name = "customer_order_silver_v2")
def capping_unitPrice_Qt():
    df =  dlt.read("wtchk_customer_order_filtered")


    boundary_unit = [0,0]
    boundary_qty = [0,0]
    boundary_unit = df.select(col("UnitPrice")).approxQuantile('UnitPrice',[0.05,0.95], 0.25)
 

    boundary_qty = df.select(col("Quantity")).approxQuantile('Quantity',[0.05,0.95], 0.25)
    print(boundary_unit)
    print(boundary_unit[0])
    print(boundary_unit[1])
    



    df = df.withColumn('UnitPrice', F.when(col('UnitPrice') > boundary_unit[1], boundary_unit[1])
                                       .when(col('UnitPrice') < boundary_unit[0], boundary_unit[0])
                                       .otherwise(col('UnitPrice')))
    
    df = df.withColumn('Quantity', F.when(col('Quantity') > boundary_qty[1], boundary_qty[1])
                                       .when(col('Quantity') < boundary_qty[0], boundary_qty[0])
                                       .otherwise(col('Quantity')))
                                          
    return df

showoutput(capping_unitPrice_Qt)

The code runs and it produces the table. as well as the value that I need:
image
I really cannot wrap my head around as what is not well written. I would appreciate any kind of input or advice. Thank you very much!

Add tests

Add unit tests for the functions and get coverage.

Ability to import dlt signatures without a current SparkSession

I've been happily using the dlt-with-debug library, but I'm running into an issue when importing the dlt signatures without an active SparkSession. I'm trying:

import dlt_with_debug.dlt_signatures as dlt

But the __init__.py calls the v2.py file which in turn tries to get the pipeline_id from the current SparkSession. But this fails when there is no active SparkSession.

I think my usecase can be achieved with an extra check in

pipeline_id = spark.conf.get("pipelines.id", None)
that first confirms there is an actual SparkSession before calling .conf.get() on it.

@souvik-databricks curious what you think! And I'd be happy to contribute via PR

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.