GithubHelp home page GithubHelp logo

saylieee / azure-databricks-data-conversion Goto Github PK

View Code? Open in Web Editor NEW

This project forked from davevoyles/azure-databricks-data-conversion

0.0 1.0 0.0 226 KB

For manipulating data and learning ADB. Mostly .csv & parquet work

License: Apache License 2.0

Python 100.00%

azure-databricks-data-conversion's Introduction

Incremental loading of parquet files in Azure Databricks with blob storage

For manipulating data and learning ADB. Mostly .csv & parquet work

AUTHOR: Dave Voyles | Microsoft | April-2020 | https://github.com/DaveVoyles/Azure-Databricks-data-conversion


GOAL: Incremental loading of parquet files

  1. Take something that has "date time" and run a filter that filters to one day at a time.
  2. Then add that data to an existing parquet data set.

Steps:

  1. Mount to Azure blob storage -- allows us to read/write. Mount point is: "/mnt/blobmount"
  2. Read .csv from from "/mnt/blobmount/example/data/users.csv"
  3. Write .csv file back to blob as parquet OR csv (in this example, parquet) to "/example/data/users/incremental" folder. NOTE: We are appending the current date, up to the minute, to prevent overwriting the existing parquet file
  4. Read back parquet file as data frame
  5. Filter df by between a start & end date NOTE: All transactions occur on the same day, so we filter by HOUR here ('03') to give us fewer results
  6. Append newly filtered DF to existing parquet file

Conclusion

We have seen that it is very easy to add data to an existing Parquet file. This works very well when you’re adding data - as opposed to updating or deleting existing records - to a cold data store (Amazon S3, for instance).

In the end, this provides a cheap replacement for using a database when all you need to do is offline analysis on your data. You can add partitions to Parquet files, but you can’t edit the data in place.

azure-databricks-data-conversion's People

Contributors

davevoyles avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.