GithubHelp home page GithubHelp logo

ziyadmsq / big_data_benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xdssio/big_data_benchmarks

0.0 1.0 1.0 325 KB

big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

License: MIT License

Jupyter Notebook 100.00%

big_data_benchmarks's Introduction

Big data techonlogy benchmarks This project is designed to compare big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

The benchmarks for this article.

The analysis is done on a 100GB Texi data 2009 - 2015.

Technologies

General Remarks

  • Some notebooks requeire a restart of the karnel after package installation.
  • Different notebooks run on different kernels, check out on the top what is what.
  • The notebooks of technologies who don't run out of core are set to work with only 1M rows.
  • On special cases notebooks needed to be restarted for optmial performance - that might not be fair, but I wanted to try to get the most out of each technology.

Instructions

  1. Create an S3 bucket to put your results (or remove this part in the persist function in the code).
  2. Create a ml.c5d.4xlarge instance on AWS SageMaker with extra 500G Stroage.
  3. Run the get_data.ipynb notebook to mount the SSD and download the data.
  4. Run the notebook you want to test.
  • In each notebook and the beginning, make sure the name of the instance and the S3 bucket is right.

Good luck!

big_data_benchmarks's People

Contributors

xdssio avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.